# Linking scRNA-seq data against `Gene`

In [None]:
!lamin login testuser2

In [None]:
import lamindb as ln
import lamindb.schema as lns
import bionty as bt

ln.Run()

Consider an scRNA-seq count matrix in form of an `AnnData` object in memory:

In [None]:
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()

In [None]:
adata

Check out the features of this dataset:

In [None]:
adata.var.head()

## Configure a `Gene` reference for parsing features

The features in this data object are genes and indexed by Ensembl gene ids. We'd like to link these features so that we can query the data by genes!

Features are often knowledge-based entities. [Bionty](https://lamin.ai/docs/bionty) provides several knowledge-based tables for basic biological entities.

```{note}

- For an overview of knowledge tables, see: {mod}`~bionty`.
- For an overview of lookup identifiers, see: [Bionty lookup](https://lamin.ai/docs/bionty/bionty.lookup).
```

Let's first configure a gene reference we'd like to use for parsing features in this dataset:

```{tip}

We provide lookups for ids and species in bionty, so you don't need to memorize the exact strings:

```python

id = bt.lookup.gene_id.ensembl_gene_id
species=bt.Species().lookup.mouse
```

By default, `ensembl_gene_id` is used, you may pass other ids available from `bt.lookup.gene_id`.

In [None]:
reference = bt.Gene(species="mouse")

In [None]:
reference

## Parse features

Now let's parse the features from the data using the configured reference above:

In [None]:
features = ln.Features(data=adata, reference=reference)

Here, all features were successfully (unambiguously) linked against their canonical reference in `bionty.Gene`.

This creates a feature set of type `gene` linked (indexed by its hash):

In [None]:
features

This feature set links records for 10k genes. Here are the first 3, all of which can be queried:

In [None]:
features.genes[:3]

Hence, not just for Ensemble IDs, but also by gene symbol, NCBI ids, gene type, etc.

## Track data with features (genes)

Now we can track data together with the parsed features by passing `features` when instantiating `DObject`.

```{tip}

Linking features can also be made post instantiation via:
```python

dobject = ln.DObject(adata, name="Mouse Lymph Node scRNA-seq")
dobject.features = features
```

In [None]:
dobject = ln.DObject(adata, name="Mouse Lymph Node scRNA-seq", features=features)

The features can now be accessed via relationship to dobejct:

In [None]:
dobject.features[0].genes[:3]

Add dobject and its linked features to the database:

In [None]:
ln.add(dobject)

## Querying data by features

```{seealso}

Basic queries:

- {doc}`/guide/select`
- {doc}`/guide/query-book`

```

Let us query gene records by symbol:

In [None]:
ln.select(lns.bionty.Gene, symbol="Actg1").df()

How do we retrieve data objects, in which this gene was a feature:

We could first query all feature sets that contain the gene, and then query `DObject` by that.

In [None]:
features = (
    ln.select(ln.Features, lns.bionty.Gene.symbol)
    .join(ln.Features.genes)
    .where(lns.bionty.Gene.symbol == "Actg1")
)
features.df().head()

Of course, we can also write everything in one statement:

In [None]:
dobjects = (
    ln.select(ln.DObject)
    .join(ln.DObject.features)
    .join(ln.Features.genes)
    .where(lns.bionty.Gene.symbol == "Actg1")
)
dobjects.df().head()