# Linking scRNA-seq data against `Gene`

In [None]:
!lamin login testuser2

In [None]:
import lamindb as ln
import lnschema_bionty as bt

ln.track()

Consider an scRNA-seq count matrix in form of an `AnnData` object in memory:

In [None]:
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()

In [None]:
adata

Check out the features of this dataset:

In [None]:
adata.var.head()

## Parse features

The features in this data object are genes and indexed by Ensembl gene ids. We'd like to link these features so that we can query the data by genes!

Features are often knowledge-based entities. [Bionty](https://lamin.ai/docs/bionty) provides several knowledge-based tables for basic biological entities.

```{note}

- For an overview of knowledge tables, see: {mod}`~bionty`.
```

Now let's parse the features from the data into the Gene Ensembl id:

In [None]:
# Don't forget to specify species here, default is "human"
features = ln.Features(adata.var.index, bt.Gene.ensembl_gene_id, species="mouse")

Here, all features were successfully (unambiguously) linked against their canonical reference in `bionty.Gene`.

This creates a feature set of type `gene` linked (indexed by its hash):

In [None]:
features

This feature set links records for 10k genes. Here are the first 3, all of which can be queried:

In [None]:
features.genes[:3]

Hence, not just for Ensemble IDs, but also by gene symbol, NCBI ids, gene type, etc.

## Track data with features (genes)

Now we can track data together with the parsed features by passing `features` when instantiating `File`.

```{tip}

Linking features can also be made post instantiation via:
```python

file = ln.File(adata, name="Mouse Lymph Node scRNA-seq")
file.features = features
```

In [None]:
file = ln.File(adata, name="Mouse Lymph Node scRNA-seq", features=features)

The features can now be accessed via relationship to dobejct:

In [None]:
file.features[0].genes[:3]

Add file and its linked features to the database:

In [None]:
ln.add(file)

## Querying data by features

```{seealso}

Basic queries:

- {doc}`/guide/select`
- {doc}`/guide/query-book`

```

Let us query gene records by symbol:

In [None]:
ln.select(bt.Gene, symbol="Actg1").df()

How do we retrieve data objects, in which this gene was a feature:

We could first query all feature sets that contain the gene, and then query `File` by that.

In [None]:
features = (
    ln.select(ln.Features, bt.Gene.symbol)
    .join(ln.Features.genes)
    .where(bt.Gene.symbol == "Actg1")
)
features.df().head()

Of course, we can also write everything in one statement:

In [None]:
files = (
    ln.select(ln.File)
    .join(ln.File.features)
    .join(ln.Features.genes)
    .where(bt.Gene.symbol == "Actg1")
)
files.df().head()