# Link features of data objects - data lakehouse

So far, we haven't enabled ourselves to query for the _features_ of ingested data, and used LaminDB like a data lake.

Let us fix this! By that, we will impose some integrity on the data objects in our storage, and make them queryable.

Soon, we'll also see how this enables us to stream partial objects.

In [None]:
!lndb login testuser2  # let us login another user to simulate team collaboration

In [None]:
import lamindb as ln
import lamindb.knowledge as lnk

ln.nb.header()

## Linking scRNA-seq data against `Gene`

Consider an scRNA-seq count matrix in form of an `AnnData` object in memory

In [None]:
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()

adata.var.head()

The features in this data object are genes and indexed by Ensembl gene ids. We'd like to link these features so that we can query the data by genes!

Features are often knowledge-based entities. `lamindb.knowledge` (under the hood, [Bionty](https://lamin.ai/docs/bionty)) provides several knowledge-based tables for basic biological entities.

In [None]:
reference = ln.knowledge.Gene(
    id=lnk.lookup.gene_id.ensembl_gene_id,
    species=lnk.lookup.species.mouse,
)

```{note}

- For an overview of knowledge tables, see: {class}`~lamindb.knowledge`.
- For an overview of lookup identifiers, see: {class}`~lamindb.knowledge.lookup`.
```

In [None]:
dobject = ln.DObject(adata, name="Mouse Lymph Node scRNA-seq", features_ref=reference)

Given we provided a gene reference for parsing features, we now see a feature set of type gene linked (indexed by its hash):

In [None]:
dobject.features

This feature set links records for 10k genes. Here are the first 3, all of which can be queried:

In [None]:
dobject.features[0].genes[:3]

Hence, not just for Ensemble IDs, but also by gene symbol, NCBI ids, gene type, etc.

Here, all features were successfully (unambiguously) linked against their canonical reference in `bionty.Gene`.

In [None]:
ln.add(dobject)

## Linking flow cytometry data against cell markers

Let us now consider a flow cytometry example dataset:

In [None]:
filepath = ln.dev.datasets.file_fcs()

filepath

Because the file is a standard `.fcs` file, `ln.DObject` - under the hood - can parse it using [readfcs](https://lamin.ai/docs/readfcs).

Alternatively, we can load it into memory: `AnnData = readfcs.read_fcs(filepath)`.

We'll use the `CellMarker`-ontology based reference to link features:

In [None]:
reference = lnk.CellMarker()

In [None]:
dobject = ln.DObject(filepath, features_ref=reference);

In [None]:
ln.add(dobject)