# Link features

So far, we haven't enabled ourselves to query for the features[^features] of ingested data, and used LaminDB like a data lake.

[^features]: We'll mostly use the term feature for synonyms variable (statistics), column and field (databases), dimension (machine learning).


We can also use LaminDB like a queryable data warehouse to store links[^relations] and monitor data integrity.

Let us explain how to implement this by providing feature models at ingestion!

[^relations]: We mostly use the term link for synonyms relations and references.

In [None]:
import lamindb as db
import bionty as bt  # https://lamin.ai/docs/bionty
import scanpy as sc  # https://scanpy.readthedocs.io

db.header()

## Example datasets

Consider
- `data1`: a flow cytometry dataset in form of an `.fcs` file
- `data2`: a scRNA-seq count matrix in form of an `AnnData` object in memory

In [None]:
data1 = db.datasets.file_fcs()
data1

In [None]:
data2 = sc.read(db.datasets.file_mouse_sc_lymph_node())

## Define feature models

For `data1`, we specify a feature model using the `bionty` `Gene` entity with gene symbols.

In [None]:
feature_model1 = bt.CellMarker(species=bt.lookup.species.human)

Let us now ingest the data by passing a feature model to `db.do.ingest.add`, which will enable us to query the `dobject` by features by creating all necessary links in the background.

It will also log out and store information on data integrity:

In [None]:
db.do.ingest.add(data1, feature_model=feature_model1, featureset_name="flow_panel_1")

Using this feature model, we can't link and hence won't be able to query for 9 features.

We can overcome this by working with a custom feature model, discussed later.

Features in data2 are indexed by Ensemble gene ids. For an overview of gene ids, see: [`bt.lookup.gene_id`](https://lamin.ai/docs/bionty/api).

In [None]:
data2.var.head()

Hence, we use a feature model based on Ensemble IDs and ingest the data with it.

In [None]:
feature_model2 = bt.Gene(
    id=bt.lookup.gene_id.ensembl_gene_id, species=bt.lookup.species.mouse
)

In [None]:
db.do.ingest.add(
    data2,
    name="mouse_sc_lymph_node",
    feature_model=feature_model2,
    featureset_name="mouse_1k",
)

We can retrieve the integrity information through `.logs`:

In [None]:
db.do.ingest.logs

Finalize the ingestion.

In [None]:
db.do.ingest.commit()