# Link features

So far, we haven't enabled ourselves to query for the features[^features] of ingested data, and used LaminDB like a data lake.

[^features]: We'll mostly use the term feature for synonyms variable (statistics), column and field (databases), dimension (machine learning).


We can also use LaminDB like a queryable data warehouse to store links[^relations] and monitor data integrity.

Let us explain how to implement this by providing feature models at ingestion!

[^relations]: We mostly use the term link for synonyms relations and references.

In [1]:
import lamindb as db
import bionty as bt  # https://lamin.ai/docs/bionty
import scanpy as sc  # https://scanpy.readthedocs.io

sc._settings.ScanpyConfig(verbosity=0)
db.header()

2022-08-23 10:37:56,499:INFO - Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-08-23 10:37:56,500:INFO - NumExpr defaulting to 8 threads.


0,1
id,ZKJX7AnXzQQp
version,draft
time_init,2022-07-30 16:14
time_run,2022-08-23 08:37
pypackage,bionty==0.0.6+17.g0499eed lamindb==0.2.1 scanpy==1.8.2


## Example datasets

Consider
- `data1`: a flow cytometry dataset in form of an `.fcs` file
- `data2`: a scRNA-seq count matrix in form of an `AnnData` object in memory

In [2]:
data1 = db.datasets.file_fcs()
data1

PosixPath('example.fcs')

In [3]:
import random

data2 = sc.datasets.ebi_expression_atlas("E-MTAB-8414")
data2 = data2[:, data2.var_names.isin(random.sample(list(data2.var_names), 1000))]

HTTPError: HTTP Error 500: Internal Server Error (https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-8414/)

## Define feature models

For `data1`, we specify a feature model using the `bionty` `Gene` entity with gene symbols.

In [None]:
feature_model1 = bt.Gene(id=bt.lookup.gene_id.name, species=bt.lookup.species.human)

Let us now ingest the data by passing a feature model to `db.do.ingest.add`, which will enable us to query the `dobject` by features by creating all necessary links in the background.

It will also log out and store information on data integrity:

In [None]:
db.do.ingest.add(data1, feature_model=feature_model1)

🔶 name column not found, using index as features.




✅ 9 terms (56.2%) are linked.
🔶 7 terms (43.8%) are not linked.


Using this feature model, we can't link and hence won't be able to query for 9 features.

We can overcome this by working with a custom feature model, discussed later.

Features in data2 are indexed by Ensemble gene ids. For an overview of gene ids, see: [`bt.lookup.gene_id`](https://lamin.ai/docs/bionty/api).

In [None]:
data2.var.head()

ENSMUSG00000000318
ENSMUSG00000000386
ENSMUSG00000000440
ENSMUSG00000000441
ENSMUSG00000000605


Hence, we use a feature model based on Ensemble IDs and ingest the data with it.

In [None]:
feature_model2 = bt.Gene(
    id=bt.lookup.gene_id.ensembl_gene_id, species=bt.lookup.species.mouse
)

In [None]:
db.do.ingest.add(data2, name="ebi_E-MTAB-8414", feature_model=feature_model2)

🔶 ensembl_gene_id column not found, using index as features.
✅ 1000 terms (100.0%) are linked.
🔶 0 terms (0.0%) are not linked.


We can retrieve the integrity information through `.logs`:

In [None]:
db.do.ingest.logs

{'example.fcs': {'feature': 'name',
  'n_mapped': 9,
  'percent_mapped': 56.2,
  'unmapped': Index(['FSC-A', 'FSC-H', 'SSC-A', 'KI67', 'CD3', 'CD45RO', 'VIVID / CD14'], dtype='object')},
 'ebi_E-MTAB-8414.h5ad': {'feature': 'ensembl_gene_id',
  'n_mapped': 1000,
  'percent_mapped': 100.0,
  'unmapped': Index([], dtype='object')}}

Finalize the ingestion.

In [None]:
db.do.ingest.commit()

✅ Annotated data mWmIx2pzpzz4eCOa0jR5f with the following features:
+------------+------------+------------+
| [1;92mgeneset.id[0m | [1;95mbiometa.id[0m | [1;94mspecies.id[0m |
+------------+------------+------------+
|     3      |     3      |     1      |
+------------+------------+------------+
✅ Annotated data ZbdKAJ3DAEszfn8MpGCwd with the following features:
+------------+------------+------------+
| [1;92mgeneset.id[0m | [1;95mbiometa.id[0m | [1;94mspecies.id[0m |
+------------+------------+------------+
|     4      |     4      |     2      |
+------------+------------+------------+
✅ Ingested the following dobjects:
+-------------------------------------------------+-----------------------------------+-----------------------+
|                     [1;92mdobject[0m                     |              [1;94mjupynb[0m               |         [1;95muser[0m          |
+-------------------------------------------------+-----------------------------------+---------

RuntimeError: Make sure you save the notebook in your editor before publishing!
You can avoid the need for manually saving in Jupyter Lab, which auto-saves the buffer during publish.