# Link features

So far, we haven't enabled ourselves to query for the features[^features] of ingested data, and used LaminDB like a data lake.

[^features]: We'll mostly use the term feature for synonyms variable (statistics), column and field (databases), dimension (machine learning).


We can also use LaminDB like a queryable data warehouse to store links[^relations] and monitor data integrity.

Let us explain how to implement this by providing feature models at ingestion!

[^relations]: We mostly use the term link for synonyms relations and references.

In [1]:
import lamindb as db
import bionty as bt  # https://lamin.ai/docs/bionty
import scanpy as sc  # https://scanpy.readthedocs.io

db.header()

2022-08-16 20:19:29,014:INFO - Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-08-16 20:19:29,015:INFO - NumExpr defaulting to 8 threads.


0,1
id,ZKJX7AnXzQQp
version,draft
time_init,2022-07-30 16:14
time_run,2022-08-16 18:19
pypackage,bionty==0.0.6+17.g0499eed lamindb==0.2.1 scanpy==1.8.2


## Example datasets

Consider
- `data1`: a flow cytometry dataset in form of an `.fcs` file
- `data2`: a scRNA-seq count matrix in form of an `AnnData` object in memory

In [2]:
data1 = db.datasets.file_fcs()
data1

PosixPath('example.fcs')

In [3]:
data2 = sc.datasets.pbmc3k()
data2

AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

## Define feature models

For `data1`, we specify a feature model using the bionty Gene entity with id `hgnc_symbol` ([genenames.org](https://www.genenames.org/)).

In [4]:
feature_model1 = bt.Gene(id=bt.lookup.gene_ids.hgnc_symbol)

Let us now ingest the data by passing a feature model to `db.do.ingest.add`, which will enable us to query the `dobject` by features by creating all necessary links in the background.

It will also log out and store information on data integrity:

In [5]:
db.do.ingest.add(data1, feature_model=feature_model1)

🔶 hgnc_symbol column not found, using index as features.
🔶 9 terms (56.2%) are not mappable.


Using this feature model, we can't link and hence won't be able to query for 9 features.

We can overcome this by working with a custom feature model, discussed later.

Features in data2 are indexed by Ensemble gene ids. For an overview of gene ids, see: [`bt.lookup.gene_ids`](https://lamin.ai/docs/bionty/api).

In [6]:
data2.var = data2.var.rename(columns=dict(gene_ids=bt.lookup.gene_ids.ensembl_gene_id))
data2.var.head()

Unnamed: 0_level_0,ensembl_gene_id
index,Unnamed: 1_level_1
MIR1302-10,ENSG00000243485
FAM138A,ENSG00000237613
OR4F5,ENSG00000186092
RP11-34P13.7,ENSG00000238009
RP11-34P13.8,ENSG00000239945


Hence, we use a feature model based on Ensemble IDs and ingest the data with it.

In [7]:
feature_model2 = bt.Gene(id=bt.lookup.gene_ids.ensembl_gene_id)

In [8]:
db.do.ingest.add(data2, name="scanpy_pbmc3k", feature_model=feature_model2)

🔶 9154 terms (28.0%) are not mappable.


We can retrieve the integrity information through `.logs`:

In [9]:
db.do.ingest.logs

{'example.fcs': {'feature': 'hgnc_symbol',
  'n_mapped': 7,
  'percent_mapped': 43.8,
  'unmapped': Index(['FSC-A', 'FSC-H', 'SSC-A', 'KI67', 'CD3', 'CD45RO', 'CD8', 'CD57',
         'VIVID / CD14'],
        dtype='object')},
 'scanpy_pbmc3k.h5ad': {'feature': 'ensembl_gene_id',
  'n_mapped': 23584,
  'percent_mapped': 72.0,
  'unmapped': Index(['ENSG00000238009', 'ENSG00000239945', 'ENSG00000237683',
         'ENSG00000239906', 'ENSG00000241599', 'ENSG00000228463',
         'ENSG00000237094', 'ENSG00000235249', 'ENSG00000236601',
         'ENSG00000236743',
         ...
         'ENSG00000217792', 'ENSG00000268276', 'ENSG00000148828',
         'ENSG00000215700', 'ENSG00000215699', 'ENSG00000215635',
         'ENSG00000268590', 'ENSG00000251180', 'ENSG00000215616',
         'ENSG00000215611'],
        dtype='object', name='ensembl_gene_id', length=9154)}}

Finalize the ingestion.

In [10]:
db.do.ingest.commit()

✅ Annotated data BU9X0K5GXuWeuX3c4KMvl with the following features:
+------------+------------+------------+
| [1;92mgeneset.id[0m | [1;95mbiometa.id[0m | [1;94mspecies.id[0m |
+------------+------------+------------+
|     3      |     3      |     1      |
+------------+------------+------------+
✅ Annotated data XVGV6RMiPlOySQFPlHDYg with the following features:
+------------+------------+------------+
| [1;92mgeneset.id[0m | [1;95mbiometa.id[0m | [1;94mspecies.id[0m |
+------------+------------+------------+
|     4      |     4      |     1      |
+------------+------------+------------+
✅ Ingested the following dobjects:
+-----------------------------------------------+-----------------------------------+-----------------------+
|                    [1;92mdobject[0m                    |              [1;94mjupynb[0m               |         [1;95muser[0m          |
+-----------------------------------------------+-----------------------------------+---------------

RuntimeError: Make sure you save the notebook in your editor before publishing!
You can avoid the need for manually saving in Jupyter Lab, which auto-saves the buffer during publish.