# Track data using bio-registries & provenance

In [None]:
# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb

lb.settings.species = "human"  # globally set species
lb.settings.auto_save_parents = False

In [None]:
ln.track()

## Track cell types, tissues and diseases

We fetch an example dataset from LaminDB that has a few cell type, tissue and disease annotations:

In [None]:
adata = ln.dev.datasets.anndata_with_obs()

In [None]:
adata

In [None]:
adata.var_names[:5]

In [None]:
adata.obs[["tissue", "cell_type", "disease"]].value_counts()

### Register biological metadata and link to the dataset

As a first step, we register the Anndata object with LaminDB using {meth}`~lamindb.File.from_anndata`:

In [None]:
file = ln.File.from_anndata(
    adata, key="mini_anndata_with_obs.h5ad", var_ref=lb.Gene.ensembl_gene_id
)

In [None]:
file.save()

In [None]:
cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)

All of these look good and contain no typos, let's save them to their registries:

In [None]:
ln.save(cell_types)
ln.save(tissues)
ln.save(diseases)

We also need some features to bucket these labels:

In [None]:
ln.Feature(name="cell_type", type="category").save()
ln.Feature(name="tissue", type="category").save()
ln.Feature(name="disease", type="category").save()

Link labels against the file:

In [None]:
file.add_labels(cell_types)
file.add_labels(tissues)
file.add_labels(diseases)

In [None]:
file.describe()

In [None]:
file.view_lineage()

Examine the currently available cell types and tissues:

In [None]:
lb.CellType.filter().df()

In [None]:
lb.Tissue.filter().df()

## Processing the dataset

To track our data transformation we create a new {class}`~lamindb.Transform` of type "pipeline":

In [None]:
transform = ln.Transform(
    name="Subset to T-cells and liver lymphoma", version="0.1.0", type="pipeline"
)

Set the current tracking to the new transform:

In [None]:
ln.track(transform)

### Get a backed AnnData object

In [None]:
file = ln.File.filter(key="mini_anndata_with_obs.h5ad").one()

In [None]:
adata = file.backed()
adata

In [None]:
adata.obs[["cell_type", "disease"]].value_counts()

### Subset dataset to specific cell types and diseases

Create the subset:

In [None]:
subset_obs = adata.obs.cell_type.isin(["T cell", "hematopoietic stem cell"]) & (
    adata.obs.disease.isin(["liver lymphoma", "chronic kidney disease"])
)

In [None]:
adata_subset = adata[subset_obs]
adata_subset

In [None]:
adata_subset.obs[["cell_type", "disease"]].value_counts()

This subset can now be registered:

In [None]:
file_subset = ln.File.from_anndata(
    adata_subset.to_memory(),
    key="subset/mini_anndata_with_obs.h5ad",
    var_ref=lb.Gene.ensembl_gene_id,
)

In [None]:
file_subset.save()

Add labels to features, all of them validate:

In [None]:
cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)

file_subset.add_labels(cell_types)
file_subset.add_labels(tissues)
file_subset.add_labels(diseases)

In [None]:
file_subset.describe()

## Examine data lineage

Common questions that might arise are:

- Which h5ad file is in the `subset` subfolder?
- Which notebook ingested this file?
- By whom?
- And which file is its parent?

Let's answer this using LaminDB:

Query a subsetted `.h5ad` file containing "hematopoietic stem cell" and "T cell" to learn which h5ad file is in the `subset` subfolder:

In [None]:
cell_types_bt_lookup = lb.CellType.lookup()

In [None]:
my_subset = ln.File.filter(
    suffix=".h5ad",
    key__startswith="subset",
    cell_types__in=[
        cell_types_bt_lookup.hematopoietic_stem_cell,
        cell_types_bt_lookup.t_cell,
    ],
).first()

In [None]:
my_subset.view_lineage()

In [None]:
!lamin delete analysis-usecase
!rm -r ./analysis-usecase