# Track data lineage through processing steps

In [None]:
# A lamindb instance containing bionty schema (skip if you already loaded your instance)

!lamin init --storage bio-lineage --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb

In [None]:
ln.track()

## Track AnnData with cell types, tissues and diseases

### An AnnData with `tissue`, `cell_type`, `disease` in `.obs`

In [None]:
adata = ln.dev.datasets.anndata_with_obs()

In [None]:
adata

In [None]:
adata.obs[["tissue", "cell_type"]].value_counts()

### Register biological metadata and link to the dataset

```{note}

If you don't want auto-populated ontology_ids, use ln.parse(..., from_bionty=False).

Also see: {doc}`docs:guide/parse`
```

In [None]:
file = ln.File(adata, key="mini_anndata_with_obs.h5ad")

In [None]:
ln.save(file)

In [None]:
cell_types = ln.parse(adata.obs.cell_type, lb.CellType.name)
tissues = ln.parse(adata.obs.tissue, lb.Tissue.name)

In [None]:
ln.add(cell_types)
ln.add(tissues);

In [None]:
file.cell_types.set(cell_types)
file.tissues.set(tissues)

### Your vocabulary store

In [None]:
ln.select(lb.CellType).df()

In [None]:
ln.select(lb.Tissue).df()

## Processing of the dataset

Pull the ingested parent dataset:

In [None]:
file = ln.select(ln.File, key="mini_anndata_with_obs.h5ad").one()

In [None]:
adata = file.backed()  # get a cloud-backed AnnData object

In [None]:
adata

In [None]:
adata.obs[["cell_type", "disease"]].value_counts()

### Subset dataset to specific cell types and diseases (`.obs`)

In [None]:
transform = ln.Transform(
    type="pipeline", name="subset_to_T_cells_and_liver_lymphoma", version="0.1.0"
)

In [None]:
ln.track(transform)

In [None]:
obs = file.subsetter()

In [None]:
subset_obs = obs.cell_type.isin(["T cell", "hematopoietic stem cell"]) & (
    obs.disease.isin(["liver lymphoma", "chronic kidney disease"])
)

In [None]:
adata_subset = file.stream(subset_obs=subset_obs, is_run_input=True)

In [None]:
adata_subset

In [None]:
adata_subset.obs[["cell_type", "disease"]].value_counts()

```{tip}

- For `h5ad`, you can provide either `query_obs` OR `query_var` for subsetting (h5py doesn't support indexing for multiple axis).
- For `zarr`, you can provide BOTH. (Note zarr streaming/subset only works with `anndata>=0.9.1`!)

Subset to specific genes (`.var`):

```python
genes = adata.var.index.values[:10]
var = file.subsetter()
adata_subset_var = file.stream(subset_var=var.index.isin(genes))
```

## Add the subset `AnnData` to LaminDB

In [None]:
file_subset = ln.File(adata_subset, key="subset/mini_anndata_with_obs.h5ad")

In [None]:
ln.save(file_subset)

Link the subsetted file to cell types:

In [None]:
cell_types = ln.parse(adata_subset.obs.cell_type, lb.CellType.name)

In [None]:
file_subset.cell_types.set(cell_types)

## Data lineage

- Which h5ad file is in the `subset` subfolder?
- Which notebook ingested this file?
- By whom?
- And which file is its parent?

Query a subsetted `.h5ad` file containing "hematopoietic stem cell" and "T cell":

In [None]:
file_subset = (
    ln.select(ln.File, suffix=".h5ad")
    .filter(
        key__startswith="subset",
        cell_types__name__in=["hematopoietic stem cell", "T cell"],
    )
    .first()
)

File:

In [None]:
file_subset

Which notebook ingested this file?

In [None]:
file_subset.transform

Who ingested this file?

In [None]:
file_subset.created_by

Parent files:

In [None]:
file_subset.run.inputs.values()