[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/analysis-flow.ipynb)

# Analysis flow

Here, we'll track typical data transformations like subsetting that occur during analysis.

If exploring more generally, read this first: {doc}`/project-flow`.

In [None]:
# pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./analysis-usecase --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
from lamin_utils import logger

## Register an initial dataset

Here we register an initial artifact with a pipeline script [register_example_file.py](https://github.com/laminlabs/lamin-usecases/blob/main/docs/analysis-flow-scripts/register_example_file.py).

In [None]:
!python analysis-flow-scripts/register_example_file.py

## Pull the registered dataset, apply a transformation, and register the result

Track the current notebook:

In [None]:
ln.settings.transform.stem_uid = "eNef4Arw8nNM"
ln.settings.transform.version = "0"
ln.track()

In [None]:
artifact = ln.Artifact.filter(description="anndata with obs").one()
artifact.describe()

### Get a backed AnnData object

In [None]:
adata = artifact.open()
adata

### Subset dataset to specific cell types and diseases

In [None]:
cell_types = artifact.cell_types.all().lookup(return_field="name")
diseases = artifact.diseases.all().lookup(return_field="name")

Create the subset:

In [None]:
subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))

In [None]:
adata_subset = adata[subset_obs]
adata_subset

In [None]:
adata_subset.obs[["cell_type", "disease"]].value_counts()

Register the subsetted AnnData:

In [None]:
curate = ln.Curate.from_anndata(
    adata_subset.to_memory(), 
    var_index=bt.Gene.ensembl_gene_id, 
    categoricals={
        "cell_type": bt.CellType.name, 
        "disease": bt.Disease.name, 
        "tissue": bt.Tissue.name,
    },
    organism="human"
)

curate.validate()

In [None]:
artifact = curate.save_artifact(description="anndata with obs subset")

In [None]:
artifact.describe()

## Examine data flow

Query a subsetted `.h5ad` artifact containing "hematopoietic stem cell" and "T cell":

In [None]:
cell_types = bt.CellType.lookup()

In [None]:
my_subset = ln.Artifact.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()

In [None]:
my_subset

Common questions that might arise are:

- What is the history of this artifact?
- Which features and labels are associated with it?
- Which notebook analyzed and registered this artifact?
- By whom?
- And which artifact is its parent?

Let's answer this using LaminDB:

In [None]:
print("--> What is the history of this artifact?\n")
artifact.view_lineage()

print("\n\n--> Which features and labels are associated with it?\n")
logger.print(artifact.features)
logger.print(artifact.labels)

print("\n\n--> Which notebook analyzed and registered this artifact\n")
logger.print(artifact.transform)

print("\n\n--> By whom\n")
logger.print(artifact.created_by)

print("\n\n--> And which artifact is its parent\n")
display(artifact.run.input_artifacts.df())

In [None]:
!lamin delete --force analysis-usecase
!rm -r ./analysis-usecase