[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/analysis-flow.ipynb)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/laminlabs/lamin-usecases/main?labpath=lamin-usecases%2Fdocs%2Fanalysis-flow.ipynb)

# Analysis flow

Here, we'll track typical data transformations like subsetting that occur during analysis.

If exploring more generally, read this first: {doc}`/project-flow`.

## Setup

In [None]:
# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
from lamin_utils import logger

bt.settings.auto_save_parents = False

## Register an initial dataset

Here we register an initial artifact with a pipeline script.

In [None]:
# register_example_file.py


def register_example_file():
    # create a pipeline transform to track the registration of the artifact
    transform = ln.Transform(
        name="register example artifact", type="pipeline", version="0.0.1"
    )
    ln.track(transform=transform)

    # an example dataset that has a few cell type, tissue and disease annotations
    adata = ln.core.datasets.anndata_with_obs()

    # validate and register features
    genes = bt.Gene.from_values(
        adata.var_names,
        bt.Gene.ensembl_gene_id,
        organism="human",
        )
    ln.save(genes)
    obs_features = ln.Feature.from_df(adata.obs)
    ln.save(obs_features)

    # validate and register labels
    cell_types = bt.CellType.from_values(adata.obs["cell_type"])
    ln.save(cell_types)
    tissues = bt.Tissue.from_values(adata.obs["tissue"])
    ln.save(tissues)
    diseases = bt.Disease.from_values(adata.obs["disease"])
    ln.save(diseases)

    # register artifact and annotate with features & labels
    artifact = ln.Artifact.from_anndata(
        adata,
        description="anndata with obs"
    )
    artifact.save()
    artifact.features.add_from_anndata(
        var_field=bt.Gene.ensembl_gene_id,
        organism="human",
    )
    features = ln.Feature.lookup()
    artifact.labels.add(cell_types, features.cell_type)
    artifact.labels.add(tissues, features.tissue)
    artifact.labels.add(diseases, features.disease)


register_example_file()

## Pull the registered dataset, apply a transformation, and register the result

Set the current notebook as the new transform:

In [None]:
ln.settings.transform.stem_uid = "eNef4Arw8nNM"
ln.settings.transform.version = "0"
ln.track()

In [None]:
artifact = ln.Artifact.filter(description="anndata with obs").one()

In [None]:
artifact.describe()

### Get a backed AnnData object

In [None]:
adata = artifact.backed()
adata

### Subset dataset to specific cell types and diseases

In [None]:
cell_types = artifact.cell_types.all().lookup(return_field="name")
diseases = artifact.diseases.all().lookup(return_field="name")

Create the subset:

In [None]:
subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))

In [None]:
adata_subset = adata[subset_obs]
adata_subset

In [None]:
adata_subset.obs[["cell_type", "disease"]].value_counts()

Register the subsetted AnnData:

In [None]:
file_subset = ln.Artifact.from_anndata(
    adata_subset.to_memory(),
    description="anndata with obs subset"
)

In [None]:
file_subset.save()

In [None]:
file_subset.features.add_from_anndata(
    var_field=bt.Gene.ensembl_gene_id,
    organism="human",  # optionally, globally set organism via bt.settings.organism = "human"
    )

In [None]:
features = ln.Feature.lookup()

file_subset.labels.add(adata_subset.obs.cell_type, features.cell_type)
file_subset.labels.add(adata_subset.obs.disease, features.disease)
file_subset.labels.add(adata_subset.obs.tissue, features.tissue)

## Examine data flow

Query a subsetted `.h5ad` artifact containing "hematopoietic stem cell" and "T cell":

In [None]:
cell_types = bt.CellType.lookup()

In [None]:
my_subset = ln.Artifact.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()

In [None]:
my_subset

Common questions that might arise are:

- What is the history of this artifact?
- Which features and labels are associated with it?
- Which notebook analyzed and registered this artifact?
- By whom?
- And which artifact is its parent?

Let's answer this using LaminDB:

In [None]:
print("--> What is the history of this artifact?\n")
file_subset.view_lineage()

print("\n\n--> Which features and labels are associated with it?\n")
logger.print(file_subset.features)
logger.print(file_subset.labels)

print("\n\n--> Which notebook analyzed and registered this artifact\n")
logger.print(file_subset.transform)

print("\n\n--> By whom\n")
logger.print(file_subset.created_by)

print("\n\n--> And which artifact is its parent\n")
display(file_subset.run.input_artifacts.df())

In [None]:
!lamin delete --force analysis-usecase
!rm -r ./analysis-usecase