[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/analysis-flow.ipynb)

# Analysis flow

Here, we'll track typical data transformations like subsetting that occur during analysis.

In [None]:
# pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./test-analysis-flow --modules bionty

In [None]:
import lamindb as ln
import bionty as bt

## Save an initial dataset

```{eval-rst}
.. literalinclude:: analysis-flow-scripts/register_example_file.py
   :language: python
   :caption: register_example_file.py
```

In [None]:
!python analysis-flow-scripts/register_example_file.py

## Open a dataset, subset it, and register the result

Track the current notebook:

In [None]:
ln.track("eNef4Arw8nNM")

In [None]:
artifact = ln.Artifact.get(description="anndata with obs")
artifact.describe()

### Get a backed AnnData object

In [None]:
adata = artifact.open()
adata

### Subset dataset to specific cell types and diseases

In [None]:
cell_types = artifact.cell_types.all().distinct().lookup(return_field="name")
diseases = artifact.diseases.all().distinct().lookup(return_field="name")

Create the subset:

In [None]:
subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))

In [None]:
adata_subset = adata[subset_obs]
adata_subset

In [None]:
adata_subset.obs[["cell_type", "disease"]].value_counts()

Register the subsetted AnnData:

In [None]:
curate = ln.Curator.from_anndata(
    adata_subset.to_memory(),
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "disease": bt.Disease.name,
        "tissue": bt.Tissue.name,
    },
    organism="human",
)
curate.validate()

In [None]:
artifact = curate.save_artifact(description="anndata with obs subset")
artifact.describe()

## Examine data lineage

Query a subsetted `.h5ad` artifact containing "hematopoietic stem cell" and "T cell":

In [None]:
cell_types = bt.CellType.lookup()

In [None]:
my_subset = ln.Artifact.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()
my_subset

Common questions that might arise are:

- What is the history of this artifact?
- Which features and labels are associated with it?
- Which notebook analyzed and registered this artifact?
- By whom?
- And which artifact is its parent?

Let's answer this using LaminDB:

In [None]:
artifact.features

In [None]:
print("--> What is the lineage of this artifact?\n")
artifact.view_lineage()

print("\n\n--> Which features and labels are associated with it?\n")
print(artifact.features)
print(artifact.labels)

print("\n\n--> Which notebook analyzed and saved this artifact\n")
print(artifact.transform)

print("\n\n--> Who save this artifact?\n")
print(artifact.created_by)

print("\n\n--> Which artifacts were inputs?\n")
display(artifact.run.input_artifacts.to_dataframe())

In [None]:
!rm -r ./analysis-flow
!lamin delete --force analysis-flow