[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna2.ipynb)

# Integrate scRNA-seq datasets

scRNA-seq data integration is the process of analyzing data from several scRNA sequencing experiments to uncover common or distinct biological insights and patterns.

Here, we'll demonstrate how to fetch two scRNA-seq datasets by registered metadata such as cell types to finally integrate them.

## Setup

In [None]:
!lamin load test-scrna

In [None]:
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd
import anndata as ad

In [None]:
ln.track()

## Access ![](https://img.shields.io/badge/Access-10b981) 

## Query files by provenance metadata

In [None]:
users = ln.User.lookup()

In [None]:
ln.Transform.filter(created_by=users.falexwolf).search("register scrna")

In [None]:
transform = ln.Transform.filter(id="Nv48yAceNSh8z8").one()

In [None]:
ln.File.filter(transform=transform).df()

### Query files based on biological metadata 

In [None]:
assays = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
cell_types = lb.CellType.lookup()

In [None]:
query = ln.File.filter(
    experimental_factors=assays.single_cell_rna_sequencing,
    species=species.human,
    cell_types=cell_types.conventional_dendritic_cell,
)

In [None]:
query.df()

## Transform ![](https://img.shields.io/badge/Transform-10b981) 

### Compare gene sets

Get file objects:

In [None]:
file1, file2 = query.list()

In [None]:
file1.describe()

In [None]:
file1.view_flow()

In [None]:
file2.describe()

In [None]:
file2.view_flow()

Load files into memory:

In [None]:
file1_adata = file1.load()
file2_adata = file2.load()

Here we compute shared genes without loading files:

In [None]:
file1_genes = file1.features["var"]
file2_genes = file2.features["var"]

shared_genes = file1_genes & file2_genes
len(shared_genes)

In [None]:
shared_genes.list("symbol")[:10]

We also need to convert the ensembl_gene_id to symbol for file2 so that they can be concatenated:

In [None]:
mapper = pd.DataFrame(shared_genes.values_list("ensembl_gene_id", "symbol")).set_index(
    0
)[1]
mapper.head()

In [None]:
file2_adata.var.rename(index=mapper, inplace=True)

### Compare cell types

In [None]:
file1_celltypes = file1.cell_types.all()
file2_celltypes = file2.cell_types.all()

shared_celltypes = file1_celltypes & file2_celltypes
shared_celltypes_names = shared_celltypes.list("name")
shared_celltypes_names

We can now subset the two datasets by shared cell types:

In [None]:
file1_adata_subset = file1_adata[
    file1_adata.obs["cell_type"].isin(shared_celltypes_names)
]

file2_adata_subset = file2_adata[
    file2_adata.obs["cell_type"].isin(shared_celltypes_names)
]

Concatenate subsetted datasets:

In [None]:
adata_concat = ad.concat(
    [file1_adata_subset, file2_adata_subset],
    label="file",
    keys=[file1.description, file2.description],
)
adata_concat

In [None]:
adata_concat.obs.value_counts()

In [None]:
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna