# Analyze the sharded dataset in memory

In [None]:
import lamindb as ln
import lnschema_bionty as lb
import anndata as ad

In [None]:
ln.track()

In [None]:
ln.Dataset.filter().df()

In [None]:
dataset = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()

In [None]:
dataset.files.df()

If the dataset doesn't consist of too many files, we can now load it into memory.

Under-the-hood, the `AnnData` objects are concatenated during loading.

The amount of time this takes depends on a variety of factors.

If it occurs often, one might consider storing a concatenated version of the dataset, rather than the individual pieces.

In [None]:
adata = dataset.load()

The default is an outer join during concatenation as in pandas:

In [None]:
adata

The `AnnData` has the reference to the individual files in the `.obs` annotations:

In [None]:
adata.obs.file_id.cat.categories

We can easily obtain ensemble IDs for gene symbols using the look up object:

In [None]:
genes = lb.Gene.lookup(field="symbol")

In [None]:
genes.itm2a.ensembl_gene_id

Let us create a plot:

In [None]:
import scanpy as sc

sc.pp.pca(adata, n_comps=2)

In [None]:
sc.pl.pca(
    adata,
    color=genes.itm2a.ensembl_gene_id,
    title=(
        f"{genes.itm2a.symbol} / {genes.itm2a.ensembl_gene_id} /"
        f" {genes.itm2a.description}"
    ),
    save="_itm2a",
)

In [None]:
file = ln.File("./figures/pca_itm2a.pdf", description="My result on ITM2A")

In [None]:
file.save()

In [None]:
file.view_flow()

In [None]:
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna