![scrna6/6](https://img.shields.io/badge/scrna6/6-lightgrey)

# Concatenate datasets to a single array store

In the previous notebooks, we've seen how to incrementally create a collection of scRNA-seq datasets and train models on it.

Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata (see this [blog post](https://lamin.ai/blog/arrayloader-benchmarks)). This is what CELLxGENE does to create Census: a number of `.h5ad` files are concatenated to give rise to a single `tiledbsoma` array store ({doc}`docs:cellxgene`).

:::{note}

This notebook is based on [the tiledbsoma documentation](https://tiledbsoma.readthedocs.io/en/latest/notebooks/tutorial_soma_append_mode.html).

:::

In [None]:
import lamindb as ln
import pandas as pd
import scanpy as sc
from lamindb.core.storage import register_for_tiledbsoma_store, write_tiledbsoma_store
from functools import reduce

In [None]:
ln.context.uid = "oJN8WmVrxI8m0000"
ln.context.track()

Query the collection of `h5ad` files that we'd like to convert into a single array.

In [None]:
collection = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="2"
).one()
collection.describe()

## Prepare the array store

Prepare a path for a new `tiledbsoma.Experiment`.

We will create our array store at the LaminDB instance root with folder name `"scrna.tiledbsoma"`.

In [None]:
soma_path = (ln.settings.storage.root / "scrna.tiledbsoma").as_posix()  # we could take any AWS S3 path, here

## Prepare the AnnData objects

We need to prepare the`AnnData` objects in the collection to be concatenated into one `tiledbsoma.Experiment`. They need to have the same `.var` and `.obs` columns, `.uns` and `.obsp` should be removed.

In [None]:
adatas = [artifact.load() for artifact in collection.ordered_artifacts]

Compute the intersetion of all columns. All `AnnData` objects should have the same columns in their `.obs`, `.var`, `.raw.var` to be ingested into one `tiledbsoma.Experiment`.

In [None]:
obs_columns = reduce(pd.Index.intersection, [adata.obs.columns for adata in adatas])
var_columns = reduce(pd.Index.intersection, [adata.var.columns for adata in adatas])
var_raw_columns = reduce(pd.Index.intersection, [adata.raw.var.columns for adata in adatas])

Prepare the `AnnData` objects for concatenation. Prepare id fields, sanitize `index` names, intersect columns, drop slots. Here we have to drop `.obsp`, `.uns` and also columns from the dataframes that are not in the intersections obtained above, otherwise the ingestion will fail. We will need to provide `obs` and `var` names in `tiledbsoma.io.register_anndatas`, so we create these fileds (`obs_id`, `var_id`) from the dataframe indices.

In [None]:
for i, adata in enumerate(adatas):
    del adata.obsp
    del adata.uns
    
    adata.obs = adata.obs.filter(obs_columns)
    adata.obs["obs_id"] = adata.obs.index
    adata.obs["dataset"] = i
    adata.obs.index.name = None
    
    adata.var = adata.var.filter(var_columns)
    adata.var["var_id"] = adata.var.index
    adata.var.index.name = None
    
    drop_raw_var_columns = adata.raw.var.columns.difference(var_raw_columns)
    adata.raw.var.drop(columns=drop_raw_var_columns, inplace=True)
    adata.raw.var["var_id"] = adata.raw.var.index
    adata.raw.var.index.name = None

## Create the array store

Register all the AnnData objects. Pass `store=None` because `tiledbsoma.Experiment` doesn't exist yet:

In [None]:
registration_mapping, adatas = register_for_tiledbsoma_store(
    store=None,
    adatas=adatas,
    measurement_name="RNA",
    obs_field_name="obs_id",
    var_field_name="var_id",
    append_obsm_varm=True
)

Ingest the `AnnData` objects sequentially, providing the context. This saves the `AnnData` objects in one array store.

In [None]:
for adata in adatas:
    soma_artifact = write_tiledbsoma_store(
        store=soma_path,
        adata=adata,
        measurement_name="RNA",
        registration_mapping=registration_mapping
    )
    soma_artifact.save()

## Query the array store

Open and query the experiment. We can use the registered `Artifact`. We query `X` and `obs` from the array store.

In [None]:
with soma_artifact.open() as soma_store:
    obs = soma_store["obs"]
    ms_rna = soma_store["ms"]["RNA"]
    
    n_obs = len(obs)
    n_var = len(ms_rna["var"])
    X = ms_rna["X"]["data"].read().coos((n_obs, n_var)).concat().to_scipy()
    
    print(obs.read().concat().to_pandas())

## Update the array store

Calculate PCA from the queried `X`.

In [None]:
pca_array = sc.pp.pca(X, n_comps=2)

In [None]:
soma_artifact

Open the array store in write mode and add PCA. When the store is updated, the corresponding artifact also gets updated with a new version. 

In [None]:
with soma_artifact.open(mode="w") as soma_store:
    tiledbsoma.io.add_matrix_to_collection(
        exp=soma_store,
        measurement_name="RNA",
        collection_name="obsm",
        matrix_name="pca",
        matrix_data=pca_array
    )

Note that the artifact has been changed.

In [None]:
soma_artifact