![scrna6/6](https://img.shields.io/badge/scrna6/6-lightgrey)

# Concatenate datasets to a single array store

In the previous notebooks, we've seen how to incrementally create a collection of scRNA-seq datasets and train models on it.

Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata (see this [blog post](https://lamin.ai/blog/arrayloader-benchmarks)). This is what CELLxGENE does to create Census: a number of `.h5ad` files are concatenated to give rise to a single `tiledbsoma` array store ({doc}`docs:cellxgene`).

:::{note}

This notebook shows how `lamindb` can be used with `tiledbsoma` append mode, also expained in [the tiledbsoma documentation](https://tiledbsoma.readthedocs.io/en/latest/notebooks/tutorial_soma_append_mode.html).

:::

In [None]:
import lamindb as ln
import pandas as pd
import scanpy as sc
import tiledbsoma.io
from functools import reduce

In [None]:
ln.context.uid = "oJN8WmVrxI8m0000"
ln.context.track()

Query the collection of `h5ad` files that we'd like to convert into a single array.

In [None]:
collection = ln.Collection.get(
    name="My versioned scRNA-seq collection", version="2"
)
collection.describe()

## Prepare the AnnData objects

We need to prepare the`AnnData` objects in the collection to be concatenated into one `tiledbsoma.Experiment`. They need to have the same `.var` and `.obs` columns, `.uns` and `.obsp` should be removed.

In [None]:
adatas = [artifact.load() for artifact in collection.ordered_artifacts]

Compute the intersetion of all columns. All `AnnData` objects should have the same columns in their `.obs`, `.var`, `.raw.var` to be ingested into one `tiledbsoma.Experiment`.

In [None]:
obs_columns = reduce(pd.Index.intersection, [adata.obs.columns for adata in adatas])
var_columns = reduce(pd.Index.intersection, [adata.var.columns for adata in adatas])
var_raw_columns = reduce(pd.Index.intersection, [adata.raw.var.columns for adata in adatas])

Prepare the `AnnData` objects for concatenation. Prepare id fields, sanitize `index` names, intersect columns, drop slots. Here we have to drop `.obsp`, `.uns` and also columns from the dataframes that are not in the intersections obtained above, otherwise the ingestion will fail. We will need to provide `obs` and `var` names in `ln.integrations.save_tiledbsoma_experiment`, so we create these fileds (`obs_id`, `var_id`) from the dataframe indices.

In [None]:
for i, adata in enumerate(adatas):
    del adata.obsp
    del adata.uns
    
    adata.obs = adata.obs.filter(obs_columns)
    adata.obs["obs_id"] = adata.obs.index
    adata.obs["dataset"] = i
    adata.obs.index.name = None
    
    adata.var = adata.var.filter(var_columns)
    adata.var["var_id"] = adata.var.index
    adata.var.index.name = None
    
    drop_raw_var_columns = adata.raw.var.columns.difference(var_raw_columns)
    adata.raw.var.drop(columns=drop_raw_var_columns, inplace=True)
    adata.raw.var["var_id"] = adata.raw.var.index
    adata.raw.var.index.name = None

## Create the array store

Ingest the `AnnData` objects. This saves the `AnnData` objects in one array store, creates `Artifact` and saves it. This function also writes current `run.uid` to `tiledbsoma.Experiment` `obs`, under `lamin_run_uid`.

In [None]:
soma_artifact = ln.integrations.save_tiledbsoma_experiment(
    adatas,
    description="tiledbsoma experiment",
    measurement_name="RNA",
    obs_id_name="obs_id",
    var_id_name="var_id",
    append_obsm_varm=True
)

## Query the array store

Open and query the experiment. We can use the registered `Artifact`. Here we query `obs` from the array store.

In [None]:
with soma_artifact.open() as soma_store:
    obs = soma_store["obs"]
    var = soma_store["ms"]["RNA"]["var"]
    
    obs_columns_store = obs.schema.names
    var_columns_store = var.schema.names
    
    obs_store_df = obs.read().concat().to_pandas()
    
    print(obs_store_df)

## Append `AnnData` to the array store

Prepare a new `AnnData` object to be appended to the store.

In [None]:
adata = ln.core.datasets.anndata_with_obs()

In [None]:
adata.obs["obs_id"] = adata.obs.index
adata.var["var_id"] = adata.var.index

adata.obs["dataset"] = obs_store_df["dataset"].max()

obs_columns_same = [obs_col for obs_col in adata.obs.columns if obs_col in obs_columns_store]
adata.obs = adata.obs[obs_columns_same]

var_columns_same = [var_col for var_col in adata.var.columns if var_col in var_columns_store]
adata.var = adata.var[var_columns_same]

In [None]:
adata.write_h5ad("adata_to_append.h5ad")

Append `AnnData`.

In [None]:
soma_artifact = ln.integrations.save_tiledbsoma_experiment(
    ["adata_to_append.h5ad"],
    revises=soma_artifact,
    measurement_name="RNA",
    obs_id_name="obs_id",
    var_id_name="var_id"
)

## Update the array store

Read `X` from the store.

In [None]:
with soma_artifact.open() as soma_store: # mode="r" by default
    n_obs = len(soma_store["obs"])
    n_var = len(ms_rna["var"])
    X = ms_rna["X"]["data"].read().coos((n_obs, n_var)).concat().to_scipy()

Calculate PCA from the queried `X`.

In [None]:
pca_array = sc.pp.pca(X, n_comps=2)

In [None]:
soma_artifact

Open the array store in write mode and add PCA. When the store is updated, the corresponding artifact also gets updated with a new version. 

In [None]:
with soma_artifact.open(mode="w") as soma_store:
    tiledbsoma.io.add_matrix_to_collection(
        exp=soma_store,
        measurement_name="RNA",
        collection_name="obsm",
        matrix_name="pca",
        matrix_data=pca_array
    )

Note that the artifact has been changed.

In [None]:
soma_artifact