![scrna6/6](https://img.shields.io/badge/scrna6/6-lightgrey)

# Transform a number of array shards to a single array store

In the previous notebooks, we've seen how to incrementally create a collection of datasets and train models on it.

In some situations we want to concatenate all datasets to one big array store to speed up ad-hoc queries for slices for arbitrary metadata from the cloud.

This is what CELLxGENE does to create Census: a number of `.h5ad` files are concatenated to give rise to a single TileDB-SOMA array store. See how this looks for `cellxgene` here: {doc}`docs:cellxgene`.

In [None]:
import lamindb as ln
import anndata as ad
import pandas as pd

import tiledbsoma
import tiledbsoma.io

from functools import reduce

In [None]:
ln.settings.transform.stem_uid = "oJN8WmVrxI8m"
ln.settings.transform.version = "1"
ln.track()

Retrieve the collection of `h5ad` files to be concatenated into a SOMA Experiment.

In [None]:
collection = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="2"
).one()

In [None]:
collection.describe()

Prepare a path and a context for a new `tiledbsoma.Experiment`.

In [None]:
soma_path = (ln.settings.storage / "scrna.tiledbsoma").as_posix()

We need to create a context with region information for the instance storage if the storage is an `s3` bucket and is not on `us-east-1`.

In [None]:
storage_settings = ln.settings._storage_settings
if storage_settings.type == "s3":
    storage_region = storage_settings.region
    ctx = tiledbsoma.SOMATileDBContext(tiledb_config={"vfs.s3.region": storage_region})
else:
    ctx = None

We need to prepare the`AnnData` objects in the collection to be concatenated into one `tiledbsoma.Experiment`. They need to have the same `.var` and `.obs` columns, `.uns` and `.obsp` should be removed.

In [None]:
adatas = [ad.read_h5ad(artifact.cache()) for artifact in collection.artifacts]

Compute the intercetion of all columns.

In [None]:
obs_columns = reduce(pd.Index.intersection, [adata.obs.columns for adata in adatas])
var_columns = reduce(pd.Index.intersection, [adata.var.columns for adata in adatas])
var_raw_columns = reduce(pd.Index.intersection, [adata.raw.var.columns for adata in adatas])

Prepare the `AnnData` objects for concatenation. Prepare id fields, sanitize `index` names, intersect columns, drop slots.

In [None]:
for i, adata in enumerate(adatas):
    del adata.obsp
    del adata.uns
    
    adata.obs = adata.obs.filter(obs_columns)
    adata.obs["obs_id"] = adata.obs.index
    adata.obs["dataset"] = i
    adata.obs.index.name = None
    
    adata.var = adata.var.filter(var_columns)
    adata.var["var_id"] = adata.var.index
    adata.var.index.name = None
    
    drop_raw_var_columns = adata.raw.var.columns.difference(var_raw_columns)
    adata.raw.var.drop(columns=drop_raw_var_columns, inplace=True)
    adata.raw.var["var_id"] = adata.raw.var.index
    adata.raw.var.index.name = None

Register all the AnnData objects. Pass `experiment_uri=None` because `tiledbsoma.Experiment` doesn't exist yet

In [None]:
registration_mapping = tiledbsoma.io.register_anndatas(
    experiment_uri=None,
    adatas=adatas,
    measurement_name="RNA",
    obs_field_name="obs_id",
    var_field_name="var_id",
    append_obsm_varm=True
)

Ingest the `AnnData` objects sequentially, providing the context.

In [None]:
for adata in adatas:
    tiledbsoma.io.from_anndata(
        experiment_uri=soma_path,
        anndata=adata,
        measurement_name="RNA",
        registration_mapping=registration_mapping,
        context=ctx
    )

Register the created `tiledbsoma.Experiment` storage in `lamindb`.

In [None]:
artifact_soma = ln.Artifact(soma_path, description="My scRNA-seq SOMA Experiment")
artifact_soma.save()

Open and query the experiment.

In [None]:
experiment = tiledbsoma.Experiment.open(artifact_soma.path.as_posix(), context=ctx)

In [None]:
experiment

In [None]:
experiment["obs"].read().concat().to_pandas()

In [None]:
experiment.close()