# Stream data

When working with large serialized objects, it is often inefficient to load entire files into memory.

Here, we show how to subset an `AnnData` stored in the cloud.

In [None]:
instance_name = f"lamindb-ci-test-stream"

!lamin load testuser1/{instance_name}
!lamin delete {instance_name}
!lamin init --storage s3://lamindb-ci --name {instance_name}

In [None]:
import lamindb as ln

Check the configured storage:

In [None]:
ln.setup.settings.storage.root

In [None]:
ln.select(ln.Storage).df()

Register a file:

In [None]:
file = ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad")

In [None]:
file = ln.save(file)

## Access subsets of an AnnData object through `file.stream()`

Get its backed cloud representation:

In [None]:
adata = file.backed()

Inspect its metadata:

In [None]:
adata.obs.head()

In [None]:
adata.obs.cell_type.value_counts()

In [None]:
assert (adata.obs.cell_type == "CD34+").sum() == 2

Construct a subsetter based on the metadata:

In [None]:
obs = file.subsetter()
subset_obs = obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    obs.percent_mito <= 0.05
)

In [None]:
adata_subset = file.stream(subset_obs=subset_obs)

In [None]:
adata_subset

In [None]:
adata_subset.obs.cell_type.value_counts()

In [None]:
assert (adata_subset.obs.cell_type == "CD34+").sum() == 0

You can do the same with a zarr object:

In [None]:
file = ln.save(ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.zarr"))
adata_subset = file.stream(subset_obs=subset_obs)
adata_subset.obs.cell_type.value_counts()

In [None]:
assert (adata_subset.obs.cell_type == "CD34+").sum() == 0

## Access subsets of an AnnData object directly through `file.backed()`

It is also possible to access AnnData objects' attributes and subset them directly through `file.backed()` withouth loading the full objects into memory:

It works for `.zarr` and `.h5ad` files

In [None]:
file = ln.select(ln.File, name="pbmc68k.zarr").one()

In [None]:
file.backed()

In [None]:
file = ln.select(ln.File, name="pbmc68k.h5ad").one()

In [None]:
adata = file.backed()

Note that the object above is an `AnnDataAccessor` object, not an `AnnData` object

In [None]:
adata

It is possible to access AnnData atributes without loading them into memory

In [None]:
adata.obsm

In [None]:
adata.varm

In [None]:
adata.obsp

In [None]:
adata.varp

However, `.obs`, `.var` and `.uns` are always loaded fully into memory on `AnnDataAccessor` initialization

In [None]:
adata.obs.head()

In [None]:
adata.var.head()

In [None]:
adata.uns.keys()

Without subsetting the `AnnDataAccessor` object gives references to the lazy `h5` or `zarr` arrays

In [None]:
adata.X

In [None]:
adata.obsm["X_pca"]

This is also a lazy `SparseDataset` from the `anndata` package

In [None]:
adata.obsp["distances"]

Get a subset of the object, attributes are loaded only on explicit access:

In [None]:
obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
num_idx = sum(obs_idx)

adata_subset = adata[obs_idx]

In [None]:
adata_subset

Check shapes of the subset

In [None]:
assert adata_subset.shape == (num_idx, adata.shape[1])

In [None]:
adata_subset.obs.cell_type.value_counts()

In [None]:
assert (adata_subset.obs.cell_type == "CD34+").sum() == 0

Subsets load the arrays into memory only on direct access

In [None]:
adata_subset.X

In [None]:
adata_subset.obsm["X_pca"]

In [None]:
assert adata_subset.obsp["distances"].shape[0] == num_idx

In [None]:
!lamin delete {instance_name}