# Stream cloud-backed objects

When working with large serialized objects, it is often inefficient to load entire files into memory.

Here, we show how to subset an `AnnData` stored in the cloud.

In [None]:
# initialize a test instance for this notebook
# this needs to be called *before* importing lamindb in Python
# if you'd like to load or init an instance after, use the Python API: ln.setup.init(...)
!lamin init --storage s3://lamindb-ci --name lamindb-ci-test-stream

In [None]:
import lamindb as ln

ln.settings.verbosity = 3  # show hints

In [None]:
# save some test data
ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.zarr").save()
ln.File("s3://lamindb-ci/lndb-storage/testfile.hdf5").save()

Many files offer cloud-backed objects that allow streaming their content into memory.

Here, we'll first show how to access a cloud-backed `AnnData` object and then discuss generic hdf5 and zarr objects.

## AnnData

In [None]:
file = ln.File.select(key="lndb-storage/pbmc68k.h5ad").one()

In [None]:
adata = file.backed()

Note that the object above is an `AnnDataAccessor` object, not an `AnnData` object

In [None]:
adata

It is possible to access `AnnData` atributes without loading them into memory

In [None]:
print(adata.obsm)
print(adata.varm)
print(adata.obsp)
print(adata.varm)

However, `.obs`, `.var` and `.uns` are always loaded fully into memory on `AnnDataAccessor` initialization

In [None]:
adata.obs.head()

In [None]:
adata.var.head()

In [None]:
adata.uns.keys()

Without subsetting, the `AnnDataAccessor` object gives references to underlying lazy `h5` or `zarr` arrays:

In [None]:
adata.X

In [None]:
adata.obsm["X_pca"]

And to a lazy `SparseDataset` from the `anndata` package:

In [None]:
adata.obsp["distances"]

Get a subset of the object, attributes are loaded only on explicit access:

In [None]:
obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)

adata_subset = adata[obs_idx]

In [None]:
adata_subset

Check shapes of the subset

In [None]:
num_idx = sum(obs_idx)
assert adata_subset.shape == (num_idx, adata.shape[1])
assert (adata_subset.obs.cell_type == "CD34+").sum() == 0

In [None]:
adata_subset.obs.cell_type.value_counts()

Subsets load the arrays into memory only on direct access

In [None]:
print(adata_subset.X)

In [None]:
print(adata_subset.obsm["X_pca"])

In [None]:
assert adata_subset.obsp["distances"].shape[0] == num_idx

To load the entire subset into memory as an actual `AnnData` object, use `to_memory()`:

In [None]:
adata_subset.to_memory()

## Generic HDF5 & zarr

Let us query a generic HDF5 file:

In [None]:
file = ln.File.select(key="lndb-storage/testfile.hdf5").one()

And get a backed accessor:

In [None]:
backed = file.backed()

The returned object contains the `.connection` and `h5py.File` or `zarr.Group` in `.storage`

In [None]:
backed

In [None]:
backed.storage

In [None]:
!lamin delete lamindb-ci-test-stream