# Stream data

When working with large serialized objects, it is often inefficient to load entire files into memory.

Here, we show how to subset an `AnnData` stored in the cloud.

In [None]:
import os

orm = "django" if os.getenv("LAMINDB_USE_DJANGO") == "1" else "sqlalchemy"
instance_name = f"lamindb-ci-test-stream-{orm}"

!lamin load testuser1/{instance_name}
!lamin delete {instance_name}
!lamin init --storage s3://lamindb-ci --name {instance_name}

In [None]:
import lamindb as ln

Check the configured storage:

In [None]:
ln.setup.settings.storage.root

Register a file:

In [None]:
if ln._USE_DJANGO:
    file = ln.File.create("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad")
else:
    file = ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad")
file = ln.add(file)

Get its backed cloud representation:

In [None]:
adata = file.backed()

Inspect its metadata:

In [None]:
adata.obs.head()

In [None]:
adata.obs.cell_type.value_counts()

In [None]:
assert (adata.obs.cell_type == "CD34+").sum() == 2

Construct a subsetter based on the metadata:

In [None]:
obs = file.subsetter()
subset_obs = obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    obs.percent_mito <= 0.05
)

In [None]:
adata_subset = file.stream(subset_obs=subset_obs)

In [None]:
adata_subset

In [None]:
adata_subset.obs.cell_type.value_counts()

In [None]:
assert (adata_subset.obs.cell_type == "CD34+").sum() == 0

It is also possible to access AnnData objects' attributes and subset them directly through `file.backed()` withouth loading the full objects into memory:

In [None]:
adata

Note that the object above is an AnnDataAccessor object, not an AnnData object

Check the reference to `.X`:

In [None]:
adata.X

Get a subset of the object, attributes are loaded only on explicit access:

In [None]:
obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]

In [None]:
adata_subset

In [None]:
adata_subset.obs.cell_type.value_counts()

You can do the same with a zarr object:

In [None]:
if ln._USE_DJANGO:
    file = ln.add(ln.File.create("s3://lamindb-ci/lndb-storage/pbmc68k.zarr"))
else:
    file = ln.add(ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.zarr"))
adata_subset = file.stream(subset_obs=subset_obs)
adata_subset.obs.cell_type.value_counts()

In [None]:
assert (adata_subset.obs.cell_type == "CD34+").sum() == 0

In [None]:
!lamin delete {instance_name}