# Partial streaming of annotated data matrices from the cloud

When working with large datasets, it is often inefficient and time consuming to load each of the dataset into memory.

Here we demonstrate how partial streaming of data objects works using LaminDB, which allows you to query and only work with parts of a data object.

```{note}

This notebook uses `AnnData` objects as examples.

In the future, other data objects will provide similar functionality.

```

Let's first set up a LaminDB instance using cloud storage (AWS S3):

In [None]:
import lamindb as ln
import lamindb.schema as lns

ln.nb.header()

## Prepare data object (skip if you already ingested your data into LaminDB)

### An AnnData objects

Here we load a scRNA-seq dataset as AnnData object, which contains a `cell_type` field as streaming labels.

In [None]:
pbmc68k = ln.dev.datasets.anndata_pbmc68k_reduced()

pbmc68k

In [None]:
pbmc68k.obs["cell_type"].value_counts()

### Ingest AnnData object into LaminDB

This follows our canonical [ingest](https://lamin.ai/docs/db/guide/quickstart) process.

In [None]:
pbmc68k_h5ad = ln.DObject(pbmc68k, name="pbmc68k")

# Optionally, you may save anndata to the zarr format
pbmc68k_zarr = ln.DObject(pbmc68k, name="pbmc68k", format="zarr")

In [None]:
ln.add([pbmc68k_h5ad, pbmc68k_zarr]);

Load zarr (no downloading/caching happens here): 

In [None]:
pbmc68k_zarr.load()

## Stream data objects

We saw that both datasets have `cell_type`: Dendritic cells, CD14+ Monocytes.

Now let's only fetch data that are labeled as these two cell types.

First we obtain the ingested AnnData DObjects by querying the LaminDB instance.

```{note}

This is merely a database query, it does **not** download the data.

```

In [None]:
dobjects = ln.select(ln.DObject).join(lns.Run, id=ln.nb.run.id).all()

dobjects

In [None]:
dobject = dobjects[0]

Next, we prepare the query strings to query the columns of `.obs` for each `AnnData` object. For details see the [pandas docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html).

```{note}

Soon, we'll integrate the within-object queries with the SQL queries.

```

### (A) Pandas-style query strings

In [None]:
query_string = "cell_type == 'Dendritic cells' | cell_type == 'CD14+ Monocytes'"

Subset the `AnnData` objects based on the query strings above and load them directly into memory.

```{note}

No caching happens here!

When `ln.subset` is executed, only the `.obs` columns are loaded to perform the subset. For all remaining, **only the subsets** data are loaded into memory.

```

In [None]:
ln.subset(dobject, query_obs=query_string)

```{tip}

Set `use_concat=True` to return a single concatenated AnnData object (runs [`anndata.concat`](https://anndata.readthedocs.io/en/latest/generated/anndata.concat.html) under the hood).

See an example in (B) Lazy query expressions
```

### (B) Lazy query expressions

Lazy selectors for convenient subsetting with complicated conditions.

Operators, methods and numpy functions are supported.

In [None]:
from lamindb import lazy
import numpy as np

In [None]:
query_string = lazy.cell_type.isin(("Dendritic cells", "CD14+ Monocytes")) & (
    lazy.percent_mito <= 0.05
)

In [None]:
adata = ln.subset(dobject, query_obs=query_string, use_concat=True)

In [None]:
adata

Let's now check whether the returned AnnData only contains queried categories:

In [None]:
adata.obs["cell_type"].value_counts()

## [Not for users] Clean the test data from CI

Clean the test instance.

In [None]:
ln.delete(dobjects + [ln.nb.run], force=True)