[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/data.ipynb)

# Query files & datasets

We saw how LaminDB allows to query & search across files & datasets using registries: {doc}`meta`.

This guide addresses "queries within datasets", for instance:
```
ulabels = ln.ULabel.lookup()
df = ln.File.filter(ulabels == ulabels.setosa).first().load()  # access a batch of the iris dataset ingested in the tutorial
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]  # subset the iris dataset to observations of organism "setosa"
```

Because the file was validated, subsetting the `DataFrame` is guaranteed to succeed, sparing you the headache of re-curating features & labels.

Such within-dataset queries are also possible for cloud-backed datasets using DuckDB, TileDB, zarr, HDF5, parquet and other storage backends.

- For a use case with TileDB, see: {doc}`docs:cellxgene-census`
- For a use case with DuckDB, see: {doc}`docs:rxrx`

In this notebook, we show how to subset an `AnnData` and generic `HDF5` and `zarr` datasets accessed in the cloud.

## Setup

In [None]:
!lamin init --storage s3://lamindb-ci --name test-data

In [None]:
import lamindb as ln

In [None]:
ln.settings.verbosity = "info"

We'll need some test data:

In [None]:
ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
ln.File("s3://lamindb-ci/lndb-storage/testfile.hdf5").save()

## AnnData

An `h5ad` file stored on s3:

In [None]:
file = ln.File.filter(key="lndb-storage/pbmc68k.h5ad").one()

In [None]:
file.path

In [None]:
adata = file.backed()

This object is an `AnnDataAccessor` object, an `AnnData` object backed in the cloud:

In [None]:
adata

Without subsetting, the `AnnDataAccessor` object references underlying lazy `h5` or `zarr` arrays:

In [None]:
adata.X

You can subset it like a normal `AnnData` object:

In [None]:
obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset

Subsets load arrays into memory upon direct access:

In [None]:
adata_subset.X

To load the entire subset into memory as an actual `AnnData` object, use `to_memory()`:

In [None]:
adata_subset.to_memory()

## Generic HDF5

Let us query a generic HDF5 file:

In [None]:
file = ln.File.filter(key="lndb-storage/testfile.hdf5").one()

And get a backed accessor:

In [None]:
backed = file.backed()

The returned object contains the `.connection` and `h5py.File` or `zarr.Group` in `.storage`

In [None]:
backed

In [None]:
backed.storage

In [None]:
!lamin delete --force test-data