[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/data.ipynb)

# Query arrays

We saw how LaminDB allows to query & search across artifacts & datasets using registries: {doc}`meta`.

Let us now look at the following case:

```
# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]  
```

Because the artifact was validated, querying the `DataFrame` is guaranteed to succeed!

Such within-dataset queries are also possible for cloud-backed datasets using DuckDB, TileDB, zarr, HDF5, parquet, and other storage backends.

- For a use case with TileDB, see: {doc}`docs:cellxgene`
- For a use case with DuckDB, see: {doc}`docs:rxrx`

In this notebook, we show how to subset an `AnnData` and generic `HDF5` and `zarr` datasets accessed in the cloud.

## Setup

In [None]:
!lamin init --storage s3://lamindb-ci --name test-data

In [None]:
import lamindb as ln

In [None]:
ln.settings.verbosity = "info"

We'll need some test data:

In [None]:
ln.Artifact("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/lndb-storage/testfile.hdf5").save()

## AnnData

An `h5ad` artifact stored on s3:

In [None]:
artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one()

In [None]:
artifact.path

In [None]:
adata = artifact.backed()

This object is an `AnnDataAccessor` object, an `AnnData` object backed in the cloud:

In [None]:
adata

Without subsetting, the `AnnDataAccessor` object references underlying lazy `h5` or `zarr` arrays:

In [None]:
adata.X

You can subset it like a normal `AnnData` object:

In [None]:
obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset

Subsets load arrays into memory upon direct access:

In [None]:
adata_subset.X

To load the entire subset into memory as an actual `AnnData` object, use `to_memory()`:

In [None]:
adata_subset.to_memory()

## Generic HDF5

Let us query a generic HDF5 artifact:

In [None]:
artifact = ln.Artifact.filter(key="lndb-storage/testfile.hdf5").one()

And get a backed accessor:

In [None]:
backed = artifact.backed()

The returned object contains the `.connection` and `h5py.File` or `zarr.Group` in `.storage`

In [None]:
backed

In [None]:
backed.storage

In [None]:
# clean up test instance
!lamin delete --force test-data
!rm -r test-data