[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/arrays.ipynb)

# Slice arrays

We saw how LaminDB allows to query & search across artifacts & collections using registries: {doc}`registries`.

Let us now look at the following case:

```
# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]  
```

Because the artifact was validated, querying the `DataFrame` is guaranteed to succeed!

Such within-collection queries are also possible for cloud-backed collections using [DuckDB](https://duckdb.org),
[TileDB](https://tiledb.com), [zarr](https://zarr.readthedocs.io/en/stable), [HDF5](https://docs.h5py.org/en/stable),
[parquet](https://parquet.apache.org), and other storage backends.

- For a use case with TileDB, see: {doc}`docs:cellxgene`
- For a use case with DuckDB, see: {doc}`docs:rxrx`

In this notebook, we show how to subset an `AnnData` and generic `HDF5` and `zarr` collections accessed in the cloud.

Let us create a remote instance for testing.

In [None]:
!lamin login testuser1
!lamin init --storage s3://lamindb-ci/test-arrays

Import lamindb and track this notebook.

In [None]:
import lamindb as ln

ln.track("hsRyWJggf2Ca")

We'll need some test data:

In [None]:
ln.Artifact("s3://lamindb-ci/test-arrays/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/test-arrays/testfile.hdf5").save()

Note that it is also possible to register Hugging Face paths. For this `huggingface_hub` package should be installed.

We register a folder of `parquet` files as a single artifact.

In [None]:
ln.Artifact("hf://datasets/Koncopd/lamindb-test/sharded_parquet").save()

We also register a collection of individual `parquet` files.

In [None]:
artifact_shard1 = ln.Artifact(
    "hf://datasets/Koncopd/lamindb-test/sharded_parquet/louvain=0/947eee0b064440c9b9910ca2eb89e608-0.parquet"
).save()
artifact_shard2 = ln.Artifact(
    "hf://datasets/Koncopd/lamindb-test/sharded_parquet/louvain=1/947eee0b064440c9b9910ca2eb89e608-0.parquet"
).save()

ln.Collection(
    [artifact_shard1, artifact_shard2], key="sharded_parquet_collection"
).save()

## AnnData

An `h5ad` artifact stored on s3:

In [None]:
artifact = ln.Artifact.get(key="pbmc68k.h5ad")

In [None]:
artifact.path

In [None]:
adata = artifact.open()

This object is an `AnnDataAccessor` object, an `AnnData` object backed in the cloud:

In [None]:
adata

Without subsetting, the `AnnDataAccessor` object references underlying lazy `h5` or `zarr` arrays:

In [None]:
adata.X

You can subset it like a normal `AnnData` object:

In [None]:
obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset

Subsets load arrays into memory upon direct access:

In [None]:
adata_subset.X

To load the entire subset into memory as an actual `AnnData` object, use `to_memory()`:

In [None]:
adata_subset.to_memory()

## Generic HDF5

Let us query a generic HDF5 artifact:

In [None]:
artifact = ln.Artifact.get(key="testfile.hdf5")

And get a backed accessor:

In [None]:
backed = artifact.open()

The returned object contains the `.connection` and `h5py.File` or `zarr.Group` in `.storage`

In [None]:
backed

In [None]:
backed.storage

## Parquet

A dataframe stored as sharded `parquet`.

In [None]:
artifact = ln.Artifact.get(key="sharded_parquet")

In [None]:
artifact.path.view_tree()

In [None]:
backed = artifact.open()

This returns a [pyarrow dataset](https://arrow.apache.org/docs/python/dataset.html).

In [None]:
backed

In [None]:
backed.head(5).to_pandas()

It is also possible to open a collection of cloud artifacts.

In [None]:
collection = ln.Collection.get(key="sharded_parquet_collection")

In [None]:
backed = collection.open()

In [None]:
backed

In [None]:
backed.to_table().to_pandas()

By default `Artifact.open()` and `Collection.open()` use `pyarrow` to lazily open dataframes. `polars` can be also used by passing `engine="polars"`. Note also that `.open(engine="polars")` returns a context manager with [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html).

In [None]:
with collection.open(engine="polars") as lazy_df:
    display(lazy_df.collect().to_pandas())

Yet another way to open several parquet files as a single dataset is via calling `.open()` directly for a query set.

In [None]:
backed = ln.Artifact.filter(suffix=".parquet").open()

In [None]:
backed

In [None]:
# clean up test instance
!lamin delete --force test-arrays