[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/arrays.ipynb)

# Slice & stream arrays

We saw how LaminDB allows to query & search across artifacts using registries: {doc}`registries`.

Let us now query the datasets in storage themselves. Here, we show how to subset an `AnnData` and generic `HDF5` and `zarr` collections accessed in the cloud.

In [None]:
# replace with your username and S3 bucket
!lamin login testuser1
!lamin init --storage s3://lamindb-ci/test-arrays

Import lamindb and track this notebook.

In [None]:
import lamindb as ln
import numpy as np
import zarr

ln.track()

We'll need some test data:

In [None]:
ln.Artifact("s3://lamindb-ci/test-arrays/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/test-arrays/testfile.hdf5").save()

## AnnData

An `h5ad` artifact stored on s3:

In [None]:
artifact = ln.Artifact.get(key="pbmc68k.h5ad")

In [None]:
artifact.path

In [None]:
access = artifact.open()

This object is an `AnnDataAccessor` object, an `AnnData` object backed in the cloud:

In [None]:
access

Without subsetting, the `AnnDataAccessor` object references underlying lazy `h5` or `zarr` arrays:

In [None]:
access.X

You can subset it like a normal `AnnData` object:

In [None]:
obs_idx = access.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    access.obs.percent_mito <= 0.05
)
access_subset = access[obs_idx]
access_subset

Subsets load arrays into memory upon direct access:

In [None]:
access_subset.X

To load the entire subset into memory as an actual `AnnData` object, use `to_memory()`:

In [None]:
adata_subset = access_subset.to_memory()

adata_subset

### Add a column to a cloud AnnData object

It is also possible to add columns to `.obs` and `.var` of cloud AnnData objects without downloading them.

Create a new `AnnData` `zarr` artifact.

In [None]:
adata_subset.write_zarr("adata_subset.zarr")

In [None]:
artifact = ln.Artifact(
    "adata_subset.zarr", description="test add column to adata"
).save()

In [None]:
artifact

In [None]:
with artifact.open(mode="r+") as access:
    access.add_column(where="obs", col_name="ones", col=np.ones(access.shape[0]))
    display(access)

The version of the artifact is updated after the modification.

In [None]:
artifact

In [None]:
artifact.delete(permanent=True)

## SpatialData

It is also possible to access `AnnData` objects inside `SpatialData` `tables`:

In [None]:
artifact = ln.Artifact.connect("laminlabs/lamindata").get(
    key="visium_aligned_guide_min.zarr"
)

access = artifact.open()

In [None]:
access

In [None]:
access.tables

This gives you the same `AnnDataAccessor` object as for a normal `AnnData`.

In [None]:
table = access.tables["table"]

table

You can subset it and read into memory as an actual `AnnData`:

In [None]:
table_subset = table[table.obs["clone"] == "diploid"]

table_subset

```python
adata = table_subset.to_memory()
```

## Generic HDF5

Let us query a generic HDF5 artifact:

In [None]:
artifact = ln.Artifact.get(key="testfile.hdf5")

And get a backed accessor:

In [None]:
backed = artifact.open()

The returned object contains the `.connection` and `h5py.File` or `zarr.Group` in `.storage`

In [None]:
backed

In [None]:
backed.storage

## Parquet

A dataframe stored as sharded `parquet`.

Note that it is also possible to register and access Hugging Face paths. For this `huggingface_hub` package should be installed.

In [None]:
artifact = ln.Artifact.connect("laminlabs/lamindata").get(key="sharded_parquet")

In [None]:
artifact.path.view_tree()

In [None]:
backed = artifact.open()

This returns a [pyarrow dataset](https://arrow.apache.org/docs/python/dataset.html).

In [None]:
backed

In [None]:
backed.head(5).to_pandas()

It is also possible to open a collection of cloud artifacts.

In [None]:
collection = ln.Collection.connect("laminlabs/lamindata").get(
    key="sharded_parquet_collection"
)

In [None]:
backed = collection.open()

In [None]:
backed

In [None]:
backed.to_table().to_pandas()

By default `Artifact.open()` and `Collection.open()` use `pyarrow` to lazily open dataframes. `polars` can be also used by passing `engine="polars"`. Note also that `.open(engine="polars")` returns a context manager with [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html).

In [None]:
with collection.open(engine="polars") as lazy_df:
    display(lazy_df.collect().to_pandas())

Yet another way to open several parquet files as a single dataset is via calling `.open()` directly for a query set.

In [None]:
backed = ln.Artifact.filter(suffix=".parquet").open()

In [None]:
backed

## Stream arrays into cloud

It is also possible to write directly into the default cloud (or local) storage of the current instance and then save as an `Artifact`. This can be done using {meth}`~lamindb.Artifact.from_lazy` that returns {class}`~lamindb.models.LazyArtifact`. This object creates a real artifact on `.save()` with the provided arguments.

In [None]:
lazy = ln.Artifact.from_lazy(suffix=".zarr", overwrite_versions=True, key="mydata.zarr")

lazy

Stream an array into `lazy.path` in the default instance storage using `zarr`.

In [None]:
store = zarr.storage.FsspecStore.from_url(lazy.path.as_posix())

group = zarr.open(store, mode="w")
group["ones"] = np.ones(3)

Save and get the artifact.

In [None]:
artifact = lazy.save()

artifact

In [None]:
artifact.delete(permanent=True)

In [None]:
# clean up test instance
ln.setup.delete("test-arrays", force=True)