[![hub](https://img.shields.io/badge/Source%20%26%20report%20-mediumseagreen)](https://lamin.ai/laminlabs/arc-virtual-cell-atlas/transform/l6GZa1J999W5)

# Arc Virtual Cell Atlas: scRNA-seq

The [Arc Virtual Cell Atlas](https://github.com/ArcInstitute/arc-virtual-cell-atlas) hosts one of the biggest collections of scRNA-seq datasets.

Lamin mirrors the dataset for simplified access here: [laminlabs/arc-virtual-cell-atlas](https://lamin.ai/laminlabs/arc-virtual-cell-atlas).

If you use the data academically, please cite the original publications, [Youngblut _et al._ (2025)](https://arcinstitute.org/manuscripts/scBaseCamp) and [Zhang _et al._ (2025)](https://biorxiv.org/10.1101/2025.02.20.639398).

Connect to the source instance.

In [None]:
# pip install 'lamindb[jupyter,bionty,wetlab,gcp]'
!lamin connect laminlabs/arc-virtual-cell-atlas

```{note}

If you want to transfer artifacts or metadata into your own instance, use `.using("laminlabs/arc-virtual-cell-atlas")` when accessing registries and then `.save()` ({doc}`/transfer`).

```

In [None]:
import lamindb as ln
import bionty as bt
import wetlab as wl
import pyarrow.compute as pc

## Metadata

50 cell lines.

In [None]:
bt.CellLine.df()

380 compounds.

In [None]:
wl.Compound.df(limit=None)

1,138 perturbations.

In [None]:
wl.CompoundPerturbation.df(limit=None)

17 metadata features.

In [None]:
ln.Feature.df()

## The Tahoe-100M collection

Every individual dataset in the atlas is an `.h5ad` file that is registered as an artifact in LaminDB.

Let us first query for the `Tahoe-100M` collection.

In [None]:
# get the collection: https://lamin.ai/laminlabs/arc-virtual-cell-atlas/collection/BpavRL4ntRTzWEE5
collection = ln.Collection.get(key="tahoe100")
# 14 artifacts in this collection, each correspond to a plate
collection.artifacts.df()

In [None]:
# check the curated metadata of the first artifact
artifact1 = collection.artifacts.all()[0]
artifact1.describe()

## Query artifacts of interest based on metadata

Let's find which datasets contain A549 cells perturbed with Piroxicam.

In [None]:
cell_lines = bt.CellLine.lookup()
drugs = wl.Compound.lookup()

artifacts_a549_piroxicam = collection.artifacts.filter(
    cell_lines=cell_lines.a549, compounds=drugs.piroxicam
).all()
artifacts_a549_piroxicam.df()

You can download an `.h5ad` into your local cache:

```python
artifact1.cache()
```

Or stream it:
```python
artifact1.open()
```


## Open the obs metadata parquet file as a PyArrow Dataset

Open the obs metadata file (2.29G) with `PyArrow.Dataset`.

In [None]:
ulabels = ln.ULabel.lookup()
parquet_artifact = ln.Artifact.filter(
    key__contains="obs_metadata.parquet", ulabels=ulabels.tahoe_100
).one()
parquet_artifact

In [None]:
dataset = parquet_artifact.open()
dataset.schema

Which A549 cells are perturbed with Piroxicam.

In [None]:
filter_expr = (pc.field("cell_name") == cell_lines.a549.name) & (
    pc.field("drug") == drugs.piroxicam.name
)
df = dataset.scanner(filter=filter_expr).to_table().to_pandas()
df.value_counts("plate")

In [None]:
df.head()

Retrieve the corresponding cells from h5ad files.

```python
plate_cells = df.groupby("plate")["BARCODE_SUB_LIB_ID"].apply(list)

adatas = []
for artifact in artifacts_a549_piroxicam:
    plate = artifact.features.get_values()["plate"]
    idxs = plate_cells.get(plate)
    print(f"Loading {len(idxs)} cells from plate {plate}")
    with artifact.open() as store:
        adata = store[idxs].to_memory()
        adatas.append(adata)
```

# TBD