# Download Datasets

This notebook demonstrates how to download and use datasets from PerturbLab's resource registry.

## Features
- Automatic caching in `~/.cache/perturblab/`
- Support for scPerturb benchmark datasets (55+ datasets)
- Gene Ontology (GO) files
- Progress tracking and resume support


In [1]:
from perturblab.data.resources import load_dataset, get_dataset, list_datasets
import anndata as ad

# List all available datasets
print("Available datasets:")
datasets = list_datasets()
print(f"Total: {len(datasets)} datasets")
print("\nFirst 10 datasets:")
for ds in datasets[:10]:
    print(f"  - {ds}")


Available datasets:
Total: 58 datasets

First 10 datasets:
  - go/gene2go_gears
  - go/go_basic
  - go/go_full
  - scperturb/adamson_2016_10x001
  - scperturb/adamson_2016_10x005
  - scperturb/adamson_2016_10x010
  - scperturb/aissa_2021
  - scperturb/chang_2021
  - scperturb/cui_2023
  - scperturb/datlinger_2017


## Download scPerturb Benchmark Dataset


In [2]:
# Download a scPerturb dataset (automatically cached)
# First call downloads, subsequent calls use cache
h5ad_path = load_dataset('scperturb/norman_2019')
print(f"Dataset downloaded to: {h5ad_path}")

# Load into AnnData
adata = ad.read_h5ad(h5ad_path)
print(f"\nDataset shape: {adata.shape}")
print(f"Genes: {adata.n_vars}, Cells: {adata.n_obs}")
print(f"\nObservations columns: {list(adata.obs.columns[:5])}")
print(f"Variables columns: {list(adata.var.columns[:5])}")


[perturblab] [INFO] Cache miss: norman_2019.h5ad (is_directory=False), creating...
[perturblab] [INFO] Downloading remote resource 'norman_2019'...
[perturblab] [INFO] Downloading: https://zenodo.org/record/13350497/files/NormanWeissman2019.h5ad?download=1
[perturblab] [INFO] Target: /home/wzq/.cache/perturblab/auto/.tmp_norman_2019.h5ad_ehl30_je
[perturblab] [ERROR] Failed to cache norman_2019.h5ad: Failed to download from https://zenodo.org/record/13350497/files/NormanWeissman2019.h5ad?download=1: 404 Client Error: NOT FOUND for url: https://zenodo.org/records/13350497/files/NormanWeissman2019.h5ad


DownloadError: Failed to download from https://zenodo.org/record/13350497/files/NormanWeissman2019.h5ad?download=1: 404 Client Error: NOT FOUND for url: https://zenodo.org/records/13350497/files/NormanWeissman2019.h5ad

## Download Gene Ontology File


In [None]:
# Download GO ontology file
go_path = load_dataset('go/go_basic')
print(f"GO file downloaded to: {go_path}")

# Check file size
import os
file_size_mb = os.path.getsize(go_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")

# Second call uses cache (instant)
go_path_2 = load_dataset('go/go_basic')
print(f"\nCached path (same): {go_path == go_path_2}")


[perturblab] [INFO] Loading resource 'go_basic' from /home/wzq/.cache/perturblab/auto/go_basic...
GO file downloaded to: /home/wzq/.cache/perturblab/auto/go_basic
File size: 29.96 MB

Cached path (same): True


## Get Dataset Resource Metadata


In [None]:
# Get resource object for metadata
resource = get_dataset('scperturb/norman_2019')
print(f"Resource key: {resource.key}")
print(f"Resource type: {type(resource).__name__}")
print(f"Has remote config: {resource._remote_config is not None}")

if resource._remote_config:
    print(f"Downloader: {resource._remote_config.get('downloader', 'N/A')}")


Resource key: norman_2019
Resource type: h5adFile
Has remote config: True
Downloader: HTTPDownloader
