# Download Datasets

This notebook demonstrates how to download and use datasets from PerturbLab's resource registry.

## Features
- Automatic caching in `~/.cache/perturblab/`
- Support for scPerturb benchmark datasets (55+ datasets)
- Gene Ontology (GO) files
- Progress tracking and resume support


In [5]:
from perturblab.data.resources import load_dataset, get_dataset, list_datasets
import anndata as ad

# List all available datasets
print("Available datasets:")
datasets = list_datasets()
print(f"Total: {len(datasets)} datasets")
print("\nFirst 10 datasets:")
for ds in datasets[:10]:
    print(f"  - {ds}")


Available datasets:
Total: 57 datasets

First 10 datasets:
  - go/gene2go_gears
  - go/go_basic
  - go/go_full
  - scperturb/adamson_2016_10x001
  - scperturb/adamson_2016_10x005
  - scperturb/adamson_2016_10x010
  - scperturb/aissa_2021
  - scperturb/chang_2021
  - scperturb/cui_2023
  - scperturb/datlinger_2017


## Download scPerturb Benchmark Dataset


In [6]:
print("all datasets")
list_datasets()
h5ad_path = load_dataset('scperturb/adamson_2016_10x001')
print(f"Dataset downloaded to: {h5ad_path}")

# Load into AnnData
adata = ad.read_h5ad(h5ad_path)
print(f"\nDataset shape: {adata.shape}")
print(f"Genes: {adata.n_vars}, Cells: {adata.n_obs}")
print(f"\nObservations columns: {list(adata.obs.columns[:5])}")
print(f"Variables columns: {list(adata.var.columns[:5])}")


all datasets
Dataset downloaded to: C:\Users\Administrator\.cache\perturblab\auto\adamson_2016_10x001.h5ad

Dataset shape: (5768, 35635)
Genes: 35635, Cells: 5768

Observations columns: ['perturbation', 'read count', 'UMI count', 'tissue_type', 'cell_line']
Variables columns: ['ensembl_id', 'ncounts', 'ncells']


## Download Gene Ontology File


In [7]:
# Download GO ontology file
go_path = load_dataset('go/go_basic')
print(f"GO file downloaded to: {go_path}")

# Check file size
import os
file_size_mb = os.path.getsize(go_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")

# Second call uses cache (instant)
go_path_2 = load_dataset('go/go_basic')
print(f"\nCached path (same): {go_path == go_path_2}")


[perturblab] [INFO] Cache miss: go_basic (is_directory=False), creating...
[perturblab] [INFO] Downloading remote resource 'go_basic'...
[perturblab] [INFO] Downloading GO ontology (basic)
[perturblab] [INFO] Downloading: http://purl.obolibrary.org/obo/go/go-basic.obo
[perturblab] [INFO] Target: C:\Users\Administrator\.cache\perturblab\auto\.tmp_go_basic_n_yjnjsv


go-basic.obo:  98%|█████████▊| 29.5M/30.0M [00:01<00:00, 32.9MB/s]

[perturblab] [INFO] Downloaded 30.0 MB in 1.9s (16.2 MB/s)


go-basic.obo: 100%|██████████| 30.0M/30.0M [00:01<00:00, 16.9MB/s]

[perturblab] [INFO] Download complete: C:\Users\Administrator\.cache\perturblab\auto\.tmp_go_basic_n_yjnjsv
[perturblab] [INFO] Cached go_basic: 30.0 MB (created in 3.5s)
[perturblab] [INFO] Loading resource 'go_basic' from C:\Users\Administrator\.cache\perturblab\auto\go_basic...
GO file downloaded to: C:\Users\Administrator\.cache\perturblab\auto\go_basic
File size: 29.96 MB

Cached path (same): True





## Get Dataset Resource Metadata


In [8]:
# Get resource object for metadata
resource = get_dataset('scperturb/norman_2019_filtered')
print(f"Resource key: {resource.key}")
print(f"Resource type: {type(resource).__name__}")
print(f"Has remote config: {resource._remote_config is not None}")

if resource._remote_config:
    print(f"Downloader: {resource._remote_config.get('downloader', 'N/A')}")


Resource key: norman_2019_filtered
Resource type: h5adFile
Has remote config: True
Downloader: HTTPDownloader
