# Single-Cell PAGE Tutorial: CRC Atlas with Hallmark Gene Sets

This notebook demonstrates `pypage-sc` on a real single-cell dataset: a colorectal cancer (CRC) scRNA-seq atlas from [CZ CELLxGENE](https://cellxgene.cziscience.com/) (13,843 cells, 25,344 genes).

**What we'll cover:**
1. Downloading the dataset from CELLxGENE
2. Exploring the AnnData object
3. Loading Hallmark gene sets from a GMT file
4. Running SingleCellPAGE (per-cell MI/CMI scoring + Geary's C spatial coherence)
5. Interpreting the results
6. Visualizing pathway scores on the UMAP embedding
7. Injecting scores into AnnData for downstream use with scanpy
8. Generating the interactive VISION-like HTML report
9. Equivalent CLI command

**Requirements:** `bio-pypage`, `anndata`, `scanpy` (optional, for plotting)

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import anndata

from pypage import GeneSets, SingleCellPAGE
from pypage.plotting import (
    plot_pathway_embedding,
    plot_consistency_ranking,
    interactive_report_to_html,
)

np.random.seed(42)

## 1. Download the CRC dataset

Download the h5ad file from CZ CELLxGENE. This is a ~200 MB file with 13,843 epithelial cells from colorectal cancer patients.

In [None]:
import os
import urllib.request

DATA_URL = "https://datasets.cellxgene.cziscience.com/d6742179-6c5f-4ddc-8327-b6719b157abd.h5ad"
DATA_PATH = "CRC.h5ad"

if not os.path.exists(DATA_PATH):
    print(f"Downloading {DATA_PATH} (~200 MB)...")
    urllib.request.urlretrieve(DATA_URL, DATA_PATH)
    print("Done.")
else:
    print(f"{DATA_PATH} already exists, skipping download.")

## 2. Explore the AnnData object

In [None]:
adata = anndata.read_h5ad(DATA_PATH)
adata

In [None]:
print("var_names (gene IDs):")
print(adata.var_names[:5].tolist())

print("\navar.var columns:")
print(list(adata.var.columns))

print("\ngene symbols (adata.var['gene']):")
print(adata.var["gene"].head().tolist())

The `var_names` are Ensembl IDs (e.g. `ENSG00000000003`), but gene symbols are stored in `adata.var['gene']`. We need to set `var_names` to the symbols so they match the GMT file.

In [None]:
adata.var_names = adata.var["gene"].astype(str).values
adata.var_names_make_unique()

print(f"var_names now: {adata.var_names[:5].tolist()}")

In [None]:
print("Available embeddings:", list(adata.obsm.keys()))
print("Available obs columns:", list(adata.obs.columns)[:10])

## 3. Load Hallmark gene sets

In [None]:
gs = GeneSets.from_gmt("../example_data/h.all.v2026.1.Hs.symbols.gmt")
print(gs)

# Check gene overlap
shared = np.intersect1d(adata.var_names, gs.genes)
print(f"\nShared genes: {len(shared)} / {len(gs.genes)} GMT genes")

## 4. Run SingleCellPAGE

SingleCellPAGE performs three steps:
1. **Score pathways** — compute MI or CMI for each cell and each pathway
2. **Compute Geary's C** — measure spatial autocorrelation of pathway scores on the cell-cell KNN graph. Reports C' = 1 - C (higher = more spatially coherent).
3. **Permutation test** — generate size-matched random gene sets, compute their C', and derive empirical p-values + BH FDR

In [None]:
sc = SingleCellPAGE(
    adata=adata,
    genesets=gs,
    function="cmi",
    n_bins=10,
    n_jobs=4,
)
sc

In [None]:
results = sc.run(n_permutations=1000)
print(f"Total pathways: {len(results)}")
print(f"Significant (FDR < 0.05): {(results['FDR'] < 0.05).sum()}")

## 5. Results

The results DataFrame contains:
- **pathway** — pathway name
- **consistency** — spatial autocorrelation score (C' = 1 - Geary's C). Higher values mean pathway scores vary coherently across the cell manifold.
- **p-value** — empirical p-value from size-matched random gene sets
- **FDR** — Benjamini-Hochberg corrected p-value

In [None]:
results.head(20)

In [None]:
sig = results[results["FDR"] < 0.05]
print(f"Significant pathways (FDR < 0.05): {len(sig)}\n")
sig

## 6. Visualization

### Consistency ranking

Bar chart of top pathways ranked by spatial consistency. Blue = significant (FDR < 0.05), red = not significant.

In [None]:
ax = plot_consistency_ranking(results, top_n=20, fdr_threshold=0.05)
ax.set_title("CRC — Hallmark pathway consistency ranking")
plt.tight_layout()
plt.show()

### Pathway scores on UMAP

Overlay per-cell pathway scores on the UMAP embedding. The top hit — **HALLMARK_INTERFERON_ALPHA_RESPONSE** — should show clear spatial structure.

In [None]:
top_pathways = results.head(4)["pathway"].tolist()

fig, axes = plt.subplots(1, 4, figsize=(22, 4.5))
for ax, pw_name in zip(axes, top_pathways):
    sc.plot_pathway_on_embedding(pw_name, embedding_key="X_umap", ax=ax, size=2)
plt.tight_layout()
plt.show()

## 7. Inject scores into AnnData

Add all per-cell pathway scores as `scPAGE_*` columns in `adata.obs`. This makes them available for any downstream scanpy analysis.

In [None]:
for i, pw_name in enumerate(sc.pathway_names):
    adata.obs[f"scPAGE_{pw_name}"] = sc.scores[:, i]

scpage_cols = [c for c in adata.obs.columns if c.startswith("scPAGE_")]
print(f"Added {len(scpage_cols)} scPAGE_ columns to adata.obs")
print(f"Example: {scpage_cols[:3]}")

In [None]:
# If scanpy is installed, use its UMAP plot
try:
    import scanpy as sc_mod
    sc_mod.pl.umap(
        adata,
        color="scPAGE_HALLMARK_INTERFERON_ALPHA_RESPONSE",
        title="HALLMARK_INTERFERON_ALPHA_RESPONSE",
        frameon=False,
    )
except ImportError:
    print("scanpy not installed — using built-in plot instead")
    sc.plot_pathway_on_embedding(
        "HALLMARK_INTERFERON_ALPHA_RESPONSE", embedding_key="X_umap"
    )
    plt.show()

### Save annotated AnnData

Save the annotated AnnData with pathway scores for later use:

In [None]:
# adata.write_h5ad("CRC_scPAGE.h5ad")
# print("Saved annotated AnnData with scPAGE scores.")

## 8. Interactive HTML report

Generate a self-contained VISION-like interactive report. Open it in any browser to click through pathways and see per-cell scores on the UMAP.

In [None]:
interactive_report_to_html(
    results=results,
    scores=sc.scores,
    pathway_names=list(sc.pathway_names),
    embeddings=sc.embeddings,
    output_path="CRC_report.html",
    fdr_threshold=0.05,
    title="CRC — Hallmark Gene Sets",
)
print("Interactive report saved to CRC_report.html")
print("Open in a browser to explore pathway scores interactively.")

## 9. Exploring the per-cell score matrix

The raw scores are stored in `sc.scores` with shape `(n_cells, n_pathways)`. Here we compare the top pathway's score distribution across cell types.

In [None]:
print(f"Score matrix shape: {sc.scores.shape}")

scores_df = pd.DataFrame(sc.scores, columns=sc.pathway_names)
scores_df.describe().T.sort_values("mean", ascending=False).head(10)

## 10. Equivalent CLI command

The same analysis can be run from the command line:

```bash
pypage-sc --adata CRC.h5ad --gene-column gene \
    --gmt h.all.v2026.1.Hs.symbols.gmt --seed 42 --n-jobs 4
```

This creates `CRC_scPAGE/` with:
- `results.tsv` — pathway results (consistency, p-value, FDR)
- `ranking.pdf` / `ranking.html` — consistency ranking bar chart
- `report.html` — interactive VISION-like report
- `adata.h5ad` — annotated AnnData with `scPAGE_` scores in `.obs`
- `umap_plots/` — per-pathway UMAP PDFs for the top 10 pathways

### Additional CLI options

```bash
# Disable outputs you don't need
pypage-sc --adata CRC.h5ad --gene-column gene --gmt pathways.gmt \
    --no-report --no-save-adata

# More UMAP PDFs, specific embedding
pypage-sc --adata CRC.h5ad --gene-column gene --gmt pathways.gmt \
    --umap-top-n 20 --embedding-key X_tsne

# Export per-cell scores as TSV
pypage-sc --adata CRC.h5ad --gene-column gene --gmt pathways.gmt \
    --scores scores.tsv
```

## Next steps

- Increase `n_permutations` (default 1000) for more robust p-values in publication analyses
- Use `function='cmi'` (default) to correct for annotation bias; `function='mi'` is faster but doesn't account for genes appearing in multiple pathways
- Try different gene set collections (C2 KEGG, C5 GO, etc.) via `GeneSets.from_gmt()`
- Use `SingleCellPAGE.run_neighborhoods(labels=adata.obs['cell_type'])` to run standard bulk PAGE per cell type cluster