# Single-Cell PAGE Tutorial

This notebook demonstrates `SingleCellPAGE`, which brings VISION-like single-cell pathway analysis to pyPAGE using information-theoretic scoring.

**What it does:**
1. Computes MI or CMI (conditioned on annotation bias) per cell for each pathway
2. Tests whether pathway scores vary coherently across the cell manifold using **Geary's C** autocorrelation on a KNN graph
3. Identifies statistically significant pathways via permutation testing with size-matched random gene sets

**Two modes:**
- `run()` — per-cell scoring + spatial coherence testing (main mode)
- `run_neighborhoods()` — aggregate cells into groups, then run standard bulk PAGE per group

**Inputs:** AnnData objects or raw numpy arrays + gene names

> **Runtime note:** Parameters below are tuned for interactive speed. Increase `n_permutations` for publication-scale analyses.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import anndata

from pypage import GeneSets, SingleCellPAGE

np.random.seed(42)
%matplotlib inline

## 1) Create synthetic single-cell data

We simulate a dataset with:
- **200 cells** in two clusters (100 each)
- **500 genes** with baseline noisy expression
- **5 "planted" pathways** whose member genes are differentially expressed between clusters (strong signal)
- **5 "null" pathways** with random gene membership (no signal)

If `SingleCellPAGE` works correctly, planted pathways should have high spatial consistency (C') and low FDR, while null pathways should not.

In [None]:
n_cells = 200
n_genes = 500
n_planted = 5
n_null = 5
pathway_size = 30

gene_names = np.array([f"gene_{i}" for i in range(n_genes)])

# Two clusters with baseline noise
half = n_cells // 2
expression = np.random.randn(n_cells, n_genes) * 0.5

# --- Plant signal in the first 5 pathways ---
pathway_genes_list = []
pathway_names_list = []

for p in range(n_planted):
    start = p * pathway_size
    end = start + pathway_size
    # Strong differential expression between clusters
    expression[:half, start:end] += 3.0
    expression[half:, start:end] -= 3.0
    for g_idx in range(start, end):
        pathway_genes_list.append(gene_names[g_idx])
        pathway_names_list.append(f"planted_{p}")

# --- Null pathways: random genes, no differential expression ---
used_genes = set(range(n_planted * pathway_size))
for p in range(n_null):
    available = [i for i in range(n_genes) if i not in used_genes]
    chosen = np.random.choice(available, size=pathway_size, replace=False)
    for g_idx in chosen:
        pathway_genes_list.append(gene_names[g_idx])
        pathway_names_list.append(f"null_{p}")
        used_genes.add(g_idx)

# Build GeneSets
gs = GeneSets(
    genes=np.array(pathway_genes_list),
    pathways=np.array(pathway_names_list),
)

# Cluster labels for later use
labels = np.array(["cluster_A"] * half + ["cluster_B"] * half)

print(f"Expression matrix: {expression.shape}")
print(f"Gene sets: {gs.n_pathways} pathways, {gs.n_genes} genes")
print(f"Pathways: {gs.pathways}")

## 2) Create a synthetic UMAP embedding

For visualization purposes, we create a simple 2D embedding that reflects the two-cluster structure. In practice, this would come from `scanpy.tl.umap` or similar.

In [None]:
# Simulate a UMAP-like embedding with two clusters
umap_coords = np.zeros((n_cells, 2))
umap_coords[:half] = np.random.randn(half, 2) * 0.8 + np.array([-3, 0])
umap_coords[half:] = np.random.randn(half, 2) * 0.8 + np.array([3, 0])

fig, ax = plt.subplots(figsize=(5, 4))
scatter = ax.scatter(umap_coords[:, 0], umap_coords[:, 1],
                     c=[0]*half + [1]*half, cmap='Set1', s=10)
ax.set_xlabel('UMAP 1')
ax.set_ylabel('UMAP 2')
ax.set_title('Synthetic embedding (colored by cluster)')
plt.tight_layout()
plt.show()

## 3) Build an AnnData object

`SingleCellPAGE` accepts AnnData natively. It extracts:
- `.X` — expression matrix
- `.var_names` — gene names
- `.obsp['connectivities']` — precomputed KNN graph (if available)
- `.obsm['X_umap']` / `'X_tsne'` — embeddings for visualization

In [None]:
adata = anndata.AnnData(
    X=expression,
    var=pd.DataFrame(index=gene_names),
    obs=pd.DataFrame({"cluster": labels}),
    obsm={"X_umap": umap_coords},
)
adata

## 4) Initialize SingleCellPAGE

Key parameters:
- `n_bins` — number of bins for discretizing gene expression (default: 10; lower = faster)
- `function` — `'cmi'` (default, corrects for annotation bias) or `'mi'`
- `n_neighbors` — KNN graph size (default: `ceil(sqrt(n_cells))`, capped at 100)

In [None]:
sc_page = SingleCellPAGE(
    adata=adata,
    genesets=gs,
    n_bins=5,
    function='mi',  # 'mi' for speed in this demo; use 'cmi' for bias correction
)
sc_page

## 5) Run the analysis

The `run()` method performs four steps:

1. **Build KNN graph** — either from precomputed connectivity or from the expression data
2. **Score pathways** — compute MI (or CMI) for each cell and each pathway
3. **Compute Geary's C** — measure spatial autocorrelation of pathway scores on the KNN graph. Reports C' = 1 - C (higher = more coherent)
4. **Permutation test** — generate size-matched random gene sets, compute their C', and derive empirical p-values + BH-corrected FDR

In [None]:
results = sc_page.run(n_permutations=200)
results

**Interpretation:**
- **consistency** (C') — how coherently the pathway varies across the cell manifold. Higher = cells with similar pathway activity are neighbors on the KNN graph.
- **p-value** — empirical p-value from size-matched random gene sets.
- **FDR** — Benjamini-Hochberg corrected p-value.

Planted pathways should rank at the top with high consistency and low FDR.

In [None]:
# Compare planted vs null
planted = results[results['pathway'].str.startswith('planted')]
null = results[results['pathway'].str.startswith('null')]

print("Planted pathways:")
print(f"  Mean consistency: {planted['consistency'].mean():.4f}")
print(f"  Mean FDR:         {planted['FDR'].mean():.4f}")
print()
print("Null pathways:")
print(f"  Mean consistency: {null['consistency'].mean():.4f}")
print(f"  Mean FDR:         {null['FDR'].mean():.4f}")

## 6) Visualize results

### 6a) Consistency ranking

Bar plot of top pathways ranked by spatial consistency. Blue bars indicate FDR < 0.05.

In [None]:
ax = sc_page.plot_consistency_ranking(top_n=10, fdr_threshold=0.05)
plt.tight_layout()
plt.show()

### 6b) Pathway scores on embedding

Overlay per-cell pathway scores on the UMAP embedding. Planted pathways should show clear cluster-specific patterns.

In [None]:
# Plot top 3 planted and top 2 null pathways side by side
planted_names = planted['pathway'].values[:3]
null_names = null['pathway'].values[:2]
pathways_to_plot = list(planted_names) + list(null_names)

fig, axes = plt.subplots(1, len(pathways_to_plot), figsize=(4 * len(pathways_to_plot), 3.5))
for i, pw in enumerate(pathways_to_plot):
    sc_page.plot_pathway_on_embedding(pw, embedding_key='X_umap', ax=axes[i], size=10)
plt.tight_layout()
plt.show()

### 6c) Pathway heatmap across clusters

Mean pathway scores per cluster, showing which pathways differentiate the groups.

In [None]:
ax = sc_page.plot_pathway_heatmap(labels)
plt.tight_layout()
plt.show()

## 7) Per-cell score matrix

The raw per-cell pathway scores are stored in `sc_page.scores` (shape: n_cells x n_pathways). You can use these for downstream analysis, clustering, or custom visualization.

In [None]:
scores_df = pd.DataFrame(
    sc_page.scores,
    columns=sc_page.pathway_names,
)
scores_df.insert(0, 'cluster', labels)

print(f"Score matrix shape: {sc_page.scores.shape}")
scores_df.head()

In [None]:
# Distribution of scores for a planted vs null pathway
fig, axes = plt.subplots(1, 2, figsize=(10, 3.5))

for ax, pw, title in zip(axes, ['planted_0', 'null_0'], ['Planted pathway', 'Null pathway']):
    for cluster in ['cluster_A', 'cluster_B']:
        mask = scores_df['cluster'] == cluster
        ax.hist(scores_df.loc[mask, pw], bins=15, alpha=0.6, label=cluster)
    ax.set_xlabel('MI score')
    ax.set_ylabel('Count')
    ax.set_title(title)
    ax.legend()

plt.tight_layout()
plt.show()

## 8) Alternative mode: neighborhood-level PAGE

If you already have cluster labels (e.g., from Leiden clustering), you can aggregate cells into groups and run standard bulk PAGE per group. This is faster and gives the full PAGE output (enrichment heatmap, hypergeometric tests) per group.

In [None]:
summary, group_results = sc_page.run_neighborhoods(labels=labels)
summary

In [None]:
# Inspect results for cluster_A
results_a, hm_a = group_results['cluster_A']
print(f"Cluster A: {len(results_a)} significant pathways")
if len(results_a) > 0:
    print(results_a.head(10))

In [None]:
# Inspect results for cluster_B
results_b, hm_b = group_results['cluster_B']
print(f"Cluster B: {len(results_b)} significant pathways")
if len(results_b) > 0:
    print(results_b.head(10))

## 9) Using numpy arrays directly

If you don't have an AnnData object, you can pass a numpy expression matrix and gene names directly.

In [None]:
sc_page_np = SingleCellPAGE(
    expression=expression,
    genes=gene_names,
    genesets=gs,
    n_bins=5,
    function='mi',
)

results_np = sc_page_np.run(n_permutations=100)
results_np.head()

## 10) Using CMI to correct for annotation bias

The default `function='cmi'` conditions on per-gene membership counts (how many pathways each gene belongs to). This corrects for annotation bias where heavily-annotated genes drive spurious pathway scores. In practice, always prefer CMI for real data.

In [None]:
sc_page_cmi = SingleCellPAGE(
    adata=adata,
    genesets=gs,
    n_bins=5,
    function='cmi',  # conditional mutual information
)

results_cmi = sc_page_cmi.run(n_permutations=100)
results_cmi

## 11) Providing a precomputed KNN graph

If you already computed a cell-cell connectivity graph (e.g., via `scanpy.pp.neighbors`), you can pass it directly to avoid recomputation.

In [None]:
from pypage.spatial import build_knn_graph

# Build a custom KNN graph
W = build_knn_graph(expression, k=15)
print(f"Graph shape: {W.shape}, nonzeros: {W.nnz}")

# Pass it via the connectivity parameter
sc_page_precomp = SingleCellPAGE(
    expression=expression,
    genes=gene_names,
    genesets=gs,
    n_bins=5,
    function='mi',
    connectivity=W,
)
results_precomp = sc_page_precomp.run(n_permutations=50)
results_precomp.head()

## Summary

| Feature | Method |
|---|---|
| Per-cell pathway scoring | `run()` — MI or CMI per cell |
| Spatial coherence test | Geary's C on KNN graph |
| Significance | Permutation test + BH FDR |
| Neighborhood mode | `run_neighborhoods(labels=...)` — standard PAGE per group |
| Input formats | AnnData, numpy arrays, precomputed graphs |
| Visualization | Embedding overlays, heatmaps, consistency ranking |

### Recommended settings for real data

- `n_bins=10` (default) for finer expression discretization
- `function='cmi'` (default) for annotation bias correction
- `n_permutations=1000` for robust p-values
- Pass `adata` with precomputed neighbors from scanpy for best results