# pyPAGE example notebook: bulk and single-cell

This notebook demonstrates two end-to-end workflows:

1. **Bulk-like analysis** using bundled example differential expression and GO annotations.
2. **Single-cell-like analysis** where differential scores are computed from two cell groups with a Mann-Whitney U test, then passed to `pyPAGE`.

> Runtime note: parameters below are tuned for interactive speed. Increase `n_shuffle` for publication-scale analyses.


In [1]:
import numpy as np
import pandas as pd
from scipy.stats import mannwhitneyu

from pypage import PAGE, ExpressionProfile, GeneSets

np.random.seed(0)


Matplotlib is building the font cache; this may take a moment.


## 1) Bulk example

Load differential expression bins and GO gene sets from `example_data/`, then run `PAGE`.


In [2]:
expr_df = pd.read_csv(
    "../example_data/AP2S1.tab.gz",
    sep="	",
    header=None,
    names=["gene", "bin"],
)
expr_df.head()

Unnamed: 0,gene,bin
0,AP2S1,0
1,CNP,0
2,CAVIN1,2
3,FUOM,0
4,NOL4L,0


In [3]:
expr_df.bin.value_counts()

bin
1    17026
2      283
0      233
Name: count, dtype: int64

In [6]:
exp_bulk = ExpressionProfile(expr_df["gene"].values, expr_df["bin"].values, is_bin=True)
exp_bulk

Expression Profile
>> num_genes: 17542
>> num_bins: 10

In [4]:
go_df = pd.read_csv(
    "../example_data/GO_BP_2021_index.txt.gz",
    sep="	",
    header=None,
    names=["gene", "pathway"],
)
go_df.head()

Unnamed: 0,gene,pathway
0,SDF2L1,'de novo' posttranslational protein folding (G...
1,HSPA9,'de novo' posttranslational protein folding (G...
2,CCT2,'de novo' posttranslational protein folding (G...
3,ST13,'de novo' posttranslational protein folding (G...
4,HSPA6,'de novo' posttranslational protein folding (G...


In [7]:
ont_bulk = GeneSets(go_df["gene"].values, go_df["pathway"].values, n_bins=6)

In [8]:
# Optional speed filter for notebook use
ont_bulk.filter_pathways(min_size=20, max_size=400)

p_bulk = PAGE(
    exp_bulk,
    ont_bulk,
    n_shuffle=50,
    alpha=0.01,
    k=7,
    filter_redundant=True,
    n_jobs=1,
)

bulk_results, bulk_hm = p_bulk.run()
print(f"Bulk pathways returned: {len(bulk_results)}")
bulk_results.head(10)


OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
calculating conditional mutual information: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2272/2272 [00:05<00:00, 421.93it/s]
permutation testing:   9%|█████████████▍                                                                                                                                        | 203/2272 [00:12<02:10, 15.89it/s]
consolidating redundant pathways: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2272/2272 [00:05<00:00, 396.12it/s]
hypergeometric tests: 57it [00:00, 1066.83it/s]

Bulk pathways returned: 57





Unnamed: 0,pathway,CMI,p-value,Regulation pattern
0,mitotic spindle organization (GO:0007052),0.004761,0.0,1
1,rRNA metabolic process (GO:0016072),0.004278,0.0,1
2,mitotic sister chromatid segregation (GO:0000070),0.003973,0.0,1
3,generation of neurons (GO:0048699),0.003268,0.0,-1
4,positive regulation of DNA biosynthetic proces...,0.00316,0.0,1
5,positive regulation of protein localization to...,0.003078,0.0,1
6,protein stabilization (GO:0050821),0.002708,0.0,1
7,axonogenesis (GO:0007409),0.002569,0.0,-1
8,positive regulation of cell junction assembly ...,0.002509,0.0,-1
9,anterior/posterior pattern specification (GO:0...,0.002219,0.0,-1


In [None]:
# Visualize top pathways
if bulk_hm is not None:
    bulk_hm.show(max_rows=30, title="Bulk example")

## 2) Single-cell example

Create two synthetic cell populations (`A` vs `B`), compute per-gene differential scores with a Mann-Whitney U test, and run `PAGE` on those scores.


In [None]:
# Build a gene universe from ontology genes and sample a subset for speed
all_ont_genes = pd.unique(go_df["gene"])
n_genes = 2500
genes_sc = np.random.choice(all_ont_genes, size=n_genes, replace=False)

n_cells_a = 60
n_cells_b = 60

# Baseline expression per gene
baseline = np.random.gamma(shape=2.0, scale=1.0, size=n_genes)

# Simulate two groups with targeted perturbations
cells_a = np.random.poisson(lam=np.clip(baseline, 0.05, None), size=(n_cells_a, n_genes))
perturbed = baseline.copy()
perturbed[:120] *= 1.8   # up in B
perturbed[120:240] *= 0.5  # down in B
cells_b = np.random.poisson(lam=np.clip(perturbed, 0.05, None), size=(n_cells_b, n_genes))


def differential_score_u_test(group_a: np.ndarray, group_b: np.ndarray) -> np.ndarray:
    # Compute sign(mean diff) * (1 - pvalue) per gene.
    n = group_a.shape[1]
    score = np.zeros(n, dtype=float)
    sign = np.sign(group_b.mean(axis=0) - group_a.mean(axis=0))
    for i in range(n):
        if np.array_equal(np.unique(group_a[:, i]), np.unique(group_b[:, i])):
            score[i] = 0.0
        else:
            p = mannwhitneyu(group_a[:, i], group_b[:, i], alternative="two-sided").pvalue
            score[i] = sign[i] * (1.0 - p)
    return score

scores_sc = differential_score_u_test(cells_a, cells_b)

exp_sc = ExpressionProfile(genes_sc, scores_sc, n_bins=10)

# Restrict annotation to selected genes for this toy single-cell run
go_sc = go_df[go_df["gene"].isin(set(genes_sc))].copy()
ont_sc = GeneSets(go_sc["gene"].values, go_sc["pathway"].values, n_bins=6)
ont_sc.filter_pathways(min_size=15, max_size=300)

p_sc = PAGE(
    exp_sc,
    ont_sc,
    n_shuffle=30,
    alpha=0.01,
    k=7,
    filter_redundant=True,
    n_jobs=1,
)

sc_results, sc_hm = p_sc.run()
print(f"Single-cell pathways returned: {len(sc_results)}")
sc_results.head(10)


In [None]:
if sc_hm is not None:
    sc_hm.show(max_rows=30, title="Single-cell example")


## Next steps

- Increase `n_shuffle` (for example, 500 to 1000+) for stronger p-value estimates.
- Replace synthetic single-cell matrices with your real cell-by-gene matrix and group labels.
- Optionally convert accessions with `convert_from_to(...)` if expression and annotations use different gene ID types.
