# Bulk PAGE Tutorial: DESeq2 Log Fold-Change with Hallmark Gene Sets

This notebook demonstrates a complete bulk pyPAGE workflow using DESeq2 differential expression results and MSigDB Hallmark gene sets.

**What we'll cover:**
1. Loading DESeq2 results as continuous expression scores
2. Loading Hallmark gene sets from a GMT file
3. Running PAGE with CMI (conditional mutual information) to correct for annotation bias
4. Interpreting the results and iPAGE-style enrichment heatmap
5. Exploring enriched genes and the enrichment score matrix
6. Redundancy filtering
7. Equivalent CLI command

**Data included in the repo:**
- `example_data/test_DESeq_logFC.txt.gz` — DESeq2 output (20,725 genes)
- `example_data/h.all.v2026.1.Hs.symbols.gmt` — MSigDB Hallmark gene sets (50 pathways)

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pypage import PAGE, ExpressionProfile, GeneSets

np.random.seed(42)

## 1. Load the DESeq2 results

The input file has columns including `GENE` (gene symbols) and `log2FoldChange` (continuous differential expression scores). pyPAGE will auto-discretize continuous scores into equal-frequency bins.

In [None]:
deseq = pd.read_csv("../example_data/test_DESeq_logFC.txt.gz", sep="\t")
print(f"Shape: {deseq.shape}")
deseq.head()

In [None]:
fig, ax = plt.subplots(figsize=(7, 3))
ax.hist(deseq["log2FoldChange"].dropna(), bins=80, edgecolor="none", color="#4a90d9")
ax.set_xlabel("log2 Fold-Change")
ax.set_ylabel("Genes")
ax.set_title("Distribution of DESeq2 log2 fold-changes")
plt.tight_layout()
plt.show()

Create an `ExpressionProfile` from the gene symbols and log2 fold-change values. Since these are continuous scores, `is_bin=False` (default) and they will be discretized into `n_bins` equal-frequency bins.

In [None]:
exp = ExpressionProfile(
    deseq["GENE"].values,
    deseq["log2FoldChange"].values,
    n_bins=10,
)
exp

## 2. Load Hallmark gene sets

MSigDB Hallmark gene sets are provided in GMT format. `GeneSets.from_gmt()` reads `.gmt` or `.gmt.gz` files directly.

In [None]:
gs = GeneSets.from_gmt("../example_data/h.all.v2026.1.Hs.symbols.gmt")
print(gs)
print(f"\nExample pathways: {gs.pathways[:5]}")
print(f"Total genes across all pathways: {gs.n_genes}")

## 3. Run PAGE

The default `function='cmi'` conditions on per-gene membership counts to correct for annotation bias — the core contribution of this method over standard PAGE.

Key parameters:
- `n_shuffle=10000` — permutations for p-value estimation (we use 1000 here for speed)
- `alpha=0.005` — p-value threshold for informative pathways
- `k=20` — early stopping after k consecutive non-significant pathways
- `filter_redundant=True` — remove redundant pathways via CMI

In [None]:
p = PAGE(
    exp, gs,
    function='cmi',
    n_shuffle=1000,
    alpha=0.005,
    k=20,
    filter_redundant=True,
)
results, heatmap = p.run()

print(f"Significant pathways: {len(results)}")

## 4. Results

The results DataFrame contains:
- **pathway** — pathway name
- **CMI** — conditional mutual information score
- **z-score** — z-score of observed CMI vs. permutation null distribution
- **p-value** — empirical p-value from permutation test
- **Regulation pattern** — `1` for upregulated, `-1` for downregulated

In [None]:
results

### iPAGE-style enrichment heatmap

The `Heatmap` object visualizes per-bin enrichment patterns. Each row is a pathway, each column is an expression bin (low to high). Positive values (yellow) indicate overrepresentation of pathway genes in that bin; negative values (dark) indicate underrepresentation.

In [None]:
if heatmap is not None:
    heatmap.show(title="Hallmark Gene Sets — DESeq2 Log Fold-Change")

## 5. Enriched genes per pathway

`get_enriched_genes()` returns pathway member genes grouped by expression bin. This lets you see which genes drive the enrichment signal.

In [None]:
if len(results) > 0:
    top_pathway = results.iloc[0]["pathway"]
    enriched = p.get_enriched_genes(top_pathway)
    print(f"Genes in '{top_pathway}' by expression bin:\n")
    for i, genes_in_bin in enumerate(enriched):
        if len(genes_in_bin) > 0:
            preview = ", ".join(genes_in_bin[:8])
            suffix = f" ... ({len(genes_in_bin)} total)" if len(genes_in_bin) > 8 else ""
            print(f"  Bin {i}: {preview}{suffix}")

### Enrichment score matrix

`get_es_matrix()` returns the full enrichment score matrix (log10 hypergeometric p-values). This is the numerical data behind the heatmap.

In [None]:
if heatmap is not None:
    es = p.get_es_matrix()
    print(f"Enrichment score matrix: {es.shape}")
    es

## 6. Redundancy filtering

When `filter_redundant=True`, PAGE removes pathways that are redundant with already-accepted pathways based on conditional mutual information between memberships. Inspect what was filtered:

In [None]:
killed = p.get_redundancy_log()
if len(killed) > 0:
    print(f"Removed {len(killed)} redundant pathways:")
    display(killed)
else:
    print("No pathways were removed by redundancy filtering.")

print(f"\nAll informative pathways (including redundant):")
p.full_results

## 7. Manual pathway analysis

Use `run_manual()` to analyze specific pathways of interest without permutation testing. This is useful for inspecting pathways you know are relevant.

In [None]:
p_manual = PAGE(exp, gs)
manual_results, manual_hm = p_manual.run_manual([
    "HALLMARK_INTERFERON_GAMMA_RESPONSE",
    "HALLMARK_E2F_TARGETS",
    "HALLMARK_APOPTOSIS",
])
display(manual_results)

if manual_hm is not None:
    manual_hm.show(title="Manual pathway inspection")

## 8. Equivalent CLI command

The same analysis can be run from the command line:

```bash
pypage -e example_data/test_DESeq_logFC.txt.gz \
    --gmt example_data/h.all.v2026.1.Hs.symbols.gmt \
    --cols GENE,log2FoldChange --seed 42
```

This creates `example_data/test_DESeq_logFC_PAGE/` with:
- `results.tsv` — pathway results
- `results.matrix.tsv` — enrichment score matrix
- `results.killed.tsv` — redundancy filtering log
- `heatmap.pdf` — iPAGE-style enrichment heatmap
- `heatmap.html` — interactive HTML heatmap

Re-plot with custom color scale:

```bash
pypage --draw-only -e example_data/test_DESeq_logFC.txt.gz \
    --min-val -2 --max-val 3 --bar-min -1 --bar-max 1
```