# Lab 11: Transcriptomics - GWAS + RNA-seq + GSEA Integration

## Student Notebook

This notebook contains exercises for the Transcriptomics laboratory session. You will implement a real-world bioinformatics workflow integrating:

- **GWAS (Genome-Wide Association Studies)** data from public databases
- **RNA-seq** differential expression analysis
- **GSEA (Gene Set Enrichment Analysis)** for pathway analysis
- **Protein-Protein Interaction (PPI)** network analysis

**Learning Objectives:**
- Retrieve GWAS associations from the EBI GWAS Catalog API
- Extract and process gene sets from GWAS results
- Analyze RNA-seq differential expression data
- Perform Gene Set Enrichment Analysis (GSEA)
- Interpret GSEA results and identify leading edge genes
- Build and visualize protein interaction networks

---

## Laboratory Pipeline Overview

This laboratory follows an integrated multi-omics analysis pipeline:

1. **Exercise 1: Fetch GWAS Associations** - Retrieve genetic associations from the GWAS Catalog
2. **Exercise 2: Extract GWAS Genes** - Convert SNP associations to candidate gene sets
3. **Exercise 3: Load & Visualize RNA-seq Data** - Explore differential expression with volcano plots
4. **Exercise 4: Rank Genes for GSEA** - Prepare ranked gene lists for enrichment analysis
5. **Exercise 5: Run GWAS-GSEA Integration** - Test if GWAS genes are enriched in expression changes
6. **Exercise 6: Interpret GSEA Results** - Analyze enrichment scores and significance
7. **Exercise 7: Network Analysis** - Build PPI networks from leading edge genes

---

### Biological Context

**Why integrate GWAS and RNA-seq?**

GWAS identifies genetic variants (SNPs) associated with diseases or traits, but these variants often lie in non-coding regions, making it difficult to identify the causal genes. By integrating GWAS results with RNA-seq differential expression data, we can:

1. Test whether GWAS-implicated genes show altered expression in disease
2. Prioritize candidate genes for functional validation
3. Identify disease-relevant pathways and networks

In this lab, we'll analyze **Crohn's disease** as our case study - a chronic inflammatory bowel disease with a strong genetic component.

In [None]:
# Import required libraries
%matplotlib inline

# Data manipulation
import pandas as pd
import numpy as np
from typing import Set, Dict

# API and network
import requests
import networkx as nx

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from pyvis.network import Network

# Gene Set Enrichment Analysis
import gseapy as gp

# Configure plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

print("✓ All libraries imported successfully")

---

## Exercise 1: Fetch GWAS Associations from the GWAS Catalog

**Background:**

The [GWAS Catalog](https://www.ebi.ac.uk/gwas/) is a curated collection of published genome-wide association studies. It provides a REST API to programmatically access association data, including:
- Variant identifiers (rsIDs)
- Association p-values
- Mapped genes near each variant

**Goals:**
1. Query the GWAS Catalog API for a specific disease trait
2. Filter associations by genome-wide significance (p < 5×10⁻⁸)
3. Extract SNP IDs, p-values, and mapped genes

**Key Concepts:**
- REST API usage for biological databases
- GWAS significance thresholds
- SNP-to-gene mapping

**API Documentation:** https://www.ebi.ac.uk/gwas/rest/docs/api

---

### TODO:

1. Complete the `fetch_gwas_associations()` function:
   - Make a GET request to the GWAS Catalog API
   - Parse the JSON response to extract associations
   - Filter associations by p-value threshold
   - Return a DataFrame with rsid, pvalue, and mapped_genes columns

**Hints:**
- API endpoint: `https://www.ebi.ac.uk/gwas/rest/api/associations`
- Use `params={"efoTrait": trait}` for query parameters
- Set `headers={"Accept": "application/json"}`
- Associations are in `response.json()["_embedded"]["associations"]`

In [None]:
def fetch_gwas_associations(trait: str, pval_threshold: float = 5e-8) -> pd.DataFrame:
    """
    Retrieve real GWAS associations for a given trait from the GWAS Catalog API.
    
    Parameters:
    -----------
    trait : str
        Disease or phenotype name (e.g., "Crohn's disease")
    pval_threshold : float
        GWAS significance cutoff (default: 5e-8 for genome-wide significance)
    
    Returns:
    --------
    pd.DataFrame
        DataFrame with columns:
        - rsid: SNP identifier
        - pvalue: Association p-value
        - mapped_genes: Genes mapped to the variant
    """
    url = "https://www.ebi.ac.uk/gwas/rest/api/associations"
    params = {"efoTrait": trait}
    headers = {"Accept": "application/json"}
    
    print(f"Querying GWAS Catalog for: {trait}")
    
    # TODO: Make the API request
    # response = requests.get(...)
    # response.raise_for_status()
    
    records = []
    
    # TODO: Parse the JSON response
    # data = response.json()["_embedded"]["associations"]
    
    # TODO: Loop through associations and filter by p-value
    # for assoc in data:
    #     pval = float(assoc.get("pvalue", 1.0))
    #     if pval <= pval_threshold:
    #         records.append({
    #             "rsid": ...,
    #             "pvalue": ...,
    #             "mapped_genes": ...
    #         })
    
    print(f"✓ Found {len(records)} significant associations (p < {pval_threshold})")
    return pd.DataFrame(records)

In [None]:
# Exercise 1: Fetch GWAS associations for Crohn's disease
print("=" * 60)
print("Exercise 1: Fetch GWAS Associations")
print("=" * 60)

# Query the GWAS Catalog
gwas_df = fetch_gwas_associations("Crohn's disease")

# Display results
print(f"\nGWAS Results Summary:")
print(f"  Total significant SNPs: {len(gwas_df)}")
if len(gwas_df) > 0:
    print(f"  P-value range: {gwas_df['pvalue'].min():.2e} - {gwas_df['pvalue'].max():.2e}")

# Show first few associations
print("\nTop 10 associations by p-value:")
gwas_df.sort_values('pvalue').head(10)

In [None]:
def plot_gwas_pvalues(gwas_df: pd.DataFrame):
    """
    Visualize the distribution of GWAS p-values.
    
    Parameters:
    -----------
    gwas_df : pd.DataFrame
        GWAS associations with 'pvalue' column
    """
    plt.figure(figsize=(8, 5))
    
    # Transform p-values to -log10 scale
    neg_log_p = -np.log10(gwas_df["pvalue"])
    
    sns.histplot(neg_log_p, bins=30, kde=True, color='steelblue')
    
    # Add significance threshold line
    plt.axvline(-np.log10(5e-8), color='red', linestyle='--', 
                label='Genome-wide significance (5e-8)')
    
    plt.xlabel("-log10(GWAS p-value)", fontsize=12)
    plt.ylabel("Number of SNPs", fontsize=12)
    plt.title("GWAS Signal Strength Distribution", fontsize=14)
    plt.legend()
    plt.tight_layout()
    plt.show()

# Visualize GWAS p-value distribution (run after completing fetch_gwas_associations)
if len(gwas_df) > 0:
    plot_gwas_pvalues(gwas_df)
else:
    print("Complete fetch_gwas_associations() first to visualize results.")

---

## Exercise 2: Extract GWAS Candidate Genes

**Background:**

GWAS identifies genetic variants (SNPs), but for functional interpretation, we need to identify the genes these variants may affect. The GWAS Catalog provides "mapped genes" - genes that are:
- Located nearest to the variant
- Within the same linkage disequilibrium (LD) block
- Potentially regulated by the variant (eQTLs)

**Goals:**
1. Parse the mapped_genes field from GWAS results
2. Create a unique set of candidate genes
3. Understand the gene set size and composition

**Key Concepts:**
- SNP-to-gene mapping strategies
- Gene set construction for pathway analysis
- Data cleaning and parsing

---

### TODO:

1. Complete the `extract_gwas_genes()` function:
   - Iterate through the mapped_genes column
   - Split comma-separated gene lists
   - Clean whitespace and add unique genes to a set
   - Handle missing/empty values appropriately

**Hints:**
- Use `gwas_df["mapped_genes"].dropna()` to skip missing values
- Use `gene_str.split(",")` to split multiple genes
- Use `gene.strip()` to remove whitespace

In [None]:
def extract_gwas_genes(gwas_df: pd.DataFrame) -> Set[str]:
    """
    Convert GWAS SNP associations into a set of candidate genes.
    
    Parameters:
    -----------
    gwas_df : pd.DataFrame
        Output of fetch_gwas_associations() with 'mapped_genes' column
    
    Returns:
    --------
    Set[str]
        Unique gene symbols implicated by GWAS
    
    Notes:
    ------
    The mapped_genes field may contain multiple genes separated by commas.
    Some entries may contain special annotations like "intergenic" which
    should be handled appropriately.
    """
    genes = set()
    
    # TODO: Iterate through mapped_genes column
    # for gene_str in gwas_df["mapped_genes"].dropna():
    #     # TODO: Split by comma and clean whitespace
    #     for g in gene_str.split(","):
    #         gene = g.strip()
    #         # TODO: Skip empty strings and special annotations
    #         if gene and not gene.lower() in ['intergenic', 'na', 'none']:
    #             genes.add(gene)
    
    return genes

In [None]:
# Exercise 2: Extract candidate genes from GWAS results
print("=" * 60)
print("Exercise 2: Extract GWAS Candidate Genes")
print("=" * 60)

gwas_genes = extract_gwas_genes(gwas_df)

print(f"\nGWAS Gene Set Summary:")
print(f"  Total unique genes: {len(gwas_genes)}")
if len(gwas_genes) > 0:
    print(f"  SNPs per gene (avg): {len(gwas_df) / len(gwas_genes):.2f}")

    # Display some example genes
    print(f"\nExample GWAS genes (first 20):")
    sorted_genes = sorted(gwas_genes)
    print(", ".join(sorted_genes[:20]))

    # Check for well-known Crohn's disease genes
    known_cd_genes = {'NOD2', 'IL23R', 'ATG16L1', 'IRGM', 'IL10', 'CARD9'}
    found_known = known_cd_genes & gwas_genes
    print(f"\nKnown Crohn's disease genes found: {found_known if found_known else 'None in this query'}")
else:
    print("Complete extract_gwas_genes() first.")

---

## Exercise 3: Load and Visualize RNA-seq Differential Expression Data

**Background:**

RNA-seq measures gene expression levels across the transcriptome. Differential expression analysis compares expression between conditions (e.g., disease vs. healthy) to identify genes with altered expression.

Key metrics from differential expression analysis:
- **logFC (log2 Fold Change)**: Magnitude and direction of expression change
- **p-value**: Statistical significance of the change

**Volcano plots** are a standard visualization that shows both significance (-log10 p-value on y-axis) and effect size (logFC on x-axis).

**Goals:**
1. Load RNA-seq differential expression results
2. Create static and interactive volcano plots
3. Identify significantly differentially expressed genes

**Key Concepts:**
- Differential expression analysis interpretation
- Volcano plot visualization
- Multiple testing correction

---

### TODO:

1. Study the provided `generate_synthetic_rnaseq_data()` function
2. Complete `plot_volcano()` to create a volcano plot:
   - Calculate -log10(pvalue) for y-axis
   - Color genes by significance and direction
   - Add threshold lines
3. Complete `interactive_volcano_plot()` using Plotly

In [None]:
# This function is provided - study how it works

def generate_synthetic_rnaseq_data(n_genes: int = 5000, 
                                   gwas_genes: Set[str] = None,
                                   enrichment_strength: float = 0.3) -> pd.DataFrame:
    """
    Generate synthetic RNA-seq differential expression data.
    
    Parameters:
    -----------
    n_genes : int
        Number of genes to simulate
    gwas_genes : Set[str]
        GWAS genes to include (will have slightly elevated effect sizes)
    enrichment_strength : float
        How much to bias GWAS genes toward differential expression
    
    Returns:
    --------
    pd.DataFrame
        Simulated DE results with gene, logFC, pvalue columns
    """
    np.random.seed(42)
    
    # Generate random gene names
    genes = [f"GENE{i:04d}" for i in range(n_genes)]
    
    # Add GWAS genes if provided
    if gwas_genes:
        gwas_list = list(gwas_genes)
        # Replace some random genes with GWAS genes
        for i, gwas_gene in enumerate(gwas_list[:min(len(gwas_list), 200)]):
            genes[i] = gwas_gene
    
    # Generate logFC values (normal distribution)
    logfc = np.random.normal(0, 1, n_genes)
    
    # Bias GWAS genes toward larger effect sizes
    if gwas_genes:
        for i, gene in enumerate(genes):
            if gene in gwas_genes:
                # Add enrichment bias
                logfc[i] += np.random.choice([-1, 1]) * enrichment_strength * np.abs(np.random.normal(1, 0.5))
    
    # Generate p-values (related to effect size + noise)
    noise = np.random.exponential(0.5, n_genes)
    pvalues = 10 ** (-np.abs(logfc) * 2 - noise + np.random.normal(0, 1, n_genes))
    pvalues = np.clip(pvalues, 1e-300, 1)
    
    df = pd.DataFrame({
        'gene': genes,
        'logFC': logfc,
        'pvalue': pvalues
    })
    
    return df


def load_rnaseq_results(filepath: str = None, 
                        gwas_genes: Set[str] = None) -> pd.DataFrame:
    """
    Load RNA-seq differential expression results.
    
    Parameters:
    -----------
    filepath : str
        Path to CSV file with columns: gene, logFC, pvalue
        If None, generates synthetic data
    gwas_genes : Set[str]
        GWAS genes for synthetic data generation
    
    Returns:
    --------
    pd.DataFrame
        DE results indexed by gene
    """
    if filepath:
        df = pd.read_csv(filepath)
    else:
        # Generate synthetic data for demonstration
        df = generate_synthetic_rnaseq_data(n_genes=5000, gwas_genes=gwas_genes)
    
    df = df.set_index("gene")
    return df

In [None]:
# Exercise 3: Load RNA-seq data
print("=" * 60)
print("Exercise 3: Load and Visualize RNA-seq Data")
print("=" * 60)

# Load/generate RNA-seq data
de_df = load_rnaseq_results(gwas_genes=gwas_genes)

print(f"\nRNA-seq Data Summary:")
print(f"  Total genes: {len(de_df)}")
print(f"  logFC range: [{de_df['logFC'].min():.2f}, {de_df['logFC'].max():.2f}]")
print(f"  Significant genes (p < 0.05): {(de_df['pvalue'] < 0.05).sum()}")
print(f"  Highly significant (p < 0.001): {(de_df['pvalue'] < 0.001).sum()}")

# Display summary statistics
de_df.describe()

In [None]:
def plot_volcano(de_df: pd.DataFrame, pval_cutoff: float = 0.05, logfc_cutoff: float = 1.0):
    """
    Create a volcano plot of RNA-seq differential expression.
    
    Parameters:
    -----------
    de_df : pd.DataFrame
        RNA-seq DE results with logFC and pvalue columns
    pval_cutoff : float
        Significance threshold for coloring
    logfc_cutoff : float
        Effect size threshold for coloring
    """
    plt.figure(figsize=(10, 7))
    
    # TODO: Calculate -log10(pvalue)
    # neg_log_p = ...
    
    # TODO: Define categories based on significance and logFC cutoffs
    # sig = de_df["pvalue"] < pval_cutoff
    # up = (de_df["logFC"] > logfc_cutoff) & sig
    # down = (de_df["logFC"] < -logfc_cutoff) & sig
    
    # TODO: Plot points with different colors
    # - Not significant: gray
    # - Upregulated: firebrick
    # - Downregulated: steelblue
    
    # TODO: Add threshold lines (horizontal for p-value, vertical for logFC)
    
    plt.xlabel("log2 Fold Change", fontsize=12)
    plt.ylabel("-log10 p-value", fontsize=12)
    plt.title("RNA-seq Differential Expression Volcano Plot", fontsize=14)
    plt.legend(loc='upper right')
    plt.tight_layout()
    plt.show()

# Create static volcano plot
plot_volcano(de_df)

In [None]:
def interactive_volcano_plot(de_df: pd.DataFrame, pval_cutoff: float = 0.05, logfc_cutoff: float = 1.0):
    """
    Create an interactive volcano plot for RNA-seq results using Plotly.
    
    Parameters:
    -----------
    de_df : pd.DataFrame
        RNA-seq DE results with logFC and pvalue columns
    pval_cutoff : float
        Significance threshold
    logfc_cutoff : float
        Effect size threshold for coloring
    
    Returns:
    --------
    Interactive Plotly figure
    """
    df = de_df.copy()
    
    # TODO: Calculate -log10(pvalue)
    # df["neg_log10_p"] = ...
    
    df["gene_name"] = df.index
    
    # TODO: Assign categories based on significance and direction (with logFC cutoff)
    # df["category"] = "Not Significant"
    # df.loc[(df["pvalue"] < pval_cutoff) & (df["logFC"] > logfc_cutoff), "category"] = "Upregulated"
    # df.loc[(df["pvalue"] < pval_cutoff) & (df["logFC"] < -logfc_cutoff), "category"] = "Downregulated"
    
    color_map = {
        "Not Significant": "lightgray",
        "Upregulated": "firebrick",
        "Downregulated": "steelblue"
    }
    
    # TODO: Create Plotly scatter plot using px.scatter()
    # fig = px.scatter(
    #     df,
    #     x="logFC",
    #     y="neg_log10_p",
    #     color="category",
    #     color_discrete_map=color_map,
    #     hover_name="gene_name",
    #     ...
    # )
    
    # TODO: Add horizontal and vertical threshold lines
    # fig.add_hline(...)
    # fig.add_vline(...)
    
    # fig.show()
    print("Complete the interactive_volcano_plot() function")

# Create interactive volcano plot
interactive_volcano_plot(de_df)

---

## Exercise 4: Rank Genes for GSEA

**Background:**

Gene Set Enrichment Analysis (GSEA) is a powerful method to determine whether predefined gene sets show statistically significant differences between conditions. Unlike over-representation analysis (which uses a cutoff), GSEA uses a **ranked list** of all genes.

**Ranking metrics:**
- **logFC**: Simple and interpretable
- **Signed p-value**: -log10(p) × sign(logFC) - incorporates significance
- **t-statistic**: If available from DE analysis

**Goals:**
1. Create a ranked gene list based on expression changes
2. Visualize where GWAS genes appear in the ranking
3. Understand the pre-ranked GSEA approach

**Key Concepts:**
- Gene ranking strategies
- Pre-ranked GSEA
- Visualizing gene positions in ranked lists

---

### TODO:

1. Complete the `rank_genes_for_gsea()` function:
   - Extract the ranking metric column
   - Sort genes from highest to lowest score
   - Handle missing values

In [None]:
def rank_genes_for_gsea(de_df: pd.DataFrame, 
                        ranking_metric: str = "logFC") -> pd.Series:
    """
    Create a ranked gene list for GSEA.
    
    Parameters:
    -----------
    de_df : pd.DataFrame
        RNA-seq DE results with logFC and optionally pvalue columns
    ranking_metric : str
        Column to use for ranking. Options:
        - "logFC": Use log fold change directly
        - "signed_pvalue": Use -log10(p) * sign(logFC)
    
    Returns:
    --------
    pd.Series
        Gene rankings (index = gene, value = score)
        Sorted from highest to lowest
    """
    # TODO: Handle different ranking metrics
    # if ranking_metric == "signed_pvalue":
    #     ranked = -np.log10(de_df["pvalue"]) * np.sign(de_df["logFC"])
    # else:
    #     ranked = de_df[ranking_metric]
    
    # TODO: Sort from highest to lowest and drop NA values
    # ranked = ranked.sort_values(ascending=False)
    # ranked = ranked.dropna()
    
    # Placeholder - replace with your implementation
    ranked = pd.Series(dtype=float)
    
    return ranked

In [None]:
# Exercise 4: Rank genes for GSEA
print("=" * 60)
print("Exercise 4: Rank Genes for GSEA")
print("=" * 60)

# Create ranked gene list
ranked_genes = rank_genes_for_gsea(de_df, ranking_metric="logFC")

print(f"\nRanked Gene List Summary:")
print(f"  Total ranked genes: {len(ranked_genes)}")

if len(ranked_genes) > 0:
    print(f"  Score range: [{ranked_genes.min():.3f}, {ranked_genes.max():.3f}]")

    print("\nTop 10 upregulated genes:")
    print(ranked_genes.head(10).to_frame('logFC'))

    print("\nTop 10 downregulated genes:")
    print(ranked_genes.tail(10).to_frame('logFC'))
else:
    print("Complete rank_genes_for_gsea() first.")

In [None]:
# This function is provided - study how the barcode plot works

def plot_gwas_gene_ranking(ranked_genes: pd.Series, gwas_genes: Set[str]):
    """
    Visualize where GWAS genes appear in the ranked expression list.
    
    This "barcode plot" shows the distribution of GWAS genes across
    the ranked list. If GWAS genes are enriched at one end, it suggests
    they tend to be up- or down-regulated in disease.
    """
    # Find positions of GWAS genes in ranked list
    positions = [
        i for i, gene in enumerate(ranked_genes.index)
        if gene in gwas_genes
    ]
    
    print(f"GWAS genes found in ranked list: {len(positions)} / {len(gwas_genes)}")
    
    fig, axes = plt.subplots(2, 1, figsize=(12, 5), 
                             gridspec_kw={'height_ratios': [3, 1]})
    
    # Top plot: ranked scores
    ax1 = axes[0]
    ax1.plot(range(len(ranked_genes)), ranked_genes.values, 
             color='steelblue', alpha=0.7, linewidth=0.5)
    ax1.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax1.set_ylabel('Expression Score (logFC)', fontsize=11)
    ax1.set_title('Ranked Gene Expression with GWAS Gene Positions', fontsize=13)
    ax1.set_xlim(0, len(ranked_genes))
    
    # Highlight GWAS genes
    for pos in positions:
        ax1.axvline(pos, color='red', alpha=0.3, linewidth=0.5)
    
    # Bottom plot: barcode
    ax2 = axes[1]
    ax2.eventplot(positions, orientation='horizontal', colors='red', linewidths=0.8)
    ax2.set_xlabel('Gene Rank (high -> low expression change)', fontsize=11)
    ax2.set_yticks([])
    ax2.set_xlim(0, len(ranked_genes))
    ax2.set_ylabel('GWAS\nGenes', fontsize=10)
    
    plt.tight_layout()
    plt.show()

# Visualize GWAS gene positions (run after completing rank_genes_for_gsea)
if len(ranked_genes) > 0 and len(gwas_genes) > 0:
    plot_gwas_gene_ranking(ranked_genes, gwas_genes)
else:
    print("Complete previous exercises first.")

---

## Exercise 5: Run GWAS-GSEA Integration Analysis

**Background:**

Now we perform the key integration step: testing whether GWAS-associated genes are enriched in the RNA-seq expression changes. GSEA works by:

1. Walking down the ranked gene list
2. Increasing a running score when encountering a gene in the gene set
3. Decreasing the score for genes not in the set
4. The **Enrichment Score (ES)** is the maximum deviation from zero
5. Significance is determined by permutation testing

**Interpretation:**
- **Positive NES**: Gene set enriched at the TOP of the list (upregulated)
- **Negative NES**: Gene set enriched at the BOTTOM of the list (downregulated)
- **NES near 0**: No significant enrichment

**Goals:**
1. Run pre-ranked GSEA with GWAS gene set
2. Visualize the enrichment curve
3. Interpret the Normalized Enrichment Score (NES)

**Key Concepts:**
- GSEA algorithm mechanics
- Permutation testing
- Enrichment score interpretation

---

### TODO:

1. Complete the `run_gwas_gsea()` function:
   - Create a gene set dictionary
   - Use `gp.prerank()` to run pre-ranked GSEA
   - Return the result object

**Hints:**
- Gene set format: `{"set_name": list(genes)}`
- Key parameters for `gp.prerank()`: `rnk`, `gene_sets`, `permutation_num`, `seed`, `min_size`

In [None]:
def run_gwas_gsea(ranked_genes: pd.Series,
                  gwas_genes: Set[str],
                  n_permutations: int = 1000) -> Dict:
    """
    Test whether GWAS-associated genes are enriched in RNA-seq changes.
    
    Parameters:
    -----------
    ranked_genes : pd.Series
        Ranked gene list (from rank_genes_for_gsea)
    gwas_genes : Set[str]
        Genes implicated by GWAS
    n_permutations : int
        Number of permutations for significance testing
    
    Returns:
    --------
    Dict
        GSEA result object from gseapy
    """
    # TODO: Create gene set dictionary
    # gene_set = {"GWAS_Crohns_Disease_Genes": list(gwas_genes)}

    print(f"Running GSEA with {len(gwas_genes)} GWAS genes...")
    print(f"  Permutations: {n_permutations}")
    
    # TODO: Run pre-ranked GSEA using gp.prerank()
    # pre_res = gp.prerank(
    #     rnk=ranked_genes,
    #     gene_sets=gene_set,
    #     permutation_num=n_permutations,
    #     seed=42,
    #     outdir=None,
    #     verbose=False,
    #     min_size=1,
    # )
    
    print("✓ GSEA completed")
    # return pre_res
    return None  # Replace with your implementation

In [None]:
# Exercise 5: Run GSEA
print("=" * 60)
print("Exercise 5: Run GWAS-GSEA Integration Analysis")
print("=" * 60)

# Run GSEA analysis
if len(ranked_genes) > 0 and len(gwas_genes) > 0:
    gsea_res = run_gwas_gsea(ranked_genes, gwas_genes, n_permutations=1000)
    
    if gsea_res is not None:
        print("\nGSEA Results Preview:")
        display(gsea_res.res2d)
    else:
        print("Complete run_gwas_gsea() first.")
else:
    print("Complete previous exercises first.")
    gsea_res = None

In [None]:
# This function is provided - it uses gseapy's built-in plotting

def plot_gsea_enrichment(pre_res):
    """
    Visualize the GSEA enrichment curve.
    """
    # Get the term name
    top_term = pre_res.res2d.sort_values('NES', ascending=False)['Term'].values[0]
    
    # Use gseapy's built-in plotting
    gp.plot.gseaplot(
        rank_metric=pre_res.ranking,
        term=top_term,
        ofname=None,
        **pre_res.results[top_term]
    )

# Plot GSEA enrichment curve
if gsea_res is not None:
    print("\nGSEA Enrichment Plot:")
    plot_gsea_enrichment(gsea_res)
else:
    print("Complete run_gwas_gsea() first.")

---

## Exercise 6: Interpret and Summarize GSEA Results

**Background:**

GSEA provides several key metrics for interpretation:

| Metric | Description | Interpretation |
|--------|-------------|----------------|
| **ES** | Enrichment Score | Maximum deviation of running sum from zero |
| **NES** | Normalized ES | ES adjusted for gene set size and permutations |
| **p-value** | Nominal p-value | Statistical significance (uncorrected) |
| **FDR** | False Discovery Rate | Multiple testing corrected significance |
| **Leading Edge** | Core enrichment genes | Genes contributing most to enrichment |

**Leading Edge Genes:**
These are the genes that appear in the ranked list before the point of maximum enrichment. They represent the "core" of the enrichment signal and are often prioritized for follow-up.

**Goals:**
1. Extract and display key GSEA metrics
2. Identify leading edge genes
3. Interpret the biological significance

---

### TODO:

1. Complete the `summarize_gsea_results()` function:
   - Extract NES, p-value, FDR, and leading edge genes from results
   - Return a summary DataFrame

In [None]:
def summarize_gsea_results(pre_res) -> pd.DataFrame:
    """
    Summarize GSEA results for interpretation.
    
    Parameters:
    -----------
    pre_res : gseapy result object
        Output of run_gwas_gsea()
    
    Returns:
    --------
    pd.DataFrame with:
        - NES: Normalized Enrichment Score
        - p-value: Nominal significance
        - FDR: False Discovery Rate
        - Leading edge genes: Core enrichment genes
    """
    res = pre_res.res2d.copy()
    
    # TODO: Select and rename relevant columns
    # The gseapy result has columns: NES, FWER p-val, FDR q-val, Lead_genes
    # summary = res[["NES", "FWER p-val", "FDR q-val", "Lead_genes"]].copy()
    # summary.columns = ["NES", "p-value", "FDR", "Leading Edge Genes"]
    
    # Placeholder
    summary = pd.DataFrame()
    
    return summary

In [None]:
# Exercise 6: Summarize GSEA results
print("=" * 60)
print("Exercise 6: Interpret and Summarize GSEA Results")
print("=" * 60)

if gsea_res is not None:
    summary = summarize_gsea_results(gsea_res)
    
    if len(summary) > 0:
        print("\n" + "=" * 50)
        print("GSEA RESULTS SUMMARY")
        print("=" * 50)

        # Extract values
        nes = summary["NES"].iloc[0]
        pval = summary["p-value"].iloc[0]
        fdr = summary["FDR"].iloc[0]
        lead_genes = summary["Leading Edge Genes"].iloc[0]

        print(f"\nNormalized Enrichment Score (NES): {nes:.3f}")
        print(f"Nominal p-value: {pval:.4f}")
        print(f"FDR q-value: {fdr:.4f}")

        # Interpret the result
        print("\n" + "-" * 50)
        print("INTERPRETATION:")
        print("-" * 50)

        if nes > 0:
            direction = "UPREGULATED"
            print(f"• GWAS genes tend to be {direction} in disease")
        else:
            direction = "DOWNREGULATED" 
            print(f"• GWAS genes tend to be {direction} in disease")

        if pval < 0.05:
            print(f"• The enrichment is statistically significant (p < 0.05)")
        else:
            print(f"• The enrichment is NOT statistically significant (p >= 0.05)")

        # Count leading edge genes
        lead_gene_list = lead_genes.split(",") if lead_genes else []
        print(f"\nLeading edge genes: {len(lead_gene_list)} genes")
    else:
        print("Complete summarize_gsea_results() first.")
else:
    print("Complete previous exercises first.")
    summary = pd.DataFrame()

---

## Exercise 7: Leading Edge Analysis & Protein Interaction Network

**Background:**

Leading edge genes represent the "core" of the enrichment signal. To understand how these genes work together, we can analyze their protein-protein interactions (PPIs). Protein interaction networks reveal:

- **Hub genes**: Highly connected proteins (potential key regulators)
- **Functional modules**: Clusters of interacting proteins (pathways)
- **Potential drug targets**: Well-connected disease genes

**Data Sources:**
- STRING database: Curated and predicted protein interactions
- BioGRID: Experimentally validated interactions

**Goals:**
1. Extract leading edge genes from GSEA
2. Build a protein interaction network
3. Visualize the network with GWAS genes highlighted
4. Identify hub genes and modules

**Key Concepts:**
- Protein interaction networks
- Network topology (degree, centrality)
- Interactive network visualization

---

### TODO:

1. Extract leading edge genes from GSEA results
2. Complete the `build_ppi_network()` function:
   - Filter edges by score threshold
   - Only include edges between genes of interest
   - Build a NetworkX graph

In [None]:
# Exercise 7: Extract leading edge genes
print("=" * 60)
print("Exercise 7: Leading Edge & Network Analysis")
print("=" * 60)

# TODO: Extract leading edge genes from GSEA results
# leading_edge_str = gsea_res.res2d.iloc[0]["Lead_genes"]
# leading_edge = set(gene.strip() for gene in leading_edge_str.split(",")) if leading_edge_str else set()

leading_edge = set()  # Replace with your implementation

print(f"\nLeading edge genes: {len(leading_edge)}")
if len(leading_edge) > 0 and len(gwas_genes) > 0:
    print(f"GWAS genes in leading edge: {len(leading_edge & gwas_genes)}")

In [None]:
# This function is provided - it generates synthetic PPI data for demonstration

def generate_synthetic_ppi_data(genes: Set[str], 
                                 density: float = 0.1) -> pd.DataFrame:
    """
    Generate synthetic PPI data for demonstration.
    In a real analysis, you would use data from STRING or BioGRID.
    """
    np.random.seed(42)
    gene_list = list(genes)
    n = len(gene_list)
    
    edges = []
    for i in range(n):
        for j in range(i+1, n):
            if np.random.random() < density:
                score = np.random.uniform(0.4, 1.0)
                edges.append({
                    "protein1": gene_list[i],
                    "protein2": gene_list[j],
                    "score": score
                })
    
    return pd.DataFrame(edges)


def build_ppi_network(edge_df: pd.DataFrame,
                      genes_of_interest: Set[str],
                      score_threshold: float = 0.7) -> nx.Graph:
    """
    Construct a protein interaction network for genes of interest.
    
    Parameters:
    -----------
    edge_df : pd.DataFrame
        PPI edges with columns: protein1, protein2, score
    genes_of_interest : Set[str]
        Genes to include in the network
    score_threshold : float
        Minimum interaction confidence score (0-1)
    
    Returns:
    --------
    networkx.Graph
        Protein interaction network
    """
    G = nx.Graph()
    
    # TODO: Add edges that pass threshold and involve genes of interest
    # for _, row in edge_df.iterrows():
    #     if row["score"] >= score_threshold:
    #         if row["protein1"] in genes_of_interest and row["protein2"] in genes_of_interest:
    #             G.add_edge(row["protein1"], row["protein2"], weight=row["score"])
    
    return G

In [None]:
# Build PPI network
print("\nBuilding PPI network...")

# Use leading edge genes for network (or GWAS genes if leading edge is empty)
network_genes = leading_edge if len(leading_edge) > 5 else gwas_genes

if len(network_genes) > 0:
    # Generate synthetic PPI data
    ppi_edges = generate_synthetic_ppi_data(network_genes, density=0.15)
    print(f"  Total potential interactions: {len(ppi_edges)}")

    # Build filtered network
    G = build_ppi_network(ppi_edges, network_genes, score_threshold=0.7)

    print(f"\nNetwork Statistics:")
    print(f"  Nodes (proteins): {G.number_of_nodes()}")
    print(f"  Edges (interactions): {G.number_of_edges()}")

    if G.number_of_nodes() > 0:
        print(f"  Network density: {nx.density(G):.4f}")
        
        # Connected components
        components = list(nx.connected_components(G))
        print(f"  Connected components: {len(components)}")
        print(f"  Largest component size: {len(max(components, key=len))}")
else:
    print("Complete previous exercises first.")
    G = nx.Graph()

In [None]:
# Analyze network topology
if G.number_of_nodes() > 0:
    print("\n" + "=" * 50)
    print("NETWORK TOPOLOGY ANALYSIS")
    print("=" * 50)
    
    # Calculate degree centrality
    degree_cent = nx.degree_centrality(G)
    
    # Sort by centrality
    sorted_genes = sorted(degree_cent.items(), key=lambda x: x[1], reverse=True)
    
    print("\nTop 10 Hub Genes (by degree centrality):")
    print("-" * 40)
    for i, (gene, cent) in enumerate(sorted_genes[:10], 1):
        degree = G.degree(gene)
        gwas_marker = "*" if gene in gwas_genes else " "
        print(f"  {i:2d}. {gwas_marker} {gene:15s} | Degree: {degree:3d} | Centrality: {cent:.3f}")
    
    print("\n  * = GWAS-associated gene")
else:
    print("Complete build_ppi_network() first.")

In [None]:
# Static network visualization with matplotlib
if G.number_of_nodes() > 0 and G.number_of_nodes() < 100:
    plt.figure(figsize=(12, 10))
    
    # Create layout
    pos = nx.spring_layout(G, k=2, iterations=50, seed=42)
    
    # Node colors: red for GWAS genes, blue for others
    node_colors = ['red' if node in gwas_genes else 'lightblue' for node in G.nodes()]
    
    # Node sizes based on degree
    degrees = dict(G.degree())
    node_sizes = [300 + 100 * degrees[node] for node in G.nodes()]
    
    # Draw network
    nx.draw_networkx_edges(G, pos, alpha=0.3, width=1)
    nx.draw_networkx_nodes(G, pos, node_color=node_colors, node_size=node_sizes, alpha=0.8)
    nx.draw_networkx_labels(G, pos, font_size=8)
    
    plt.title("Protein Interaction Network of Leading Edge Genes\n(Red = GWAS genes, Size = Degree)", fontsize=14)
    plt.axis('off')
    plt.tight_layout()
    plt.show()
else:
    print("Network too large, empty, or not yet built.")

In [None]:
# This function is provided - it creates an interactive network visualization

def visualize_network_pyvis(G: nx.Graph, 
                            highlight_genes: Set[str] = None,
                            filename: str = "ppi_network.html"):
    """
    Create an interactive network visualization using PyVis.
    """
    if G.number_of_nodes() == 0:
        print("Empty network - nothing to visualize")
        return
    
    # Create PyVis network
    net = Network(
        height="600px",
        width="100%",
        bgcolor="#ffffff",
        font_color="#333333"
    )
    
    # Add nodes with styling
    degrees = dict(G.degree())
    for node in G.nodes():
        color = "#e74c3c" if highlight_genes and node in highlight_genes else "#3498db"
        size = 15 + 3 * degrees[node]
        net.add_node(
            node, 
            label=node, 
            color=color, 
            size=size,
            title=f"{node}\nDegree: {degrees[node]}\nGWAS: {'Yes' if node in highlight_genes else 'No'}"
        )
    
    # Add edges
    for u, v, d in G.edges(data=True):
        weight = d.get("weight", 0.5)
        net.add_edge(u, v, value=weight, title=f"Score: {weight:.2f}")
    
    # Configure physics
    net.set_options("""
    var options = {
        "physics": {
            "forceAtlas2Based": {
                "gravitationalConstant": -50,
                "centralGravity": 0.01,
                "springLength": 100,
                "springConstant": 0.08
            },
            "solver": "forceAtlas2Based"
        }
    }
    """)
    
    net.save_graph(filename)
    print(f"Interactive network saved to: {filename}")
    print("  Open in a web browser to explore the network")

# Create interactive network visualization
if G.number_of_nodes() > 0:
    visualize_network_pyvis(G, highlight_genes=gwas_genes)
else:
    print("Complete build_ppi_network() first.")

---

## Summary & Conclusions

In this laboratory, you performed an integrated analysis of **Crohn's disease** using multiple data types:

### Key Results:

1. **GWAS Analysis** (Exercises 1-2)
   - Retrieved genome-wide significant associations from the GWAS Catalog
   - Extracted candidate genes implicated by genetic variants

2. **RNA-seq Analysis** (Exercises 3-4)
   - Visualized differential expression using volcano plots
   - Created ranked gene lists for enrichment analysis

3. **GSEA Integration** (Exercises 5-6)
   - Tested whether GWAS genes show altered expression in disease
   - Identified leading edge genes contributing to enrichment

4. **Network Analysis** (Exercise 7)
   - Built protein interaction networks from key genes
   - Identified hub genes as potential therapeutic targets

### Biological Insights:

This type of integrative analysis helps:
- **Prioritize candidate genes** from GWAS for functional validation
- **Identify disease mechanisms** through pathway enrichment
- **Discover potential drug targets** via network analysis

### Next Steps:

1. Validate key genes experimentally (knockdown, CRISPR)
2. Perform pathway analysis (KEGG, Reactome)
3. Integrate additional data types (proteomics, metabolomics)
4. Query drug databases for existing compounds targeting hub genes

In [None]:
# Final summary table
print("\n" + "=" * 60)
print("LABORATORY SUMMARY")
print("=" * 60)

# Collect results (using defaults if exercises not completed)
summary_data = {
    "Metric": [
        "Disease/Trait",
        "GWAS Significant SNPs",
        "GWAS Candidate Genes", 
        "RNA-seq Genes Analyzed",
        "Significant DE Genes (p<0.05)",
        "GSEA NES",
        "GSEA p-value",
        "Leading Edge Genes",
        "PPI Network Nodes",
        "PPI Network Edges"
    ],
    "Value": [
        "Crohn's Disease",
        len(gwas_df) if 'gwas_df' in dir() else "N/A",
        len(gwas_genes) if 'gwas_genes' in dir() else "N/A",
        len(de_df) if 'de_df' in dir() else "N/A",
        (de_df['pvalue'] < 0.05).sum() if 'de_df' in dir() else "N/A",
        f"{summary['NES'].iloc[0]:.3f}" if 'summary' in dir() and len(summary) > 0 else "N/A",
        f"{summary['p-value'].iloc[0]:.4f}" if 'summary' in dir() and len(summary) > 0 else "N/A",
        len(leading_edge) if 'leading_edge' in dir() else "N/A",
        G.number_of_nodes() if 'G' in dir() else "N/A",
        G.number_of_edges() if 'G' in dir() else "N/A"
    ]
}

summary_table = pd.DataFrame(summary_data)
print(summary_table.to_string(index=False))