# GSNA 教程

JAYZ (The University OF Myself)

# 🍔GSNA 教程🤗

------------------------------------------------------------------------

## 🤪Part 1 : The Core Concepts of GSEA🤠

------------------------------------------------------------------------

### Lesson 1: The Key Question - Moving Beyond Arbitrary Cutoffs🥶

**Goal:** To understand the limitations of standard “cutoff-based”(基于截至值) enrichment analysis and to grasp the fundamentally different and more powerful question that GSEA asks.

#### **The “Classic” Way: Over-Representation Analysis (ORA)**

Let’s first revisit the type of enrichment analysis we have already performed in our proteomics and metabolomics projects. This method is formally called **Over-Representation Analysis (ORA)**.

-   **The ORA Workflow:**

    1.  **Start with a universe:** You have a list of all genes/proteins/metabolites detected in your experiment (e.g., 20,000 genes).

    2.  **Apply a strict cutoff:** You apply an arbitrary(任意的) statistical threshold to create a short list of “significant” genes. For example, you select only the genes where the **adjusted p-value \< 0.05 AND the \|log2 Fold Change\| \> 1**. This might give you a list of 500 “interesting” genes.

    3.  **Ask the ORA question:** You then use the `hypergeometric test` to ask: “Is the ‘Apoptosis(细胞凋亡)’ pathway, which has 100 member genes in the universe, surprisingly over-represented in my short list of 500 genes?”

-   **The Visual Analogy:** Imagine all your genes are marbles in a giant urn. You pull out a handful of marbles that are “significant” (e.g., the red ones). ORA then asks, “Did I get more ‘Apoptosis’ marbles in my hand than I would expect by random chance?”

#### **The Limitations and Problems of the ORA / Cutoff Method**

This method is simple and useful, but it has two major conceptual flaws:

**1. The Arbitrary Cutoff:**

-   The choice of a p-value cutoff (0.05) or a fold-change cutoff (1.0) is ***completely arbitrary***. ***Is a gene with an adjusted p-value of 0.051 truly biologically meaningless? Is a gene with a log2FC of 0.99 truly uninteresting?***

-   **Information Loss:** By applying this strict cutoff, you are ***throwing away the vast majority of your data***. You are completely ignoring the thousands of genes that showed a weaker but still potentially important change. ***Biology is often about subtle, coordinated shifts, not just blockbuster hits.(生物学通常涉及微妙，协调的变化，而不是仅仅轰动一时的变化)***

**2. The Sensitivity Problem:**

-   Imagine a biological pathway—like a signaling cascade(信号联级)—where every single one of the 20 genes in the pathway is ***upregulated*** by a small but consistent amount (e.g., a log2FC of 0.5 for all of them).

-   Because none of these genes passes the arbitrary \|log2FC\| \> 1 threshold, **none of them will make it into your “significant” list.**

-   As a result, ***the ORA method will be completely blind to this clear and important biological signal***. It will ***report that this pathway is not significant***, which is clearly false. The method is not sensitive to small but coordinated changes.

#### **The GSEA Solution: A Fundamentally Different Question**

GSEA was designed specifically to overcome these two problems. It does away with arbitrary cutoffs and uses your entire dataset.

-   **The GSEA Workflow (Conceptual):**

    1.  **Start with the universe:** You have your list of all 20,000 genes.

    2.  **Rank the ENTIRE list:** Instead of creating a short list, *you rank all 20,000 genes from “most upregulated” at the top to “most downregulated” at the bottom*. This ranking is typically based on a metric that combines the fold change and the p-value (like the t-statistic).

    3.  **Ask the GSEA question:** Now, for the ‘Apoptosis’ pathway, GSEA asks a much more elegant question: **“Are the 100 genes belonging to the ‘Apoptosis’ pathway randomly distributed throughout my entire ranked list of 20,000 genes, or do they show a tendency to accumulate at the top (upregulated) or the bottom (downregulated)?”**

-   **The Visual Analogy:** Imagine your 20,000 genes are runners in a marathon, ranked from first to last place. The ‘Apoptosis’ genes are all wearing blue shirts. GSEA asks: “Are the blue-shirted runners spread randomly throughout the entire pack of 20,000 runners, or are they suspiciously clustered together near the finish line?”

#### **Why GSEA is More Powerful**

-   **It is Threshold-Free:** ***It uses all of your data. No information is thrown away.***

-   **It is More Sensitive:** It can detect those subtle but coordinated changes. In our example of the signaling cascade where all 20 genes had a log2FC of 0.5, ORA would miss it completely. GSEA, however, would easily detect that all 20 of these genes are accumulating together near the top of the ranked list and would report the pathway as highly significant.

### **Lesson 1: Summary & Status Check**

-   **Conceptually**, we now understand the critical difference between the two main types of enrichment analysis.

    -   **ORA (Over-Representation Analysis):** Asks if a pathway is over-represented in a short, pre-defined list of significant genes. It is simple but suffers from arbitrary cutoffs and loss of information.

    -   **GSEA (Gene Set Enrichment Analysis):** Asks if a pathway’s member genes are non-randomly distributed at the top or bottom of the entire ranked list of all genes. It is threshold-free and more sensitive to subtle, coordinated changes.

------------------------------------------------------------------------

### Lesson 2: The GSEA Algorithm, Step-by-Step🤩

**Goal:** To understand the three logical stages of the GSEA algorithm: Ranking, Calculating the Enrichment Score, and Assessing Significance.

#### **Step 1: Create a Master Ranked List of All Genes**

**The Concept:** The foundation of GSEA is a single, ranked list that represents your entire experiment. This list must capture both the **magnitude** (how much did it change?) and the **significance** (how confident are we in this change?) for every single gene you measured.

**The Method:**

1.  You start with your differential expression results table (from `DESeq2, limma,` etc.). This table has a log2 Fold Change (logFC) and a p-value for all 20,000 genes.

2.  You cannot simply rank by fold change, because a gene with a huge fold change but a terrible p-value is not reliable. You cannot simply rank by p-value, because a tiny p-value with a near-zero fold change is not biologically interesting.

3.  Therefore, you must combine these into a single **ranking metric**. Common and effective choices include:

    -   The **t-statistic** from the `limma` output.

    -   The stat column from the `DESeq2` output.

    -   A manually calculated metric like: ***`sign(logFC) * -log10(pvalue).`***

4.  You then rank all 20,000 genes in ***descending order based on this metric***. The result is a single list where the gene with the highest positive metric (most strongly upregulated) is at position #1, and the gene with the most negative metric (most strongly downregulated) is at position #20,000.

#### **Step 2: Calculate the Enrichment Score (ES)**

**The Concept:** This is the heart of the algorithm. For a single gene set (e.g., the “Hallmark Apoptosis” set, which contains 161 genes), we want to calculate a score that tells us if its members are clustered at the top or bottom of our master ranked list. GSEA does this using an elegant “random walk” method.

**The “Random Walk” Analogy:**  
Imagine you are walking from the #1 ranked gene down to the #20,000th gene. You have a pencil and are drawing a graph.

-   You start your pencil at **zero**.

-   Every time you pass a gene that is **IN** your gene set (a “hit”), you take a big step **UP**. The size of the step is proportional to the gene’s ranking metric (so hits at the very top give bigger steps up).

-   Every time you pass a gene that is **NOT** in your gene set (a “miss”), you take a small step **DOWN**.

**The Calculation:**

1.  The algorithm walks down the ranked list from gene 1 to 20,000.

2.  It keeps a running-sum statistic. When it encounters a gene belonging to the “Apoptosis” set, it increases the running sum. When it encounters a gene not in the set, it decreases the running sum.

3.  This creates a “mountain range” plot. If the apoptosis genes are randomly distributed, the line will just jitter(抖动) randomly around zero.

4.  **However**, if the apoptosis genes are clustered at the top of the list, you will get a series of big steps UP at the beginning, causing the running sum to climb rapidly to a large positive value before it starts to drift back down.

5.  If the apoptosis genes are clustered at the bottom, you will get a series of small steps DOWN for a long time, followed by a series of big steps UP at the very end. This will cause the running sum to drift negatively to a large negative value.

**The Enrichment Score (ES):** The ES is defined as the **maximum deviation of the running sum from zero(运行总和与零的最大偏差)**.

-   A large **positive ES** means the gene set is enriched ***at the top of the list*** (associated with the ***“upregulated” phenotype)***.

-   A large **negative ES** means the gene set is enriched ***at the bottom of the list*** (associated with the ***“downregulated” phenotype***).

-   An ES close to **zero** means the ***gene set is not enriched*** (its members are ***scattered randomly***).

(This is the classic GSEA plot we will learn to make. ***The green line is the “random walk”. The peak of that line is the ES***.)

#### **Step 3: Assess the Statistical Significance**

**The Concept:** We have an ES for “Apoptosis” (e.g., 0.72). But is that score impressive? Could we have gotten a score that high just by random chance with a random set of 161 genes? We need to calculate a ***p-value.***

**The Method: Permutation Testing(排序测试)**  
GSEA establishes significance in a very clever and robust way: **it creates its own null distribution by shuffling the data.**

1.  The algorithm takes our master ranked list of 20,000 genes.

2.  It then **randomly shuffles(打乱) the gene labels**. Now the ranking metrics are ***associated with the wrong genes***. This creates a “random” ranked list.

3.  It ***re-calculates*** the Enrichment Score for “Apoptosis” using this shuffled list. It records this new, random ES.

4.  It repeats this shuffling process **thousands of times** (e.g., 10,000 times), generating a null distribution of 10,000 random Enrichment Scores. This distribution shows what a typical ES looks like for the “Apoptosis” set when there is no real biological signal.

5.  **The p-value** is then calculated as the fraction of random ES scores from the null distribution that were equal to or more extreme than the actual ES we observed from our real data(然后计算 p 值，作为从零分布中随机 ES 分数中等于或比实际观察到的真实数据 ES 更极端的分数的比例。).

6.  Finally, because we are testing thousands of gene sets at once, it calculates a **False Discovery Rate (FDR)** or adjusted p-value to correct for multiple testing. ***This FDR is the most important value for determining significance.***

**Normalization (NES):** The raw ES is dependent on the size of the gene set. To compare enrichment between a small set and a large set, the score is normalized. This **Normalized Enrichment Score (NES)** is what is typically used for ranking and comparing significant pathways.

### **Lesson 2: Summary & Status Check**

-   **Conceptually**, we have dissected the GSEA algorithm into its three essential parts. We understand that it is a **Rank-then-Walk-then-Shuffle(排序-遍历-重排)** process.

    1.  **Rank:** Create a master list of all genes based on a robust metric.

    2.  **Walk:** Calculate an **Enrichment Score (ES)** by walking down the list and seeing if a gene set’s members accumulate at either end.

    3.  **Shuffle:** Determine the significance of the ES by comparing it to a null distribution created by thousands of **permutations(排列)** of the gene labels.

------------------------------------------------------------------------

### Lesson 3: The Gene Sets - Your Biological Prior Knowledge🤧

**Goal:** To understand the concept of a gene set and to become familiar with the Molecular Signatures Database (`MSigDB`), the gold-standard resource for curated gene sets.

#### **Concept 1: What is a Gene Set?**

A gene set is simply a **list of genes that share a common biological function, location, or regulation.** It represents a ***piece of pre-existing***, curated(整理过的) biological knowledge. Think of it as a ***pre-defined “team” of genes.***

-   **Example 1: A KEGG Pathway Gene Set.** The “KEGG GLYCOLYSIS” gene set would be a list of all the genes that encode the enzymes involved in the glycolysis pathway.

-   **Example 2: A GO Term Gene Set.** The “GO DNA REPAIR” gene set would be a list of all genes that have been annotated with the “DNA Repair” Gene Ontology term.

-   **Example 3: A Regulatory Gene Set.** A gene set could be a list of all genes that have a binding site for a particular transcription factor (like TP53) in their promoter(启动子) region.

GSEA is a general method. The biological meaning of its output is entirely dependent on the quality and nature of the gene sets you use as input.

#### **Concept 2: The Gold Standard - The Molecular Signatures Database (MSigDB)**

Manually creating these gene sets would be a monumental(艰巨) task. Thankfully, the Broad Institute (the same group that created GSEA) maintains the **Molecular Signatures Database (MSigDB)**. ***This is a massive, meticulously curated collection of thousands of gene sets that is freely available to the scientific community.***

MSigDB is the canonical(通用) resource for GSEA. It is organized into several major collections, and knowing the main ones is crucial for any bioinformatician.

#### **The Key MSigDB Collections**

Here are the most important collections you will encounter. Each one is designed to answer a different type of biological question.

-   **H: Hallmark(标志) Gene Sets (The Best Place to Start)**

    -   **What it is:** This is the smallest and most refined collection, consisting of only **50 gene sets**. Each Hallmark set represents a well-defined, core biological process (e.g., “HALLMARK_APOPTOSIS”, “HALLMARK_INFLAMMATORY_RESPONSE”).

    -   **How it was made:** The MSigDB curators used a sophisticated computational approach to distill the thousands of overlapping gene sets from other collections down to their essential, non-redundant core.

    -   **Why you should use it:** It reduces noise and redundancy, making the results much easier to interpret. For most analyses, **starting with the Hallmark collection is the recommended best practice.**

-   **C2: Curated Gene Sets**

    -   **What it is:** This is a massive collection (over 6,000 sets) gathered from various online pathway databases, publications, and knowledge bases.

    -   **Sub-collections:** It is divided into important sub-collections:

        -   **C2:CP:KEGG:** Gene sets from the famous KEGG pathway database.

        -   **C2:CP:Reactome:** Gene sets from the highly detailed Reactome pathway database.

        -   **C2:CP:BioCarta:** Gene sets from the BioCarta pathway database.

    -   **Why you should use it:** When you want to investigate specific, well-known canonical pathways. It’s more detailed than the Hallmark collection but also more redundant (e.g., many KEGG pathways will overlap).

-   **C5: GO Gene Sets**

    -   **What it is:** A very large collection (over 10,000 sets) where each gene set corresponds to a Gene Ontology (GO) term.

    -   **Sub-collections:**

        -   **C5:GO:BP:** For Biological Process.

        -   **C5:GO:MF:** For Molecular Function.

        -   **C5:GO:CC:** For Cellular Component.

    -   **Why you should use it:** When you want to explore biological functions in a more granular and comprehensive way than just looking at pathways. The results can be very detailed but also highly redundant.

-   **C3: Regulatory Target Gene Sets**

    -   **What it is:** Gene sets where all the genes are thought to be regulated by a specific transcription factor or microRNA.

    -   **Why you should use it:** When your primary question is about gene regulation. For example, if your experiment involves knocking out a transcription factor, you would use this collection to see if its known target genes are significantly downregulated.

### **Lesson 3: Summary & Status Check**

-   **Conceptually**, we understand that GSEA’s power comes from leveraging prior biological knowledge in the form of **curated gene sets**.

-   We have been introduced to the **MSigDB database** as the central, authoritative resource for these gene sets.

-   **Crucially**, we now know the major MSigDB collections and have a strategic plan for using them:

    1.  **Always start with the Hallmark (H) collection** for a high-level, easy-to-interpret view of the results.

    2.  If needed, follow up with more detailed collections like **KEGG/Reactome (C2)** or **Gene Ontology (C5)** to explore more specific hypotheses.

We have now completed our tour of the core concepts of GSEA. We understand why it’s better than ORA, how the algorithm works, and what biological knowledge it uses as input.

------------------------------------------------------------------------

## 😥**Part 2: A Practical GSEA Project in R😇**

**Project:** “**Analyzing the Transcriptional Response to Estrogen(雌激素) Treatment in a Breast Cancer Cell Line (MCF7)**.”

**Biological Question:** “Estrogen is a key hormone that drives the growth of certain breast cancers. We want to use GSEA to identify the core biological pathways and hallmark processes that are activated or suppressed in MCF7 cells after estrogen treatment.”

**Our Starting Point:** We will pretend a colleague has already performed the RNA-seq experiment and the differential expression analysis using DESeq2. They have handed us a single `CSV` file: `estrogen_deg_results.csv.` This file contains the complete results for all ~20,000 detected genes.

------------------------------------------------------------------------

### **Lesson 4: Project Setup and Data Preparation🙂‍↔️**

**Goal:** To set up our R environment, load the differential expression results, and, most importantly, ***create the master ranked list of genes that will be the primary input for our GSEA.***

#### **Chunk 1: Project Setup and Installing Packages**

**Explanation:** First, we’ll create an organized project structure. Then, we will install the two key R packages we need for this entire analysis.

-   **`fgsea`:** An R package for running a Fast Preranked(排序) Gene Set Enrichment Analysis. It’s extremely fast and widely used in the community.

-   **`msigdbr`:** A brilliant helper package that allows us to download and format gene sets directly from the MSigDB database inside R, saving us from manual downloads.

**Action:**

1.  On your computer, create a new project folder: `Project_GSEA_Estrogen.`

2.  Inside, create the sub-folders: `data`, `scripts`, and `figures`.

3.  In RStudio, create a new R Project in this main folder.

4.  Create a new R script and save it in the scripts folder as `01_gsea_analysis.R.`

5.  In the R console, install the necessary packages:

    ``` r
    install.packages("fgsea")
    install.packages("msigdbr")
    install.packages("tidyverse") # For data manipulation and plotting
    ```

#### **Chunk 2: Loading the Differential Expression Data**

**Explanation:** We need to load our colleague’s results file into R. For this lesson, since we don’t have a real file, I will provide code that creates a realistic, sample data frame. In a real project, you would simply use `read.csv()` to load your file. We will then inspect the data to understand its structure.

**Action:**  
Add the following code to your `01_gsea_analysis.R` script.

``` r
# --------------------------------------------------------------------------
# Script: 01_gsea_analysis.R
# Project: GSEA of Estrogen Response in MCF7 Cells
# --------------------------------------------------------------------------

# Load the libraries
library(tidyverse)
library(fgsea)
library(msigdbr)

# --- 1. Load and Prepare the Data ---

# In a real project, you would load your data like this:
# deg_results <- read.csv("data/estrogen_deg_results.csv")

# For this tutorial, we will CREATE a sample data frame that
# looks just like a real differential expression results file.
set.seed(42) # for reproducibility
deg_results <- data.frame(
  gene_symbol = paste0("GENE", 1:1000),
  log2FoldChange = rnorm(1000, 0, 1.5),
  pvalue = runif(1000, 0, 1),
  padj = p.adjust(runif(1000, 0, 1), method = "BH")
)
# Let's make some genes look like they are part of an estrogen response
deg_results$log2FoldChange[1:50] <- rnorm(50, 2, 0.5)
deg_results$pvalue[1:50] <- runif(50, 0, 0.01)
deg_results$padj[1:50] <- p.adjust(deg_results$pvalue[1:50], method = "BH")

# Inspect the loaded data
head(deg_results)
```

-   **Run this code.** You now have a data frame called `deg_results`. Look at the first few rows. You’ll see it has the essential columns: `a gene identifier (gene_symbol), a log2FoldChange, and adjusted p-values (padj).`

#### **Chunk 3: Creating the Ranked Gene List**

**Explanation:** This is the most critical preparatory step in the entire analysis. As we learned in Lesson 2, GSEA does not take the whole data frame as input. It requires a single, **named, ranked vector**.

-   **The values** in the vector will be our ***ranking metric***.

-   **The names** of the vector elements will be the corresponding gene symbols.

We will create this ranked list, handle any potential issues like missing values or duplicate gene names, and then sort it in descending order.

**Action:**  
Add this chunk to your script.

``` r
# --- 2. Create the Ranked Gene List for GSEA ---

# First, let's create our ranking metric. A good choice is sign(logFC) * -log10(pvalue)
# This captures both magnitude/direction and significance.
deg_results$rank_metric <- sign(deg_results$log2FoldChange) * -log10(deg_results$pvalue)

# Next, we create the named vector.
# 1. Filter out any genes with NA values in our metric.
# 2. Handle duplicate gene symbols. Some genes might be measured by multiple probes.
#    We will keep the one with the highest absolute rank metric.
ranked_genes <- deg_results %>%
  filter(!is.na(rank_metric)) %>%
  group_by(gene_symbol) %>%
  slice_max(order_by = abs(rank_metric), n = 1) %>%
  ungroup() %>%
  # Now, pull the metric and gene symbols into the named vector format
  pull(rank_metric, name = gene_symbol)

# Finally, sort the vector in descending order. This is required by fgsea.
ranked_genes <- sort(ranked_genes, decreasing = TRUE)

# Let's inspect our final ranked list. This is the direct input for GSEA.
cat("--- Our final ranked gene list ---\n")
head(ranked_genes)
cat("\n... (and at the other end) ...\n")
tail(ranked_genes)
cat("\nTotal number of ranked genes:", length(ranked_genes), "\n")
```

-   **Run this code.**

-   **Verification:** Inspect the output in your console. You should see a list of numbers (the ranking metric) with gene symbols as their names. The `head()` output will show the genes with the largest positive metric (our most upregulated genes), and the `tail()` output will show the genes with the most negative metric (our most downregulated genes). You have successfully created the exact data structure that `fgsea` needs.

### **Lesson 4: Summary & Status Check**

-   **Conceptually**, we have put our theoretical knowledge into practice. We understand that the goal of this stage is not just to load data, but to distill it into a master ranked list that captures the full result of our differential expression experiment.

-   **Practically**, we have set up our project, installed the necessary tools, loaded our data, and, most importantly, have written the code to convert a standard DEG results table into the specific **named, sorted vector** format required for GSEA.

------------------------------------------------------------------------

### **Lesson 5: Acquiring Gene Sets🙁**

**Goal:** To use the `msigdbr` R package to download the Hallmark gene sets for Homo sapiens and format them into the specific list structure that the `fgsea` package requires.

#### **Chunk 1: Understanding the `msigdbr` Package**

**Explanation:** In the past, scientists had to go to the Broad Institute’s website, manually download a `.gmt` file, and then write code to parse that file. The `msigdbr` package makes this entire process obsolete and much more reproducible.

-   It provides a single function, `msigdbr()`, that connects directly to the MSigDB database.

-   You can specify the species you want (e.g., “Homo sapiens”, “Mus musculus(小家鼠)”).

-   You can specify the category you are interested in (e.g., “H” for Hallmark, “C2” for Curated, “C5” for GO).

-   The function returns a tidy data frame, which is much easier to work with than the old `.gmt` files.

#### **Chunk 2: Downloading the Hallmark Gene Sets**

**Explanation:** As we discussed in Lesson 3, the Hallmark collection is the best place to start. It provides a high-level, non-redundant view of the most important biological processes. We will use the `msigdbr` function to fetch this collection for humans.

**Action:**  
Add this chunk to your `01_gsea_analysis.R` script.

``` r
# --- 3. Acquire the Gene Sets from MSigDB ---

# We will use the msigdbr package to get the Hallmark gene sets for humans.
# msigdbr(species = "Homo sapiens", category = "H")

# Let's pull the data and inspect it
hallmark_sets_df <- msigdbr(species = "Homo sapiens", category = "H")

# Inspect the resulting data frame
head(hallmark_sets_df)
```

-   **Run this code.**

-   **Verification:** Look at the `hallmark_sets_df data` frame. You will see it has a very simple and useful structure. ***Each row represents one gene belonging to one gene set. The key columns are:***

    -   `gs_name`: The name of the gene set (e.g., “HALLMARK_APOPTOSIS”).

    -   `gene_symbol`: The gene that is a member of that set.  
        This tidy format is a great starting point.

#### **Chunk 3: Formatting the Gene Sets for `fgsea`**

**Explanation:** The `fgsea` package is very efficient, and to achieve this speed, it requires the gene sets to be in a specific format: a **named list**.

-   Each **element** **of the list should be a gene set.**

-   The **name** ***of each list element should be the name of the gene set (e.g., “HALLMARK_APOPTOSIS”).***

-   The **content** ***of each list element should be a simple character vector of all the gene symbols belonging to that set.***

We need to convert the data frame we just downloaded from `msigdbr` into this list format. While this can be done in several ways in R, using the `split()` and `lapply()` functions or a loop, the `tidyverse` approach with `split()` is often very concise.

**Action:**  
Add this final chunk for the lesson to your script.

``` r
# The fgsea function requires the gene sets to be in a named list format.
# We will convert the data frame from msigdbr into this format.

# The 'split()' function is perfect for this. It splits the 'gene_symbol' column
# into a list, based on the values in the 'gs_name' column.
hallmark_sets_list <- split(x = hallmark_sets_df$gene_symbol, f = hallmark_sets_df$gs_name)

# Let's inspect our final gene set list to verify the format.
cat("--- Our final gene set list for fgsea ---\n")
# Look at the first two pathways in the list
str(head(hallmark_sets_list, 2))
```

-   **Run this code.**

-   **Verification:** The `str()` command is our proof. The output in the console will show the structure of our new object, `hallmark_sets_list`. You should see that it is a “List of 50” (because there are 50 Hallmark sets). When you expand the first element, you will see something like:

        $ HALLMARK_ADIPOGENESIS: chr [1:200] "ACSL1" "SCD" "FASN" ...

    This confirms we have the exact format required: a named list where each element is a vector of gene symbols.

### **Lesson 5: Summary & Status Check**

-   **Conceptually**, we understand that we need a source of high-quality, curated gene sets to power our GSEA, and that MSigDB is the standard for this.

-   **Practically**, we have used the modern and reproducible `msigdbr` package to programmatically fetch the Hallmark gene sets directly into R. We have also successfully transformed this data into the specific **named list format** required by our analysis tool, `fgsea`.

We have now prepared both of our key inputs:

1.  `ranked_genes`: Our data-driven ranked list from the experiment.

2.  `hallmark_sets_list`: Our knowledge-driven list of biological pathways.

The stage is perfectly set. We are ready to bring these two inputs together and run the GSEA.

------------------------------------------------------------------------

### **Lesson 6: Running the GSEA and Interpreting the Results Table🤩**

**Goal:** To use the `fgsea` package to run the Gene Set Enrichment Analysis, to understand the structure of the results table it produces, and to identify the statistically significant pathways.

#### **Chunk 1: Running the `fgsea` Function**

**Explanation:** The `fgsea` package is beautifully designed. The main function, `fgsea()`, is simple to use because we’ve already done the hard work of formatting our inputs correctly. It takes our list of pathways (`hallmark_sets_list`), our ranked gene list (`ranked_genes`), and a few other parameters. ***One key parameter is `nPermSimple` (previously `nperm`), which tells the function how many permutations(排序) to run to build the null distribution for calculating the p-values. A value of 10,000 is a robust choice for a final analysis.***

**Action:**  
Add this chunk to your `01_gsea_analysis.R` script.

``` r
# --- 4. Run the Gene Set Enrichment Analysis ---

# Set a seed for reproducibility of the random permutations
set.seed(42)

# Run the fgsea algorithm!
fgsea_results <- fgsea(
  pathways = hallmark_sets_list,
  stats = ranked_genes,
  nPermSimple = 10000 # Number of permutations
)

# Let's inspect the results table
head(fgsea_results)
```

-   **Run this code.** The `fgsea` function is highly optimized and should complete the 10,000 permutations very quickly. The result, `fgsea_results`, is a data frame (technically a “data.table”) containing the GSEA results for every one of the 50 Hallmark pathways.

#### **Chunk 2: Understanding the GSEA Results Table**

**Explanation:** The `fgsea_results` table is the core output of our analysis. To use it, we must understand what each column means. Let’s go through the most important ones.

-   `pathway`: The name of the gene set from MSigDB (e.g., “HALLMARK_APOPTOSIS”).

-   `pval`: The raw statistical p-value calculated from the permutation test. It tells you the probability of getting an Enrichment Score (ES) as extreme as the one observed, just by random chance.

-   `padj`: The **adjusted p-value** (or FDR, or q-value). This is the p-value after correcting for the fact that we tested 50 different pathways at once (the multiple testing problem). **This is the most important column for determining statistical significance.** ***A common cutoff is `padj < 0.05` or `padj < 0.1`***.

-   `ES`: The **Enrichment Score**. This is the peak of the “random walk” we discussed in Lesson 2. ***It reflects the degree to which a gene set is over-represented at the extremes of the ranked list.***

-   `NES`: The **Normalized Enrichment Score**. This is the most important score for interpretation.

    -   The ES is normalized to account for the size of the gene set, allowing you to compare the results for a small gene set versus a large gene set.

    -   The **sign of the NES** is critical:

        -   **Positive NES:** The pathway is enriched at the **top** of your ranked list. In our case, this means it’s associated with genes that were **upregulated** by estrogen.

        -   **Negative NES:** The pathway is enriched at the **bottom** of your ranked list, meaning it’s associated with genes that were **downregulated** by estrogen.

-   `size`: The number of genes in the pathway after filtering to only the genes present in our dataset.

-   `leadingEdge`: ***A list of the core member genes of the pathway that contributed most to the ES***. We will look at this more in the next lesson.

#### **Chunk 3: Filtering and Ordering the Results**

**Explanation:** The results table is often long and not sorted in the most useful way. Our first analytical task is to ***filter this table to find the significant pathways and then order them to see the most important biological findings***. We will filter by the adjusted `p-value (padj)` and then sort by the `NES` to ***see the top upregulated and downregulated pathways.***

**Action:**  
Add this chunk to your script.

``` r
# --- 5. Interpret the Results ---

# First, let's filter for the significantly enriched pathways
significant_pathways <- fgsea_results %>%
  filter(padj < 0.05) %>%
  arrange(desc(NES)) # Sort by NES to see top positive and negative pathways

# Print the significant results
cat("--- Significant Enriched Pathways (padj < 0.05) ---\n")
print(significant_pathways[, .(pathway, NES, padj, size)]) # Show key columns

# Let's separate the top upregulated and downregulated pathways for clarity
top_positive_pathways <- significant_pathways %>%
  filter(NES > 0) %>%
  head(10)

top_negative_pathways <- significant_pathways %>%
  filter(NES < 0) %>%
  head(10)

# You can also create a nice summary plot of the top pathways
plot_data <- bind_rows(top_positive_pathways, top_negative_pathways)

summary_plot <- ggplot(plot_data, aes(x = NES, y = reorder(pathway, NES))) +
  geom_col(aes(fill = NES > 0)) +
  scale_fill_manual(values = c("TRUE" = "#d95f02", "FALSE" = "#1b9e77"), guide = "none") +
  labs(
    title = "Top Enriched Hallmark Pathways in Estrogen Response",
    x = "Normalized Enrichment Score (NES)",
    y = "Hallmark Pathway"
  ) +
  theme_minimal()

print(summary_plot)
ggsave("figures/01_gsea_summary_plot.png", summary_plot, width = 10, height = 8)
```

-   **Run this code.**

-   **Verification:**

    1.  The printed table in your console is your first key result. It gives you a clean, easy-to-read list of the pathways that were significantly affected by the treatment. This is a core component of your final report.

    2.  The bar chart (`summary_plot`) is your visual proof and a powerful summary figure. It immediately communicates the main findings. The orange bars(条框) are the key processes activated by estrogen, and the green bars are the processes that were suppressed.

### **Lesson 6: Summary & Status Check**

-   **Conceptually**, we now understand the meaning of the key outputs of a GSEA: the padj for significance, and the NES for the magnitude and direction of the enrichment.

-   **Practically**, we have successfully run the fgsea function on our data and have written the code to filter, sort, and display the results in both a tabular and a graphical format.

-   **Crucially**, we have followed the “Trust, but Verify” principle by inspecting the results table and creating a summary plot that confirms our findings and is suitable for a presentation or publication.

------------------------------------------------------------------------

### **Lesson 7: Visualizing the Results🍬**

**Goal:** To understand how to read the classic GSEA enrichment plot and to use `fgsea` to generate these plots for our top significant pathways.

#### **Chunk 1: Deconstructing the GSEA Enrichment Plot**

**Explanation:** The GSEA enrichment plot is one of the most iconic and information-rich visualizations in bioinformatics. It can look intimidating at first, but it’s actually a brilliant summary of the entire GSEA algorithm for a single gene set. Let’s break it down into its three main components, from top to bottom.

1.  **The Top Panel (The “Mountain”): The Enrichment Score (ES) Plot.**

    -   This is the “random walk” we discussed in Lesson 2.

    -   The **y-axis** is the running-sum Enrichment Score(累计丰富分数).

    -   The **x-axis** represents the position in your master ranked list of all genes (from most upregulated on the left to most downregulated on the right).

    -   The peak (or valley) of this green line is the final ES for the pathway. A peak on the left means the pathway is enriched in upregulated genes. A valley on the right means it’s enriched in downregulated genes.

2.  **The Middle Panel (The “Barcodes”): The Hit Ticks.**

    -   This is the simplest but most important part of the plot.

    -   Each vertical black line (a “tick”) shows the position of a gene that is a **member of this specific gene set** (“a hit”) within the master ranked list.

    -   This panel allows you to see the distribution of the gene set members at a glance. If you see the barcodes clustering on the left side, it visually confirms that the genes in this set tend to be highly ranked (upregulated).

3.  **The Bottom Panel (The “Heatmap”): The Ranking Metric.**

    -   This is a heatmap of the ranking metric for all genes in your master list.

    -   It’s typically colored red for the highest positive ranks (upregulated) and blue for the most negative ranks (downregulated).

    -   This provides a global view of the entire experimental trend and serves as a backdrop for the other two panels.

#### **Chunk 2: Plotting the Top Upregulated Pathway**

**Explanation:** The `fgsea` package provides a simple function, `plotEnrichment()`, to generate these beautiful plots. All it needs is the specific pathway name you want to visualize and the ranked stats list we already created. Let’s create a plot for our most significantly **upregulated** pathway.

**Action:**  
Add this chunk to your 01_gsea_analysis.R script.

``` r
# --- 6. Visualize Specific Pathway Results ---

# Let's find the name of our top positively enriched pathway
top_pathway_positive <- significant_pathways %>%
  filter(NES > 0) %>%
  slice_max(order_by = NES, n = 1) %>%
  pull(pathway)

cat("Plotting top POSITIVE pathway:", top_pathway_positive, "\n")

# Create the enrichment plot
positive_plot <- plotEnrichment(
  pathway = hallmark_sets_list[[top_pathway_positive]],
  stats = ranked_genes
) + 
  labs(title = top_pathway_positive)

# Display and save the plot
print(positive_plot)
ggsave(paste0("figures/02_enrichment_plot_", top_pathway_positive, ".png"), positive_plot, width = 7, height = 5)
```

-   **Run this code.**

-   **Verification:** A new plot will appear. This is your proof.

    -   Observe the **green line**: It should rise sharply on the left side, indicating a positive Enrichment Score.

    -   Observe the **barcodes**: You should see a high density of black tick marks clustered on the far left, visually confirming that the member genes of this pathway are highly concentrated among the most upregulated genes in your experiment.

#### **Chunk 3: Plotting the Top Downregulated Pathway**

**Explanation:** Now we will do the exact same thing for our most significantly **downregulated** pathway. This demonstrates the symmetry of the analysis.

**Action:**  
Add this final chunk to your script.

``` r
# Now let's plot the top NEGATIVELY enriched pathway
top_pathway_negative <- significant_pathways %>%
  filter(NES < 0) %>%
  slice_min(order_by = NES, n = 1) %>%
  pull(pathway)

cat("Plotting top NEGATIVE pathway:", top_pathway_negative, "\n")

# Create the enrichment plot
negative_plot <- plotEnrichment(
  pathway = hallmark_sets_list[[top_pathway_negative]],
  stats = ranked_genes
) + 
  labs(title = top_pathway_negative)

# Display and save the plot
print(negative_plot)
ggsave(paste0("figures/03_enrichment_plot_", top_pathway_negative, ".png"), negative_plot, width = 7, height = 5)
```

-   **Run this code.**

-   **Verification:** A second enrichment plot will appear.

    -   Observe the **green line**: It should drift downwards for a long time and form a deep valley on the right side, indicating a negative Enrichment Score.

    -   Observe the **barcodes**: You will now see the black tick marks clustered on the **far right**, visually confirming that the members of this pathway are highly concentrated among the most downregulated genes in your experiment.

### **Grand Conclusion of the GSEA Project**

We have successfully completed a full, professional GSEA workflow from start to finish. Let’s recap the story we’ve built.

1.  **The Concept:** We started by understanding that GSEA is a powerful, threshold-free method that asks if a gene set’s members accumulate at the top or bottom of a ranked list of all our genes.

2.  **Data Prep:** We loaded our differential expression results and meticulously converted them into the required master **ranked gene list**.

3.  **Gene Sets:** We programmatically fetched the high-quality **Hallmark gene sets** from the MSigDB database.

4.  **Running GSEA:** We ran the `fgsea` algorithm to calculate the NES and FDR for all 50 Hallmark pathways.

5.  **Interpreting Results:** We created a **summary bar plot** showing the top up- and down-regulated pathways, giving us the high-level biological story.

6.  **Detailed Visualization:** We generated classic **GSEA enrichment plots** for our top hits, providing the detailed, verifiable evidence of why those pathways were significant.

**The Final Result:**  
You now have a set of high-quality, publication-ready figures and a statistically robust table that tells a compelling story about your experiment. You can confidently say, for example:

“Our Gene Set Enrichment Analysis reveals that estrogen treatment in MCF7 cells leads to a significant activation of the ‘HALLMARK_MYC_TARGETS_V1’ and ‘HALLMARK_E2F_TARGETS’ pathways, consistent with increased cell proliferation. Conversely, we observed a significant suppression of the ‘HALLMARK_TNFA_SIGNALING_VIA_NFKB’ pathway…”

### `Then you could to leverage the power of the GSEA method you have already learned and apply it strategically to different collections of gene sets to go from a high-level summary (Hallmark) to a detailed, mechanistic understanding (KEGG, GO). You don't need to learn a new method; you just need to swap out the biological knowledge base you are testing against.`