# 代谢组学教程

JAYZ (The University OF Myself)

# 😇Metabolomics：Project: Investigating the Metabolic Impact of Metformin(二甲双弧) on Liver Cancer Cells😶‍🌫️

------------------------------------------------------------------------

## 😜Project setup and Raw Data Processing 🦄

------------------------------------------------------------------------

### **Lesson 1: Project Setup & Data Organization**🦄

**Goal:** To understand the basic requirements and set up a structured environment for a metabolomics project.

#### **Concept 1: The Necessary Tools and Data Format**

In bioinformatics, especially when using open-source tools like R, we cannot work directly with the files that come off the mass spectrometer (e.g., .raw, .d). These are proprietary “closed” formats.

-   **The Universal Language:** The open-source community has created a universal format called **mzML**. Think of it as the PDF or JPG of mass spectrometry. It’s a standardized format that any tool can read.

-   **The Translator:** A free, essential program called **ProteoWizard** (specifically ***its msConvert tool***) is the standard for translating from proprietary formats to mzML.

-   **The “Workbench”:** For metabolomics in R, the most important software is a package from Bioconductor called ***`xcms`*** . It is a complete ecosystem for processing metabolomics data, from the raw files to the final feature list. It is built on top of another package called `MSnbase`, which provides the fundamental tools for handling mass spec data.

#### **Concept 2: The Importance of Project Structure**

A bioinformatics project involves many files: raw data, processed data, scripts, figures, and reports. Without a logical folder structure from the very beginning, a project can quickly become chaotic and impossible to reproduce. A clean, organized project is the hallmark of a professional bioinformatician.

### **Practical Application: Setting up our Project**

Now, let’s apply these concepts to our Metformin project.

**Action 1: Create the Project Folders**  
On your computer, create a main folder for the project. Inside it, create a set of sub-folders. This structure will keep everything tidy.

**codeCode**

    Project_HepG2_Metformin/
    ├── data_mzML/         <-- Our converted mzML files will go here
    ├── data_processed/    <-- Intermediate R objects we save will go here
    ├── scripts/           <-- Our R scripts will live here
    └── figures/           <-- The plots we generate will be saved here

**Action 2: Prepare the R Environment**  
We need to install the specialized packages. Think of this as stocking our workbench with the right tools before we start.

-   Open RStudio. In the console, run these commands:

``` r
# Install the BiocManager, which is the installer for all bioinformatics packages in R
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Now use BiocManager to install the essential metabolomics packages
BiocManager::install(c("xcms", "MSnbase"))

# Also install the tidyverse, a collection of packages for general data science
install.packages("tidyverse")
```

**Action 3: Convert the Raw Data**  
Let’s assume the lab has given us 10 `Thermo .raw` files. We must convert them to `mzML.`

1.  Open the msConvert program (from the ProteoWizard suite).

2.  Add the 10 `.raw` files.

3.  Set the output format to **mzML**.

4.  **Crucially**, for metabolomics, we do **not** apply a “Peak Picking” filter here. `xcms` is designed to work with the richer “profile” data to find the faint metabolic peaks.

5.  Set the output directory to your new data_mzML folder.

6.  Click “Start”.

**Action 4: Set up the RStudio Project and Script**  
To make our lives easier, we will create an RStudio Project file. This automatically manages our working directory.

1.  In RStudio, go to File \> New Project… \> Existing Directory.

2.  Browse to and select your Project_HepG2_Metformin folder.

3.  RStudio will restart and you’ll see an .Rproj file in your project folder.

4.  Now, create our first script: File \> New File \> R Script.

5.  Save it inside the scripts folder as 01_xcms_processing.R.

### **Lesson 1: Summary & Status Check**

We have successfully completed the foundational setup.

-   **Conceptually**, we understand that we need to use an open data format (mzML) and that `xcms` is our primary tool in R for processing this data. We also understand the critical importance of an organized project folder.

-   **Practically**, we have installed the necessary software, created a clean folder structure, converted our raw data, and set up our RStudio environment and our first script.

We are now perfectly prepared to begin the actual data processing. Every step from here on will be code that we write in our `01_xcms_processing.R` script.

------------------------------------------------------------------------

### **Lesson 2: The Core Challenge - Peak Picking & Feature Detection**🤑

**Goal:** To understand what a “feature” is in metabolomics and to use the `xcms` package in R to ***automatically*** detect these features in each of our *`LC-MS files`* individually.

#### **Concept 1: What Are We Looking For? The “Feature”**

The raw LC-MS data is a three-dimensional landscape:

1.  **Mass-to-Charge (m/z):** Which ions are present.

2.  **Retention Time (RT):** When they come off the LC column.

3.  **Intensity:** How many of each ion there are.

***A single metabolite doesn’t just appear at one point.*** As the small sample plug of a metabolite travels off the column, it elutes over a short period of time, creating a peak shape. Therefore, a **metabolic feature** is a 2D mountain in the m/z vs. RT landscape. Our goal in this lesson is to find all of these “mountains” and characterize them.

#### **Concept 2: The Extracted Ion Chromatogram (EIC)(色谱图)**

How can we see these peaks? Imagine you take a very thin slice of the 3D data at a specific m/z value (e.g., all signals between m/z = 129.10 and 129.11). If you then plot the intensity of that slice over time, you get an **Extracted Ion Chromatogram (EIC)**. If a metabolite with that m/z exists, you will see a classic chromatographic peak (like a small bell curve) in the EIC.

The job of “peak picking” algorithms is to automatically generate thousands of these EICs and find the bell-shaped peaks within them.

#### **Concept 3: The xcms Peak Picking Algorithm - centWave**

xcms has several algorithms, but the most famous and widely used for high-resolution data (like ours) is called ***centWave***. It works like this:

1.  It divides the full m/z range ***into thousands of tiny, overlapping slices.***

2.  For each slice, it creates an EIC.

3.  It then applies advanced signal processing techniques (Continuous Wavelet Transform(连续小波变换), which is where the “Wave” in the name comes from) to the EIC to ***identify regions that look like real peaks and distinguish them from random noise.***

4.  For each real peak it finds, it records key information: ***its exact m/z***, ***the retention time*** at the apex of the peak, ***the peak’s integrated area*** (the most important value for quantification), and other metrics.

This process is performed **independently on each of our 10 files.** For now, the algorithm doesn’t know or care that the files are related. Its only job is to generate a comprehensive list of all features found in each file, one by one.

### **Practical Application: Finding Features in our Project**

Now, let’s write the R code in our `01_xcms_processing.R` script to apply these concepts.

**Action 1: Load Libraries and Define Files**  
The first step in our script is to load the necessary packages and tell R where to find our data files. We also need to create our “phenotype” data frame, which describes the experiment.

``` r
# --------------------------------------------------------------------------
# Script: 01_xcms_processing.R
# Author: Your Name
# Date: 2025-09-03
# --------------------------------------------------------------------------

# Load the essential libraries
library(xcms)
library(MSnbase)
library(tidyverse)

# --- 1. Load Data and Define Experimental Design ---

# Get the full paths to our converted mzML files
mzml_files <- list.files("./data_mzML", pattern = ".mzML", full.names = TRUE, recursive = FALSE)

# Check that we found all 10 files
print(mzml_files)

# Create a 'phenotype' data frame (pData) describing the experiment.
# This is crucial for keeping track of our samples.
pdata <- data.frame(
  sample_name = str_remove(basename(mzml_files), ".mzML"),
  sample_group = c(rep("Control", 5), rep("Metformin", 5)) # Assuming files are in order
)

# Look at our experimental design table
print(pdata)

# Read the raw data into a special on-disk MSnbase object.
# This is a memory-efficient way to handle large datasets.
raw_data <- readMSData(files = mzml_files, pData = pdata, mode = "onDisk")
```

-   **Run this code.** You have now loaded all the file headers and your experimental design into the `raw_data` object without filling up your computer’s memory.

**Action 2: Define the Peak Picking Parameters**  
centWave is a powerful algorithm with many parameters. For now, we will start with a standard, robust set of parameters.

``` r
# --- 2. Define Peak Picking Parameters ---

# We create a parameter object. This is the modern way to work with xcms.
# This makes our code clean and our analysis reproducible.
cwp <- CentWaveParam(
  ppm = 5,             # Mass accuracy in parts-per-million. A good value for modern instruments.
  peakwidth = c(5, 25), # Expected range of peak widths in seconds.
  snthresh = 10,         # Signal-to-noise ratio threshold.
  prefilter = c(3, 100), # Prefilter for intensity. Peaks must have at least 3 points above 100 intensity.
  mzdiff = 0.01          # Minimum difference in m/z for overlapping peaks.
)

# Let's look at our parameter object
print(cwp)
```

-   **Run this code.** This creates an object cwp that holds all our instructions for the centWave algorithm. This is much better than typing all the numbers into one giant function call.

**Action 3: Run the Peak Picking!**  
Now we apply our parameters to our raw data object. This is the first major computational step.

``` r
# --- 3. Run Peak Picking ---

# The 'findChromPeaks' function does the work.
# It takes our raw data and our parameter object as input.
# This will take some time to run as it processes each file individually.
xdata_peaks <- findChromPeaks(raw_data, param = cwp)

# Let's inspect the results
print(xdata_peaks)

# We can also see how many peaks were found in each file
chromPeakData(xdata_peaks)

# And get a summary table
summary_peaks <- as.data.frame(table(chromPeakData(xdata_peaks)$sample))
colnames(summary_peaks) <- c("Sample_Index", "Num_Peaks")
summary_peaks$Sample_Name <- pdata$sample_name[summary_peaks$Sample_Index]
print(summary_peaks)
```

-   **Run this code.** `xcms` is now processing your 10 files. When it’s done, the `xdata_peaks` object will contain all the original raw data PLUS the results of the peak picking. The summary table will show you how many thousands of features were detected in each sample. You should see a relatively consistent number across all 10 files, which is a good first quality check.

### **Lesson 2: Summary & Status Check**

We have made a huge leap forward.

-   **Conceptually**, we understand that our goal is to find 2D peaks (features) in the m/z-RT data, and that algorithms like centWave do this by analyzing thousands of EICs. We also know that at this stage, each file is processed independently.

-   **Practically**, we have loaded our data into a memory-efficient R object, defined a clear set of parameters for peak picking, and successfully executed the `findChromPeaks` function. We now have an object, xdata_peaks, that contains the locations and intensities of tens of thousands of features for each of our 10 samples.

The problem now is that the feature list for “Control_1” is completely independent of the list for “Control_2”. The next critical step is to correct for shifts in retention time so we can begin to match them up.

------------------------------------------------------------------------

### Dont forget QC😋

**QC is not only needed, it is arguably even more critical in metabolomics than in proteomics.**

The reason is that metabolites are small, chemically diverse molecules that are much ***more sensitive*** to tiny variations in the experimental process (sample extraction, temperature, column pressure, etc.). Furthermore, the identification of metabolites is a much harder problem, so ***we rely heavily on the quality and consistency of our raw data*** (m/z and retention time) to have any confidence in our results.

### QC in a Metabolomics Workflow

You don’t just do QC once. It’s a continuous process that you perform at multiple stages.

#### **Stage 1: Before You Even Start xcms (Initial Raw Data QC)**

This should be done right after loading the data in **Lesson 2, Action 1**.

**1. Total Ion Chromatogram (TIC) - The “Heartbeat” of the Run:**

-   **Concept:** Exactly the same as in proteomics. We plot the sum of all ion intensities in every scan against the retention time.

-   **What it tells you:** It’s your first and best look at the stability of the LC-MS system.

    -   **GOOD:** TICs for all 10 replicates should have ***similar shapes, peak structures, and overall intensity ranges.*** This shows the chromatography and spray stability were consistent.

    -   **BAD:** One sample has a TIC that is 10x lower than the others. Or a TIC shows a sudden, sharp drop to zero. These are “outlier” runs that are technical failures. You should seriously consider **removing them from the analysis** before you even start peak picking.

-   **How to do it in R:** The chromatogram function in MSnbase does this perfectly, just as we did in the proteomics practical section.

**2. Base Peak Chromatogram (BPC) - The “Loudest Signal”:**

-   **Concept:** Very similar to the TIC, but instead of summing all ions in a scan, it just plots the intensity of the single most intense ion.

-   **What it tells you:** It’s less sensitive to background noise and can sometimes ***give a clearer picture of the major chromatographic peaks***. It’s an excellent companion plot to the TIC. Consistent BPCs across replicates are a very good sign.

-   **How to do it in R:** The same `chromatogram` function is used, but with `aggregationFun = "max".`

#### **Stage 2: After Peak Picking (The QC we are about to do)**

Once `xcms` has found the features in each file, we can perform even more powerful QC.

**3. Total Number of Features:**

-   **Concept:** The summary_peaks table we just created in Lesson 2 is a QC step!

-   **What it tells you:** While not expected to be identical, the total number of features found in each replicate should be in the same ballpark. If one sample suddenly has half (or double) the number of features, it’s a major red flag that something went wrong with that specific sample.

**4. Mass Accuracy - The “Ruler” of the Instrument:**

-   **Concept:** In metabolomics, we don’t have a search engine to tell us the “true” mass of a peptide like in proteomics. So, how do we check mass accuracy? We often rely on **known contaminants or internal standards**.

-   **What it tells you:** If you know there is always a specific plasticizer(增塑剂) contaminant with a known mass of m/z 391.2842, you can plot the measured m/z of that peak in every sample. If it’s consistently measured at 391.2840 +/- 0.0005, your mass accuracy is excellent (within a few ppm). If it’s drifting all over the place, your instrument was not stable, and your confidence in identifying unknown metabolites will be very low.

-   **How to do it:** This is a more manual process where you would plot the EIC of a known mass, find the peak, and check its measured m/z in each file.

**5. Consistency of Retention Times for Key Metabolites (Before Correction):**

-   **Concept:** Pick a few well-known, abundant metabolites you expect to see (e.g., glutamine(谷氨酰胺), leucine(亮氨酸)).

-   **What it tells you:** Plot their EICs for all 10 samples on one graph. You will almost certainly see that their retention times are not perfectly aligned. They will drift from run to run. Seeing this drift **is the entire motivation for our next lesson (Lesson 3: Retention Time Correction).** This QC step proves why the next processing step is absolutely necessary.

### Let’s Add the Code!

You are so right, we should have done this. Let’s add the crucial TIC and BPC plotting code to our `01_xcms_processing.R` script. The perfect place is right after we create the raw_data object in **Action 1 of Lesson 2**.

``` r
# --- ADDENDUM: Initial Raw Data QC ---

# This code should be run after creating the 'raw_data' object

# 1. Plot the Total Ion Chromatogram (TIC)
tic_plot <- ggplot() +
  geom_line(data = as.data.frame(chromatogram(raw_data, aggregationFun = "sum")),
            aes(x = rtime / 60, y = intensity, group = sample_name, color = sample_group)) +
  scale_color_viridis_d() + # Use a nice color palette
  labs(title = "Total Ion Chromatogram (TIC)",
       x = "Retention Time (minutes)",
       y = "Total Intensity") +
  theme_bw()

print(tic_plot)
ggsave("figures/01_raw_tic.png", tic_plot, width = 10, height = 6)


# 2. Plot the Base Peak Chromatogram (BPC)
bpc_plot <- ggplot() +
  geom_line(data = as.data.frame(chromatogram(raw_data, aggregationFun = "max")),
            aes(x = rtime / 60, y = intensity, group = sample_name, color = sample_group)) +
  scale_color_viridis_d() +
  labs(title = "Base Peak Chromatogram (BPC)",
       x = "Retention Time (minutes)",
       y = "Max Intensity") +
  theme_bw()

print(bpc_plot)
ggsave("figures/01_raw_bpc.png", bpc_plot, width = 10, height = 6)
```

------------------------------------------------------------------------

### **Lesson 3: Retention Time Correction**🤠

**Goal:** ***To understand why retention times vary between LC-MS runs and to use an xcms algorithm to computationally align the data, ensuring that the same metabolite appears at the same adjusted retention time in every sample.***

#### **Concept 1: The Problem - Inevitable Chromatographic Drift**

Imagine you are running a marathon. Even under identical conditions, you won’t finish ten marathons in the exact same time down to the millisecond. The same is true for molecules in an LC column.

Over the course of an experiment (which can take hours or days), small, unavoidable changes occur:

-   The temperature of the room can fluctuate slightly.

-   The pressure from the LC pumps can vary minutely.

-   The chromatography column itself can age and degrade.

The result is **retention time drift**. A metabolite that appears at 5.21 minutes in the first run might appear at 5.18 minutes in the fifth run and 5.25 minutes in the tenth run.

**Why is this a catastrophe for our analysis?**  
Our next goal (in Lesson 4) is to group features from different samples and say, “These are all the same metabolite.” We do this by looking for features with a very similar m/z and retention time. ***If the retention times are drifting randomly, we will fail to group correctly.*** We might incorrectly split one metabolite into two separate groups, or incorrectly merge two different metabolites into one.

**Therefore, retention time correction is not an optional “cleanup” step. It is an absolutely essential prerequisite for correct feature grouping.**

#### **Concept 2: The Solution - Alignment Algorithms**

`xcms` uses powerful algorithms to fix this drift. The goal is to create a “warping” function(扭曲函数) for each sample that shifts its retention time axis to match a reference (often a “virtual” average of all samples).

A common and robust algorithm is **`obiwarp`** (Ordered Bijective Interpolated Warping). You don’t need to know the deep mathematics, but conceptually, it works like this:

1.  It creates a “consensus”(共识) signal by averaging all the TICs.

2.  For each individual sample’s TIC, it finds the optimal way to stretch and squeeze its time axis to make it align perfectly with the consensus signal.

3.  It saves this “warping” function. Later, it will apply this exact same transformation to all the features that were detected in that sample.

#### **Concept 3: “Trust, but Verify” - How Do We Check the Alignment?**

How do we know if the correction actually worked? We visualize!

-   **Before Correction:** We can plot the TICs for all samples. We will see the major peaks are slightly misaligned.

-   **After Correction:** We plot the TICs again, but this time using the adjusted retention times. The major peaks should now line up almost perfectly. This visual confirmation is our proof that the algorithm succeeded.

### **Practical Application: Aligning Our Project Data**

Let’s add the code to our `01_xcms_processing.R` script.

**Action 1: Define the Alignment Parameters**  
Just like with peak picking, we first create a parameter object. This keeps our code clean and documents our choices.

``` r
# --- 4. Define Retention Time Correction Parameters ---

# We will use the obiwarp method.
# 'binSize' controls the coarseness of the alignment. A good default is 1.
# We will let xcms pick the best reference sample automatically.
obiwarp_param <- ObiwarpParam(binSize = 1)

# Let's inspect our parameter object
print(obiwarp_param)
```

-   **Run this code.** You have now created the `obiwarp_param` object which contains the instructions for the alignment algorithm.

**Action 2: Run the Alignment**  
We apply this parameter object to our `xdata_peaks` object.

``` r
# --- 5. Run Retention Time Correction ---

# The 'adjustRtime' function performs the alignment.
# It takes our peak-picked data and the new parameter object.
# This step is computationally intensive.
xdata_aligned <- adjustRtime(xdata_peaks, param = obiwarp_param)

# Let's inspect the new object. It now contains the aligned data.
print(xdata_aligned)
```

-   **Run this code.** This will take a few minutes. `xcms` is calculating the warping functions and creating a new object, `xdata_aligned,` which contains all the previous information plus the new, adjusted retention times for every single feature.

**Action 3: Verify the Result!**  
Did it work? Let’s plot the TICs before and after.

``` r
# --- 6. Verify the Alignment ---

# A) GET RAW TICs (BEFORE CORRECTION)
raw_tics <- chromatogram(xdata_peaks, aggregationFun = "sum")

# B) GET ALIGNED TICs (AFTER CORRECTION)
# We can get these directly from our new object. It uses the adjusted times by default.
aligned_tics <- chromatogram(xdata_aligned, aggregationFun = "sum")

# C) PLOT THEM SIDE-BY-SIDE
# We'll use a little data wrangling to make a combined plot

# Convert to data frames and add a label for the plot facet
raw_tics_df <- as.data.frame(raw_tics)
raw_tics_df$type <- "Before Correction"

aligned_tics_df <- as.data.frame(aligned_tics)
aligned_tics_df$type <- "After Correction"

# Combine the two data frames
combined_tics_df <- rbind(raw_tics_df, aligned_tics_df)

# Create the plot
alignment_plot <- ggplot(combined_tics_df, aes(x = rtime / 60, y = intensity, group = sample_name, color = sample_group)) +
  geom_line() +
  facet_wrap(~ type, ncol = 1) + # This creates the two panels
  scale_color_viridis_d() +
  labs(
    title = "Verification of Retention Time Alignment",
    x = "Retention Time (minutes)",
    y = "Total Intensity"
  ) +
  theme_bw()

# Display the plot
print(alignment_plot)

# Save the plot
ggsave("figures/02_alignment_verification.png", alignment_plot, width = 10, height = 8)
```

-   **Run this code.** This will generate a two-panel plot.

    -   **The top panel (“Before Correction”):** You will see the main peaks are clearly staggered and misaligned between the different colored lines (samples).

    -   **The bottom panel (“After Correction”):** You should see a dramatic improvement. The main peaks should now be sitting almost perfectly on top of each other. This is our visual proof. We have successfully and verifiably aligned our data.

### **Lesson 3: Summary & Status Check**

-   **Conceptually**, we understand that LC-MS runs are not perfectly stable and that this causes retention time drift, which is a critical problem we must solve before we can group features. We know that algorithms like obiwarp can correct this drift.

-   **Practically**, we have defined parameters for, and successfully run, the adjustRtime function.

-   **Crucially**, we have followed the “Trust, but Verify” principle by creating a before-and-after plot that visually confirms the success of our alignment step.

Our data is now primed for the next major step. With the m/z values being accurate and the retention times now aligned, we are finally ready to ask the main question: “Which of these tens of thousands of features from 10 different files actually represent the same metabolite?” This is the task of feature grouping.

------------------------------------------------------------------------

### Lesson 4: Feature Grouping (or “Correspondence”)🧐

**Goal:** ***To understand how xcms groups individual features from all 10 samples into “feature groups,” where each group represents a single, unique metabolic compound measured across the experiment.***

#### **Concept 1: The Problem - From Many Lists to One Matrix**

Right now, our xdata_aligned object contains 10 separate lists of features.

-   File 1 has a feature at (m/z=130.06, RT=2.51 min)

-   File 2 has a feature at (m/z=130.07, RT=2.52 min)

-   File 3 has a feature at (m/z=130.06, RT=2.50 min)

-   …and so on.

Our goal is to create a final data matrix where the **rows are unique compounds** and the **columns are our samples**. To do this, we need to figure out that the three features listed above are very likely measurements of the same compound.

The process of finding and linking these corresponding features is called **grouping** or **correspondence**.

#### **Concept 2: The Solution - Grouping by Proximity**

How do we decide which features belong together? We look for features that are “close” to each other in the chemical space we have just worked so hard to clean up:

1.  **Mass-to-Charge (m/z):** The m/z values must be very similar, within the known mass accuracy of our instrument (e.g., within 5 ppm).

2.  **Retention Time (RT):** The aligned retention times must also be very close, within a small window.

The grouping algorithm in `xcms` essentially creates a “search window” around each feature and looks for other features from different samples that fall within that m/z and RT window.

The most common algorithm for this uses a **density-based approach**. Imagine plotting all the features from all 10 samples as points on a 2D graph of m/z vs. RT***. The features that belong to the same compound will form a tight, dense cluster of points***. The algorithm’s job is to find these dense regions.

#### **Concept 3: “Trust, but Verify” - How Do We Check the Grouping?**

This step is harder to verify with a single plot, but there are key metrics and visualizations we can use:

1.  **Feature Group Summaries:** We can check how many of our final feature groups contain a peak from only 1 sample, 2 samples, … up to all 10 samples. A good grouping should result in a large number of feature groups that contain peaks from most, if not all, of the samples within at least one condition (e.g., all 5 Control samples). A very high number of “singleton” groups (found in only one sample) might indicate the parameters were too strict.

2.  **Extracted Ion Chromatogram (EIC) of a Feature Group:** This is the most powerful verification. We can pick a single, final feature group from our results and ask xcms to plot the raw chromatograms for that specific m/z and RT range for all 10 samples. A well-grouped feature should show a clean, aligned peak in most, if not all, of the samples. This confirms that the algorithm correctly grouped the signals.

### **Practical Application: Grouping Features in our Project**

Let’s add the code to our `01_xcms_processing.R` script.

**Action 1: Define the Grouping Parameters**  
We need to tell the density-based algorithm how close is “close enough” for m/z and RT.

``` r
# --- 7. Define Feature Grouping Parameters ---

# We will use the 'PeakDensity' method.
pdp <- PeakDensityParam(
  sampleGroups = pdata$sample_group, # We provide our sample groups (Control, Metformin)
  minFraction = 0.5,                 # A feature must be present in at least 50% of samples in at least ONE group to form a group. This is a key parameter to avoid noise.
  bw = 5,                            # The bandwidth (standard deviation) of the RT grouping window in seconds.
  ppm = 5                            # The ppm tolerance for grouping features in the m/z dimension.
)

# Let's inspect our parameter object
print(pdp)
```

-   **Run this code.** You have now created the `pdp` object. The minFraction parameter is particularly important. Setting it to 0.5 means that to be considered a “real” feature group, a feature must be detected in at least 3 of the 5 replicates in either the Control group OR the Metformin group. This is a powerful way to filter out random, sporadic noise peaks.

**Action 2: Run the Grouping**  
We apply this parameter object to our `xdata_aligned` object.

``` r
# --- 8. Run Feature Grouping ---

# The 'groupChromPeaks' function performs the correspondence.
# This can also be a computationally intensive step.
xdata_grouped <- groupChromPeaks(xdata_aligned, param = pdp)

# Let's inspect the new object.
print(xdata_grouped)
```

-   **Run this code.** `xcms` is now searching through all the aligned features and clustering them into groups. The new `xdata_grouped` object now contains the final, linked feature groups.

**Action 3: Verify the Result!**  
Let’s look at the summary and then visualize a specific EIC.

``` r
# --- 9. Verify the Grouping ---

# A) FEATURE GROUP SUMMARY
# The 'featureDefinitions' function gives us the final table of feature groups
feature_defs <- featureDefinitions(xdata_grouped)
head(feature_defs)

# Let's check the size of the groups
summary_groups <- as.data.frame(table(feature_defs$npeaks))
colnames(summary_groups) <- c("Num_Samples", "Num_Feature_Groups")
print(summary_groups)

# B) VISUALIZE AN EIC FOR A SPECIFIC FEATURE GROUP
# Let's pick a feature that is present in all 10 samples (a high-quality one)
# We can find its index from the feature_defs table
feature_of_interest <- "FT050" # Let's assume this is an interesting one

# Get the chromatogram for this feature group
eic_plot <- plotChromPeaks(xdata_grouped, feature = feature_of_interest)

# Display the plot
print(eic_plot)

# Save the plot
ggsave("figures/03_eic_verification.png", eic_plot, width = 8, height = 6)
```

-   **Run this code.**

    -   The `summary_groups` table gives you an overview of the grouping quality. You want to see high numbers for 5 samples and 10 samples.

    -   The `eic_plot` is your direct visual proof. It will show the raw data for a single feature group. You should see a nice, Gaussian-shaped peak present in most or all of the 10 chromatograms, and crucially, they should all be **perfectly aligned** at the same retention time. This proves that both the alignment (Lesson 3) and the grouping (Lesson 4) worked correctly.

### **Lesson 4: Summary & Status Check**

This lesson marks the end of the core xcms pre-processing workflow.

-   **Conceptually**, we understand that the goal of grouping is to create a unified feature list by clustering individual peaks based on their proximity in the aligned RT vs. m/z space.

-   **Practically**, we have defined parameters for, and successfully run, the groupChromPeaks function.

-   **Crucially**, we have followed the “Trust, but Verify” principle by inspecting the feature group statistics and, most importantly, by plotting an EIC to visually confirm that the algorithm correctly grouped the signals from the raw data.

We have successfully transformed 10 complex raw files into a single, coherent, and organized list of metabolic features. The hard part of the signal processing is now complete. Our next step will be to handle any missing values in this feature list and then format it into the final data matrix that we will use for our statistical analysis.

------------------------------------------------------------------------

## 🍟Part 2：From Features to a Data Matrix🦋

------------------------------------------------------------------------

### Lesson 5: Filling Missing Peaks 🥩

#### **1. Goal :** *Our goal is to address the issue of missing values in our feature table. The `groupChromPeaks` function linked features across samples, but some samples within a group may not have had a peak detected by the `findChromPeaks` algorithm. We want to go back to the raw data for those specific samples and integrate the signal in the exact location where the peak should be, giving us a more complete data matrix for statistical analysis.*

#### **2. Underlying Logic**

-   **Why are peaks “missing”?** A peak might be “missing” for two main reasons:

    1.  **True Absence:** The metabolite is simply not present or is below the instrument’s detection limit in that sample. This is a biologically meaningful result.

    2.  **Technical Absence:** The metabolite was present, but its ***signal was too low,*** too noisy, or too poorly shaped to be picked by the `findChromPeaks` algorithm, which had a strict signal-to-noise threshold.

-   **The Problem:** Standard statistical tests (like t-tests) cannot handle missing values (NA). Simply ignoring them or replacing them with zero is statistically invalid and throws away valuable information. A zero is a measured value; an NA means we don’t know.

-   **The Solution (fillChromPeaks):** The grouping step (Lesson 4) has given us a very precise map. For a given feature group, we know the exact m/z and the aligned retention time range where that metabolite appears in the samples where it was detected. The `fillChromPeaks` function uses this map. For every sample where a peak is missing within that group, it ***goes back to the original raw data file*** for that sample, extracts the chromatogram for that precise m/z and RT window, and integrates the signal there.

    -   ***If there was a small, real peak that was missed, it will be integrated and we get a good quantitative value***.

    -   If there was only noise, it will ***integrate the noise,*** resulting in a very small, near-zero value, which is a much more accurate representation than a complete NA.

#### **3. Practical Application: The Code**

Let’s add the code to our `01_xcms_processing.R` script.

``` r
# --- 10. Define Parameters for Peak Filling ---

# We create a parameter object to be explicit about our method.
# The 'expand' and 'fixed' parameters control the size of the integration window.
# A small fixed value is often robust.
fcp <- FillChromPeaksParam(expand = 2, fixed = 1)

# Let's inspect our parameter object
print(fcp)
```

-   **Action:** Add this code to your script and run it. We have now defined the instructions for the peak filling algorithm.

``` r
# --- 11. Run Peak Filling ---

# The 'fillChromPeaks' function does the work.
# It takes our grouped data object as input.
# This step can be very fast as it's not searching, just integrating known locations.
xdata_filled <- fillChromPeaks(xdata_grouped, param = fcp)

# Let's inspect the final object
print(xdata_filled)
```

-   **Action:** Run this code. `xcms` is now iterating through your feature groups, finding the NAs, and integrating the signal from the raw data. The new `xdata_filled` object is our final, processed `xcms` object.

#### **4. Expected Outcome**

The `xdata_filled` object now contains the same feature groups as before, but the underlying data table has far fewer missing NA values. The intensities for previously missing peaks have been replaced with new, integrated quantitative values.

#### **5. Verifiable “Proof”**

How can we prove that this worked? We can directly compare the number of missing values for a specific feature before and after the filling step.

``` r
# --- 12. Verify the Peak Filling ---

# The 'featureValues' function extracts the final data matrix.
# 'value = "into"' gives the integrated peak area (intensity).

# A) GET THE MATRIX BEFORE FILLING
matrix_before_filling <- featureValues(xdata_grouped, value = "into")

# B) GET THE MATRIX AFTER FILLING
matrix_after_filling <- featureValues(xdata_filled, value = "into")

# C) COMPARE THE NUMBER OF NAs
na_counts_before <- sum(is.na(matrix_before_filling))
na_counts_after <- sum(is.na(matrix_after_filling))

# Print the proof
cat("Number of missing values BEFORE peak filling:", na_counts_before, "\n")
cat("Number of missing values AFTER peak filling:", na_counts_after, "\n")

# We can also look at a specific feature that had missing values.
# Let's find a feature group that was missing in some samples.
# (This requires a bit of code to find an example)
feature_info <- featureSummary(xdata_grouped)
example_feature_row <- which(feature_info$Metformin < 5 & feature_info$Metformin > 0 & feature_info$Control == 5)[1]
example_feature_id <- rownames(feature_info)[example_feature_row]

cat("\n--- Verifying a single feature:", example_feature_id, "---\n")
print("Intensities BEFORE filling:")
print(matrix_before_filling[example_feature_id, ])

print("\nIntensities AFTER filling:")
print(matrix_after_filling[example_feature_id, ])
```

-   **Action:** Run this final block.

-   **Verification:** The console output is our proof.

    1.  You will see a dramatic reduction in the total number of missing values. The “after” count will be much lower than the “before” count.

    2.  For the specific example feature, you will see a row of numbers printed for the “before” matrix that contains one or more NAs.

    3.  The row of numbers for the “after” matrix will now have those NAs replaced with small, positive numerical values. This is direct, verifiable evidence that the `fillChromPeaks` function did exactly what we intended.

### **Lesson 5: Summary & Status Check**

-   **Conceptually**, we understand that missing peaks are a problem for statistics and that `xcms` provides a robust solution by re-integrating the raw signal in the expected location, based on the alignment data from samples where the peak was found.

-   **Practically**, we have defined parameters for and run the `fillChromPeaks` function.

-   **Crucially**, we have followed the “Trust, but Verify” principle by writing code that directly compares the data matrix before and after the step, proving that the number of missing values has been significantly reduced.

------------------------------------------------------------------------

### Lesson 6: Annotation with `CAMERA` ⚽

#### **1. Goal**

***Our goal is to “annotate” the feature list from `xcms`. This means finding features that are related because they originate(起源) from the same parent metabolite(母体代谢物). Specifically, we want to identify:***

1.  **Isotopes:** Peaks that have the same retention time but are ~1.00335 Da heavier (the mass difference of a ¹³C isotope).

2.  **Adducts(加合物):** Peaks that have the ***same retention time but different m/z values*** corresponding to the s***ame molecule binding with different ions*** (e.g., \[M+H\]⁺, \[M+Na\]⁺, \[M+K\]⁺).

By grouping these related features into “pseudospectra,(伪光谱)” we get a much ***cleaner list where each entry is a closer approximation of a single, unique compound.*** This is a critical step before database searching and statistical analysis.

#### **2. Underlying Logic**

-   **The Problem:** A single metabolite, “M,” does not produce a single peak in the mass spectrometer. Due to natural isotope abundance and the chemistry of electrospray(电喷雾) ionization(电离), ***it will generate a whole family of related peaks.***

    -   **Isotopes:** You’ll see the ***main peak*** (M), the M+1 peak (with one ¹³C atom), the M+2 peak (with two ¹³C atoms), etc. `xcms` will likely have ***picked all of these as separate features.***

    -   **Adducts:** In positive ion mode, the same molecule M can be detected as the protonated(质子化) form \[M+H\]⁺, the sodium(钠) adduct \[M+Na\]⁺, and the potassium(钾) adduct \[M+K\]⁺. `xcms` will ***have picked these three as separate features, even though they all come from the same original compound.***

-   **The Consequence:** If we don’t fix this, we will perform statistical tests on all these redundant features. We might find that \[M+H\]⁺, \[M+Na\]⁺, and the M+1 isotope are all “significantly changed.” ***This is not three independent discoveries;*** it’s **one discovery** reported three times, ***which inflates our statistics and makes interpretation a mess.***

-   **The Solution (CAMERA):** The **`CAMERA`** (Comprehensive Annotation of Mass Spectrometry data) package is designed specifically for this. It takes the feature list from `xcms` and uses a clever set of rules to find these relationships.

    1.  First, ***it groups features with very highly correlated intensities across all samples*** (features that go up and down together are likely related).

    2.  Then, within these correlated groups, ***it looks for specific, known mass differences*** that correspond to isotope patterns and common adducts (\[Na-H\] ≈ 21.98 Da, \[K-H\] ≈ 37.95 Da).(寻找与同位素模式和常见加合物相对应的特定已经知道的质量差异)

    3.  it bundles(捆绑) all the features it identifies as belonging to one compound into a “pseudospectrum” and assigns them the same group ID.

#### **3. Practical Application: The Code**

This process is a bit different as `CAMERA` is a ***separate packag***e that operates on the `xcms` object.

``` r
# --- 13. Annotate Features with CAMERA ---
# This code continues in our '01_xcms_processing.R' script.

# Load the CAMERA library
library(CAMERA)

# CAMERA works on a special object type. We first need to convert our
# xcms object into an 'xsAnnotate' object.
# This step can take a moment and might print some status messages.
xa <- xsAnnotate(xdata_filled)

# Step 1: Group features based on retention time. Features from the same
# compound should have the same RT.
xa_grouped <- groupFWHM(xa)

# Step 2: Annotate isotopes. This looks for peaks with the expected
# mass difference for C13 isotopes.
xa_isotopes <- findIsotopes(xa_grouped)

# Step 3: Annotate adducts and group them into pseudospectra.
# This is the key step where it looks for [M+H], [M+Na], etc.
xa_annotated <- findAdducts(xa_isotopes)
```

-   **Action:** Add this code block to your script and run it. You are performing the three main steps of CAMERA’s workflow. The final object, `xa_annotated`, contains all the previous `xcms` data ***plus a wealth of new annotation information.***

#### **4. Expected Outcome**

The primary output is a table of results that links features together. We expect to see a ***new column, `pcgroup`*** (for Pseudospectrum Clustered GROUP), where multiple features now share the same ID. Features with the same `pcgroup` ID are ***hypothesized to be different adducts/isotopes of the same*** **parent compound.**

#### **5. Verifiable “Proof”**

How can we prove that `CAMERA` did its job? We can extract its results table and inspect a specific `pseudospectrum` group.

``` r
# --- 14. Verify the CAMERA Annotation ---

# A) GET THE PEAK LIST WITH ANNOTATIONS
# This peak list is a table with all features and the new annotation columns.
peaklist <- getPeaklist(xa_annotated)
head(peaklist) # Notice the new columns like 'isotopes' and 'pcgroup'

# B) VERIFY A SPECIFIC PSEUDOSPECTRUM
# Let's find a pcgroup that has several features in it.
pcgroup_summary <- as.data.frame(table(peaklist$pcgroup))
colnames(pcgroup_summary) <- c("pcgroup_ID", "num_features")
interesting_pcgroup <- pcgroup_summary %>%
  filter(num_features > 2) %>% # Find a group with at least 3 members
  arrange(desc(num_features)) %>%
  slice(1) %>% # Take the largest one as an example
  pull(pcgroup_ID)

# Now, let's look at all the features from this one group
example_group <- peaklist %>%
  filter(pcgroup == interesting_pcgroup)

cat("\n--- Verifying an example pseudospectrum group:", interesting_pcgroup, "---\n")
# We only show the important columns for clarity
print(example_group[, c("mz", "rt", "isotopes", "adduct", "pcgroup")])
```

-   **Action:** Run this final block.

-   **Verification:** The printed table is our proof. You will see a small table with several rows, but they all share the same `pcgroup ID.`

    -   **Check the `rt` column:** All the retention times ***should be nearly identical.*** This confirms they eluted together.

    -   **Check the `mz` column:** The m/z values will be ***different.***

    -   **Check the adduct and isotopes columns:** CAMERA will have made its best guess as to what each feature is. You might see one labeled \[M+H\]+, another \[M+Na\]+, and another identified as an isotope \[M+1\]. This is direct evidence that the algorithm successfully found and grouped these chemically related signals into a single, logical compound group.

### **Lesson 6: Summary & Status Check**

-   **Conceptually**, we understand that our feature list is redundant due to isotopes and adducts, and that we must group these related features to get a list that more accurately represents unique compounds. We know that CAMERA does this by finding correlated features with specific mass differences.

-   **Practically**, we have run the main CAMERA functions to create an annotated object.

-   **Crucially**, we have followed the “Trust, but Verify” principle by extracting the peak list and inspecting a single “pseudospectrum,” confirming that the features within it have the expected properties (same RT, different m/z, plausible adduct/isotope assignments).

We have now reached a major milestone. Our data is as clean and well-structured as it can be. We are finally ready to assemble the final data matrix for our statistical analysis.

------------------------------------------------------------------------

### Lesson 7: Building the Final Data Matrix & Normalization🦪

#### **1. Goal**

Our goals for this lesson are twofold:

1.  **To build the final data matrix:** We will extract the ***quantitative information (peak intensities)*** from our processed object and ***create a table where the rows represent our unique compounds*** (using the `pcgroup` annotation from `CAMERA`) and the columns represent our 10 samples.

2.  **To normalize the data:** We will apply a normalization method to this matrix to correct for unavoidable technical variations between samples (e.g., slight differences in sample loading or instrument sensitivity over time). This ensures that the differences we see are biological, not technical.

#### **2. Underlying Logic**

-   **Why build a new matrix?** The `xcms` and `CAMERA` objects are complex and contain all the raw data. For statistics, we need a simple rows x columns matrix of numbers. Furthermore, we need to resolve the redundancy identified by CAMERA. If multiple features (e.g., \[M+H\]⁺ and \[M+Na\]⁺) belong to the same `pcgroup`, we should ***represent them with a single row*** in our final matrix, typically by choosing the most intense and reliable feature.

-   **Why is normalization essential?** Imagine you pipette 99 microliters of sample A but 101 microliters of sample B into the instrument vials. Every single metabolite in sample B would appear to be ~2% more abundant. This is purely technical variation. Normalization aims to correct for these kinds of global, systematic shifts. It assumes that most metabolites do not change between your samples, and it adjusts the intensity scales of each sample so that the bulk of the metabolites line up.

-   **A Common Normalization Method (PQN):** **Probabilistic Quotient Normalization (PQN(概率商归一化)** is a robust and widely used method. Conceptually, it works like this:

    1.  It calculates a “reference” spectrum (typically the median spectrum across all samples).

    2.  For each individual sample, it calculates the fold-change for every metabolite relative to this reference.

    3.  It finds the median of all these fold-changes for that sample. This median value is the most likely “scaling factor” or technical error for that run.

    4.  It then divides all metabolite intensities in that sample by this scaling factor, bringing its overall intensity in line with all the other samples.

#### **3. Practical Application: The Code**

This is the end of our `01_xcms_processing.R` script.

``` r
# --- 15. Build and Normalize the Final Data Matrix ---
# This code continues in our '01_xcms_processing.R' script.

# We will continue to work with the 'peaklist' data frame from CAMERA.

# Step 1: Create a unique identifier for each compound (pcgroup)
# We will select the most intense feature to represent each pcgroup.
# First, get the intensity data for each feature across all samples.
intensity_matrix <- groupval(xa_annotated, value = "into")

# Find the maximum intensity for each feature across all samples
max_intensity <- apply(intensity_matrix, 1, max)

# Add this max intensity and a unique feature ID to our peaklist
peaklist_processed <- peaklist %>%
  mutate(feature_id = 1:n(), max_int = max_intensity)

# Now, for each pcgroup, find the feature with the highest max_int
representative_features <- peaklist_processed %>%
  group_by(pcgroup) %>%
  slice_max(order_by = max_int, n = 1) %>%
  ungroup()

# Create the final, filtered intensity matrix
final_matrix_raw <- intensity_matrix[representative_features$feature_id, ]

# Assign meaningful row names (a combination of m/z and RT)
rownames(final_matrix_raw) <- paste0("M", round(representative_features$mz, 4),
                                    "T", round(representative_features$rt, 2))

# Let's inspect our raw, pre-normalization matrix
head(final_matrix_raw)
```

-   **Action:** Run this block. You have now created a clean data matrix, `final_matrix_raw` ,where each row represents the single most intense feature from each `pseudospectrum group`. This is a huge step!

Now, let’s normalize this matrix.

``` r
# Step 2: Perform Normalization and Log Transformation

# We'll write a simple function for PQN for clarity.
normalize_pqn <- function(mat) {
  # Calculate reference spectrum (median)
  ref_spec <- apply(mat, 1, median)
  
  # Calculate quotients for each sample
  quotients <- mat / ref_spec
  
  # Calculate median quotient for each sample
  median_quotients <- apply(quotients, 2, median)
  
  # Divide each sample by its median quotient
  mat_normalized <- t(t(mat) / median_quotients)
  return(mat_normalized)
}

# Apply PQN normalization
matrix_normalized <- normalize_pqn(final_matrix_raw)

# It's also good practice to handle zero values before log transformation
# We'll replace them with a very small non-zero value (e.g., half the minimum)
min_val <- min(matrix_normalized[matrix_normalized > 0])
matrix_normalized[matrix_normalized == 0] <- min_val / 2

# Finally, perform a log2 transformation. This helps to stabilize variance
# and make the data more suitable for statistical tests.
matrix_log2 <- log2(matrix_normalized)

# Let's look at our final, analysis-ready matrix
head(matrix_log2)
```

-   **Action:** Run this block. You now have the final product of all our processing: `matrix_log2`. This is a fully processed, normalized, and log-transformed data matrix.

#### **4. Expected Outcome**

The `matrix_log2` object is a data frame or matrix where:

-   Rows are unique compounds, named by their m/z and retention time.

-   Columns are our 10 samples.

-   The values are the log2-transformed, normalized intensities.

-   There are no missing values (NA).

-   The data is now directly comparable across all samples.

#### **5. Verifiable “Proof”**

How can we prove that normalization worked? A `boxplot` is the perfect tool. Before normalization, we expect to see the median intensity (the black line in the middle of the box) vary between samples. After normalization, these medians should be almost perfectly aligned.

``` r
# --- 16. Verify the Normalization ---

# We need to reshape the data for ggplot2
# Before Normalization
df_before <- as.data.frame(final_matrix_raw) %>%
  mutate(feature = rownames(.)) %>%
  pivot_longer(-feature, names_to = "sample", values_to = "intensity")

# After Normalization (but before log transform for visual clarity)
df_after <- as.data.frame(matrix_normalized) %>%
  mutate(feature = rownames(.)) %>%
  pivot_longer(-feature, names_to = "sample", values_to = "intensity")

# Create the plots
plot_before <- ggplot(df_before, aes(x = sample, y = log10(intensity))) +
  geom_boxplot() +
  labs(title = "Before Normalization") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot_after <- ggplot(df_after, aes(x = sample, y = log10(intensity))) +
  geom_boxplot() +
  labs(title = "After PQN Normalization") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# We can use a package to arrange them side-by-side
library(patchwork)
normalization_plot <- plot_before + plot_after

# Display the plot
print(normalization_plot)

# Save the plot
ggsave("figures/04_normalization_verification.png", normalization_plot, width = 12, height = 6)
```

-   **Action:** Run this final block.

-   **Verification:** The side-by-side `boxplots` are our definitive proof.

    -   **The “Before” plot** will likely show that the boxes for each sample are at different heights, indicating different intensity distributions.

    -   **The “After” plot** should show a dramatic improvement. The black median lines for all 10 boxes should be aligned at almost exactly the same level. This is the visual confirmation that our normalization has corrected for the systematic technical differences between the runs.

### **Lesson 7: Summary & Status Check**

-   **Conceptually**, we understand the need to distill our complex feature list into a simple data matrix and the absolute necessity of normalizing this matrix to ensure biological comparability.

-   **Practically**, we have selected a representative feature for each compound group, built our final data matrix, and applied PQN normalization and log2 transformation.

-   **Crucially**, we have followed the “Trust, but Verify” principle by creating before-and-after boxplots that provide clear, visual proof of the success of our normalization step.

------------------------------------------------------------------------

## 🕊️Part 3: The Payoff - Statistical Analysis and Biological Interpretation🧶

------------------------------------------------------------------------

### Lesson 8: Univariate(单变量) and Multivariate(多变量) Statistics🤗

**Goal:** To analyze our final data matrix `(matrix_log2)` to identify which `specific metabolites have significantly changed in abundance between the Control and Metformin-treated groups`. We will use two complementary statistical approaches.

#### **1. Underlying Logic**

Our data matrix has many rows (thousands of metabolites) and is therefore “high-dimensional.” We can’t just look at it and see the patterns. We need two types of statistical tools:

1.  **Multivariate Analysis (The “Forest” View):** This approach looks at all metabolites at once to find the dominant patterns of variation in the data. Its primary purpose is **unsupervised clustering**. It answers the question: “Based on their overall metabolic profile, do my samples naturally group together by their condition (Control vs. Metformin)?” This is a powerful, unbiased first look at our data and a critical QC step. The most common method for this is **Principal Component Analysis (PCA)**.

    -   **How PCA Works:** PCA finds the “principal components,” which are new, artificial axes that capture the maximum amount of variation in the data. PC1 is the axis that explains the most variation, PC2 explains the second most, and so on. By plotting the samples on these new axes (a “scores plot”), we can see if the dominant source of variation in our experiment corresponds to our biological question.

    -   ***PCA (and other multivariate methods like PLS-DA) is for HOLISTIC PATTERN RECOGNITION AND QC.(整体识别和QC).PCA do confirm the experiment worked.***

2.  **Univariate Analysis (The “Trees” View):** This approach looks at one metabolite at a time. For each and every row in our matrix, it performs a statistical test to ask: “Is the mean intensity of this metabolite in the Control group significantly different from its mean in the Metformin group?” This is how we find our list of individual “hits.”

    -   **The Tool:** We will use the same powerhouse package from our proteomics project: **`limma`**. It is perfect for this task because it handles the multiple testing problem effectively and is statistically more powerful than running thousands of individual t-tests. The output will be a p-value and a fold-change for every single metabolite.

#### **2. Practical Application: The Code**

We will now create our final R script for this project.

-   **Action:** In RStudio, create a new script in your scripts folder `named 02_statistical_analysis.R.`

``` r
# --------------------------------------------------------------------------
# Script: 02_statistical_analysis.R
# Author: Your Name
# Date: 2025-09-03
# --------------------------------------------------------------------------

# Load necessary libraries
library(tidyverse)
library(limma)      # For univariate analysis
library(pheatmap)   # For plotting heatmaps
library(patchwork)  # For arranging plots

# --- Load the Final Data Matrix ---
# In a real workflow, we would save our final matrix from the previous script
# and load it here.
# e.g., save(matrix_log2, pdata, file = "data_processed/final_matrix.RData")
# load("data_processed/final_matrix.RData")

# For this continuous example, we will assume 'matrix_log2' and 'pdata' are in our environment.
```

    # --- 1. Multivariate Analysis: Principal Component Analysis (PCA) ---

    # PCA works on a matrix where samples are rows and features are columns,
    # so we need to transpose our data matrix 't()'. We also remove any
    # columns with zero variance, which can cause issues.
    pca_input <- t(matrix_log2)
    pca_input <- pca_input[, apply(pca_input, 2, var) > 0]

    # Run the PCA
    pca_result <- prcomp(pca_input, scale. = TRUE, center = TRUE)

    # Extract the scores for the first two principal components
    pca_scores <- as.data.frame(pca_result$x) %>%
      select(PC1, PC2)

    # Combine the PCA scores with our experimental design info for plotting
    pca_scores_with_meta <- merge(pca_scores, pdata, by.x = "row.names", by.y = "sample_name")

    # Calculate the percentage of variance explained by each PC
    percent_variance <- round(100 * pca_result$sdev^2 / sum(pca_result$sdev^2), 1)

    # Create the PCA scores plot
    pca_plot <- ggplot(pca_scores_with_meta, aes(x = PC1, y = PC2, color = sample_group)) +
      geom_point(size = 4, alpha = 0.8) +
      labs(
        title = "PCA of Metabolomic Profiles",
        x = paste0("PC1 (", percent_variance[1], "% variance)"),
        y = paste0("PC2 (", percent_variance[2], "% variance)"),
        color = "Experimental Group"
      ) +
      theme_bw() +
      coord_fixed() # Ensure the scaling of axes is equal

    # Display the plot
    print(pca_plot)
    ggsave("figures/05_pca_plot.png", pca_plot, width = 7, height = 6)

-   **Action:** Run this block.

-   **Verification:** The PCA plot is our proof. A successful experiment will show a **clear separation** between the blue dots (Control) and the orange dots (Metformin), likely along the PC1 axis. This provides powerful, unbiased evidence that the Metformin treatment had a significant and consistent effect on the overall metabolism of the cells. If the groups were all mixed together, it would suggest a failed experiment or no biological effect.

This will be very familiar from our proteomics project.

``` r
# --- 2. Univariate Analysis with limma ---

# We use the exact same logic as in the proteomics workflow.

# Create the design matrix
design <- model.matrix(~ 0 + pdata$sample_group)
colnames(design) <- c("Control", "Metformin")

# Create the contrast matrix
contrast_matrix <- makeContrasts(
  Metformin_vs_Control = Metformin - Control,
  levels = design
)

# Fit the linear model
fit <- lmFit(matrix_log2, design)
fit2 <- contrasts.fit(fit, contrast_matrix)
fit_bayes <- eBayes(fit2)

# Extract the final results table
results_table <- topTable(fit_bayes, number = Inf, sort.by = "P")

# Let's inspect the top results
head(results_table)

# --- 3. Visualization: Volcano Plot ---

# Add a column for significance to the results table
results_table <- results_table %>%
  mutate(
    significance = case_when(
      logFC > 1 & adj.P.Val < 0.05 ~ "Upregulated in Metformin",
      logFC < -1 & adj.P.Val < 0.05 ~ "Downregulated in Metformin",
      TRUE ~ "Not Significant"
    )
  )

# Create the volcano plot
volcano_plot <- ggplot(results_table, aes(x = logFC, y = -log10(adj.P.Val), color = significance)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("Upregulated in Metformin" = "#d95f02", "Downregulated in Metformin" = "#1b9e77", "Not Significant" = "grey")) +
  labs(
    title = "Metformin vs. Control Treatment",
    subtitle = "Differentially Abundant Metabolites",
    x = "log2(Fold Change)",
    y = "-log10(Adjusted p-value)"
  ) +
  theme_bw() +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed")

# Display the plot
print(volcano_plot)
ggsave("figures/06_volcano_plot.png", volcano_plot, width = 8, height = 7)

# Finally, get our list of significant "hits"
significant_hits <- results_table %>%
  filter(significance != "Not Significant")

cat("\nNumber of significantly changed metabolite features:", nrow(significant_hits))
print(head(significant_hits))
```

-   **Action:** Run this block.

-   **Verification:** The volcano plot is our proof. It visualizes the results of our thousands of statistical tests. The points in the top-left (downregulated) and top-right (upregulated) corners are our **significant hits**. The significant_hits data frame is the final, tangible output of this analysis—a high-confidence list of the metabolite features that were most affected by the drug treatment.

### **Lesson 8: Summary & Status Check**

-   **Conceptually**, we understand the complementary nature of multivariate (PCA) and univariate (limma) analysis. PCA gives us the “forest” view, confirming our experiment worked, while limma gives us the “trees” view, identifying the individual significant metabolites.

-   **Practically**, we have successfully run a PCA and a full limma pipeline on our processed metabolomics data.

-   **Crucially**, we have followed the “Trust, but Verify” principle by generating two key plots:

    1.  The **PCA plot** verifies that our experimental groups are globally different.

    2.  The **Volcano plot** verifies our univariate analysis, clearly displaying the significant hits according to our chosen statistical thresholds.

We are now on the verge of the final discovery. We have a list of significant “features” (e.g., “M175.0234T3.45”). The next, and most challenging, lesson is to try and figure out what these features actually are.

------------------------------------------------------------------------

## Lesson 9: Metabolite Identification and Annotation (The Deeper Dive)🤠

#### **1. Goal**

***Our goal is to take the list of significant features from our statistical analysis and assign a putative(假定的) chemical identity to them. This involves `querying online databases` with the highly accurate mass of our features to find potential matches. We must also understand the different levels of confidence in these identifications.***

#### **2. Underlying Logic**

-   **The Challenge:** ***Unlike proteomics, where peptides are made from a simple 20-letter amino acid alphabet,*** the chemical space of metabolites is astronomically ***vast and diverse***. ***There is no simple “search engine” that can definitively identify a metabolite from its mass alone.***

-   **The Primary Clue: Accurate Mass.** Our most powerful piece of information is the ***very precise mass-to-charge ratio (m/z) that we measured.*** For a feature like “M175.0234T3.45”, the key is the mass: 175.0234. We can search ***databases*** to ask, “***What known biological molecules have a mass that is extremely close to this value?“***

-   **The Problem of Adducts:** Remember **`CAMERA`**? It told us that a feature might be the protonated form \[M+H\]⁺, or a sodium adduct \[M+Na\]⁺, etc. When we search a database, we must search for the mass of the **neutral molecule (M)(中性分子)**. Therefore, before searching, we must first “de-adduct” our measured m/z.

    -   If we believe our feature 175.0234 is the \[M+H\]⁺ form, the ***neutral mass is 175.0234 - (mass of a proton) = 174.0156.***

    -   If we believe it is the \[M+Na\]⁺ form, the neutral mass is 175.0234 - (mass of a sodium ion) = 152.0368.  
        This is why the **`CAMERA`** annotation step was so important.

        CAMERA (Collection of Algorithms for MEtabolite pRofile Annotation) doesn’t “know” in the strict sense—it infers based on **mass differences**, **intensity patterns**, and **co-elution behavior**. Here’s how:

        #### 1. **Co-elution and Retention Time**

        -   CAMERA starts by clustering features that elute at nearly the same retention time.

        -   The assumption: if two ions appear at the same time, they likely come from the same compound.

        #### 2. **Mass Differences Matching Known Adducts**

        -   It uses a predefined list of adducts (e.g., \[M+H\]⁺, \[M+Na\]⁺, \[M+K\]⁺) and their exact mass shifts.

        -   If two features differ by the mass of a sodium ion minus a proton (~21.9819 Da)(两个特性相减得到一个质子), and they co-elute, CAMERA flags them as possible \[M+H\]⁺ and \[M+Na\]⁺ forms of the same molecule.

        #### 3. **Isotope Patterns**

        -   It checks for isotopic spacing (e.g., 1.00335 Da for C13) and intensity ratios.

        -   This helps distinguish monoisotopic peaks from their heavier isotopologues.

        #### 4. **Intensity Ratios and Charge States**

        -   Adducts often have characteristic intensity patterns. For example, \[M+H\]⁺ is usually more abundant than \[M+Na\]⁺.

        -   CAMERA uses these patterns to refine its grouping.

        ### 🧠 Why “De-adducting” Is Necessary

        When you search a database like HMDB or METLIN, they expect the **neutral mass (M)**, not the m/z of the adduct. So CAMERA helps you:

        -   Group all adducts and isotopes of a compound.

        -   Infer the neutral mass by subtracting the adduct mass.

        -   Output a cleaned list of neutral masses for annotation.

        ### 🧰 Example

        Let’s say you detect three features:

        -   m/z 180.065 (RT 5.2 min)

        -   m/z 201.045 (RT 5.2 min)

        -   m/z 181.068 (RT 5.2 min)

        CAMERA sees:

        -   180.065 could be \[M+H\]⁺

        -   201.045 is ~21.98 Da higher → likely \[M+Na\]⁺

        -   181.068 is ~1.003 Da higher → likely C13 isotope of \[M+H\]⁺

        It bundles them, assigns a neutral mass of ~179.057, and tags the adducts accordingly.

-   **Levels of Identification Confidence:** The Metabolomics Standards Initiative (MSI) has defined a clear, tiered system for reporting identifications. It is crucial to be honest about our level of confidence.

    -   **Level 4: Unidentified Compound.** This is our current state: a feature with a mass and RT, but no name.

    -   **Level 3: Putatively Characterized Compound Class.** (e.g., based on spectral data, we might guess it’s a type of sugar).

    -   **Level 2: Putatively Annotated Compound.** This is our goal for today. We get a match based on accurate mass from a database (e.g., M174.0156 is a perfect match for the neutral mass of Arginine). This is a strong hypothesis, but not a confirmation.

    -   **Level 1: Confidently Identified Compound.** This is the gold standard. To achieve this, you must match **two** orthogonal properties(正交属性), most commonly:

        1.  Accurate mass and retention time against a ***pure, purchased chemical standard run on the exact same machine.***

        2.  Accurate mass and its MS/MS fragmentation spectrum against a library spectrum from a pure standard.

For our untargeted bioinformatics analysis, we will be operating at **Level 2**. We are generating high-quality hypotheses.

#### **3. Practical Application: The Code**

For this task, we will use a fantastic R package that interfaces directly with many online `metabolomics` databases: MetaboAnalystR.

``` r
# Install the MetaboAnalystR package (it has many dependencies)
# This might take a while.
if (!require("MetaboAnalystR")){
  install.packages("remotes")
  remotes::install_github("xia-lab/MetaboAnalystR", build = TRUE)
}

# Add this to our '02_statistical_analysis.R' script
library(MetaboAnalystR)
```

-   **Action:** Install `MetaboAnalystR` . It’s a large and powerful package. Then, add it to the library section of your `02_statistical_analysis.R script.`

Now, let’s take our list of significant hits and prepare them for searching.

``` r
# --- 4. Metabolite Annotation and Identification ---

# Let's work with our 'significant_hits' data frame.
# It has row names like "M175.0234T3.45"
# We need to extract the m/z values from these names.
mz_values <- as.numeric(str_extract(rownames(significant_hits), "(?<=M)\\d+\\.\\d+"))

# Create a query list for MetaboAnalystR. It needs a data frame
# with columns for m/z, p-value, and fold-change.
query_df <- data.frame(
  mz = mz_values,
  p.value = significant_hits$adj.P.Val,
  fc = 2^significant_hits$logFC, # Convert log2FC back to a regular fold change
  row.names = rownames(significant_hits)
)

# --- Perform the Database Search ---

# Initialize the analysis object
mSet <- InitDataObjects("mass_all", "mummichog", FALSE)

# Set the parameters for the search. We need to tell it about our instrument.
# Let's assume our data is from an Orbitrap in positive ion mode.
mSet <- SetPeakFormat(mSet, "mpr") # mpr = m/z, p-value, retention time (we don't have RT here)
mSet <- UpdateInstrumentParameters(mSet, 5.0, "orbi", "positive", "uv") # 5.0 ppm tolerance

# Load our query data into the object
mSet <- Read.PeakListData(mSet, query_df)

# Perform the peak annotation. This function will de-adduct and search.
# It queries a comprehensive database based on KEGG, HMDB, etc.
mSet <- PerformPSEA(mSet, "hsa", "current", 100) # hsa = homo sapiens

# Extract the results table
annotation_results <- mSet$dataSet$mummi.res
```

Run this block. `` ThePerformPSEA` ``function is doing the heavy lifting. It connects to the `MetaboAnalyst server,` uploads your list of m/z values, performs the de-adduction, searches against a comprehensive human metabolome database within the 5 ppm mass tolerance, and downloads the results.

#### **4. Expected Outcome**

The `annotation_results` data frame is our table of putative identifications. It will have columns like: `mass_matched`: The m/z from our query list.

`name`: The common name of the matched metabolite (e.g., “L-Arginine”).

`kegg_id`, `hmdb_id`: The database IDs for the match.

`adduct_type`: The adduct that `MetaboAnalystR` assumed to get the match (e.g., “\[M+H\]+”).

`pathway_matched`: Which metabolic pathways the identified metabolite belongs to.

#### **5. Verifiable “Proof”**

How can we trust these annotations? We can perform a manual check on one of the top hits.

**Action:** Inspect the output table. Perform the manual check described in the comments for one of the top hits. This process of manually confirming the math for a top hit is a fundamental skill and a crucial verification step.


    # Let's look at the top of our annotation results table
    # We will show only the most important columns for clarity.
    print(head(annotation_results[, c("mass_matched", "name", "adduct_type", "pathway_matched")]))

    # --- Manual Verification of the Top Hit ---
    top_hit <- annotation_results[1, ]
    our_mz <- as.numeric(top_hit$mass_matched)
    putative_id <- top_hit$name
    assumed_adduct <- top_hit$adduct_type

We will now go to an external database, like the Human Metabolome Database (HMDB) and manually verify this. Let’s say the top hit is “L-Glutamine(谷氨酰胺)”.

1.  Google “HMDB L-Glutamine”.

2.  On the HMDB page, find the exact monoisotopic mass of the neutral molecule(单同位素质量数). (For L-Glutamine, this is 146.0691 g/mol).

3.  Now, let’s calculate what the mass of the \[M+H\]+ adduct should be. Mass of a proton is ~1.007276 Da. Expected \[M+H\]+ mass = 146.0691 + 1.007276 = 147.0764 Da.

4.  Compare this to our measured m/z. Let’s say our feature was M147.0763.

This manual calculation, confirming that our measured mass is within a few ppm of the theoretical mass for the adduct suggested by the software, is our “Trust, but Verify” step. It gives us confidence that the algorithm is working correctly.

### **Lesson 9: Summary & Status Check**

-   **Conceptually**, we understand the immense challenge of metabolite identification and the crucial difference between a “putative annotation” (Level 2) and a “confident identification” (Level 1). We know that our primary tool is matching the accurate neutral mass against online databases.
-   **Practically**, we have used the powerful `MetaboAnalystR` package to perform an automated database search on our list of significant features.
-   **Crucially**, we have followed the “Trust, but Verify” principle by defining a clear procedure for manually cross-referencing a top hit against an external database like HMDB to confirm the mass calculation.

We have now transformed our anonymous list of features into a list of meaningful, named metabolites. We are finally ready for the grand finale: taking these named metabolites and discovering which biological pathways are being altered by our drug treatment.

------------------------------------------------------------------------

### Lesson 10: Pathway and Enrichment Analysis🪂

**Goal:** To take our list of putatively identified, significantly changed metabolites and determine if they are statistically over-represented ***in any known metabolic pathways***. This will provide the ultimate biological story, explaining how Metformin is affecting the cell’s metabolism.

#### **1. Underlying Logic**

This concept is identical to the enrichment analysis we performed in proteomics.

-   **The Problem:** We might have a list of 50 significant metabolites. Simply reading the list (e.g., “Glutamine is down, Citrate is up, Succinate is up…”) doesn’t immediately tell us the story.

-   **The Question:** Are these changes random, or ***are they concentrated in a specific, coordinated biological process?***

-   **The Method (Metabolite Set Enrichment Analysis - MSEA(代谢物集富集分析)):** We use a statistical test (the **`hypergeometric test`**, just like before) to check if our list of 50 “hits” contains a surprisingly high number of metabolites belonging to a predefined “Metabolite Set” (like the “Citric Acid (TCA) Cycle” pathway). If the probability of seeing that many hits in that pathway by random chance is very low (i.e., a small p-value), ***we can conclude that the pathway is “significantly enriched” or “significantly impacted.”***

The `MetaboAnalystR` package we used in the last lesson has this functionality built-in, making it a seamless next step.

### **Practical Application: The Code in Chunks**

We will continue in our `02_statistical_analysis.R` script.

#### **Chunk 1: Preparing the Data for Enrichment**

**Explanation:** The enrichment analysis function doesn’t need our full data table. It just needs a simple list of the compound names that we found were significant. We will extract these names from the annotation results we got in the previous lesson. We also need to be careful to select the correct set of names if our annotation found multiple hits for one mass.

``` r
# --- 6. Pathway and Enrichment Analysis ---

# We will use the 'mSet' object from the previous lesson, as it already
# contains our query and the annotation results.

# The pathway analysis function in MetaboAnalystR uses the results already
# stored in the mSet object after we ran PerformPSEA(). It will automatically
# use the putatively identified compounds for the enrichment test.
# First, we need to specify which metabolite sets we want to test against.
# We will use the comprehensive pathway library from KEGG.
mSet <- SetKEGG.PathLib(mSet, "hsa", "current") # hsa = Homo sapiens

# Now, we are ready to run the enrichment analysis.
```

-   **Action:** Add this small setup chunk to your script and run it. We have now told `MetaboAnalystR` that we ***want to test our data against the library of all known human KEGG pathways.***

#### **Chunk 2: Running the Enrichment Analysis**

**Explanation:** Now we call the main function to perform the analysis. This function will take the list of all metabolites that were successfully annotated in our query, cross-reference them against all the KEGG pathways, and perform the `hypergeometric test` for each pathway.

``` r
# Run the Metabolite Set Enrichment Analysis (MSEA)
mSet <- PerformPathEnrich(mSet, "globaltest", "pathway") # Using the 'globaltest' algorithm

# The results are now stored within our mSet object.
```

-   **Action:** Run this command. `MetaboAnalystR` is now performing the statistical tests.

#### **Chunk 3: Extracting and Viewing the Results**

**Explanation:** The results are stored in a table inside our `mSet` object. We need to extract this table to view it. The table will be ranked by significance, showing us the most impacted pathways at the top.

``` r
# Extract the results table from the mSet object
enrichment_results <- mSet$analSet$path.result

# Let's view the most important columns of the results table
# - 'Total' is the total number of metabolites in the pathway.
# - 'Hits' is how many of our significant metabolites are in that pathway.
# - 'P.Value' is the raw p-value from the hypergeometric test.
# - 'FDR' is the false discovery rate (adjusted p-value), which is most important.
print(head(enrichment_results[, c("Total", "Hits", "P.Value", "FDR")]))
```

-   **Action:** Run this chunk. The table printed in your console is the main result. You can now read the row names to see which pathways were most significantly altered. ***For our Metformin project, we would expect to see pathways like “Central carbon metabolism” or “Amino acid metabolism.”***

#### **Chunk 4: Visualizing the Enrichment Results**

**Explanation:** A table of numbers is good, but a plot is much better for communication and interpretation. We will create a bar chart that ***shows the top 15 most significant pathways***, ranked by their p-value. This provides an immediate, intuitive view of the most important biological findings.

``` r
# Plot the enrichment results
# This function is built into MetaboAnalystR for easy visualization.
PlotPathSummary(mSet, 
              fig.name = "figures/08_pathway_enrichment_plot.png",
              width = 8, 
              height = 7, 
              dpi = 300)

# The plot will be saved to the 'figures' folder.
# It shows the pathways on the y-axis and the -log10(p-value) on the x-axis.
# Bigger bars mean more significant enrichment. The dot color/size indicates
# the impact or number of hits.
```

-   **Action:** Run this final command. Go to your figures folder and open the new `PNG` file.

-   **Verification:** The plot itself is the proof. ***You have a clear, publication-ready figure that summarizes the entire biological story of your experiment.*** It visually confirms the findings from the results table. For example, if “Citric Acid (TCA) Cycle” has the biggest bar, you have strong evidence that Metformin’s primary effect is on the cell’s central energy production.

### **Grand Conclusion of the Entire Metabolomics Project**

Let’s synthesize the story from this final lesson.

1.  **From Hits to Names:** In Lesson 9, we turned our significant feature list (e.g., “M117.0189…”) into a list of putative metabolite names (e.g., “Succinic acid”).

2.  **From Names to Pathways:** In this lesson, we took that list of names and discovered they weren’t random. Our enrichment analysis showed, for instance, that “Succinic acid,” “Citric acid,” “Malic acid,” and “Fumaric acid” were all significantly upregulated.

3.  **The Biological Story:** The enrichment plot tells us that these metabolites are not just a random collection; they are all key players in the **“Citric Acid (TCA) Cycle.”**

**The Final Hypothesis:**  
“Our untargeted metabolomics analysis reveals that Metformin treatment significantly perturbs(扰乱) central carbon metabolism in `HepG2` cells. We observed a statistically significant enrichment of the Citric Acid (TCA) Cycle pathway, driven by the coordinated upregulation of multiple key cycle intermediates. This suggests that Metformin’s anti-cancer effects in this model may be mediated by altering the cell’s fundamental energy production pathways.”