# 网络药理学 教程

JAYZ (The University OF Myself)

# **🫡网络药理学🎃**

------------------------------------------------------------------------

## **🤗Part 1: The Conceptual Foundations🤖**

------------------------------------------------------------------------

### **Lesson 1: The Core Concepts - Beyond a Single Target😏**

**Goal:** ***`To build a rock-solid understanding of the central idea behind network pharmacology and to define the key terms we will use throughout this entire field`.***

#### **The Old Idea: The “Magic Bullet”**

For over a century, the dominant idea in pharmacology was the “magic bullet,” a term coined by the great scientist Paul Ehrlich.

-   **The Concept:** A perfect drug is like a sniper’s bullet. It has **one specific target** in the body (usually a single protein/enzyme), ***it hits that target with high precision, and this single action cures the disease.*** All other interactions are considered “off-target effects(脱靶效应)” or side effects(副作用).
-   **The Analogy:** A light switch. To turn on the lights in a room, you flip one specific switch. You don’t need to touch anything else.
-   **An Example:** A highly specific kinase(激酶) inhibitor might be designed to block the activity of only one kinase that is known to be overactive in a cancer cell.

**The Problem with This Simple Model:** While beautifully simple, this model often fails to explain reality.

1\. **Complex Diseases:** Most chronic diseases, like cancer, heart disease, or Alzheimer’s, are ***not caused by a single faulty protein***. They are **network diseases**, arising from the breakdown of an entire system with many interconnected parts.

2\. **Drug Promiscuity:** Most drugs are not perfect snipers(狙击手). They are more like shotguns. ***A single drug molecule often binds to multiple, sometimes dozens, of different proteins in the cell.***

#### **The New Idea: Network Pharmacology**

Network pharmacology embraces this complexity. It starts with a completely different set of assumptions.

-   **The Concept:** A disease is a **disruption of a network(网络的中断)**. A drug works by **re-tuning or perturbing that network(重新调整或者扰乱网络)**. ***The therapeutic effect comes from the drug’s collective action on multiple targets, which then ripples through(波及) the interconnected biological system to restore a healthier state.***
-   **The Analogy:** A sound mixing board in a recording studio. A song (a healthy cell) sounds good because dozens of faders (proteins) are all at the right level. In a disease state, many of these faders are in the wrong positions, making the song sound terrible. A good drug is not a single switch; it’s the sound engineer’s hand **moving a coordinated set of faders** to bring the song back into harmony. Some faders are moved a lot, some are moved a little, but it’s the *collective* change that fixes the song.
-   **The Goal:** The goal of network pharmacology is to **create a map of this mixing board** and to predict which faders a specific drug is likely to move.

#### **The Key Terms: Our Cast of Characters**

To tell this story, we need to be precise about our terminology. There are four main “characters” in every network pharmacology analysis.

1.  **The Drug / Compound(s):**
    -   **Definition:** The therapeutic agent we are studying.
    -   **Examples:**
        -   A single, well-defined small molecule like **Aspirin**.
        -   A natural product like **Quercetin** (a chemical found in apples and onions).
        -   A complex mixture, like the dozens of different chemical compounds found in a **Ginseng extract**. In this case, our “drug” is actually a list of many different molecules.
2.  **The Drug Target(s):**
    -   **Definition:** A protein in the body that the drug physically binds to or interacts with, causing a change in the protein’s activity.
    -   **Our Goal:** To find all the known or predicted targets for our drug.
    -   **Example:** A key target of Aspirin is the protein Cyclooxygenase-2 (COX-2).
3.  **The Disease:**
    -   **Definition:** The specific pathological condition we want to treat.
    -   **Examples:** Colon Cancer, Alzheimer’s Disease, Influenza.
4.  **The Disease Target(s):**
    -   **Definition:** A gene or its corresponding protein that is known to be implicated in the disease. ***This could be a gene that is frequently mutated in the disease, or a protein whose activity is abnormally high or low***.
    -   **Our Goal:** To find a comprehensive list of all genes associated with our disease.
    -   **Example:** The gene *APC* is a famous tumor suppressor that is a key disease target in Colon Cancer.

**The Central Hypothesis of Network Pharmacology:** We can formally state the core idea like this: **An effective drug works by modulating a network of its targets, and this network significantly overlaps with the network of targets known to be involved in the disease.**

Our entire practical workflow will be about finding these two sets of targets and then analyzing their overlap and their connections.

### **Lesson 1: Summary & Status Check**

-   **Conceptually**, we have made the critical shift from the simplistic “magic bullet” model to the more realistic **“network” model** of drug action. We understand that diseases are complex network perturbations and that drugs act by modulating these networks.
-   We have clearly defined our **four main characters**: the Drug, the Drug Targets, the Disease, and the Disease Targets.
-   **Crucially**, we have framed the entire field around a central, testable hypothesis: the idea that the drug’s target network must intersect with the disease’s target network.

We now understand *why* we are doing this and *what* we are looking for. The next logical step is to learn about the specific bioinformatics databases where we can find the information about these “characters.”

------------------------------------------------------------------------

### **Lesson 2: The Building Blocks - The Data You Need🤫**

**Goal:** ***`To identify the key online databases and web servers that provide the three essential lists of information needed for any network pharmacology project: the drug's targets, the disease's targets, and the protein interaction network that links them.`***

#### **1. Finding the Drug Targets**

**The Challenge:** We have a drug (e.g., Quercetin). How do we find the list of proteins it is likely to interact with in the human body? ***We use a combination of databases that store experimental data and web servers that make predictions.***

**Key Resource 1: `DrugBank`**

-   **What it is:** A comprehensive, expert-curated database that is like the Wikipedia of drugs. For thousands of approved and experimental drugs, it contains a wealth of information, including their known protein targets.

-   **How to use it:** You go to the DrugBank website and simply search for your drug (e.g., “Aspirin”). The entry will have a “Targets” section listing the proteins it is known to bind to.

-   **When to use it:** This is the best source for **well-studied, approved drugs**.

**Key Resource 2: `STITCH`**

-   **What it is:** A database of known and predicted **chemical-protein interactions**. It’s broader than `DrugBank` and includes many experimental chemicals and natural products, not just clinical drugs. It aggregates data from experiments, databases, and text-mining of scientific literature.

-   **How to use it:** You search for your chemical compound. The results show a network of its interacting proteins. ***Each interaction has a confidence score***.

-   **When to use it:** Excellent for **natural products** or **experimental compounds** that might not be in `DrugBank.`

**Key Resource 3: `SwissTargetPrediction` (A Predictive Tool)**

-   **What it is:** This is a fundamentally different and incredibly powerful tool. It is a **web server**, not just a database. It answers the question: “I have a new or unstudied chemical. ***Based on its 2D/3D chemical structure, what proteins is it likely to bind to?“***

-   **How it works (The Logic of “Chemical Similarity”):** The core principle is that **structurally similar molecules often have similar biological functions**. The server has a massive ***database of known ligands and their targets***. When you submit your molecule’s structure, ***it rapidly calculates its similarity to all the known ligands in its database***. If your molecule is 95% similar to a known inhibitor of the protein EGFR, the server will predict that EGFR is a probable target for your molecule.

-   **How to use it:** You provide the chemical structure of your molecule (usually as a SMILES string, which is a text-based way of representing a chemical structure). The server returns a ranked list of the most probable protein targets.

-   **When to use it:** This is **essential for novel compounds, natural products, or any molecule** where the experimental target data is limited. This is the main tool for hypothesis generation.

#### **2. Finding the Disease Targets**

**The Challenge:** We have a disease (e.g., Colon Cancer). How do we find a comprehensive list of all the genes and proteins that are known to be involved in this disease?

**Key Resource 1: `GeneCards`**

-   **What it is:** A “human gene compendium.” It is one of the best and most user-friendly starting points. It integrates information from over 150 other databases.

-   **How to use it:** ***You search for your disease (e.g., “colon cancer”). GeneCards provides a single, unified “Gene List” associated with that disease, complete with scores indicating the strength of the association.***

-   **When to use it:** **Almost always.** It is the perfect tool to get a comprehensive, well-documented list of disease-associated genes.

**Key Resource 2: `OMIM` (Online Mendelian Inheritance in Man)**

-   **What it is:** A highly curated catalog of human genes and genetic disorders. It is more ***focused on the genetic and hereditary aspects of disease.***

-   **How to use it:** You search for the disease. It provides detailed text summaries of the disease’s genetic basis and lists the key genes involved.

-   **When to use it:** Excellent for diseases with a strong known genetic component.

**Key Resource 3: `DisGeNET`**

-   **What it is:** A database specifically focused on **gene-disease associations**. It systematically collects and scores evidence from scientific literature and other databases.

-   **How to use it:** You can search by disease to get a list of associated genes, or by gene to see all the diseases it is linked to.

-   **When to use it:** A very good, systematic alternative or complement to GeneCards for generating a high-quality disease gene list.

#### **3. The Blueprint that Connects Everything: The PPI Network**

**The Challenge:** We will soon have two separate lists: drug targets and disease targets. How do we know how they are related to each other functionally? ***The answer is that we need a “map” of the human cell’s proteome that shows how all proteins are connected.***

**The Key Resource: `STRING` Database**

-   **What it is:** We’ve used this before! STRING is a database of known and predicted **Protein-Protein Interactions (PPIs)**. It is the functional “scaffolding” of the cell.

-   **How it works:** It aggregates evidence from many sources:

    -   Experimental data (e.g., two proteins were physically pulled out of a cell together).

    -   Co-expression data (e.g., two genes are always turned on at the same time).

    -   Text mining of millions of scientific papers.

-   **Its Role in Network Pharmacology:** The PPI network is the **universe** in which our analysis takes place. ***It’s the map that allows us to see if the drug targets are “close” to the disease targets***. We can ask questions like, “Is my drug target, Protein A, directly interacting with a known disease target, Protein B?” or “Does my drug target sit ‘upstream’ in a signaling pathway that controls many disease targets?”

### **Lesson 2: Summary & Status Check**

-   **Conceptually**, we have identified the three distinct categories of information that form the “building blocks” of our analysis. We know that we need data on drug-target interactions, gene-disease associations, and the underlying protein-protein interaction network.

-   **Practically**, we have learned the names and purposes of the key, high-quality bioinformatics resources that provide this information. We now have a “shopping list” of where to go to get our data:

    -   **For Drug Targets:** Start with **SwissTargetPrediction** (for prediction) and **STITCH/DrugBank** (for known targets).

    -   **For Disease Targets:** Start with **GeneCards**.

    -   **For the Network Context:** Use the **STRING** database.

We are no longer just talking about abstract concepts. We now have a concrete plan for data acquisition. We are ready to start our practical project by visiting these websites and collecting our initial lists.

------------------------------------------------------------------------

### **Lesson 3: The Blueprint - The Protein-Protein Interaction (PPI) Network🫥**

**Goal:** To understand the structure and meaning of a PPI network, what the “nodes” and “edges” represent, and how this network serves as the fundamental framework for linking drug targets to disease targets.

#### **1. Core Concepts: The Network as a Map**

-   **What is a Network (or Graph)?**

    -   In bioinformatics, a network is a simple but powerful way to represent relationships between biological entities. It consists of two parts:

        1.  **Nodes (or Vertices):** These are the individual “things” we are studying. In a PPI network, **each node is a unique protein**.

        2.  **Edges (or Links):** These are the lines that connect the nodes. In a PPI network, **an edge between two nodes means there is evidence that those two proteins interact.**

-   **What Does an “Interaction” Mean?**

    -   This is a critical point. An edge in a PPI network can represent many different kinds of relationships, ***and the STRING database (our main source) is brilliant because it captures this diversity. An edge could mean:***

        -   **Physical Binding:** The two proteins physically stick to each other to form a complex. (This is the classic, strongest definition).

        -   **Functional Association:** The two proteins are part of the same signaling pathway, and one activates the other (e.g., a kinase phosphorylating its substrate（磷酸激酶激化底物).

        -   **Co-expression:** The genes for the two proteins are consistently turned on and off at the same time across many experiments.

        -   **Genomic Proximity:** The genes for the two proteins are located next to each other on the chromosome (common in bacteria).

        -   **Literature Co-mention:** The two proteins are frequently mentioned together in the abstracts of scientific papers.

-   **The PPI Network as the “Interactome”**

    -   The complete set of all possible protein-protein interactions in an organism is called its **“interactome.”**

    -   The STRING database is our best attempt at building a comprehensive map of the human interactome. It is not complete and is constantly being updated as new research is published, but it is our essential blueprint.

#### **2. Extending Knowledge: Properties of Biological Networks**

Biological networks are not random webs. They have specific, important properties that we can exploit in our analysis.

-   **They are “Scale-Free.”**

    -   **The Concept:** This is the most famous property of biological networks. It means that the distribution of connections is not a nice bell curve. Instead, it follows a “power law.”

    -   **The Implication:** Most proteins (nodes) in the network have only a few connections (a low “degree”). However, there are a few special proteins that are connected to a huge number of other proteins. These are called **“hubs.”**

    -   **The Analogy:** Think of an airline route map. Most airports are small and have only a few routes. But a few airports, like Atlanta or Chicago, are massive hubs with connections to hundreds of other cities.

    -   **Why it matters for us:** These hub proteins are often critically important for the cell. A drug that targets a hub protein is likely to have a much larger and more widespread effect on the network than a drug that targets a minor, poorly connected protein. Identifying hubs will be a key part of our analysis.

-   **They have “Modularity.”**

    -   **The Concept:** Proteins with related functions tend to form tightly interconnected clusters or “communities” within the larger network.

    -   **The Analogy:** In a social network, you will find dense clusters of friends who all went to the same high school or work at the same company. These are modules.

    -   **Why it matters for us:** These modules often represent specific biological pathways or protein complexes (like the proteasome or the ribosome). If we find that a drug’s targets are heavily concentrated in one specific module, it gives us a powerful clue about its mechanism of action.

#### **3. The Role of the PPI Network in Our Workflow**

Let’s put this all together. Here is how the PPI network acts as the central player in our upcoming practical analysis:

1.  **Foundation:** We start by considering the entire human interactome from STRING as our underlying map.

2.  **Finding the Overlap:** We will identify our list of drug targets and our list of disease targets.

3.  **Building the Drug-Disease Network:** We will then query the STRING database and ask it to “pull out” a sub-network. This sub-network will contain **only our targets of interest** and the known interactions **between them**.

4.  **Analysis:** This smaller, specific drug-disease network is what we will analyze. We will look for its hubs, its bottlenecks, and its modules. The proteins that are topologically important in this specific sub-network are our best candidates for explaining the drug’s effect on the disease.

The PPI network is not just a pretty picture. It is a structured data source that provides the functional context and the links that allow us to connect the chemistry of a drug to the biology of a disease.

### **Lesson 3: Summary & Status Check**

-   **Conceptually**, we now understand that a PPI network is a map of protein interactions, composed of **nodes (proteins)** and **edges (interactions)**. We know that these networks are not random; they have important properties like being **scale-free (having hubs)** and **modular (having clusters)**.

-   We have a clear mental model of the role of the PPI network in our analysis: it is the **scaffolding** upon which we will build and analyze our specific drug-disease interaction network.

-   **Crucially**, we have revisited the **STRING database** and understand it as the primary, evidence-based source for this blueprint of the cell.

We have now completed our tour of the three foundational concepts. We know what network pharmacology is, where to get our data, and what the underlying map looks like. We are fully prepared to start our practical project.

------------------------------------------------------------------------

## **😗Part 2: The Practical Workflow🥱**

------------------------------------------------------------------------

## **Project: Investigating the Network Mechanism of Quercetin Against Colon Cancer🤤**

------------------------------------------------------------------------

## **Lesson 4: Step 1 - Target Identification😶‍🌫️**

**Goal:** ***`To use a combination of predictive web servers and curated databases to generate our two primary lists of genes/proteins: 1. The predicted targets of our drug, Quercetin. 2. The known targets associated with our disease, Colon Cancer.`***

#### **Chunk 1: Finding the Drug - Defining Quercetin**

**Explanation:** Before we can find the targets of a drug, we need to be able to represent it in a way that bioinformatics tools can understand. We can’t just type “Quercetin.” We need its chemical structure, most commonly represented as a **SMILES string**.

**Action:**

1.  Go to a large chemical database like **PubChem** ([**https://pubchem.ncbi.nlm.nih.gov/**](https://www.google.com/url?sa=E&q=https%3A%2F%2Fpubchem.ncbi.nlm.nih.gov%2F)).

2.  In the search bar, type **“Quercetin”**.

3.  On the Quercetin compound page, look for the **“Canonical SMILES”**.

4.  **Copy this string.** It will look like this: ***`C1=CC(=C(C=C1C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O)O)O`***

5.  Create a project folder on your computer (e.g., `Project_Network_Pharmacology`). Inside, create a data `subfolder.`

6.  Open a simple text editor and save this SMILES string in a file named `quercetin_smiles.txt` inside your data folde

-   **Verification:** You have successfully found and saved the unambiguous, machine-readable identifier for our drug.

#### **Chunk 2: Predicting the Drug Targets with SwissTargetPrediction**

**Explanation:** Now we will use the powerful `SwissTargetPrediction web serevr`. It will take our SMILES string and, based on chemical similarity, predict the most likely human protein targets.

**Action:** 1. Go to the **SwissTargetPrediction** website ([**http://www.swisstargetprediction.ch/**](https://www.google.com/url?sa=E&q=http%3A%2F%2Fwww.swisstargetprediction.ch%2F)).

1.  Paste the SMILES string for Quercetin into the search box.

2.  Make sure the species is set to **“Homo sapiens”**.

3.  Click **“Predict targets”**.

4.  The server will run the analysis and return a results page. **This page shows a list of protein targets, ranked by a “Probability” score.**

5.  We want to export this data. Click the **“CSV”** button to download the results as a comma-separated values file.

6.  Save this file in your data folder as `quercetin_predicted_targets.csv.`

-   **Verification:** Open the downloaded CSV file in a spreadsheet program (like Excel or Google Sheets) or a text editor. You will see a clean table. The most important columns are:
    -   `Target`: The common name of the protein target (e.g., “Serine(丝氨酸)/threonine-protein kinase(苏氨酸蛋白激酶) PIM1”).
    -   `Uniprot ID`: The stable, universal identifier for that protein (e.g., “P11309”).
    -   `Probability`: The model’s confidence in the prediction. This file is our raw list of predicted drug targets.

#### **Chunk 3: Finding the Disease Targets with GeneCards**

**Explanation:** Now we switch hats and find the targets for our disease. We will use the comprehensive GeneCards database to get a well-annotated list of genes associated with Colon Cancer.

**Action:** Go to the **GeneCards** website ([**https://www.genecards.org/**](https://www.google.com/url?sa=E&q=https%3A%2F%2Fwww.genecards.org%2F)).

1.  In the search bar, type **“colon cancer”** and press Enter.

2.  GeneCards will show you a results page. It should have a section or link for the “Gene list” associated with the disease. Click on it.

3.  This will take you to a large table of genes. Each gene has a “GCid” and a “Relevance Score” indicating how strongly it is associated with the disease.

4.  We want to download this data. Look for a “Download” button or link.

5.  Download the table as a text or CSV file.

6.  Save this file in your data folder as `colon_cancer_targets.csv`

-   **Verification:** Open the downloaded CSV file. You will see a large table with many columns. The most important columns for us are:
    -   `Gene Symbol`: The official symbol for the gene (e.g., “APC”, “KRAS”, “TP53”).
    -   `Relevance score`: The score from GeneCards. This file is our raw list of disease-associated targets.

#### **Chunk 4: Creating Clean Target Lists**

**Explanation:** The raw downloaded files are a great start, but we need to process them into simple, clean lists of just the gene symbols for our next step. We can do this easily in R.

**Action:** 1. In RStudio, create a new R Project for this analysis. 2. Create a new R script and save it as `01_target_processing.R`. 3. Add and run the following code:

``` r
# Load tidyverse for data manipulation
library(tidyverse)

# --- 1. Process Quercetin Predicted Targets ---
# Load the CSV file from SwissTargetPrediction
quercetin_raw <- read_csv("data/quercetin_predicted_targets.csv")

# We only care about the "Common name" column, which contains the gene symbol.
# Let's also only keep targets with a probability > 0 as a minimal filter.
quercetin_targets <- quercetin_raw %>%
  filter(Probability > 0) %>%
  pull(`Common name`) %>% # The pull() function extracts a single column as a vector
  unique() # Ensure we have no duplicates

# Save this clean list to a file
write_lines(quercetin_targets, "data/quercetin_targets_clean.txt")
cat("Found", length(quercetin_targets), "unique predicted targets for Quercetin.\n")


# --- 2. Process Colon Cancer Targets ---
# Load the data from GeneCards
cancer_raw <- read_delim("data/colon_cancer_targets.csv", delim = ",") # Use read_delim for flexibility

# We only care about the "Gene Symbol" column.
# Let's filter for a relevance score > 1 to keep moderately to highly associated genes.
cancer_targets <- cancer_raw %>%
  filter(`Relevance score` > 1.0) %>%
  pull(`Gene Symbol`) %>%
  unique()

# Save this clean list to a file
write_lines(cancer_targets, "data/colon_cancer_targets_clean.txt")
cat("Found", length(cancer_targets), "unique targets for Colon Cancer.\n")
```

-   **Verification:** After running this R script, you will have two new, simple text files in your `data` folder:
    -   `quercetin_targets_clean.txt`: A single column of gene symbols.
    -   `colon_cancer_targets_clean.txt`: A single column of gene symbols. The `cat()` commands in the R script also print a summary to the console, telling you exactly how many targets are in each of your final lists. This confirms the successful completion of our data acquisition and cleaning step.

### **Lesson 4: Summary & Status Check**

-   **Conceptually**, we understand that the first practical step of any network pharmacology project is to acquire two lists of gene/protein identifiers: one for the drug’s targets and one for the disease’s targets.
-   **Practically**, we have used premier, web-based bioinformatics tools (**SwissTargetPrediction** and **GeneCards**) to acquire this data for our specific project.
-   **Crucially**, we have followed the “Trust, but Verify” principle by not just downloading the data, but by writing a simple R script to **clean and format** it, and to **report summary statistics** (the number of targets in each list).

------------------------------------------------------------------------

### **Lesson 5: Step 2 - Finding the “Drug-Disease” Common Targets😎**

**Goal:** ***To take our two clean lists of targets (`quercetin_targets_clean.txt` and `colon_cancer_targets_clean.txt`) and find the genes/proteins that appear on both lists. These “common targets” represent the most direct link between the drug’s activity and the disease’s biology.***

#### **1. Core Concepts**

-   **The “Intersection”:** In mathematics and computer science, the “intersection” of two sets is the collection of all elements that are common to both sets. This is exactly what we want to find. If Set A is {1, 2, 3} and Set B is {3, 4, 5}, their intersection is {3}.

-   **The Biological Significance:** Why are these common targets so important? They are proteins that are simultaneously:

    1.  Predicted to be physically modulated by our drug, Quercetin.

    2.  Known from extensive research to be functionally involved in the pathology of Colon Cancer.

-   **The First Hypothesis:** ***These proteins are our strongest, first-line candidates for explaining the therapeutic mechanism***. By analyzing the functions of just these common targets, we can get an immediate, high-level idea of which biological processes are at the heart of the drug-disease interaction.

#### **2. Practical Application: Finding the Intersection in R**

We will continue in our R script, `01_target_processing.R`.

#### **Chunk 1: Loading the Clean Data**

**Explanation:** First, we need to load the two clean text files that we created at the end of the last lesson back into our R session.

``` r
# This code continues in the '01_target_processing.R' script
# (Assuming the tidyverse library is already loaded)

# --- 3. Find Drug-Disease Common Targets ---

# Load the clean lists back into R
# (If you are running this in a new session)
quercetin_targets <- read_lines("data/quercetin_targets_clean.txt")
cancer_targets <- read_lines("data/colon_cancer_targets_clean.txt")
```

-   **Action:** Add this chunk to your script and run it. You now have two character vectors in your R environment: `quercetin_targets` and `cancer_targets.`

#### **Chunk 2: Calculating the Intersection**

**Explanation:** R has a built-in, highly efficient function called intersect() that does exactly what we need. We simply give it our two vectors as input.

``` r
# Find the common elements between the two vectors
common_targets <- intersect(quercetin_targets, cancer_targets)

# Let's see how many common targets we found and what they are
cat("Number of Quercetin targets:", length(quercetin_targets), "\n")
cat("Number of Colon Cancer targets:", length(cancer_targets), "\n")
cat("Number of Common targets:", length(common_targets), "\n\n")

cat("--- List of Common Targets ---\n")
print(common_targets)

# Save this important list to a file for later use
write_lines(common_targets, "data/common_targets.txt")
```

-   **Action:** Run this chunk.

-   **Verification:** The printed output in the console is our proof. It gives us a clear summary of the numbers and lists the gene symbols of the proteins that are at the intersection. Seeing this list (e.g., “EGFR”, “SRC”, “PIK3CA”) confirms that our drug is predicted to act on proteins that are indeed highly relevant to colon cancer.

#### **Chunk 3: Initial Functional Analysis of Common Targets**

**Explanation:** Now that we have this high-confidence list of common targets, what do they do? We can get a quick preview of their collective function by performing a standard GO and KEGG enrichment analysis, just like we did in our other ’omics projects. ***This will tell us if these direct points of action are concentrated in any specific biological pathways.*** We will use the popular clusterProfiler package.

``` r
# --- 4. Functional Enrichment of Common Targets ---
# Install packages if you don't have them
# BiocManager::install(c("clusterProfiler", "org.Hs.eg.db"))

library(clusterProfiler)
library(org.Hs.eg.db)

# We need to convert our gene symbols to Entrez IDs for clusterProfiler
entrez_ids <- mapIds(org.Hs.eg.db,
                     keys = common_targets,
                     keyType = "SYMBOL",
                     column = "ENTREZID",
                     multiVals = "first")

# Run KEGG enrichment analysis
kegg_enrichment <- enrichKEGG(
  gene = entrez_ids,
  organism = "hsa", # Homo sapiens
  pvalueCutoff = 0.05
)

# Visualize the KEGG results
dotplot(kegg_enrichment, showCategory = 15) + ggtitle("Enriched KEGG Pathways for Common Targets")
```

-   **Action:** Run this chunk.

-   **Verification and Interpretation:** The dot plot that is generated is our first major biological insight. It answers the question: “What are the primary pathways where the drug’s action and the disease’s pathology overlap?” You might see pathways like “Pathways in cancer,” “`PI3K-Akt signaling pathway`,” or “`Ras signaling pathway`.” This provides strong, early evidence for the drug’s mechanism and verifies that our list of common targets is biologically coherent and relevant.

### **Lesson 5: Summary & Status Check**

-   **Conceptually**, we understand that finding the **intersection** of the drug and disease target lists is the first and most direct way to identify key mechanistic proteins.

-   **Practically**, we have used the simple `intersect()` function in R to generate our list of **common targets**.

-   **Crucially**, we have followed the “Trust, but Verify” principle by not just finding the list, but by immediately performing a **functional enrichment analysis** on it. The resulting dot plot provides a verifiable and interpretable summary of the core biological processes that are being targeted, confirming that our approach is on the right track.

------------------------------------------------------------------------

### **Lesson 6: Step 3 - Constructing the PPI Network🤗**

**Goal:** ***`To use our list of common targets to retrieve a specific sub-network from the STRING database. This sub-network will consist only of our proteins of interest and the high-confidence interactions that exist between them.`***

#### **1. Core Concepts**

-   **From a List to a Network:** A simple list of genes is informative, but it doesn’t show relationships. A network is a much richer data structure. By constructing a PPI network, we are explicitly adding a layer of biological context. We are moving from asking “Which proteins are involved?” to “**How are the involved proteins connected?**”

-   **The “Sub-Network” Concept:** The entire human interactome in the STRING database contains ~20,000 proteins and millions of interactions. Trying to look at the whole thing is impossible and uninformative. The power of this step is in **filtering** ***this massive background network***. We are telling STRING: “***Show me a map, but only include the cities (proteins) that are on my list (common_targets.txt) and the direct highways (interactions) between them***.”

-   **The Importance of Confidence:** STRING aggregates data from many sources, ***some of which are higher quality than others*** (e.g., data from a direct experiment is more reliable than a prediction from text-mining). Therefore, every interaction (edge) in STRING has a **confidence score** (from 0 to 1). ***It is a standard and crucial practice to filter this network and keep only the*** **high-confidence interactions** (e.g., score \> 0.4 or 0.7). ***This removes noisy, speculative connections and ensures our final network is based on robust evidence.***

#### **2. Practical Application: Building the Network with STRING and R**

We have two excellent ways to do this: using the user-friendly `STRING` web server or using the more powerful and reproducible `STRINGdb` R package. We will describe the R method, as it is more in line with a bioinformatics workflow.

#### **Chunk 1: Setup and Loading Data**

**Explanation:** We will start a new R script for our network analysis. We’ll install and load the `STRINGdb` package and our list of common targets.

``` r
# --- New Script: 02_network_analysis.R ---

# Install required packages
# BiocManager::install("STRINGdb")
# install.packages("tidyverse")

# Load libraries
library(STRINGdb)
library(tidyverse)

# --- 1. Load Common Targets ---
common_targets <- read_lines("data/common_targets.txt")
```

-   **Action:** Create a new R script, 02_network_analysis.R, and run this setup chunk.

#### **Chunk 2: Initialize STRINGdb and Map Identifiers**

**Explanation:** First, we create an object that connects R to the `STRING` database. We specify the species (“Homo sapiens” is species ID 9606) and a confidence score threshold. A score of 400 (which corresponds to 0.4) is a good default for medium confidence. Then, we need to map our gene symbols to the stable protein identifiers used by `STRING`. The package has a handy function for this.

``` r
# --- 2. Initialize STRINGdb and Map IDs ---

# Create the STRINGdb object
# We'll set the score threshold to 400 (medium confidence)
string_db <- STRINGdb$new(version = "11.5", 
                          species = 9606, # Homo sapiens
                          score_threshold = 400, 
                          input_directory = "")

# Map our gene symbols to STRING identifiers
mapped_targets <- string_db$map(
  data.frame(gene = common_targets),
  "gene",
  removeUnmappedRows = TRUE
)

# Let's see how many of our genes were successfully mapped
cat("Successfully mapped", nrow(mapped_targets), "out of", length(common_targets), "genes to STRING IDs.\n")
```

-   **Action:** Run this chunk.

-   **Verification:** The `cat()` command is our proof. It tells us how many of our genes were found in the `STRING` database. A high mapping rate confirms that our gene list is valid and recognized by the database.

#### **Chunk 3: Retrieve the PPI Network and Visualize**

**Explanation:** Now for the main event. We use the mapped IDs to ask `STRING` to build and plot the interaction network.

``` r
# --- 3. Get and Plot the PPI Network ---

# This function retrieves the interactions and creates a plot object
# We give it our mapped IDs and tell it to hide disconnected nodes.
ppi_plot <- string_db$plot_network(mapped_targets$STRING_id, 
                                 add_link = FALSE, # Don't add extra web links to the plot
                                 add_summary = FALSE) # Don't add the text summary to the plot

# Display the plot
print(ppi_plot)

# Save the plot to our figures folder
ggsave("figures/01_common_targets_ppi_network.png", ppi_plot, width = 8, height = 8)
```

-   **Action:** Run this chunk.

-   **Verification:** A network diagram will appear in your RStudio “Plots” pane and will be saved as a PNG file. ***This is your primary output and verification for this lesson***. You should see a web of interconnected nodes (your proteins). ***The thickness of the lines (edges) corresponds to the strength of the evidence for that interaction. Seeing a well-connected “hairball(毛团)” rather than a set of isolated dots is proof that your common targets are not a random collection, but a set of functionally related, interacting proteins.***

#### **Chunk 4: Exporting the Network for Deeper Analysis**

**Explanation:** While the R plot is great for a quick look, the most powerful network analysis is often done in a dedicated, interactive software tool called **`Cytoscape`**. To do this, ***we need to export our network data from R into a simple table format (an “edge list”) that Cytoscape can read.***

``` r
# --- 4. Export Network Data for Cytoscape ---

# Get the interaction data as a data frame
interaction_data <- string_db$get_interactions(mapped_targets$STRING_id)

# The 'interaction_data' data frame has columns 'from', 'to', and 'combined_score'.
# This is a standard "edge list" format.
head(interaction_data)

# We need to add the gene symbols back for easier interpretation in Cytoscape
# We do this by merging with our 'mapped_targets' data frame.
from_mapped <- mapped_targets %>% rename(from = STRING_id, from_gene = gene)
to_mapped <- mapped_targets %>% rename(to = STRING_id, to_gene = gene)

edge_list <- interaction_data %>%
  inner_join(from_mapped, by = "from") %>%
  inner_join(to_mapped, by = "to") %>%
  select(from_gene, to_gene, combined_score) # Keep only the important columns

# Save this edge list as a CSV file
write_csv(edge_list, "data/ppi_network_edgelist.csv")
```

-   **Action:** Run this final chunk.

-   **Verification:** Check your data folder. You will now have a new file, `ppi_network_edgelist.csv`. Open it. It’s a simple, three-column table: `from_gene, to_gene, combined_score`. This is a universal format for describing a network, and it is the perfect input for our next lesson, where we will analyze its structure.

### **Lesson 6: Summary & Status Check**

-   **Conceptually**, we understand that we are not analyzing the entire human interactome, but rather a specific **sub-network** composed of our proteins of interest. We also appreciate the importance of using a **confidence score** to filter for high-quality interactions.

-   **Practically**, we have used the `STRINGdb` R package to programmatically retrieve and visualize this sub-network.

-   **Crucially**, we have followed the “Trust, but Verify” principle by confirming the number of mapped genes, by visually inspecting the generated network plot, and by exporting the network data into a clean, verifiable **edge list file**.

We have successfully built our network. We have the blueprint of the interactions. The next and final analytical step is to analyze the shape and structure of this network to find out which of these proteins are the most important.

------------------------------------------------------------------------

## **🌮BY THE WAY:Detour: A Beginner’s Guide to Cytoscape🍗**

**Goal:** To understand what Cytoscape is, how to get data into it, how to control its visual style, and how to use its core analysis features.

#### **1. What is Cytoscape?**

-   **The Concept:** Cytoscape is an open-source, standalone software application for visualizing and analyzing networks. Think of it as the Photoshop or Illustrator for biological networks. It is not an R package; it’s a program you download and run on your computer.

-   **The Power:** Its strength lies in its interactivity. You can click on nodes, drag them around, change their colors and sizes based on data, and explore the network structure visually. It also has a powerful “App Store” where you can add hundreds of specialized analysis tools.

-   **Action:** Go to the [**Cytoscape website**](https://www.google.com/url?sa=E&q=https%3A%2F%2Fcytoscape.org%2F) and download and install the latest version.

#### **2. The Cytoscape Interface: A Quick Tour**

When you open Cytoscape, you’ll see a few key panels.

-   **A) The Network View Window (Center):** This is the main canvas where your network will be displayed. You can zoom with your mouse wheel and pan by clicking and dragging the background.

-   **B) The Control Panel (Left):** This is where you control everything. It has several tabs:

    -   **Network:** Shows all the networks you have loaded.

    -   **Style:** This is the most important tab. It lets you control the visual properties (color, size, shape, label) of your nodes and edges.

    -   **Select:** Lets you select nodes/edges based on rules.

-   **C) The Table Panel (Bottom):** This is a spreadsheet that contains all the data associated with your nodes and edges. Every row is a node (or edge), and every column is an attribute (like gene name, degree, fold change, etc.). This panel is directly linked to the Network View.

#### **3. Getting Data Into Cytoscape: The Two Main Ways**

There are two primary ways we will get our network data into Cytoscape.

**Method 1: The STRING App (Easiest and Most Common)**

-   **The Concept:** Cytoscape has a built-in app that connects directly to the STRING database. This is the most seamless way to build a network from a list of genes.

-   **The Workflow:**

    1.  In Cytoscape, go to the “Search” bar at the top.

    2.  Make sure the dropdown next to it is set to “STRING: protein query” (or similar).

    3.  Paste your list of gene symbols (e.g., our common_targets list) into the search bar.

    4.  Press Enter.

    5.  The STRING app will open a dialog box. You can set the confidence score cutoff (e.g., 0.4 for medium confidence) and click “Import”.

    6.  **Result:** Cytoscape will automatically query the STRING database, fetch your proteins and the interactions between them, and draw the network for you. The data (like confidence scores) will be automatically loaded into the Table Panel.

**Method 2: Importing a Pre-made Network File**

-   **The Concept:** Sometimes you might have a network already defined in a simple text file (e.g., from another program or a publication). The standard format is a simple two-column file: Source_Node Target_Node. Each row defines one edge.

-   **The Workflow:**

    1.  Go to File \> Import \> Network from File…

    2.  Select your text file.

    3.  A dialog box will appear. You need to tell Cytoscape which column is the “Source” and which is the “Target”.

    4.  Click “OK”.

    5.  **Result:** Cytoscape will draw the network defined in your file.

#### **4. The Most Important Skill: Mapping Data to Visual Styles**

This is the key to making your network tell a story. The “Style” tab in the Control Panel is where this happens.

-   **The Concept:** You can make the visual properties of nodes and edges **dependent on the data** in your Table Panel.

-   **Example 1: Sizing Nodes by Importance.**

    1.  In the **Style** tab, find the “Node” properties at the bottom.

    2.  Click on the “Size” property.

    3.  For the “Column” dropdown, select a data column, for example, “Degree” (which represents how many connections a node has).

    4.  For the “Mapping Type” dropdown, select “Continuous Mapping”.

    5.  **Result:** Cytoscape will now automatically resize all the nodes in your network. The unimportant nodes with a low degree will be small, and the important “hub” nodes with a high degree will be large.

-   **Example 2: Coloring Nodes by Experimental Data.**

    1.  First, you need to import your experimental data. Go to File \> Import \> Table from File… and load a simple CSV file that has a column for “Gene Symbol” and another for “log2FoldChange”. Cytoscape will merge this data into the main Node Table.

    2.  In the **Style** tab, click on the “Fill Color” property for Nodes.

    3.  For the “Column” dropdown, select your “log2FoldChange” column.

    4.  For the “Mapping Type”, select “Continuous Mapping”.

    5.  Choose a color gradient (e.g., blue for negative, white for zero, red for positive).

    6.  **Result:** Cytoscape will color all the nodes in your network based on their expression values. You can see at a glance which parts of the network were upregulated (red) and which were downregulated (blue).

#### **5. Core Analysis: The NetworkAnalyzer Tool**

Cytoscape has powerful built-in analysis tools. The most fundamental one is NetworkAnalyzer.

-   **The Workflow:**

    1.  Go to the menu Tools \> NetworkAnalyzer \> Analyze Network…

    2.  Choose whether to treat the network as “directed” or “undirected” (for PPI, undirected is standard).

    3.  Click “OK”.

-   **The Result:** NetworkAnalyzer calculates a dozen different topology statistics for every node and edge in your network. It creates new columns in your Table Panel for things like:

    -   **Degree:** The number of connections (to find hubs).

    -   **Betweenness Centrality:** The “bottleneck” score.

    -   **Clustering Coefficient:** How tightly connected a node’s neighbors are.

-   Once these columns are created, you can then use them in the **Style** tab to visually map them, just as we did with Degree in the example above.

### **Detour Summary**

You are now equipped with the fundamental knowledge to use Cytoscape.

-   You know that it’s an **interactive program** for network visualization and analysis.

-   You know how to get data into it, primarily using the **STRING app**.

-   You understand the most powerful concept: **mapping data from the Table Panel to visual styles** (like size and color) to make your network meaningful.

-   You know how to run the core **NetworkAnalyzer** tool to calculate topological properties, which are the basis for identifying the most important nodes.

------------------------------------------------------------------------

### **Lesson 7: Step 4 - Network Analysis and Key Node Identification😮**

**Goal:** ***`To use the NetworkAnalyzer tool in Cytoscape to calculate key topological metrics for each protein (node) in our network, and to use these metrics to identify the most important "hub" and "bottleneck" proteins.`***

#### **1. Core Concepts: Measuring “Importance” in a Network**

How do we decide which protein in a network is the “most important”? We can’t just guess. We use precise mathematical measures of **centrality**. A protein’s centrality is a measure of its topological significance within the network. There are many types of centrality, but we will focus on the two most common and intuitive ones for this analysis.

-   **Concept 1: Degree Centrality (Finding the “Hubs”)**

    -   **The Question:** Which protein has the most direct connections or interactions?

    -   **The Metric:** **Degree**. The degree of a node is simply the number of edges connected to it.

    -   **The Analogy:** In a social network, the person with the highest degree is the one with the most friends. They are a social “hub.”

    -   **The Biological Interpretation:** A protein with a very high degree is a **hub protein**. It interacts with many other proteins. ***Hubs are often critical points of signal integration and regulation.*** A drug that targets a hub can have widespread, cascading effects throughout the network. In our drug-disease network, the node with the highest degree is a top candidate for being a key driver of the drug’s effect.

-   **Concept 2: Betweenness Centrality(中介中心性) (Finding the “Bottlenecks”)**

    -   **The Question:** ***Which protein is most critical for connecting different parts of the network***?

    -   **The Metric:** **Betweenness Centrality**. This is a more sophisticated measure. For every pair of nodes in the network, you find the shortest path between them. The betweenness centrality of a given node is the number of these shortest paths that pass through that node.

    -   **The Analogy:** Think of a city’s road map. A “hub” is a major intersection with many roads leading into it (high degree). A “bottleneck” is a specific bridge that is the only way to get from one side of the city to the other (high betweenness centrality). Even if that bridge doesn’t have the most roads connected to it, its removal would fragment the network.

    -   **The Biological Interpretation:** A protein with high betweenness centrality is a **bottleneck protein**. It may not have the most connections, ***but it is crucial for communication between different functional modules or pathways.*** Targeting a bottleneck can be a very effective way to disrupt the flow of information in a disease network.

#### **2. Practical Application: Analyzing the Network in Cytoscape**

We will now use Cytoscape’s built-in tools to calculate these metrics and then use the Style mapping feature to visualize them.

#### **Chunk 1: Calculate Network Statistics**

**Explanation:** We will use the NetworkAnalyzer app, which is a core part of Cytoscape, to automatically compute the Degree, Betweenness Centrality, and many other metrics for every node in our network.

**Action:**

1.  With your network visible in Cytoscape, go to the main menu: **`Tools > NetworkAnalyzer > Analyze Network...`**`.`

2.  A small dialog box will pop up.

    -   Ensure the option **“Treat the network as undirected”** is selected. This is the standard for most PPI networks where an interaction is a mutual relationship.

3.  Click **“OK”**.

4.  The analysis will run very quickly. A new “Results Panel” window might pop up showing a lot of detailed statistics. You can close this window for now.

-   **Verification:** The real magic happens in the background. Go to the **Table Panel** at the bottom of the Cytoscape window. Click on the **Node Table** tab. You will see that NetworkAnalyzer has added many new columns to your table. Scroll to the right and you will find columns named **Degree** and **BetweennessCentrality**. The fact that these columns are now populated with numerical values for every protein is your proof that the analysis was successful.

#### **Chunk 2: Identify Key Nodes from the Data Table**

**Explanation:** Now that we have the data, we can find our key proteins by simply sorting the table.

**Action:**

1.  In the **Node Table**, find the column header for **Degree**.

2.  **Click on the header** to sort the entire table by this column. Click it again to sort in descending order.

3.  ***The proteins at the very top of the table are now your*** **top hub proteins**. Note down the top 5-10 gene symbols.

4.  Now, find the column header for **BetweennessCentrality** and click it to sort in descending order.

5.  The proteins at the top of this list are your **top bottleneck proteins**. Note down these names as well.

-   **Verification:** You have now generated a data-driven, quantitative list of your most topologically important proteins. ***The list of top hubs (e.GFR,g., E SRC, JUN) will likely contain well-known cancer-related proteins, which verifies that your network is biologically meaningful.***

#### **Chunk 3: Visualize Centrality Using Node Size**

**Explanation:** A long list of names is useful, but a visual representation is much more powerful. We will now use Cytoscape’s Style mapping to make the most important nodes visually stand out.

**Action:**

1.  Go to the **Control Panel** on the left and click on the **“Style”** tab.

2.  You will see a list of visual properties. Find the **“Node”** property called **“Size”**.

3.  Click on the small down arrow next to the Size property to expand its mapping options.

4.  For the **“Column”** dropdown, select the **Degree** column that we just created.

5.  For the **“Mapping Type”** dropdown, select **“Continuous Mapping”**.

6.  A new graphical interface will appear. This allows you to define the relationship between the degree value and the node size. A simple linear relationship is a good start. You can double-click on the graph to add control points (e.g., make nodes with a very low degree have a size of 20, and nodes with the highest degree have a size of 100).

-   **Verification:** Look at your **Network View**. It will have instantly and automatically updated. The network is no longer a uniform hairball. You can now immediately see the **hub proteins** because they are drawn as the **largest nodes**. This visual hierarchy immediately draws your eye to the most connected and potentially most important proteins, providing a powerful and intuitive view of your network’s structure. You can repeat this process, mapping BetweennessCentrality to the node border width or another property, to visualize multiple metrics at once.

### **Lesson 7: Summary & Status Check**

-   **Conceptually**, we have defined and distinguished between the two key measures of network centrality: **Degree (for hubs)** and **Betweenness Centrality (for bottlenecks)**.

-   **Practically**, we have used Cytoscape’s core **NetworkAnalyzer** tool to calculate these metrics for every protein in our drug-disease network.

-   **Crucially**, we have followed the “Trust, but Verify” principle in two ways:

    1.  We **verified the analysis** by inspecting the new data columns (Degree, BetweennessCentrality) in the Node Table.

    2.  We then created a **verifiable visualization** by mapping the Degree data to the Node Size, allowing us to instantly and intuitively identify the hub proteins in our network.

------------------------------------------------------------------------

### **Lesson 8: Step 5 - Final Pathway Analysis and Hypothesis Generation🤗**

**Goal:** ***`To take the short list of key hub/bottleneck proteins identified from our network analysis, perform a final, focused functional enrichment analysis on them, and synthesize all of our findings into a single, coherent, and data-driven hypothesis.`***

#### **1. Core Concepts**

-   **The “Signal from the Noise”:** Why are we doing this final enrichment step? Our first enrichment analysis in Lesson 5 was on all the common targets. This gave us a broad overview. However, that list might have contained many proteins that are part of the disease but are on the periphery of the network. The list of **key topological proteins** we have now is much more focused. These are the proteins that are not only involved in the drug-disease interaction but are also structurally critical to the network that connects them. Analyzing their function gives us a much cleaner “signal” about the most critical processes being modulated.

-   **Synthesizing the Narrative:** This is the final step in any scientific data analysis. It involves telling a story that connects all the pieces of evidence we have gathered. Our story must include:

    1.  The **drug** (Quercetin).

    2.  The key **direct targets** (our top hub proteins).

    3.  The key **biological pathways** that these targets are enriched in.

    4.  The link to the **disease** (Colon Cancer).

-   **The Structure of a Good Hypothesis:** A strong hypothesis from a network pharmacology study is specific and mechanistic. It should follow a template like:  
    “We predict that **\[The Drug\]** exerts its therapeutic effect against **\[The Disease\]** primarily by targeting a set of key hub proteins, including **\[Key Protein A, Key Protein B\]**. These proteins are central nodes in the **\[Name of Enriched Pathway\]**, and their modulation by the drug is predicted to disrupt this pathway, leading to a therapeutic outcome like **\[e.g., induction of apoptosis, inhibition of proliferation\]**.”

#### **2. Practical Application: Final Analysis and Synthesis**

We will now perform the final enrichment analysis and then write out our concluding hypothesis.

#### **Chunk 1: Extracting and Analyzing Key Proteins**

**Explanation:** First, we need our final, high-confidence list of key proteins. We will define this as the “Top 20” proteins based on their Degree from our Cytoscape analysis in the last lesson. Then we will perform a KEGG pathway enrichment on this short, focused list.

**Action:**

1.  Go back to your Cytoscape session.

2.  In the **Node Table**, make sure it is sorted in descending order by the **Degree** column.

3.  Select the top 20 rows.

4.  Copy the gene symbols from the name (or display name) column for these top 20 proteins.

5.  We will now take this list back to R for the final enrichment analysis. Continue in your `01_target_processing.R` script.

``` r
# --- 5. Final Enrichment of Key Hub Proteins ---
# (Continuing in the '01_target_processing.R' script)

# Paste the list of the Top 20 Hub Proteins you copied from Cytoscape.
# This is our final, high-confidence list.
key_hub_proteins <- c(
  "SRC", "EGFR", "JUN", "MAPK1", "MAPK3", "PIK3R1", "HRAS", "AKT1",
  "PTK2", "RELA", "CASP3", "APP", "GSK3B", "PLCG1", "PRKACA", "ESR1",
  "RXRA", "STAT3", "AR", "PIK3CA"
) # Note: This is a representative example list. Your list may vary.

# Convert these gene symbols to Entrez IDs for clusterProfiler
key_entrez_ids <- mapIds(org.Hs.eg.db,
                         keys = key_hub_proteins,
                         keyType = "SYMBOL",
                         column = "ENTREZID",
                         multiVals = "first")

# Run KEGG enrichment on ONLY these key proteins
# The "universe" is all the proteins in our original common_targets list
universe_entrez <- mapIds(org.Hs.eg.db, keys = common_targets, keyType = "SYMBOL", column = "ENTREZID", multiVals="first")

key_kegg_enrichment <- enrichKEGG(
  gene = key_entrez_ids,
  universe = universe_entrez, # Providing the correct background
  organism = "hsa",
  pvalueCutoff = 0.05
)

# Visualize the final, most important pathway results
dotplot(key_kegg_enrichment, showCategory = 15) + 
  ggtitle("Key KEGG Pathways Modulated by Quercetin Hubs")

# Let's save this final, most important plot
ggsave("figures/final_hub_pathway_enrichment.png", width = 10, height = 8)
```

-   **Action:** Run this final analysis chunk.

-   **Verification:** The new dot plot is our final piece of evidence. It is the most important result of our entire study. It shows which specific biological pathways are most significantly enriched among the most structurally important proteins in our drug-disease network. You will likely see that the number of significant pathways is smaller and more focused than our first enrichment analysis, and key cancer-related pathways like **“Pathways in cancer,” “PI3K-Akt signaling pathway,”** and **“MAPK signaling pathway”** will likely be at the very top.

#### **Chunk 2: Synthesizing the Final Hypothesis**

**Explanation:** Now, we act as scientists. We look at all the evidence we have generated and weave it into a coherent story.

**Action (A Thought Process):**

1.  **Recall the Drug and Disease:** We are studying Quercetin against Colon Cancer.

2.  **Identify the Key Hubs:** Look at our `key_hub_proteins` list. We see major, famous cancer proteins like EGFR, SRC, AKT1, and MAPK1.

3.  **Identify the Top Enriched Pathway:** Look at the dot plot we just generated. The “PI3K-Akt signaling pathway” is the most significant result.

4.  **Connect the Dots:** We know that the PI3K-Akt pathway is a central pathway that controls cell survival and proliferation and is frequently dysregulated in cancer. Our results show that Quercetin’s key targets are central hubs in this exact pathway.

5.  **Formulate the Hypothesis:** Now we can write out our final conclusion using the template from the Core Concepts.

### **Grand Conclusion of the Canonical Network Pharmacology Project**

**The Final Hypothesis:**

“Our network pharmacology analysis predicts that the flavonoid **Quercetin** exerts its therapeutic effects against **Colon Cancer** through a multi-target mechanism. By predicting Quercetin’s protein targets and intersecting them with known colon cancer-associated genes, we constructed a core drug-disease interaction network. Topological analysis of this network identified several key **hub proteins**, including the well-known oncogenes **SRC, EGFR, and AKT1**. A focused functional analysis of these hub proteins revealed a highly significant enrichment for the **‘PI3K-Akt signaling pathway’** (KEGG: hsa04151). Therefore, we hypothesize that Quercetin’s primary anti-cancer mechanism involves ***`binding to these central hubs`***, which in turn disrupts the PI3K-Akt signaling cascade, leading to an inhibition of cancer cell proliferation and survival.”

This final paragraph is the culmination of our entire project. It is specific, it is mechanistic, and every single part of it is directly supported by a verifiable step in our bioinformatics workflow. The next step would be to take this computational hypothesis back to the lab for experimental validation.

------------------------------------------------------------------------

## **🥩The “DEG-driven” Network Pharmacology Workflow🥶**

**Project:** “Investigating the Network Mechanism of a Novel Kinase Inhibitor (Drug ‘X’) on MCF7 Breast Cancer Cells, Based on Experimental RNA-seq Data.”

**The Starting Point:** We have two files:

1.  `drug_X_predicted_targets.txt`: A list of the drug’s predicted direct protein targets, which we got from SwissTargetPrediction (exactly as we did for Quercetin).

2.  `MCF7_drug_X_DEGs.csv`: The results table from a DESeq2 analysis of an RNA-seq experiment. It contains all genes, their log2FoldChange, and their adjusted p-values after treating MCF7 cells with Drug X.

**The Goal:** To build a network that explains how the drug’s direct binding actions lead to the observed transcriptional changes (the DEGs).

### **Lesson 1 (of this section): Step 1 - Preparing the Target Lists**

**Goal:** To define our two primary lists of genes: the Drug_Targets and the Experimental_Targets (the DEGs).

#### **Chunk 1: Define the Drug’s Direct Targets**

**Explanation:** This step is identical to the canonical workflow. We assume we have already used SwissTargetPrediction to generate our list of predicted protein targets for “Drug X”.

**Action (in R):**  
Let’s start a new R script for this project: 01_deg_driven_analysis.R.

``` r
library(tidyverse)

# --- 1. Load and Define Target Lists ---

# Load the predicted direct targets for Drug X
drug_targets <- read_lines("data/drug_X_predicted_targets.txt")

cat("Found", length(drug_targets), "predicted direct targets for Drug X.\n")
```

-   **Verification:** This is our first list, `drug_targets`. It might contain ~100 proteins.

#### **Chunk 2: Define the Experimental Targets (DEGs)**

**Explanation:** This is the key difference. Instead of going to GeneCards, our “disease targets” are the genes that were significantly altered in our actual experiment. We will load the DEG results file and apply a statistical cutoff to define this list.

**Action (in R):**

``` r
# Load the full results from the RNA-seq experiment
deg_full_results <- read_csv("data/MCF7_drug_X_DEGs.csv")

# Define our list of significant DEGs
# We will use a standard cutoff: adjusted p-value < 0.05 and |log2FC| > 1
experimental_targets <- deg_full_results %>%
  filter(padj < 0.05, abs(log2FoldChange) > 1) %>%
  pull(gene_symbol) %>% # pull the gene symbol column
  unique()

cat("Found", length(experimental_targets), "significant DEGs (Experimental Targets).\n")
```

-   **Verification:** This is our second, and much larger, list: experimental_targets. It might contain several hundred or even a few thousand genes.
-   ***Most DEGs can be directly mapped to their protein products using databases like UniProt or Ensembl. Cytoscape itself doesn’t do this mapping automatically, but you can prepare the list externally and import it.***

### **Lesson 2: Step 2 - Building and Analyzing the Integrated Network**

**Goal:** To combine both lists, build a single integrated PPI network, and analyze it to find the key proteins that connect the drug’s direct action to its downstream effects.

#### **Chunk 1: Create the Combined Seed List for STRING**

**Explanation:** To see the connections, we need to build a network that contains all the relevant players. This means we will take the **union** of our two lists (all unique genes from both lists combined), not the intersection. This combined list will be our “seed” list for querying the STRING database.

**Action (in R):**

``` r
# Combine the two lists and get all unique gene symbols
combined_seed_list <- union(drug_targets, experimental_targets)

cat("Total number of unique proteins for building the network:", length(combined_seed_list), "\n")

# Save this list to a file so we can paste it into Cytoscape
write_lines(combined_seed_list, "data/combined_seed_list_for_string.txt")
```

-   **Verification:** You now have a single, large list of all the proteins that are either a direct drug target, an experimentally observed DEG, or both. This is the list you will now take to Cytoscape.

#### **Chunk 2: Build the Network in Cytoscape**

**Explanation:** This process is **exactly the same** as Lesson 6 in the canonical workflow.

**Action:**

1.  Copy the contents of `combined_seed_list_for_string.txt` to your clipboard.

2.  Go to **Cytoscape**.

3.  Use the **STRING: protein query** to import a network from this list, using a confidence cutoff of **0.4**.

4.  **Result:** Cytoscape will build a large network containing both the drug’s direct targets and the DEGs, and all the known interactions that connect them.

#### **Chunk 3: Analyze the Network to Find Key Nodes**

**Explanation:** This process is **exactly the same** as Lesson 7 in the canonical workflow.

**Action:**

1.  In Cytoscape, run **Tools \> NetworkAnalyzer \> Analyze Network…**.

2.  This calculates the Degree and BetweennessCentrality for every node in your large, integrated network.

3.  **Sort the Node Table** by Degree to identify the top hub proteins.

-   **The Critical Insight:** When you look at the list of top hubs, you will now find a fascinating mix of proteins. Some might be from your original drug_targets list, while others might be proteins from your experimental_targets list that were not direct drug targets but are clearly central to the network’s response. This immediately helps you find the key “mediator” proteins.

### **Lesson 3: Step 3 - Integrating Data and Generating a Hypothesis**

**Goal:** To make our network even more informative by overlaying our experimental data (the log2 fold changes) onto it, and then using this integrated view to generate a powerful final hypothesis.

#### **Chunk 1: Import Experimental Data into Cytoscape**

**Explanation:** A network is most powerful when it’s layered with experimental data. We will import our full DEG results table and map the log2FoldChange values to the color of the nodes.

**Action:**

1.  In Cytoscape, go to **File \> Import \> Table from File…**.

2.  Select your MCF7_drug_X_DEGs.csv file.

3.  In the import dialog, make sure the “Key” column is set to your gene_symbol column. This tells Cytoscape how to match the rows in your data file to the nodes in your network.

4.  Click “OK”.

5.  **Verification:** Go to the **Node Table**. You will see that new columns have been added, including log2FoldChange and padj, directly from your experimental data.

#### **Chunk 2: Map Fold Change to Node Color**

**Explanation:** Now we will use the Style interface to color our network.

**Action:**

1.  Go to the **“Style”** tab in the Control Panel.

2.  Select the **“Fill Color”** property for Nodes.

3.  Set the **“Column”** to log2FoldChange.

4.  Set the **“Mapping Type”** to **“Continuous Mapping”**.

5.  Choose a color scheme. A classic “Red-White-Blue” diverging palette is perfect. Red for upregulated (positive log2FC), blue for downregulated (negative log2FC), and white for no change.

-   **Verification:** Look at your network. It is no longer a simple diagram; it is a **data-rich visualization**. You can now see at a glance:

    -   Which parts of the network were transcriptionally **upregulated (red)**.

    -   Which parts were **downregulated (blue)**.

    -   You can also see if your direct drug targets (which may be grey if they weren’t themselves DEGs) are interacting primarily with red or blue nodes.

#### **Chunk 3: The Final Hypothesis**

**Explanation:** By integrating the network topology (the hubs) with the experimental data (the colors), we can now generate a sophisticated and verifiable hypothesis.

**Action (A Thought Process):**

1.  **Identify a top hub protein** from your network analysis (e.g., the protein STAT3).

2.  Note that STAT3 might be one of your predicted direct drug_targets, but it might be colored white/grey in your network, meaning its own gene expression didn’t change.

3.  **Observe its neighbors.** You look at the network and see that STAT3 is directly connected to 20 other proteins, and 15 of them are colored bright blue (strongly downregulated).

4.  **Formulate the Hypothesis:**

“Our DEG-driven network analysis reveals that Drug X’s mechanism is likely mediated through the transcription factor **STAT3**. Although STAT3 itself is not transcriptionally altered, it is a predicted direct target of Drug X and acts as a central hub in the interaction network. We observed that the majority of STAT3’s direct interaction partners are significantly **downregulated** following drug treatment. Therefore, we hypothesize that Drug X directly binds to and inhibits the activity of the STAT3 protein, which in turn leads to the observed transcriptional suppression of its target gene network, resulting in an anti-proliferative effect.”

This is a far more powerful and detailed hypothesis than we could have generated with the canonical method alone, because it directly links the drug’s physical action to the specific experimental results you observed.