# Bioinformatics II: Computational Analysis of Protein Data
## Tutorial Notebook - Protein Sequences, Structures, and Interaction Networks

**Learning Path:** Beginner → Intermediate → Advanced

---

### What You'll Learn Today:
1. **Part 3 (Intermediate-Advanced):** Protein interaction networks - Graph construction and analysis

---


## PART 3: Protein-Protein Interaction Networks (Intermediate-Advanced)

### 3.1 Understanding Protein Interaction Networks

### Background:
Proteins rarely work alone. They interact with other proteins to perform cellular functions. These interactions form networks that can be analyzed using graph theory.

**Network Components:**
- **Nodes:** Proteins
- **Edges:** Interactions between proteins



### 3.0 Required libraries
This example focuses on accessing the protein-protein interaction data from STRING DB via provided API from the database. In this case, few libraries are used:
1. `requests` used for performing the web request and retrieval of response.
2. `pandas` used for transforming the retrieved response (in JSON) to data frame.
3. `networkx` used for constructing the network graph structure from the protein-protein information. https://networkx.org
4. `matplotlib` to support for the drawing functions in `networkx`

In [None]:
# install of required libraries:
!pip install pandas matplotlib networkx

In [None]:
import requests 
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx

### Use Case 4: Analyzing the p53 Tumor Suppressor Network
**Biological Context:** p53 is a crucial protein in cancer prevention. It interacts with many proteins to regulate cell division.

**Data Source:** STRING Database (https://string-db.org/)
- Format: Tab-separated interaction file

### 3.1 Accessing the Web API of STRING DB
The following code shows the steps in accessing the Web API of STRING DB. First we need to setup the default endpoint of API to get the data, which is `https://string-db.org/api/json/network`. Then, setup the parameters required such as the `identifiers` and `species` as in the example. Details of the parameters available can refer to STRING DB API documentation. https://string-db.org/cgi/help?sessionId=bZSpsS1iWGJ6<br><br>
Then will use the API endpoint url and the parameters to request response from the webserver. The returned response will be parsed using the `json()` function. <br><br>
<span style='color:red'>Note: Please always check if the response having the correct outcome or any error messages

In [None]:
string_url = "https://string-db.org/api/json/network"

protein_name = "TP53"

params = {
    "identifiers":protein_name, #<-- protein name
    "species":9606, # <-- human
    "limit": 20 # <-- number of interactions partners to retrieve, 10 is default without setting any value.
}
response = requests.get(string_url, params=params)
if response.status_code == 200:
    network  = response.json()
    print(f"Downloaded {len(network)} interactions for {protein_name}")
else:
    print(f"Failed to download. Status code: {response.status_code}")
    
print(network[0])

In [None]:
network_df = pd.json_normalize(network)
network_df.head() # print the first five row of data in dataframe

In [None]:
network_df.shape

Please observe the dataframe above, these are the information available from the protein-protein interaction data returned from the webserver. Look at the available attributes, in this example, we focus only on the 3rd and 4th columns, which are the `preferredName_A` and `preferredName_B`. This is the indication of there is interaction between these two proteins. <br><br>As for the score, this is the measurement available in STRING DB, it DOES NOT indicate about the strength of the interaction, but the CONFIDENCE, meaning how likely STRING judge the interaction to be true based on evidence. Detail: https://string-db.org/cgi/info?footer_active_subpage=scores

In [None]:
ppi_data = network_df[['preferredName_A','preferredName_B','score']]
ppi_data.columns = ['protein1','protein2','score']
ppi_data.head(10)

### 3.2 Generate network graph for the protein-protein interaction
From now onwards, we will use the dataframe in generating the network graph. In this case, we will use the `from_pandas_edgelist()` function from `networkx` and pass in the dataframe we generate in STEP 1, then pass in the two columns name as the following parameters.

In [None]:
network_graph = nx.from_pandas_edgelist(ppi_data, "protein1", "protein2")

Once the network graph structure is generated, we can access many of the properties about the network. For example, we can get information about the number of edges and number of nodes using `number_of_edges()` and `number_of_nodes()` respectively. 

In [None]:
print(f"Network built with {network_graph.number_of_nodes()} nodes and {network_graph.number_of_edges()} edges")

We also can get number of interactions available for each node using the `degree()` function of the network graph object.

In [None]:
network_graph.degree()

### 3.3 Network Visualization
Next, we can further visualize the network using `spring_layout()` function from `networkx` and pass in the network graph object. This will generate a layout based on the network graph object, which include the coordinates of each node. This is a random process, which by default it will generate different layout everytime. Unless set the `seed` parameter within `spring_layout` to a specific number, then it will always generate same layout.

In [None]:
slayout = nx.spring_layout(network_graph, seed=125)
slayout

Then, to generate the graph view, use the `draw()` function from `networkx`, pass in the network graph object and the layout from above. There are different settings available for the graph display, such as:
- `with_labels`: set True to display text on each node.
- `node_size`: the size of the node whether big or small.
- `node_color`: setting the color of the node.
- `edge_colr`: setting the color of the edge.
- `font_size`: setting the size of the font.

In [None]:
nx.draw(network_graph, slayout, with_labels=True, node_size=1000, node_color='lightblue', font_size=8)
plt.title('Protein-Protein Interaction Network', fontsize=16)

### 3.3 Analysis of network

Then, we can also check on the centrality of the nodes, by checking the `degree_centrality()` function from `networkx` and pass in the network graph object. This will return the measured centrality of each node, closer to 1.0 tend to be the center. This is directly correlate with the output of `degree()` above.

In [None]:
degree_centrality = nx.degree_centrality(network_graph)
degree_centrality

In [None]:
top_5_proteins = sorted(degree_centrality.items(), key=lambda x:-x[1])[:5]
top_5_proteins

In [None]:
high_centrality_nodes = [node for node, centrality in top_5_proteins]
high_centrality_nodes

In [None]:
nx.draw(network_graph, slayout, with_labels=True, node_size=1000, node_color='lightblue')
nx.draw_networkx_nodes(network_graph, slayout, nodelist=high_centrality_nodes, node_size=1000, node_color='orange')

---