# Week 5 - Networks 

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
In this lab you'll do some exercises to familiarise yourself with network properties and concepts.
</div>

## Setup

**imports**

In [None]:
import itertools
import networkx as nx
import matplotlib.pyplot as plt
import pickle
%matplotlib inline

**Data**

This week we need a Protein-Protein Interaction (PPI) network. <br>
The data was loaded into a networkx graph then stored as binary. <br>
This data is available at https://github.com/melbournebioinformatics/COMP90014_2024/blob/master/tutorials/data/week5/saccharomyces_cerevisiae_ppi.pickle. 

Create a ./data folder as usual, download the file above, then place in the ./data folder. 
Use the shell commands below or download manually using a browser. 

<div style="font-size: 16px">

(Bash Shell)
> ```Bash
> mkdir -p data
> cd data
> wget https://github.com/melbournebioinformatics/COMP90014_2024/blob/master/tutorials/data/week4/NC_000913.fasta.gz?raw=true
> ```


## **INTRODUCTION**

_Please complete these exercises at your own pace at the start of the class to familiarise yourself with networks. NOTE: We will be using the networkx package in assignments!_


## Exploring classes in python.

**Data in python**

Python supports many types of data containers. These include ***variables***, ***collections*** (such as lists), and ***classes***. <br>
In python, these are all internally stored as objects. 

Quoting from the [python reference docs](https://docs.python.org/3/reference/datamodel.html): <br>
*"Objects are Python’s abstraction for data. All data in a Python program is represented by objects or by relations between objects."*

Even imported packages / external python files using the ***module system*** are stored as objects. 

These objects may have subobjects. For example:
- The functions available in an imported package
- Attributes/methods on classes


**Investigating objects**

To view the available properties of an object, we can use the <small>`dir(object)`</small> function. <br>
This will display all methods (functions), attribues (variables) belonging to an object. 

> **Note** <br>
> Aside from using the python <small>`dir(object)`</small> function, there are other ways to investigate objects:
> - Read the docs / manual (eg the [python docs](https://www.python.org/doc/), or the [networkx docs](https://networkx.org/documentation/stable/reference/index.html))
> - By right clicking -> "Go to definition" (F12) in VSC 
> - By investigating objects during runtime using the VSC debugger. <br><br>

We can use <small>`dir(object)`</small> to investigate the contents of the ***networkx*** package. 


We will work with **Graph()** (undirected) and **DiGraph()** (directed) graph classes supplied by networkx today. <br>
Let's see what attributes the Graph() class has:


In [None]:
dir(nx.Graph)

We can try a few of these out to see how they work. Run the cell below.

In [None]:
graphA = nx.Graph()
print(graphA.is_directed())
print(graphA.size())

graphA.add_edge('A', 'B')
print(graphA.size())

Rather than printing out DiGraph's methods, the following cell will list the attributes which are **exclusive** to Graph or DiGraph.

In [None]:
graphA_attributes = set(dir(nx.Graph))
graphB_attributes = set(dir(nx.DiGraph))

print('\nGraphA (Graph) exclusive attributes:')
print(list(graphA_attributes - graphB_attributes))

print('\nGraphB (DiGraph) exclusive attributes:')
print(list(graphB_attributes - graphA_attributes))


The directed graph class has a more methods than the undirected graph class.  <br>
This is because directed graphs differentiate between incoming vs outgoing edges on a node. 

## Intro exercise 1 - Interpreting Graphs

Given the <b>undirected</b> graph drawn below, write down the adjacency matrix.<br>
Do this inside the markdown cell below. 

For example (different network): 

```
  X Y Z
X 0 1 1
Y 1 0 0 
Z 1 0 0 
```  

<img src="https://raw.githubusercontent.com/melbournebioinformatics/COMP90014_2024/master/tutorials/media/week5/small_graph_undirected.png" width="400">

```
# YOUR ANSWER HERE
  A B C D
A        
B        
C        
D        
```

Given the <b>directed</b> graph drawn below, write down the adjacency matrix. <br>
Think whether the matrix should still be symmetrical.

<img src="https://raw.githubusercontent.com/melbournebioinformatics/COMP90014_2024/master/tutorials/media/week5/small_graph_directed.png" width="400">



```
# YOUR ANSWER HERE   
            (child)  
           A B C D E 
         A           
         B           
(parent) C           
         D           
         E           
```

## Intro exercise 2 - Defining Graphs

Create the above graphs in networkx. use the <b>graph_object.add_edge()</b> method to add edges. 
<br>An example showing how to draw GraphA, and show other representations of the data is given below.

Define an **undirected** networkx graph object and add nodes/edges:

In [None]:
graphA = nx.Graph()

# YOUR CODE HERE
raise NotImplementedError

Define a **directed** networkx graph object and add nodes/edges:

In [None]:
graphB = nx.DiGraph()

# YOUR CODE HERE
raise NotImplementedError


<br>The following 4 cells show different representations of our graph:<br>

In [None]:
nx.draw_spring(graphA, with_labels=True, node_size=1200, node_color='#eeeeff', edge_color='red')

In [None]:
nx.adjacency_matrix(graphA)

In [None]:
print(nx.adjacency_matrix(graphA))

The output from the above cells only shows connections between nodes. It does not show where there are two nodes that are not connected as this is not useful information. Print the numpy array below to see how the coordinates only show where the connections occur. 

In [None]:
print(graphA.nodes())
nx.to_numpy_array(graphA)

<br>networkx seems to be using 'numpy' - a popular python library, to store graph data. 
<br>numpy allows matrix and vector operations to be performed quickly and efficiently. This makes sense if our network gets very big! 

Lets also check graphB to see if we created it correctly. Print the adjacency matrix for `graphB` (as above) in the folowing cell:

In [None]:
nx.draw_spring(graphB, with_labels=True, node_size=1200, node_color='#eeeeff', edge_color='red')
print(nx.adjacency_matrix(graphB))
nx.to_numpy_array(graphB)

## **Exercises**

</div>

## Exercise 1 - Network Properties 

Complete the function below to find the degree distrbution for any given graph. You can use the networkx method `graph.degree()`, which returns the number of edges connecting to each node. You should return a tuple of two lists: the first list contains all observed vertex degree values in the graph, and the second contains the counts showing how often a vertex with that degree was observed.

For instance, calling `degree_distribution()` on `graphA` above could return

```([1, 2, 3], [1, 2, 1])```

meaning that there is one vertex with degree 1 (D), two vertices with degree 2 (A and B), and one vertex with degree 3 (C).

These two lists will give us a handy form for plotting the degree distribution.

In [None]:
# Here's the networkx function `graph.degree()`:
graphA.degree('C')

In [None]:
def degree_distribution(graph):
    """
    For the networkx graph provided, return a tuple of lists, where
    the first list gives all observed vertex degrees, and the second list gives
    the corresponding vertex counts.
    """
    # YOUR CODE HERE
    raise NotImplementedError
    return (degrees, counts)        

Once you have this function, you can draw the degree distribution with a scatter plot:

In [None]:
# Graph A:
degrees, counts = degree_distribution(graphA)
fig, ax = plt.subplots()
ax.scatter(degrees, counts)

Here are some graphs of types described in lectures. You can generate other graph types with networkx functions described at https://networkx.github.io/documentation/stable/reference/generators.html

A random (Erdos-Renyi) graph:

In [None]:
# 600 nodes, probability of each edge 0.4
random_graph = nx.fast_gnp_random_graph(600, 0.4)

A scale-free graph:

In [None]:
# 600 nodes
scale_free_graph = nx.scale_free_graph(600)

If you are finding the degree distribution correctly, you can plot the distributions for these different graph types:

In [None]:
degrees, counts = degree_distribution(random_graph)
fig, ax = plt.subplots()
ax.scatter(degrees, counts)

In [None]:
degrees, counts = degree_distribution(scale_free_graph)
fig, ax = plt.subplots()
ax.scatter(degrees, counts)

The plot for the scale-free graph doesn't look very clear as the relationship shown in lectures is on a log-log scale. Try using `ax.set_xscale('log')` and `ax.set_yscale('log')` on your plot to see this relationship more clearly.

In [None]:
nx.draw_spring(graphA, with_labels=True, node_size=1200, node_color='#eeeeff', edge_color='red')

<div style= border: solid 1px rgb(129,199,132); padding: 10px;">
<h3>Exercise 2: Protein-protein interaction networks</h3>

<h4> The Data:</h4>
In this question we will look at part of the baker's yeast Protein-Protein-Interaction network, taken from the [STRING database](https://string-db.org/). 

Some information about the PPI network:
- The graph is undirected
- Nodes are proteins
- Edges represent a protein-protein interaction
- Edges are unlabelled.
    
<h4> The Challenge:</h4>
 Write a function to calculate the local clustering coefficient for a graph. The clustering coefficient of a node quantifies how close its neighbours are to being a complete graph where each neighbour is connected to each other neighbour. This function does exist in networkx, but don't use it - implement it yourself:
    
- [ ] Input: graph [networkx.Graph]
- [ ] Ouput: clustering coefficient (float)
    
<b>Hints:</b>
- Find all neighbours of a query node
- Calculate the possible pairwise combinations of the neighbours (excluding query node)
- Find the number of these pairwise connections that actually exist in the graph
- Return the actual neighbour edges / possible neighbour edges

</div>

##### Explore the network

In [None]:
# Import the network
ppi_network = pickle.load(open(r'data/saccharomyces_cerevisiae_ppi.pickle', 'rb'))
#Print graph object
print(ppi_network)

In [None]:
# Visualise Graph
nx.draw_spring(ppi_network, with_labels=True, node_size=1200, node_color='#eeeeff', edge_color='red')

In [None]:
# Print the adjacency matrix
print(nx.adjacency_matrix(ppi_network))

In [None]:
#Print the distribution
degrees, counts = degree_distribution(ppi_network)
fig, ax = plt.subplots()
ax.scatter(degrees, counts)

### Short answer question 1

(1 marks, max 25 words)

<div class="alert alert-info">

What distribution do you think this resembles? 

</div>

<span style="color:rgb(17, 122, 121); font-family:Courier"><i><b># -- GRADED CELL (1 marks) - complete this cell --</b></i></span>

YOUR ANSWER HERE

In [None]:
def clustering_coefficient(graph, node_label):
    """
    Calculate and return the clustering coefficient for a node in the protein-protein network.
    The clustering coefficient is the number of edges between neighbors 
    divided by the possible number of edges between neighbors.
    """
    # YOUR CODE HERE
    raise NotImplementedError


In [None]:
# Should give 0.7421052631578947
clustering_coefficient(ppi_network, 'Q0060')


Now try testing for the node 'Q0010'. Does your function run efficiently?

In [None]:
clustering_coefficient(ppi_network, 'Q0010')

Compare the performance to the Networkx implementation. If you would like to challenge yourself, have a look at the [source code](https://networkx.org/documentation/stable/_modules/networkx/algorithms/cluster.html#clustering) for the Networkx implementation and see if you can impprove your solution.

In [None]:
# Should give 0.28018948751541606
nx.clustering(ppi_network, "Q0010")

# Extension activities

### Extension 1 - Exam style question
Using the Networkx package (or your function above if you managed to find an efficient implementation!) calculate the maximum clustering coefficients.

In [None]:
def max_clustering_coefficients(graph, x):
    """
    Write a function to collect the top 'x' nodes ranked by clustering_coefficient. 
    Return an ordered list of tuples in the format [(node, clustering_coefficient)]
    """
    # YOUR CODE HERE
    raise NotImplementedError

In [None]:
max_clustering_coefficients(ppi_network, 3)

### Short answer question 2

(3 marks, max 100 words)

<div class="alert alert-info">

In the context of the protein-protein interaction graph, what can we infer about the nodes with the highest clustering coefficients? How might this be important in a disease context?
    
</div>
=== BEGIN MARK SCHEME ===

A high clustering coefficient indicates that the subgraph around that node is highly interconnected (dense).<br>
We can infer that this protein is likely involved in a protein complex or a signalling pathway where it is interacting with many other proteins.<br>
In the context of disease, mutations in these proteins are likely to have a greater impact than a protein with a low clustering coefficent and hence may be good drug targets.<br>
=== END MARK SCHEME ===


<span style="color:rgb(17, 122, 121); font-family:Courier"><i><b># -- GRADED CELL (3 marks) - complete this cell --</b></i></span>

YOUR ANSWER HERE

### Extension 2 - Exploring the Graph


Sometimes we want to know whether a particular node is reachable from another node. <br>
For example: if we start at node A, is there a path in the graph to node B? 

For this kind of task we often use Breadth First Search (BFS). <br>
BFS is a good choice because it branches out from the starting location in all directions when exploring the graph. 
For unweighted graphs, this will result in the shortest path from source node to destination node. 

In the cell below, implement BFS for networkx graphs. 

Your implementation should use the .neighbors() function to get the edges from a particular node. <br>
For a directed graph, .neighbors() returns only the outward edges from a node. <br>
For this reason, your implementation should be valid for either an undirected or directed graph.

You can use the following pseudocode to help your implementation:
```
function bfs_reachable( source, dest, graph )
    let Q be a queue
    let V be a list
    Q.enqueue( source )  # add node to queue
    V.add( source )      # mark as visited
 
    while ( Q is not empty)
        v = Q.dequeue()
        for all neighbors n of v in Graph g:
            return true if n is dest
            if n is not visited:
              Q.enqueue( n )   # add node to queue
              V.add( n )       # mark node as visited

    return false
```

In [None]:
def bfs_reachable(source_node, target_node, graph):
    """
    For a given source node, destination node, and a graph, return True if 
    it is possible to reach the destination from the source, else False. 
    """
    # YOUR CODE HERE
    raise NotImplementedError


Let's check our bfs_reachable function using the cell below. 

In [None]:
# defining test graph
graph = nx.DiGraph()
graph.add_edge('A','B')
graph.add_edge('A','C')
graph.add_edge('C','B')
graph.add_edge('C','D')
graph.add_edge('E','C')

# running bfs_reachable
print(bfs_reachable('A', 'D', graph))  # should return True
print(bfs_reachable('E', 'B', graph))  # should return True
print(bfs_reachable('E', 'A', graph))  # should return False
print(bfs_reachable('A', 'A', graph))  # should return True