# COMP90014 Assignment 2
### Semester 2, 2019

**Task 6 has been ammended - please use this version

This assignment should be completed by each student individually. Make sure you read this entire document, and ask for help if anything is not clear. Any changes or clarification to this document will be announced via the LMS.

Please make sure you are aware of the University's rules on academic honesty and plagiarism, which are very strict: https://academichonesty.unimelb.edu.au/ 

Make sure you **do not** copy any code either from other students or from the internet. This is considered plagiarism. It is generally a good idea to avoid looking at any solutions as you may find it surprisingly difficult to generate your own solution to the problem once you have seen somebody else's.

Your completed notebook file containing all your answers will be turned in via LMS. No other files or formats will be accepted - only upload the completed `.ipynb` file.

### Overview
To complete the assignment you will need to finish the tasks in this notebook. There are multiple tasks that are connected in a logical order.

The tasks are a combination of writing your own implementations of algorithms we've discussed in lectures, writing your own code to use library implementations of these algorithms and interpreting the results in short answer format. Each short answer question has a word limit that will be strictly enforced! **Please note that for this assignment, you may be awarded a mark of zero for a question if you go over the word limit.**

In some case, we have provided test input and test output that you can use to try out your solutions. These tests are just samples and are not exhaustive - they may warn you if you've made a mistake, but they are not guaranteed to. It's up to you to decide whether your code is correct.

### Marking

Cells that must be completed to receive marks are clearly labeled. There are 19 graded cells, some of which are code cells, in which you must complete the code to solve a problem, and some of which are markdown cells, in which you must write your answers to short-answer questions. 

In this assignment, every graded cell is worth 2 marks. In addition to the graded cells, up to 7 marks will be given for code style, readability, efficiency and comments. 

The total marks for the assignment add up to 45, and it will be worth 15% of your overall subject grade.

Please make sure that you do not edit the "GRADED CELL" comments in either the code or the markdown cells, as this will disrupt the marking system.

### Background and data 

WGCNA stands for weighted gene co-expression network analysis. It is a data analysis technique used for studying biological networks based on pairwise correlations of gene expression data. WGCNA is good at identifying clusters of genes that may be co-regulated, and therefore may have shared biological function.

For this assignment, you will primarily be using the [FlyAtlas](http://flyatlas.org) dataset. For this assignment, instead of using the probe-wise dataset, we will be using the expression value for each gene.



## Task 0 - Setup 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import altair

In [None]:
import pandas as pd
import numpy as np
import networkx as nx
import scipy
import re
from io import StringIO
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from scipy.spatial.distance import squareform

### Read in data

In [None]:
raw_expression = pd.read_csv('flyatlas_subset.csv.gz', index_col=0)

In [None]:
raw_expression.head()

The data frame has 18952 rows (measurements) and 136 columns (samples) so it is certainly high dimensional.These 136 columns represent 4 replicates each from 34 different tissue types.

We will set numpy and pandas to display numbers to just two decimal places in this notebook - this won't affect the actual numbers, just their display, and you can change it if you prefer.

In [None]:
np.set_printoptions(precision=2)
pd.options.display.precision=2

In [None]:
# The actual stored numbers have not changed
raw_expression.head()

In [None]:
log_expression = np.log(raw_expression)

The following code snippet removes the replicate name from each sample, so we can use these labels as categories for plotting later.

In [None]:
tissues_list = [re.match('(.+?)(( biological)? rep\d+)', c).group(1)
                     for c in raw_expression.columns]
tissues = pd.Series(tissues_list, index = raw_expression.columns)

In [None]:
len(tissues)

## Task 1 - Building a correlation matrix

**Task 1(a)**

The [FlyAtlas](http://flyatlas.org) dataset contains four biological replicates for each tissue. Combine the biological replicates by calculating the mean expression value for each gene in each tissue.

In [None]:
# ~~ GRADED CELL - complete this cell ~~

def average_by_tissue(expression, tissues):
    '''
    Given a DataFrame of gene expression data, 
    and a list, array or Series of tissues corresponding to the columns of the dataframe,
    average over the expression values in each gene for each tissue type and
    return the resulting dataframe. 
    The columns of the new dataframe should correspond to the provided tissues.
    '''


The below test case should return

```
      A    B
0   4.5  2.5
1  10.0  7.0
```


In [None]:
test_df = pd.DataFrame([[5,4,3,2],[10,10,6,8]])
print(average_by_tissue(test_df, ['A','A','B','B']))

In [None]:
# Calculate expression for each tissue in the flyatlas data
tissue_expression = average_by_tissue(log_expression, tissues)

In [None]:
tissue_expression.shape

**Task 1(b)**

WGCNA starts by building a pairwise correlation matrix of genes. Using the matrix you just created, produce an *unsigned* correlation matrix where each cell contains the absolute value of the correlation coefficients.

You can calculate the Pearson correlation values yourself, or look up a numpy or scipy function to do so.

In [None]:
# ~~ GRADED CELL - complete this cell ~~

def calculate_unsigned_correlation(expression):
    ''' 
    Produce the unsigned correlation matrix for a table of gene expression values.
    Assume that the columns of the expression matrix are samples and the rows are
    genes, and return an array of arrays giving the Pearson correlation between each pair of genes,
    in the same order as the rows of the expression table.
    '''


The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 1.  ,  0.95,  0.96,  0.44,  0.3 ,  0.15],
       [ 0.95,  1.  ,  1.  ,  0.71,  0.59,  0.46],
       [ 0.96,  1.  ,  1.  ,  0.67,  0.54,  0.41],
       [ 0.44,  0.71,  0.67,  1.  ,  0.99,  0.95],
       [ 0.3 ,  0.59,  0.54,  0.99,  1.  ,  0.99],
       [ 0.15,  0.46,  0.41,  0.95,  0.99,  1.  ]])
```

In [None]:
test_df = pd.DataFrame([[ 3.8,  2.7,  4.5],
                       [ 4.3,  3.4,  6.2],
                       [ 5.3,  4.3,  7. ],
                       [ 4.6,  6. ,  7.7],
                       [ 5.2,  7.3,  8.8],
                       [ 6.2,  8.5,  9.4]], 
                         columns=['Tissue1', 'Tissue2', 'Tissue3'],
                         index=['GeneA', 'GeneB', 'GeneC', 'GeneD', 'GeneE', 'GeneF'])
calculate_unsigned_correlation(test_df)

The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 1.  ,  0.95,  0.3 ,  0.15],
       [ 0.95,  1.  ,  0.59,  0.46],
       [ 0.3 ,  0.59,  1.  ,  0.99],
       [ 0.15,  0.46,  0.99,  1.  ]])
```

In [None]:
test_df = pd.DataFrame([[ 3.8,  2.7,  4.5],
                       [ 4.3,  3.4,  6.2],
                       [ 5.2,  7.3,  8.8],
                       [ 6.2,  8.5,  9.4]], 
                         columns=['Tissue1', 'Tissue2', 'Tissue3'],
                         index=['GeneA', 'GeneB', 'GeneC', 'GeneD'])
calculate_unsigned_correlation(test_df)

In [None]:
# Calculate the correlation matrix for the flyatlas data
unsigned_correlation = calculate_unsigned_correlation(tissue_expression)
_ = plt.hist(unsigned_correlation.flatten(), bins=100)

**Task 1(c)**

Why are we using an unsigned correlation matrix instead of a signed correlation matrix? (max 50 words)

*# ~~ GRADED CELL - your answer here --*



## Task 2 - Building an adjacency matrix

To use the correlation matrix to create a network, we will transform it into an adjacency matrix. You will create two types of adjacency matrix, a binary adjacency matrix and a weighted adjacency matrix.

**Task 2(a)**

To create the binary adjacency matrix, transform the correlation matrix such that every correlation greater than or equal to a given threshold value is considered adjacent (represented by a 1 in the matrix), and every correlation below that value is considered not adjacent (represented by a 0). Set the diagonal of the adjacency matrix to 0, so that we don't consider a node to be adjacent to itself.

In [None]:
# ~~ GRADED CELL - complete this cell ~~

def calculate_binary_adjacencies(correlation, threshold):
    '''
    Given a correlation matrix between genes of shape (N,N),
    return the corresponding binary adjacency matrix of shape (N,N),
    where correlation values are above the given threshold.
    '''


The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  1.],
       [ 0.,  0.,  1.,  0.]])
```

In [None]:
test_corr = np.array([[ 1.  ,  0.95,  0.3 ,  0.15],
       [ 0.95,  1.  ,  0.59,  0.46],
       [ 0.3 ,  0.59,  1.  ,  0.99],
       [ 0.15,  0.46,  0.99,  1.  ]])
calculate_binary_adjacencies(test_corr, 0.5)

The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  0.]])
```

In [None]:
test_corr = np.array([[ 1.  ,  0.95,  0.3 ,  0.15],
       [ 0.95,  1.  ,  0.59,  0.46],
       [ 0.3 ,  0.59,  1.  ,  0.99],
       [ 0.15,  0.46,  0.99,  1.  ]])
calculate_binary_adjacencies(test_corr, 0.6)

In [None]:
# Calculate the binary adjacency matrix for the flyatlas data
adjacency_binary = calculate_binary_adjacencies(unsigned_correlation, 0.85)

**Task 2(b)**

Calculate the connectivity of the adjacency matrix by dividing the total number of edges by the number of possible edges.

In [None]:
# ~~ GRADED CELL - complete this cell ~~

def calculate_connectivity(adjacency):
    '''
    Calculate the number of edges that exist in a given binary adjacency matrix,
    divided by the total number of possible edges between all nodes.
    '''
    

In [None]:
# Should return 0.5
calculate_connectivity(np.array([[ 0.,  1.,  0.,  0.],
                                   [ 1.,  0.,  1.,  0.],
                                   [ 0.,  1.,  0.,  1.],
                                   [ 0.,  0.,  1.,  0.]]))

In [None]:
# Should return 0.33
calculate_connectivity(np.array([[ 0.,  1.,  0.,  0.],
                                   [ 1.,  0.,  0.,  0.],
                                   [ 0.,  0.,  0.,  1.],
                                   [ 0.,  0.,  1.,  0.]]))

In [None]:
calculate_connectivity(adjacency_binary)

**Task 2(c)**

The weighted adjacency matrix can be created by raising the correlation matrix to some power. Write a function that raises the correlation matrix to some power, `beta`, and sets the diagonal to `0`. For the rest of the assignment we will use `beta = 4` but your function should accept any integer.

In [None]:
# ~~ GRADED CELL - complete this cell ~~

def calculate_weighted_adjacencies(correlation, beta):
    '''
    Given a correlation matrix between genes of shape (N,N),
    return the corresponding binary adjacency matrix of shape (N,N),
    where we use a power-law soft threshold with parameter beta.
    '''


The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 0.  ,  0.9 ,  0.09,  0.02],
       [ 0.9 ,  0.  ,  0.35,  0.21],
       [ 0.09,  0.35,  0.  ,  0.98],
       [ 0.02,  0.21,  0.98,  0.  ]])
```

In [None]:
test_corr = np.array([[ 1.  ,  0.95,  0.3 ,  0.15],
       [ 0.95,  1.  ,  0.59,  0.46],
       [ 0.3 ,  0.59,  1.  ,  0.99],
       [ 0.15,  0.46,  0.99,  1.  ]])
calculate_weighted_adjacencies(test_corr, 2)

The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 0.  ,  0.86,  0.03,  0.  ],
       [ 0.86,  0.  ,  0.21,  0.1 ],
       [ 0.03,  0.21,  0.  ,  0.97],
       [ 0.  ,  0.1 ,  0.97,  0.  ]])
```

In [None]:
test_corr = np.array([[ 1.  ,  0.95,  0.3 ,  0.15],
       [ 0.95,  1.  ,  0.59,  0.46],
       [ 0.3 ,  0.59,  1.  ,  0.99],
       [ 0.15,  0.46,  0.99,  1.  ]])
calculate_weighted_adjacencies(test_corr, 3)

In [None]:
# Calculate the weighted adjacency matrix for the flyatlas data
adjacency_weighted = calculate_weighted_adjacencies(unsigned_correlation, 4)

**Task 2(d)**

How do you expect the network connectivity would change if the threshold for the binary adjacency matrix is increased or decreased? (max 50 words)

*# ~~ GRADED CELL - your answer here --*


## Task 3 - Defining modules with hierarchical clustering

We will implement a distance or dissimilarity function between genes that makes use of the gene network as specified by the adjacency matrix, and these distances to carry out hierarchical clustering. We'll use scipy's hierarchical clustering functions `linkage()` and `fcluster()`, as they provide us with an easy way to draw the dendrogram.

The distance or dissimilarity function we'll implement is based on that given in lectures:

If $i=j$, 

$$d_{ij} = 0$$

otherwise

$$d_{ij} = 1 - \frac{l_{ij} + a_{ij}}{min(k_i,k_j) + 1 - a_{ij}}$$

where

$$l_{ij} = \Sigma_u a_{iu} a_{uj}$$

$$k_i = \Sigma_u a_{iu}$$

and $a_{ij}$ refers to the $i$,$j$th element of the adjacency matrix.

Note we have set $d_{ij}$ to $0$ if $i=j$ as this is the distance from a node to itself.

In the functions below we'll refer to $k_i$ as the vertex connectivity of node $i$ (i.e. gene $i$), and $l_{ij}$ as the neighbour connectivity between $i$ and $j$. 

**Task 3(a)**

In the distance metric above, $l_{ij}$ sums over every node in the graph, but only a subset of nodes contribute. Based on the equations when applied to the binary adjacency matrix, explain which subset of nodes contribute to $d_{ij}$ for a given $i$ and $j$. (max 50 words)


*# ~~ GRADED CELL - your answer here --*


**Task 3(b)**

Complete the `vertex_connectivity()` and `neighbour_connectivity()` functions below. The `distance_matrix()` function has been provided, and should work correctly once the other two functions are complete.

In [None]:
# ~~ GRADED CELL - complete this cell ~~

def vertex_connectivity(adjacency):
    '''
    Given an adjacency matrix of shape (N,N), calculate the
    vertex connectivity k_i of every node i and return these as an array of
    shape (N,1).
    '''

In [None]:
# ~~ GRADED CELL - complete this cell ~~

def neighbour_connectivity(adjacency):
    '''
    Given the adjacency matrix of shape (N,N), calculate the sum
    of path weights from every i to every j via a single neighbouring node,
    i.e. Sum a_iu a_uj over all nodes u.
    Return these path weights from i to j in a matrix of shape (N,N)
    '''


In [None]:
def distance_matrix(adjacency):
    '''
    Given the adjacency matrix of shape (N,N), calculate the distance 
    between every i and j based on the dissamilarity formula provided in lectures,
    and return these distances in a matrix of shape (N,N).
    '''
    l_ij = neighbour_connectivity(adjacency)
    k_i = vertex_connectivity(adjacency)[:,np.newaxis]
    print(k_i)
    k_j = vertex_connectivity(adjacency)[np.newaxis,:]
    print(k_j)
    print(np.minimum(k_i,k_j))
    d_ij = 1 - ((l_ij + adjacency) / (np.minimum(k_i,k_j) + 1 - adjacency))
    np.fill_diagonal(d_ij, 0)
    return d_ij

The below test case should return (if displayed to a precision of two decimal places)

```
array([ 0.89,  1.17,  1.21,  1.07])
```

In [None]:
vertex_connectivity(np.array([[ 0.  ,  0.86,  0.03,  0.  ],
                           [ 0.86,  0.  ,  0.21,  0.1 ],
                           [ 0.03,  0.21,  0.  ,  0.97],
                           [ 0.  ,  0.1 ,  0.97,  0.  ]]))

The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 0.74,  0.01,  0.18,  0.12],
       [ 0.01,  0.79,  0.12,  0.2 ],
       [ 0.18,  0.12,  0.99,  0.02],
       [ 0.12,  0.2 ,  0.02,  0.95]])
```

In [None]:
neighbour_connectivity(np.array([[ 0.  ,  0.86,  0.03,  0.  ],
                           [ 0.86,  0.  ,  0.21,  0.1 ],
                           [ 0.03,  0.21,  0.  ,  0.97],
                           [ 0.  ,  0.1 ,  0.97,  0.  ]]))

The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 0.  ,  0.16,  0.89,  0.94],
       [ 0.16,  0.  ,  0.83,  0.85],
       [ 0.89,  0.83,  0.  ,  0.1 ],
       [ 0.94,  0.85,  0.1 ,  0.  ]])
```

In [None]:
distance_matrix(np.array([[ 0.  ,  0.86,  0.03,  0.  ],
                       [ 0.86,  0.  ,  0.21,  0.1 ],
                       [ 0.03,  0.21,  0.  ,  0.97],
                       [ 0.  ,  0.1 ,  0.97,  0.  ]]))

In [None]:
# Calculate the distance matrix for flyatlas data
distances = distance_matrix(adjacency_weighted)

We can now carry out hierarchical clustering. Here, scipy's `linkage()` function performs agglomerative clustering based on the provided distances. 

The most important function below is `linkage()`. Other functions you see are `squareform()`, a utility function to transform the distance matrix into the format scipy requires, and `transpose()`, which is here used to compensate for small floating-point errors and ensure the distance matrix is exactly symmetric.

In [None]:
Z = linkage(squareform((distances+distances.transpose())/2), 'ward')

Scipy provides a function to draw the dendrogram:

In [None]:
plt.figure(figsize=(25, 10))
_ = dendrogram(Z, no_labels=True)
plt.axhline(y = 4.5, color = 'r', linestyle = '--')

And we can extract the desired number of flat clusters with `fcluster()`:

In [None]:
labels = fcluster(Z, 6, criterion='maxclust')

We now have a numpy array assigning each gene to one of 6 clusters, labelled 1 to 6:

In [None]:
labels[:10]

**Task 3(c)**

Write a function to convert this flat list of labels into a set of modules in the form of lists of gene names.

In [None]:
# ~~ GRADED CELL - complete this cell ~~

def module_lists(genes, labels):
    '''
    Given an array or series of gene names and an array or series of cluster labels,
    return a list of lists where each list represents the genes in a cluster.
    '''


In [None]:
# Should return [['GeneA', 'GeneC'], ['GeneB'], ['GeneD']]
module_lists(['GeneA', 'GeneB', 'GeneC', 'GeneD'], [1,2,1,3])

In [None]:
# Get our flyatlas module lists
modules_hierarchical = module_lists(tissue_expression.index, labels)

# The sizes of our modules
[len(module) for module in modules_hierarchical]

In [None]:
# The modules themselves
modules_hierarchical

## Task 4 - Clustering with k-medoids 

In this task, we'll use a different clustering method and see what modules we get.

We don't have a Euclidean distance space, but we have a pairwise matrix of distances, which allows us to implement k-medoids. Recall that in k-medoids, one of our data points acts as the centroid of each cluster - in this case there will be a centroid gene for each cluster. The algorithm we will implement for k-medoids is:

1. Initialise the centroids randomly.
2. Assign each gene to the closest centroid (in this case, using our network-based distances).
3. Choose the most-central gene in each cluster to be the new centroid. This is the gene which minimises the sum of squared distances within the cluster.
4. Repeat from step (2) until the algorithm converges and the centroids no longer change.

The functions `initialise_centroids()` and `calculate_new_centroids()` are provided for you. You need to implement `assign_points()` and complete the function `kmedoids()` itself.

In [None]:
def initialise_centroids(N, k):
    """
    Select k centroid indices randomly given that there are N data points.
    We are only selecting the indices, so we don't need the actual data.
    """
    centroid_indices = np.random.choice(list(range(N)), size=k, replace=False)
    return centroid_indices 

In [None]:
def calculate_centroid_index(within_cluster_distances):
    """
    Take in distance array of size (C,C) where C is the size of the cluster.
    Return a centroid index (of type int) which is the most-central data point.
    """
    # We'll minimise the sum of square distances
    # Calculate this quantity for every point, and pick the best as the centroid
    sse = np.sum((within_cluster_distances**2),axis=1)
    return np.argmin(sse)

def calculate_new_centroids(distances, assignments, k):
    """
    Take distances of shape (N,N) and cluster assignments of shape (N) and
    return centroid indices array of shape (k).
    """
    assert np.max(assignments) < k
    centroid_list = []
    for c in range(k):
        in_cluster = assignments==c
        cluster_distances = distances[in_cluster,:][:,in_cluster]
        centroid_index_in_cluster = calculate_centroid_index(cluster_distances)
        centroid_index = np.array(list(range(len(distances))))[in_cluster][centroid_index_in_cluster]
        centroid_list.append(centroid_index)
    return np.array(centroid_list)

In [None]:
# ~~ GRADED CELL - complete this cell ~~

def assign_points(centroid_indices, distances):
    """
    Assign each point to its closest centroid.
    Take in an array of centroid indices of length k, and an array representing
    the distance matrix, of shape (N,N).
    Return a 1D array of length N representing cluster assignments.
    Each value in the returned array should be a number from 0 to k-1,
    indicating which cluster (centroid) this data point has been assigned to.
    """


In [None]:
# ~~ GRADED CELL - complete this cell ~~

def kmedoids(distances, k):
    """
    Implement k-medoids clustering on a given set of points, by taking in
    a pre-computed distance matrix of size (N,N), and a number of clusters k.
    Returns a tuple of (centroid_indices, cluster_assignments)
    where 
    centroid_indices is a list of indices specifying which data points are now
     centroids, and
    cluster_assignments is a 1D array of length N, where the values of the
     array are numbers from 0 to k-1 and represent cluster assignments.
    """


In [None]:
# Should return array([0, 0, 1, 1])

test_dist = np.array([[ 0.  ,  0.16,  0.89,  0.94],
       [ 0.16,  0.  ,  0.83,  0.85],
       [ 0.89,  0.83,  0.  ,  0.1 ],
       [ 0.94,  0.85,  0.1 ,  0.  ]])
assign_points([0,2], test_dist)

In [None]:
# Should return array([0, 1, 1, 1])

test_dist = np.array([[ 0.  ,  0.16,  0.89,  0.94],
       [ 0.16,  0.  ,  0.83,  0.85],
       [ 0.89,  0.83,  0.  ,  0.1 ],
       [ 0.94,  0.85,  0.1 ,  0.  ]])
assign_points([0,1], test_dist)

In [None]:
# k-medoids is stochastic, so you're not guaranteed to get the correct result every 
# time for this test. However the most likely outcome is
# (array([2, 0]), array([1, 1, 0, 0]))
# or equivalently
# (array([0, 2]), array([0, 0, 1, 1]))

test_dist = np.array([[ 0.  ,  0.16,  0.89,  0.94],
       [ 0.16,  0.  ,  0.83,  0.85],
       [ 0.89,  0.83,  0.  ,  0.1 ],
       [ 0.94,  0.85,  0.1 ,  0.  ]])
kmedoids(test_dist, 2)

Now we can cluster our data:

In [None]:
# Cluster our flyatlas data
centroids, kmedoids_labels = kmedoids(distances, 6)

In [None]:
# genes that are the centroids
tissue_expression.index[centroids]

In [None]:
# Use the module_lists function you defined earlier
modules_kmedoids = module_lists(tissue_expression.index, kmedoids_labels)

# Module sizes
[len(m) for m in modules_kmedoids]

In [None]:
# The modules themselves
modules_kmedoids

## Task 5 - Dimension Reduction

In this task we will be performing Priciple Components Analysis to determine which gene in the first principle component has the highest contribution to the varience.

**Task 5(a)**

Perform a Principle Componanets Analysis on the log_expression matrix with the correct number of components and print the explained variance by component list (like we saw in the week 9 tutorial)

In [None]:
# ~~ GRADED CELL - complete this cell ~~

**Task 5(b)**

Print the gene that contributes most to the first eigen vector (the first principle component) of the PCA. The word limit on the second cell is 50 words.

In [1]:
# ~~ GRADED CELL - complete this cell ~~

## Task 6 - Centrality

In this task we will be computing differing measures of centrality in the gene found above in Task 5. 

**Task 6 (a)**

Degree Centrality

Using the adjacency_binary matrix (with a threshold value of 0.85) claculate the degree centrality of the gene found in Task 5.

In [None]:
# ~~ GRADED CELL - your code here --

**Task 6 (b)**

Betweeness Centrality

Using the adjacency_binary matrix generated above(with a threshold value of 0.85), turn the matrix into a graph object (using networkx) and then claculate the betweeness centrality of the gene found in Task 5 using the [networkx function for betweeness centrality](https://networkx.github.io/documentation/networkx-2.2/reference/algorithms/generate/networkx.algorithms.centrality.betweenness_centrality.html#networkx.algorithms.centrality.betweenness_centrality).

In [None]:
# ~~ GRADED CELL - your code here --

**Task 6 (c)**

Closeness Centrality

Using the [dijkstra_path](https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.shortest_paths.weighted.dijkstra_path.html) function from networkx claculate the closeness centrality of the gene found in Task 5 using the following formula:


$$C(u) = \frac{n - 1}{\sum_{v=1}^{n-1} d(v, u)}$$

where d(v, u) is the shortest-path distance between v and u, and n is the number of nodes that can reach u.

In [2]:
# ~~ GRADED CELL - your code here --

**Task 6 (d)**

What does each measurement say about the gene's centrality? Is it relatively central? 


*# ~~ GRADED CELL - your answer here --*
