# Big Data for Biologists: Decoding Genomic Function- Class 9

## How do you visualize similarities and differences of gene expression profiles across cell types? Part III

##  Learning Objectives
***Students should be able to***
 <ol>
 <li> <a href=#Introkmeans>Explain what the k-means clustering algorithm is </a></li>
 <li> <a href=#Runkmeans>Perform k-means clustering to examine groups of samples from an RNA-Seq experiment.</a></li>
 <li> <a href=#Whileloop>Use a while loop to assess when the k-means clustering assignments are no longer changing.</a></li>
 <li> <a href=#heatmap>Interpret gene expression data in a heatmap </li>
 <li> <a href=#KmeansGenes> Perform k-means clustering to examine groups of genes from an RNA-Seq experiment. </a> </li>

# What is the K-means clustering algorithm?<a name='Introkmeans' />

In [None]:
import warnings
warnings.filterwarnings('ignore')
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/4b5d3muPQmA" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')



For this class we have prepared a number of helper functions in a file in **../helpers/kmeans_helpers.py** .

One of the helper functions makes scatter plots using code that is similar to what we saw in the last class. It adds some functionality to plot different styles of points or to change colors. 

We have also added a helper functions for importing RNA_Seq data in **../helpers/RNAseq_helpers.py**

You can take a look at either of the files by going into the helpers file or using the cat command to see what is there. 

In [None]:
!cat ../helpers/RNAseq_helpers.py

In [None]:
!cat ../helpers/kmeans_helpers.py

In [None]:
%%capture

#Imports helper functions for loading RNA_Seq data and kmeans algorithm 

%matplotlib inline
%load_ext autoreload
%reload_ext autoreload

%autoreload 2
import sys
sys.path.append('../helpers')
from kmeans_helpers import * 
from RNAseq_helpers import * 
warnings.filterwarnings('ignore')

In [None]:
#Step 1: Generate a dataset in 2 dimensions with 150 datapoints 
x,y=generate_random_data(150)

In [None]:
len(x)

In [None]:
len(y)

We plot the data:

In [None]:
plot(x,y)

Thought question: Why can we use the simple command plot to make this plot rather than the long list of commands that we used in the last class? 

Based on a visual inspection of the data, we decide that we would like 3 clusters. It looks like all the data points fall in the range of [-5,5].

In [None]:
k=3
min_val=-5
max_val=5 

The first step in the algorithm is to randomly initialize the cluster centroids. We can do this with the **initialize_centroids** function that was imported from kmeans_helpers. First, let's use the **help()** function to examine the **initialize_centroids** function. 

In [None]:
help(initialize_centroids)

In [None]:
x_centroids,y_centroids=initialize_centroids(k,min_val,max_val)

Thought question: Where could you find the code for the initialize_centroids function either to understand each step or to view if your code needed debugging?  

Plot the cluster centroids that we initialized above:

In [None]:
plot(x,
     y,
     x_centroids=x_centroids,
     y_centroids=y_centroids)

We would like to find the closest centroid to each point. To do this, we will calculate the Euclidean distance from each point to each centroid. Euclidean distance can be found with the formula below: 
![euclidean distance formula](../Images/euclidean_distance.png)


In [None]:
help(distance)

In [None]:
#Calculate the distance of each point to each centroid. 
distances=[] 
for cluster_index in range(k): 
    cluster_distance=distance(x,
                              y,
                              x_centroids[cluster_index],
                              y_centroids[cluster_index])
    distances.append(cluster_distance)

print("length of distances: "+ str(len(distances)))
print(distances[0][0:10])
print ("length of distances each cluster: "+str(len(distances[1])))

Thought questions: 

Why is the length of distances 3? What does each sublist represent? 

The variable distances is defined as "distances=[ ]"in the code above and it is also defined as "distances=[ ]" as the first line in the distance helper function defined in kmeans_helpers.py. Why do the distances for each centroid get appended to distances instead of getting erased each time the for loop gets called for a new centroid?

In [None]:
#Assign each point to the cluster corresponding to the nearest centroid. 
num_points=len(x)
print(len(x))
cluster_assignments=assign_cluster(distances,num_points)

In [None]:
print(cluster_assignments)

In [None]:
#plot the data with cluster assignments --red indicates points in cluster 0, blue in cluster 1 and green in cluster 2. 
plot(x,
     y,
     x_centroids=x_centroids,
     y_centroids=y_centroids,
     cluster_assignments=cluster_assignments)

Now, we re-calculate the centroid positions so that the centroids are at the mid-point of each cluster. We use the helper function 'update_centroids' that is defined in helpers/kmeans_helpers.py

In [None]:
#call the function to update the centroids 
x_centroids,y_centroids=update_centroids(x,y,cluster_assignments,k)
print("updated x_centroids:"+str(x_centroids))
print("updated y_centroids:"+str(y_centroids))
plot(x,
    y,
    cluster_assignments=cluster_assignments,
    x_centroids=x_centroids,
    y_centroids=y_centroids)

In [None]:
# We repeat this cycle of assigning points to cluster and updating the cluster centroids. 
# At each iteration, you should observe an improved separation of the datapoints. 


#We combine the functions we have written above to produce a full iteration of the k-means clustering algorithm. 
def iterate(x,y,x_centroids,y_centroids,k): 
    num_points=len(x)
    
    #calculate centroid distances 
    distances=[] 
    for cluster_index in range(k): 
        distances.append(distance(x,y,x_centroids[cluster_index],y_centroids[cluster_index]))
        
    #Assign each point to the cluster corresponding to the nearest centroid. 
    cluster_assignments=assign_cluster(distances,num_points)

    #update the centroid assignments 
    x_centroids,y_centroids=update_centroids(x,y,cluster_assignments,k)
    
    #generate a plot 
    print(plot(x,
        y,
        x_centroids=x_centroids,
        y_centroids=y_centroids,
        cluster_assignments=cluster_assignments))
    
    #return the new cluster assignments and centroid values 
    return cluster_assignments,x_centroids,y_centroids

In [None]:
#run an iteration of the algorithm 
#run this code block several times to observe progress improvements in cluster assignments 
cluster_assignments,x_centroids,y_centroids=iterate(x,y,x_centroids,y_centroids,k)

 ## Use a while loop to assess when the k-means clustering assignments are no longer changing.<a name='Whileloop' />

In [None]:
#Finally, we define the end-criteria -- the algorithm is complete when the cluster assignments are no longer changing. 
def check_finished(old_cluster_assignments,new_cluster_assignments): 
    for i in range(len(old_cluster_assignments)): 
        if old_cluster_assignments[i]!=new_cluster_assignments[i]: 
            return False 
    return True 

In [None]:
#We put all the steps above together to define the full k-means clustering algorithm 
def k_means(x,y,k): 
    min_val=min(x+y)
    max_val=max(x+y)
    x_centroids,y_centroids=initialize_centroids(k,min_val,max_val)
    cluster_assignments,x_centroids,y_centroids=iterate(x,y,x_centroids,y_centroids,k)
    finished=False 
    while (finished==False): 
        new_assignments,x_centroids,y_centroids=iterate(x,y,x_centroids,y_centroids,k)
        finished=check_finished(cluster_assignments,new_assignments)
        cluster_assignments=new_assignments
    return cluster_assignments

In [None]:
#Let's run the full algorithm on our toy dataset 
cluster_assignments=k_means(x,y,3)

What would happen if you chose a different value of k (number of clusters)? 


In [None]:
#Use the k_means function defined above to run k-means clustering 
#on x,y for k=2 and k=4. 

## BEGIN SOLUTION ## 
## END SOLUTION ## 

#Which value of k (2,3, or 4) gives the best separation of the datapoints into cohesive clusters? 
## BEGIN SOLUTION ## 
## END SOLUTION ## 

# Using K-means clustering to examine groups of samples from an RNA-Seq experiment. <a name='Runkmeans' />

We will now use kmeans clustering to cluster our samples across the four organ systems: 

* Blood 
* Embryonic
* Immune 
* Respiratory 

In [None]:
import pandas as pd 
systems_subset=["Blood","Embryonic","Immune","Respiratory"]
rnaseq_data,metadata=load_rnaseq_data(systems_subset=systems_subset,rnaseq_data='/data/datasets/RNAseq/diff_genes_top.tsv',
                              metadata='/data/datasets/RNAseq/rnaseq_metadata.txt')
rnaseq_data_transposed=rnaseq_data.transpose()

#Write a line of code to check the top few lines of the rnaseq_data_transposed matrix.  
## BEGIN SOLUTION ##
## END SOLUTION ## 

In [None]:
rnaseq_data_transposed.shape

Note that our matrix has 1543 columns (1 per gene) and 116 rows (1 per sample). Why do we examine only 1543 genes from the original set of 55,000 genes in the human genome? These 1543 genes were found to have different levels of expression across the four organ systems in **systems_subset**. The remaining genes did not show a significant difference in expression across the blood, embryonic, immune, and respiratory systems. 

In the example above, we wrote the k-means clustering code "from scratch". However, python already has a KMeans clustering function in the scikit learn library, which is more efficient for clustering large datasets than the function that we wrote above. We will use the built-in KMeans function in our further data analysis. 

In [None]:
from sklearn.cluster import KMeans

In [None]:
#use the help command to learn more about the KMeans clustering function. 
#Uncomment the following line of code to run the "help" function 
## BEGIN SOLUTION ## 
## END SOLUTION ## 

In [None]:
#use the help command to learn more about the scikit_PCAandkmeans helper function. 
## BEGIN SOLUTION ## 
## END SOLUTION ## 

In [None]:
#Run the kmeans clustering algorithm on the data with four clusters
## BEGIN SOLUTION ## 
## END SOLUTION ## 

We would like a way to determine how well the k-means clustering has separated the data. One way to assess the performance of the clustering algorithm is to compare inter-cluster distance to intra-cluster distance. We would like points in the same cluster to be near each other, and points that are in different clusters to be far apart: 
![Inter-cluster and intra-cluster distance](../Images/inter_cluster_vs_intra_cluster.png)

**Silhouette analysis** allows us to compute the ratio of inter-cluster distance to intra-cluster distance for each cluster. This measure has a range of [-1, 1]. We see above that the clustering analysis yields a silhouette score of 0.66, suggesting that the data is fairly well-separated by four clusters. 


# Interpreting gene expression data in a heatmap <a name='heatmap' />

In [None]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/oMtDyOn2TCc" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

We can visualize the 116 clustered samples in a heatmap, with samples along the y-axis and genes along the x-axis.  Note how we can see four distinct clusters  in the data. The heatmap values correspond to gene expression levels for each gene in each sample.

In [None]:
plot_heatmap(rnaseq_data_transposed)

Thought questions: 
* What does each box in the heatmap represent?
* What does each column represent?
* What does each row represent?  

## Perform k-means clustering to examine groups of genes from an RNA-Seq experiment <a name='KmeansGenes' />

Previously, we performed k-means clustering on the samples. Now, we are interested in identifying groups of genes with similar patterns of expression across organ systems. To perfom clustering on the genes, we will reverse the axis (switch the rows and columns) in the input dataset.

Let's begin with 4 clusters (k=4). 

Above, when we read in our dataset, we used a command **data.transpose()** to put the sample names as the rows and the genes as the columns. 

Now, we want to use the original data matrix with 1543 rows (which corresponds to the number of genes) and 116 columns (corresponding to the number of samples). 

In [None]:
#Fill in the code below to run the k-means clustering on the genes. 
## BEGIN SOLUTION ## 
## END SOLUTION ## 

In [None]:
#To see what the output of the kmeans clustering looks like, print the first ten lines
print(clusters[0:10])

In [None]:
# Plot a heatmap of the data 
plot_heatmap(rnaseq_data)

We now obtain the list of genes associated with each cluster using the 'get_genes_from_cluster' function in ../helpers/kmeans_helpers.py

In [None]:
get_genes_from_clusters(rnaseq_data,clusters,k,filename='gene_id_to_gene_name.txt')

This will generate files 0.txt, 1.txt, 2.txt, and 3.txt -- listing the genes associated with each cluster. 


In [None]:
#Examine the first 10 lines in the file 0.txt
! head 0.txt