# 3.2  Clustering analysis and PCA  (on normalized count matrix) #

### IMPORTANT: Please make sure that you are using the R kernel to run this notebook.###
We are now switching from the bash kernel to the R kernel. 
The R language provides a number of utilities for genomic data analysis and visualization. We will explore some of these. 

In [1]:
#The preprocessCore library provides a number of functions useful for statistical analysis,
#including functions for data normalization that we will use below. 
library("preprocessCore")

In [2]:
#Change to your $WORK_DIR. The syntax for switching directories in R is a little different than what we used in bash. 
#Use the "setwd" command to switch to your $WORK_DIR 
sunetid="ubuntu"
work_dir=paste("/scratch/",sunetid,sep="")
setwd(work_dir)
#The "dir" command will list all files in your current working directory 
dir()

In this tutorial we will focus on the clustering and PCA analysis steps of the pipeline: 
![Analysis pipeline](images/part3.png)

In [3]:
#load the count signal matrix
count_data=read.table("all.readcount.txt",header=TRUE)
rownames(count_data)=paste(count_data$Chrom,count_data$Start,count_data$End,sep='\t')
#remove the columns we will not use in downstream analysis
count_data$ID=NULL
count_data$Chrom=NULL
count_data$Start=NULL
count_data$End=NULL

head(count_data)

Unnamed: 0,ambenj_asf1_YPD_2,ambenj_rtt109_YPGE_2,dmaghini_asf1_YPD_5,dmaghini_WT_YPD_5,egreenwa_asf1_YPD_6,egreenwa_rtt109_YPD_5,gamador_rtt109_YPGE_6,gamador_WT_YPGE_6,hrosenbl_WT_YPGE_1,jarod_asf1_YPGE_5,⋯,yiuwong_rtt109_YPGE_1,yiuwong_WT_YPD_1,YPD_asf1_rep1,YPD_rtt109_rep1,YPD_rtt109_rep2,YPD_WT_rep1,YPD_WT_rep2,YPGE_asf1_rep1,YPGE_asf1_rep2,YPGE_WT_rep1
chrI	0	857,36,25,21,35,14,11,58,18,50,33,⋯,57,88,3,1,7,3,3,4,3,4
chrI	2415	2586,1354,2882,1503,2884,560,354,2873,1347,4899,2759,⋯,4071,5241,136,108,160,111,212,139,162,220
chrI	6315	6556,217,222,155,350,70,48,258,88,252,209,⋯,307,785,15,10,19,14,30,18,26,33
chrI	14706	14936,329,306,183,451,142,57,433,97,504,378,⋯,633,655,30,26,26,22,30,41,29,43
chrI	20592	21210,319,839,197,660,133,154,823,392,946,685,⋯,831,1503,26,17,28,28,40,13,25,48
chrI	28570	28931,4,36,7,21,5,1,36,20,38,24,⋯,50,30,2,0,2,1,1,0,1,2


In [None]:
#normalize the data 
#quantile normalization 
norm_asinh_count=normalize.quantiles(data.matrix(asinh(count_data)))

In [None]:
colnames(norm_asinh_count)=names(count_data)
rownames(norm_asinh_count)=rownames(count_data)

In [None]:
head(norm_asinh_count)

Much better! After quantile normalization, the fold change values across samples are on the same scale. 

## PCA ##

PCA (Principal Component Analysis) is a way to identify the primary directions of variation in the data. It can also be used for very coarse-grained clustering of samples; similar samples will have similar coordinates along the principal axes.

We will perform PCA on *all.count.txt*. We treat each sample as a single point in a very high dimensional space (where the dimensionality is equal to the number of genes the vary), and then we will perform dimensionality reduction in this space. We can color-code the PCA plots by "Strain", "Media", "Researcher", or "Rep" to determine which parameter separates the samples most effectively. 

In [None]:
#We run the principle component analysis command in R

#The t() function transposes the data matrix and allows us to cluster the samples, as opposed to the individual peaks,
#by placing the samples in the rows and the peaks in the columns. 
count.pca=prcomp(t(norm_asinh_count),center=TRUE,scale=FALSE)

We generate a scree plot that shows how much variance in the data is explained by each prinicipal component:

In [None]:
var_explained=round(100*count.pca$sdev^2/sum(count.pca$sdev^2),2)
print(var_explained)

Let's generate a simple bar graph to better illustrate the variance explained by each PC.


In [None]:
barplot(var_explained)

We can also plot the first few prinicpal components to see if they correlate with any of our experimental variables: 

    * Strain of yeast 
    * Media 
    
We also expect replicates for the same sample to cluster closely together.

Finally, we should make sure to check for any unintended batch effects in the data. For example, it's posssible that samples generated by one researcher may exhibit a systematic difference from samples generated by a different researcher. We should check for this bias and correct it if possible. 


    

In [None]:
#First, we load our metadata file into R to help us color samples by replicate, strain, media, and researcher. 
metadata=read.table("/metadata/TC2019_samples.tsv",header=TRUE)
#We use the "factor" function to tell R which variables are categorical rather than continuous 
metadata$Strain=factor(metadata$Strain)
metadata$Media=factor(metadata$Media)
metadata$Sample=factor(metadata$Sample)
metadata$Researcher=factor(metadata$Researcher)
head(metadata)

In [None]:
#extract the PC columns from the count.pca object 
pcs=data.frame(count.pca$x)


In [None]:
#add columns from the metadata file. Do this safely using the "merge" command to make sure the sample ID's 
#from the two data frames are aligned
pcs$ID=rownames(pcs)
pcs_annotated=merge(pcs,metadata,by="ID")
head(pcs_annotated)

Now, we can use the ggplot package in R to generate scatterplots of PC1 vs PC2, PC2 vs PC3, etc and color-code
by experimental variables. 


In [None]:
library(ggplot2)

In [None]:
#Plot pc1 vs pc2, color by Sample -- that is, all replicates for the same sample should be the same color. 
ggplot(data=pcs_annotated,aes(x=PC1,y=PC2,color=Sample))+
geom_point()


We should see replicates of the same sample clustering close together. Do we see this in the scatterplot above?

### Correcting a sample swap ### 

In [None]:
#Plot pc1 vs pc2, color by Media 
ggplot(data=pcs_annotated,aes(x=PC1,y=PC2,color=Media))+
geom_point()

We see that Principal component 1 (PC1) captures variation in the data due to media. 
But it appears that we have a sample swap! One pink sample clusters with the blue samples, and vice versa. 

Let's add labels to the PCA plot so we know which two samples are swapped. 


In [None]:
ggplot(data=pcs_annotated,aes(x=PC1,y=PC2,color=Media,label=ID))+
geom_point()+
geom_text()


We see that sample "jkcheng_rtt109_YPGE_3" clusters with the YPD samples, while sample "jkcheng_WT_YPD_3" clusters with the YPGE samples. That's ok, sample swaps happen, and luckily in this case it's eay to correct.  We simply swap the column labels in *count_data_matrix*



In [None]:
which(colnames(norm_asinh_count)=='jkcheng_rtt109_YPGE_3')

In [None]:
which(colnames(norm_asinh_count)=='jkcheng_WT_YPD_3')

In [None]:
colnames(norm_asinh_count)[c(12,13)] <- colnames(norm_asinh_count)[c(13,12)]
#rerun the PCA
count.pca=prcomp(t(norm_asinh_count),center=TRUE,scale=FALSE)
pcs=data.frame(count.pca$x)
pcs$ID=rownames(pcs)
pcs_annotated=merge(pcs,metadata,by="ID")
#Plot pc1 vs pc2, color by Media 
ggplot(data=pcs_annotated,aes(x=PC1,y=PC2,color=Media))+
geom_point()

Much better! The samples from the same media groups now cluster together on the PCA. 


### Correcting for batch effects ###

In [None]:
#Plot pc1 vs pc2, color by Researcher -- here, we're checking for a batch effect based on researcher.
ggplot(data=pcs_annotated,aes(x=PC1,y=PC2,color=Researcher))+
geom_point()


Yikes! We do seem to have a batch effect based on researcher -- PC2 captures this effect. Specifically, the pilot samples that we used to replace some of the samples that did not work have a systematic difference from the other samples. 

Luckily, there are steps we can take to remove this batch effect. We use the R **limma** package to fit a linear mixed effects model. The explanatory variables are Strain, Media, and Researcher. The output variable is the normalized fold change value in the data matrix. We then subtract out the contribution from "Researcher" (the confounding variable) to the output variable. 

In [None]:
library(limma)

In [None]:
#make sure the row order of the metadata file matches the column order of the count_data_matrix file. 
rownames(metadata)=metadata$ID
metadata=metadata[colnames(norm_asinh_count),]


In [None]:
#design the model using entries from our metadata file 
mod=model.matrix(~0+Strain+Media+Researcher,data=metadata)

#fit the model to the data 
fit=lmFit(norm_asinh_count,design=mod)

head(coefficients(fit))

#We note that column 5 in the model captures the batch effect from the "Researcher" variable. We can remove the 
#contribution of this variable from the data: 
batch_contribution=coefficients(fit)[,5:18]%*% t(fit$design[,5:18])
norm_asinh_count_corrected=norm_asinh_count-batch_contribution

Let's re-run the PCA analysis on  count_data_matrix_corrected to make sure we're no longer observing a batch effect 
due to researcher.



In [None]:
count.pca.corrected=prcomp(t(norm_asinh_count_corrected),center=TRUE,scale=FALSE)
var_explained=round(100*count.pca.corrected$sdev^2/sum(count.pca.corrected$sdev^2),2)
barplot(var_explained)
pcs.corrected=data.frame(count.pca.corrected$x)
pcs.corrected$ID=rownames(pcs.corrected)
pcs_annotated.corrected=merge(pcs.corrected,metadata,by="ID")

In [None]:
ggplot(data=pcs_annotated.corrected,aes(x=PC1,y=PC2,color=Researcher))+
geom_point()

Excellent! We no longer see the pilot samples clustering together. Let's make sure that the samples still separate by media. 

In [None]:
ggplot(data=pcs_annotated.corrected,aes(x=PC1,y=PC2,color=Media))+
geom_point()

Now, we also see a clear separation of the wild type (WT) strain along PC2: 

In [None]:
ggplot(data=pcs_annotated.corrected,aes(x=PC1,y=PC2,color=Strain))+
geom_point()

**Exercise**: Is there a pair of PC's that appears to separate the asf1 and rtt109 samples? 

Hint: substitute X=PC1,PC2,PC3,PC4...  and  y=PC1,PC2,PC3,P4 ...  in various combinations and set color=Strain. 

In [None]:
#Plot pc1 vs pc2, color by Strain 
#YOUR CODE HERE: 
#ggplot(data=pcs_annotated.corrected,aes(x=?,y=?,color=Strain))+geom_point()


**Exercise**: We have done a fairly thorough PCA analysis for the fold change data matrix. Repeat the PCA analysis for the normalized count matrix that we have defined above. Do the PCA plots look similar or different? 

#### Getting peak contributions to principal components. ####

Finally, we'd like to determine how much each peak contributes to PC1, PC2, and PC3. We can look at PC4 and up also, but for the sake of time we'll stick with the first 3 principal components; from the scree plot, we see they explain approximately 50% of the variance in the data. Primarily we want to get a sense of which peaks are critical in defining the principle components, and in which direction (positive or negative).

In [None]:
contribs_pc1=count.pca.corrected$rotation[,1]
contribs_pc2=count.pca.corrected$rotation[,2]
contribs_pc3=count.pca.corrected$rotation[,3]

#these are lists of contributs from each peak to the corresponding PC
head(contribs_pc1)
length(contribs_pc1)

In [None]:
#Use the write.table command to write the PC contribution data to output files. 
#If you want to use the pc contributions from count data rather than fold change data, uncomment the lines below. 
#The analyses looked similar enough that either one can be used downstream. 

#write.table(contribs_pc1,paste(work_dir,"pc1_contribs.txt",sep='/'),quote=FALSE,col.names=FALSE,row.names=TRUE,sep='\t')
#write.table(contribs_pc2,paste(work_dir,"pc2_contribs.txt",sep='/'),quote=FALSE,col.names=FALSE,row.names=TRUE,sep='\t')
#write.table(contribs_pc3,paste(work_dir,"pc3_contribs.txt",sep='/'),quote=FALSE,col.names=FALSE,row.names=TRUE,sep='\t')


## Hierarchical Clustering of Fold Change Signal Across Samples ##

Cluster analysis is a simple way to visualize patterns in the data. By clustering peaks according to their signal across different time points, we may find groups of peaks that have similar behavior across these time points. By clustering samples according to their signal across peaks, we can perform a simple sanity check of data quality ‐ samples of the same time point should cluster together.

In [None]:
library(gplots)
library(RColorBrewer)

Let's begin by clustering normalized fold change data that has not been corrected for the sample swap or the batch effect:

In [None]:
heatmap.2(norm_asinh_count,
          scale     = "none",
          col       = rev(colorRampPalette(brewer.pal(10, "RdBu"))(256)),
          distfun   = function(x) dist(x,method="euclidean"),
          hclustfun = function(x) hclust(x, method="ward.D"),
          Rowv=TRUE,
          Colv=TRUE,
          trace="none",
          cexCol = 0.9,
          margins=c(15,5),
          labRow="")



Now, we examine the hierarchical clustering on the corrected fold change data. 

In [None]:
heatmap.2(norm_asinh_count_corrected,
          scale     = "none",
          col       = rev(colorRampPalette(brewer.pal(10, "RdBu"))(256)),
          distfun   = function(x) dist(x,method="euclidean"),
          hclustfun = function(x) hclust(x, method="ward.D"),
          Rowv=TRUE,
          Colv=TRUE,
          trace="none",
          cexCol = 0.9,
          margins=c(15,5),
          labRow="")


This looks better than our fold change heatmap, but adding more contrast wouldn't hurt. We follow the same process to color the heatmap by quantiles.

In [None]:
#We split the fold change matrix into 1% quantiles 
quantile.range <- quantile(norm_asinh_count_corrected, probs = seq(0, 1, 0.01))
#we scale the breaks in the heatmap color palette according to the quantiles. 
palette.breaks <- seq(quantile.range["5%"], quantile.range["95%"], 0.1)


heatmap.2(norm_asinh_count_corrected,
          scale     = "none",
          col       = rev(colorRampPalette(brewer.pal(10, "RdBu"))(length(palette.breaks) - 1)),
          distfun   = function(x) dist(x,method="euclidean"),
          hclustfun = function(x) hclust(x, method="ward.D"),
          Rowv=TRUE,
          Colv=TRUE,
          trace="none",
          cexCol = 0.9,
          margins=c(15,5),
          breaks = palette.breaks,
          labRow="")


The two heatmaps look very different, but show the same data! 
When selecting a color scheme for PCA or heatmaps in R, the R Color Brewer tool is quite useful. Also, for nice color palettes, check out: http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3