# 3.1 Clustering analysis and PCA (on normalized fold change data)#

### IMPORTANT: Please make sure that you are using the R kernel to run this notebook.###
We are now switching from the bash kernel to the R kernel. 
The R language provides a number of utilities for genomic data analysis and visualization. We will explore some of these. 

In [None]:
#The preprocessCore library provides a number of functions useful for statistical analysis,
#including functions for data normalization that we will use below. 
library("preprocessCore")

In [None]:
?library

In [None]:
#Change to your $WORK_DIR. The syntax for switching directories in R is a little different than what we used in bash. 
#Use the "setwd" command to switch to your $WORK_DIR 
sunetid="annashch"
work_dir=paste("/scratch/",sunetid,sep="")
setwd(work_dir)
#The "dir" command will list all files in your current working directory 
dir()

In this tutorial we will focus on the clustering and PCA analysis steps of the pipeline: 
![Analysis pipeline](images/part3.png)

In [None]:
#load the fc signal matrix. You can either use the one you generated in the last tutorial,or the one that we have 
#pre-generated in the $AGGREGATE_ANALYSIS_DIR folder in case you ran into any issues with that step

#fc_data=read.table("all.fc.txt",header=TRUE)
fc_data=read.table("/outputs/all.fc.txt",header=TRUE)

rownames(fc_data)=paste(fc_data$Chrom,fc_data$Start,fc_data$End,sep='\t')
#remove the columns we will not use in downstream analysis
fc_data$ID=NULL
fc_data$Chrom=NULL
fc_data$Start=NULL
fc_data$End=NULL

head(fc_data)

In [None]:
#normalize the data 
#quantile normalization 
norm_asinh_fc=normalize.quantiles(data.matrix(asinh(fc_data)))

In [None]:
colnames(norm_asinh_fc)=names(fc_data)
rownames(norm_asinh_fc)=rownames(fc_data)

In [None]:
head(norm_asinh_fc)

Much better! After quantile normalization, the fold change values across samples are on the same scale. 

## PCA ##

PCA (Principal Component Analysis) is a way to identify the primary directions of variation in the data. It can also be used for very coarse-grained clustering of samples; similar samples will have similar coordinates along the principal axes.

We will perform PCA on *all.fc.txt*. We treat each sample as a single point in a very high dimensional space (where the dimensionality is equal to the number of genes the vary), and then we will perform dimensionality reduction in this space. We can color-code the PCA plots by "Strain", "Timepoint", "Researcher", or "Sample" to determine which parameter separates the samples most effectively. 

In [None]:
#We run the principle component analysis command in R

#The t() function transposes the data matrix and allows us to cluster the samples, as opposed to the individual peaks,
#by placing the samples in the rows and the peaks in the columns. 
fc.pca=prcomp(t(norm_asinh_fc))

We generate a scree plot that shows how much variance in the data is explained by each prinicipal component:

In [None]:
var_explained=round(100*fc.pca$sdev^2/sum(fc.pca$sdev^2),2)
print(var_explained)

Let's generate a simple bar graph to better illustrate the variance explained by each PC.


In [None]:
barplot(var_explained)

We can also plot the first few prinicpal components to see if they correlate with any of our experimental variables: 

    * Strain of yeast 
    * Timepoint 
    
We also expect replicates for the same sample to cluster closely together.

Finally, we should make sure to check for any unintended batch effects in the data. For example, it's posssible that samples generated by one researcher may exhibit a systematic difference from samples generated by a different researcher. We should check for this bias and correct it if possible. 


    

In [None]:
metadata=read.table("/metadata/TC2019_samples.tsv",header=TRUE)
nrow(metadata)

In [None]:
#First, we load our metadata file into R to help us color samples by replicate, strain, timepoint, and researcher. 
metadata=read.table("/metadata/TC2019_samples.tsv",header=TRUE)
#We use the "factor" function to tell R which variables are categorical rather than continuous 
metadata$Strain=factor(metadata$Strain)
metadata$Timepoint=factor(metadata$Timepoint)
metadata$Sample=factor(metadata$Sample)
metadata$Researcher=factor(metadata$Researcher)
metadata$Group=factor(metadata$Group)
head(metadata)

In [None]:
#extract the PC columns from the fc.pca object 
pcs=data.frame(fc.pca$x)
head(pcs)

In [None]:
#add columns from the metadata file. Do this safely using the "merge" command to make sure the sample ID's 
#from the two data frames are aligned
pcs$ID=rownames(pcs)
pcs_annotated=merge(pcs,metadata,by="ID")
head(pcs_annotated)

Now, we can use the ggplot package in R to generate scatterplots of PC1 vs PC2, PC2 vs PC3, etc and color-code
by experimental variables. 


In [None]:
library(ggplot2)

In [None]:
#select 20 distinct colors to use for the PCA scatterplot. 
cols=c('#a6cee3','#1f78b4','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f','#ff7f00','#cab2d6','#6a3d9a','#ffff99','#b15928','#8dd3c7','#ffffb3','#bebada','#fb8072','#80b1d3','#fdb462','#b3de69','#fccde5','#d9d9d9','#bc80bd','#ccebc5','#ffed6f')

In [None]:
#Plot pc1 vs pc2, color by Sample -- that is, all replicates for the same sample should be the same color. 
ggplot(data=pcs_annotated,
       aes(x=PC1,y=PC2,color=Sample))+
       geom_point(size=3)+
       scale_color_manual(values=cols)


We should see replicates of the same sample clustering close together. Do we see this in the scatterplot above?

In [None]:
#Plot pc1 vs pc2, color by Strain 
ggplot(data=pcs_annotated,aes(x=PC1,y=PC2,color=Strain))+
geom_point(size=3)+
scale_color_manual(values=cols)


No clear clustering by strain is observed. Let's color by Timepoint

In [None]:
ggplot(data=pcs_annotated,aes(x=PC1,y=PC2,color=Timepoint,label=ID))+
geom_point()+
geom_text()

### Correcting for batch effects ###

We check for batch effects from Researcher and Group. 

In [None]:
#Plot pc1 vs pc2, color by Researcher -- here, we're checking for a batch effect based on researcher.
ggplot(data=pcs_annotated,aes(x=PC1,y=PC2,color=Researcher))+
geom_point(size=3)+
scale_color_manual(values=cols)



In [None]:
#Plot pc1 vs pc2, color by Researcher -- here, we're checking for a batch effect based on researcher.
ggplot(data=pcs_annotated,aes(x=PC1,y=PC2,color=Group))+
geom_point(size=3)

We don't see a clear batch effect for any individual researcher, but we do observe a batch effect from the Kundaje lab members' samples. In this case, unfortunately we cannot correct for the "Group" batch effect, as the design is confounded for the "Group" variable. However, we can try to correct for any "Researcher" batch effect, even though it's not 100% clear if there is one.  We use the R **limma** package to fit a linear mixed effects model. The explanatory variables are Strain, Timepoint, and  Group. The output variable is the normalized fold change value in the data matrix. We then subtract out the contribution from "Group" (the confounding variable) to the output variable. 

In [None]:
library(limma)

In [None]:
#make sure the row order of the metadata file matches the column order of the fc_data_matrix file. 
rownames(metadata)=metadata$ID
metadata=metadata[colnames(norm_asinh_fc),]


In [None]:
#design the model using entries from our metadata file 
mod=model.matrix(~0+Strain +Timepoint+Researcher,data=metadata)

#fit the model to the data 
fit=lmFit(norm_asinh_fc,design=mod)

head(coefficients(fit))



In [None]:
colnames(fit$design)


In [None]:
#We note that column 5 in the model captures the batch effect from the "Researcher" variable. We can remove the 
#contribution of this variable from the data: 
batch_contribution=coefficients(fit)[,12:29]%*% t(fit$design[,12:29])
norm_asinh_fc_corrected=norm_asinh_fc-batch_contribution

Let's re-run the PCA analysis on  fc_data_matrix_corrected to make sure we're no longer observing a batch effect 
due to researcher.



In [None]:
fc.pca.corrected=prcomp(t(norm_asinh_fc_corrected))
var_explained=round(100*fc.pca.corrected$sdev^2/sum(fc.pca.corrected$sdev^2),2)
barplot(var_explained)
pcs.corrected=data.frame(fc.pca.corrected$x)
pcs.corrected$ID=rownames(pcs.corrected)
pcs_annotated.corrected=merge(pcs.corrected,metadata,by="ID")

In [None]:
ggplot(data=pcs_annotated.corrected,
       aes(x=PC1,y=PC2,color=Researcher))+
       geom_point(size=3)+
       scale_color_manual(values=cols)

In [None]:
ggplot(data=pcs_annotated.corrected,
       aes(x=PC1,y=PC2,color=Group))+
       geom_point(size=3)

Excellent! We no longer see the Kundaje lab samples clustering together. Let's make sure that the samples still separate by Timepoint and check for any improved separation by Strain. 

In [None]:
ggplot(data=pcs_annotated.corrected,aes(x=PC1,y=PC2,color=Timepoint))+
geom_point(size=3)

Let's check for separation by Strain:

In [None]:
ggplot(data=pcs_annotated.corrected,aes(x=PC1,y=PC2,color=Strain))+
geom_point(size=3)+
scale_color_manual(values=cols)

In [None]:
ggplot(data=pcs_annotated.corrected,aes(x=PC1,y=PC3,color=Strain))+
geom_point(size=3)+
scale_color_manual(values=cols)

In [None]:
tmp=ebayes(fit)

In [None]:
length(fit$df.residual)

#### Getting peak contributions to principal components. ####

Finally, we'd like to determine how much each peak contributes to PC1, PC2, and PC3. We can look at PC4 and up also, but for the sake of time we'll stick with the first 3 principal components; from the scree plot, we see they explain approximately 50% of the variance in the data. Primarily we want to get a sense of which peaks are critical in defining the principle components, and in which direction (positive or negative).

In [None]:
contribs_pc1=sort(fc.pca.corrected$rotation[,1])
contribs_pc2=sort(fc.pca.corrected$rotation[,2])
contribs_pc3=sort(fc.pca.corrected$rotation[,3])

#these are lists of contributs from each peak to the corresponding PC
tail(contribs_pc1)
length(contribs_pc1)

In [None]:
#Use the write.table command to write the PC contribution data to output files. 

write.table(contribs_pc1,paste(work_dir,"pc1_contribs.txt",sep='/'),quote=FALSE,col.names=FALSE,row.names=TRUE,sep='\t')
write.table(contribs_pc2,paste(work_dir,"pc2_contribs.txt",sep='/'),quote=FALSE,col.names=FALSE,row.names=TRUE,sep='\t')
write.table(contribs_pc3,paste(work_dir,"pc3_contribs.txt",sep='/'),quote=FALSE,col.names=FALSE,row.names=TRUE,sep='\t')


## Hierarchical Clustering of Fold Change Signal Across Samples ##

Cluster analysis is a simple way to visualize patterns in the data. By clustering peaks according to their signal across different time points, we may find groups of peaks that have similar behavior across these time points. By clustering samples according to their signal across peaks, we can perform a simple sanity check of data quality ‐ samples of the same time point should cluster together.

In [None]:
library(gplots)
library(RColorBrewer)

In [None]:
?dist

In [None]:
?hclust

Let's begin by clustering normalized fold change data that has not been corrected for the sample swap or the batch effect:

In [None]:
heatmap.2(norm_asinh_fc,
          scale     = "none",
          col       = rev(colorRampPalette(brewer.pal(10, "RdBu"))(256)),
          distfun   = function(x) dist(x,method="euclidean"),
          hclustfun = function(x) hclust(x, method="ward.D"),
          Rowv=TRUE,
          Colv=TRUE,
          trace="none",
          cexCol = 0.9,
          margins=c(15,5),
          labRow="")



Now, we examine the hierarchical clustering on the corrected fold change data. 

In [None]:
heatmap.2(norm_asinh_fc_corrected,
          scale     = "none",
          col       = rev(colorRampPalette(brewer.pal(10, "RdBu"))(256)),
          distfun   = function(x) dist(x,method="euclidean"),
          hclustfun = function(x) hclust(x, method="ward.D"),
          Rowv=TRUE,
          Colv=TRUE,
          trace="none",
          cexCol = 0.9,
          margins=c(15,5),
          labRow="")


There is very little contrast in the heatmap that was generated. We can add contrast by modifying how "breaks" between colors are generated. 

In [None]:
#We split the fold change matrix into 1% quantiles 
quantile.range <- quantile(norm_asinh_fc_corrected, probs = seq(0, 1, 0.01))
#we scale the breaks in the heatmap color palette according to the quantiles. 
palette.breaks <- seq(quantile.range["5%"], quantile.range["95%"], 0.1)


heatmap.2(norm_asinh_fc_corrected,
          scale     = "none",
          col       = rev(colorRampPalette(brewer.pal(10, "RdBu"))(length(palette.breaks) - 1)),
          distfun   = function(x) dist(x,method="euclidean"),
          hclustfun = function(x) hclust(x, method="ward.D"),
          Rowv=TRUE,
          Colv=TRUE,
          trace="none",
          cexCol = 0.9,
          margins=c(15,5),
          breaks = palette.breaks,
          labRow="")


The two heatmaps look very different, but show the same data! 
When selecting a color scheme for PCA or heatmaps in R, the R Color Brewer tool is quite useful. Also, for nice color palettes, check out: http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3