# 3.4 Calling differentially expressed peaks with DESeq2 and limma

### IMPORTANT: Please make sure that you are using the R kernel to run this notebook. ###


In this tutorial, we will focus on calling differential peaks: 
![Analysis pipeline](images/part4.png)

## Running DESeq

DESeq(https://bioconductor.org/packages/release/bioc/html/DESeq2.html) uses read count data, such as in our matrix **all.readcount.txt**, to estimate differential gene expression across conditions specified in a metdata file.  We run DESeq with the following comparisons (which we call "contrasts"): 

* 0min WT vs 45min WT 
* Timepoint comparisons 
    * MSN1 (0min vs 45min) 
    * MSN2 (0min vs 45min)
    
* Strain: 
    *  WT vs MSN1
    *  WT vs MSN2
    *  WT vs MSN4
    *  WT vs HOG1
    *  WT vs SKN7
    *  WT vs HOT1
    *  WT vs YAP1
    *  WT vs YAP6
    *  WT vs YAP7   

In [12]:
#change to your working directory 
username="annashch"
setwd(paste("/scratch/",username,sep=""))

In [13]:
#load the DESeq2 library
library(DESeq2,quietly = TRUE)


In [14]:
#We read in the counts data matrix and the metdata matrix in the same manner as we did in tutorial 3.1 
#load the read count matrix
count_data=read.table("/outputs/all.readcount.txt",header=TRUE)
rownames(count_data)=paste(count_data$Chrom,count_data$Start,count_data$End,sep='\t')
#remove the columns we will not use 
count_data$Chrom=NULL
count_data$Start=NULL
count_data$End=NULL
count_data$ID=NULL

head(count_data)

Unnamed: 0_level_0,abalsubr_0min_YAP6_1,abalsubr_45min_HOT1_2,ajberg5_0min_HOG1_1,ajberg5_45min_WT_2,annashch_0min_YAP1_2,annashch_45min_YAP6_2,annlin_0min_MSN2_2,annlin_45min_YAP7_1,clin5_0min_MSN4_2,clin5_45min_MSN2_2,⋯,soumyak_0min_HOT1_2,soumyak_45min_YAP1_2,srstern_0min_WT_2,srstern_45min_MSN2_1,subkc_0min_MSN2_1,subkc_45min_YAP6_1,surag_0min_YAP7_2,surag_45min_HOT1_1,zahoor_0min_YAP6_2,zahoor_45min_YAP7_2
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
chrI	13	555,451,289,618,368,169,558,306,578,363,357,⋯,138,439,750,919,443,652,523,1115,588,369
chrI	6348	6518,41,36,102,44,27,58,57,52,67,46,⋯,39,61,115,147,49,104,72,158,97,26
chrI	9278	9407,21,29,32,24,5,39,20,33,36,37,⋯,7,12,32,88,18,85,14,89,41,34
chrI	20461	21185,271,457,461,482,253,632,303,747,444,516,⋯,249,803,754,1511,295,810,392,1753,553,433
chrI	28591	28910,139,72,125,69,95,155,147,95,187,106,⋯,70,122,245,232,84,203,145,234,230,64
chrI	29757	30083,155,103,188,117,100,221,143,183,190,139,⋯,89,177,315,395,136,273,188,404,292,98


In [44]:
metadata=read.table("/metadata/TC2019_samples.tsv",header=TRUE)
#We use the "factor" function to tell R which variables are categorical rather than continuous 
metadata$Strain=factor(metadata$Strain)
metadata$Timepoint=factor(metadata$Timepoint,levels=c("45min","0min"))
metadata$Researcher=factor(metadata$Researcher)
#we don't need the other metadata columns for this analysis 
#metadata$Sample=NULL
metadata$Replicate=NULL
rownames(metadata)=metadata$ID
metadata$ID=NULL
#make sure the rows in metadata match the order of the columns in count_data 
metadata=metadata[names(count_data),]
head(metadata)

Unnamed: 0_level_0,Sample,Researcher,Timepoint,Strain
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>
abalsubr_0min_YAP6_1,0min_YAP6,abalsubr,0min,YAP6
abalsubr_45min_HOT1_2,45min_HOT1,abalsubr,45min,HOT1
ajberg5_0min_HOG1_1,0min_HOG1,ajberg5,0min,HOG1
ajberg5_45min_WT_2,45min_WT,ajberg5,45min,WT
annashch_0min_YAP1_2,0min_YAP1,annashch,0min,YAP1
annashch_45min_YAP6_2,45min_YAP6,annashch,45min,YAP6


In [37]:
metadata$Timepoint

In [38]:
#We set threshold for determining differential expression 
padjust_thresh=0.01 


In [40]:
#create a DESeq2 object with the data, metadata, and model information 
ddsMat=DESeqDataSetFromMatrix(countData=as.matrix(count_data),
                            colData=metadata,
                            design=~Timepoint+Strain+Researcher)


In [41]:
#Run DESeq2 analysis 
dds<-DESeq(ddsMat)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing


In [42]:
#We can examine several contrasts in the resulting DESeq2 object
resultsNames(dds)

In [34]:
#Specify the contrasts we want to examine (we indicated these above)
deseq_contrasts=list(c("Sample","45min_HOT1","45min_WT"))
contrast_names=c("Sample_45min_HOT1_vs_45min_WT")


In [19]:
#Specify the contrasts we want to examine (we indicated these above)
deseq_contrasts=list(c("Timepoint","0min","45min"),
                     c("Strain","WT","MSN1"),
                     c("Strain","WT","MSN2"),
                     c("Strain","WT","MSN4"),
                     c("Strain","WT","HOG1"),
                     c("Strain","WT","SKN7"),
                     c("Strain","WT","HOT1"),
                     c("Strain","WT","YAP1"),
                     c("Strain","WT","YAP6"),
                     c("Strain","WT","YAP7"))
contrast_names=c("Timepoint_0min_vs_45min",
        "Strain_WT_vs_MSN1",
        "Strain_WT_vs_MSN2",
        "Strain_WT_vs_MSN4",
        "Strain_WT_vs_HOG1",
        "Strain_WT_vs_SKN7",
        "Strain_WT_vs_HOT1",
        "Strain_WT_vs_YAP1",
        "Strain_WT_vs_YAP6",
        "Strain_WT_vs_YAP7")



In [35]:
#Query the DESeq2 results to find differential peaks for each contrast, using our padjust_thresh and lfc_thresh values.
for(contrast_index in seq(1,1))
{
        comparison_name=unlist(contrast_names[contrast_index])    
        print(comparison_name)
        ds=results(dds,
           contrast=unlist(deseq_contrasts[contrast_index]))
       
        #write  entries for all peaks
        write.table(ds,file=paste(comparison_name,".txt",sep=""),quote=FALSE,row.names=TRUE,col.names=TRUE,sep='\t')
    
        #subset the peak set to just the differential peaks 
        ds=na.omit(ds)
        sig=ds[ds$padj<padjust_thresh,] 
    
        #find positive log fold change peaks 
        positive_sig=sig[sig$log2FoldChange > 0,]
    
        #find negative log fold change peaks 
        negative_sig=sig[sig$log2FoldChange <0,]
    
        write.table(positive_sig,
                    file=paste(comparison_name,".differential.positive.txt",sep=""),
                    quote=FALSE,row.names=FALSE,col.names=FALSE,sep='\t')
        write.table(negative_sig,
                    file=paste(comparison_name,".differential.negative.txt",sep=""),
                    quote=FALSE,row.names=FALSE,col.names=FALSE,sep='\t')
}


[1] "Sample_45min_HOT1_vs_45min_WT"


This code will generate 10 sets of files: 

* Timepoint_0min_vs_45min.txt  
* Timepoint_0min_vs_45min.differential.positive.txt  
* Timepoint_0min_vs_45min.differential.negative.txt  


* Strain_WT_vs_MSN1.txt  
* Strain_WT_vs_MSN1.differential.positive.txt
* Strain_WT_vs_MSN1.differential.negative.txt


* Strain_WT_vs_MSN2.txt  
* Strain_WT_vs_MSN2.differential.positive.txt
* Strain_WT_vs_MSN2.differential.negative.txt


* Strain_WT_vs_MSN4.txt  
* Strain_WT_vs_MSN4.differential.positive.txt
* Strain_WT_vs_MSN4.differential.negative.txt


* Strain_WT_vs_HOG1.txt  
* Strain_WT_vs_HOG1.differential.positive.txt
* Strain_WT_vs_HOG1.differential.negative.txt


* Strain_WT_vs_SKN7.txt  
* Strain_WT_vs_SKN7.differential.positive.txt
* Strain_WT_vs_SKN7.differential.negative.txt


* Strain_WT_vs_HOT1.txt  
* Strain_WT_vs_HOT1.differential.positive.txt
* Strain_WT_vs_HOT1.differential.negative.txt


* Strain_WT_vs_YAP1.txt  
* Strain_WT_vs_YAP1.differential.positive.txt
* Strain_WT_vs_YAP1.differential.negative.txt


* Strain_WT_vs_YAP6.txt  
* Strain_WT_vs_YAP6.differential.positive.txt
* Strain_WT_vs_YAP6.differential.negative.txt


* Strain_WT_vs_YAP7.txt  
* Strain_WT_vs_YAP7.differential.positive.txt
* Strain_WT_vs_YAP7.differential.negative.txt


The first is the raw output from DESeq for all peaks. We will not have time to discuss everything in this file, but feel free to read the DESeq manual and see if you can understand it. The second,  contains a list of the IDs of the differentially open peaks from ATAC‐seq. The p‐value cutoff for differential openness that we use is 0.01. 

### Running limma ###

If you recall, we used the R limma package to remove the "Researcher" batch effect in our data. Limma can also be used for differential peak calling. Limma uses a similar algorithm to DESeq2. We will go through the process of calling differential peaks with limma and see how the peak rankings differ between limma and DESeq2 -- it's always best to sanity check your results by running them through several similar analysis algorithms. 

In [None]:
#import the limma library 

library(limma)
#design the model 
design=model.matrix(~0+Strain+Timepoint+Researcher,data=metadata)

#We use the "voom" function associated with the limma package to normalize the count data 
vm=voom(count_data,design)

#fit the model to the data 
fit=lmFit(vm,design=vm$design)


#We'll examine the Timepoint contrast 
cont.matrix=makeContrasts(timepoint="Timepoint4h",levels=fit)
media_model=eBayes(contrasts.fit(fit,cont.matrix))
res_limma=topTable(media_model,n=nrow(count_data))
head(res_limma)

### Comparing DESeq2 and limma voom outputs ### 

In [None]:
#Let's extract the media comparison from DESeq2
res_deseq2=results(dds,
           contrast=unlist(deseq_contrasts[1]))
res_deseq2=as.data.frame(res_deseq2)


In [None]:
#We need to merge the two result dataframes by peak name So that we can generate a scatterplot of
#padj in one vs the other 
res_limma$peak=rownames(res_limma)
res_deseq2$peak=rownames(res_deseq2)
nrow(res_limma)
nrow(res_deseq2)

In [None]:
merged_df=merge(res_limma,res_deseq2,by="peak")
merged_df$limma_padj=-10*log10(merged_df$padj)
merged_df$deseq2_padj=-10*log10(merged_df$adj.P.Val)



In [None]:
head(merged_df)

In [None]:
library(ggplot2)
ggplot(merged_df,aes(x=deseq2_padj,y=limma_padj))+
    geom_point(alpha=0.1)+
    xlim(0,400)+
    ylim(0,400)

The p-values appear to be pretty correlated. Let's make sure by computing the spearman and pearson correlations: 

In [None]:
spearman_cor=cor(merged_df$limma_padj,merged_df$deseq2_padj,method="spearman")
spearman_cor

In [None]:
pearson_cor=cor(merged_df$limma_padj,merged_df$deseq2_padj,method="pearson")
pearson_cor

Finally, we plot the rank comparison of the p-values across the two methods. 

In [None]:
#use the "rank" function to generate rank columns for the p-values 
merged_df$limma_padj_rank=rank(merged_df$limma_padj)
merged_df$deseq2_padj_rank=rank(merged_df$deseq2_padj)

ggplot(merged_df,aes(x=deseq2_padj_rank,y=limma_padj_rank))+
    geom_point(alpha=0.1)