# 3.3 Calling differentially expressed peaks with DESeq2

### IMPORTANT: Please make sure that you are using the R kernel to run this notebook. ###


In this tutorial, we will focus on calling differential peaks: 
![Analysis pipeline](images/part4.png)

## Missing R packages 

When running the scripts in this section, if you get an error saying the gplots package has not been installed, you can install the package locally by  running the **3.5 Install R packages** notebook.

## Running DESeq

DESeq(https://bioconductor.org/packages/release/bioc/html/DESeq2.html) uses read count data, such as in our matrix **all.readcount.txt**, to estimate differential gene expression across conditions specified in a metdata file.  We run DESeq with 4 comparisons (which we call "contrasts"): 
* Media 
    * glucose vs ethanol
* Strain: 
    *  WT vs asf1
    *  WT vs rtt109
    *  asf1 vs rtt109 
   

In [34]:
#change to your working directory 
username="ubuntu"
setwd(paste("/srv/scratch/training_camp/work/",username,sep=""))

In [35]:
#load the DESeq2 library
library(DESeq2,quietly = TRUE)


In [36]:
#We read in the counts data matrix and the metdata matrix in the same manner as we did in tutorial 3.1 
#load the read count matrix
count_data=read.table("all.readcount.txt",header=TRUE)
rownames(count_data)=paste(count_data$Chrom,count_data$Start,count_data$End,sep='\t')
#remove the columns we will not use 
count_data$Chrom=NULL
count_data$Start=NULL
count_data$End=NULL
count_data$ID=NULL
head(count_data)



Unnamed: 0,cln3.SCD.0_6MNaCl.Rep1_R1_001,cln3.SCD.0_6MNaCl.Rep2_R1_001,cln3.SCD.Rep1_R1_001,cln3.SCD.Rep2_R1_001,cln3.SCE.0_6MNaCl.Rep1_R1_001,cln3.SCE.0_6MNaCl.Rep2_R1_001,cln3.SCE.Rep1_R1_001,cln3.SCE.Rep2_R1_001,whi5.cln3.SCE.Rep1_R1_001,whi5.cln3.SCE.Rep2_R1_001,whi5.SCE.Rep1_R1_001,whi5.SCE.Rep2_R1_001,WT.SCD.0_6MNaCl.Rep1_R1_001,WT.SCD.0_6MNaCl.Rep2_R1_001,WT.SCD.Rep1_R1_001,WT.SCD.Rep2_R1_001,WT.SCE.0_6MNaCl.Rep1_R1_001,WT.SCE.0_6MNaCl.Rep2_R1_001,WT.SCE.Rep1_R1_001,WT.SCE.Rep2_R1_001
chrI	0	781,0,0,151,191,226,158,210,127,292,296,232,188,83,246,25,182,241,203,9,244
chrI	6332	6549,0,0,537,820,1342,1050,1157,590,1460,1624,1562,713,590,1585,115,732,2227,2032,90,1230
chrI	9138	9609,0,0,175,222,366,251,304,160,401,483,410,261,143,379,34,220,379,383,17,344
chrI	20611	21197,0,0,249,309,369,282,316,189,394,406,322,314,134,342,60,370,334,310,19,410
chrI	28155	29092,0,0,50,50,48,37,42,22,57,65,55,72,12,49,7,47,65,64,1,60
chrI	29173	30197,0,0,88,115,215,226,225,129,241,390,284,118,86,224,27,164,324,316,25,249


In [41]:
metadata=read.table("/srv/scratch/training_camp/metadata/TC2017_samples.tsv",header=TRUE)
#We use the "factor" function to tell R which variables are categorical rather than continuous 
metadata$Strain=factor(metadata$Strain)
metadata$Media=factor(metadata$Media)
#we don't need the other metadata columns for this analysis 
metadata$Sample=NULL
metadata$Researcher=NULL
metadata$Replicate=NULL
rownames(metadata)=metadata$ID
metadata$ID=NULL
#make sure the rows in metadata match the order of the columns in count_data 
metadata=metadata[names(count_data),]
metadata

Unnamed: 0,Strain,Media
cln3.SCD.0_6MNaCl.Rep1_R1_001,cln3,D
cln3.SCD.0_6MNaCl.Rep2_R1_001,cln3,D
cln3.SCD.Rep1_R1_001,cln3,D
cln3.SCD.Rep2_R1_001,cln3,D
cln3.SCE.0_6MNaCl.Rep1_R1_001,cln3,E
cln3.SCE.0_6MNaCl.Rep2_R1_001,cln3,E
cln3.SCE.Rep1_R1_001,cln3,E
cln3.SCE.Rep2_R1_001,cln3,E
whi5.cln3.SCE.Rep1_R1_001,whi5,E
whi5.cln3.SCE.Rep2_R1_001,whi5,E


In [71]:
#We set threshold for determining differential expression 
padjust_thresh=0.05 


In [72]:
#create a DESeq2 object with the data, metadata, and model information 
ddsMat=DESeqDataSetFromMatrix(countData=as.matrix(count_data),
                            colData=metadata,
                            design=~Strain+Media)


In [73]:
#Run DESeq2 analysis 
dds<-DESeq(ddsMat)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing


In [74]:
#We can examine several contrasts in the resulting DESeq2 object
resultsNames(dds)

In [75]:
#Specify the contrasts we want to examine (we indicated these above)
deseq_contrasts=list(c("Media","D","E"),
                     c("Strain","WT","cln3"),
                     c("Strain","WT","whi5"),
                     c("Strain","cln3","whi5"))
contrast_names=c("Media_SCD_vs_SCE",
        "Strain_WT_vs_cln3",
        "Strain_WT_vs_whi5",
        "Strain_cln3_vs_whi5")



In [86]:
#Query the DESeq2 results to find differential peaks for each contrast, using our padjust_thresh and lfc_thresh values.
for(contrast_index in seq(1,4))
{
        comparison_name=unlist(contrast_names[contrast_index])    
        print(comparison_name)
        ds=results(dds,
           contrast=unlist(deseq_contrasts[contrast_index]))
        print(ds)
        #write  entries for all peaks
        write.table(ds,file=paste(comparison_name,".txt",sep=""),quote=FALSE,row.names=TRUE,col.names=TRUE,sep='\t')
    
        #subset the peak set to just the differential peaks 
        ds=na.omit(ds)
        sig=ds[ds$padj<padjust_thresh,] 
        peaks_sig=rownames(sig)
        head(peaks_sig)
        write.table(peaks_sig,
                    file=paste(comparison_name,".differential.txt",sep=""),
                    quote=FALSE,row.names=FALSE,col.names=FALSE,sep='\t')
}


[1] "Media_SCD_vs_SCE"
log2 fold change (MAP): Media D vs E 
Wald test p-value: Media D vs E 
DataFrame with 3455 rows and 6 columns
                        baseMean log2FoldChange     lfcSE       stat
                       <numeric>      <numeric> <numeric>  <numeric>
chrI\t0\t781            71.42476     0.36254265 0.1910333  1.8977984
chrI\t6332\t6549       391.51930    -0.09034538 0.1225859 -0.7369964
chrI\t9138\t9609       103.12211     0.17084819 0.1522454  1.1221895
chrI\t20611\t21197     116.90361     0.45614217 0.2069841  2.2037546
chrI\t28155\t29092      17.91868     0.35457978 0.2693864  1.3162496
...                          ...            ...       ...        ...
chrXVI\t920499\t921897 208.25318     -0.8976245 0.1150745 -7.8003770
chrXVI\t927333\t928679 276.36635      0.1132156 0.1911494  0.5922889
chrXVI\t930437\t931270  66.09052      0.3222262 0.2464723  1.3073527
chrXVI\t938878\t939160  19.37689     -0.1968364 0.2324888 -0.8466491
chrXVI\t942430\t942789  29.97553      0

This code will generate 4 pairs of files: 

* Media_SCD_vs_SCE.txt  
* Media_SCD_vs_SCE.differential.txt  


* Strain_WT_vs_cln3.txt  
* Strain_WT_vs_cln3.differential.txt

* Strain_WT_vs_whi5.txt  
* Strain_WT_vs_whi5.txt.sigPeakNames  


* Strain_WT_vs_cln3.txt
* Strain_WT_vs_cln3.differential.txt

The first is the raw output from DESeq for all peaks. We will not have time to discuss everything in this file, but feel free to read the DESeq manual and see if you can understand it. The second,  contains a list of the IDs of the differentially open peaks from ATAC‐seq. The p‐value cutoff for differential openness that we use is 0.05. You can examine the content of these files with the following commands: 