# 3.3 Calling differentially expressed peaks with DESeq2

### IMPORTANT: Please make sure that you are using the R kernel to run this notebook. ###


In this tutorial, we will focus on calling differential peaks: 
![Analysis pipeline](images/part4.png)

## Missing R packages 

When running the scripts in this section, if you get an error saying the gplots package has not been installed, you can install the package locally by  running the **3.5 Install R packages** notebook.

## Running DESeq

DESeq(https://bioconductor.org/packages/release/bioc/html/DESeq2.html) uses read count data, such as in our matrix **all.readcount.txt**, to estimate differential gene expression across conditions specified in a metdata file.  We run DESeq with 4 comparisons (which we call "contrasts"): 
* Media 
    * glucose vs ethanol
* Strain: 
    *  WT vs asf1
    *  WT vs rtt109
    *  asf1 vs rtt109 
   

In [None]:
#change to your working directory 
username="ubuntu"
setwd(paste("/srv/scratch/training_camp/work/",username,sep=""))

In [None]:
#load the DESeq2 library
library(DESeq2,quietly = TRUE)


In [None]:
#We read in the counts data matrix and the metdata matrix in the same manner as we did in tutorial 3.1 
#load the read count matrix
count_data=read.table("all.readcount.txt",header=TRUE)
rownames(count_data)=paste(count_data$Chrom,count_data$Start,count_data$End,sep='\t')
#remove the columns we will not use 
count_data$Chrom=NULL
count_data$Start=NULL
count_data$End=NULL
count_data$ID=NULL
head(count_data)



In [None]:
metadata=read.table("/srv/scratch/training_camp/metadata/TC2017_samples.tsv",header=TRUE)
#We use the "factor" function to tell R which variables are categorical rather than continuous 
metadata$Strain=factor(metadata$Strain)
metadata$Media=factor(metadata$Media)
#we don't need the other metadata columns for this analysis 
metadata$Sample=NULL
metadata$Researcher=NULL
metadata$Replicate=NULL
rownames(metadata)=metadata$ID
metadata$ID=NULL
#make sure the rows in metadata match the order of the columns in count_data 
metadata=metadata[names(count_data),]
metadata

In [None]:
#We set threshold for determining differential expression 
padjust_thresh=0.05 


In [None]:
#create a DESeq2 object with the data, metadata, and model information 
ddsMat=DESeqDataSetFromMatrix(countData=as.matrix(count_data),
                            colData=metadata,
                            design=~Strain+Media)


In [None]:
#Run DESeq2 analysis 
dds<-DESeq(ddsMat)

In [None]:
#We can examine several contrasts in the resulting DESeq2 object
resultsNames(dds)

In [None]:
#Specify the contrasts we want to examine (we indicated these above)
deseq_contrasts=list(c("Media","D","E"),
                     c("Strain","WT","cln3"),
                     c("Strain","WT","whi5"),
                     c("Strain","cln3","whi5"))
contrast_names=c("Media_SCD_vs_SCE",
        "Strain_WT_vs_cln3",
        "Strain_WT_vs_whi5",
        "Strain_cln3_vs_whi5")



In [None]:
#Query the DESeq2 results to find differential peaks for each contrast, using our padjust_thresh and lfc_thresh values.
for(contrast_index in seq(1,4))
{
        comparison_name=unlist(contrast_names[contrast_index])    
        print(comparison_name)
        ds=results(dds,
           contrast=unlist(deseq_contrasts[contrast_index]))
        print(ds)
        #write  entries for all peaks
        write.table(ds,file=paste(comparison_name,".txt",sep=""),quote=FALSE,row.names=TRUE,col.names=TRUE,sep='\t')
    
        #subset the peak set to just the differential peaks 
        ds=na.omit(ds)
        sig=ds[ds$padj<padjust_thresh,] 
        peaks_sig=rownames(sig)
        head(peaks_sig)
        write.table(peaks_sig,
                    file=paste(comparison_name,".differential.txt",sep=""),
                    quote=FALSE,row.names=FALSE,col.names=FALSE,sep='\t')
}


This code will generate 4 pairs of files: 

* Media_SCD_vs_SCE.txt  
* Media_SCD_vs_SCE.differential.txt  


* Strain_WT_vs_cln3.txt  
* Strain_WT_vs_cln3.differential.txt

* Strain_WT_vs_whi5.txt  
* Strain_WT_vs_whi5.txt.sigPeakNames  


* Strain_WT_vs_cln3.txt
* Strain_WT_vs_cln3.differential.txt

The first is the raw output from DESeq for all peaks. We will not have time to discuss everything in this file, but feel free to read the DESeq manual and see if you can understand it. The second,  contains a list of the IDs of the differentially open peaks from ATAC‐seq. The p‐value cutoff for differential openness that we use is 0.05. You can examine the content of these files with the following commands: 