### Requirements for DESeq

"The count values must be raw counts of sequencing reads. This is important for DESeq’s statistical model to hold,
as only the actual counts allow assessing the measurement precision correctly. Hence, please do do not supply other
quantities, such as (rounded) normalized counts, or counts of covered base pairs – this will only lead to nonsensical
results." https://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf

Use this command to install DESeq2

In [40]:
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
suppressMessages(biocLite("DESeq2"))
suppressMessages(biocLite("SummarizedExperiment"))
suppressMessages(library("DESeq2"))

Bioconductor version 3.4 (BiocInstaller 1.24.0), ?biocLite for help
A new version of Bioconductor is available after installing the most recent
  version of R; see http://bioconductor.org/install


# Differential Expression
- Load in compendium features as read count
    - Use ids from compendium features of normalized TPM
- caculate using DESeq library
- output new train and test features in log2(TPM+1) with differential expressed genes 

# Load in compendium Features as TPM for IDs

In [1]:
library('tidyverse')
library('stringr')
suppressMessages(library("DESeq2"))

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


Load in data, get ids in compendium

Found diseases with at least one of both methods in cohortdiseasetissue
- ['acute lymphoblastic leukemia', 'acute myeloid leukemia', 'sarcoma', 'neuroblastoma', 'diffuse large B-cell lymphoma', 'glioblastoma multiforme', 'wilms tumor', 'fibrolamellar hepatocellular carcinoma', 'atypical teratoid/rhabdoid tumor', 'lymphoma', 'ependymoma']

In [15]:
diseases<- c('acute lymphoblastic leukemia', 'acute myeloid leukemia', 'sarcoma', 'neuroblastoma', 'diffuse large B-cell lymphoma', 'glioblastoma multiforme', 'wilms tumor', 'fibrolamellar hepatocellular carcinoma', 'atypical teratoid/rhabdoid tumor', 'lymphoma', 'ependymoma')

Make labels only use these diseases

In [26]:
labels<-labels %>% filter(disease %in% diseases)

In [27]:
compLabels <- read.csv("compendiumLabels.tsv", sep='\t')

labels <- compLabels

riboDGlio<-labels %>% filter(labels$tr_method == 'RiboMinus')
riboDth_ids <- riboDGlio$th_sampleid
polyAGlio<-labels %>% filter(labels$tr_method == 'PolyA')
polyAth_ids <- polyAGlio$th_sampleid

th_ids <- labels$th_sampleid

Find files for read count

In [28]:
compendiumTPM <- dir(path = "/data/archive/downstream", pattern = 'rsem_genes.results', all.files = TRUE,
           full.names = TRUE, recursive = TRUE,
           ignore.case = FALSE, include.dirs = TRUE, no.. = TRUE)

compendiumTPM_ids <- compendiumTPM %>% gsub('/data/archive/downstream/','',.) %>% gsub('/.*','',.)

readcountIds <- intersect(compendiumTPM_ids,th_ids)

# Create dataset of read count

readcountIds<- readcountIds[order(readcountIds)]

Read files

In [29]:
ptm <- proc.time()
locationList <- c()
tables<-c()
riboDs <- data.frame(c(seq(1,60498)))
polyAs <- data.frame(c(seq(1,60498)))
for (id in readcountIds){
    curdir <- dir(path = paste0("/data/archive/downstream/",id), pattern = 'rsem_genes.results', all.files = TRUE,
           full.names = TRUE, recursive = TRUE,
           ignore.case = FALSE, include.dirs = TRUE, no.. = TRUE)
    cur<-suppressMessages(read_tsv(curdir[length(curdir)]))
    if(length(cur) != 0){
        for (i in seq(1,length(riboDth_ids))){
            if(id == riboDth_ids[i]){
                riboDs[id]<-as.integer(round(cur$expected_count,0))
            }
        }
        for (i in seq(1,length(polyAth_ids))){
            if (id == polyAth_ids[i]){
                polyAs[id] <-as.integer(round(cur$expected_count,0))
            }
        }
    }
}

print(proc.time()-ptm)

    user   system  elapsed 
2944.988   27.992 2975.234 


In [None]:
write.csv(riboDs,'data/riboDs.csv')

In [None]:
write.csv(polyAs,'data/polyAs.csv')

Rename the rows and columns to be untreated = riboD and treated = polyA


In [30]:
polyAs <- cbind(gene_id=c(cur$gene_id), polyAs)
riboDs <- cbind(gene_id=c(cur$gene_id), riboDs)



names(riboDs) <- paste0('RiboMinus',names(riboDs))
names(polyAs) <- paste0('PolyA',names(polyAs))
names(riboDs)[1] <- 'gene_id'
names(polyAs)[1] <- 'gene_id'

Get all labels and treatment labelling in the same order, which is required by DESeq

In [31]:
riboDs <- riboDs[, order(names(riboDs),decreasing=TRUE)]
polyAs <- polyAs[, order(names(polyAs),decreasing=TRUE)]

bothMethodsReadCount <- dplyr::inner_join(riboDs, polyAs, by="gene_id")

bothMethodsReadCount <- bothMethodsReadCount[, order(names(bothMethodsReadCount),decreasing=TRUE)]

dim(bothMethodsReadCount[,-1])
readCountLabels <- filter(labels, th_sampleid %in% readcountIds)

readCountLabels <- readCountLabels %>% arrange(desc(th_sampleid)) 

rownames(readCountLabels) <- paste0(readCountLabels$tr_method,readCountLabels$th_sampleid)

# rownames(readCountLabels) <- sub("RiboMinus", "untreated",rownames(readCountLabels))
# rownames(readCountLabels) <- sub("PolyA", "treated",rownames(readCountLabels))

bothMethodsReadCount <- bothMethodsReadCount[, rownames(readCountLabels)]

rownames(bothMethodsReadCount) <- cur$gene_id

all(rownames(readCountLabels) == colnames(bothMethodsReadCount))

Create CountDataSet called dds

In [32]:
dds <- DESeqDataSetFromMatrix(bothMethodsReadCount,
                              colData = readCountLabels,
                              design= ~ tr_method)

In [33]:
dds

class: DESeqDataSet 
dim: 60498 938 
metadata(1): version
assays(1): counts
rownames(60498): ENSG00000000003.14 ENSG00000000005.5 ...
  ENSGR0000280767.2 ENSGR0000281849.2
rowData names(0):
colnames(938): PolyATHR31_0940_S01 PolyATHR31_0939_S01 ...
  PolyATARGET-40-0A4HX8-01A-01R PolyATARGET-40-0A4HMC-01A-01R
colData names(3): th_sampleid tr_method disease

# Prefiltering
- removing rows in which there are very few reads, 
- we reduce the memory size of the dds data object, 
- and we increase the speed of the transformation and testing functions within DESeq2. 

<br>Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads total.

In [34]:
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

In [None]:
dds <- DESeq(dds)
res <- results(dds)
# res

estimating size factors
estimating dispersions
gene-wise dispersion estimates


In [None]:
resultsNames(dds)

In [None]:
library("BiocParallel")
register(MulticoreParam(4))

In [None]:
resOrdered <- res[order(res$pvalue),]


In [None]:
resOrdered

In [None]:
summary(res)

In [None]:
ddsMF <- dds
levels(ddsMF$tr_method)


In [None]:
design(ddsMF) <- formula(~ tr_method + disease)
ddsMF <- DESeq(ddsMF)

In [None]:
resMF <- results(ddsMF)
head(resMF)