### Requirements for DESeq

"The count values must be raw counts of sequencing reads. This is important for DESeq’s statistical model to hold,
as only the actual counts allow assessing the measurement precision correctly. Hence, please do do not supply other
quantities, such as (rounded) normalized counts, or counts of covered base pairs – this will only lead to nonsensical
results." https://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf

# Differential Expression
- Load in glioblastoma features as read count
    - Use ids from glioblastoma features of normalized TPM
- caculate using DESeq library
- output new train and test features in log2(TPM+1) with differential expressed genes 

# Load in glioblastoma Features as TPM for IDs

In [2]:
library('tidyverse')
library('stringr')

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


In [3]:
labels <- read.csv("data/glioblastomaLabels.csv")

In [4]:
riboDGlio<-labels %>% filter(labels$tr_method == 'RiboMinus')
riboDth_ids <- riboDGlio$th_sampleid
polyAGlio<-labels %>% filter(labels$tr_method == 'PolyA')
polyAth_ids <- polyAGlio$th_sampleid


In [6]:
a <- read.csv('data/glioblastomaExpression.csv')

In [7]:
glio_th_ids <- tail(colnames(a),-1)

Find files for read count

In [8]:
compendiumTPM <- dir(path = "/data/archive/downstream", pattern = 'rsem_genes.results', all.files = TRUE,
           full.names = TRUE, recursive = TRUE,
           ignore.case = FALSE, include.dirs = TRUE, no.. = TRUE)

In [9]:
compendiumTPM_ids <- compendiumTPM %>% gsub('/data/archive/downstream/','',.) %>% gsub('/.*','',.)

In [10]:
readcountIds <- intersect(compendiumTPM_ids,glio_th_ids)

In [11]:
length(readcountIds)

Create dataset of read count

In [15]:
ptm <- proc.time()
locationList <- c()
tables<-c()
riboDs <- list()
polyAs <- list()
i <- 1
for (id in readcountIds){
    curdir <- dir(path = paste0("/data/archive/downstream/",id), pattern = 'rsem_genes.results', all.files = TRUE,
           full.names = TRUE, recursive = TRUE,
           ignore.case = FALSE, include.dirs = TRUE, no.. = TRUE)
    cur<-suppressMessages(read_tsv(curdir[length(curdir)]))
    if(id %in% riboDth_ids){
        riboDs[[i]] <- cur$expected_count
    }
    if (id %in% polyAth_ids){
        polyAs[[i]] <- cur$expected_count
    }
    i <- i+1
}

print(proc.time()-ptm)

   user  system elapsed 
 50.924   1.108  52.093 


In [16]:
polyAReadCount <- do.call(cbind, polyAs)

In [18]:
polyAReadCount$gene_id <- cur$gene_id

“Coercing LHS to a list”

In [None]:
polyAReadCount

In [4]:
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq")

Bioconductor version 3.4 (BiocInstaller 1.24.0), ?biocLite for help
A new version of Bioconductor is available after installing the most recent
  version of R; see http://bioconductor.org/install
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.4 (BiocInstaller 1.24.0), R 3.3.2 (2016-10-31).
Installing package(s) ‘DESeq’
also installing the dependencies ‘annotate’, ‘locfit’, ‘genefilter’, ‘geneplotter’

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Old packages: 'assertthat', 'backports', 'BH', 'boot', 'broom', 'car', 'caret',
  'cluster', 'colorspace', 'crayon', 'curl', 'data.table', 'DBI', 'devtools',
  'digest', 'dplyr', 'evaluate', 'forcats', 'foreach', 'forecast', 'foreign',
  'ggplot2', 'git2r', 'haven', 'highr', 'hms', 'htmltools', 'httpuv', 'httr',
  'IRdisplay', 'iterators', 'jsonlite', 'knitr', 'lattice', 'lazyeval', 'lme4',
  'lubridate', 'markdown', 'MASS', 'Matrix', 'memoise', 'mgcv', 'modelr',
  'munsell', 'nycflights13', 'openssl