# Import count data using basic tools

We want to read in the count files using more basic tools (important if the count files are not generated fro,m htseq-count). We will import the files as matrices. We start with some of the preliminary steps

In [16]:
library(tools)

In [17]:
phfile="/home/owzar001/CURRENT/summercourse-2015/Data/sampletable.txt"
cntdir="/home/owzar001/CURRENT/summercourse-2015/Data/COUNTS"

In [18]:
md5sum(phfile)
phdata=read.table(phfile,sep=",",stringsAsFactor=FALSE)

In [19]:
colnames(phdata)=c("filename","sampid","trt")
phdata[["md5sum"]]=md5sum(file.path(cntdir,phdata[["filename"]]))
phdata

Unnamed: 0,filename,sampid,trt,md5sum
1,AGTCAA_counts.tsv,AGTCAA,0,a9eaa959aba1b02b3831583c2a9751c8
2,AGTTCC_counts.tsv,AGTTCC,0,4183767e4eeb75dc582bcf438af13500
3,ATGTCA_counts.tsv,ATGTCA,0,26fbba06520758e5a3acd9bd432ebed4
4,CCGTCC_counts.tsv,CCGTCC,1,50036a88fd48645f740a31f4f4352cfb
5,GTCCGC_counts.tsv,GTCCGC,1,bb1cecd886127159157e9431d072cad5
6,GTGAAA_counts.tsv,GTGAAA,1,fa544c0a076eedb54937c7189f4e1fbc


In [20]:
phdata=phdata[c("sampid","filename","trt","md5sum")]
phdata

Unnamed: 0,sampid,filename,trt,md5sum
1,AGTCAA,AGTCAA_counts.tsv,0,a9eaa959aba1b02b3831583c2a9751c8
2,AGTTCC,AGTTCC_counts.tsv,0,4183767e4eeb75dc582bcf438af13500
3,ATGTCA,ATGTCA_counts.tsv,0,26fbba06520758e5a3acd9bd432ebed4
4,CCGTCC,CCGTCC_counts.tsv,1,50036a88fd48645f740a31f4f4352cfb
5,GTCCGC,GTCCGC_counts.tsv,1,bb1cecd886127159157e9431d072cad5
6,GTGAAA,GTGAAA_counts.tsv,1,fa544c0a076eedb54937c7189f4e1fbc


In [21]:
phdata[["trt"]]=as.factor(phdata[["trt"]])

This function reads the data from one file and returns the results as a matrix. It is designed to only keep the rows for which the gene id starts with the prefix "GeneID"

In [7]:
readcnts=function(fname,sep="\t",prefix="GeneID",collab="V1",header=FALSE)
    {
        ### Import text file
        dat=read.table(fname,sep=sep,header=header,stringsAsFactor=FALSE)
        ### Only keep rows for which the gene id matches with the prefix
        dat=dat[substr(dat[[collab]],1,nchar(prefix))==prefix,]
        return(dat)
    }


Let's read in a file

In [22]:
file1=readcnts("/home/owzar001/CURRENT/summercourse-2015/Data/COUNTS/AGTCAA_counts.tsv")

In [23]:
dim(file1)

In [24]:
head(file1)

Unnamed: 0,V1,V2
1,GeneID:12930114,118
2,GeneID:12930115,30
3,GeneID:12930116,15
4,GeneID:12930117,12
5,GeneID:12930118,122
6,GeneID:12930119,60


Now read in all files 

In [25]:
## Read in the first file
countdat=readcnts(file.path(cntdir,phdata$filename[1]))
### Assign the sample id as the column name
names(countdat)[2]=phdata$sampid[1]
### Repeat the last two steps for files 2 ... 6 and merge
### along each step
for(i in 2:nrow(phdata))
    {
        dat2=readcnts(file.path(cntdir,phdata$filename[i]))
        names(dat2)[2]=phdata$sampid[i]
        countdat=merge(countdat,dat2,by="V1")
    }

geneid=countdat$V1
countdat=as.matrix(countdat[,-1])
row.names(countdat)=geneid

In [26]:
head(countdat)

Unnamed: 0,AGTCAA,AGTTCC,ATGTCA,CCGTCC,GTCCGC,GTGAAA
GeneID:12930114,118,137,149,120,161,174
GeneID:12930115,30,42,25,18,32,34
GeneID:12930116,15,55,37,49,36,27
GeneID:12930117,12,12,13,11,7,6
GeneID:12930118,122,137,94,48,131,69
GeneID:12930119,60,88,78,53,43,29


In [27]:
library(DESeq2)

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:stats’:

    xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
    do.call, duplicated, eval, evalq, Filter, Find, get, intersect,
    is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,
    pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rep.int,
    rownames, sapply, setdiff, sort, table, tapply, union, unique,
    unlist, unsplit

Creating a generic function for ‘nchar’ from package ‘base’ in package ‘S4Vectors’
Loading required

In [28]:
dds=DESeqDataSetFromMatrix(countdat,DataFrame(phdata),design=~trt)

In [29]:
dds

class: DESeqDataSet 
dim: 4436 6 
exptData(0):
assays(1): counts
rownames(4436): GeneID:12930114 GeneID:12930115 ... GeneID:13406005
  GeneID:13406006
rowRanges metadata column names(0):
colnames(6): AGTCAA AGTTCC ... GTCCGC GTGAAA
colData names(4): sampid filename trt md5sum

In [23]:
sessionInfo()

R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] parallel  stats4    tools     stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
[1] DESeq2_1.8.1              RcppArmadillo_0.5.200.1.0
[3] Rcpp_0.11.6               GenomicRanges_1.20.5     
[5] GenomeInfoDb_1.4.1        IRanges_2.2.5            
[7] S4Vectors_0.6.2           BiocGenerics_0.14.0      

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3          
 [4] XVector_0.8.0        fut