# Set up environment

In [2]:
source("pilot_config.R")

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Attaching package: ‘foreach’

The following objects are masked from ‘package:purrr’:

    accumulate, when

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following o

In [3]:
# get files of all star output files
stardirs <- list.files(DATDIR)

# Read in the count data output from STAR

There are 204 files under this folder. Each file is generated from a fastq file using STAR.

In [5]:
length(stardirs)

In [6]:
head(stardirs)

View the head of one file (1_MA_J_S18_L001_ReadsPerGene.out.tab). Notice that there are four columns and the columns we want is the **first** and the **fourth** columns.

In [4]:
cmdstr <- paste("head", file.path(DATDIR, stardirs[1]))
cmdout <- system(cmdstr, intern = TRUE)
str_split(cmdout, pattern = "\t")

### Construct a matrix that gathers all the count files

helper functions to read the count files

In [5]:
mycombine <- function(df1, df2) {
    # Combine two data frames by gene names
    #
    # Args:
    #   df1 (Dataframe): the first count data
    #   df2 (Dataframe): the second count data
    #
    # Returns:
    #   (Dataframe) The combined data frame of df1 and df2
    full_join(df1, df2, by = "gene")
}

myfile <- function(filedir, filename) {
    # Get the absolute paths of a file
    #
    # Args:
    #   filedir  (Character): the directory of the folder
    #   filename (Character): the filename
    #
    # Returns:
    #   (Character) the directory of the input file
    file.path(filedir, filename)
}

# Data type for each column
coltypes <- list(col_character(), col_integer(), col_integer(), col_integer())

read the count files and combine them

In [6]:
out <- foreach(stardir = stardirs, .combine = mycombine) %do% {
    
    # get a directory of each count file
    cntfile <- myfile(DATDIR, stardir)
    
    # read in the count file
    readr::read_tsv(cntfile, col_names = FALSE, col_types = coltypes) %>%
        dplyr::select(X1, X4) %>% # get the 1st and 4th columns
            dplyr::rename_(.dots=setNames(names(.), c("gene",stardir)))
}

There are 205 columns (204 count files + 1 rowname)

In [10]:
dim(out)

In [11]:
out[1:6, 1:6]

gene,1_MA_J_S18_L001_ReadsPerGene.out.tab,1_MA_J_S18_L002_ReadsPerGene.out.tab,1_MA_J_S18_L003_ReadsPerGene.out.tab,1_MA_J_S18_L004_ReadsPerGene.out.tab,1_RZ_J_S26_L001_ReadsPerGene.out.tab
N_unmapped,2690,2684,2672,2585,7218
N_multimapping,66100,65234,66538,65066,395848
N_noFeature,20347,20004,20549,20505,768146
N_ambiguous,647,652,697,616,1431
CNAG_04548,0,0,0,1,0
CNAG_07303,0,0,0,0,0


# Arrange the results from the count files

### Separate the first four rows (-> nmisc) and others (-> genecounts)

We see that our count matrix contains both summarizing counts and specific counts of each gene for each sample. Note that our 'out' matrix is in biological format (i.e. the samples are the columns and the variables are the rows). Let's split this matrix up into two matrices: `nmisc` and `genecounts`.

For `nmisc`, we will take the first 4 rows of `out` since those are the summarizing features. Next, we want to transform the data frame so that it is in statistical format (the samples are the rows and the feature types are the columns). Using a combination of gather and spread, we can transpose our matrix into the desired format.

In [8]:
### Gather and spread the first four rows
out %>%
    dplyr::slice(1:4) %>%
    gather(expid, value, -gene) %>% 
    spread(gene, value) %>%
    rename_(.dots = setNames(names(.), c("expid", "namb", "nmulti", "nnofeat","nunmap"))) ->
    nmisc

In [12]:
nmisc %>% head

expid,namb,nmulti,nnofeat,nunmap
1_MA_J_S18_L001_ReadsPerGene.out.tab,647,66100,20347,2690
1_MA_J_S18_L002_ReadsPerGene.out.tab,652,65234,20004,2684
1_MA_J_S18_L003_ReadsPerGene.out.tab,697,66538,20549,2672
1_MA_J_S18_L004_ReadsPerGene.out.tab,616,65066,20505,2585
1_RZ_J_S26_L001_ReadsPerGene.out.tab,1431,395848,768146,7218
1_RZ_J_S26_L002_ReadsPerGene.out.tab,1337,388079,755654,7022


To obtain the counts for specific genes, we will use the rest of our `out` matrix since it contains the gene counts. However, we still want to transform the data frame into statistical format, which we will accomplish using gather and spread.

In [9]:
### Gather and spread the genes to get a count matrix
out %>%
    dplyr::slice(-(1:4)) %>%
    gather(expid, value, -gene) %>% 
    spread(gene, value) -> genecounts

In [19]:
genecounts[1:6,1:6]

expid,CNAG_00001,CNAG_00002,CNAG_00003,CNAG_00004,CNAG_00005
1_MA_J_S18_L001_ReadsPerGene.out.tab,0,66,38,74,33
1_MA_J_S18_L002_ReadsPerGene.out.tab,0,59,25,79,25
1_MA_J_S18_L003_ReadsPerGene.out.tab,0,74,27,79,32
1_MA_J_S18_L004_ReadsPerGene.out.tab,0,66,22,69,24
1_RZ_J_S26_L001_ReadsPerGene.out.tab,0,50,16,51,26
1_RZ_J_S26_L002_ReadsPerGene.out.tab,0,45,7,51,31


### For each samples, sum up all the counts

We can create a variable denoting the number of total genes mapped for each sample by summing across the rows.

In [20]:
### Sum across the rows for a total gene count variable
genecounts %>%    
    mutate(ngenemap = rowSums(.[-1])) %>%
    select(expid, ngenemap) -> ngene

### Summarize the results

We will create a comprehensive data frame `mapresults` which will combine `ngene` with `nmisc`. This data frame will have summarizing mapping features in addition to proportion features. 

In [21]:
### Merge in the 4 misc counts and add summaries
ngene %>%
    full_join(nmisc, by = "expid") %>%
    mutate(depth = as.integer(ngenemap + namb + nmulti + nnofeat + nunmap)) %>%
    mutate(prop.gene = ngenemap / depth) %>%
    mutate(prop.nofeat = nnofeat / depth) %>%
    mutate(prop.unique = (ngenemap + nnofeat) / depth) ->
    mapresults

# Store the results

In [22]:
head(mapresults)

expid,ngenemap,namb,nmulti,nnofeat,nunmap,depth,prop.gene,prop.nofeat,prop.unique
1_MA_J_S18_L001_ReadsPerGene.out.tab,2399781,647,66100,20347,2690,2489565,0.9639359,0.008172914,0.9721088
1_MA_J_S18_L002_ReadsPerGene.out.tab,2362228,652,65234,20004,2684,2450802,0.9638592,0.008162226,0.9720214
1_MA_J_S18_L003_ReadsPerGene.out.tab,2436776,697,66538,20549,2672,2527232,0.9642075,0.00813103,0.9723385
1_MA_J_S18_L004_ReadsPerGene.out.tab,2417485,616,65066,20505,2585,2506257,0.9645798,0.008181523,0.9727614
1_RZ_J_S26_L001_ReadsPerGene.out.tab,2366742,1431,395848,768146,7218,3539385,0.6686874,0.2170281,0.8857155
1_RZ_J_S26_L002_ReadsPerGene.out.tab,2331658,1337,388079,755654,7022,3483750,0.6692954,0.216908217,0.8862037


In [17]:
outfile <- file.path(OUTDIR, "hts-pilot-2018.RData")
save(mapresults, genecounts, file = outfile)

In [18]:
tools::md5sum(path.expand(outfile))