# Set up environment

In [None]:
source("course_config.R")

In [None]:
# get files of all star output files
stardirs <- list.files(DATDIR)

# Read in the count data output from STAR

There are 204 files under this folder. Each file is generated from a fastq file using STAR.

In [None]:
length(stardirs)

In [None]:
head(stardirs)

View the head of one file (1_MA_J_S18_L001_ReadsPerGene.out.tab). Notice that there are four columns and the columns we want is the **first** and the **fourth** columns.

In [None]:
cmdstr <- paste("head", file.path(DATDIR, stardirs[1]))
cmdout <- system(cmdstr, intern = TRUE)
str_split(cmdout, pattern = "\t")

### Construct a matrix that gathers all the count files

helper functions to read the count files

In [None]:
mycombine <- function(df1, df2) {
    # Combine two data frames by gene names
    #
    # Args:
    #   df1 (Dataframe): the first count data
    #   df2 (Dataframe): the second count data
    #
    # Returns:
    #   (Dataframe) The combined data frame of df1 and df2
    full_join(df1, df2, by = "gene")
}

myfile <- function(filedir, filename) {
    # Get the absolute paths of a file
    #
    # Args:
    #   filedir  (Character): the directory of the folder
    #   filename (Character): the filename
    #
    # Returns:
    #   (Character) the directory of the input file
    file.path(filedir, filename)
}

# Data type for each column
coltypes <- list(col_character(), col_integer(), col_integer(), col_integer())

read the count files and combine them

In [None]:
out <- foreach(stardir = stardirs, .combine = mycombine) %do% {
    
    # get a directory of each count file
    cntfile <- myfile(DATDIR, stardir)
    
    # read in the count file
    readr::read_tsv(cntfile, col_names = FALSE, col_types = coltypes) %>%
        dplyr::select(X1, X4) %>% # get the 1st and 4th columns
            dplyr::rename_(.dots=setNames(names(.), c("gene",stardir)))
}

There are 205 columns (204 count files + 1 rowname)

In [None]:
dim(out)

In [None]:
out[1:6, 1:6]

# Arrange the results from the count files

### Separate the first four rows (-> nmisc) and others (-> genecounts)

We see that our count matrix contains both summarizing counts and specific counts of each gene for each sample. Note that our 'out' matrix is in biological format (i.e. the samples are the columns and the variables are the rows). Let's split this matrix up into two matrices: `nmisc` and `genecounts`.

For `nmisc`, we will take the first 4 rows of `out` since those are the summarizing features. Next, we want to transform the data frame so that it is in statistical format (the samples are the rows and the feature types are the columns). Using a combination of gather and spread, we can transpose our matrix into the desired format.

In [None]:
### Gather and spread the first four rows
out %>%
    dplyr::slice(1:4) %>%
    gather(expid, value, -gene) %>% 
    spread(gene, value) %>%
    rename_(.dots = setNames(names(.), c("expid", "namb", "nmulti", "nnofeat","nunmap"))) ->
    nmisc

In [None]:
nmisc %>% head

To obtain the counts for specific genes, we will use the rest of our `out` matrix since it contains the gene counts. However, we still want to transform the data frame into statistical format, which we will accomplish using gather and spread.

In [None]:
### Gather and spread the genes to get a count matrix
out %>%
    dplyr::slice(-(1:4)) %>%
    gather(expid, value, -gene) %>% 
    spread(gene, value) -> genecounts

In [None]:
genecounts[1:6,1:6]

### For each samples, sum up all the counts

We can create a variable denoting the number of total genes mapped for each sample by summing across the rows.

In [None]:
### Sum across the rows for a total gene count variable
genecounts %>%    
    mutate(ngenemap = rowSums(.[-1])) %>%
    select(expid, ngenemap) -> ngene

### Summarize the results

We will create a comprehensive data frame `mapresults` which will combine `ngene` with `nmisc`. This data frame will have summarizing mapping features in addition to proportion features. 

In [None]:
### Merge in the 4 misc counts and add summaries
ngene %>%
    full_join(nmisc, by = "expid") %>%
    mutate(depth = as.integer(ngenemap + namb + nmulti + nnofeat + nunmap)) %>%
    mutate(prop.gene = ngenemap / depth) %>%
    mutate(prop.nofeat = nnofeat / depth) %>%
    mutate(prop.unique = (ngenemap + nnofeat) / depth) ->
    mapresults

# Store the results

In [None]:
head(mapresults)

In [None]:
dir.create(OUTDIR,recursive = TRUE)
outfile <- file.path(OUTDIR, "hts-course-2018.RData")
save(mapresults, genecounts, file = outfile)

In [None]:
tools::md5sum(path.expand(outfile))