# Monocle analysis of cell-gene matrix
This notebook starts with the annotated cell-gene matrix, and performs analyses using [Monocle](http://cole-trapnell-lab.github.io/monocle-release/).

Analysis by Alistair Russell and [Jesse Bloom](https://research.fhcrc.org/bloom/en.html).

## Setup for analysis
Load or install the necessary `R` packages, print session information that describes the packages / versions used.

In [1]:
options(warn=-1) # suppress warnings that otherwise clutter output

if (!require("pacman", quietly=TRUE)) 
  install.packages("pacman")
pacman::p_load("ggplot2", "ggthemes", "ggExtra", "gridExtra", "cowplot", "scales", "reshape2", 
  "dplyr", "magrittr", "rmarkdown", "IRdisplay", "psych", "qlcMatrix", "colorRamps", "ggpubr",
  "tidyverse", "RColorBrewer", "naturalsort", "grid", "DescTools")

bioc.packages <- c("monocle", "piano")
if (!(all(suppressMessages(lapply(bioc.packages, require, quietly=TRUE, character.only=TRUE))))) {
  source("http://bioconductor.org/biocLite.R")
  biocLite()
  biocLite(bioc.packages, suppressWarnings=TRUE)
}   
    
# print information on session
sessionInfo()
    
# http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette
# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", 
               "#0072B2", "#D55E00", "#CC79A7")
# The palette with black
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", 
                "#0072B2", "#D55E00", "#CC79A7")

# plots will be saved here
plotsdir <- './results/plots/'
if (!dir.exists(plotsdir)) 
  dir.create(plotsdir)    
    
# figures for paper will be saved here
figsdir <- './paper/figures/'
if (!dir.exists(figsdir))
  dir.create(figsdir)

saveShowPlot <- function(p, width, height, isfig=FALSE) {
  # save plot with filename of variable name with dots replaced by _, then show
  # if *isfig* is TRUE, then also saves a PDF to *figsdir*
  pngfile <- file.path(plotsdir, sprintf("%s.png", 
    gsub("\\.", "_", deparse(substitute(p)))))
  figfile <- file.path(figsdir, sprintf("%s.pdf", 
    gsub("\\.", "_", deparse(substitute(p)))))
  ggsave(pngfile, plot=p, width=width, height=height, units="in")
  if (isfig)
    ggsave(figfile, plot=p, width=width, height=height, units="in")
  display_png(file=pngfile, width=width * 90)
}
    
fancy_scientific <- function(x, parse.str=TRUE, digits=NULL) {
  # scientific notation formatting, based loosely on https://stackoverflow.com/a/24241954
  # if `parse.str` is TRUE, then we parse the string into an expression
  # `digits` indicates how many digits to include
  x %>% format(scientific=TRUE, digits=digits) %>% gsub("^0e\\+00","0", .) %>%
    gsub("^1e\\+00", "1", .) %>% gsub("^(.*)e", "'\\1'e", .) %>% 
    gsub("e\\+","e", .) %>% gsub("e", "%*%10^", .) %>%
    gsub("^\'1\'\\%\\*\\%", "", .) %>% {if (parse.str) parse(text=.) else .}
}

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] splines   stats4    parallel  grid      stats     graphics  grDevices
 [8] utils     datasets  methods   base     

other attached packages:
 [1] piano_1.16.1        monocle_2.5.3       DDRTree_0.1.5      
 [4] irlba_2.2.1         VGAM_1.0-3          Biobase_2.36.2     
 [7] BiocGenerics_0.22.0 DescTools_0.99.23   naturalsort_0.1.3  
[10] RColorBrewer_1.1-2  purrr_0.2.2.2       readr_1.1.1        
[13] tidyr_0.6.3         tibble_1.3.3        tidyverse_1.1.1    
[16] ggpubr_0.1.4        colorRamps_2.3      qlcMatrix_0.9.5    
[19] slam_0.1-40         Matrix_1.2-12       psych_1.7.5      

## Load the cell-gene matrix
We load the cell-gene matrix that was created previously by [align_and_annotate.ipynb](align_and_annotate.ipynb).
There is a separate cell gene matrix for each cell type:
  - *humanplusflu*: A549 cells infected with influenza
  - *canine*: MDCK cells spiked in as a control to estimate leakage / contamination rate.

Note that the flu reads are annotated by the synonymous barcodes near the 3' ends.

The matrix for each cell type contains information for all samples:
  - *IFN_enriched*: cells that have been MACS-sorted for IFN+ at 13-hours post-infection
  - *not_enriched*: cells at 10 hours post-infection that have not been sorted for IFN+

In [3]:
# We have data for these two types of cells
celltypes <- c("humanplusflu", "canine")

# We have data for these two samples
samples <- c("IFN_enriched", "not_enriched")

# Cell gene matrices in this directory
matrixdir <- "./results/cellgenecounts/"

# Read cell-gene matrices for each sample into a list named by cell type.
# These are in `matrixdir` with names like "merged_humanplusflu_matrix.mtx"
cells <- lapply(
  setNames(celltypes, celltypes),
  function(celltype) {
    newCellDataSet(
      readMM(file.path(matrixdir, paste("merged", celltype, "matrix.mtx", sep="_"))),
      phenoData=new(
        "AnnotatedDataFrame",
        data=read.delim(file.path(matrixdir, paste("merged", celltype, "cells.tsv", sep="_")))
        ),
      featureData=new(
        "AnnotatedDataFrame",
        data=read.delim(file.path(matrixdir, paste("merged", celltype, "genes.tsv", sep="_")))
        ),
      expressionFamily=negbinomial.size()
      ) 
    }
  )

In [6]:
cells['humanplusflu']

$humanplusflu
CellDataSet (storageMode: environment)
assayData: 19969 features, 3260 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: 1 2 ... 3260 (3260 total)
  varLabels: CellBarcode Sample ... Size_Factor (21 total)
  varMetadata: labelDescription
featureData
  featureNames: 1 2 ... 19969 (19969 total)
  fvarLabels: gene_long_name gene_short_name
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:  
