Update tutorials

hms-dbmi · Nov 17, 2015 · 7f40039 · 7f40039
1 parent d054b7c
commit 7f40039
Show file tree

Hide file tree

Showing 12 changed files with 310 additions and 117 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,16 +1,22 @@
 Package: scde
 Type: Package
 Title: Single Cell Differential Expression
-Version: 1.99.0
+Version: 0.99.1
 Date: 2015-05-06
-Description: The scde package implements a set of statistical methods for analyzing single-cell RNA-seq data.
-    scde fits individual error models for single-cell RNA-seq measurements. These models can then be used for
-    assessment of differential expression between groups of cells, as well as other types of analysis.
-    The scde package also contains the pagoda framework which applies pathway and gene set overdispersion analysis
-    to identify and characterize putative cell subpopulations based on transcriptional signatures.
-    The overall approach to the differential expression analysis is detailed in the following publication:
-    "Bayesian approach to single-cell differential expression analysis" (Kharchenko PV, Silberstein L, Scadden DT, Nature Methods, doi:10.1038/nmeth.2967).
-    The overall approach to subpopulation identification and characterization is detailed in the following publication:
+Description: The scde package implements a set of statistical methods for
+    analyzing single-cell RNA-seq data. scde fits individual error models for
+    single-cell RNA-seq measurements. These models can then be used for assessment
+    of differential expression between groups of cells, as well as other types of
+    analysis. The scde package also contains the pagoda framework which applies
+    pathway and gene set overdispersion analysis to identify and characterize
+    putative cell subpopulations based on transcriptional signatures. The overall
+    approach to the differential expression analysis is detailed in the following
+    publication: "Bayesian approach to single-cell differential expression
+    analysis" (Kharchenko PV, Silberstein L, Scadden DT, Nature Methods, doi:
+    10.1038/nmeth.2967). The overall approach to subpopulation identification and
+    characterization is detailed in the following pre-print: "Characterizing 
+    transcriptional heterogeneity through pathway and gene set overdispersion 
+    analysis" (Fan J, et al, doi: http://dx.doi.org/10.1101/026948).
 Author: Peter Kharchenko [aut, cre],
     Jean Fan [aut]
 Authors@R: c(
@@ -20,37 +26,19 @@ Authors@R: c(
 	         email = "jeanfan@fas.harvard.edu")
 	         )
 Maintainer: Jean Fan <jeanfan@fas.harvard.edu>
-URL: http://pklab.med.harvard.edu/scde/index.html
-BugReports: https://github.com/JEFworks/scde/issues
+URL: http://pklab.med.harvard.edu/scde
+BugReports: https://github.com/hms-dbmi/scde/issues
 License: GPL-2
 LazyData: true
-Depends:
-    R (>= 3.0.0),
-    flexmix
-Imports:
-    Rcpp (>= 0.10.4),
-    RcppArmadillo (>= 0.5.400.2.0),
-    mgcv,
-    Rook,
-    rjson,
-    MASS,
-    Cairo,
-    RColorBrewer,
-    edgeR,
-    quantreg,
-    methods,
-    nnet,
-    RMTstat,
-    extRemes,
-    pcaMethods,
-    BiocParallel,
-    parallel
-Suggests:
-    knitr,
-    cba,
-    fastcluster,
-    WGCNA
+Depends: R (>= 3.0.0), flexmix
+Imports: Rcpp (>= 0.10.4), RcppArmadillo (>= 0.5.400.2.0), mgcv, Rook,
+        rjson, MASS, Cairo, RColorBrewer, edgeR, quantreg, methods,
+        nnet, RMTstat, extRemes, pcaMethods, BiocParallel, parallel
+Suggests: knitr, cba, fastcluster, WGCNA, GO.db, org.Hs.eg.db
 biocViews: RNASeq, StatisticalMethod, DifferentialExpression, Bayesian,
-    Transcription, Software
+        Transcription, Software
 LinkingTo: Rcpp, RcppArmadillo
 VignetteBuilder: knitr
+Packaged: 2015-11-02 14:30:04 UTC; reyes
+RoxygenNote: 5.0.0
+NeedsCompilation: yes
diff --git a/vignettes/diffexp.Rmd b/vignettes/diffexp.Rmd
@@ -4,18 +4,19 @@ author: "Peter Kharchenko, Jean Fan"
 date: '`r Sys.Date()`'
 output: html_document
 vignette: |
+  %\VignetteEngine{knitr::rmarkdown}
   %\VignetteIndexEntry{Vignette Title} \usepackage[utf8]{inputenc}
 ---
 
 # Single-Cell Differential Expression Analysis
 
-In this vignette, we show you how perform single cell differential expression analysis using single cell RNA-seq data. 
+In this vignette, we show you how perform single cell differential expression analysis using single cell RNA-seq data with the `scde` package.
 
-The `scde` package implements routines for fitting individual error models for single-cell RNA-seq measurements. Briefly, the read counts observed for each gene are modeled using a mixture of a negative binomial (NB) distribution (for the amplified/detected transcripts) and low-level Poisson distribution (for the unobserved or background-level signal of genes that failed to amplify or were not detected for other reasons). These models can then be used to identify robustly differentially expressed genes between groups of cells. 
+The `scde` package implements routines for fitting individual error models for single-cell RNA-seq measurements. Briefly, the read counts observed for each gene are modeled using a mixture of a negative binomial (NB) distribution (for the amplified/detected transcripts) and low-level Poisson distribution (for the unobserved or background-level signal of genes that failed to amplify or were not detected for other reasons). These models can then be used to identify robustly differentially expressed genes between groups of cells. For more information, please refer to the original manuscript by [_Kharchenko et al._](http://www.ncbi.nlm.nih.gov/pubmed/24836921).
 
 ## Preparing data
 
-The analysis starts with a matrix of read counts. Depending on the protocol, these may be raw numbers of reads mapped to each gene, or count values adjusted for potential biases (sequence dependency, splice variant coverage, etc. - the values must be integers). The `scde` package includes a subset of the ES/MEF cell dataset published by [_Islam et al._](http://www.ncbi.nlm.nih.gov/pubmed/?term=24363023). The subset includes first 20 ES and MEF cells. Here we load the cells and define a factor separating ES and MEF cell types:
+The analysis starts with a matrix of read counts. Depending on the protocol, these may be raw numbers of reads mapped to each gene, or count values adjusted for potential biases (sequence dependency, splice variant coverage, etc. - the values must be integers). The `scde` package includes a subset of the ES/MEF cell dataset published by [_Islam et al._](http://www.ncbi.nlm.nih.gov/pubmed/24363023). The subset includes first 20 ES and MEF cells. Here we load the cells and define a factor separating ES and MEF cell types:
 
 ```{r, include = FALSE}
 library(knitr)
@@ -145,3 +146,110 @@ Similarly, batch correction can be performed when calculating expression differe
 # test for all of the genes
 ediff.batch <- scde.expression.difference(o.ifm, cd, o.prior, groups = groups, batch = batch, n.randomizations = 100, n.cores = 1, return.posteriors = TRUE, verbose = 1)
 ```
+
+### More detailed functions
+
+The `scde.expression.difference` method can return a more extensive set of results, including joint posteriors and the expression fold difference posteriors for all of the exam
+ined genes:
+```{r, detailed1, include=FALSE, eval=FALSE}
+# recalculate difference and return with joint posteriors and difference posterior
+ediff.details <- scde.expression.difference(o.ifm, cd, o.prior, n.randomizations = 100, n.cores = 1, verbose = 1, return.posteriors = TRUE)
+```
+
+The joint posteriors can also be obtained explicitly for a particular set of cells:
+```{r, detailed2, eval=FALSE}
+# calculate joint posterior for ESCs (set return.individual.posterior.modes=T if you need p.modes)
+jp <- scde.posteriors(models = o.ifm[grep("ESC",rownames(o.ifm)), ], cd, o.prior, n.cores = 1)
+```
+
+The error models fit the intercept and the slope of the NB "correlated" component, providing more consistent expression magnitude estimates among the cells. These can be obtain
+ed with a quick helper function:
+```{r, detailed3}
+# get expression magntiude estimates
+o.fpm <- scde.expression.magnitude(o.ifm, counts = cd)
+```
+
+Drop-out probabilities (as a function of expression magnitudes) for different cells are useful for assessing the quality of the measurements: 
+```{r, detailed4, fig.width=4, fig.height=4}
+# get failure probabilities on the expresison range
+o.fail.curves <- scde.failure.probability(o.ifm, magnitudes = log((10^o.prior$x)-1))
+par(mfrow = c(1,1), mar = c(3.5,3.5,0.5,0.5), mgp = c(2.0,0.65,0), cex = 1)
+plot(c(), c(), xlim=range(o.prior$x), ylim=c(0,1), xlab="expression magnitude (log10)", ylab="drop-out probability")
+invisible(apply(o.fail.curves[, grep("ES",colnames(o.fail.curves))], 2, function(y) lines(x = o.prior$x, y = y,col = "orange")))
+invisible(apply(o.fail.curves[, grep("MEF", colnames(o.fail.curves))], 2, function(y) lines(x = o.prior$x, y = y, col = "dodgerblue")))
+```
+
+The drop-out probabilities (at a given expression magnitude, or at an observed count) can be useful in subsequent analysis
+```{r, detailed5}
+# get failure probabilities on the expresison range
+o.fail.curves <- scde.failure.probability(o.ifm, magnitudes = log((10^o.prior$x)-1))
+# get self-fail probabilities (at a given observed count)
+p.self.fail <- scde.failure.probability(models = o.ifm, counts = cd)
+```
+
+## Adjusted distance meaures
+
+The dependency of drop-out probability on the average expression magntiude captured by the cell-speicifc models can be used to adjust cell-to-cell similarity measures, for insta
+nce in the context of cell clustering. Several such measures are explored below.
+
+### Direct drop-out
+
+Direct weighting downweights the contribution of a given gene to the cell-to-cell distance based on the probability that the given measurement is a drop-out event (i.e. belongs to the drop-out component) - the "self-fail" probability shown in the previous section. To estimate the adjusted distance, we will simulate the drop-out events, replacing them with `NA` values, and calculating correlation using the remaining points: 
+```{r, adjusted1, results='hide', eval=FALSE}
+p.self.fail <- scde.failure.probability(models = o.ifm, counts = cd)
+# simulate drop-outs
+# note: using 10 sampling rounds for illustration here. ~500 or more should be used.
+n.simulations <- 10; k <- 0.9;
+cell.names <- colnames(cd); names(cell.names) <- cell.names;
+dl <- mclapply(1:n.simulations,function(i) {
+  scd1 <- do.call(cbind,lapply(cell.names,function(nam) {
+    x <- cd[,nam];
+    # replace predicted drop outs with NAs
+    x[!as.logical(rbinom(length(x),1,1-p.self.fail[,nam]*k))] <- NA;
+    x;
+    }))
+  rownames(scd1) <- rownames(cd); 
+  # calculate correlation on the complete observation pairs
+  cor(log10(scd1+1),use="pairwise.complete.obs");
+}, mc.cores = 1)
+# calculate average distance across sampling rounds
+direct.dist <- as.dist(1-Reduce("+",dl)/length(dl))
+```
+
+### Reciprocal weighting
+The reciprocal weighting of the Pearson correlation will give increased weight to pairs of observations where a gene expressed (on average) at a level x1 observed in a cell c1 would not be likely to fail in a cell c2, and vice versa:
+```{r, adjusted2, results='hide', eval=FALSE}
+# load boot package for the weighted correlation implementation
+require(boot)
+k <- 0.95;
+reciprocal.dist <- as.dist(1 - do.call(rbind, mclapply(cell.names, function(nam1) {
+  unlist(lapply(cell.names, function(nam2) {
+    # reciprocal probabilities
+    f1 <- scde.failure.probability(models = o.ifm[nam1,,drop = FALSE], magnitudes = o.fpm[, nam2])
+    f2 <- scde.failure.probability(models = o.ifm[nam2,,drop = FALSE], magnitudes = o.fpm[, nam1])
+    # weight factor
+    pnf <- sqrt((1-f1)*(1-f2))*k +(1-k); 
+    boot::corr(log10(cbind(cd[, nam1], cd[, nam2])+1), w = pnf)
+    }))
+},mc.cores = 1)), upper = FALSE)
+```
+
+### Mode-relative weighting
+A more reliable reference magnitude against which drop-out likelihood could be assessed would be an estimate of the average expression magnitude, such as joint posterior mode. Below we estimate `p.mode.fail`, a probability that a drop-out event could be observed at the level of average expression magntiude in a given cell. For each measurement we then reduce it weight if it indeed dropped out in a cell where we expect it to drop-out given its average expression magnitude `(p.self.fail*p.mode.fail)`. However we do want to give high weight to measurements where the drop-out was not observed, even though it was exected based on the average expression magnitude, so the overall weight expression is `(1-p.self.fail*sqrt(p.self.fail*p.mode.fail))` (other formulations are clearly possible here).
+```{r, adjusted3, results='hide', eval=FALSE}
+# reclculate posteriors with the individual posterior modes 
+jp <- scde.posteriors(models = o.ifm, cd, o.prior, return.individual.posterior.modes = TRUE, n.cores = 1)
+# find joint posterior modes for each gene - a measure of MLE of group-average expression
+jp$jp.modes <- log(as.numeric(colnames(jp$jp)))[max.col(jp$jp)]
+p.mode.fail <- scde.failure.probability(models = o.ifm, magnitudes = jp$jp.modes)
+# weight matrix
+matw <- 1-sqrt(p.self.fail*sqrt(p.self.fail*p.mode.fail))
+# magnitude matrix (using individual posterior modes here)
+mat <- log10(exp(jp$modes)+1);
+# weighted distance
+mode.fail.dist <- as.dist(1-do.call(rbind,mclapply(cell.names,function(nam1) {
+  unlist(lapply(cell.names,function(nam2) {
+    corr(cbind(mat[, nam1], mat[, nam2]), w = sqrt(sqrt(matw[, nam1]*matw[, nam2])))
+  }))
+}, mc.cores = 1)), upper = FALSE)
+```