Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
197 lines (150 sloc) 10.2 KB
---
title: "Basic Deconvolution using dtangle"
author: "Greg Hunt"
date: "`r Sys.Date()`"
output:
md_document:
variant: markdown_github
pdf_document: default
html_document: default
vignette: |
%\VignetteIndexEntry{Basic Deconvolution} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# The Data
In this vignette we will work through a simple example of deconvolving cell type proportions from DNA microarray data. We work with a data set created from rats and introduced by [Shen-Orr et al](https://www.nature.com/nmeth/journal/v7/n4/abs/nmeth.1439.html). This is available on GEO with accession [GSE19830](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19830). The data set we will work with is subset of the Shen-Orr data and is included in the ``dtangle`` package under the name ``shen_orr_ex``. Alternatively, we can access this and other data sets data set through the supplementary ``dtangle.data`` package we have made available [here](https://wm1693.box.com/s/np66a1wnhngafoawsiu665sjb7kye2ub). More information about the data set is available as part of the ``R`` help, ``?shen_orr_ex``. First, load up the data set.
```{r}
library('dtangle')
names(shen_orr_ex)
```
In this data set rat brain, liver and lung cells have been mixed together in various proportions the resulting mixtures were analyzed with DNA microarrays. The mixing proportions are encoded in the mixture matrix
```{r}
head(shen_orr_ex$annotation$mixture)
```
Each row of this matrix is a sample and each column gives the mixing proportions of the cell types in each sample.
The RMA-summarized gene expression data generated as part of the Shen-Orr experiment is accessible under ``data$log``,
```{r}
Y <- shen_orr_ex$data$log
Y[1:4,1:4]
```
Each row is a different individual and each column is a particular gene. The values of the matrix are $\log_2$ RMA-summarized gene expressions.
# Arguments
The arguments to dtangle can be grouped as follows. First are the the two most important groups of arguments:
1. gene expression data input: Y, references, and pure\_samples
2. marker gene controls: n\_markers, markers and marker\_method
and then the less important arguments for more fine-tuned control:
3. data type fudge-factor controls: gamma and data\_type
4. the summarizing function: summary\_fn
# 1. Y, references, and pure\_samples
In order to deconvolve gene expression data from mixture samples dtangle requires references of the cell-types to be deconvolved. The mixture gene expressions and reference gene expressions are given to dtangle using the arguments Y, references, and pure\_samples.
Here we use some data from [Shen-Orr et al](https://www.nature.com/nmeth/journal/v7/n4/abs/nmeth.1439.html) as an example. This data consists of log-scale expression measurements from mixtures of rat brain, liver and lung cells.
```{r}
library('dtangle')
data = shen_orr_ex$data$log
mixture_proportions = shen_orr_ex$annotation$mixture
```
the true mixing proportion of each sample captured in the variable mixture\_proportions:
```{r}
mixture_proportions
```
we can see from this that the first nine samples are pure reference samples of the three cell types and the remaining samples are mixture samples of the cell types. We want to use these reference samples to deconvolve the remaining mixture samples. This can be done in a couple of ways:
1. We can provide Y and pure\_samples to dtangle. Here Y will be the combined matrix of reference and mixture samples and pure\_samples will tell dtangle which samples (rows of Y) are reference samples and (by elimination) which samples are mixture samples we wish to deconvolve.
```{r}
pure_samples = list(Liver=c(1,2,3),Brain=c(4,5,6),Lung=c(7,8,9))
dt_out = dtangle(Y=data, pure_samples = pure_samples)
matplot(mixture_proportions,dt_out$estimates, xlim = c(0,1),ylim=c(0,1),xlab="Truth",ylab="Estimates")
```
2. We can instead split the data into Y as just the matrix of mixture samples and references as the matrix of reference expressions.
```{r}
mixture_samples = data[-(1:9),]
reference_samples = data[1:9,]
dt_out = dtangle(Y=mixture_samples, reference=reference_samples,pure_samples = pure_samples)
mixture_mixture_proportions = mixture_proportions[-(1:9),]
matplot(mixture_mixture_proportions,dt_out$estimates, xlim = c(0,1),ylim=c(0,1),xlab="Truth",ylab="Estimates")
```
Now the variable pure\_samples tells dtangle what cell type each of the the rows of the references matrix corresponds to. Notice that dtangle only estimates the mixing proportions for the samples given in the Y argument and so since we have removed the references samples from the matrix passed to Y then we only estimate the mixing proportions for the remaining 33 mixture samples. Previously we had estimated proportions for both the mixture and reference samples.
In this example we still needed the variable pure\_samples because our reference expression matrix contained multiple reference profiles for each cell type. Often one only has a reference expression matrix with one (typically average) expression profile per cell type. In this case we don't need the pure\_samples argument:
```{r}
ref_reduced = t(sapply(pure_samples,function(x)colMeans(reference_samples[x,,drop=FALSE])))
dt_out = dtangle(Y=mixture_samples, reference=ref_reduced)
matplot(mixture_mixture_proportions,dt_out$estimates, xlim = c(0,1),ylim=c(0,1),xlab="Truth",ylab="Estimates")
```
# 2. n\_markers, markers and marker\_method
Central to dtangle is finding marker genes for each cell type. Markers may either be given explicitly to dtangle by the user or they may be left up to dtangle itself to determine the marker genes automatically.
## Letting dtangle determine the marker genes.
If we do not specify the argument markers then dtangle automatically determines marker genes:
```{r}
dt_out = dtangle(Y=mixture_samples, references = ref_reduced)
```
we can change the way that dtangle finds marker genes using the marker\_method argument:
```{r}
dt_out = dtangle(Y=mixture_samples, references = ref_reduced,marker_method = "diff")
```
the default is to use "ratio".
The argument n\_markers specifies how many marker genes to use. If unspecified then dtangle uses the top 10\% of genes (as ranked according to marker\_method) as markers.
```{r}
dt_out$n_markers
```
The number of marker genes can be explicitly specified by setting n\_markers:
```{r}
dt_out = dtangle(Y=mixture_samples, references = ref_reduced,marker_method = "diff",n_markers=100)
dt_out$n_markers
```
if just a single integer is specified then all genes us that number of marker genes. Alternatively we can specify a vector of integers to specify a number of marker genes individually for each cell type:
```{r}
dt_out = dtangle(Y=mixture_samples, references = ref_reduced,marker_method = "diff",n_markers=c(100,150,50))
dt_out$n_markers
```
we can also, in a similar fashion, pass a percentage (or vector of percentages) to n\_markers which will then use that percentage of the ranked marker genes for each cell type:
```{r}
dt_out = dtangle(Y=mixture_samples, references = ref_reduced,marker_method = "diff",n_markers=.075)
dt_out$n_markers
dt_out = dtangle(Y=mixture_samples, references = ref_reduced,marker_method = "diff",n_markers=c(.1,.15,.05))
dt_out$n_markers
```
## Specifying the marker genes explicitly.
Instead of letting dtangle determine the marker genes we can instead explicitly pass a list of markers to dtangle specifying the marker genes,
```{r}
marker_genes = list(c(120,253,316),
c(180,429,14),
c(1,109,206))
dt_out = dtangle(Y=mixture_samples, references = ref_reduced,markers=marker_genes)
dt_out$n_markers
```
the format of the list is precisely the same format as returned in the markers element of the output of dtangle, that is, a list of vectors of column indicies of $Y$ that are markers of each of the cell types. The elements of the list correspond one to each cell type in the same order specified either in elements of pure\_samples or by the rows of references. The argument of n\_markers can be used in the same way to subset the markers if desired.
## How dtangle finds markers
dtangle finds the marker genes by using the find\_markers function.
```{r}
mrkrs = find_markers(Y=mixture_samples, references = ref_reduced)
names(mrkrs)
```
which returns a list with four elements L which contains all genes putatively assigned to a cell type they mark, V which contains the ranking values by which the elements of L are ordered, M and sM which are the matrix and sorted matrix used to create V and L.
We can pass either the entire list or just the L list to dtangle as markers and re-create how dtangle automatically chooses markers:
```{r}
dt_out = dtangle(Y = mixture_samples,references = ref_reduced,markers=mrkrs,n_markers=.1)
```
# 3. gamma and data\_type
A unique aspect of dtangle is its parameter $\gamma$. This is a fudge factor dtangle applies to account for imperfect mRNA quantification, especially for microarrays. The value of $\gamma$ can be specified through the argument gamma or data\_type.
We can explicitly set gamma if we want,
```{r}
dt_out = dtangle(Y = mixture_samples,references = ref_reduced,gamma=.9)
```
or we can set it a pre-defined value by passing the data\_type argument which sets $\gamma$ automatically based upon the data type:
```{r}
dt_out = dtangle(Y = mixture_samples,references = ref_reduced,data_type="microarray-gene")
```
the options for $\gamma$ are
```{r}
dtangle:::gma
```
if neither gamma nor data\_type are specified then dtangle sets $\gamma$ to one. If there is no intuition about either the data type or a good value of $\gamma$ then we recommend leaving $\gamma$ at one.
# 4. summary\_fn
There is a final optional parameter summary\_fn which controls how dtangle aggregates the gene expressions for estimation. dtangle is a very robust algorithm and so the default summary function is the mean. One can specify the median as a more robust option, or any other summary statistic if so they desire. We recommend either the mean or the median.
```{r}
dt_out = dtangle(Y = mixture_samples,references = ref_reduced,summary_fn=median)
head(dt_out$estimates)
dt_out = dtangle(Y = mixture_samples,references = ref_reduced,summary_fn=mean)
head(dt_out$estimates)
```