Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
124 lines (93 sloc) 5.03 KB
title: "dtangle with scRNA-seq reference"
author: "Saskia Freytag and Greg Hunt"
date: "20/02/2018"
variant: markdown_github
pdf_document: default
html_document: default
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
In this vignette we will work through an example of deconvolving brain cell type proportions from DNA microarray data. We work with a dataset created from post-mortem human brains and introduced by Lu et al. This is available on GEO with accession GSE53890. We will refer to this dataset as the mixture dataset. The reference data set we will work with is subset of the Darmanis et al single cell brain data. This data is currently available at <>. The GEO accession number for this dataset is GSE67835. First load up the datasets.
```{r DataLoad, message = FALSE}
gse53890 <- getGEO('GSE53890', GSEMatrix= TRUE)
me <- as.matrix(exprs(gse53890[[1]]))
Unfortunately, the two datasets use different gene identifiers. We will have to convert the gene identifiers in the mixture dataset to match those in the reference dataset.
```{r GeneIdentifiers, message=FALSE}
k <- keys(hgu133plus2.db, keytype="PROBEID")
gene_anno <- mapIds(hgu133plus2.db, keys = k, column = c("SYMBOL"), keytype="PROBEID")
index <- pmatch(rownames(me), names(gene_anno))
rownames(me) <- gene_anno[index]
In the next step, we normalize the mixture datasets using simple background correction. Normally, we would also need to convert the expression in the single cell reference dataset to log cpms. However, this is already done in this case. If this is not the case in your dataset it can be done using the `edgeR::cpm` function (make sure to add a prior count). Normalization will allow us to combine the count data from the scRNA-seq and expression measurements from the microarrays.
```{r normalization, message=FALSE}
me <- backgroundCorrect(me, method='normexp')
Next, we need to find the genes that are common to both datasets and join the datasets using these. After combining the datasets, we will apply quantile normalization to these datasets in order to ensure that they are indeed comparable.
```{r join}
commongenes <- intersect (rownames(me), rownames(sce))
me <- me[pmatch(commongenes, rownames(me)), ]
sce <- sce[pmatch(commongenes, rownames(sce)), ]
y <- cbind(sce, me)
y <- normalizeBetweenArrays(y)
y <- t(y)
In order to apply dtangle, we need to create an object that records can the pure samples of each cell type.
```{r format}
all_cell_type <- unique(anno$CellType)
pure_samples <- lapply(1:length(all_cell_type), function(i) {
which(anno$CellType == all_cell_type[i])
names(pure_samples) = all_cell_type
We are finally ready to apply dtangle. The first step in running dtangle is to identify marker genes for each cell type. These may be provided by the scientist if they are already known or may be determined by dtangle or another algorithm. To find marker genes using dtangle we pass the following arguments to the `find_markers` function:
1. the data matrix, `Y`,
2. the list of pure samples for each type, `pure_samples`,
3. the data type, `data_type`,
4. the method used to rank markers, `marker_method`.
After we have ranked our markers with find_markers we need to determine how many markers to use for each cell type. The simplest way to do this is to choose, say, the top 10% of all marker genes for each type.
```{r markers}
marker_list = find_markers(y,pure_samples=pure_samples,data_type="rna-seq",marker_method='ratio')
q = .1
quantiles = lapply(marker_list$V,function(x)quantile(x,1-q))
K = length(pure_samples)
n_markers = sapply(1:K,function(i){max(which(marker_list$V[[i]] > quantiles[[i]]))})
Now that we have ranked the genes as markers for each type and chosen how many marker genes to use for each cell type we can run the dtangle deconvolution algorithm.
providing to the dtangle function the arguments:
1. `Y`, our data matrix
2. the list of `pure_samples`
3. the number of markers to use for each cell type, `n_markers`
4. the data_type
5. the list of ranked markers (output from `find_markers`) to the markers argument.
```{r deconvo}
marks = marker_list$L
dc <- dtangle(y, pure_samples=pure_samples, n_markers=n_markers, data_type = 'microarray-gene', markers = marks)
Finally, we can extract the proportion estimates for our mixture dataset.
```{r extract}
final_est <- dc$estimates[(dim(sce)[2]+1):dim(y)[1],]
colnames(final_est) <- all_cell_type
We can now plot the proportion estimates.
```{r plot}
plot_data <- melt(final_est)
colnames(plot_data) <- c("Sample", "Cell Type", "Proportion")
plot_data$Proportion <- as.numeric(plot_data$Proportion)
ggplot(plot_data, aes(x = `Cell Type`, y=Proportion))+geom_violin(aes(fill = `Cell Type`)) + geom_jitter(height = 0, width = 0.1)