breault_betacatenin_rnaseq/qc-summary.Rmd

---
output:
  knitrBootstrap::bootstrap_document:
    theme: readable
    highlight: zenburn
    theme.chooser: TRUE
    highlight.chooser: TRUE
  html_document:
    toc: true
    highlight: zenburn
---

```{r setup, echo=FALSE}
knitr::opts_chunk$set(tidy=TRUE, highlight=TRUE, dev="png",
               cache=FALSE, highlight=TRUE, autodep=TRUE, warning=FALSE, error=FALSE,
               message=FALSE, prompt=TRUE, comment='', fig.cap='', bootstrap.show.code=FALSE)
library(rmarkdown)
library(knitrBootstrap)
```

# QC Overview
QC report for RNA-Seq on Adrenal glands from WT (Ctr), beta catenin KO (KO), and beta catenin gain-of-function (ex3) mice. 
<br>
Client: Emanuele Pignatti from David Breault's lab
<br>
Analyst: Radhika Khetani at the Harvard Chan Bioinformatics Core

```{r qc-setup}
library(ggplot2)
library(reshape)
library(gplots)
library(edgeR)
library(CHBUtils)
library(pheatmap)
project_summary = "/Users/rkhetani/Dropbox/HBC consults/Breault_betacatenin_RNA-Seq/breault-betac-bcbio/2015-08-25_breault_rnaseq/project-summary.csv"
counts_file = "/Users/rkhetani/Dropbox/HBC consults/Breault_betacatenin_RNA-Seq/breault-betac-bcbio/2015-08-25_breault_rnaseq/combined.counts"
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442",
"#0072B2", "#D55E00", "#CC79A7")
summarydata = data.frame(read.table(project_summary, header=TRUE, sep=","), row.names="Name", check.rows=FALSE)
summarydata$Name = rownames(summarydata)
summarydata = summarydata[order(summarydata$Name),]
counts = read.table(counts_file, header=TRUE, row.names="id", check.names=FALSE)
counts = counts[, order(colnames(counts))]
# this is a list of all non user-supplied metadata columns that could appear
known_columns = c("Name", "X.GC", "Exonic.Rate", "Sequences.flagged.as.poor.quality",
    "rRNA.rate", "Fragment.Length.Mean", "Intronic.Rate", "Intergenic.Rate",
    "Mapping.Rate", "Quality.format", "Duplication.Rate.of.Mapped", "Mapped",
    "rRNA", "Sequence.length", "Transcripts.Detected", "Mean.Per.Base.Cov.",
    "Genes.Detected", "Unique.Starts.Per.Read", "unique_starts_per_read",
    "complexity", "X5.3.bias")
metadata = summarydata[, !colnames(summarydata) %in% known_columns, drop=FALSE]
```

```{r heatmap-function}
get_heatmap_fn = function(summarydata) {
    # return the pheatmap function with or without metadata
    metadata = summarydata[, !colnames(summarydata) %in% known_columns, drop=FALSE]
    if(ncol(metadata) == 0) {
       return(pheatmap)
    }
    else {
    # rownames(metadata) = summarydata$Name
    heatmap_fn = function(data, ...) {
        pheatmap(data, annotation=metadata, ...)
    }
    return(heatmap_fn)
}}
heatmap_fn = get_heatmap_fn(summarydata)
```

# Quality control metrics
First look at the data quality for the 18 samples (3 groups with 6 replicates). These metrics are divided into 2 groups, the mapping information and the gene-counts based analysis.

## Mapping information
Information about how the reads mapped to the genome, and a look at consistency across the samples

### Mapped reads
Differing number of mapped reads across the samples, but this is most likely due to the differing number of starting reads. Good number of reads in each sample, about 25 million on average.
```{r mapped-plot}
ggplot(summarydata, aes(x=Name, y=Mapped)) +
    theme_bw(base_size=10) +
    theme(panel.grid.major = element_line(size = .5, color = "grey"),
          axis.text.x = element_text(angle=90)) +
    geom_bar(stat="identity") +
    ylab("mapped reads") + xlab("")
```

### Genomic mapping rate
This is a better plot for comparing mapping between samples, it denotes how many of the total reads in each sample were mapped to the mouse genome (mm10). 
<br>
Except for `1369-8` all other samples have great mapping rates that are very similar to each other. `1369-8` is passable.
```{r mapping-rate-plot}
ggplot(summarydata, aes(x=Name, y=Mapping.Rate)) +
    geom_bar(stat="identity") +
    ylab("mapping rate") + xlab("") +
    theme_bw(base_size=10) +
    theme(panel.grid.major = element_line(size = .5, color = "grey"),
          axis.text.x = element_text(angle=90))
```

### Unique mapping rate
This plot denotes how many of the mapped reads mapped to a single location on the genome. This is an important metric because we will only be looking at these uniquely mapping reads for the statistical analysis downstream. 
<br>Overall, the samples look very good.
```{r unique-rate-plot}
dd = data.frame(Name=names(counts), Unique=colSums(counts), Mapped=summarydata[,"Mapped"])
ggplot(dd, aes(x=Name, y=Unique/Mapped)) +
    geom_bar(stat="identity") +
    ylab("unique mapping rate") + xlab("") +
    theme_bw(base_size=10) +
    theme(panel.grid.major = element_line(size = .5, color = "grey"),
          axis.text.x = element_text(angle=90))
```

### Exonic mapping rate
This plot denotes how many of the mapped reads map to known exons regions.
<br>Overall, the samples look very good again.
```{r exonic-mapping-plot}
ggplot(summarydata, aes(x=Name, y=Exonic.Rate)) +
    geom_bar(stat="identity") +
    theme_bw(base_size=10) +
    theme(panel.grid.major = element_line(size = .5, color = "grey"),
          axis.text.x = element_text(angle=90)) +
    ylab("exonic mapping rate") + xlab("")
```

### rRNA mapping rate
This plot denotes how many of the mapped reads map to rRNA genes; all but `1369-4` look great. Since `1369-4` looks good in all other respects, it is okay to include it in further analyses.
```{r rRNA-rate-plot}
ggplot(summarydata, aes(x=Name, y=rRNA.rate)) +
    geom_bar(stat="identity") +
    theme_bw(base_size=10) +
    theme(panel.grid.major = element_line(size = .5, color = "grey"),
          axis.text.x = element_text(angle=90)) +
    ylab("rRNA rate") + xlab("")
```

## Genic information, correlation plots and sample clustering
This is the second category of QC metrics that are all related to genic read counts.

### Number of genes detected
The numbers of genes detected and the consistency are very good.

```{r genes-detected-plot}
dd = data.frame(Name=names(counts), Genes.Detected = colSums(counts > 0))
ggplot(dd, aes(x=Name, y=Genes.Detected)) +
    geom_bar(stat="identity") +
    theme_bw(base_size=10) +
    theme(panel.grid.major = element_line(size = .5, color = "grey"),
          axis.text.x = element_text(angle=90)) +
    ylab("genes detected") + xlab("")
```

### Boxplot of log10 counts per gene
This plot demonstrates the consistency (or lack thereof) between samples with respect to counts per gene. The samples look okay, but not great. Normalization is essential for such a comparison (see the next plot).
```{r boxplot-raw}
melted = melt(counts)
colnames(melted) = c("sample", "count")
melted$sample = factor(melted$sample)
#melted$sample = reorder(melted$sample, colnames(counts))
melted$count = log(melted$count)
ggplot(melted, aes(x=sample, y=count)) + geom_boxplot() +
    theme_bw(base_size=10) +
    theme(panel.grid.major = element_line(size = .5, color = "grey"),
          axis.text.x = element_text(angle=90)) + xlab("")
```

### Boxplot of log10 TMM-normalized counts per gene
This plot demonstrates the consistency (or lack thereof) between samples with respect to counts per gene also, but after the samples have been normalized (using the TMM method) so that we are making a fair comparison.
Trimmed mean of M-values (TMM) normalization is described [here](http://genomebiology.com/2010/11/3/R25)
<br>Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3). doi:10.1186/gb-2010-11-3-r25

After normalization the samples look very consistent across the board. This is quite good.
```{r boxplot-normalized}
y = DGEList(counts=counts)
y = calcNormFactors(y)
normalized_counts = cpm(y, normalized.lib.sizes=TRUE)
melted = melt(normalized_counts)
colnames(melted) = c("gene", "sample", "count")
melted$sample = factor(melted$sample)
#melted$sample = reorder(melted$sample, colnames(counts))
melted$count = log(melted$count)
ggplot(melted, aes(x=sample, y=count)) + geom_boxplot() +
    theme_bw(base_size=10) +
    theme(panel.grid.major = element_line(size = .5, color = "grey"),
          axis.text.x = element_text(angle=90)) + xlab("")
```

### Density of log10 TMM-normalized counts
In this plot we expect the pattern of count density to be consistent between the samples, all but 1 sample look fairly consistent.
```{r density-normalized}
ggplot(melted, aes(x=count, group=sample)) +
    geom_density() +
    theme_bw(base_size=10) +
    theme(panel.grid.major = element_line(size = .5, color = "grey"),
          axis.text.x = element_text(angle=90)) + xlab("")
```

### Correlation (Pearson) heatmap of TMM-normalized counts
This plot denotes the parametric correlation with respect to gene counts for all samples versus all other samples. The expectation would be that replicates from the same group will cluster together. 
<br>In this case the clustering (colored boxes at the top of the heatmap) is okay.
```{r pearson-heatmap-normalized}
heatmap_fn(cor(normalized_counts, method="pearson"))
```

### Correlation (Spearman) heatmap of TMM-normalized counts
This plot also denotes the correlation with respect to gene counts for all samples versus all other samples, but it is using a non-parametric or ranked method for assessing the correlation. 
<br>The clustering is a little different from above (parametric), but it is still just okay.
```{r spearman-heatmap-normalized}
heatmap_fn(cor(normalized_counts, method="spearman"))
```

### PCA
PCA (principal components analysis) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data. PCA takes the expresson levels for all probes and transforms it in principal component space, reducing each sample into one point (as coordinates within that space). This allows us to separate samples according to expression variation, and identify potential outliers. Basically, the PCA analysis is yet another way to look at how samples are clustering. 

This first PCA plot (PC1 against PC2) is looking at the normalized count data, and the replicates are not clustering together.

```{r pca-normalized}
pca_matrix <- prcomp(t(normalized_counts))$x
df <- cbind(metadata, pca_matrix[,c('PC1', 'PC2')])
 
ggplot(df, aes(PC1, PC2, color = treatment)) + 
     geom_text(aes(PC1, PC2, label = row.names(df)), size = 5, hjust=0.1, vjust=0.1) +
     scale_x_continuous(expand = c(0.3,  0.3)) +
     ggtitle('PC1 vs PC2 :: normalized counts') 
```

This second plot is looking at the PCA analysis of log2 of the normalized count data. In this case the replicates look like they are clustering together a little better.

```{r pca-log-normalized}
# PCA on logged values
normalized_counts2 <- read.table("~/Dropbox//HBC consults/Breault_betacatenin_RNA-Seq//breault-betac-bcbio/2015-08-25_breault_rnaseq//combined.counts.norm_log2")

pca_matrix <- prcomp(t(normalized_counts2))$x
df <- cbind(metadata, pca_matrix[,c('PC1', 'PC2')])
 
ggplot(df, aes(PC1, PC2, color = treatment)) + 
     geom_text(aes(PC1, PC2, label = row.names(df)), size = 5, hjust=0.1, vjust=0.1) +
     scale_x_continuous(expand = c(0.3,  0.3)) +
     ggtitle('PC1 vs PC2 :: log2(normalized counts)') 
```


```{r pca-normalized-tests, echo=FALSE, eval=FALSE}
write.table(normalized_counts,file = "../Dropbox//HBC consults/Breault_betacatenin_RNA-Seq//breault-betac-bcbio/2015-08-25_breault_rnaseq//combined.counts.norm")

# PCA on genes with an average of 1 normalized count across all samples
normalized_counts_avg <- read.table("../Dropbox//HBC consults/Breault_betacatenin_RNA-Seq//breault-betac-bcbio/2015-08-25_breault_rnaseq//combined.counts.norm_1avg", row.names = 1, header=T)

pca_matrix <- prcomp(t(normalized_counts_avg))$x
df <- cbind(metadata, pca_matrix[,c('PC1', 'PC2')])
 
ggplot(df, aes(PC1, PC2, color = treatment)) + 
     geom_text(aes(PC1, PC2, label = row.names(df)), size = 5, hjust=0.1, vjust=0.1) +
     scale_x_continuous(expand = c(0.3,  0.3))

write.table(log(normalized_counts_avg,2),file = "../Dropbox//HBC consults/Breault_betacatenin_RNA-Seq//breault-betac-bcbio/2015-08-25_breault_rnaseq//combined.counts.norm_1avg.log2")

normalized_counts_avg.log <- read.table("../Dropbox//HBC consults/Breault_betacatenin_RNA-Seq//breault-betac-bcbio/2015-08-25_breault_rnaseq//combined.counts.norm_1avg.log2", row.names = 1, header=T)

pca_matrix <- prcomp(t(normalized_counts_avg))$x
df <- cbind(metadata, pca_matrix[,c('PC1', 'PC2')])
 
ggplot(df, aes(PC1, PC2, color = treatment)) + 
     geom_text(aes(PC1, PC2, label = row.names(df)), size = 5, hjust=0.1, vjust=0.1) +
     scale_x_continuous(expand = c(0.3,  0.3))


# PCA on top 500 (based on average count across all samples)
normalized_counts_top <- read.table("../Dropbox//HBC consults/Breault_betacatenin_RNA-Seq//breault-betac-bcbio/2015-08-25_breault_rnaseq//combined.counts.norm_500", row.names = 1, header=T)

pca_matrix <- prcomp(t(normalized_counts_top))$x
df <- cbind(metadata, pca_matrix[,c('PC1', 'PC2')])
 
ggplot(df, aes(PC1, PC2, color = treatment)) + 
     geom_text(aes(PC1, PC2, label = row.names(df)), size = 5, hjust=0.1, vjust=0.1) +
     scale_x_continuous(expand = c(0.3,  0.3))

pca_matrix <- prcomp(log(t(normalized_counts_top),2))$x
df <- cbind(metadata, pca_matrix[,c('PC1', 'PC2')])
 
ggplot(df, aes(PC1, PC2, color = treatment)) + 
     geom_text(aes(PC1, PC2, label = row.names(df)), size = 5, hjust=0.1, vjust=0.1) +
     scale_x_continuous(expand = c(0.3,  0.3))
```

### Heatmap of top 30 most expressed genes
Here is a quick look at what the highest expressing (average across all samples) 30 genes are and their behavior across the groups.
```{r top-count-genes, results='asis'}
select = order(rowMeans(normalized_counts),decreasing=TRUE)[1:30]
pheatmap(normalized_counts[select,], cluster_rows = F, annotation = metadata)
```

### Conclusions
The take home from the clustering analysis is that we should expect a lot of variance within the groups, and this might impact how many false negatives we get from the final differential analysis. Having said that, none of the replicates are sticking out as being outliers overall.