/
qc-summary.Rmd
294 lines (249 loc) · 13.9 KB
/
qc-summary.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
---
output:
knitrBootstrap::bootstrap_document:
theme: readable
highlight: zenburn
theme.chooser: TRUE
highlight.chooser: TRUE
html_document:
toc: true
highlight: zenburn
---
```{r setup, echo=FALSE}
knitr::opts_chunk$set(tidy=TRUE, highlight=TRUE, dev="png",
cache=FALSE, highlight=TRUE, autodep=TRUE, warning=FALSE, error=FALSE,
message=FALSE, prompt=TRUE, comment='', fig.cap='', bootstrap.show.code=FALSE)
library(rmarkdown)
library(knitrBootstrap)
```
# QC Overview
QC report for RNA-Seq on Adrenal glands from WT (Ctr), beta catenin KO (KO), and beta catenin gain-of-function (ex3) mice.
<br>
Client: Emanuele Pignatti from David Breault's lab
<br>
Analyst: Radhika Khetani at the Harvard Chan Bioinformatics Core
```{r qc-setup}
library(ggplot2)
library(reshape)
library(gplots)
library(edgeR)
library(CHBUtils)
library(pheatmap)
project_summary = "/Users/rkhetani/Dropbox/HBC consults/Breault_betacatenin_RNA-Seq/breault-betac-bcbio/2015-08-25_breault_rnaseq/project-summary.csv"
counts_file = "/Users/rkhetani/Dropbox/HBC consults/Breault_betacatenin_RNA-Seq/breault-betac-bcbio/2015-08-25_breault_rnaseq/combined.counts"
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442",
"#0072B2", "#D55E00", "#CC79A7")
summarydata = data.frame(read.table(project_summary, header=TRUE, sep=","), row.names="Name", check.rows=FALSE)
summarydata$Name = rownames(summarydata)
summarydata = summarydata[order(summarydata$Name),]
counts = read.table(counts_file, header=TRUE, row.names="id", check.names=FALSE)
counts = counts[, order(colnames(counts))]
# this is a list of all non user-supplied metadata columns that could appear
known_columns = c("Name", "X.GC", "Exonic.Rate", "Sequences.flagged.as.poor.quality",
"rRNA.rate", "Fragment.Length.Mean", "Intronic.Rate", "Intergenic.Rate",
"Mapping.Rate", "Quality.format", "Duplication.Rate.of.Mapped", "Mapped",
"rRNA", "Sequence.length", "Transcripts.Detected", "Mean.Per.Base.Cov.",
"Genes.Detected", "Unique.Starts.Per.Read", "unique_starts_per_read",
"complexity", "X5.3.bias")
metadata = summarydata[, !colnames(summarydata) %in% known_columns, drop=FALSE]
```
```{r heatmap-function}
get_heatmap_fn = function(summarydata) {
# return the pheatmap function with or without metadata
metadata = summarydata[, !colnames(summarydata) %in% known_columns, drop=FALSE]
if(ncol(metadata) == 0) {
return(pheatmap)
}
else {
# rownames(metadata) = summarydata$Name
heatmap_fn = function(data, ...) {
pheatmap(data, annotation=metadata, ...)
}
return(heatmap_fn)
}}
heatmap_fn = get_heatmap_fn(summarydata)
```
# Quality control metrics
First look at the data quality for the 18 samples (3 groups with 6 replicates). These metrics are divided into 2 groups, the mapping information and the gene-counts based analysis.
## Mapping information
Information about how the reads mapped to the genome, and a look at consistency across the samples
### Mapped reads
Differing number of mapped reads across the samples, but this is most likely due to the differing number of starting reads. Good number of reads in each sample, about 25 million on average.
```{r mapped-plot}
ggplot(summarydata, aes(x=Name, y=Mapped)) +
theme_bw(base_size=10) +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=90)) +
geom_bar(stat="identity") +
ylab("mapped reads") + xlab("")
```
### Genomic mapping rate
This is a better plot for comparing mapping between samples, it denotes how many of the total reads in each sample were mapped to the mouse genome (mm10).
<br>
Except for `1369-8` all other samples have great mapping rates that are very similar to each other. `1369-8` is passable.
```{r mapping-rate-plot}
ggplot(summarydata, aes(x=Name, y=Mapping.Rate)) +
geom_bar(stat="identity") +
ylab("mapping rate") + xlab("") +
theme_bw(base_size=10) +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=90))
```
### Unique mapping rate
This plot denotes how many of the mapped reads mapped to a single location on the genome. This is an important metric because we will only be looking at these uniquely mapping reads for the statistical analysis downstream.
<br>Overall, the samples look very good.
```{r unique-rate-plot}
dd = data.frame(Name=names(counts), Unique=colSums(counts), Mapped=summarydata[,"Mapped"])
ggplot(dd, aes(x=Name, y=Unique/Mapped)) +
geom_bar(stat="identity") +
ylab("unique mapping rate") + xlab("") +
theme_bw(base_size=10) +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=90))
```
### Exonic mapping rate
This plot denotes how many of the mapped reads map to known exons regions.
<br>Overall, the samples look very good again.
```{r exonic-mapping-plot}
ggplot(summarydata, aes(x=Name, y=Exonic.Rate)) +
geom_bar(stat="identity") +
theme_bw(base_size=10) +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=90)) +
ylab("exonic mapping rate") + xlab("")
```
### rRNA mapping rate
This plot denotes how many of the mapped reads map to rRNA genes; all but `1369-4` look great. Since `1369-4` looks good in all other respects, it is okay to include it in further analyses.
```{r rRNA-rate-plot}
ggplot(summarydata, aes(x=Name, y=rRNA.rate)) +
geom_bar(stat="identity") +
theme_bw(base_size=10) +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=90)) +
ylab("rRNA rate") + xlab("")
```
## Genic information, correlation plots and sample clustering
This is the second category of QC metrics that are all related to genic read counts.
### Number of genes detected
The numbers of genes detected and the consistency are very good.
```{r genes-detected-plot}
dd = data.frame(Name=names(counts), Genes.Detected = colSums(counts > 0))
ggplot(dd, aes(x=Name, y=Genes.Detected)) +
geom_bar(stat="identity") +
theme_bw(base_size=10) +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=90)) +
ylab("genes detected") + xlab("")
```
### Boxplot of log10 counts per gene
This plot demonstrates the consistency (or lack thereof) between samples with respect to counts per gene. The samples look okay, but not great. Normalization is essential for such a comparison (see the next plot).
```{r boxplot-raw}
melted = melt(counts)
colnames(melted) = c("sample", "count")
melted$sample = factor(melted$sample)
#melted$sample = reorder(melted$sample, colnames(counts))
melted$count = log(melted$count)
ggplot(melted, aes(x=sample, y=count)) + geom_boxplot() +
theme_bw(base_size=10) +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=90)) + xlab("")
```
### Boxplot of log10 TMM-normalized counts per gene
This plot demonstrates the consistency (or lack thereof) between samples with respect to counts per gene also, but after the samples have been normalized (using the TMM method) so that we are making a fair comparison.
Trimmed mean of M-values (TMM) normalization is described [here](http://genomebiology.com/2010/11/3/R25)
<br>Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3). doi:10.1186/gb-2010-11-3-r25
After normalization the samples look very consistent across the board. This is quite good.
```{r boxplot-normalized}
y = DGEList(counts=counts)
y = calcNormFactors(y)
normalized_counts = cpm(y, normalized.lib.sizes=TRUE)
melted = melt(normalized_counts)
colnames(melted) = c("gene", "sample", "count")
melted$sample = factor(melted$sample)
#melted$sample = reorder(melted$sample, colnames(counts))
melted$count = log(melted$count)
ggplot(melted, aes(x=sample, y=count)) + geom_boxplot() +
theme_bw(base_size=10) +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=90)) + xlab("")
```
### Density of log10 TMM-normalized counts
In this plot we expect the pattern of count density to be consistent between the samples, all but 1 sample look fairly consistent.
```{r density-normalized}
ggplot(melted, aes(x=count, group=sample)) +
geom_density() +
theme_bw(base_size=10) +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=90)) + xlab("")
```
### Correlation (Pearson) heatmap of TMM-normalized counts
This plot denotes the parametric correlation with respect to gene counts for all samples versus all other samples. The expectation would be that replicates from the same group will cluster together.
<br>In this case the clustering (colored boxes at the top of the heatmap) is okay.
```{r pearson-heatmap-normalized}
heatmap_fn(cor(normalized_counts, method="pearson"))
```
### Correlation (Spearman) heatmap of TMM-normalized counts
This plot also denotes the correlation with respect to gene counts for all samples versus all other samples, but it is using a non-parametric or ranked method for assessing the correlation.
<br>The clustering is a little different from above (parametric), but it is still just okay.
```{r spearman-heatmap-normalized}
heatmap_fn(cor(normalized_counts, method="spearman"))
```
### PCA
PCA (principal components analysis) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data. PCA takes the expresson levels for all probes and transforms it in principal component space, reducing each sample into one point (as coordinates within that space). This allows us to separate samples according to expression variation, and identify potential outliers. Basically, the PCA analysis is yet another way to look at how samples are clustering.
This first PCA plot (PC1 against PC2) is looking at the normalized count data, and the replicates are not clustering together.
```{r pca-normalized}
pca_matrix <- prcomp(t(normalized_counts))$x
df <- cbind(metadata, pca_matrix[,c('PC1', 'PC2')])
ggplot(df, aes(PC1, PC2, color = treatment)) +
geom_text(aes(PC1, PC2, label = row.names(df)), size = 5, hjust=0.1, vjust=0.1) +
scale_x_continuous(expand = c(0.3, 0.3)) +
ggtitle('PC1 vs PC2 :: normalized counts')
```
This second plot is looking at the PCA analysis of log2 of the normalized count data. In this case the replicates look like they are clustering together a little better.
```{r pca-log-normalized}
# PCA on logged values
normalized_counts2 <- read.table("~/Dropbox//HBC consults/Breault_betacatenin_RNA-Seq//breault-betac-bcbio/2015-08-25_breault_rnaseq//combined.counts.norm_log2")
pca_matrix <- prcomp(t(normalized_counts2))$x
df <- cbind(metadata, pca_matrix[,c('PC1', 'PC2')])
ggplot(df, aes(PC1, PC2, color = treatment)) +
geom_text(aes(PC1, PC2, label = row.names(df)), size = 5, hjust=0.1, vjust=0.1) +
scale_x_continuous(expand = c(0.3, 0.3)) +
ggtitle('PC1 vs PC2 :: log2(normalized counts)')
```
```{r pca-normalized-tests, echo=FALSE, eval=FALSE}
write.table(normalized_counts,file = "../Dropbox//HBC consults/Breault_betacatenin_RNA-Seq//breault-betac-bcbio/2015-08-25_breault_rnaseq//combined.counts.norm")
# PCA on genes with an average of 1 normalized count across all samples
normalized_counts_avg <- read.table("../Dropbox//HBC consults/Breault_betacatenin_RNA-Seq//breault-betac-bcbio/2015-08-25_breault_rnaseq//combined.counts.norm_1avg", row.names = 1, header=T)
pca_matrix <- prcomp(t(normalized_counts_avg))$x
df <- cbind(metadata, pca_matrix[,c('PC1', 'PC2')])
ggplot(df, aes(PC1, PC2, color = treatment)) +
geom_text(aes(PC1, PC2, label = row.names(df)), size = 5, hjust=0.1, vjust=0.1) +
scale_x_continuous(expand = c(0.3, 0.3))
write.table(log(normalized_counts_avg,2),file = "../Dropbox//HBC consults/Breault_betacatenin_RNA-Seq//breault-betac-bcbio/2015-08-25_breault_rnaseq//combined.counts.norm_1avg.log2")
normalized_counts_avg.log <- read.table("../Dropbox//HBC consults/Breault_betacatenin_RNA-Seq//breault-betac-bcbio/2015-08-25_breault_rnaseq//combined.counts.norm_1avg.log2", row.names = 1, header=T)
pca_matrix <- prcomp(t(normalized_counts_avg))$x
df <- cbind(metadata, pca_matrix[,c('PC1', 'PC2')])
ggplot(df, aes(PC1, PC2, color = treatment)) +
geom_text(aes(PC1, PC2, label = row.names(df)), size = 5, hjust=0.1, vjust=0.1) +
scale_x_continuous(expand = c(0.3, 0.3))
# PCA on top 500 (based on average count across all samples)
normalized_counts_top <- read.table("../Dropbox//HBC consults/Breault_betacatenin_RNA-Seq//breault-betac-bcbio/2015-08-25_breault_rnaseq//combined.counts.norm_500", row.names = 1, header=T)
pca_matrix <- prcomp(t(normalized_counts_top))$x
df <- cbind(metadata, pca_matrix[,c('PC1', 'PC2')])
ggplot(df, aes(PC1, PC2, color = treatment)) +
geom_text(aes(PC1, PC2, label = row.names(df)), size = 5, hjust=0.1, vjust=0.1) +
scale_x_continuous(expand = c(0.3, 0.3))
pca_matrix <- prcomp(log(t(normalized_counts_top),2))$x
df <- cbind(metadata, pca_matrix[,c('PC1', 'PC2')])
ggplot(df, aes(PC1, PC2, color = treatment)) +
geom_text(aes(PC1, PC2, label = row.names(df)), size = 5, hjust=0.1, vjust=0.1) +
scale_x_continuous(expand = c(0.3, 0.3))
```
### Heatmap of top 30 most expressed genes
Here is a quick look at what the highest expressing (average across all samples) 30 genes are and their behavior across the groups.
```{r top-count-genes, results='asis'}
select = order(rowMeans(normalized_counts),decreasing=TRUE)[1:30]
pheatmap(normalized_counts[select,], cluster_rows = F, annotation = metadata)
```
### Conclusions
The take home from the clustering analysis is that we should expect a lot of variance within the groups, and this might impact how many false negatives we get from the final differential analysis. Having said that, none of the replicates are sticking out as being outliers overall.