Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
645 lines (444 sloc) 21 KB
output
knitrBootstrap::bootstrap_document html_document
theme highlight theme.chooser highlight.chooser
readable
zenburn
true
true
toc highlight
true
zenburn

last update r date() by @lopantano

library(knitr)

knitr::opts_chunk$set(tidy=TRUE, highlight=TRUE, dev="png", fig.width=9,fig.heigh=6,
               cache=FALSE, highlight=TRUE, autodep=TRUE, warning=FALSE, error=FALSE, echo=FALSE,
               message=FALSE, prompt=TRUE, comment='', fig.cap='', bootstrap.show.code=FALSE)

root_path = "~/orch/scratch/simulator/mirqc/bcbio/config/work2/report"
root_other_path = "~/orch/scratch/simulator/mirqc/bcbio/non_mirqc_bcbio/latest/report"

For replication go here

Document with R code go here


library(ggplot2)
library(reshape)
library(DESeq2)
library(genefilter)
library(CHBUtils)
library(gtools)
library(gridExtra)
library(devtools)
library(dplyr)
library(isomiRs)
library(edgeR)
library(pheatmap)
library(rmarkdown)
library(knitrBootstrap)
#render("ready_report.rmd")

root_file = paste0(root_path,"/report/")

condition = "condition"

Overview

mirRQC project paper

samples overview:

Universal Human miRNA reference RNA (Agilent Technologies, #750700), human brain total RNA (Life Technologies, #AM6050), human liver total RNA (Life Technologies, #AM7960) and MS2-phage RNA (Roche, #10165948001) were diluted to a platform-specific concentration. RNA integrity and purity were evaluated using the Experion automated gel electrophoresis system (Bio-Rad) and Nanodrop spectrophotometer. All RNA samples were of high quality (miRQC A: RNA quality index (RQI, scale from 0 to 10) = 9.0; miRQC B: RQI = 8.7; human liver RNA: RQI = 9.2) and high purity (data not shown). RNA was isolated from serum prepared from three healthy donors using the miRNeasy mini kit (Qiagen) according to the manufacturer's instructions, and RNA samples were pooled. Informed consent was obtained from all donors (Ghent University Ethical Committee). Different kits for isolation of serum RNA are available; addressing their impact was outside the scope of this work. Synthetic miRNA templates for let-7a-5p, let-7b-5p, let-7c, let-7d-5p, miR-302a-3p, miR-302b-3p, miR-302c-3p, miR-302d-3p, miR-133a and miR-10a-5p were synthesized by Integrated DNA Technologies and 5′ phosphorylated. Synthetic let-7 and miR-302 miRNAs were spiked into MS2-phage RNA and total human liver RNA, respectively, at 5 × 106 copies/μg RNA. These samples do not contain endogenous miR-302 or let-7 miRNAs, which allowed unbiased analysis of cross-reactivity between the individual miR-302 and let-7 miRNAs measured by the platform and the different miR-302 and let-7 synthetic templates in a complex RNA background. Synthetic miRNA templates for miR-10a-5p, let-7a-5p, miR-302a-3p and miR-133a were spiked in human serum RNA at 6 × 103 copies per microliter of serum RNA or at 5-times higher, 2-times higher, 2-times lower and 5-times lower concentrations, respectively. All vendors received 10 μl of each serum RNA sample.

samples

# setwd(root_path)
files = read.table(file.path(root_path, "summary_re.csv"), sep=",",header=T,colClasses = "character")

samples = files[,"sample_id"]

names_stats = files[,"size_stats"]
names(names_stats) = samples

groups = files[,"group"]
names(groups) = samples

summarydata = data.frame(row.names=samples,samples=samples,group=groups)
design <- data.frame(row.names=files$sample_id, condition=files$group)


files_other = read.table(file.path(root_other_path, "summary_re.csv"), sep=",",header=T,colClasses = "character")
design_other <- data.frame(row.names=files_other$sample_id, condition=files_other$group)


Exploratory analysis

In this section we will see exploratory figures about quality of the data, reads with adapter, reads mapped to miRNAs and reads mapped to other small RNAs.

Size distribution

After adapter removal, we can plot the size distribution of the small RNAs. In a normal small RNA sample, we should see a peak at 22/23 and maybe another at 26 or 31 depending on the biological background.

tab = data.frame()
for (sample in samples){
    d = read.table(file.path(root_path, names_stats[sample]), sep=" ")
    tab = rbind(tab, d %>% mutate(sample=sample, group=groups[sample]))
}


reads_adapter = tab %>% group_by(sample, group) %>% summarise(total=sum(V2))
ggplot(reads_adapter, aes(x=sample,y=total,fill=group)) +
    geom_bar(stat="identity", position = "dodge") +
    ggtitle("total number of reads with adapter") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))


ggplot(tab, aes(x=V1,y=V2,group=sample)) +
    geom_bar(stat="identity", position = "dodge") +
    facet_wrap(~group, ncol=2)+
    ggtitle("size distribution") + ylab("abundance") + xlab("size") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

miRNA

Total miRNA expression annotated with mirbase

mi_files = file.path(root_path, files[,"miraligner"])
row.names(design) = samples

obj <- IsomirDataSeqFromFiles(files = mi_files, design = design ,header = T)
ggplot( data.frame(sample=colnames(counts(obj)), total=colSums(counts(obj)))) +
    geom_bar(aes(x=sample,y=total), stat='identity')+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
mirna_step <- as.data.frame(colSums(counts(obj)))

Distribution of mirna expression

ggplot(melt(counts(obj))) +
    geom_boxplot(aes(x=X2,y=value))+
    scale_y_log10()+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Cumulative distribution of miRNAs


cs <- as.data.frame(apply(counts(obj),2,cumsum))
cs$pos <- 1:nrow(cs)

ggplot((melt(cs,id.vars = "pos")))+
    geom_point(aes(x=pos,y=value,color=variable))+
    scale_y_log10()

Others small RNA

The data was analyzed with seqcluster

This tools used all reads, uniquely mapped and multi-mapped reads. The first step is to cluster sequences in all locations they overlap. The second step is to create meta-clusters: is the unit that merge all clusters that share the same sequences. This way the output are meta-clusters, common sequences that could come from different region of the genome.

Genome covered

cov_stats <- read.table(file.path(root_path, "../align", "seqs_rmlw.bam_cov.tsv"),sep="\t",check.names = F)

kable(cov_stats[cov_stats$V1=="genome",] %>% dplyr::select(coverage=V2,ratio_genome=V5), row.names = FALSE)

The normal value for data with strong small RNA signal in human is: 0.0002

Classification

Number of reads in the data after each step:

  • raw: initial reads
  • cluster: after cluster detection
  • multimap: after meta-cluster detection using all hits
reads_stats <- read.table(file.path(root_path, "../seqcluster", "cluster", "read_stats.tsv"),sep="\t",check.names = F)
ggplot(reads_stats, aes(x=V2, y=V1, fill=V3)) + 
    geom_bar(stat = 'identity', position = 'dodge') +
    labs(list(x="samples", y="reads")) +
    scale_fill_brewer("steps", palette = 'Set1')+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
clus <- read.table(file.path(root_path, "../seqcluster/cluster/counts.tsv"),header=T,sep="\t",row.names=1, check.names = FALSE)
ann <- clus[,2]
toomany <- clus[,1]
clus_ma <- clus[,3:ncol(clus)]
clus_ma = clus_ma[,row.names(design)]

Check complex meta-clusters: This kind of events happen when there are small RNA over the whole genome, and all repetitive small RNAs map to thousands of places and sharing many sequences in many positions. If any meta-cluster is > 40% of the total data, maybe it is worth to add some filters like: minimum number of counts -e or --min--shared in seqcluster prepare

clus_ma_norm = sweep(clus_ma, 2, colSums(clus_ma), "/")
head(clus_ma_norm[toomany>0,])

Until here is an example of the Rmd template that the user can get from seqcluster-helper and render directly in R / Rstudio.

Contribution by class:

rRNA <- colSums(clus_ma[grepl("rRNA",ann) & grepl("miRNA",ann)==F,])
miRNA <- colSums(clus_ma[grepl("miRNA",ann),])
tRNA <- colSums(clus_ma[grepl("tRNA",ann) & grepl("rRNA",ann)==F & grepl("repeat",ann)==F & grepl("miRNA",ann)==F,])
piRNA <- colSums(clus_ma[grepl("piRNA",ann) & grepl("rRNA",ann)==F & grepl("miRNA",ann)==F,])
rmsk <- colSums(clus_ma[grepl("repeat",ann) & grepl("rRNA",ann)==F & grepl("miRNA",ann)==F & grepl("piRNA", ann)==F,])
total <- colSums(clus_ma)

dd <- data.frame(samples=names(rRNA),
                 rRNA=rRNA,
                 miRNA=miRNA,
                 tRNA=tRNA,
                 piRNA=piRNA,
                 rmsk=rmsk,
                total=total)

ggplot(melt(dd)) +
    geom_bar(aes(x=samples,y=value,fill=variable),
             stat='identity',
             position="dodge")+
    scale_fill_brewer(palette = "Set1")+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

dd_norm = dd
dd_norm[,2:5] = sweep(dd[,2:5],1,dd[,6],"/")
ggplot(melt(dd_norm[,1:5])) +
    geom_bar(aes(x=samples,y=value,fill=variable),
             stat='identity',
             position="dodge")+
    scale_fill_brewer(palette = "Set1")+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
    labs(list(title="relative proportion of small RNAs",y="% reads"))

Comparison

library(DESeq2)
library(DEGreport)
library(vsn)

filter_handle <- function(res){
    res_nona <- res[!is.na(res$padj),]
    keep <- res_nona$padj < 0.1 
    res_nona[keep,]
}

handle_deseq2 = function(dds, summarydata, column) {
  all_combs = combn(levels(summarydata[,column]), 2, simplify=FALSE)
  all_results = list()
  contrast_strings = list()
  for(comb in all_combs) {
    contrast_string = paste(comb, collapse="_vs_")
    contrast = c(column, comb)
    res = results(dds, contrast=contrast)
    res = res[order(res$padj),]
    all_results = c(all_results, res)
    contrast_strings = c(contrast_strings, contrast_string)
  }
  names(all_results) = contrast_strings
  return(all_results)
}

plot_MA = function(res){
    for(i in seq(length(res))) {
        plotMA(res[[i]])
        title(paste("MA plot for contrast", names(res)[i]))
    }
}

plot_volcano = function(res){
    for(i in seq(length(res))) {
        stats = as.data.frame(res[[i]][,c(2,6)])
        p = volcano_density_plot(stats, title=names(res)[i], lfc.cutoff=1)
        print(p)
    }
}

do_de = function(raw, summarydata, condition){
    dss = DESeqDataSetFromMatrix(countData = raw[rowMeans(raw)>3,],
                       colData = summarydata,
                       design = ~ condition)
    dss = DESeq(dss)
    plotDispEsts(dss)
    dss
}

do_norm = function(dss, root_path, prefix){
    rlog_ma = assay(rlog(dss))
    count_ma = counts(dss, normalized=TRUE)
    raw = counts(dss, normalized=FALSE)
    fn_log = paste0(root_file, prefix, "log_matrix.txt")
    write.table(rlog_ma,fn_log,sep="\t")
    fn_count = paste0(root_file, prefix, "count_matrix.txt")
    write.table(count_ma,fn_count,sep="\t")
    fn_raw = paste0(root_file, prefix, "raw_matrix.txt")
    write.table(count_ma,fn_raw,sep="\t")
}

miRNA

mirna_c = counts(obj)
mirna_c = mirna_c[,sort(colnames(mirna_c))]

Abundance detection of miRQC samples

There are 4 samples:

  • A: universal human RNA sample
  • B: human brain sample
  • C: 0.25 * A + 0.75 * B
  • D: 0.25 * B + 0.75 * A

If A > B then A > D > C > B

If A < B then A < D < C < B

Note that C/D samples are swapped in the paper and in the GEO web. Text from the paper:

These samples (termed miRQC A–D) consist of 100% Universal Human miRNA Reference RNA (UHmiRR; A), 100% human brain RNA (HBR; B) and two titrations thereof (C = 0.75A + 0.25B and D = 0.25A + 0.75B).

while in the GEO:

Source name miRQC C Organism Homo sapiens Characteristics biomaterial: Mixture of 25% UHRR and 75% HuBr Total RNA

Source name miRQC D Organism Homo sapiens Characteristics biomaterial: Mixture of 75% UHRR and 25% HuBr Total RNA

keep = grepl("miRQC", colnames(mirna_c))
keep_d = grepl("miRQC", rownames(design))
library(edgeR)
dge = DGEList(mirna_c[,keep])
dge = calcNormFactors(dge,method = 'upperquartile')
mirna_n = cpm(dge, normalized.lib.sizes = T, log = T)
mirna_n = mirna_n[rowMedians(mirna_n)>5,]
library(GGally)
GGally::ggpairs(mirna_n, axisLabels = "internal")
mirqca = rowMeans(mirna_n[, 1:2])
mirqcb = rowMeans(mirna_n[, 3:4])
mirqcc = rowMeans(mirna_n[, 5:6])
mirqcd = rowMeans(mirna_n[, 7:8])

top_a = mirqca > mirqcb 
conc_a = top_a & mirqca > mirqcd & mirqcd > mirqcc
top_b = mirqca < mirqcb 
conc_b = top_b & mirqcb > mirqcc & mirqcc > mirqcd

miRNAs which mirqca > mirqcd > mrqcc are r sum(conc_a) out of r sum(top_a) miRNAs which mirqcb > mirqcc > mrqcd are r sum(conc_b) out of r sum(top_b)

ratio expression summary of A/D

kable(t(as.matrix(summary(mirqca[top_a]-mirqcd[top_a]))))

the average logFC is 0.5 that is similar to the expected FC = log2(1/0.75) = 0.5

ratio expression summary of A/C

kable(t(as.matrix(summary(mirqca[top_a]-mirqcc[top_a]))))
      

the average logFC is 1.6 that is similar to the expected FC = log2(1/0.25) = 2

Same happens when comparing B vs D

kable(t(as.matrix(summary(mirqcb[top_b]-mirqcd[top_b]))))
      

and B vs C

kable(t(as.matrix(summary(mirqcb[top_b]-mirqcc[top_b]))))
      

according to this:

miRQC_C = 0.75 * miRQC_B + 0.25 * miRQC_A

miRQC_D = 0.75 * miRQC_A + 0.25 * miRQC_B

that is the same that is in the GEO data set description file.

Specificity


mi_files = file.path(root_other_path, files_other[,"miraligner"])

obj_other <- IsomirDataSeqFromFiles(files = mi_files, design = design_other ,header = T)
obj_other = isoCounts(obj_other)
mirna_c_other = counts(obj_other)

we spiked in 8 synthetic miRNAs from two miRNA families into human liver RNA (miR-302a/b/c/d) or MS2-phage RNA (let-7a/b/c/d)

We should only see those miRNAs in those samples and not in anywhere else.


kable(isoSelect(obj_other, mirna="hsa-let-7a-5p", minc=2)[,grepl("MS", rownames(design_other))])

kable(isoSelect(obj_other, mirna="hsa-let-7b-5p", minc=2)[,grepl("MS", rownames(design_other))])

kable(isoSelect(obj_other, mirna="hsa-let-7c-5p", minc=2)[,grepl("MS", rownames(design_other))])

kable(isoSelect(obj_other, mirna="hsa-let-7d-5p", minc=2)[,grepl("MS", rownames(design_other))])

mirnas = c("hsa-let-7a-5p", "hsa-let-7b-5p", "hsa-let-7c-5p", "hsa-let-7d-5p")
ma_sub = counts(obj_other)[mirnas, grepl("MS", rownames(design_other))]
ma_sub[ma_sub<=2] = 0
ma = log2( ma_sub + 0.25 )
pheatmap(ma)

kable(isoSelect(obj_other, mirna="hsa-miR-302a-3p", minc=2)[,grepl("liver", rownames(design_other))])

kable(isoSelect(obj_other, mirna="hsa-miR-302b-3p", minc=2)[,grepl("liver", rownames(design_other))])

kable(isoSelect(obj_other, mirna="hsa-miR-302c-3p", minc=2)[,grepl("liver", rownames(design_other))])

kable(isoSelect(obj_other, mirna="hsa-miR-302d-3p", minc=2)[,grepl("liver", rownames(design_other))])

mirnas = c("hsa-miR-302a-3p", "hsa-miR-302b-3p", "hsa-miR-302c-3p", "hsa-miR-302d-3p")
ma_sub = counts(obj_other)[mirnas, grepl("liver", rownames(design_other))]
ma_sub[ma_sub<=2] = 0
ma = log2( ma_sub + 0.25 )
pheatmap(ma)

According to the text they saw cross-mapping between these miRNAs. Here we are seeing almost perfect annotation for miR-302 family and some amplification of the let-7a reference miRNA in the MS-let-7c sample, where the let-7a is in low concentration (10 counts, compared to the expected expression -450 counts-).

Figure-e

Accuracy

target = c("hsa-miR-10a-5p", "hsa-let-7a-5p", "hsa-miR-302a-3p", "hsa-miR-133a-3p")
keep = grepl("serum", colnames(mirna_c_other))
keep_d = grepl("serum", rownames(design_other))
dge = DGEList(mirna_c_other[,keep])
dge = calcNormFactors(dge,method = 'TMM')
serum_n = cpm(dge, normalized.lib.sizes = T, log = T)

kable(serum_n[target,])

For miR10 and let-7a the changes are clear although not proportional.

I cannot detect sequences for miR-302 and miR-133 (< 5 reads). I checked directly in the raw data and these sequences are not there.

I think they weren't captured by the sequencing at all. The text shows the same for miR-302, and not detection in changes for miR-133 ( I guess for the same reason).

kable(isoSelect(obj_other, "hsa-miR-302a-3p", minc=1)[, grepl("serum", rownames(design_other))])

kable(isoSelect(obj_other, "hsa-miR-133a-3p", minc=1)[, grepl("serum", rownames(design_other))])

These two miRNAs were the ones that supposed to be down by 2 and 5 times in the serum with variable []. Since, these miRNAs are not detected in the samples with higher [] (serum with constant []) it would be imposible to detecte them in the other two. I would say there is a bias when the platform capture the sequences, and these two are not being detected, meanwhile there is less problems for let-7a and miR-10a

isomiRs

As an example of some figures you can do with this package. (read more here).

There is one figure per type of isomiR. The y-axes shows the percentage of unique sequences with that change. The x-axes shows the percentage of abundance with that change.

obj = isoPlot(obj)
obj = isoPlot(obj, type="iso3")
obj = isoPlot(obj, type="add")
obj = isoPlot(obj, type="subs")

It seems there are some nt-changes for serum and MS2 samples at position 13/14, and at position 9-11 for miRQC and liver samples. These are just exploratory figures and could lead to a differential expression analysis of isomiRs.

We can explore the porsition 11 better.

isoPlotPosition(obj, position=11)

This shows that the main change is from A > G (maybe A > I) changes that are common in human brain reigons.

Clusters

The same logic was applied to clusters detection.

clus_ma = clus_ma[, sort(colnames(clus_ma))]
keep = grepl("miRQC", colnames(clus_ma))
keep_d = grepl("miRQC", rownames(design))
dge = DGEList(clus_ma[,keep])
dge = calcNormFactors(dge,method = 'upperquartile')
clus_ma_norm = cpm(dge, normalized.lib.sizes = T, log = T)
enough = rowMedians(clus_ma_norm)>5
clus_ma_norm = clus_ma_norm[enough,]
is_mi = grepl("miRNA", ann)

Matrix correlation among samples

GGally::ggpairs(clus_ma_norm, axisLabels = "internal")

Abundance detection

mirqca = rowMeans(clus_ma_norm[, 1:2])
mirqcb = rowMeans(clus_ma_norm[, 3:4])
mirqcc = rowMeans(clus_ma_norm[, 5:6])
mirqcd = rowMeans(clus_ma_norm[, 7:8])

top_a = mirqca > mirqcb 
conc_a = top_a & mirqca > mirqcd & mirqcd > mirqcc

top_b = mirqca < mirqcb 
conc_b = top_b & mirqcb > mirqcc & mirqcc > mirqcd

clusters which mirqca > mirqcd > mrqcc are r sum(conc_a) (r sum(is_mi[enough] & conc_a) are miRNAs) out of r sum(top_a)

clusters which mirqcb > mirqcc > mrqcd are r sum(conc_b) (r sum(is_mi[enough] & conc_b) out of r sum(top_b)

ratio expression summary of A/D

kable(t(as.matrix(summary(mirqca[top_a]-mirqcd[top_a]))))

the average logFC is 0.3 that is similar to the expected FC = log2(1/0.75) = 0.41

ratio expression summary of A/C

kable(t(as.matrix(summary(mirqca[top_a]-mirqcc[top_a]))))

the logFC is 1.6 that is similar to the expected FC = log2(1/0.25) = 2

Same happens when comparing B vs D

kable(t(as.matrix(summary(mirqcb[top_b]-mirqcd[top_b]))))

and B vs C

kable(t(as.matrix(summary(mirqcb[top_b]-mirqcc[top_b]))))

The exactly same thing than we saw for miRNA analysis and in concordance with the samples description file.

Conclusions

We can conclude that the miRNAs and clusters quantification is accurate. The mapping annotation for miRNAs is perfect with those examples, and the difference detection is good for 2 miRNAs, and bad for the other two miRNAs due to a lack of reads support for those miRNAs.

In general, seqbuster/seqcluster show good accuracy for abundance detection and mapping accuracy for miRNAs.

Thanks

special thanks to the author of that papers to make data available. I encourage to use this data for any tool that analyze small RNA data.