# RNA-seq analysis

## Part 1: QC and clustering

In [None]:
# Set up environment
options(warn = -1)
options(jupyter.plot_mimetypes = 'image/png')

# Load packages
suppressPackageStartupMessages(require(dplyr))
suppressPackageStartupMessages(require(reshape2))
suppressPackageStartupMessages(require(ggplot2))
suppressPackageStartupMessages(require(tidyr))

In [None]:
source('src/load_datasets.r')

In [None]:
# Inspect the tables
cat('TPM (tall table):')
head(tpm_tall, 3)
cat('Metadata:')
head(meta, 10)
cat('Annotations')
head(annot, 3)

# Summarize by gene
Transcript-level quantification was performed in order to allow fine-grained exploration of differential transcript isoform usage (e.g., alternate splicing changes). However, for our initial analyses, we want to use the gene-level quantification, which is less sensitive to small changes in transcript levels of low-expressed transcripts. 

Converting the data only requires summing transcript TPM levels per gene.

In [None]:
tpm_gene <- tpm_tall %>%
    inner_join(annot, by=c('target_id'='gencode_tx')) %>%
    group_by(hugo_symbol, Name) %>%
    summarize(tpm = sum(tpm, na.rm=T)) %>%
    inner_join(meta)

head(tpm_gene)

## Check the highest-expressed genes 
A good sanity check is to manually explore the highest expressed genes. We'll sort by the average expression of each transcript over all samples, and then print the top 20.

In [None]:
tpm_gene %>%
    group_by(hugo_symbol) %>%
    summarize(mean_tpm = mean(tpm, na.rm=T)) %>%
    arrange(desc(mean_tpm)) %>%
    head(20) %>%
    mutate(rank = 1:20) %>%
    select(rank, hugo_symbol)

We expect mitochondrial genes like MT-CO1, glycolytic enzymes like GAPDH, elongation factors like EEF1A1, and ribosomal proteins like RPS\* and RPL\* to be among the highest expressed.

## Overall characterization

The transcripts per million (TPM) metric represents the number of transcripts expected to code for a specific gene, given one million randomly selected transcripts. In other words, TPM is normalized for gene length, sample-specific sequencing depth, and distribution of transcript lengths within a sample, and should have approximately even distribution across samples.

Below, we plot log-transformed tpm across all samples, and color-coded by treatment. 

In [None]:
options(repr.plot.width=6, repr.plot.height=3)

# Overall expression boxplots per condition
ggplot(tpm_gene, aes(x=Name, color=Treatment, y=log10(tpm + 1))) +
    geom_violin() + geom_boxplot(outlier.size=NA, fill=NA) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

In [None]:
options(repr.plot.width=6, repr.plot.height=5)

p <- counts_mat %>% 
    as.data.frame %>%
    gather('Name', 'counts') %>%
    ggplot
p + geom_density(aes(x=counts, color=Name)) + scale_x_log10()

## Hierarchical clustering
Using the gene-level TPM data, we create a log-transformed expression matrix and cluster the samples.

In [None]:
options(repr.plot.width=8, repr.plot.height=6)

# Swap the IDs with meaningful labels
cluster_mat <- tpm_gene %>%
    mutate(log10_tpm = log10(tpm + 1)) %>%
    select(hugo_symbol, Name, log10_tpm) %>%
    spread(Name, log10_tpm) %>%
    ungroup %>%
    select(-hugo_symbol) %>%
    as.matrix
colnames(cluster_mat) <- meta[colnames(cluster_mat), 'Description']
cluster_mat[is.na(cluster_mat)] <- 0

# Take a look at the matrix
cluster_mat[1:5, 1:5]
summary(cluster_mat[,1:4])

In [None]:
# Hierarchical clustering
plot(hclust(dist(t(cluster_mat), method='euclidean')), xlab=NA, sub=NA)

## Plot across transcript length

In [None]:
cov = read.csv('data/gene_coverage.csv') %>%
    select(percentile, coverage, Name=sample_id) %>%
    inner_join(meta, by='Name')

p <- ggplot(cov, aes(x=percentile, y=log10(coverage), group=Name, color=Treatment))
p + geom_line(aes(linetype=Concentration))
