diff --git a/about_us_2018.Rmd b/about_us_2019.Rmd similarity index 56% rename from about_us_2018.Rmd rename to about_us_2019.Rmd index 51b040a..3331cff 100644 --- a/about_us_2018.Rmd +++ b/about_us_2019.Rmd @@ -9,9 +9,8 @@ output: html_document You may have to install these first! The first two come from CRAN, the third one comes from Github! ```{r libraries} -library(gplots) library(googlesheets) -library(RSkittleBrewer) +library(pheatmap) ``` @@ -20,7 +19,7 @@ library(RSkittleBrewer) This is the data about us that we just created! ```{r read_data} -my_url = "https://docs.google.com/spreadsheets/d/11fHItDjoutvLG8XLx2rk95L2E8AJyPzThvzOleUr7rU/edit?usp=sharing" +my_url = "https://docs.google.com/spreadsheets/d/16eC3vwEcEdCC0GUUyJXV_fzCkdMlKaaYBygyx-_j_oQ/edit?usp=sharing" my_gs = gs_url(my_url) dat = gs_read(my_gs) ``` @@ -30,16 +29,12 @@ dat = gs_read(my_gs) This makes the pretty heatmap that shows what we know! ```{r make_plot} -trop = RSkittleBrewer("tropical") -colramp = colorRampPalette(c(trop[3],"white",trop[2]))(9) -palette(trop) dat = as.matrix(dat) dat[is.na(dat)]= 0 -par(mar=c(5,5,5,5)) -heatmap.2(as.matrix(dat),col=colramp,Rowv=NA,Colv=NA, - dendrogram="none", scale="none", - trace="none",margins=c(10,2)) +pheatmap(dat, + cluster_rows = FALSE, + cluster_cols=FALSE) ``` diff --git a/cshl-lab1-2019.Rmd b/cshl-lab1-2019.Rmd new file mode 100644 index 0000000..733bbce --- /dev/null +++ b/cshl-lab1-2019.Rmd @@ -0,0 +1,314 @@ +--- +title: "Lab 1: Import, organization, plotting" +output: html_document +--- + +# Install packages + +We won't run this because I've already installed them in the base package. But you would have to do this one time if you were doing this with new packages in the future. + + +```{r, eval=FALSE} +if (!requireNamespace("BiocManager", quietly = TRUE)) + install.packages("BiocManager") +BiocManager::install(version = "3.10") + +BiocManager::install(c("edgeR", + "tximport", + "tximportData", + "rhdf5", + "DESeq2", + "limma", + "edgeR")) + +install.packages('tidyverse') +install.packages('viridis') + +``` + + +# Load the libraries + +Here we load the libraries we will need for analysis + +```{r} + +library(rhdf5) +library(tximport) +library(tximportData) +library(SummarizedExperiment) +library(DESeq2) +library(limma) +library(edgeR) + +library(readr) +library(dplyr) +library(ggplot2) +library(readr) +library(viridis) + + +``` + + +# Load some data + +You created some transcript gtf/gff files that show the assembled "structure" of a transcriptome. You would need to run a "quantification" step to get the relative abundance of each of the transcripts. This step could be done either using StringTie or one of many other tools like Kalisto. + +The `tximportData` file shows some examples of quantified files. We will start with the cufflinks + + +```{r} +dir <- system.file("extdata", package = "tximportData") +list.files(dir) +``` + + +## Import StringTie quantification data with tximport + +We are going to use the tximport package to load the data from some StringTie data. You can also use tximport for a variety of other data types. + +To do this with StringTie you would need to run the command in Galaxy to quantify, something like this: + +`stringtie -eB -G transcripts.gff ` + +which will produce a set of `c_tab` files. These can be imported using the commands: + +```{r import_stringtie, eval=FALSE} +tmp <- read_tsv(files[1]) +tx2gene <- tmp[, c("t_name", "gene_name")] +txi <- tximport(files, type = "stringtie", tx2gene = tx2gene) +``` + +More info here: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual + +## Import Kallisto quantification data with tximport + +An alternative, widely used, transcript quantification software is [Kallisto](https://pachterlab.github.io/kallisto/). We will use these quantifications for this tutorial (since there is a nice example in the package tutorial!) + +First read in the sample information (this is one of the three tables we learned about): + +```{r read_samp} +samples <- read.table(file.path(dir, "samples.txt"), header = TRUE) +samples +``` + +We will also need information on transcripts and genes. This will depend on what you used to quantify so be careful! (this is the 2nd table) + +```{r read_gn} +tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv")) +head(tx2gene) +``` + + +Now read in the quantified transcripts (this is the third table) + +```{r read_tx_ab} +files <- file.path(dir, "kallisto_boot", samples$run, "abundance.h5") +names(files) <- paste0("sample", 1:6) +txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE) +head(txi.kallisto$counts) +``` + +We could also read in the quantites at the gene level + +```{r} +gene.kallisto = summarizeToGene(txi.kallisto,tx2gene) +``` + + +## Doing some exploration + +Now let's look at the transcript level data: + +```{r examine} +names(txi.kallisto) +glimpse(txi.kallisto) +``` + +And the gene level data + +```{r examine} +names(gene.kallisto) +glimpse(gene.kallisto) +``` + +Now lets look at the counts + +```{r counts} +head(gene.kallisto$counts) +dim(gene.kallisto$counts) +``` + +Let's look at the distribution of counts. + +```{r ggplot_err,eval=FALSE} +ggplot(gene.kallisto$counts,aes(y=sample1)) + geom_boxplot() + theme_minimal() +``` + +Uh oh....two approaches here: + +1. Go to base R + + +```{r} +boxplot(gene.kallisto$counts[,-1]) +``` + + +2. Make this "tidy" + +```{r tidy_kallisto} +tidy_kallisto = as_tibble(gene.kallisto$counts) +tidy_kallisto$gene = rownames(gene.kallisto$counts) +tidy_kallisto = tidy_kallisto %>% + gather(sample,value,sample1:sample6) +``` + +```{r slow_ggplot, eval=FALSE} +##Slowish +tidy_kallisto %>% + ggplot(aes(x=sample,y=value,group=sample,colour=sample)) + + geom_boxplot() + + theme_minimal() +``` + + +## Some more exploration + +Normal data looks, well, "normal" + +```{r} +hist(rnorm(1000)) +``` + +Gene counts are not: + +```{r} +counts = gene.kallisto$counts +hist(counts[,1],col=2,breaks=100) +``` + + +One way is to use the log + +```{r} +hist(log(counts[,1]),col=2,breaks=100) +``` + +But there is a problem! + +```{r} +min(log(counts)) +``` + +People often remove this problem by adding a small number before taking the log + +```{r} +min(log(counts[,1] + 1)) +``` + +Another common choice is log base 2, because then differences between values can be interpreted as "fold changes": + +```{r} +hist(log2(counts[,1] + 1),breaks=100,col=2) +``` + + +Another common transform is to remove genes with low counts: + +```{r} +hist(rowSums(counts==0),col=2) +``` + + +We could filter these out: + +```{r} +low_genes = rowMeans(counts) < 5 +table(low_genes) +filt_counts = counts[!low_genes,] +dim(filt_counts) +``` + +Our data is starting to look a bit nicer + +```{r} +hist(log2(filt_counts[,1] + 1),col=2) +``` + +Now we can do things like compare replicates + +```{r} +plot(log2(filt_counts[,1]+1),log2(filt_counts[,2]+1),pch=19,col=2) +``` + +A better way is an M-A plot. + +```{r} +mm = log2(filt_counts[,1]+1) - log2(filt_counts[,2]+1) +aa = log2(filt_counts[,1]+1) + log2(filt_counts[,2]+1) +plot(aa,mm,col=2,pch=19) +``` + +# Making the three tables and running DESeq + +One thing you might have noticed is that when we started filtering there is a problem. We filtered the genes but not the annotation! So we want the genes/annotation/metadata to be lined up. We can do this with a `SummarizedExperiment` object: + +```{r} +samples$treatment = rep(c("A","B"),each=3) +rse <- SummarizedExperiment(assays=SimpleList(counts=counts), + colData=DataFrame(samples)) +rse +``` + + +Now we can subset the rse object: + +```{r} +rse[,1:3] +rse[1:10, ] +``` + + +# Basic limma-voom analysis + +There are a bunch of ways to do this, one is to read in the data directly to have it ready for limma/voom + +```{r} +txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM") +``` + +Create the object for limma analysis with filtering + +```{r} +y <- DGEList(txi$counts) +keep <- filterByExpr(y) +y <- y[keep, ] +``` + +Calculate normalization factors +```{r} +y <- calcNormFactors(y) +design <- model.matrix(~condition, data = sampleTable) +v <- voom(y, design) +``` + +Calculate the model fits + +```{r} +fit <- lmFit(v, design) +fit <- eBayes(fit) +``` + +Volcano plot + +```{r} +limma::volcanoplot(fit, coef = 2) +``` + +Look at top hits + +```{r} +top <- topTable(fit, number = 10) +``` diff --git a/cshl-lab2-2019.Rmd b/cshl-lab2-2019.Rmd new file mode 100644 index 0000000..07fce4f --- /dev/null +++ b/cshl-lab2-2019.Rmd @@ -0,0 +1,268 @@ +--- +title: "Lab 2: Modeling" +output: html_document +--- + +# Install packages + +We won't run this because I've already installed them in the base package. But you would have to do this one time if you were doing this with new packages in the future. + + +```{r, eval=FALSE} +source("http://www.bioconductor.org/biocLite.R") +biocLite(c("Biobase")) +biocLite(c("tximport", + "tximportData", + "rhdf5", + "DESeq2", + "limma", + "edgeR")) + +``` + + +# Load the libraries + +Here we load the libraries we will need for analysis + +```{r} + +library(rhdf5) +library(tximport) +library(tximportData) +library(SummarizedExperiment) +library(DESeq2) +library(limma) +library(edgeR) + +library(readr) +library(dplyr) +library(ggplot2) +library(readr) +library(RSkittleBrewer) +library(pheatmap) +library(dendextend) + +``` + +Set up the color palette for pretty plots + +```{r} +trop = RSkittleBrewer("tropical") +colramp = colorRampPalette(c(trop[3],"white",trop[2]))(9) +palette(trop) +``` + + +# Load some data + +You created some transcript gtf/gff files that show the assembled "structure" of a transcriptome. You would need to run a "quantification" step to get the relative abundance of each of the transcripts. This step could be done either using StringTie or one of many other tools like Kalisto. + +The `tximportData` file shows some examples of quantified files. We will start with the cufflinks + + +```{r} +dir <- system.file("extdata", package = "tximportData") +samples <- read.table(file.path(dir, "samples.txt"), header = TRUE) +tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv")) +head(tx2gene) + + +files <- file.path(dir, "kallisto_boot", samples$run, "abundance.h5") +names(files) <- paste0("sample", 1:6) +txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE) +head(txi.kallisto$counts) + +gene.kallisto = summarizeToGene(txi.kallisto,tx2gene) +counts = gene.kallisto$counts + +samples$treatment = rep(c("A","B"),each=3) +rse <- SummarizedExperiment(assays=SimpleList(counts=counts, + l2counts = log2(counts+1)), + colData=DataFrame(samples)) +``` + +# Hierarchical clustering + +First let's make a plot like we did before. Now we'll need to use the `assays` command. + +```{r} +boxplot(assays(rse)$counts[,2]) +``` + +Let's do some filtering: + +```{r} +high_genes = rowMeans(assays(rse)$l2counts) > 10 +rse = rse[high_genes,] +``` + +We can now compute distance between samples: + +```{r} +d1 = dist(t(assays(rse)$l2counts)) +``` + +Let's look at this distance matrix: + +```{r} +pheatmap(as.matrix(d1),cluster_cols=FALSE,cluster_rows=FALSE) +``` + + + +Cluster the data with hierarchical clustering: + +```{r} +hclust1 = hclust(d1,method="average") +``` + +Make a plot of the clustering: + +```{r} +plot(hclust1) +``` + + +We can also plot by colors + +```{r} +dend = as.dendrogram(hclust1) +dend = color_labels(hclust1,4,col=1:4) +plot(dend) +``` + + +We can also make these plots colored by specific sample characteristics: + + + +```{r} +labels_colors(dend) = c(1,1,1,2,2,2) +plot(dend) +``` + + + +Make clustering plot directly (doesn't save the clustering) + +```{r} +pheatmap(assays(rse)$l2counts,cluster_rows=FALSE) +``` + +# Kmeans clustering + +We can cluster the genes into a specific number of groups: + +```{r} +kmeans1 = kmeans(assays(rse)$l2counts,centers=3) +names(kmeans1) +``` + +Now we plot the cluster means +```{r} +matplot(t(kmeans1$centers),col=1:3,type="l",lwd=3) +``` + +How many genes in each cluster? + +```{r} +table(kmeans1$cluster) +``` + + +Heatmap kmeans clustered + +```{r} +pheatmap(assays(rse)$l2counts[order(kmeans1$cluster),], + cluster_cols=F, + cluster_rows=F,show_rownames =FALSE) +``` + +This is not deterministic!!! Try running this a few times: + +```{r} +kmeans2 = kmeans(assays(rse)$l2counts,centers=3) +table(kmeans1$cluster,kmeans2$cluster) +``` + + +# Basic limma-voom analysis + +Create the object for limma analysis with filtering + +```{r} +y <- DGEList(assays(rse)$counts) +keep <- filterByExpr(y) +y <- y[keep, ] +``` + +Calculate normalization factors +```{r} +y <- calcNormFactors(y) +design <- model.matrix( ~ treatment, data = colData(rse)) +``` + +Check for mean variance relationship +```{r} +v <- voom(y, design,plot=TRUE) +``` + + +Calculate the model fits + +```{r} +fit <- lmFit(v, design) +fit <- eBayes(fit) +``` + + +Volcano plot - looking for statistically significant effects with also big fold changes + +```{r} +limma::volcanoplot(fit, coef = 2) +``` + +Look at an MA plot to see if you see strange patterns + +```{r} +limma::plotMA(fit, coef = 2,col=2) +``` + + +Look at top hits + +```{r} +top <- topTable(fit, number = 10) +``` + +Look at p-values + +```{r} +all <- topTable(fit, number = Inf) +hist(all$P.Value,col=3) +``` + +We can use the Benjamini Hochberg adjusted p-values + +```{r} +all <- topTable(fit, number = Inf) +hist(all$adj.P.Val,col=3) +``` + +Or we can adjust them directly: + +```{r} +all$bonf_pval = p.adjust(all$P.Value,method="bonferroni") +hist(all$bonf_pval,col=3) +``` + + +# Session Info + + +This is for me to debug what happened later + +```{r} +sessionInfo() +```