-
Notifications
You must be signed in to change notification settings - Fork 0
/
RNAseq_practical.Rmd
1265 lines (1005 loc) · 51.8 KB
/
RNAseq_practical.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: >
Transcriptome Data Analysis - hands-on session
subtitle: >
Bulk RNA-seq Workshop - IMBEI, University Medical Center Mainz</br>
<small>Material adapted from the GTIPI practical session by</small>
Instructors: Annekathrin Nedwed, Alicia Schulze, Federico Marini</br>
<a href="https://imbeimainz.github.io/GTIPI2022"><img src="images/gtipi_logo.png" alt="" height="100"/></a>
author:
- name: <a href="https://csoneson.github.io">Charlotte Soneson (charlotte.soneson@fmi.ch)</a><br><a href="https://www.fmi.ch/bioinformatics/">FMI Basel</a><br><a href="https://twitter.com/CSoneson">`r icons::fontawesome('twitter')` `@CSoneson`</a>
- name: <a href="https://federicomarini.github.io">Federico Marini (marinif@uni-mainz.de)</a><br><a href="https://www.unimedizin-mainz.de/imbei/">IMBEI, University Medical Center Mainz</a><br><a href="https://twitter.com/FedeBioinfo">`r icons::fontawesome('twitter')` `@FedeBioinfo`</a>
date: "2023/06/23"
output:
bookdown::html_document2:
toc: true
toc_float: true
theme: cosmo
code_folding: show
code_download: true
editor_options:
chunk_output_type: inline
bibliography: references.bib
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#",
error = FALSE,
warning = FALSE,
message = FALSE
)
```
<!-- <script type="text/javascript" -->
<!-- src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"> -->
<!-- </script> -->
```{r style, echo=FALSE, message=FALSE, warning=FALSE, results="asis", eval = TRUE}
suppressPackageStartupMessages({
library(BiocStyle)
library(knitr)
library(rmarkdown)
})
options(width = 80)
eval_all <- FALSE
eval_dte <- FALSE
eval_dtu <- FALSE
opts_chunk$set(fig.width = 6, fig.height = 6, eval = eval_all)
```
# Introduction
In this tutorial we walk through a gene-level RNA-seq differential expression
analysis using Bioconductor packages. We start from the gene-vs-sample count
matrix, and thus assume that the raw reads have already been quality controlled
and that the gene expression has been quantified (either using alignment and
counting, or by applying an alignment-free quantification tool). We perform
exploratory data analysis (EDA) for quality assessment and to explore the
relationship between samples, then perform differential gene expression
analysis, and visually explore the results.
Bioconductor has many packages supporting analysis of high-throughput sequence
data, including RNA-seq. The packages that we will use in this tutorial include
core packages maintained by the [Bioconductor core
team](https://www.bioconductor.org/about/core-team/) for importing and
processing raw sequencing data and loading gene annotations. We will also use
contributed packages for statistical analysis and visualization of sequencing
data. Through scheduled releases every 6 months, the Bioconductor project
ensures that all the packages within a release will work together in harmony
(hence the "conductor" metaphor). The packages used in this tutorial are loaded
with the *library* function and can be installed by following the [Bioconductor
package installation
instructions](http://bioconductor.org/install/#install-bioconductor-packages).
If you use the results from an R package in published research, you can find the
proper citation for the software by typing `citation("pkgName")`, where you
would substitute the name of the package for `pkgName`. Citing methods papers
helps to support and reward the individuals who put time into open source
software for genomic data analysis.
Many parts of this tutorial are based on a published RNA-seq workflow
available via [F1000Research](http://f1000research.com/articles/4-1070)
[@Love2015RNASeq] and as a [Bioconductor
package](https://www.bioconductor.org/packages/release/workflows/html/rnaseqGene.html).
Along this notebook that you can render yourself - we will show you how during the
workshop - you will encounter some small quiz sections to keep up with your progress
throughout the entire workflow.
## Setup recommendation
We recommend you use the following setup for an effective learning experience
* Clone or download the workshop repo, available at https://github.com/imbeimainz/RNAseqWorkshop2023
* (Make sure you have R/RStudio/Bioconductor all setup, as expected)
* Work in RStudio, ideally clicking on the RStudio project file
* Open up the `RNAseq_practical.Rmd` file, inside RStudio
<!-- * As a check, refer to `RNAseq_practical_run_all.html` file - don't use that during the workshop -->
* Execute the code chunks/read the information provided
Have fun learning - together!
## Experimental data
The data used in this workflow comes from an RNA-seq experiment
[@Alasoo2018-macrophage], in which the authors identified shared quantitative
trait loci (QTLs) for chromatin accessibility and gene expression in human
macrophages exposed to IFNgamma, Salmonella and IFNgamma plus Salmonella.
Processed data from a subset of 24 samples from this experiment (six female
donors, with four treatments each) is available via the
`r Biocpkg("macrophage")` R/Bioconductor package.
In particular, the package contains output from *Salmon* [@Patro2017Salmon],
as well as a metadata file. More information about how the raw data was
processed is available from the package
[vignette](http://bioconductor.org/packages/release/data/experiment/vignettes/macrophage/inst/doc/macrophage.html).
We start by setting the path to the folder containing the quantifications
(the output folders from *Salmon*). Since these are provided with an R package,
we will point to the `extdata` subfolder of the installed package. For a typical
analysis of your own data, you would point directly to a folder on your storage
system (i.e., not using `system.file()`).
```{r listfiles}
library(macrophage)
dir <- system.file("extdata", package = "macrophage")
list.files(dir)
```
TODO: open up that directory to "better see what is in it"? Idea: it can give them a sense for the *what do I expect to get there?*
# Reading the metadata
First, we will read the metadata for the experiment. The main annotations of
interest for this tutorial are `condition_name`, which represents the
treatment of the sample (naive, IFN gamma, Salmonella, IFN gamma+Salmonella) and
`line_id`, which represents the donor ID. The
sample identifier is given by the `names` column, and will be used to match the
metadata table to the quantifications.
```{r readcoldata}
coldata <- read.csv(file.path(dir, "coldata.csv"))[, c(1, 2, 3, 5)]
dim(coldata)
coldata
```
In addition to the `names` column, `r Biocpkg("tximeta")`, which we will use to read the
quantification data, requires that `coldata` has a column named `files`,
pointing to the *Salmon* output (the `quant.sf` file) for the respective samples.
```{r addfiles}
coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
head(coldata)
all(file.exists(coldata$files))
```
Now we have everything we need, and can import the quantifications with *tximeta*.
In this process, we will see that *tximeta* automatically identifies the
source and version of the transcriptome reference that was used for the
quantification, and adds some metadata.
The imported data will be stored in a *SummarizedExperiment* container.
# Importing quantifications into R
We will
next read the *Salmon* quantifications provided in the `r Biocpkg("macrophage")`
package into R and summarize the expected counts on the gene level.
A simple way to import results from a variety of transcript abundance
estimation tools into R is provided by the `r Biocpkg("tximport")` and
`r Biocpkg("tximeta")` packages. Here, *tximport* reads the quantifications into a
list of matrices, while *tximeta* instead aggregates the information into a
*SummarizedExperiment* object, and also automatically adds additional
annotations for the features. Both packages can return quantifications on the
transcript level or aggregate them on the gene level. They also calculate
average transcript lengths for each gene and each sample, which can be used as
offsets to improve the differential expression analysis by accounting for
differential isoform usage across samples [@Soneson2015Differential].
The code below imports the *Salmon* quantifications into R using the *tximeta*
package. Note how the transcriptome that was used for the quantification is
automatically recognized and used to annotate the resulting data object. In
order for this to work, *tximeta* requires that the output folder structure from
Salmon is retained, since it reads information from the associated log files in
addition to the quantified abundances themselves.
```{r importst}
suppressPackageStartupMessages({
library(tximeta)
library(DESeq2)
library(org.Hs.eg.db)
library(SummarizedExperiment)
})
## Import quantifications on the transcript level
st <- tximeta(coldata = coldata, type = "salmon", dropInfReps = TRUE)
st
```
We see that *tximeta* has identified the transcriptome used for the
quantification as `GENCODE - Homo sapiens - release 29`. How did this happen?
In fact, the output directory from *Salmon* contains much more information than
just the `quant.sf` file! (as mentioned above, this means that it is not
advisable to move files out of the folder, or to share only the `quant.sf`
file, since the context is lost):
```{r listinquants}
list.files(file.path(dir, "quants", coldata$names[1]), recursive = TRUE)
```
In particular, the `meta_info.json` file contains a hash checksum, which is
derived from the set of transcripts used as reference during the quantification
and which lets *tximeta* identify the reference source (by comparing to a
table of these hash checksums for commonly used references).
```{r metainfo}
rjson::fromJSON(file = file.path(dir, "quants", coldata$names[1],
"aux_info", "meta_info.json"))
```
Looking at the size of the *SummarizedExperiment* object (205,870 rows!) as well
as the row names, we see that this object contains transcript-level information.
The assays are created by directly importing the values from the `quant.sf`
files and combining this information across the 24 samples:
* counts - `NumReads` column
* abundance - `TPM` column
* length - `EffectiveLength` column
We can access any of the assays via the `assay` function:
```{r headst}
head(assay(st, "counts"), 3)
```
You may have noted that `st` is in fact a *RangedSummarizedExperiment*
object (rather than "just" a *SummarizedExperiment* object). What does this mean?
Let’s look at the information we have about the rows (transcripts) in the object:
```{r rrst}
rowRanges(st)
```
By knowing the source and version of the reference used for the quantification,
*tximeta* was able to retrieve the annotation files and decorate the object with
information about the transcripts, such as the chromosome and position, and the
corresponding gene ID. Importantly, *Salmon* did not use (or know about) any of
this during the quantification! It needs only the transcript sequences.
If we just want the annotation columns, without the ranges, we can get those
with the `rowData` accessor:
```{r rdst}
rowData(st)
```
Similar to the row annotations in `rowData`, the *SummarizedExperiment*
object contains sample annotations in the `colData` slot.
```{r cdst}
colData(st)
```
### Quiz {-}
* How many samples are there in this `macrophage` dataset? How many are there in your (typical) RNA-seq experiment?
* Is the quantification done at the transcript level or at the gene level? How can you go from one to the other? And vice versa?
* What is "better"? Think of DGE, DTU, DTE... and of your experimental setting
# Summarizing on the gene level
As we saw, the features in the *SummarizedExperiment* object above are individual
transcripts, rather than genes.
Often, however, we want to do analysis on the gene level, since the gene-level
abundances are more robust and sometimes more interpretable than transcript-level
abundances.
The `rowData` contains the information about the corresponding gene for each
transcript, in the `gene_id` column, and *tximeta* provides a function
to summarize on the gene level:
* Counts are added up
* TPMs are added up
* Transcript lengths are added up after weighting by the respective transcript TPMs
```{r summarizetogene}
## Summarize quantifications on the gene level
sg <- tximeta::summarizeToGene(st)
sg
# compare e.g. to
st
```
Now we have a new `RangedSummarizedExperiment` object, with one row per gene.
The row ranges have been summarized as well, and can be used for subsetting
and interpretation just as for the transcripts.
At this point, the only information we have about the genes in our data set,
apart from their genomic location and the associated transcript IDs, is the
Ensembl ID.
Often we need additional annotations, such as gene symbols.
Bioconductor provides a range of annotation packages:
* `OrgDb` packages, providing gene-based annotations for a given organism
* `TxDb` and `EnsDb` packages, providing transcript ranges for a given genome build
* `BSgenome` packages, providing the genome sequence for a given genome build
For our purposes here, the appropriate `OrgDb` package is the most suitable,
since it contains gene-centric ID conversion tables.
Since this is human data, we will use the `r Biocpkg("org.Hs.eg.db")` package.
```{r addsymbol}
## Add gene symbols
sg <- tximeta::addIds(sg, "SYMBOL", gene = TRUE)
sg
head(rowData(sg))
```
To see a list of the possible columns, use the `columns` function from the
`r Biocpkg("AnnotationDbi")` package:
```{r listcols}
AnnotationDbi::columns(org.Hs.eg.db)
```
We can even add annotations where we expect (and would like to retain)
multiple mapping values, e.g., associated GO terms:
```{r addgo}
sg <- addIds(sg, "GO", multiVals = "list", gene = TRUE)
head(rowData(sg))
```
Note that *Salmon* returns *estimated* or *expected* counts, which are not necessarily
integers. They may need to be rounded before they are passed to count-based
statistical methods. To obtain consistent results with different
pipelines, we round the estimated counts here (note that in practice,
`r Biocpkg("DESeq2")` will automatically round the counts, while
`r Biocpkg("edgeR")` will work well also with the non-integer values).
```{r roundcounts}
assay(sg, "counts") <- round(assay(sg, "counts"))
```
### Quiz {-}
* Can you start your DE analysis with normalized values such as TPMs?
* Where can I store additional information about the samples?
* Where can I store additional information about the features? (What can your features be?)
* Can you think of a way to read in the required information if you start (AAAAAAAH) from Excel files?
# Representing counts for differential expression packages
At this point, we have a gene-level count matrix, contained in our
*SummarizedExperiment* object.
This is a branching point where we could use a variety of
Bioconductor packages for exploration and differential expression of the count
matrix, including `r Biocpkg("edgeR")` [@Robinson2009EdgeR],
`r Biocpkg("DESeq2")` [@Love2014Moderated], `r Biocpkg("limma")` with the voom
method [@Law2014Voom], `r Biocpkg("DSS")` [@Wu2013New], `r Biocpkg("EBSeq")`
[@Leng2013EBSeq] and `r Biocpkg("BaySeq")` [@Hardcastle2010BaySeq]. We will
continue using *DESeq2* and *edgeR*.
Bioconductor software packages often define and use a custom class for storing
data that makes sure that all the needed data slots are consistently provided
and fulfill any requirements. In addition, Bioconductor has general data classes
(such as the *SummarizedExperiment*) that can be used to move data between
packages. The `r Biocpkg("DEFormats")` package can be useful for converting
between different classes. The core Bioconductor classes also provide useful
functionality: for example, subsetting or reordering the rows or columns of a
*SummarizedExperiment* automatically subsets or reorders the associated
*rowRanges* and *colData*, which can help to prevent accidental sample swaps
that would otherwise lead to spurious results. With *SummarizedExperiment* this
is all taken care of behind the scenes.
Each of the packages we will use for differential expression has a specific
class of object used to store the summarization of the RNA-seq experiment and
the intermediate quantities that are calculated during the statistical analysis
of the data. *DESeq2* uses a *DESeqDataSet* and *edgeR* uses a *DGEList*.
## The *DESeqDataSet*, sample information, and the design formula
In *DESeq2*, the custom class is called *DESeqDataSet*. It is built on top of
the *SummarizedExperiment* class, and it is easy to convert
*SummarizedExperiment* objects into *DESeqDataSet* objects. One of the two main
differences compared to a *SummarizedExperiment* object is that the `assay` slot
can be accessed using the `counts` accessor function, and the *DESeqDataSet*
class enforces that the values in this matrix are non-negative integers.
A second difference is that the *DESeqDataSet* has an associated *design
formula*. The experimental design is specified at the beginning of the analysis,
as it will inform many of the *DESeq2* functions how to treat the samples in the
analysis (one exception is the size factor estimation, i.e., the adjustment for
differing library sizes, which does not depend on the design formula). The
design formula tells which columns in the sample information table (`colData`)
specify the experimental design and how these factors should be used in the
analysis.
Let's remind ourselves of the design of our experiment:
```{r cdsg}
colData(sg)
```
We have samples from four different conditions, and six donors:
```{r tabcoldata}
table(colData(sg)$condition_name)
table(colData(sg)$line_id)
# possible to use a shortcut
table(sg$line_id)
```
We want to find the changes in gene expression that can be associated with
the different treatments, but we also want to control for differences between the
donors. The design which accomplishes this is obtained by writing
`~ line_id + condition_name`. By including `line_id`, terms will be
added to the model which
account for differences across donors, and by adding `condition_name` we get
terms representing the different treatment effects.
**Note:** it will be helpful for us if the first level of a factor is the
reference level (e.g. control, or untreated samples). The reason is that by
specifying this, functions further in the pipeline can be used and will give
comparisons such as 'treatment vs control', without needing to specify
additional arguments.
We can *relevel* the `condition_name` factor like so:
```{r relevelsg}
colData(sg)$condition_name <- factor(colData(sg)$condition_name)
colData(sg)$condition_name <- relevel(colData(sg)$condition_name, ref = "naive")
colData(sg)$condition_name
```
You can use R's formula notation to express any fixed-effects experimental
design for *edgeR* or *DESeq2*. Note that these packages use the same formula
notation as, for instance, the `lm` function of base R.
Using the `r Biocpkg("ExploreModelMatrix")` R/Bioconductor package, we can
represent our design in a graphical way:
```{r emmvis, fig.width = 10, fig.height = 8, message = FALSE}
library(ExploreModelMatrix)
vd <- VisualizeDesign(sampleData = colData(sg),
designFormula = ~ line_id + condition_name,
textSizeFitted = 4)
vd$plotlist
vd$cooccurrenceplots
```
We can also open the interactive interface to explore our design further:
```{r emm, eval=FALSE}
ExploreModelMatrix(sampleData = colData(sg),
designFormula = ~ line_id + condition_name)
```
To generate a *DESeqDataSet* object from a *SummarizedExperiment* object, we
only need to additionally provide the experimental design in terms of a formula.
```{r dsds}
dds <- DESeqDataSet(sg, design = ~ line_id + condition_name)
```
We can also create a *DESeqDataSet* directly from a count matrix, a data frame
with sample information and a design formula (see the `DESeqDataSetFromMatrix`
function).
## The *DGEList*
As mentioned above, the *edgeR* package uses another type of data container,
namely a *DGEList* object. *tximeta* provides a convenient wrapper function
to generate a *DGEList* from the gene-level *SummarizedExperiment* object:
```{r dgelist, message=FALSE}
library(edgeR)
dge <- tximeta::makeDGEList(sg)
names(dge)
head(dge$samples)
```
As for the *DESeqDataSet*, a *DGEList* can also be generated
directly from a count matrix and sample metadata (see the `DGEList()`
constructor function).
Just like the *SummarizedExperiment* and the *DESeqDataSet*, the *DGEList*
contains all the information we need: the count matrix, information about the
samples (the columns of the count matrix), and information about the genes (the
rows of the count matrix). One difference compared to the *DESeqDataSet* is that
the experimental design is not defined when creating the *DGEList*, but later in
the workflow.
### Quiz {-}
* How important is the definition of the design?
* How do I quantify the effect size? "In which direction" is this to be interpreted?
* Is it possible to change the reference in the comparison?
* If you have multiple experimental factors, how should you proceed? Think of 2 condition-2 tissues/cell types
# Filtering
It is often helpful to filter out lowly expressed genes before continuing
with the analysis, to remove features that have nearly no information,
increase the speed of the analysis and reduce the size of the data.
At the very least we exclude genes with zero counts across
all samples.
```{r filtering0}
nrow(dds)
table(rowSums(assay(dds, "counts")) == 0)
```
Here, we additionally remove genes that have a single read across the samples.
```{r filtering1}
keep <- rowSums(counts(dds)) > 1
dds <- dds[keep, ]
dge <- dge[match(rownames(dds), rownames(dge)), ]
dim(dds)
dim(dge)
```
Importantly, the group information should *not* be used to define the filtering
criterion, since that can interfere with the validity of the p-values downstream.
### Quiz {-}
* Removing unexpressed features might influence what you do when integrating different analyses. How could we proceed in such cases?
* What if I download publicly available data and some genes are missing? What can be some of the possible reasons for this?
# Exploratory analysis and visualization
There are two separate analysis paths in this tutorial:
1. *visual exploration* of sample relationships, in which we will discuss
transformation of the counts for computing distances or making plots
2. *statistical testing* for differences attributable to treatment, controlling
for donor effects
Importantly, the statistical testing methods rely on original count data (not
scaled or transformed) for calculating the precision of measurements. However,
for visualization and exploratory analysis, transformed counts are typically
more suitable. Thus, it is critical to separate the two workflows and use the
appropriate input data for each of them.
## Transformations
Many common statistical methods for exploratory analysis of multidimensional
data, for example clustering and *principal components analysis* (PCA), work
best for data that generally has the same range of variance at different ranges
of the mean values. When the expected amount of variance is approximately the
same across different mean values, the data is said to be *homoskedastic*. For
RNA-seq raw counts, however, the variance grows with the mean. For example, if
one performs PCA directly on a matrix of size-factor-normalized read counts, the
result typically depends only on the few most strongly expressed genes because
they show the largest absolute differences between samples. A simple and often
used strategy to avoid this is to take the logarithm of the normalized count
values plus a small pseudocount; however, now the genes with the very lowest
counts will tend to dominate the results because, due to the strong Poisson
noise inherent to small count values, and the fact that the logarithm amplifies
differences for the smallest values, these low count genes will show the
strongest relative differences between samples.
As a solution, *DESeq2* offers transformations for count data that stabilize the
variance across the mean: the *regularized logarithm* (rlog) and the *variance
stabilizing transformation* (VST). These have slightly different
implementations, discussed a bit in the *DESeq2* paper and in the vignette, but
a similar goal of stabilizing the variance across the range of values. Both
produce log2-like values for high counts. Here we will use the variance
stabilizing transformation implemented with the `vst` function.
```{r vsd}
vsd <- DESeq2::vst(dds)
```
This returns a *DESeqTransform* object...
```{r classvsd}
class(vsd)
```
...which retains all the column metadata that was attached to the
*DESeqDataSet*:
```{r headvsd}
head(colData(vsd), 3)
```
## PCA plot
One way to visualize sample-to-sample distances is a principal components
analysis (PCA). In this ordination method, the data points (here, the samples)
are projected onto the 2D plane such that they spread out in the two directions
that explain most of the differences (Figure below). The x-axis (the first
principal component, or *PC1*) is the direction that separates the data points
the most (i.e., the direction with the largest variance). The y-axis (the second
principal component, or *PC2*) represents the direction with largest variance
subject to the constraint that it must be *orthogonal* to the first direction.
The percent of the total variance that is contained in the direction is printed
in the axis label. Note that these percentages do not sum to 100%, because there
are more dimensions that contain the remaining variance (although each of these
remaining dimensions will explain less than the two that we see).
```{r plotpca, fig.width=6, fig.height=4.5}
DESeq2::plotPCA(vsd, intgroup = "condition_name")
```
Additionally, the *pcaExplorer* package has some functionality on top to explore datasets from the point of view of Principal Components - including also a functional interpretation of it with the `pca2go()` function.
```{r plotpca2, fig.width=10, fig.height=8}
library(pcaExplorer)
pcaplot(vsd, intgroup = "condition_name", ellipse = TRUE)
```
## MDS plot
Another way to reduce dimensionality, which is in many ways similar to PCA, is
*multidimensional scaling* (MDS). For MDS, we first have to calculate all
pairwise distances between our objects (samples in this case), and then create a
(typically) two-dimensional representation where these pre-calculated distances
are represented as accurately as possible. This means that depending on how the
pairwise sample distances are defined, the two-dimensional plot can be very
different, and it is important to choose a distance that is suitable for the
type of data at hand.
*edgeR* contains a function `plotMDS`, which operates on a *DGEList* object and
generates a two-dimensional MDS representation of the samples. The default
distance between two samples can be interpreted as the "typical" log fold change
between the two samples, for the genes that are most different between them (by
default, the top 500 genes, but this can be modified). We generate an MDS plot
from the *DGEList* object `dge`, coloring by the treatment and using different
plot symbols for different donors.
**Note:** Since the *DGEList* was created using the `makeDGEList` function,
the average transcript length offsets have been incorporated in the object and
will be used as offsets in downstream analysis. If this is not the case,
we need to estimate TMM normalization factors before performing further analysis.
```{r plotmds}
# dge <- edgeR::calcNormFactors(dge)
plotMDS(dge, top = 500, labels = NULL,
col = as.numeric(dge$samples$condition_name),
cex = 0.5, gene.selection = "common")
```
### Quiz {-}
* What if I do not see a clear separation in my PCA plot? Is it still ok to proceed?
* What should I do if I detect/am aware of a batch effect?
* Raw data vs normalized data vs transformed data: Which one should I use in "all the cases" I could encounter?
* "I have seen people using tSNE/UMAP in single cell data, why shouldn't we do the same here?"
# Differential expression analysis
## Performing differential expression testing with *DESeq2*
As we have already specified an experimental design when we created the
*DESeqDataSet*, we can run the differential expression pipeline on the raw
counts with a single call to the function `DESeq`. We can also plot the
estimated dispersions.
```{r DESeq2call}
dds <- DESeq2::DESeq(dds)
DESeq2::plotDispEsts(dds)
```
The `DESeq` function will print out a message for the various steps it performs. These
are described in more detail in the manual page, which can be
accessed by typing `?DESeq`. Briefly these are: the estimation of size factors
(controlling for differences in the sequencing depth of the samples), the
estimation of dispersion values for each gene, and fitting a generalized linear
model.
A *DESeqDataSet* is returned that contains all the fitted parameters within it,
and the following section describes how to extract out results tables of
interest from this object.
Calling the `results` function without any arguments will extract the estimated log2
fold changes and *p* values for the last variable in the design formula. If
there are more than 2 levels for this variable, `results` will extract the
results table for a comparison of the last level over the first level. This
comparison is printed at the top of the output: `condition name SL1344 vs naive`.
Other comparisons can be performed via the `contrast` argument. For example,
we will focus on comparing the IFN gamma treatment to the naive group.
```{r deseq2results}
## Default - SL1344 vs naive
res <- DESeq2::results(dds)
head(res)
## We'll instead focus on IFNgamma vs naive
res <- DESeq2::results(dds, contrast = c("condition_name", "IFNg", "naive"))
head(res)
```
As `res` is a *DataFrame* object, it carries metadata with information on the
meaning of the columns:
```{r mcolsres}
mcols(res, use.names = TRUE)
```
The first column, `baseMean`, is a just the average of the normalized count
values, dividing by size factors, taken over all samples in the *DESeqDataSet*.
The remaining four columns refer to a specific contrast, namely the comparison
of the `IFNg` level over the `naive` level for the factor variable
`condition_name`.
The column `log2FoldChange` is the effect size estimate. It tells us how much
the gene's expression seems to have changed due to infection with IFN gamma
in comparison to naive samples. This value is reported on a logarithmic
scale to base 2: for example, a log2 fold change of 1.5 means that the gene's
expression is increased by a multiplicative factor of 2^1.5.
Of course, this estimate has an uncertainty associated with it, which is
available in the column `lfcSE`, the standard error estimate for the log2 fold
change estimate. We can also express the uncertainty of a particular effect
size estimate as the result of a statistical test. The purpose of a test for
differential expression is to test whether the data provides sufficient evidence
to conclude that this value is really different from zero. *DESeq2* performs for
each gene a *hypothesis test* to see whether evidence is sufficient to decide
against the *null hypothesis* that there is zero effect of the treatment on the
gene and that the observed difference between treatment and control was merely
caused by experimental variability (i.e., the type of variability that you can
expect between different samples in the same treatment group). As usual in
statistics, the result of this test is reported as a *p* value, and it is found
in the column `pvalue`. Remember that a *p* value indicates the probability that
an effect as strong as the observed one, or even stronger, would be seen under
the situation described by the null hypothesis.
We can also summarize the results with the following line of code, which reports
some additional information, that will be covered in later sections.
```{r prepisee}
summary(res)
hist(res$pvalue)
## Remove the genes that were filtered out in the independent filtering
hist(res$pvalue[!is.na(res$padj)])
## We also add a couple of extra columns that will be useful for the interactive
## visualization later
rowData(dds)$log10Dispersion <- log10(rowData(dds)$dispersion)
restmp <- DataFrame(res)
restmp$log10BaseMean <- log10(restmp$baseMean)
restmp$mlog10PValue <- -log10(restmp$pvalue)
colnames(restmp) <- paste0("DESeq2_IFNg_vs_naive_", colnames(restmp))
rowData(dds) <- cbind(rowData(dds), restmp)
```
Note that there are many genes with differential expression due to IFN gamma
treatment at the FDR level of 10%. There are
two ways to be more strict about which set of genes are considered significant:
* lower the false discovery rate threshold (the threshold on `padj` in the
results table)
* raise the log2 fold change threshold from 0 using the `lfcThreshold` argument
of *results*
If we lower the false discovery rate threshold, we should also tell this value
to `results()`, so that the function will use an alternative threshold for the
optimal independent filtering step:
```{r res005}
res.05 <- results(dds, alpha = 0.05,
contrast = c("condition_name", "IFNg", "naive"))
table(res.05$padj < 0.05)
```
If we want to raise the log2 fold change threshold, so that we test for genes
that show more substantial changes due to treatment, we simply supply a value on
the log2 scale. For example, by specifying `lfcThreshold = 1`, we look for genes
that show significant effects of treatment on gene counts more than doubling or
less than halving, because 2^1 = 2.
```{r reslfc}
resLFC1 <- results(dds, lfcThreshold = 1,
contrast = c("condition_name", "IFNg", "naive"))
summary(resLFC1)
table(resLFC1$padj < 0.1)
```
Sometimes a subset of the *p* values in `res` will be `NA` ("not available").
This is *DESeq*'s way of reporting that all counts for this gene were zero, and
hence no test was applied. In addition, *p* values can be assigned `NA` if the
gene was excluded from analysis because it contained an extreme count outlier.
For more information, see the outlier detection section of the *DESeq2*
vignette.
With *DESeq2*, there is also an easy way to plot the (normalized, transformed)
counts for specific genes, using the `plotCounts` function:
```{r plotcounts}
plotCounts(dds, gene = "ENSG00000126561.16", intgroup = "condition_name",
normalized = TRUE, transform = FALSE)
```
## Performing differential expression testing with *edgeR*
Next we will show how to perform differential expression analysis with *edgeR*.
Recall that we have a *DGEList* `dge`, containing all the necessary information:
```{r namesdge}
names(dge)
```
We first define a design matrix, using the same formula syntax as for *DESeq2*
above.
```{r designedger}
design <- model.matrix(~ line_id + condition_name, data = dge$samples)
head(design)
```
While *DESeq2* performs independent filtering of lowly expressed genes
internally, this is done by the user before applying *edgeR*. Here, we filter
out lowly expressed genes using the `filterByExpr()` function, and then estimate
the dispersion for each gene. Note that it is important that we specify the
design in the dispersion calculation (it will be used to determine a
suitable number of samples to require a gene to be expressed in).
Afterwards, we plot the estimated dispersions.
```{r filter}
keep <- edgeR::filterByExpr(dge, design)
dge <- dge[keep, ]
dge <- edgeR::estimateDisp(dge, design)
edgeR::plotBCV(dge)
```
Finally, we fit the generalized linear model and perform the test. In the
`glmQLFTest` function, we indicate which coefficient (which column in the design
matrix) that we would like to test for. It is possible to test more general
contrasts as well, and the user guide contains many examples on how to do this.
The `topTags` function extracts the top-ranked genes. You can indicate the
adjusted p-value cutoff, and/or the number of genes to keep.
```{r edgeRcall}
fit <- edgeR::glmQLFit(dge, design)
qlf <- edgeR::glmQLFTest(fit, coef = "condition_nameIFNg")
tt.all <- edgeR::topTags(qlf, n = nrow(dge), sort.by = "none") # all genes
hist(tt.all$table$PValue)
tt <- edgeR::topTags(qlf, n = nrow(dge), p.value = 0.1) # genes with adj.p<0.1
tt10 <- edgeR::topTags(qlf) # just the top 10 by default
tt10
```
The columns in the *edgeR* result data frame are similar to the ones output by
*DESeq2*. *edgeR* represents the overall expression level on the log-CPM scale
rather than on the normalized count scale that *DESeq2* uses. The `F` column
contains the test statistic, and the `FDR` column contains the
Benjamini-Hochberg adjusted p-values.
We can compare the sets of significantly differentially expressed genes to see
how the results from the two packages overlap:
```{r deseq2vsedger}
shared <- intersect(rownames(res), rownames(tt.all$table))
table(DESeq2 = res$padj[match(shared, rownames(res))] < 0.1,
edgeR = tt.all$table$FDR[match(shared, rownames(tt.all$table))] < 0.1)
```
We can also compare the two result lists by the ranks:
```{r deseq2vsedgerplot}
plot(rank(res$pvalue[match(shared, rownames(res))]),
rank(tt.all$table$PValue[match(shared, rownames(tt.all$table))]),
cex = 0.1, xlab = "DESeq2", ylab = "edgeR")
```
Also with *edgeR* we can test for significance relative to a fold-change
threshold, using the function `glmTreat`. Below we set the log fold-change
threshold to 1 (i.e., fold change threshold equal to 2), as for *DESeq2* above.
```{r treat}
treatres <- edgeR::glmTreat(fit, coef = "condition_nameIFNg", lfc = 1)
tt.treat <- edgeR::topTags(treatres, n = nrow(dge), sort.by = "none")
```
### Quiz {-}
* A feature being called DE: "Why is my expected shortlisted gene not showing up?" What possible factors can play a role in the feature being above or below the significance threshold?
* Shall I subset my DE results to the genes only detected as DE? Why?
* The results between DESeq2 and edgeR might differ. Which one is "correct"? How can you better appreciate the existing commonalities and differences in the DE analysis?
* The thought above is valid also for the process of integrating different DE analyses (i.e. using the same method, but "comparing different groups"). What can you think of to represent these results?
# Plotting results
## MA plot with DESeq2
An *MA-plot* [@Dudoit2002Statistical] provides a useful overview for an
experiment with a two-group comparison. The log2 fold change for
a particular comparison is plotted on the y-axis and the average of the counts
normalized by size factor is shown on the x-axis ("M" for minus, because a log
ratio is equal to log minus log, and "A" for average). Each gene is represented
with a dot. Genes with an adjusted *p* value below a threshold (here 0.1, the
default with *DESeq2*) are shown in color
```{r plotma}
DESeq2::plotMA(res, ylim = c(-5, 5))
```
We see that there are many genes with low expression levels that nevertheless
have large fold changes (since we are, effectively, dividing by a small
number). To get more interpretable log fold changes (e.g., for ranking genes),
we use the `lfcShrink` function to shrink the log2
fold changes for the comparison of IFN gamma-treated vs naive samples. There are
three types of shrinkage estimators in *DESeq2*, which are covered in the
vignette. Here we specify the _apeglm_ method for shrinking coefficients, which
is good for shrinking the noisy LFC estimates while giving low bias LFC
estimates for true large differences [@Zhu2019-apeglm]. To use apeglm we specify
a coefficient from the model to shrink, either by name or number as the
coefficient appears in `resultsNames(dds)`.
```{r apeglm}
library(apeglm)
DESeq2::resultsNames(dds)
resape <- DESeq2::lfcShrink(dds, coef = "condition_name_IFNg_vs_naive", type = "apeglm")
DESeq2::plotMA(resape, ylim = c(-5, 5))
```
## MA / Smear plot with edgeR
In *edgeR*, the MA plot is obtained via the `plotSmear` function.
```{r smear}
edgeR::plotSmear(qlf, de.tags = rownames(tt$table))
```
## Heatmap of the most significant genes
Another way of representing the results of a differential expression analysis is
to construct a heatmap of the top differentially expressed genes.
A heatmap is a "color coded expression matrix", where the rows and columns are
clustered using hierarchical clustering. Typically, it should not be applied to
counts, but works better with transformed values. Here we show how it can be
applied to the variance-stabilized values generated above. We would
expect the contrasted sample groups to cluster separately ("by construction",
since the genes were selected to be most discriminative between the groups).
The heatmap will allow us to display, e.g., the variability within the groups
of the differentially expressed genes.
We choose the top 30 differentially expressed genes. There are many functions
in R that can generate heatmaps, here we show the one from the
`r CRANpkg("pheatmap")` package.
```{r heatmap, fig.width = 10, fig.height = 10, message=FALSE}
library(pheatmap)
stopifnot(rownames(vsd) == rownames(res))
mat <- assay(vsd)
rownames(mat) <- rowData(vsd)$SYMBOL
mat <- mat[head(order(res$padj), 30), ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(vsd)[, c("condition_name"), drop = FALSE])
pheatmap(mat, annotation_col = df)
```
We can of course also create heatmaps for other sets of genes - for example,
the collection of genes with the highest overall variance (which may or may
not indicate a difference between the groups - in this particular case most
of the highly variable genes show a clear difference between the groups).
```{r heatmap2, fig.width = 10, fig.height = 10, message=FALSE}
mat <- assay(vsd)
rownames(mat) <- rowData(vsd)$SYMBOL
topVarGenes <- head(order(rowVars(mat), decreasing = TRUE), 30)
mat <- mat[topVarGenes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(vsd)[, c("condition_name"), drop = FALSE])
pheatmap(mat, annotation_col = df)
```
TODO: see the "stripes" in some genes over all the samples
## Interactive visualization with iSEE
`r Biocpkg("iSEE")` is a Bioconductor package that allows interactive exploration of any data
stored in a *SummarizedExperiment* container, or any class extending this (such
as, e.g., the *DESeqDataSet* class, or the *SingleCellExperiment* for
single-cell data). By calling the `iSEE()` function with the object as the first
argument, an interactive application will be opened, in which all observed
values as well as metadata columns (`rowData` and `colData`) can be explored.
```{r isee, warning = FALSE, message = FALSE}