-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
1868 lines (1539 loc) · 75 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "RNA-seq workflow: gene-level exploratory analysis and differential expression"
author:
- name: Michael I. Love
affiliation: Departments of Biostatistics and Genetics, UNC-Chapel Hill, Chapel Hill, NC, US
- name: Simon Anders
affiliation:
- Institute for Molecular Medicine Finland (FIMM), Helsinki, Finland
- &EMBL European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
- name: Vladislav Kim
affiliation: *EMBL
- name: Wolfgang Huber
affiliation: *EMBL
date: 30 March 2017
output:
rmdformats::readthedown:
toc_depth: 3
number_sections: true
bibliography: bibliography.bib
---
<!-- to compile this: rmarkdown::render("rnaseqGene.Rmd") -->
<!--
# a list of all required libraries:
reqlibs = sub(".*library\\(\"(.*?)\"\\).*","\\1",grep("library\\(",readLines("rnaseqGene.Rmd"),value=TRUE))
find.package(reqlibs)
-->
```{r style, echo=FALSE, message=FALSE, warning=FALSE, results="asis"}
library("BiocStyle")
library("knitr")
library("rmarkdown")
## options(width = 70) # Andrzej said this is not needed
opts_chunk$set(message = FALSE, error = FALSE, warning = FALSE,
cache = FALSE, fig.width = 5, fig.height = 5)
```
# Abstract
Here we walk through an end-to-end gene-level RNA-seq differential
expression workflow using Bioconductor packages. We will start from
the FASTQ files, show how these were aligned to the reference genome,
and prepare a count matrix which tallies the number of RNA-seq
reads/fragments within each gene for each sample. We will
perform exploratory data analysis (EDA) for quality assessment and to
explore the relationship between samples, perform differential gene
expression analysis, and visually explore the results.
# Introduction
Bioconductor has many packages which support analysis of
high-throughput sequence
data, including RNA sequencing (RNA-seq). The packages which we
will use in this workflow include core packages maintained by the
Bioconductor core team for importing and processing raw sequencing
data and loading gene annotations. We will also use
contributed packages for statistical analysis and visualization
of sequencing data. Through scheduled releases every 6 months, the
Bioconductor project ensures that all the packages within a release
will work together in harmony (hence the "conductor" metaphor).
The packages used in this workflow are loaded with the
*library* function and can be installed by following the
[Bioconductor package installation instructions](http://bioconductor.org/install/#install-bioconductor-packages).
A published (but essentially similar) version of this workflow, including reviewer reports and comments
is available at [F1000Research](http://f1000research.com/articles/4-1070).
If you have questions about this workflow or any Bioconductor
software, please post these to the
[Bioconductor support site](https://support.bioconductor.org/).
If the questions concern a specific package, you can tag the post with
the name of the package, or for general questions about the workflow,
tag the post with `rnaseqgene`. Note the
[posting guide](http://www.bioconductor.org/help/support/posting-guide/)
for crafting an optimal question for the support site.
## Experimental data
The data used in this workflow is stored in the
`r Biocexptpkg("airway")` package that summarizes an RNA-seq experiment
wherein airway smooth muscle cells were treated with dexamethasone, a
synthetic glucocorticoid steroid with anti-inflammatory effects
[@Himes2014RNASeq]. Glucocorticoids are used, for example,
by people with asthma to reduce inflammation of the airways. In the experiment,
four primary human airway smooth muscle cell lines were treated with 1
micromolar dexamethasone for 18 hours. For each of the four cell
lines, we have a treated and an untreated sample. For more description
of the experiment see the
[PubMed entry 24926665](http://www.ncbi.nlm.nih.gov/pubmed/24926665)
and for raw data see the
[GEO entry GSE52778](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778).
<a id="count"></a>
# Preparing count matrices
As input, the count-based statistical methods, such as
`r Biocpkg("DESeq2")` [@Love2014Moderated],
`r Biocpkg("edgeR")` [@Robinson2009EdgeR],
`r Biocpkg("limma")` with the voom method [@Law2014Voom],
`r Biocpkg("DSS")` [@Wu2013New],
`r Biocpkg("EBSeq")` [@Leng2013EBSeq] and
`r Biocpkg("baySeq")` [@Hardcastle2010BaySeq],
expect input data as obtained, e.g., from RNA-seq or another
high-throughput sequencing experiment, in the form of a matrix of
integer values. The value in the *i*-th row and the *j*-th column of
the matrix tells how many reads (or fragments, for paired-end RNA-seq)
have been assigned to gene *i* in sample *j*. Analogously, for other
types of assays, the rows of the matrix might correspond e.g., to
binding regions (with ChIP-Seq), species of bacteria (with metagenomic
datasets), or peptide sequences (with quantitative mass spectrometry).
The values in the matrix should be counts of sequencing
reads/fragments. This is important for *DESeq2*'s statistical model to
hold, as only counts allow assessing the measurement precision
correctly. It is important to *never* provide counts that were
pre-normalized for sequencing depth/library size, as the statistical
model is most powerful when applied to un-normalized counts, and is
designed to account for library size differences internally.
## Recommended: transcript abundances and the *tximport* pipeline
Before we demonstrate how to align and then count RNA-seq fragments,
we mention that a newer and faster alternative pipeline is to use
transcript abundance quantification methods such as
[Salmon](https://combine-lab.github.io/salmon/) [@Patro2016Salmon],
[Sailfish](http://www.cs.cmu.edu/~ckingsf/software/sailfish/) [@Patro2014Sailfish],
[kallisto](https://pachterlab.github.io/kallisto/) [@Bray2016Near], or
[RSEM](http://deweylab.github.io/RSEM/) [@Li2011RSEM],
to estimate abundances without aligning reads, followed by the
`r Biocpkg("tximport")` package for assembling estimated count and
offset matrices for use with Bioconductor differential gene expression
packages.
A tutorial on how to use the *Salmon* software for quantifying
transcript abundance can be found [here](https://combine-lab.github.io/salmon/getting_started/).
Following that tutorial, you can use the steps in the
[tximport vignette](https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html),
which will show how to build a *DESeqDataSet*. This is our current recommended
pipeline for users, but below we still include steps on aligning reads to the
genome and building count matrices from BAM files.
The advantages of using the transcript abundance quantifiers in
conjunction with *tximport* to produce gene-level count matrices and
normalizing offsets, are: this approach corrects for any potential
changes in gene length across samples (e.g. from differential isoform
usage) [@Trapnell2013Differential]; some of these methods are
substantially faster and require less memory and disk usage compared
to alignment-based methods; and it is possible to avoid discarding
those fragments that can align to multiple genes with homologous
sequence [@Robert2015Errors]. Note that transcript abundance
quantifiers skip the generation of large files which store read
alignments, instead producing smaller files which store estimated
abundances, counts, and effective lengths per transcript. For more
details, see the manuscript describing this approach
[@Soneson2015Differential], and the `r Biocpkg("tximport")` package
vignette for software details. The entry point back into this workflow
after using a transcript quantifier and *tximport* would be the
[section on exploratory data analysis](#eda) below.
## Aligning reads to a reference genome
The computational analysis of an RNA-seq experiment begins from the FASTQ files that
contain the nucleotide sequence of each read and a quality score at
each position. These reads must first be aligned to a reference
genome or transcriptome, or the abundances and estimated counts per
transcript can be estimated without alignment, as described above.
In either case, it is important to know if the sequencing
experiment was single-end or paired-end, as the alignment software
will require the user to specify both FASTQ files for a paired-end
experiment. The output of this alignment step is commonly stored in a
file format called [SAM/BAM](http://samtools.github.io/hts-specs).
A number of software programs exist to align reads to a reference
genome, and we encourage you to check out some of the benchmarking papers that
discuss the advantages and disadvantages of each software, which
include accuracy, sensitivity in aligning reads over splice junctions, speed,
memory required, usability, and many other features.
Here, we used the
[STAR read aligner](https://code.google.com/p/rna-star/) [@Dobin2013STAR]
to align the reads for our current experiment to the Ensembl
release 75 [@Flicek2014Ensembl] human reference genome. In the following code example, it is assumed that there is a file in the current
directory called `files` with each line containing an identifier for
each experiment, and we have all the FASTQ files in a subdirectory
`fastq`. If you have downloaded the FASTQ files from the
Sequence Read Archive, the identifiers would be SRA run IDs,
e.g. `SRR1039520`. You should have two files for a paired-end
experiment for each ID, `fastq/SRR1039520_1.fastq1` and
`fastq/SRR1039520_2.fastq`, which give the first and second read for
the paired-end fragments. If you have performed a single-end
experiment, you would only have one file per ID.
We have also created a subdirectory, `aligned`,
where STAR will output its alignment files.
```
for f in `cat files`; do STAR --genomeDir ../STAR/ENSEMBL.homo_sapiens.release-75 \
--readFilesIn fastq/$f\_1.fastq fastq/$f\_2.fastq \
--runThreadN 12 --outFileNamePrefix aligned/$f.; done
```
[SAMtools](http://www.htslib.org/doc/samtools.html) [@Li2009Sequence]
was used to generate BAM files. The `-@` flag can be used to allocate
additional threads.
```
for f in `cat files`; do samtools view -bS aligned/$f.Aligned.out.sam \
-o aligned/$f.bam; done
```
The BAM files for a number of sequencing runs can then be used to
generate count matrices, as described in the following section.
## Locating alignment files
Besides the count matrix that we will use later, the
`r Biocexptpkg("airway")` package also contains eight files
with a small subset of the total number of reads in the experiment.
The reads were selected which aligned to a small region of
chromosome 1. Here, for demonstration, we chose a subset of reads because the full alignment files are
large (a few gigabytes each), and because it takes between 10-30
minutes to count the fragments for each sample.
We will use these files to demonstrate how a count matrix can be constructed
from BAM files. Afterwards, we will load the full count matrix
corresponding to all samples and all data, which is already provided
in the same package, and will continue the analysis with that full
matrix.
We load the data package with the example data:
```{r loadairway}
library("airway")
```
The R function *system.file* can be used to find out where on your
computer the files from a package have been installed. Here we ask for
the full path to the `extdata` directory, where R packages store
external data, that is part of the `r Biocexptpkg("airway")` package.
```{r dir}
indir <- system.file("extdata", package="airway", mustWork=TRUE)
```
In this directory, we find the eight BAM files (and some other files):
```{r list.files}
list.files(indir)
```
Typically, we have a table with detailed information for each of our
samples that links samples to the associated FASTQ and BAM files.
For your own project, you might create such a comma-separated
value (CSV) file using a text editor or spreadsheet software such as Excel.
We load such a CSV file with *read.csv*:
```{r sampleTable}
csvfile <- file.path(indir, "sample_table.csv")
sampleTable <- read.csv(csvfile, row.names = 1)
sampleTable
```
Once the reads have been aligned, there are a number of tools that
can be used to count the number of reads/fragments that can be
assigned to genomic features for each sample. These often take as
input SAM/BAM alignment files and a file specifying the genomic
features, e.g. a GFF3 or GTF file specifying the gene models.
## *DESeq2* import functions
The following tools can be used generate count matrices:
*summarizeOverlaps* [@Lawrence2013Software],
*featureCounts* [@Liao2014FeatureCounts],
*tximport* [@Soneson2015Differential],
*htseq-count* [@Anders2015HTSeqa].
function | package | framework | output | *DESeq2* input function
--------------------|------------------------------------------------------|----------------|------------------------|-------------------------
*summarizeOverlaps* | `r Biocpkg("GenomicAlignments")` | R/Bioconductor | *SummarizedExperiment* | *DESeqDataSet*
*featureCounts* | `r Biocpkg("Rsubread")` | R/Bioconductor | matrix | *DESeqDataSetFromMatrix*
*tximport* | `r Biocpkg("tximport")` | R/Bioconductor | list of matrices | *DESeqDataSetFromTximport*
*htseq-count* | [HTSeq](http://www-huber.embl.de/users/anders/HTSeq) | Python | files | *DESeqDataSetFromHTSeq*
We now proceed using *summarizeOverlaps*.
Using the `Run` column in the sample table, we construct the full
paths to the files we want to perform the counting operation on:
```{r filenames}
filenames <- file.path(indir, paste0(sampleTable$Run, "_subset.bam"))
file.exists(filenames)
```
We indicate in Bioconductor that these files are BAM files using the
*BamFileList* function from the `r Biocpkg("Rsamtools")` package
that provides an R interface to BAM files.
Here we also specify details about how the BAM files should
be treated, e.g., only process 2 million reads at a time. See
`?BamFileList` for more information.
```{r Rsamtools}
library("Rsamtools")
bamfiles <- BamFileList(filenames, yieldSize=2000000)
```
**Note:** make sure that the chromosome names of the genomic features
in the annotation you use are consistent with the chromosome names of
the reference used for read alignment. Otherwise, the scripts might
fail to count any reads to features due to the mismatching names.
For example, a common mistake is when the alignment files contain
chromosome names in the style of `1` and the gene annotation in the
style of `chr1`, or the other way around. See the *seqlevelsStyle*
function in the `r Biocpkg("GenomeInfoDb")` package for solutions.
We can check the chromosome names (here called "seqnames")
in the alignment files like so:
```{r seqinfo}
seqinfo(bamfiles[1])
```
## Defining gene models
Next, we need to read in the gene model that will be used for
counting reads/fragments. We will read the gene model from an Ensembl
[GTF file](http://www.ensembl.org/info/website/upload/gff.html) [@Flicek2014Ensembl],
using *makeTxDbFromGFF* from the `r Biocpkg("GenomicFeatures")` package.
GTF files can be downloaded from
[Ensembl's FTP site](http://www.ensembl.org/info/data/ftp/) or other gene model repositories.
A *TxDb* object is a database that can be used to
generate a variety of range-based objects, such as exons, transcripts,
and genes. We want to make a list of exons grouped by gene for
counting read/fragments.
There are other options for constructing a *TxDb*.
For the *known genes* track from the UCSC Genome Browser [@Kent2002Human],
one can use the pre-built Transcript DataBase:
`r Biocannopkg("TxDb.Hsapiens.UCSC.hg19.knownGene")`.
If the annotation file is accessible from
`r Biocpkg("AnnotationHub")` (as is the case for the Ensembl genes),
a pre-scanned GTF file can be imported using *makeTxDbFromGRanges*.
Here we will demonstrate loading from a GTF file:
```{r genfeat}
library("GenomicFeatures")
```
We indicate that none of our sequences (chromosomes) are circular
using a 0-length character vector.
```{r txdb}
gtffile <- file.path(indir,"Homo_sapiens.GRCh37.75_subset.gtf")
txdb <- makeTxDbFromGFF(gtffile, format = "gtf", circ_seqs = character())
txdb
```
The following line produces a *GRangesList* of all the exons grouped
by gene [@Lawrence2013Software]. Each element of the list is a
*GRanges* object of the exons for a gene.
```{r}
ebg <- exonsBy(txdb, by="gene")
ebg
```
## Read counting step
After these preparations, the actual counting is easy. The function
*summarizeOverlaps* from the `r Biocpkg("GenomicAlignments")`
package will do this. This produces a *SummarizedExperiment*
object that contains a variety of information about
the experiment, and will be described in more detail below.
**Note:** If it is desired to perform counting using multiple cores, one can use
the *register* and *MulticoreParam* or *SnowParam* functions from the
`r Biocpkg("BiocParallel")` package before the counting call below.
Expect that the `summarizeOverlaps` call will take at least 30 minutes
per file for a human RNA-seq file with 30 million aligned reads. By sending
the files to separate cores, one can speed up the entire counting process.
```{r}
library("GenomicAlignments")
library("BiocParallel")
```
Here we specify to use one core, not multiple cores. We could have
also skipped this line and the counting step would run in serial.
```{r}
register(SerialParam())
```
The following call creates the *SummarizedExperiment* object with counts:
```{r}
se <- summarizeOverlaps(features=ebg, reads=bamfiles,
mode="Union",
singleEnd=FALSE,
ignore.strand=TRUE,
fragments=TRUE )
```
We specify a number of arguments besides the `features` and the
`reads`. The `mode` argument describes what kind of read overlaps will
be counted. These modes are shown in Figure 1 of the
*Counting reads with summarizeOverlaps* vignette for the
`r Biocpkg("GenomicAlignments")` package.
Note that fragments will be counted only once to each gene, even if
they overlap multiple exons of a gene which may themselves be overlapping.
Setting `singleEnd` to `FALSE`
indicates that the experiment produced paired-end reads, and we want
to count a pair of reads (a fragment) only once toward the count
for a gene.
The `fragments` argument can be used when `singleEnd=FALSE` to specify if unpaired
reads should be counted (yes if `fragments=TRUE`).
In order to produce correct counts, it is important to know if the
RNA-seq experiment was strand-specific or not. This experiment was not
strand-specific so we set `ignore.strand` to `TRUE`.
However, certain strand-specific protocols could have the reads
align only to the opposite strand of the genes.
The user must check if the experiment was strand-specific and if so,
whether the reads should align to the forward or reverse strand of the genes.
For various counting/quantifying tools, one specifies counting on the
forward or reverse strand in different ways, although this task
is currently easiest with *htseq-count*, *featureCounts*, or the
transcript abundance quantifiers mentioned previously.
It is always a good idea to check the column sums of the count matrix
(see below) to make sure these totals match the expected of the number
of reads or fragments aligning to genes. Additionally, one can
visually check the read alignments using a genome visualization tool.
## SummarizedExperiment
```{r sumexp, echo=FALSE}
par(mar=c(0,0,0,0))
plot(1,1,xlim=c(0,100),ylim=c(0,100),bty="n",
type="n",xlab="",ylab="",xaxt="n",yaxt="n")
polygon(c(45,90,90,45),c(5,5,70,70),col="pink",border=NA)
polygon(c(45,90,90,45),c(68,68,70,70),col="pink3",border=NA)
text(67.5,40,"assay")
text(67.5,35,'e.g. "counts"')
polygon(c(10,40,40,10),c(5,5,70,70),col="skyblue",border=NA)
polygon(c(10,40,40,10),c(68,68,70,70),col="skyblue3",border=NA)
text(25,40,"rowRanges")
polygon(c(45,90,90,45),c(75,75,95,95),col="palegreen",border=NA)
polygon(c(45,47,47,45),c(75,75,95,95),col="palegreen3",border=NA)
text(67.5,85,"colData")
```
**The component parts of a *SummarizedExperiment* object.** The `assay` (pink
block) contains the matrix of counts, the `rowRanges` (blue block) contains
information about the genomic ranges and the `colData` (green block)
contains information about the samples. The highlighted line in each
block represents the first row (note that the first row of `colData`
lines up with the first column of the `assay`).
The *SummarizedExperiment* container is diagrammed in the Figure above
and discussed in the latest Bioconductor paper [@Huber2015Orchestrating].
In our case we have created a single matrix named "counts" that contains the fragment
counts for each gene and sample, which is stored in `assay`. It is
also possible to store multiple matrices, accessed with `assays`.
The `rowRanges` for our object is the *GRangesList* we used for
counting (one *GRanges* of exons for each row of the count matrix).
The component parts of the *SummarizedExperiment* are accessed with an
R function of the same name: `assay` (or `assays`), `rowRanges` and `colData`.
This example code above actually only counted a small subset of fragments
from the original experiment. Nevertheless, we
can still investigate the resulting *SummarizedExperiment* by looking at
the counts in the `assay` slot, the phenotypic data about the samples in
`colData` slot (in this case an empty *DataFrame*), and the data about the
genes in the `rowRanges` slot.
```{r}
se
dim(se)
assayNames(se)
head(assay(se), 3)
colSums(assay(se))
```
The `rowRanges`, when printed, only shows the first *GRanges*, and tells us
there are 19 more elements:
```{r}
rowRanges(se)
```
The `rowRanges` also contains metadata about the construction
of the gene model in the `metadata` slot. Here we use a helpful R
function, `str`, to display the metadata compactly:
```{r}
str(metadata(rowRanges(se)))
```
The `colData`:
```{r}
colData(se)
```
The `colData` slot, so far empty, should contain all the metadata.
Because we used a column of `sampleTable` to produce the `bamfiles`
vector, we know the columns of `se` are in the same order as the
rows of `sampleTable`. We can assign the `sampleTable` as the
`colData` of the summarized experiment, by converting
it into a *DataFrame* and using the assignment function:
```{r}
colData(se) <- DataFrame(sampleTable)
colData(se)
```
## Branching point
At this point, we have counted the fragments which overlap the genes
in the gene model we specified. This is a branching point where we
could use a variety of Bioconductor packages for exploration and
differential expression of the count data, including
`r Biocpkg("edgeR")` [@Robinson2009EdgeR],
`r Biocpkg("limma")` with the voom method [@Law2014Voom],
`r Biocpkg("DSS")` [@Wu2013New],
`r Biocpkg("EBSeq")` [@Leng2013EBSeq] and
`r Biocpkg("baySeq")` [@Hardcastle2010BaySeq].
@Schurch2016How
[compared performance](https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/27022035/)
of different statistical methods
for RNA-seq using a large number of biological replicates and can help
users to decide which tools make sense to use, and how many
biological replicates are necessary to obtain a certain sensitivity.
We will continue using `r Biocpkg("DESeq2")` [@Love2014Moderated].
The *SummarizedExperiment* object is all we
need to start our analysis. In the following section we will show how
to use it to create the data object used by `r Biocpkg("DESeq2")`.
<a id="construct"></a>
# The *DESeqDataSet* object, sample information and the design formula
Bioconductor software packages often define and use a custom class for
storing data that makes sure that all the needed data slots are
consistently provided and fulfill the requirements. In addition,
Bioconductor has general data classes (such as the
*SummarizedExperiment*) that can be used to move data between
packages. Additionally, the core Bioconductor classes provide useful
functionality: for example, subsetting or reordering the rows or
columns of a *SummarizedExperiment* automatically subsets or reorders
the associated *rowRanges* and *colData*, which can help to prevent
accidental sample swaps that would otherwise lead to spurious
results. With *SummarizedExperiment* this is all taken care of behind
the scenes.
In *DESeq2*, the custom class is called *DESeqDataSet*. It is built on
top of the *SummarizedExperiment* class, and it is easy to convert
*SummarizedExperiment* objects into *DESeqDataSet* objects, which we
show below. One of the two main differences is that the `assay` slot is
instead accessed using the *counts* accessor function, and the
*DESeqDataSet* class enforces that the values in this matrix are
non-negative integers.
A second difference is that the *DESeqDataSet* has an associated
*design formula*. The experimental design is specified at the
beginning of the analysis, as it will inform many of the *DESeq2*
functions how to treat the samples in the analysis (one exception is
the size factor estimation, i.e., the adjustment for differing library
sizes, which does not depend on the design formula). The design
formula tells which columns in the sample information table (`colData`)
specify the experimental design and how these factors should be used
in the analysis.
The simplest design formula for differential expression would be
`~ condition`, where `condition` is a column in `colData(dds)` that
specifies which of two (or more groups) the samples belong to. For
the airway experiment, we will specify `~ cell + dex`
meaning that we want to test for the effect of dexamethasone (`dex`)
controlling for the effect of different cell line (`cell`). We can see
each of the columns just using the `$` directly on the
*SummarizedExperiment* or *DESeqDataSet*:
```{r secell}
se$cell
se$dex
```
**Note:** it is prefered in R that the first level of a factor be the
reference level (e.g. control, or untreated samples), so we can
*relevel* the `dex` factor like so:
```{r sedex}
library("magrittr")
se$dex %<>% relevel("untrt")
se$dex
```
`%<>%` is the compound assignment pipe-operator from
the `r CRANpkg("magrittr")` package, the above line of code is a concise way of saying
```{r explaincmpass, eval = FALSE}
se$dex <- relevel(se$dex, "untrt")
```
For running *DESeq2* models, you can use R's formula notation to
express any fixed-effects experimental design.
Note that *DESeq2* uses the same formula notation as, for instance, the *lm*
function of base R. If the research aim is to determine for which
genes the effect of treatment is different across groups, then
interaction terms can be included and tested using a design such as
`~ group + treatment + group:treatment`. See the manual page for
`?results` for more examples. We will show how to use an interaction
term to test for condition-specific changes over time in a
time course example below.
In the following sections, we will demonstrate the construction of the
*DESeqDataSet* from two starting points:
* from a *SummarizedExperiment* object
* from a count matrix and a sample information table
For a full example of using the *HTSeq* Python package for read
counting, please see the `r Biocexptpkg("pasilla")` vignette. For an
example of generating the *DESeqDataSet* from files produced by
*htseq-count*, please see the `r Biocpkg("DESeq2")` vignette.
## Starting from *SummarizedExperiment*
We now use R's *data* command to load a prepared
*SummarizedExperiment* that was generated from the publicly available
sequencing data files associated with @Himes2014RNASeq,
described above. The steps we used to produce this object were
equivalent to those you worked through in the previous sections,
except that we used all the reads and all the genes. For more details
on the exact steps used to create this object, type
`vignette("airway")` into your R session.
```{r}
data("airway")
se <- airway
```
Again, we want to specify that `untrt` is the reference level for the
dex variable:
```{r}
se$dex %<>% relevel("untrt")
se$dex
```
We can quickly check the millions of fragments that uniquely aligned
to the genes (the second argument of *round* tells how many decimal
points to keep).
```{r}
round( colSums(assay(se)) / 1e6, 1 )
```
Supposing we have constructed a *SummarizedExperiment* using
one of the methods described in the previous section, we now need to
make sure that the object contains all the necessary information about
the samples, i.e., a table with metadata on the count matrix's columns
stored in the `colData` slot:
```{r}
colData(se)
```
Here we see that this object already contains an informative
`colData` slot -- because we have already prepared it for you, as
described in the `r Biocexptpkg("airway")` vignette.
However, when you work with your own data, you will have to add the
pertinent sample / phenotypic information for the experiment at this stage.
We highly recommend keeping this information in a comma-separated
value (CSV) or tab-separated value (TSV) file, which can be exported
from an Excel spreadsheet, and the assign this to the `colData` slot,
making sure that the rows correspond to the columns of the
*SummarizedExperiment*. We made sure of this correspondence earlier by
specifying the BAM files using a column of the sample table.
Once we have our fully annotated *SummarizedExperiment* object,
we can construct a *DESeqDataSet* object from it that will then form
the starting point of the analysis.
We add an appropriate design for the analysis:
```{r}
library("DESeq2")
```
```{r}
dds <- DESeqDataSet(se, design = ~ cell + dex)
```
## Starting from count matrices
In this section, we will show how to build an *DESeqDataSet* supposing
we only have a count matrix and a table of sample information.
**Note:** if you have prepared a *SummarizedExperiment* you should skip this
section. While the previous section would be used to construct a
*DESeqDataSet* from a *SummarizedExperiment*, here we first extract
the individual object (count matrix and sample info) from the
*SummarizedExperiment* in order to build it back up into a new object
-- only for demonstration purposes.
In practice, the count matrix would either be read in from a file or
perhaps generated by an R function like *featureCounts* from the
`r Biocpkg("Rsubread")` package [@Liao2014FeatureCounts].
The information in a *SummarizedExperiment* object can be
accessed with accessor functions. For example, to see the actual data,
i.e., here, the fragment counts, we use the *assay* function. (The *head*
function restricts the output to the first few lines.)
```{r}
countdata <- assay(se)
head(countdata, 3)
```
In this count matrix, each row represents an Ensembl gene, each column
a sequenced RNA library, and the values give the raw numbers of
fragments that were uniquely assigned to the respective gene in each
library. We also have information on each of the samples (the columns of the
count matrix). If you've counted reads with some other software, it is
very important to check that the columns of the count matrix correspond to the rows
of the sample information table.
```{r}
coldata <- colData(se)
```
We now have all the ingredients to prepare our data object in a form
that is suitable for analysis, namely:
* `countdata`: a table with the fragment counts
* `coldata`: a table with information about the samples
To now construct the *DESeqDataSet* object from the matrix of counts and the
sample information table, we use:
```{r}
ddsMat <- DESeqDataSetFromMatrix(countData = countdata,
colData = coldata,
design = ~ cell + dex)
```
We will continue with the object generated from the
*SummarizedExperiment* section.
<a id="eda"></a>
# Exploratory analysis and visualization
There are two separate paths in this workflow; the one we will
see first involves *transformations of the counts*
in order to visually explore sample relationships.
In the second part, we will go back to the original raw counts
for *statistical testing*. This is critical because
the statistical testing methods rely on original count data
(not scaled or transformed) for calculating the precision of measurements.
## Pre-filtering the dataset
Our count matrix with our *DESeqDataSet* contains many rows with only
zeros, and additionally many rows with only a few fragments total. In
order to reduce the size of the object, and to increase the speed of
our functions, we can remove the rows that have no or nearly no
information about the amount of gene expression. Here we apply the
most minimal filtering rule: removing rows of the *DESeqDataSet* that
have no counts, or only a single count across all samples. Additional
weighting/filtering to improve power is applied at a later step in the
workflow.
```{r}
nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)
```
## The rlog and variance stabilizing transformations
Many common statistical methods for exploratory analysis of
multidimensional data, for example clustering and *principal
components analysis* (PCA), work best for data that generally has the
same range of variance at different ranges of the mean values. When
the expected amount of variance is approximately the same across
different mean values, the data is said to be *homoskedastic*. For
RNA-seq counts, however, the expected variance grows with the mean. For
example, if one performs PCA directly on a matrix of
counts or normalized counts (e.g. correcting for differences in
sequencing depth), the resulting plot typically depends mostly
on the genes with *highest* counts because they show the largest
absolute differences between samples. A simple and often used
strategy to avoid this is to take the logarithm of the normalized
count values plus a pseudocount of 1; however, depending on the
choice of pseudocount, now the genes with the very *lowest* counts
will contribute a great deal of noise to the resulting plot, because
taking the logarithm of small counts actually inflates their variance.
We can quickly show this property of counts with some simulated
data (here, Poisson counts with a range of lambda from 0.1 to 100).
We plot the standard deviation of each row (genes) against the mean:
```{r meanSdCts}
lambda <- 10^seq(from = -1, to = 2, length = 1000)
cts <- matrix(rpois(1000*100, lambda), ncol = 100)
library("vsn")
meanSdPlot(cts, ranks = FALSE)
```
And for logarithm-transformed counts:
```{r meanSdLogCts}
log.cts.one <- log2(cts + 1)
meanSdPlot(log.cts.one, ranks = FALSE)
```
The logarithm with a small pseudocount amplifies differences when the
values are close to 0. The low count genes with low signal-to-noise
ratio will overly contribute to sample-sample distances and PCA
plots.
As a solution, *DESeq2* offers two transformations for count data that
stabilize the variance across the mean: the
*regularized-logarithm transformation* or *rlog* [@Love2014Moderated],
and the *variance stabilizing transformation* (VST)
for negative binomial data with a dispersion-mean trend
[@Anders2010Differential], implemented in the *vst* function.
For genes with high counts, the rlog and VST will give similar result
to the ordinary log2 transformation of normalized counts. For genes
with lower counts, however, the values are shrunken towards the genes'
averages across all samples. The rlog-transformed or VST data then
becomes approximately homoskedastic, and can be used directly for
computing distances between samples, making PCA plots, or as input to
downstream methods which perform best with homoskedastic data.
**Which transformation to choose?** The rlog tends to work well on
small datasets (n < 30), sometimes outperforming the VST when there is
a large range of sequencing depth across samples (an order of
magnitude difference). The VST is much faster to compute and is less
sensitive to high count outliers than the rlog. We therefore recommend
the VST for large datasets (hundreds of samples). You can perform both
transformations and compare the `meanSdPlot` or PCA plots generated,
as described below.
Note that the two transformations offered by DESeq2 are provided for
applications *other* than differential testing.
For differential testing we recommend the
*DESeq* function applied to raw counts, as described later
in this workflow, which also takes into account the dependence of the
variance of counts on the mean value during the dispersion estimation
step.
The function *rlog* returns an object based on the *SummarizedExperiment*
class that contains the rlog-transformed values in its *assay* slot.
```{r rlog}
rld <- rlog(dds, blind = FALSE)
head(assay(rld), 3)
```
The function *vst* returns a similar object:
```{r vst}
vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)
```
In the above function calls, we specified `blind = FALSE`, which means
that differences between cell lines and treatment (the variables in
the design) will not contribute to the expected variance-mean trend of
the experiment. The experimental design is not used directly in the
transformation, only in estimating the global amount of variability in
the counts. For a fully *unsupervised* transformation, one can set
`blind = TRUE` (which is the default).
To show the effect of the transformation, in the figure below
we plot the first sample
against the second, first simply using the *log2* function (after adding
1, to avoid taking the log of zero), and then using the rlog- and VST-transformed
values. For the *log2* approach, we need to first estimate *size factors* to
account for sequencing depth, and then specify `normalized=TRUE`.
Sequencing depth correction is done automatically for the *rlog*
and the *vst*.
```{r rldplot, fig.width = 6, fig.height = 2.5}
library("dplyr")
library("ggplot2")
dds <- estimateSizeFactors(dds)
df <- bind_rows(
as_data_frame(log2(counts(dds, normalized=TRUE)[, 1:2]+1)) %>%
mutate(transformation = "log2(x + 1)"),
as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"),
as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"))
colnames(df)[1:2] <- c("x", "y")
ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
coord_fixed() + facet_grid( . ~ transformation)
```
**Scatterplot of transformed counts from two samples**. Shown are
scatterplots using the log2 transform of normalized counts (left),
using the rlog (middle), and using the VST (right). While the rlog is
on roughly the same scale as the log2 counts, the VST has a upward
shift for the smaller values. It is the differences between samples
(deviation from y=x in these scatterplots) which will contribute to
the distance calculations and the PCA plot.
We can see how genes with low counts (bottom left-hand corner) seem to
be excessively variable on the ordinary logarithmic scale, while the
rlog transform and VST compress differences for the low count genes
for which the data provide little information about differential
expression.
## Sample distances
A useful first step in an RNA-seq analysis is often to assess overall
similarity between samples: Which samples are similar to each other,
which are different? Does this fit to the expectation from the
experiment's design?
We use the R function *dist* to calculate the Euclidean distance
between samples. To ensure we have a roughly equal contribution from
all genes, we use it on the rlog-transformed data. We need to
transpose the matrix of values using *t*, because the *dist* function
expects the different samples to be rows of its argument, and
different dimensions (here, genes) to be columns.
```{r}
sampleDists <- dist(t(assay(rld)))
sampleDists
```
We visualize the distances in a heatmap in a figure below, using the function
*pheatmap* from the `r CRANpkg("pheatmap")` package.
```{r}
library("pheatmap")
library("RColorBrewer")
```
In order to plot the sample distance matrix with the rows/columns
arranged by the distances in our distance matrix,
we manually provide `sampleDists` to the `clustering_distance`
argument of the *pheatmap* function.
Otherwise the *pheatmap* function would assume that the matrix contains
the data values themselves, and would calculate distances between the
rows/columns of the distance matrix, which is not desired.
We also manually specify a blue color palette using the
*colorRampPalette* function from the `r CRANpkg("RColorBrewer")` package.
```{r distheatmap, fig.width = 6.1, fig.height = 4.5}
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep = " - " )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors)
```
**Heatmap of sample-to-sample distances using the rlog-transformed values.**
Note that we have changed the row names of the distance matrix to
contain treatment type and patient number instead of sample ID, so
that we have all this information in view when looking at the heatmap.
Another option for calculating sample distances is to use the
*Poisson Distance* [@Witten2011Classification], implemented in the
`r CRANpkg("PoiClaClu")` package.
This measure of dissimilarity between counts
also takes the inherent variance
structure of counts into consideration when calculating the distances
between samples. The *PoissonDistance* function takes the original
count matrix (not normalized) with samples as rows instead of columns,
so we need to transpose the counts in `dds`.
```{r}
library("PoiClaClu")
poisd <- PoissonDistance(t(counts(dds)))
```
We plot the heatmap in a Figure below.
```{r poisdistheatmap, fig.width = 6.1, fig.height = 4.5}
samplePoisDistMatrix <- as.matrix( poisd$dd )
rownames(samplePoisDistMatrix) <- paste( rld$dex, rld$cell, sep=" - " )
colnames(samplePoisDistMatrix) <- NULL
pheatmap(samplePoisDistMatrix,
clustering_distance_rows = poisd$dd,
clustering_distance_cols = poisd$dd,
col = colors)
```
**Heatmap of sample-to-sample distances using the *Poisson Distance*.**
## PCA plot