-
Notifications
You must be signed in to change notification settings - Fork 4
/
Bioc2016.Rmd
1059 lines (838 loc) · 33.6 KB
/
Bioc2016.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "R/Bioconductor tools for mass spectrometry-based proteomics"
author: "[Laurent Gatto](https://github.com/lgatto)"
output:
BiocStyle::html_document:
toc: true
toc_depth: 1
vignette: >
% \VignetteIndexEntry{R/Bioconductor tools for mass spectrometry-based proteomics}
% \VignetteEngine{knitr::rmarkdown}
% \VignetteKeyword{proteomics, mass spectrometry, data}
% \VignettePackage{ProteomicsBioc2016Workshop}
---
```{r style, echo = FALSE, results = 'asis', message=FALSE}
BiocStyle::markdown()
```
```{r env, echo=FALSE, warning=FALSE}
.cache <- FALSE
suppressPackageStartupMessages(library("ProteomicsBioc2016Workshop"))
suppressPackageStartupMessages(library("RforProteomics"))
suppressPackageStartupMessages(library("AnnotationHub"))
suppressMessages(library("BiocInstaller"))
suppressPackageStartupMessages(library("lattice"))
suppressPackageStartupMessages(library("gridExtra"))
suppressPackageStartupMessages(library("mzID"))
suppressPackageStartupMessages(library("msmsTests"))
suppressPackageStartupMessages(library("gplots"))
```
**Abstract** In this workshop, we will use R/Bioconductor packages to
explore, process, visualise and understand mass spectrometry-based
proteomics data, starting with raw data, and proceeding with
identification and quantitation data, discussing some of their
peculiarities compared to sequencing data along the way. The
participants will gain a general overview of Bioconductor packages for
mass spectrometry and proteomics, and learn how to navigate raw data
and reconstruct quantitative data. The workshop is aimed at a beginner
to intermediate level, such as, for example, seasoned R users who want
to get started with mass spectrometry and proteomics, or proteomics
practitioners who want to familiarise themselves with R and
Bioconductor infrastructure.
This short tutorial is part of the `ProteomicsBioc2016Workshop`
package (version `r packageVersion("ProteomicsBioc2016Workshop")`),
available at https://github.com/lgatto/ProteomicsBioc2016Workshop.
> This vignette available under a
> [**creative common CC-BY**](http://creativecommons.org/licenses/by/4.0/)
> license. You are free to **share** (copy and redistribute the
> material in any medium or format) and **adapt** (remix, transform,
> and build upon the material) for any purpose, even commercially.
**Modified:** `r file.info("Bioc2016.Rmd")$mtime`<br />
**Compiled**: `r date()`
# Introduction
## Installation instructions
To install this package and build the vignette:
If the `r Biocpkg("BiocInstaller")` is not installted, start with
```{r src, eval=FALSE}
source("https://bioconductor.org/biocLite.R")
```
then
```{r install, eval=FALSE}
library("BiocInstaller")
biocLite("lgatto/ProteomicsBioc2016Workshop",
dependencies = TRUE, build_vignettes=TRUE)
```
To launch the vignette
```{r launch, eval=FALSE}
library("ProteomicsBioc2016Workshop")
bioc2016()
```
## MS/proteomics software available in Bioconductor
```{r pk, echo=FALSE, warning=FALSE, cache=.cache}
biocv <- as.character(biocVersion())
pp <- proteomicsPackages(biocv)
msp <- massSpectrometryPackages(biocv)
msdp <- massSpectrometryDataPackages(biocv)
```
In Bioconductor version `r biocv`, there are respectively `r nrow(pp)`
proteomics and `r nrow(msp)` mass spectrometry software packages and
`r nrow(msdp)` mass spectrometry experiment packages. These respective
packages can be extracted with the `proteomicsPackages()`,
`massSpectrometryPackages()` and `massSpectrometryDataPackages()` and
explored interactively.
```{r pp, eval=FALSE}
library("RforProteomics")
pp <- proteomicsPackages("3.3")
display(pp)
```
## Mass spectrometry data
```{r, datatab, results='asis', echo=FALSE}
datatab <-
data.frame(Type = c("raw", "identification", "quantitation", "peak lists", "other"),
Format = c("mzML, mzXML, netCDF, mzData", "mzIdentML", "mzQuantML", "mgf", "mzTab"),
Package = c(
"[`mzR`](http://bioconductor.org/packages/release/bioc/html/mzR.html) (read)",
"[`mzID`](http://bioconductor.org/packages/release/bioc/html/mzID.html) and [`mzR`](http://bioconductor.org/packages/release/bioc/html/mzR.html) (read)",
"",
"[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) (read/write)",
"[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) (read)"))
knitr::kable(datatab)
```
# Getting data
## AnnotationHub
`r Biocpkg("AnnotationHub")` is a cloud resource set up and managed by
the Bioconductor project that programmatically disseminates omics
data. I am currently working on contributing
[proteomics data](http://bioconductor.org/packages/devel/bioc/vignettes/ProteomicsAnnotationHubData/inst/doc/ProteomicsAnnotationHubData.html).
```{r ah0}
library("AnnotationHub")
ah <- AnnotationHub()
query(ah, "proteomics")
```
Below, we download a raw mass spectrometry dataset with identifier
`AH49008` and store it in a variable names `ms`.
```{r ah, message=FALSE}
ms <- ah[["AH49008"]]
ms
```
```{r mshd, echo=FALSE}
hd <- header(ms)
```
The data contains `r length(ms)` spectra - `r table(hd$msLevel)[[1]]`
MS1 spectra and `r table(hd$msLevel)[[2]]` MS2 spectra. The filename,
`r basename(fileName(ms))`, is not very descriptive because the data
originates from the `AnnotationHub` cloud repository. If the data was
read from a local file, is would be named as the `mzML` (or `mzXML`)
file.
## ProteomeXchange
Contemporary MS-based proteomics data is disseminated through the
[ProteomeXchange](http://www.proteomexchange.org/) infrastructure,
which centrally coordinates submission, storage and dissemination
through multiple data repositories, such as the
[PRIDE](https://www.ebi.ac.uk/pride/archive/) data base at the EBI for
MS/MS experiments, [PASSEL](http://www.peptideatlas.org/passel/) at
the ISB for SRM data and the
[MassIVE](http://massive.ucsd.edu/ProteoSAFe/static/massive.jsp)
resource. The `r Biocpkg("rpx")` is an interface to ProteomeXchange
and provides a basic and unified access to PX data.
```{r, rpx}
library("rpx")
pxannounced()
```
```{r, pxd}
px <- PXDataset("PXD000001")
px
pxfiles(px)
```
Other metadata for the `px` dataset:
```{r, pxvar, eval=FALSE}
pxtax(px)
pxurl(px)
pxref(px)
```
Data files can then be downloaded with the `pxget` function as
illustrated below.
```{r pxgetfile6, eval=TRUE}
mzf <- pxget(px, pxfiles(px)[6])
mzf
```
# Handling raw MS data
The `mzR` package provides an interface to the
[proteowizard](http://proteowizard.sourceforge.net/) code base,
the legacy RAMP is a non-sequential parser and other C/C++ code to
access various raw data files, such as `mzML`, `mzXML`, `netCDF`, and
`mzData`. The data is accessed on-disk, i.e it does not get loaded
entirely in memory by default. The three main functions are
`openMSfile` to create a file handle to a raw data file, `header` to
extract metadata about the spectra contained in the file and `peaks`
to extract one or multiple spectra of interest. Other functions such
as `instrumentInfo`, or `runInfo` can be used to gather general
information about a run.
```{r rawms}
library("mzR")
ms <- openMSfile(mzf)
ms
```
```{r hd}
hd <- header(ms)
dim(hd)
names(hd)
```
```{r peaks}
head(peaks(ms, 234))
str(peaks(ms, 1:5))
```
> **Exercise** Extract the index of the MS2 spectrum with the highest
> base peak intensity and plot its spectrum (we will see more about
> MS data visualisation in the next section). Is the data centroided
> or in profile mode?
```{r ex_raw, echo=TRUE, eval=TRUE, fig.align='center'}
hd2 <- hd[hd$msLevel == 2, ]
i <- which.max(hd2$basePeakIntensity)
hd2[i, ]
pi <- peaks(ms, hd2[i, 1])
plot(pi, type = "h")
mz <- hd2[i, "basePeakMZ"]
plot(pi, type = "h", xlim = c(mz-0.5, mz+0.5))
```
Zooming into spectrum 100.
```{r ex_raw2}
pj <- peaks(ms, 100)
plot(pj, type = "l")
plot(pj, type = "l", xlim = c(536,540))
```
# Handling identification data
The `ProteomicsBioc2016Workshop` package distributes a small
identification result file (see
`?TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzid`) that we
load and parse using infrastructure from the `r Biocpkg("mzID")`
package.
```{r id, cache=.cache}
library("mzID")
(f <- dir(system.file("extdata", package = "ProteomicsBioc2016Workshop"),
pattern = "mzid", full.names = TRUE))
id <- mzID(f)
software(id)
id
```
Various data can be extracted from the `mzID` object, using one the
accessor functions such as `database`, `sofware`, `scans`, `peptides`,
... The object can also be converted into a `data.frame` using the
`flatten` function.
```{r fid}
head(flatten(id))
```
The `mzR` interface provides a similar interface. It is however much
faster as it does not read all the data into memory and only extracts
relevant data on demand. It has also accessor functions such as
`softwareInfo`, `mzidInfo`, ... (use `showMethods(classes = "mzRident", where = "package:mzR")`)
to see all available methods.
```{r id2}
library("mzR")
id2 <- openIDfile(f)
id2
softwareInfo(id2)
```
The identification data can be accessed as a `data.frame` with the
`psms` accessor.
```{r fid2}
head(psms(id2))
```
> **Exercise** Is there a relation between the length of a protein and
> the number of identified peptides, conditioned by the (average)
> e-value of the identifications?
```{r, ex_id, echo=TRUE, eval=TRUE}
fid <- flatten(id)
x <- by(fid, fid$accession, function(x)
c(unique(x$length),
length(unique(x$pepseq)),
mean(x$'ms-gf:specevalue')))
x <- data.frame(do.call(rbind, x))
colnames(x) <- c("plength", "npep", "eval")
x$bins <- cut(x$eval, summary(x$eval))
library("lattice")
xyplot(plength ~ npep | bins, data = x)
```
# MS/MS database search
While searches are generally performed using third-party software
independently of R or can be started from R using a `system` call, the
`r Biocpkg("rTANDEM")` package allows one to execute such searches
using the X!Tandem engine.
```{r rtandem, eval=FALSE}
library("rTANDEM")
?rtandem
```
The `r Biocpkg("MSGPplus")` package provides an interface to the very
fast `MSGF+` engine.
```{r msgfplus, eval = FALSE}
library("MSGFplus")
parameters <- msgfPar(database = 'milk-proteins.fasta',
tolerance = '20 ppm',
instrument = 'TOF')
runMSGF(parameters, 'msraw.mzML')
```
# High-level data interface
The above sections introduced low-level interfaces to raw and
identification results. The `r Biocpkg("MSnbase")` package provides
abstractions for raw data through the `MSnExp` class and containers
for quantification data via the `MSnSet` class. Both store the actual
assay data (spectra or quantitation matrix) and sample and feature
metadata, accessed with `spectra` (or the
`[`, `[[` operators) or `exprs`, `pData` and `fData`.
The figure below give a schematics of an `MSnSet` instance and the
relation between the assay data and the respective feature and sample
metadata.
```{r, msnset, echo=FALSE, fig.width = 5, fig.height = 7, fig.align='center'}
plot(NA, xlim = c(0, 5), ylim = c(0, 10), axes=FALSE, xlab = NA, ylab = NA)
rect(0, 0, 3, 1.9)
rect(0, 2, 3, 10)
rect(3.05, 2, 5, 10)
segments(seq(0, 3, length.out = 7),
rep(0, 7),
seq(0, 3, length.out = 7),
rep(10, 7),
lty = "dotted")
segments(rep(0, 50),
seq(2, 10, length.out = 50),
rep(5, 100),
seq(2, 10, length.out = 50),
lty = "dotted")
text(1.5, 1, "sample metadata", cex = 1.5)
text(1.5, 6, "assay data", cex = 1.5)
text(4, 6, "feature\nmetadata", cex = 1.5)
```
Another useful slot is `processingData`, accessed with
`processingData(.)`, that records all the processing that objects have
undergone since their creation (see examples below).
The `readMSData` will parse the raw data, extract the MS2 spectra and
construct an MS experiment file.
```{r msnbase}
library("MSnbase")
quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzXML$")
quantFile
msexp <- readMSData(quantFile, verbose=FALSE)
msexp
```
The identification results stemming from the same raw data file can
then be used to add PSM matches.
```{r addid}
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzid$")
identFile
msexp <- addIdentificationData(msexp, identFile)
fData(msexp)
```
```{r specplot}
msexp[[1]]
as(msexp[[1]], "data.frame")[100:105, ]
```
`MSnExp` object load all MS data into memory. This is a viable
solution for MS2 data, but does not scale to MS1 data, especially when
multiple files are loaded together. With the help of
[Johannes Rainer](http://github.com/jotsetung), a new `MSnExp` class
supporting on disk access is being developed.
# Visualising raw data
## A full chromatogram
```{r chromato}
chromatogram(ms)
```
## Multiple chromatograms
It is of course possible to overlay several chromatograms. The code
chunks below are not executed dynamically so save time with
downloading raw data files.
```{r chromato3, eval=FALSE}
## manual download
library("RforProteomics")
url <- "http://proteome.sysbiol.cam.ac.uk/lgatto/files/Thermo-HELA-PRT/"
f1 <- downloadData(file.path(url, "Thermo_Hela_PRTC_1.mzML"))
f2 <- downloadData(file.path(url, "Thermo_Hela_PRTC_2.mzML"))
f3 <- downloadData(file.path(url, "Thermo_Hela_PRTC_3.mzML"))
```
```{r chromato3plot, eval=FALSE}
## plotting
c1 <- chromatogram(f1)
c2 <- chromatogram(f2, plot = FALSE)
lines(c2, col = "steelblue", lty = "dashed", lwd = 2)
c3 <- chromatogram(f3, plot = FALSE)
lines(c2, col = "orange", lty = "dotted")
```
![Multiple chomatograms](./Figures/chromatograms.png)
## An extracted ion chromatogram
```{r xic, cache=.cache, fig.width=12}
par(mfrow = c(1, 2))
xic(ms, mz = 636.925, width = 0.01)
x <- xic(ms, mz = 636.925, width = 0.01, rtlim = c(2120, 2200))
```
## Spectra
We first load a test iTRAQ data called `itraqdata` and distributed as
part of the `r Biocpkg("MSnbase")` package using the `data`
function. This is a pre-packaged data that comes as a dedicated data
structure called `MSnExp`. We then `plot` the 10th spectrum using
specific code that recognises what to do with an element of an
`MSnExp`.
```{r itraqdata}
data(itraqdata)
itraqdata
plot(itraqdata[[10]], reporters = iTRAQ4, full=TRUE)
```
The `ms` data is not pre-packaged as an `MSnExp` data. It is a more
bare-bone `r as.character(class(ms))` object, a pointer to a raw data
file (here `r basename(fileName(ms))`): we need first to extract a
spectrum of interest (here the 3071st spectrum, an MS1 spectrum), and
use the generic `plot` function to visualise the spectrum.
```{r ms1}
plot(peaks(ms, 3071), type = "h",
xlab = "M/Z", ylab = "Intensity",
sub = formatRt(hd[3071, "retentionTime"]))
```
The importance of flexible access to specialised data becomes visible
in the figure below (taken from the `r Biocannopkg("RforProteomics")`
[visualisation vignette](http://bioconductor.org/packages/release/data/experiment/vignettes/RforProteomics/inst/doc/RProtVis.html)).
**Not only can we access specific data and understand/visualise them,
but we can transverse all the data and extracted/visualise/understand
structured slices of data.**
In this code chunks we start by selecting relevant spectra of
interest. We will focus on the first MS1 spectrum acquired after 30
minutes of retention time.
```{r pxd1}
## (1) Open raw data file
ms <- openMSfile(mzf)
## (2) Extract the header information
hd <- header(ms)
## (3) MS1 spectra indices
ms1 <- which(hd$msLevel == 1)
## (4) Select MS1 spectra with retention time between 30 and 35 minutes
rtsel <- hd$retentionTime[ms1] / 60 > 30 & hd$retentionTime[ms1] / 60 < 35
## (5) Indices of the 1st and 2nd MS1 spectra after 30 minutes
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
## (6) Interleaved MS2 spectra
ms2 <- (i+1):(j-1)
```
Now now extract and plot all relevant information:
1. The upper panel represents the chromatogram of the `r fileName(ms)`
raw data file, produced with `chromatogram`.
```{r visfig01}
chromatogram(ms)
```
2. We concentrate at a specific retention time,
`r formatRt(hd[i, "retentionTime"])` minutes (`r hd[i, "retentionTime"]` seconds)
```{r visfig02}
chromatogram(ms)
abline(v = hd[i, "retentionTime"], col = "red")
```
3. This corresponds to the `r i`th MS1 spectrum, shown on the second
row of figures.
```{r visfig03}
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
```
4. The ions that were selected for MS2 are highlighted by vertical
lines. These are represented in the bottom part of the figure.
```{r visfig04}
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
abline(v = hd[ms2, "precursorMZ"],
col = c("#FF000080",
rep("#12121280", 9)))
```
5. On the right, we zoom on the isotopic envelope of one peptide in
particular (the one highlighted with a red line).
```{r visfig05}
plot(peaks(ms, i), type = "l", xlim = c(521, 522.5))
abline(v = hd[ms2, "precursorMZ"], col = "#FF000080")
```
6. A final loop through the relevant MS2 spectra plots the
`length(ms2)` MS2 spectra highlighted above.
```{r visfig06, fig.width = 8, fig.height = 10}
par(mfrow = c(5, 2), mar = c(2, 2, 0, 1))
for (ii in ms2) {
p <- peaks(ms, ii)
plot(p, xlab = "", ylab = "", type = "h", cex.axis = .6)
legend("topright", legend = paste0("Prec M/Z\n",
round(hd[ii, "precursorMZ"], 2)),
bty = "n", cex = .8)
}
```
```{r mslayout, echo=FALSE}
## Preparing the layout (not shown)
lout <- matrix(NA, ncol = 10, nrow = 8)
lout[1:2, ] <- 1
for (ii in 3:4)
lout[ii, ] <- c(2, 2, 2, 2, 2, 2, 3, 3, 3, 3)
lout[5, ] <- rep(4:8, each = 2)
lout[6, ] <- rep(4:8, each = 2)
lout[7, ] <- rep(9:13, each = 2)
lout[8, ] <- rep(9:13, each = 2)
```
Putting it all together:
```{r visfig}
layout(lout)
par(mar=c(4,2,1,1))
chromatogram(ms)
abline(v = hd[i, "retentionTime"], col = "red")
par(mar = c(3, 2, 1, 0))
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
abline(h = 0)
abline(v = hd[ms2, "precursorMZ"],
col = c("#FF000080",
rep("#12121280", 9)))
par(mar = c(3, 0.5, 1, 1))
plot(peaks(ms, i), type = "l", xlim = c(521, 522.5),
yaxt = "n")
abline(h = 0)
abline(v = hd[ms2, "precursorMZ"], col = "#FF000080")
par(mar = c(2, 2, 0, 1))
for (ii in ms2) {
p <- peaks(ms, ii)
plot(p, xlab = "", ylab = "", type = "h", cex.axis = .6)
legend("topright", legend = paste0("Prec M/Z\n",
round(hd[ii, "precursorMZ"], 2)),
bty = "n", cex = .8)
}
```
Below, we illustrate some additional visualisation and animations of
raw MS data, also taken from the `r Biocannopkg("RforProteomics")`
[visualisation vignette](http://bioconductor.org/packages/release/data/experiment/vignettes/RforProteomics/inst/doc/RProtVis.html). On
the left, we have a heatmap visualisation of a MS map and a 3
dimensional representation of the same data. On the right, 2 MS1
spectra in blue and the set of interleaves 10 MS2 spectra.
```{r msmap1, message=FALSE, fig.width=15, echo=TRUE}
## (1) MS space heaptmap
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd)
ff <- colorRampPalette(c("yellow", "steelblue"))
trellis.par.set(regions=list(col=ff(100)))
m1 <- plot(M, aspect = 1, allTicks = FALSE)
## (2) Same data as (1), in 3 dimenstion
M@map[msMap(M) == 0] <- NA
m2 <- plot3D(M, rgl = FALSE)
## (3) The 2 MS1 and 10 interleaved MS2 spectra from above
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
M2 <- MSmap(ms, i:j, 100, 1000, 1, hd)
m3 <- plot3D(M2)
grid.arrange(m1, m2, m3, ncol = 3)
```
Below, we have animations build from extracting successive slices as above.
```{r two-col-1, results='asis', echo=FALSE, out.extra=''}
cat("<table class='container'><tr>")
cat("<td>")
```
![MS animation 1](./Figures/msanim1.gif)
```{r two-col-2, results='asis', echo=FALSE, out.extra=''}
cat("</td>")
cat("<td>")
```
![MS animation 2](./Figures/msanim2.gif)
```{r two-col-3, results='asis', echo=FALSE, out.extra=''}
cat("</td>")
cat("</tr></table>")
```
# Visualising identification data
Annotated spectra and comparing spectra.
```{r id1, message=FALSE, fig.width=15, message=FALSE}
par(mfrow = c(1, 2))
itraqdata2 <- pickPeaks(itraqdata, verbose = FALSE)
s <- "SIGFEGDSIGR"
plot(itraqdata2[[14]], s, main = s)
plot(itraqdata2[[25]], itraqdata2[[28]], sequences = rep("IMIDLDGTENK", 2))
```
The annotation of spectra is obtained by simulating fragmentation of a
peptide and matching observed peaks to fragments:
```{r fag}
calculateFragments("SIGFEGDSIGR")
```
Visualising a pair of spectra means that we can access them, and that,
in addition to plotting, we can manipulate them and perform
computations. The two spectra corresponding to the `IMIDLDGTENK`
peptide, for example have
`r compareSpectra(itraqdata2[[25]], itraqdata2[[28]], fun = "common")`
common peaks, a correlation of
`r round(compareSpectra(itraqdata2[[25]], itraqdata2[[28]], fun = "cor"), 3)`
and a dot product of
`r round(compareSpectra(itraqdata2[[25]], itraqdata2[[28]], fun = "dotproduct"), 3)`
(see `?compareSpectra` for details).
There are 2 Bioconductor packages for peptide-spectrum matching
directly in R, namely `r Biocpkg("MSGFplus")` and `r Biocpkg("rTANDEM")`,
replying on `MSGF+` and `X!TANDEM` respectively.
See also the `r Biocpkg("MSGFgui")` package for visualisation of
identification data.
![MSGFgui screenshot](./Figures/03-2-msgfgui_panel_BW.png)
# Quantitative proteomics data
There are a wide range of proteomics quantitation techniques that can
broadly be classified as labelled vs. label-free, depending whether
the features are labelled prior the MS acquisition and the MS level at
which quantitation is inferred, namely MS1 or MS2.
```{r, quanttab, echo=FALSE, results='asis'}
qtb <- matrix(c("XIC", "Counting", "SILAC, 15N", "iTRAQ, TMT"),
nrow = 2, ncol = 2)
dimnames(qtb) <- list(
'MS level' = c("MS1", "MS2"),
'Quantitation' = c("Label-free", "Labelled"))
knitr::kable(qtb)
```
* Isobaric tagging (iTRAQ and TMT): `r Biocpkg("MSnbase")` and `r Biocpkg("isobar")`.
* Label-free: `r Biocpkg("xcms")` (metabolomics).
* Counting: `r Biocpkg("MSnbase")` and `r Biocpkg("MSnID")` for
peptide-spectrum matching confidence assessment.
* `r Githubpkg("vladpetyuk/N14N15")` for heavy Nitrogen-labelled data.
## From raw data to quantitative data
In terms of raw data quantitation, most efforts have been devoted to
MS2-level quantitation. Label-free XIC quantitation has however been
addressed in the frame of metabolomics data processing by the
[`xcms`](http://bioconductor.org/packages/release/bioc/html/xcms.html)
infrastructure.
### Isobaric tagging
An `MSnExp` is converted to an `MSnSet` by the `quantitation`
method. Below, we use the iTRAQ 4-plex isobaric tagging strategy
(defined by the `iTRAQ4` parameter; other isobaric tags are available).
```{r, itraq4plot, fig.align='center'}
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
```
```{r, quantitraq}
msset <- quantify(msexp, method = "trap", reporters = iTRAQ4, verbose = FALSE)
exprs(msset)
processingData(msset)
```
**See also** The `r Biocpkg("isobar")` package supports quantitation
from centroided `mgf` peak lists or its own tab-separated files that
can be generated from Mascot and Phenyx vendor files.
### Label-free MS2
Other MS2 quantitation methods available in `quantify` include the
(normalised) spectral index `SI` and (normalised) spectral abundance
factor `SAF` or simply a simple count method.
```{r, lfms2}
exprs(si <- quantify(msexp, method = "SIn"))
exprs(saf <- quantify(msexp, method = "NSAF"))
```
Note that spectra that have not been assigned any peptide (`NA`) or
that match non-unique peptides (`npsm > 1`) are discarded in the
counting process.
### Spectral counting
The `r Biocpkg("MSnID")` package provides enables to explore and
assess the confidence of identification data using `mzid` files. A
subset of all peptide-spectrum matches, that pass a specific false
discovery rate threshold can them be converted to an `MSnSet`, where
the number of peptide occurrences are used to populate the assay data.
## Importing third-party data
The PSI `mzTab` file format is aimed at providing a simpler (than XML
formats) and more accessible file format to the wider community. It is
composed of a key-value metadata section and peptide/protein/small
molecule tabular sections.
```{r, mztab}
mztf <- pxget(px, pxfiles(px)[2])
(mzt <- readMzTabData(mztf, what = "PEP", version = "0.9"))
```
It is also possible to import arbitrary spreadsheets as `MSnSet`
objects into R with the `readMSnSet2` function. The main 2 arguments
of the function are (1) a text-based spreadsheet or a `data.frame` and
(2) column names of indices that identify the quantitation data.
```{r, readmsnset2}
csv <- dir(system.file ("extdata" , package = "pRolocdata"),
full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv")
getEcols(csv, split = ",")
ecols <- 7:10
res <- readMSnSet2(csv, ecols)
head(exprs(res))
head(fData(res))
```
# Quantitative data processing and analysis
## Processing
Processing quantitative proteomics data is very similar to processing
transcriptomics data. When working with `MSnSet` instances, a set of
methods for data processing and visualisation are readily available to
streamline and track the process:
- `purityCorrect` to correct for isobaric tag impurities
- `normalise` (or `normalize`) to, well, normalise data using
quantile, vsn, mean, ... normalisation
- `filterNA` to remove feature with (a certain proportion) of missing
values
- `impute` to impute missing values using a wide range of methods
(more on this below)
- `image`, `MAplot`, `plot2D` (for dimensionality reduction using PCA,
t-SNE, MDS, ...), ... to visualise data
- ...
## Differentially expression
* Isobaric tagging (iTRAQ and TMT): `r Biocpkg("isobar")`
* Label-free: `r CRANpkg("aLFQ")` and `r CRANpkg("protiq")`
* `r Biocpkg("msmsTests")` for the RNA-seq count-based models and
tests
```{r msmstest}
library("msmsTests")
data(msms.dataset) ## an MSnSet
e <- pp.msms.data(msms.dataset)
head(exprs(e))
```
```{r msmstest2}
pData(e)
null.f <- "y~batch"
alt.f <- "y~treat+batch"
div <- apply(exprs(e),2,sum)
res <- msms.edgeR(e,alt.f,null.f,div=div,fnm="treat")
head(res)
```
* `r Biocpkg("MSstats")` for various statistical analyses
* and of course `r Biocpkg("limma")`
* ...
## Machine learning
* `MSnSet` support `r Biocpkg("MLInterfaces")` out-of-the-box.
* While tailor for spatial proteomics data, `r Biocpkg("pRoloc")`'s
machine learning methods are useful in a wide range of cases and
work with `MSnSet` instances.
# Some differences with RNASeq and transcriptomics in general
## Missing values
```{r na}
data(naset)
naplot(naset, col = "black")
```
### Filtering
One solution is to remove all or part of the features that have
missing values (see `?filterNA`).
```{r filterNA}
flt <- filterNA(naset)
processingData(flt)
```
### Identification transfer
Identification transfer between acquisitions (label-free): if a feature
was not acquired in MS2 in one replicate, it is possible to find the
ion in MS space based on the M/Z and retention time coordinates of the
same ion in a replicate where it was identified. (An example of this
is implemented in the `r Biocpkg("synapter")` package).
### Imputation
It is of course possible to impute missing values (`?impute`). This is
however not a straightforward thing, as is likely to dramatically fail
when a high proportion of data is missing (10s of %). But also, there
are two types of mechanisms resulting in missing values in LC/MSMS
experiments.
* Missing values resulting from absence of detection of a feature,
despite ions being present at detectable concentrations. For
example in the case of ion suppression or as a result from the
stochastic, data-dependent nature of the MS acquisition
method. These missing value are expected to be randomly distributed
in the data and are defined as **missing at random** (MAR) or
**missing completely at random** (MCAR).
* Biologically relevant missing values, resulting from the *absence*
or the low abundance of ions (below the limit of detection of the
instrument). These missing values are not expected to be randomly
distributed in the data and are defined as **missing not at random**
(MNAR).
```{r naheatmap, echo=FALSE}
x <- impute(naset, "zero")
exprs(x)[exprs(x) != 0] <- 1
library("gplots")
heatmap.2(exprs(x), col = c("lightgray", "black"),
scale = "none", dendrogram = "none",
trace = "none", keysize = 0.5, key = FALSE,
RowSideColors = ifelse(fData(x)$randna, "orange", "brown"),
ColSideColors = rep(c("steelblue", "darkolivegreen"), each = 8))
```
Different imputation methods are more appropriate to different classes
of missing values (as documented in this
[paper](http://pubs.acs.org/doi/abs/10.1021/acs.jproteome.5b00981)). Values
missing at random, and those missing not at random should be imputed
with different methods.
![RSR KNN and MinDet imputation](./Figures/imp-sim.png)
It is recommended to use **hot deck** methods (nearest neighbour
(**left**), maximum likelihood, ...) when data are missing at
random.Conversely, MNAR features should ideally be imputed with a
**left-censor** (minimum value (**right**), but not zero, ...) method.
## Coverage
- Coverage in proteomics in `%`
- [Coverage](http://www.ncbi.nlm.nih.gov/pubmed/24434847) in RNA-Seq in fold `X`
The following values are higher bounds, *without* peptide filtering for
about 80000 *gene groups*
```{r}
data(cvg)
hist(cvg$coverage, breaks = 100, xlab = "coverage", main = "")
```
And
- the majority of peptides map to a minority of proteins different
- peptides within one protein can be differently detectable in MS
acquisitions
## Protein inference and protein groups
![Peptide evidence classes](./Figures/nbt0710-647-F2.png)
From [Qeli and Ahrens (2010)](http://www.ncbi.nlm.nih.gov/pubmed/20622826).
See also [Nesvizhskii and Aebersold (2005)](http://www.ncbi.nlm.nih.gov/pubmed/16009968).
Often, in proteomics experiments, the features represent single
proteins and **groups** of indistinguishable or non-differentiable
proteins identified by shared (non-unique) peptides.
## Identifier mapping
The latest UniProt
```{r up}
data(upens)
data(upenst)
```
Where `upens` and `upenst` where created by querying the Ensembl gene
and transcript identifiers for all human UniProtKB accession numbers
against the UniProt web services. (See `?upens` for details.)
```{r idmappingplot, echo=FALSE}
par(mfrow = c(2, 1))
barplot(table(tapply(upens[, 2], upens[, 1], function(x) length(na.omit(x)))),
main = "Ensembl genes - UnitProtKB mapping")
barplot(table(tapply(upenst[, 2], upenst[, 1], function(x) length(na.omit(x)))),
main = "Ensembl transcripts - UnitProtKB mapping")
```
# Annotation
All the
[Bioconductor annotation infrastructure](http://bioconductor.org/help/workflows/annotation/annotation/),
such as `r Biocpkg("biomaRt")`, `r Biocpkg("GO.db")`, organism
specific annotations, .. are directly relevant to the analysis of
proteomics data. Some proteomics-centred annotations such as the PSI
Mass Spectrometry Ontology, Molecular Interaction (PSI MI 2.5) or