-
Notifications
You must be signed in to change notification settings - Fork 13
/
RTCGAToolbox-vignette.Rmd
239 lines (195 loc) · 7.49 KB
/
RTCGAToolbox-vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
title: "RTCGAToolbox"
author: "Mehmet Kemal Samur"
date: "`r Sys.Date()`"
output:
BiocStyle::html_document:
number_sections: yes
toc: true
references:
- id: ref1
title: Comprehensive genomic characterization defines human glioblastoma genes and core pathways
author:
- family: Cancer Genome Atlas Research Network
given:
journal: Nature
volume: 455
number: 7216
pages: 1061-1068
issued:
year: 2008
- id: ref2
title: GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers
author:
- family: Mermel, C. H. and Schumacher, S. E. and Hill, B. and Meyerson, M. L. and Beroukhim, R. and Getz, G
given:
journal: Genome Biol
volume: 12
number: 4
pages: R41
issued:
year: 2011
- id: ref3
title: RTCGAToolbox\:\ A New Tool for Exporting TCGA Firehose Data
author:
- family: Samur MK.
given:
journal: Plos ONE
volume: 9
number: 9
pages: e106397
issued:
year: 2014
vignette: >
%\VignetteIndexEntry{RTCGAToolbox Tutorial}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
# Introduction
Managing data from large scale projects such as The Cancer Genome Atlas
(TCGA)[@ref1] for further analysis is an important and time consuming step for
research projects. Several efforts, such as Firehose project, make TCGA
pre-processed data publicly available via web services and data portals but it
requires managing, downloading and preparing the data for following steps. We
developed an open source and extensible R based data client for Firehose Level
3 and Level 4 data and demonstrated its use with sample case studies.
RTCGAToolbox could improve data management for researchers who are interested
with TCGA data. In addition, it can be integrated with other analysis
pipelines for further data analysis.
RTCGAToolbox is open-source and licensed under the GNU General Public License
Version 2.0. All documentation and source code for RTCGAToolbox is freely
available. Please site the paper at [@ref3].
Currently, following functions are provided to access datasets and process
datasets.
* Control functions:
+ getFirehoseRunningDates: This function can be called to access valid
stddata run dates. To access data, users have to provide valid dates.
+ getFirehoseAnalyzeDates: This function can be called to access valid
analyze run dates. To access data, users have to provide valid dates. This
function only affects the GISTIC2 [@ref2] processed copy estimate matrices.
+ getFirehoseDatasets: This function can be called to access valid dataset
aliases.
* Data client function:
+ getFirehoseData: This is the core function of the package. Users can
access Firehose processed data via this function. Once it is called, several
steps are realized by the library to access data. Finally this function
returns an S4 object that keeps all the downloaded data.
# Installation
To install RTCGAToolbox, you can use Bioconductor. Source code is also
available on GitHub. First time users use the following code snippet to
install the package
```{r eval=FALSE}
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("RTCGAToolbox")
```
# Data Client
Before getting the data from Firehose pipelines, users have to check valid
dataset aliases, stddata run dates and analyze run dates. To provide valid
information RTCGAToolbox comes with three control functions. Users can list
datasets with "getFirehoseDatasets" function. In addition, users have to
provide stddata run date or/and analyze run date for client function. Valid
dates are accessible via "getFirehoseRunningDates" and
"getFirehoseAnalyzeDates" functions. Below code chunk shows how to list
datasets and dates.
```{r}
library(RTCGAToolbox)
# Valid aliases
getFirehoseDatasets()
```
```{r}
# Valid stddata runs
getFirehoseRunningDates(last = 3)
```
```{r}
# Valid analysis running dates (will return 3 recent date)
getFirehoseAnalyzeDates(last=3)
```
When the dates and datasets are determined users can call data client function
("getFirehoseData") to access data. Current version can download multiple data
types except ISOFORM and exon level data due to their huge data size. Below
code chunk will download READ dataset with clinical and mutation data.
```{r, message=FALSE}
# READ mutation data and clinical data
brcaData <- getFirehoseData(dataset="READ", runDate="20160128",
forceDownload=TRUE, clinical=TRUE, Mutation=TRUE)
```
Printing the object will show the user what datasets are in the `FirehoseData`
object:
```{r}
brcaData
```
Users have to set several parameters to get data they need. Below
"getFirehoseData" options has been explained:
* dataset: Users should set cohort code for the dataset they would like to
download. List can be accessiable via `getFirehoseDatasets()` like as explained
above.
* runDate: Firehose project provides different data point for cohorts. Users
can list dates by using function above,`getFirehoseRunningDates()`.
* gistic2Date: Just like cohorts Firehose project runs their analysis
pipelines to process copy number data with GISTIC2 [@ref2]. Users who want to
get GISTIC2 processed copy number data should set this date. List can be
accessible via "getFirehoseAnalyzeDates()"
Following logic keys are provided for different data types. By default client
only download clinical data.
* RNAseqGene
* clinical
* RNASeqGene
* RNASeq2Gene
* RNASeq2GeneNorm
* miRNASeqGene
* CNASNP
* CNVSNP
* CNASeq
* CNACGH
* Methylation
* Mutation
* mRNAArray
* miRNAArray
* RPPAArray
Users can also set following parameters to set client behavior.
* forceDownload: By default RTCGAToolbox checks your working directory before
download data. If you have data in the working directory from previous run it
loads data by using these exports. If you would like to suppress this and re
download data you can force RTCGAToolbox.
* fileSizeLimit: If you would like to set a limit for downloaded file size you
can use this parameter. Huge data files require longer download time and
memory to load. By default his parameter set as 500MB.
* getUUIDs: Firehose provides TCGA barcodes for every sample. In some cases
users may want to use UUIDs for samples. If this parameter set, then after
processing data RTCGAToolbox gets UUIDs for each barcode.
## Example Dataset
We've provided an abbreviated dataset from the 'ACC' (Adrenocortical carcinoma)
that contains only the top 6 rows for each dataset and a full clinical dataset.
This dataset can be invoked by doing:
```{r}
data(accmini)
accmini
```
* `accmini` data is a FirehoseData object that stores RNAseq, copy number,
mutation, clinical data from the Adrenocortical Carcinoma (ACC) study.
## Conversion to Bioconductor classes
The `biocExtract` function allows the user to take any downloaded dataset and
convert it into a standard Bioconductor object. These can either be a
`SummarizedExperiment`, `RangedSummarizedExperiment`, or `RaggedExperiment`
based on features of the data. The user must provide the desired data type
as input to the function along with the actual `FirehoseData` data object.
This allows for easy adaptability to other software in the Bioconductor
ecosystem.
```{r}
biocExtract(accmini, "RNASeq2Gene")
biocExtract(accmini, "CNASNP")
```
# Raw Data
You can obtain the downloaded data in tabular or list format from the
`FirehoseData` object by using 'getData()' function.
```{r}
head(getData(accmini, "clinical"))
getData(accmini, "RNASeq2GeneNorm")
getData(accmini, "GISTIC", "AllByGene")
```
## Session Info
```{r}
sessionInfo()
```
# References