forked from MSKCC-Epi-Bio/gnomeR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
180 lines (123 loc) · 7.58 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
output: github_document
always_allow_html: true
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
warning = FALSE,
message = FALSE
)
library(gnomeR)
library(knitr)
library(dplyr)
```
# gnomeR
<!-- badges: start -->
[![Codecov test coverage](https://codecov.io/gh/AxelitoMartin/gnomeR/branch/development/graph/badge.svg)](https://codecov.io/gh/AxelitoMartin/gnomeR?branch=development) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4171608.svg)](https://doi.org/10.5281/zenodo.4171608) [![R-CMD-check](https://github.com/AxelitoMartin/gnomeR/workflows/R-CMD-check/badge.svg)](https://github.com/AxelitoMartin/gnomeR/actions)
<!-- badges: end -->
<font size="5">:bangbang: :warning: **NOTE: This package is currently under active development with a new stable release expected April 2022. For code written before 2022-03-23, please use the previous stable version (v1.1.0)**:warning::bangbang: </font>
You can install the pre-2022-03-23 version with:
```{r, eval = FALSE}
remotes::install_github('AxelitoMartin/gnomeR@v1.1.0')
```
## Installation
You can install the development version of `gnomeR` from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("AxelitoMartin/gnomeR")
```
Along with its companion package for cbioPortal data download:
``` r
devtools::install_github("karissawhiting/cbioportalr")
```
## Introduction
the `gnomeR` package provides a consistent framework for genetic data processing, visualization and analysis. This is primarily targeted to IMPACT datasets but can also be applied to any genomic data provided by CbioPortal.
- [**Dowloading and gathering data from CbioPortal**](https://github.com/karissawhiting/cbioportalr) through an integrated API using simply the sample IDs of the samples of interests or the name of the study to retrieve all samples in that study. A separate package `cbioportalr` was developed independently.
- [**Processing genomic data**](https://axelitomartin.github.io/gnomeR/articles/Data-processing.html) retrieved for mutations (MAF file), fusions (MAF file) and copy-number alterations (and when available segmentation files) into an analysis ready format.
- [**Visualization of the processed data**](https://axelitomartin.github.io/gnomeR/articles/Visualizations.html) provided through MAF file summaries, OncoPrints and heatmaps.
- [**Analyzing the processed data**](https://axelitomartin.github.io/gnomeR/articles/Analizing-genomic-data.html) for association with binary, continuous and survival outcome. Including further visualiztion to improve understanding of the results.
KAW
## Examples
### Setting up the API
In order to download the data from CbioPortal, one must first require a token from the website [CbioPortal](https://cbioportal.mskcc.org/) wich will prompt a login page with your MSKCC credentials. Then navigate to "Web API" in the top bar menu, following this simply download a token and copy it after running the following command in R:
```{r,eval=F}
usethis::edit_r_environ()
```
And pasting the token you were given in the .Renviron file that was created and saving after pasting your token.
```{r, eval=F}
CBIOPORTAL_TOKEN = 'YOUR_TOKEN'
```
You can test your connection using:
```{r,eval = F}
cbioportalr::get_cbioportal_token()
```
### Retrieving data
Now that the Cbioportal API is set up in your environment, you must first specify the database of interest (IMPACT or TCGA are the two available options). Following this one can either specify the samples or study of interest:
```{r, eval = F}
library(gnomeR)
library(cbioportalr)
ids <- as.character(unique(mut$Tumor_Sample_Barcode)[1:100])
df <- get_genetics(sample_ids = ids,database = "msk_impact",
mutations = TRUE, fusions = TRUE, cna = TRUE)
```
### Processing the downloaded data
The `binmat()` function is the feature of the data processing of `gnomeR`. It takes genomic inputs from various sources of CbioPortal (mutation files, fusion files and copy number raw counts) to give out a clean binary matrix of n samples by all the events that were found in the files.
```{r, eval = F}
df.clean <- binmat(maf = df$mut, cna = df$cna)
```
We further included example datasets from the raw dowloaded files on CbioPortal (`mut`, `fusion`, `cna`) which we will use for the following examples.
```{r}
set.seed(123)
patients <- as.character(unique(mut$Tumor_Sample_Barcode))[sample(1:length(unique(mut$Tumor_Sample_Barcode)), 100, replace=FALSE)]
gen_dat <- binmat(patients = patients, maf = mut, fusion = fusion, cna = cna)
kable(gen_dat[1:10,1:10],row.names = TRUE)
```
### Visualization
#### MAF
Before we move on to more complex visualizations, we integrate the `maf_viz()` function to give an overview of the distribution of the different mutations across the cohort of interest:
```{r}
sum.plots <- maf_viz(maf = mut %>% filter(Tumor_Sample_Barcode %in% patients))
sum.plots$topgenes
sum.plots$genecomut
```
#### OncoPrints
OncoPrints are a convenient way to display the overall genomic profiles of samples in the cohort of interest. This is best used for a subset of genes that are under consideration.
```{r}
genes <- c("TP53","PIK3CA","KRAS","TERT","EGFR","FAT","ALK","CDKN2A","CDKN2B")
plot_oncoprint(gen_dat = gen_dat %>% select(starts_with(genes)))
```
#### FACETs
[FACETs](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5027494/) is an ASCN tool and open-source software with a broad application to whole genome, whole-exome, as well as targeted panel sequencing platforms. It is a fully integrated stand-alone pipeline that includes sequencing BAM file post-processing, joint segmentation of total- and allele-specific read counts, and integer copy number calls corrected for tumor purity, ploidy and clonal heterogeneity, with comprehensive output.
```{r}
p.heat <- facets_heatmap(seg = seg, patients = patients, min_purity = 0)
p.heat$p
```
### Analysis
In this section we will quickly overview the possible analysis in gnomeR.
#### Binary and continuous outcomes
The `gen_summary()` function let's the user perform a large scale association between the genomic features present in the `binmat()` function output and an outcome of choice:
- binary (unpaired test using Fisher's exact test and paired test using McNemmar's exact test)
- continuous (using simple linear regression)
```{r}
outcome <- factor(rbinom(n = length(patients),size = 1,prob = 1/2),levels = c("0","1"))
# out <- gen_summary(gen_dat = gen_dat,outcome = outcome,filter = 0.05)
# kable(out$fits[1:10,],row.names = TRUE)
# out$forest.plot
```
#### Survival analysis
Similarly we include simple tools to perform univariate Cox's proportional regression adjusted for false discovery rate in the `gen_uni_cox()` function.
```{r}
time <- rexp(length(patients))
status <- outcome
surv_dat <- as.data.frame(cbind(time,status))
out <- gen_uni_cox(X = gen_dat, surv_dat = surv_dat, surv_formula = Surv(time,status)~.,filter = 0.05)
kable(out$tab[1:10,],row.names = TRUE)
out$KM[[1]]
```
### Further analytical tools
The primary goal of `gnomeR` not being in depth analysis of genomic data but rather reliable, modulable and reproducible framework for processing various types of genomic data. For users interested in large scale genomic analytical methods we compiled various packages developed by [Department of Epidemiology and Biostatistics](https://www.mskcc.org/departments/epidemiology-biostatistics), Memorial Sloan-Kettering Cancer Center under an umbrella R package, [gnomeVerse](https://github.com/AxelitoMartin/genomeVerse).