# Integrative Analysis with TCGA Data

Analysis of Mutation Data from The Cancer Genome Atlas (TCGA)

In [None]:
library(knitr)


















── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## Introduction

The Cancer Genome Atlas (TCGA) is a massive cancer genomics project compiling high-throughput multi-omic data on dozens of cancer types for [public access](https://www.cancer.gov/ccg/research/genome-sequencing/tcga).

We are gonna use the `curatedTCGAData` package to manipulate locally to multiple high-throughput datasets from the project. The package provides access to TCGA data that has been curated and stored as a *MultiAssayExperiment* object on the Bioconductor [ExperimentHub](https://bioconductor.org/packages/release/bioc/html/ExperimentHub.html).

First, let’s load the packages needed.

In [None]:
library(curatedTCGAData)


Loading required package: MultiAssayExperiment



Loading required package: SummarizedExperiment



Loading required package: MatrixGenerics



Loading required package: matrixStats




Attaching package: 'matrixStats'

The following object is masked from 'package:dplyr':

    count


Attaching package: 'MatrixGenerics'

The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeig

Loading required package: GenomicRanges

Loading required package: stats4

Loading required package: BiocGenerics




Attaching package: 'BiocGenerics'

The following objects are masked from 'package:lubridate':

    intersect, setdiff, union

The following objects are masked from 'package:dplyr':

    combine, intersect, setdiff, union

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors


Attaching package: 'S4Vectors'

The following objects are masked from 'package:lubridate':

    second, second<-

The following objects are masked from 'package:dplyr':

    first, rename

The following object is masked from 'package:tidyr':

    expand

The following object is masked from 'package:utils':

    findMatches

The following objects are masked from 'package:base':

    expand.grid, I, unname

Loading required package: IRanges


Attaching package: 'IRanges'

The following object is masked from 'package:lubridate':

    %within%

The following objects are masked from 'package:dplyr':

    collapse, desc, slice

The following object is masked from 'package:purrr':

    reduce

The following object is masked from 'package:grDevices':

    windows

Loading required package: GenomeInfoDb



Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: 'Biobase'

The following object is masked from 'package:MatrixGenerics':

    rowMedians

The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians



## Download the Data

To download the data we need to use `curatedTCGAData`function. The first argument is a four letter disease (cancer) code (A complete list of disease codes used by the TCGA project are available on the [NCI Genomic Data Commons website](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations)), the second argument is a vector of data types we want to download. We need to specify `dry.run = FALSE` to download the data.

In this specific case, we are gonna work with RNA-Seq data, mutation data and methylation data from Rectum Adenocarcinoma (READ). The clinical data is included by default.

In [None]:

readData = curatedTCGAData("READ", 
                           c("RNASeq2GeneNorm", "Mutation", "Methylation_methyl450"), 
                           dry.run = FALSE, version = "2.1.1")


Working on: READ_Mutation-20160128

see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

require("RaggedExperiment")



Working on: READ_RNASeq2GeneNorm-20160128

see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

Working on: READ_Methylation_methyl450-20160128

see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

require("rhdf5")



see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

Loading required package: HDF5Array



Loading required package: DelayedArray



Loading required package: Matrix




Attaching package: 'Matrix'

The following object is masked from 'package:S4Vectors':

    expand

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loading required package: S4Arrays



Loading required package: abind


Attaching package: 'S4Arrays'

The following object is masked from 'package:abind':

    abind

The following object is masked from 'package:base':

    rowsum

Loading required package: SparseArray




Attaching package: 'DelayedArray'

The following object is masked from 'package:purrr':

    simplify

The following objects are masked from 'package:base':

    apply, scale, sweep


Attaching package: 'HDF5Array'

The following object is masked from 'package:rhdf5':

    h5ls

Working on: READ_colData-20160128

see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

Working on: READ_sampleMap-20160128

see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

Working on: READ_metadata-20160128

see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

harmonizing input:
  removing 1903 sampleMap rows not in names(experiments)
  removing 2 colData rownames not in sampleMap 'primary'

A MultiAssayExperiment object of 3 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 3:
 [1] READ_Mutation-20160128: RaggedExperiment with 22075 rows and 69 columns
 [2] READ_RNASeq2GeneNorm-20160128: SummarizedExperiment with 18115 rows and 177 columns
 [3] READ_Methylation_methyl450-20160128: SummarizedExperiment with 485577 rows and 106 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

We can see which patients have data for each assay. The assay column gives the experiment type, the primary column gives the unique patient ID and the colname gives the sample ID used as a identifier within a given experiment.

In [None]:
sampleMap(readData)


DataFrame with 352 rows and 3 columns
                                  assay      primary                colname
                               <factor>  <character>            <character>
1   READ_Methylation_methyl450-20160128 TCGA-AF-2687 TCGA-AF-2687-01A-02D..
2   READ_Methylation_methyl450-20160128 TCGA-AF-2690 TCGA-AF-2690-01A-02D..
3   READ_Methylation_methyl450-20160128 TCGA-AF-2693 TCGA-AF-2693-01A-02D..
4   READ_Methylation_methyl450-20160128 TCGA-AF-3911 TCGA-AF-3911-01A-01D..
5   READ_Methylation_methyl450-20160128 TCGA-AF-4110 TCGA-AF-4110-01A-02D..
...                                 ...          ...                    ...
348       READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02G        TCGA-AG-A02G-01
349       READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02N        TCGA-AG-A02N-01
350       READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02X        TCGA-AG-A02X-01
351       READ_RNASeq2GeneNorm-20160128 TCGA-AG-A032        TCGA-AG-A032-01
352       READ_RNASeq2GeneNorm-20160128 TCGA-AG-A0

Not all patients have data for all assays, and some of them can have multiple data entries for one or more experiment type. This may correspond to multiple biopsies or matched tumor and normal samples from an individual patient.

In [None]:
sampleMap(readData) |> 
  as_tibble() |> 
  pull(primary) |> 
  table() |> 
  table()



  1   2   3   4 
  5 147   7   8 

We can see the metadata of the patients with `colData`. Note that there are more than 2000 columns of data per patient (not necessarily complete).

In [None]:
clin = colData(readData) |> 
  as_tibble()
dim(clin)


[1]  167 2260

 [1] "patientID"             "years_to_birth"        "vital_status"         
 [4] "days_to_death"         "days_to_last_followup" "tumor_tissue_site"    
 [7] "pathologic_stage"      "pathology_T_stage"     "pathology_N_stage"    
[10] "pathology_M_stage"    

As an example, for rectum adenocarcinoma, we can see the tumor stage.

In [None]:
clin |> 
  pull(pathology_T_stage) |> 
  table()



 t1  t2  t3  t4 t4a t4b 
  9  28 114   5   8   1 

Stage T4 have subgroups. To simplify the analysis, let’s combine all T4 tumors.

In [None]:
clin <- clin |> 
  mutate(t_stage = case_when(
    pathology_T_stage %in% c("t4","t4a","t4b") ~ "t4",
    .default = pathology_T_stage
  ))

clin$t_stage |> 
  table()



 t1  t2  t3  t4 
  9  28 114  14 

Also, we can see the vital status (alive=0, deceased=1)

In [None]:
clin$vital_status |> 
  table()



  0   1 
139  28 

Or combine tumor status and vital status.

In [None]:
table(clin$t_stage, clin$vital_status)


    
      0  1
  t1  9  0
  t2 24  4
  t3 96 18
  t4  9  5