<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/master/TeachingMaterials/2021-10-NIHLibrarySession/BigQuerySurvival.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Survival Analysis and BigQuery

This notebook demonstrates how to use BigQuery to gather data to use in Survival analysis. We will be using mutation of the BRCA gene to predict the survival between mutation of BRCA and normal.

## Load Libraries and Authorize with BigQuery

### Packages

| Package | Description |
| :--- | :--- |
| [bigrquery](https://bigrquery.r-dbi.org/) | The bigrquery package makes it easy to work with data stored in Google BigQuery by allowing you to query BigQuery tables and retrieve metadata about your projects, datasets, tables, and jobs. |
| [tidyverse](https://www.tidyverse.org/packages/) | A compilation package including ggplot2, dplyr, tibble, and stringr. |
| [survival](https://cran.r-project.org/web/packages/survival/index.html) | Contains the core survival analysis routines, including definition of Surv objects, Kaplan-Meier and Aalen-Johansen (multi-state) curves, Cox models, and parametric accelerated failure time models. |
| [ggfortify](https://cran.r-project.org/web/packages/ggfortify/index.html) | Unified plotting tools for statistics commonly used, such as GLM, time series, PCA families, clustering, and survival analysis. The package offers a single plotting interface for these analysis results and plots in a unified style using 'ggplot2'. |

In [None]:
#Load Libraries
if (!require(bigrquery)) {
  print('Installing bigrquery package')
  install.packages('bigrquery')
  if (!require(bigrquery)) {
    print('Cannot install bigrquery package')
  }
}

if (!require(tidyverse)) {
  print('Installing tidyverse package')
  install.packages('tidyverse')
  if (!require(tidyverse)) {
    print('Cannot install tidyverse package')
  }
}

if (!require(survival)) {
  print('Installing survival package')
  install.packages('survival')
  if (!require(survival)) {
    print('Cannot install survival package')
  }
}

if (!require(ggfortify)) {
  print('Installing ggfortify package')
  install.packages('ggfortify')
  if (!require(ggfortify)) {
    print('Cannot install ggfortify package')
  }
}

## Authenticate to Access BigQuery
Before using BigQuery, we need to get authorization for access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC'. R notebooks that use the BigRQuery library need the following work-around to authenticate. See the following link: https://gist.github.com/jobdiogenes/235620928c84e604c6e56211ccf681f0

In [None]:
# NOTE: this cell is only required if you're using Google Colab
if (!require('R.utils')) {
    print('Installing R.utils package')
    install.packages("R.utils")
    if (!require('R.utils')) {
        print('Cannot install R.utils package')
    }
}

if (!require('httr')) {
    print('Installing httr package')
    install.pckages('httr')
    if (!require('httr')) {
        print('Cannot install httr package')
    }
}

my_check <- function() {return(TRUE)}
reassignInPackage("is_interactive", pkgName = "httr", my_check) 
options(rlang_interactive=TRUE)

In [None]:
# Now authenticate to BQ. Be sure to select the BigQuery scope!
bq_auth(use_oob = TRUE, cache = TRUE)

In [None]:
# Set your Google Project
project = 'your-project here' # Update to your Google Project

# Gather Data

## Technical terms

| Name | Description |
| :--- | :--- |
| isb-cgc-bq | Google project name of ISB-CGC |
| TCGA.somatic_mutation_hg38_gdc_current | BigQuery dataset and table containing TCGA somatic mutation data |
| TCGA.clinical_gdc_current | BigQuery dataset and table containing TCGA clinical data |
| project_short_name | The column name with the project name abbreviation |
| Hugo_Symbol| The HUGO symbol for the gene |
| demo__vital_status | The column with the survival state of the patient |
| demo__days_to_death | The column with the number of days between the date used for index and the date from the patient's death |
| diag__days_to_last_follow_up | The column with the days between the initial diagnosis and the last follow with the patient |

## Retrieve Patient BRCA Mutation Status from BigQuery
The first component of our data set is the status of the ERBB2 mutation for the patient. 

Table: *isb-cgc-bq.TCGA.somatic_mutation_hg38_gdc_current*

In [None]:
cohort_query <- "WITH t AS (
            SELECT case_id, Hugo_Symbol
            FROM `isb-cgc-bq.TCGA.somatic_mutation_hg38_gdc_current`
            WHERE
              project_short_name = 'TCGA-BRCA')
            SELECT DISTINCT case_id,
              CASE
                WHEN Hugo_Symbol = 'ERBB2' THEN 'ERBB2'
                ELSE 'none'
              END
              AS gene_status
            FROM t
            ORDER BY gene_status"
# Run the query
cohort <- bq_project_query(project, cohort_query, quiet = TRUE) 
# Create a dataframe with the results from the query
cohort <- bq_table_download(cohort, quiet = TRUE)
# Show the dataframe
summary(cohort)

## Retrieve Clinical Data From BigQuery

The other important component of our data set is the patient's vital status and either days to death or days to last follow up.

Table: *isb-cgc-bq.TCGA.clinical_gdc_current*

In [None]:
survival_query <- str_c("
  SELECT 
    case_id,
    submitter_id,
    demo__vital_status,
    demo__days_to_death,
    diag__days_to_last_follow_up
  FROM `isb-cgc-bq.TCGA.clinical_gdc_current`
  WHERE
    case_id IN ('", str_c(cohort$case_id, collapse = "', '"),"') AND
    demo__vital_status IS NOT NULL")

survival_request <- bq_project_query(project, survival_query)
survival_data <- bq_table_download(survival_request)
survival <- left_join(survival_data, cohort, key = "case_id")
head(survival)

# Clean Data

We want to make sure that the data is cleaned of duplicates, empty cells, missing data, and create one column for days to death and days to last follow up.

In [None]:
# Add a column for status at the end
survival$days_to_event <- survival$demo__days_to_death

# Fill in NAs for alive cases with days to last follow-up
for (row in 1:nrow( survival)) {
  if (survival$demo__vital_status[row] == 'Alive' && is.na(survival$days_to_event[row])){
    survival$days_to_event[row] <- survival$diag__days_to_last_follow_up[row]
  }
}

# Remove duplicates in the gene status column keeping the mutation

survival <- arrange(survival, gene_status)
survival <- survival[!duplicated(survival$case_id),]

# Filter out cases marked as dead but have no data for days to death and negative days
survival <- filter(survival, !(demo__vital_status=="Dead"&days_to_event=="NA")&days_to_event>=1)

# Convert the vital status to numbers
survival$vital_status <- ifelse(survival$demo__vital_status=='Alive', 0, 1)

head(survival)

# Analyze Data

Finally, we can create the survival analysis and plot the results.

In [None]:
# create a survival curve plot
autoplot(survfit(Surv(days_to_event, vital_status) ~ gene_status, data = survival)) +
  labs(title = "Survival Curve",
       y = "Percent Survival", 
       x = "Days") +
  theme(legend.title=element_blank())

In [None]:
# Analyze the differences between groups with a Log-Rank Test
survdiff(Surv(days_to_event, vital_status) ~ gene_status, data = survival)

## Conclusion

The log-rank test in `survdiff` indicates that there is a significant difference between the curves and that there is a difference in outcome based on the mutation of the ERBB2 gene.

# Follow-up Exercises

Practice your BigQuery skills by trying to solve the following exercises on your own. Come to our offices hours to see the solutions.

We have seen that the ERBB2 gene has an effect on the survival outcome of a patient but do other common genes in Breast Cancer have a similar effect?

- Does a mutation in the gene BRCA1 affect survival outcomes?
- How about BRCA2?

Does a mutation in the ERBB2 gene have an effect on the survival outcome for other cancers besides Breast Cancer?
- Does a mutation in ERBB2 affect survival outcomes in Ovarian Cancer?

## Additional Reading

### ERBB2 and Breast Cancer

Kurozumi, S., Alsaleem, M., Monteiro, C.J. et al. Targetable ERBB2 mutation status is an independent marker of adverse prognosis in estrogen receptor positive, ERBB2 non-amplified primary lobular breast carcinoma: a retrospective in silico analysis of public datasets. Breast Cancer Res 22, 85 (2020). https://doi.org/10.1186/s13058-020-01324-4


Ping, Zheng et al. “ERBB2 mutation is associated with a worse prognosis in patients with CDH1 altered invasive lobular cancer of the breast.” Oncotarget vol. 7,49 (2016): 80655-80663. https://doi:10.18632/oncotarget.13019


Griffith, Obi L et al. “The prognostic effects of somatic mutations in ER-positive breast cancer.” Nature communications vol. 9,1 3476. 4 Sep. 2018, doi:10.1038/s41467-018-05914-x


### Survival Analysis

Rich, Jason T et al. “A practical guide to understanding Kaplan-Meier curves.” Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery vol. 143,3 (2010): 331-6. [doi:10.1016/j.otohns.2010.05.007](doi:10.1016/j.otohns.2010.05.007)

# Contact US
Please contact us to learn more about BigQuery, to discuss cost considerations when working with BigQuery projects, or to discuss any projects you feel may benefit from the ISB-CGC Platform.

* Email us: feedback@isb-cgc.org
* Check out our website: https://isb-cgc.org
* Visit our [office hours](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/office_hours.html)