# ISB-CGC Community Notebooks

Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!

```
Title:   Quick Start Guide to ISB-CGC
Author:  Lauren Hagen
Created: 2019-06-20
Updated: 2023-08
Purpose: Painless intro to working with ISB-CGC in the cloud
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb
Notes:   This Quick Start Guide gives an overview of the data available in ISB-CGC and getting started with a basic example in python.
```
***

# Quick Start Guide to [ISB-CGC](https://isb-cgc.appspot.com/) in BigQuery



## Account Set-up
To run this notebook, you will need to have your Google Cloud Account set up. If you need to set up a Google Cloud Account, follow the "Obtain a Google identity" and "Set up a Google Cloud Project" steps on our [Quick-Start Guide documentation](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) page.

Alternatively, you can run this notebook in 'demo mode' using only a Google Account (i.e. a gmail account), not a billable Google Cloud Account. In 'demo mode',  queries are sent to BigQuery via a CGC proxy, and a CGC account is charged for the queries. However the CGC proxy restricts which queries can be run, and the total amount of BigQuery processing volume (in GB) per user.


In [1]:
demo_mode = False

## Libraries needed for the Notebook
This notebook requires the BigQuery API to be loaded [(click here for more information)](https://googleapis.github.io/google-cloud-python/latest/bigquery/usage/client.html) allowing access to BigQuery programmatically.

In [2]:
# GCP libraries
from google.cloud import bigquery
from google.colab import auth
from google.api_core import client_options
from google.oauth2.credentials import Credentials

## Overview of ISB-CGC
The ISB-CGC provides interactive and programmatic access to data hosted by institutes such as the [Genomic Data Commons (GDC)](https://gdc.cancer.gov/) and [Proteomic Data Commons (PDC)](https://proteomic.datacommons.cancer.gov/pdc/) from the [National Cancer Institute (NCI)](https://www.cancer.gov/) while leveraging many aspects of the Google Cloud Platform. You can also import your data, analyze it side by side with the datasets, and share your data when you see fit. The ISB-CGC hosts carefully curated high-level clinical, biospecimen, and molecular datasets and tables in Google BigQuery, including data from programs such as [The Cancer Genome Atlas (TCGA)](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga), [Therapeutically Applicable Research to Generate Effective Treatments (TARGET)](https://ocg.cancer.gov/programs/target), and [Clinical Proteomic Tumor Analysis Consortium (CPTAC)](https://proteomics.cancer.gov/programs/cptac). For more information can be found at our [Programs and Data Sets page](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/Hosted-Data.html). This data can be explored via python, [Google Cloud Console](https://console.cloud.google.com/) and/or our [BigQuery Table Search tool](https://isb-cgc.appspot.com/bq_meta_search/).

## Example of Accessing BigQuery Data with Python


### Log into Google Cloud Storage and Authenticate ourselves

Normally you will need to authenticated access to use Google BigQuery. However you can explore the queries in this notebook in 'demo mode' without needing authenticate with a billable Google Cloud Account.

Steps to authenticate yourself:
1. Run the code block to authenticate yourself with your Google Cloud Login
2. A second tab will open or follow the link provided
3. Follow prompts to Authorize your account to use Google Cloud SDK
4. Copy code provided and paste into the box under the Command
5. Press Enter

[Alternative authentication methods](https://googleapis.github.io/google-cloud-python/latest/core/auth.html)

In [3]:
if demo_mode:
    !wget -O collab_queries.py https://github.com/isb-cgc/Community-Notebooks/raw/refs/heads/master/BQProxy/collab_queries.py
    from collab_queries import api_endpoint, demo_client_args, demo_job_config_arg
else:
    # if you're using Google Colab, authenticate to gcloud with the following
    auth.authenticate_user()

    # alternatively, use the gcloud SDK
    #!gcloud auth application-default login

### Creating a client and using a billing project

To access BigQuery, you will need a Google Cloud Project for queries to be billed to. If you need to create a Project, instructions on how to create one can be found on our [Quick-Start Guide page](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html).

A BigQuery Client object with the billing Project needs to be created to interface with BigQuery.

> Note: Any costs that you incur are charged under your current project, so you will want to make sure you are on the correct one if you are part of multiple projects.


In [4]:
if demo_mode:
    project_id="isb-cgc-dev-1"
else:
    # Create a variable for which client to use with BigQuery
    project_id = 'YOUR_PROJECT_ID_CHANGE_ME' # Update with your Google Project Id

In [5]:
# Create a BigQuery Client
if project_id == 'YOUR_PROJECT_ID_CHANGE_ME': # checking that project id was changed
  print('Please update the project number with your Google Cloud Project')
else:
    client_args = demo_client_args() if demo_mode else {}
    client = bigquery.Client(project_id, **client_args)

### View ISB-CGC Datasets and Tables in BigQuery
Let us look at the datasets available through ISB-CGC that are in BigQuery.

In [6]:
# Which project to view datasets
project_with_data = 'isb-cgc-bq'

# Create a variable of datasets
datasets = list(client.list_datasets(project_with_data))

# If there are datasets available then print their names,
# else print that there are no datasets available
if datasets:
    print(f"Datasets in project {project_with_data}:")
    for dataset in datasets:  # API request(s)
        print("\t{}".format(dataset.dataset_id))
else:
    print(f"{project_with_data} project does not contain any datasets.")

Datasets in project isb-cgc-bq:
	0_README
	APOLLO
	APOLLO_versioned
	BEATAML1_0
	BEATAML1_0_versioned
	BROAD
	BROAD_versioned
	CBTN
	CBTN_versioned
	CBTTC
	CBTTC_versioned
	CCLE
	CCLE_versioned
	CDDP_EAGLE
	CDDP_EAGLE_versioned
	CGCI
	CGCI_versioned
	CMI
	CMI_versioned
	CPTAC
	CPTAC_versioned
	CTSP
	CTSP_versioned
	DEPMAP
	DEPMAP_versioned
	EXC_RESPONDERS
	EXC_RESPONDERS_versioned
	FM
	FM_versioned
	GDC_case_file_metadata
	GDC_case_file_metadata_versioned
	GENCODE
	GENCODE_versioned
	GENIE
	GENIE_versioned
	GPRP
	GPRP_versioned
	HCMI
	HCMI_versioned
	HTAN
	HTAN_versioned
	ICPC
	ICPC_versioned
	ISB_Regulome_Explorer
	MATCH
	MATCH_versioned
	MMRF
	MMRF_versioned
	MP2PRT
	MP2PRT_versioned
	NCICCR
	NCICCR_versioned
	OHSU
	OHSU_versioned
	ORGANOID
	ORGANOID_versioned
	PDC_metadata
	PDC_metadata_versioned
	Quant_Maps_Tissue_Biopsies
	Quant_Maps_Tissue_Biopsies_versioned
	REBC
	REBC_versioned
	TARGET
	TARGET_versioned
	TCGA
	TCGA_versioned
	TRIO
	TRIO_versioned
	VAREPOP
	VAREPOP_versioned
	WC

The ISB-CGC has two datasets for each Program or source. One dataset contains the most current data, and the other contains versioned tables, which serve as an archive for reproducibility. The current tables are labeled with "_current" and are updated when new data is released. For more information, visit our [ISB-CGC BigQuery Projects](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQuery/ISBCGC-BQ-Projects.html) page. Let's see which tables are under the TCGA dataset.

In [7]:
dataset_with_data = 'TCGA_versioned'

print("Tables:")
# Create a variable with the list of tables in the dataset
tables = list(client.list_tables(f'{project_with_data}.{dataset_with_data}'))

# If there are tables then print their names,
# else print that there are no tables
if tables:
    for table in tables:
        print("\t{}".format(table.table_id))
else:
    print("\tThis dataset does not contain any tables.")

Tables:
	DNA_methylation_chr10_hg19_gdc_2017_01
	DNA_methylation_chr10_hg38_gdc_2017_01
	DNA_methylation_chr11_hg19_gdc_2017_01
	DNA_methylation_chr11_hg38_gdc_2017_01
	DNA_methylation_chr12_hg19_gdc_2017_01
	DNA_methylation_chr12_hg38_gdc_2017_01
	DNA_methylation_chr13_hg19_gdc_2017_01
	DNA_methylation_chr13_hg38_gdc_2017_01
	DNA_methylation_chr14_hg19_gdc_2017_01
	DNA_methylation_chr14_hg38_gdc_2017_01
	DNA_methylation_chr15_hg19_gdc_2017_01
	DNA_methylation_chr15_hg38_gdc_2017_01
	DNA_methylation_chr16_hg19_gdc_2017_01
	DNA_methylation_chr16_hg38_gdc_2017_01
	DNA_methylation_chr17_hg19_gdc_2017_01
	DNA_methylation_chr17_hg38_gdc_2017_01
	DNA_methylation_chr18_hg19_gdc_2017_01
	DNA_methylation_chr18_hg38_gdc_2017_01
	DNA_methylation_chr19_hg19_gdc_2017_01
	DNA_methylation_chr19_hg38_gdc_2017_01
	DNA_methylation_chr1_hg19_gdc_2017_01
	DNA_methylation_chr1_hg38_gdc_2017_01
	DNA_methylation_chr20_hg19_gdc_2017_01
	DNA_methylation_chr20_hg38_gdc_2017_01
	DNA_methylation_chr21_hg19_gdc_20

### Query ISB-CGC BigQuery Tables


In this section, we will create a string variable with our SQL then call to BigQuery and save the result to a dataframe.

#### Syntax for the query
```
SELECT # Select a few columns to view
  proj__project_id, # GDC project
  submitter_id, # case barcode
  proj__name # GDC project name
FROM # Which table in BigQuery in the format of `project.dataset.table`
  `project_name.dataset_name.table_name` # From the GDC TCGA Clinical Dataset
LIMIT
  5 # Limit to 5 rows as the dataset is very large and we only want to see a few results
```

> Note: `LIMIT` only limits the number of rows returned and not the number of rows that the query looks at


In [8]:
query = ("""
  SELECT
    proj__project_id,
    submitter_id,
    proj__name
  FROM
    `isb-cgc-bq.TCGA_versioned.clinical_gdc_r37`
  LIMIT
    5""")

job_config_arg =  demo_job_config_arg(query_id="qsg1") if demo_mode else {}
result = client.query(query, **job_config_arg).to_dataframe()  # API request
print(result)

  proj__project_id  submitter_id                    proj__name
0        TCGA-BLCA  TCGA-ZF-AA4N  Bladder Urothelial Carcinoma
1        TCGA-BRCA  TCGA-A1-A0SK     Breast Invasive Carcinoma
2        TCGA-BRCA  TCGA-AC-A3EH     Breast Invasive Carcinoma
3        TCGA-LUSC  TCGA-77-A5FZ  Lung Squamous Cell Carcinoma
4        TCGA-LUSC  TCGA-85-8584  Lung Squamous Cell Carcinoma


## Resources
There are several ways to access and explore the data hosted by ISB-CGC.

* ISB-CGC
  * [ISB-CGC WebApp](https://isb-cgc.appspot.com/)
    * Provides a graphical interface to file and case data
    * Cohort creation
    * File exploration
  * [ISB-CGC BigQuery Table Search](https://isb-cgc.appspot.com/bq_meta_search/)
    * Provides a table search for available ISB-CGC BigQuery Tables
  * [ISB-CGC APIs](https://api-dot-isb-cgc.appspot.com/v4/swagger/)
    * Provides programmatic access to metadata

* Google Cloud
  * [Google Cloud Platform](https://cloud.google.com/)
    * Access and store data in [Google Cloud Storage](https://cloud.google.com/storage) and [BigQuery](https://cloud.google.com/bigquery) via User Interfaces or programmatically
    
* Suggested Programming Languages and Programs to use
 * SQL
    * Can be used directly in [BigQuery Console](https://console.cloud.google.com/bigquery)
    * Or via API in Python or R
 * [Python](https://www.python.org/)
    * [gsutil tool](https://cloud.google.com/storage/docs/gsutil)
    * [Jupyter Notebooks](https://jupyter.org/)
    * [Google Colabratory](https://colab.research.google.com/)
    * [Cloud Datalab](https://cloud.google.com/datalab/)
 * [R](https://www.r-project.org/)
    * [RStudio](https://rstudio.com/)
    * [RStudio.Cloud](https://rstudio.cloud/)
* Command Line Interfaces
    * Cloud Shell via Project Console
    * [CLOUD SDK](https://cloud.google.com/sdk/)
* Getting Started for Free:
    * [Free Cloud Credits from ISB-CGC for Cancer Research](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowtoRequestCloudCredits.html)
    * [Google Free Tier with up to 1TB of free queries a month](https://cloud.google.com/free)

Useful ISB-CGC Links:

* [ISB-CGC Landing Page](https://isb-cgc.appspot.com/)
* [ISB-CGC Documentation](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/)
* [How to Get Started on ISB-CGC](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html)
* [How to access Google BigQuery](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/progapi/bigqueryGUI/HowToAccessBigQueryFromTheGoogleCloudPlatform.html)
* [Community Notebook Repository](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowTos.html)

Useful Google Tutorials:

* [Google's What is BigQuery?](https://cloud.google.com/bigquery/docs/introduction)
* [Google Cloud Client Library for Python](https://googleapis.github.io/google-cloud-python/latest/index.html)