<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/Staging-Notebooks/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ISB-CGC Community Notebooks

Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!

```
Title:   Quick Start Guide to ISB-CGC
Author:  Lauren Hagen
Created: 2019-06-20
Updated: 2021-07-27
Purpose: Painless intro to working in the cloud
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb
Notes:   
```
***

# Quick Start Guide to ISB-CGC
[ISB-CGC](https://isb-cgc.appspot.com/)

This Quick Start Guide gives an overview of the data available, account set-up overview, and getting started with a basic example in python. If you have read the R version, you can skip to the Example section.

## Access Requirements
* Google Account to access ISB-CGC
* [Google Cloud Account](https://console.cloud.google.com)

## Access Suggestions
* Favored Programming Language (R or Python)
* Favored IDE (RStudio or Jupyter)
* Some knowledge of SQL

## Outline for this Notebook
* Libraries Needed for this Notebook
* Overview of ISB-CGC
* Overview How to Access Data
* Example of Accessing Data with Python
* Where to go next

## Libraries needed for the Notebook
This notebook requires the BigQuery API to be loaded [(click here for more information)](https://googleapis.github.io/google-cloud-python/latest/bigquery/usage/client.html). This library will allow you to access BigQuery programmatically.

In [None]:
# Load BigQuery API
from google.cloud import bigquery

## Overview of ISB-CGC
The ISB-CGC provides interactive and programmatic access to data hosted by institutes such as the [Genomic Data Commons (GDC)](https://gdc.cancer.gov/) and [Proteomic Data Commons (PDC)](https://proteomic.datacommons.cancer.gov/pdc/) from the [National Cancer Institute (NCI)](https://www.cancer.gov/) while leveraging many aspects of the Google Cloud Platform. You can also import your data, analyze it side by side with the datasets, and share your data when you see fit.

### About the ISB-CGC Data in the Cloud
ISB-CGC hosts carefully curated, high-level clinical, biospecimen, and molecular datasets and tables in Google BigQuery, including data from programs such as [The Cancer Genome Atlas (TCGA)](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga), [Therapeutically Applicable Research to Generate Effective Treatments (TARGET)](https://ocg.cancer.gov/programs/target), and [Clinical Proteomic Tumor Analysis Consortium (CPTAC)](https://proteomics.cancer.gov/programs/cptac). For more information about hosted data, please visit: [Programs and DataSets](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/Hosted-Data.html)

## Overview of How to Access Data
There are several ways to access and explore the data hosted by ISB-CGC. Though in this notebook, we will cover using Python and SQL to access the data.

* [ISB-CGC WebApp](https://isb-cgc.appspot.com/)
  * Provides a graphical interface to file and case data
  * Easy cohort creation
  * Doesn't require knowledge of programming languages
* [ISB-CGC BigQuery Table Search](https://isb-cgc.appspot.com/bq_meta_search/)
  * Provides a table search for available ISB-CGC BigQuery Tables
* [ISB-CGC APIs](https://api-dot-isb-cgc.appspot.com/v4/swagger/)
  * Provides programmatic access to metadata
* [Google Cloud Platform](https://cloud.google.com/)
  * Access and store data in [Google Cloud Storage](https://cloud.google.com/storage) and [BigQuery](https://cloud.google.com/bigquery) via User Interfaces or programmatically
* Suggested Programming Languages and Programs to use
 * SQL
    * Can be used directly in [BigQuery Console](https://console.cloud.google.com/bigquery)
    * Or via API in Python or R
 * [Python](https://www.python.org/)
    * [gsutil tool](https://cloud.google.com/storage/docs/gsutil)
    * [Jupyter Notebooks](https://jupyter.org/)
    * [Google Colabratory](https://colab.research.google.com/)
    * [Cloud Datalab](https://cloud.google.com/datalab/)
 * [R](https://www.r-project.org/)
    * [RStudio](https://rstudio.com/)
    * [RStudio.Cloud](https://rstudio.cloud/)
* Command Line Interfaces
    * Cloud Shell via Project Console
    * [CLOUD SDK](https://cloud.google.com/sdk/)

### Account Set-up
To run this notebook, you will need to have your Google Cloud Account set up. If you need to set up a Google Cloud Account, follow the "Obtain a Google identity" and "Set up a Google Cloud Project" steps on our [Quick-Start Guide documentation](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) page.
 

### ISB-CGC Web Interface
The [ISB-CGC Web Interface](https://isb-cgc.appspot.com/) is an interactive web-based application to access and explore the rich TCGA, TARGET, and CCLE datasets with more datasets regularly added. Through WebApp, you can create Cohorts, lists of Favorite Genes, miRNA, and Variables. The Cohorts and Variables can be used in Workbooks to allow you to quickly analyze and export datasets by mixing and matching the selections.

### Google Cloud Platform and BigQuery Overview

The [Google Cloud Platform Console](https://console.cloud.google.com/) is the web-based interface to your GCP Project. From the Console, you can check the overall status of your project, create and delete Cloud Storage buckets, upload and download files, spin up and shut down VMs, add members to your project, access the [Cloud Shell command line](https://cloud.google.com/shell/docs/), etc. You'll want to remember that any costs that you incur are charged under your *current* project, so you will want to make sure you are on the correct one if you are part of multiple projects.

ISB-CGC has uploaded multiple cancer genomic and proteomic datasets into BigQuery tables that are open-source such as TCGA and TARGET Clinical, Biospecimen, and Molecular Data, along with case and file data. This data can be accessed from the Google Cloud Platform Console User Interface (UI), programmatically with R and python, or explored with our [BigQuery Table Search tool](https://isb-cgc.appspot.com/bq_meta_search/).

## Example of Accessing BigQuery Data with Python


### Log into Google Cloud Storage and Authenticate ourselves
1. Authenticate yourself with your Google Cloud Login
2. A second tab will open or follow the link provided
3. Follow prompts to Authorize your account to use Google Cloud SDK
4. Copy code provided and paste into the box under the Command
5. Press Enter

[Alternative authentication methods](https://googleapis.github.io/google-cloud-python/latest/core/auth.html)

In [None]:
!gcloud auth application-default login

### View ISB-CGC Datasets and Tables in BigQuery
Let us look at the datasets available through ISB-CGC that are in BigQuery. 

In [None]:
# Create a client to access the data within BigQuery
# Note: you cannot use the project below as a billing project,
# it can only be used to view the tables and table schema
client = bigquery.Client('isb-cgc-bq')

# Create a variable of datasets 
datasets = list(client.list_datasets())
# Create a variable for the name of the project
project = client.project

# If there are datasets available then print their names,
# else print that there are no datasets available
if datasets:
    print("Datasets in project {}:".format(project))
    for dataset in datasets:  # API request(s)
        print("\t{}".format(dataset.dataset_id))
else:
    print("{} project does not contain any datasets.".format(project))



Datasets in project isb-cgc-bq:
	0_README
	BEATAML1_0
	BEATAML1_0_versioned
	CBTTC
	CBTTC_versioned
	CCLE
	CCLE_versioned
	CGCI
	CGCI_versioned
	CMI
	CMI_versioned
	CPTAC
	CPTAC_versioned
	CTSP
	CTSP_versioned
	FM
	FM_versioned
	GDC_case_file_metadata
	GDC_case_file_metadata_versioned
	GENCODE
	GENCODE_versioned
	GENIE
	GENIE_versioned
	GPRP
	GPRP_versioned
	HCMI
	HCMI_versioned
	ICPC
	ICPC_versioned
	MMRF
	MMRF_versioned
	NCICCR
	NCICCR_versioned
	OHSU
	OHSU_versioned
	ORGANOID
	ORGANOID_versioned
	PDC_metadata
	PDC_metadata_versioned
	Quant_Maps_Tissue_Biopsies
	Quant_Maps_Tissue_Biopsies_versioned
	TARGET
	TARGET_versioned
	TCGA
	TCGA_versioned
	VAREPOP
	VAREPOP_versioned
	WCDT
	WCDT_versioned
	annotations
	annotations_versioned
	functions
	pancancer_atlas
	supplementary_tables


The ISB-CGC has two datasets for each Program. One dataset contains the most current data, and the other contains versioned tables, which serve as an archive for reproducibility. The current tables are labeled with "_current" and are updated when new data is released. For more information, visit our [ISB-CGC BigQuery Projects](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQuery/ISBCGC-BQ-Projects.html) page.

Now, let us see which tables are under the TCGA dataset.

In [None]:
print("Tables:")
# Create a variable with the list of tables in the dataset
tables = list(client.list_tables('isb-cgc-bq.TCGA'))

# If there are tables then print their names,
# else print that there are no tables
if tables:
    for table in tables:
        print("\t{}".format(table.table_id))
else:
    print("\tThis dataset does not contain any tables.")

Tables:
	DNA_methylation_chr10_hg19_gdc_current
	DNA_methylation_chr10_hg38_gdc_current
	DNA_methylation_chr11_hg19_gdc_current
	DNA_methylation_chr11_hg38_gdc_current
	DNA_methylation_chr12_hg19_gdc_current
	DNA_methylation_chr12_hg38_gdc_current
	DNA_methylation_chr13_hg19_gdc_current
	DNA_methylation_chr13_hg38_gdc_current
	DNA_methylation_chr14_hg19_gdc_current
	DNA_methylation_chr14_hg38_gdc_current
	DNA_methylation_chr15_hg19_gdc_current
	DNA_methylation_chr15_hg38_gdc_current
	DNA_methylation_chr16_hg19_gdc_current
	DNA_methylation_chr16_hg38_gdc_current
	DNA_methylation_chr17_hg19_gdc_current
	DNA_methylation_chr17_hg38_gdc_current
	DNA_methylation_chr18_hg19_gdc_current
	DNA_methylation_chr18_hg38_gdc_current
	DNA_methylation_chr19_hg19_gdc_current
	DNA_methylation_chr19_hg38_gdc_current
	DNA_methylation_chr1_hg19_gdc_current
	DNA_methylation_chr1_hg38_gdc_current
	DNA_methylation_chr20_hg19_gdc_current
	DNA_methylation_chr20_hg38_gdc_current
	DNA_methylation_chr21_hg19_gdc_cu

### Query ISB-CGC BigQuery Tables


First, use a magic command to call to BigQuery. Then we can use Standard SQL to write your query. Click [here](https://googleapis.github.io/google-cloud-python/latest/bigquery/magics.html) for more on IPython Magic Commands for BigQuery. The result will be a [Pandas Dataframe](https://pandas.pydata.org/).

> Note: you will need to update PROJECT_ID in the next cell to your Google Cloud Project ID.

In [None]:
# Call to BigQuery with a magic command
# and replace PROJECT_ID with your project ID Number
%%bigquery --project PROJECT_ID
SELECT # Select a few columns to view
  proj__project_id, # GDC project
  submitter_id, # case barcode
  proj__name # GDC project name
FROM # From the GDC TCGA Clinical Dataset
  `isb-cgc-bq.TCGA.clinical_gdc_current`
LIMIT # Limit to 5 rows as the dataset is very large and we only want to see a few results
  5

# Syntax for the above query
# SELECT * 
# FROM `project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS`
# Limit to the first 5 fields

Unnamed: 0,proj__project_id,submitter_id,proj__name
0,TCGA-HNSC,TCGA-CN-5363,Head and Neck Squamous Cell Carcinoma
1,TCGA-HNSC,TCGA-CN-5365,Head and Neck Squamous Cell Carcinoma
2,TCGA-HNSC,TCGA-CN-A642,Head and Neck Squamous Cell Carcinoma
3,TCGA-HNSC,TCGA-CR-7380,Head and Neck Squamous Cell Carcinoma
4,TCGA-HNSC,TCGA-CV-5978,Head and Neck Squamous Cell Carcinoma


Now that wasn't so difficult! Have fun exploring and analyzing the ISB-CGC Data!

## Where to Go Next

Access, Explore and Analyze Large-Scale Cancer Data Through the Google Cloud! :)

Getting Started for Free:
* [Free Cloud Credits from ISB-CGC for Cancer Research](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowtoRequestCloudCredits.html)
* [Google Free Tier with up to 1TB of free queries a month](https://cloud.google.com/free)

ISB-CGC Links:

* [ISB-CGC Landing Page](https://isb-cgc.appspot.com/)
* [ISB-CGC Documentation](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/)
* [How to Get Started on ISB-CGC](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html)
* [How to access Google BigQuery](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/progapi/bigqueryGUI/HowToAccessBigQueryFromTheGoogleCloudPlatform.html)
* [Community Notebook Repository](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowTos.html)
* [Query of the Month](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/QueryOfTheMonthClub.html)

Google Tutorials:

* [Google's What is BigQuery?](https://cloud.google.com/bigquery/docs/introduction)
* [Google Cloud Client Library for Python](https://googleapis.github.io/google-cloud-python/latest/index.html)