<a href="https://colab.research.google.com/github/jhphan/ML-Notebooks/blob/main/tcga-ov-ml-therapy-test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ISB-CGC Machine Learning Notebooks
Check out other notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!

- **Title:** Building a simple gene expression-based classifier
- **Author:** John Phan
- **Created:** 2021-07-07
- **Purpose:** Demonstrate a basic machine learning method to predict a cancer endpoint using gene expression data.
- **URL:** https://github.com/isb-cgc/Community-Notebooks
- **Note1:** This example is based on the work published by [Bosquet et al.](https://molecular-cancer.biomedcentral.com/articles/10.1186/s12943-016-0548-9)


This notebook demonstrates how to build a basic machine learning model to predict ovarian cancer treatment outcome. Ovarian cancer gene expression data is pulled from a BigQuery table and formatted using Pandas. The data is then split into training and testing sets to build and test a logistic regression classifier using scikit-learn. 

## Import Dependencies

In [1]:
# GCP libraries
from google.cloud import bigquery
from google.colab import auth

# Pandas
import pandas as pd

# Machine learning
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics


## Authenticate

Before using BigQuery, we need to get authorized for access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html). Alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html).

In [2]:
# if you're using Google Colab, authenticate to gcloud with the following
auth.authenticate_user()

# alternatively, use the gcloud SDK
#!gcloud auth application-default login

## Parameters

Customize the following parameters based on your notebook, execution environment, or project.

In [3]:
# set the google project that will be billed for this notebook's computations
google_project = 'cgc-05-0051'

# in this example, we'll be using the Ovarian cancer TCGA dataset
cancer_type = 'TCGA-OV'

# gene expression data will be pulled from this BigQuery project
bq_project = 'isb-cgc-bq'



## BigQuery Client

Create the BigQuery client

In [4]:
# Create a client to access the data within BigQuery
client = bigquery.Client(google_project)

## Get Gene Expression Data from Big Query Table

Pull RNA-seq gene expression data from the TCGA RNA-seq BigQuery table and join it with the clinical data table to create a labeled data frame. In this example, we will label the samples based on therapy outcome. "Complete Remission/Response" will be labeled as "1" while all other therapy outcomes will be labeled as "0". This will prepare the data to build a binary classifier. 

In [5]:
ge_data = client.query(("""
  SELECT
    ge.case_barcode AS sample,
    labels.response_label AS label,
    ge.gene_name AS gene_name,
    -- Multiple samples may exist per case, take the max value
    MAX(LOG(ge.HTSeq__FPKM_UQ+1)) AS gene_expression
  FROM `{}.TCGA.RNAseq_hg38_gdc_current` AS ge
  INNER JOIN (
    SELECT
      *
    FROM (
      SELECT
        case_barcode,
        primary_therapy_outcome_success,
        CASE
          -- Complete Reponse    --> label as 1
          -- All other responses --> label as 0
          WHEN primary_therapy_outcome_success = 'Complete Remission/Response' THEN 1
          WHEN (
            primary_therapy_outcome_success IN (
              'Partial Remission/Response','Progressive Disease','Stable Disease'
            )
          ) THEN 0
        END AS response_label
        FROM `{}.TCGA_versioned.clinical_gdc_2019_06`
        WHERE
          project_short_name = '{}'
          AND primary_therapy_outcome_success IS NOT NULL
    )
  ) labels
  ON labels.case_barcode = ge.case_barcode
  WHERE gene_name IN ( -- 33 Gene signature, leave out PRSS2 (aka TRYP2)
    'RHOT1','MYO7A','ZBTB10','MATK','ST18','RPS23','GCNT1','DROSHA','NUAK1','CCPG1',
    'PDGFD','KLRAP1','MTAP','RNF13','THBS1','MLX','FAP','TIMP3','PRSS1','SLC7A11',
    'OLFML3','RPS20','MCM5','POLE','STEAP4','LRRC8D','WBP1L','ENTPD5','SYNE1','DPT',
    'COPZ2','TRIO','PDPR'
  )
  GROUP BY sample, label, gene_name
""").format(bq_project, bq_project, cancer_type)).result().to_dataframe()

print(ge_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8712 entries, 0 to 8711
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   sample           8712 non-null   object 
 1   label            8712 non-null   int64  
 2   gene_name        8712 non-null   object 
 3   gene_expression  8712 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 272.4+ KB
None


## Reshape the Data

The data pulled from BigQuery is formatted such that each row corresponds to a sample/gene combination. However, to use the data with scikit-learn to create a prediction model, we'll need to reshape the data such that each row corresponds to a sample and each column corresponds to a gene. We'll use Pandas to pivot the data as follows.

In [6]:
ge_data_pivot = ge_data.pivot(index=('sample', 'label'), columns='gene_name', values='gene_expression').reset_index(level=['sample','label'])
print(ge_data_pivot.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 35 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   sample   264 non-null    object 
 1   label    264 non-null    int64  
 2   CCPG1    264 non-null    float64
 3   COPZ2    264 non-null    float64
 4   DPT      264 non-null    float64
 5   DROSHA   264 non-null    float64
 6   ENTPD5   264 non-null    float64
 7   FAP      264 non-null    float64
 8   GCNT1    264 non-null    float64
 9   KLRAP1   264 non-null    float64
 10  LRRC8D   264 non-null    float64
 11  MATK     264 non-null    float64
 12  MCM5     264 non-null    float64
 13  MLX      264 non-null    float64
 14  MTAP     264 non-null    float64
 15  MYO7A    264 non-null    float64
 16  NUAK1    264 non-null    float64
 17  OLFML3   264 non-null    float64
 18  PDGFD    264 non-null    float64
 19  PDPR     264 non-null    float64
 20  POLE     264 non-null    float64
 21  PRSS1    264 non

## Prepare the Data for Prediction Modeling

Prepare the data by splitting it into training and testing sets, and scaling the data. It is important that prediction models are tested on samples that are independent from the training samples in order to accurately estimate performance. 

In [7]:
# remove the sample names column from the data frame
ge_data_pivot_nosample = ge_data_pivot.drop(labels='sample',axis=1)

# split data into train and test sets, 50% in train and 50% in test. 
# The "random_state" variable can be used to reproduce the split
train_data = ge_data_pivot_nosample.sample(frac=0.5, random_state=1).sort_index()

# the test data is what remains after removing the train data
test_data = ge_data_pivot_nosample.drop(train_data.index)

# store the data in a dict for easy access
data = dict()
data['train_y'] = train_data.pop('label')
data['test_y'] = test_data.pop('label')

# using scikit-learn, scale the data to 0 mean and unit variance. This is
# required for some machine learning methods.
scaler = StandardScaler()

# store the scaled data in the dict
data['train_x'] = scaler.fit_transform(train_data)
data['test_x'] = scaler.transform(test_data)
data['scaler'] = scaler


## Train and Test the Prediction Model

We will use a simple logistic regression classifier implemented by scikit-learn. More information about the classifier can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).  

In [8]:
# train a logistic regression model
lr = LogisticRegression(max_iter=1000)
lr.fit(data['train_x'], data['train_y'])

# predict samples in the test set
pred = lr.decision_function(data['test_x'])

# calculate the ROC curve and AUC in order to gauge prediction performance
fpr, tpr, thresholds = metrics.roc_curve(data['test_y'], pred)
auc = metrics.auc(fpr, tpr)

print('Prediction Performance (AUC):', auc)

auc: 0.741892254087376
auc: 0.741892254087376 fpr: [0.         0.         0.         0.02439024 0.02439024 0.04878049
 0.04878049 0.07317073 0.07317073 0.09756098 0.09756098 0.12195122
 0.12195122 0.14634146 0.14634146 0.17073171 0.17073171 0.19512195
 0.19512195 0.2195122  0.2195122  0.24390244 0.24390244 0.26829268
 0.26829268 0.29268293 0.29268293 0.34146341 0.34146341 0.36585366
 0.36585366 0.3902439  0.3902439  0.41463415 0.41463415 0.43902439
 0.43902439 0.48780488 0.48780488 0.53658537 0.53658537 0.58536585
 0.58536585 0.6097561  0.6097561  0.65853659 0.65853659 0.70731707
 0.70731707 0.73170732 0.73170732 0.95121951 0.95121951 0.97560976
 0.97560976 1.        ] tpr: [0.         0.01098901 0.17582418 0.17582418 0.25274725 0.25274725
 0.3956044  0.3956044  0.41758242 0.41758242 0.42857143 0.42857143
 0.43956044 0.43956044 0.48351648 0.48351648 0.49450549 0.49450549
 0.51648352 0.51648352 0.53846154 0.53846154 0.61538462 0.61538462
 0.62637363 0.62637363 0.63736264 0.63736264 0.65