<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/master/MachineLearning/How_to_build_an_RNAseq_logistic_regression_classifier_with_BigQuery_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to build an RNA-seq logistic regression classifier with BigQuery ML
Check out other notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!

- **Title:** How to build an RNA-seq logistic regression classifier with BigQuery ML
- **Author:** John Phan
- **Created:** 2021-07-19
- **Purpose:** Demonstrate use of BigQuery ML to predict a cancer endpoint using gene expression data.
- **URL:** https://github.com/isb-cgc/Community-Notebooks/blob/master/MachineLearning/How_to_build_an_RNAseq_logistic_regression_classifier_with_BigQuery_ML.ipynb
- **Note:** This example is based on the work published by [Bosquet et al.](https://molecular-cancer.biomedcentral.com/articles/10.1186/s12943-016-0548-9)


This notebook builds upon the [scikit-learn notebook](https://github.com/isb-cgc/Community-Notebooks/blob/master/MachineLearning/How_to_build_an_RNAseq_logistic_regression_classifier.ipynb) and demonstrates how to build a machine learning model using BigQuery ML to predict ovarian cancer treatment outcome. BigQuery is used to create a temporary data table that contains both training and testing data. These datasets are then used to fit and evaluate a Logistic Regression classifier. 

# Import Dependencies

In [None]:
# GCP libraries
from google.cloud import bigquery
from google.colab import auth

## Authenticate

Before using BigQuery, we need to get authorization for access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html). Alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html)

In [None]:
# if you're using Google Colab, authenticate to gcloud with the following
auth.authenticate_user()

# alternatively, use the gcloud SDK
#!gcloud auth application-default login

## Parameters

Customize the following parameters based on your notebook, execution environment, or project. BigQuery ML must create and store classification models, so be sure that you have write access to the locations stored in the "bq_dataset" and "bq_project" variables. 

In [None]:
# set the google project that will be billed for this notebook's computations
google_project = 'google-project' ## CHANGE ME

# bq project for storing ML model
bq_project = 'bq-project' ## CHANGE ME

# bq dataset for storing ML model
bq_dataset = 'scratch' ## CHANGE ME

# name of temporary table for data
bq_tmp_table = 'tmp_data'

# name of ML model
bq_ml_model = 'tcga_ov_therapy_ml_lr_model'

# in this example, we'll be using the Ovarian cancer TCGA dataset
cancer_type = 'TCGA-OV'

# genes used for prediction model, taken from Bosquet et al.
genes = "'RHOT1','MYO7A','ZBTB10','MATK','ST18','RPS23','GCNT1','DROSHA','NUAK1','CCPG1',\
'PDGFD','KLRAP1','MTAP','RNF13','THBS1','MLX','FAP','TIMP3','PRSS1','SLC7A11',\
'OLFML3','RPS20','MCM5','POLE','STEAP4','LRRC8D','WBP1L','ENTPD5','SYNE1','DPT',\
'COPZ2','TRIO','PDPR'"

# clinical data table
clinical_table = 'isb-cgc-bq.TCGA_versioned.clinical_gdc_2019_06'

# RNA seq data table
rnaseq_table = 'isb-cgc-bq.TCGA.RNAseq_hg38_gdc_current'


## BigQuery Client

Create the BigQuery client.

In [None]:
# Create a client to access the data within BigQuery
client = bigquery.Client(google_project)

## Create a Table with a Subset of the Gene Expression Data

Pull RNA-seq gene expression data from the TCGA RNA-seq BigQuery table, join it with clinical labels, and pivot the table so that it can be used with BigQuery ML. In this example, we will label the samples based on therapy outcome. "Complete Remission/Response" will be labeled as "1" while all other therapy outcomes will be labeled as "0". This prepares the data for binary classification. 

Prediction modeling with RNA-seq data typically requires a feature selection step to reduce the dimensionality of the data before training a classifier. However, to simplify this example, we will use a pre-identified set of 33 genes (Bosquet et al. identified 34 genes, but PRSS2 and its aliases are not available in the hg38 RNA-seq data). 

Creation of a BQ table with only the data of interest reduces the size of the data passed to BQ ML and can significantly reduce the cost of running BQ ML queries. This query also randomly splits the dataset into "training" and "testing" sets using the "FARM_FINGERPRINT" hash function in BigQuery. "FARM_FINGERPRINT" generates an integer from the input string. More information can be found [here](https://cloud.google.com/bigquery/docs/reference/standard-sql/hash_functions).

In [None]:
tmp_table_query = client.query(("""
  BEGIN
  CREATE OR REPLACE TABLE `{bq_project}.{bq_dataset}.{bq_tmp_table}` AS
  SELECT * FROM (
    SELECT
      labels.case_barcode as sample,
      labels.data_partition as data_partition,
      labels.response_label AS label,
      ge.gene_name AS gene_name,
      -- Multiple samples may exist per case, take the max value
      MAX(LOG(ge.HTSeq__FPKM_UQ+1)) AS gene_expression
    FROM `{rnaseq_table}` AS ge
    INNER JOIN (
      SELECT
        *
      FROM (
        SELECT
          case_barcode,
          primary_therapy_outcome_success,
          CASE
            -- Complete Reponse    --> label as 1
            -- All other responses --> label as 0
            WHEN primary_therapy_outcome_success = 'Complete Remission/Response' THEN 1
            WHEN (primary_therapy_outcome_success IN (
              'Partial Remission/Response','Progressive Disease','Stable Disease'
            )) THEN 0
          END AS response_label,
          CASE 
            WHEN MOD(ABS(FARM_FINGERPRINT(case_barcode)), 10) < 5 THEN 'training'
            WHEN MOD(ABS(FARM_FINGERPRINT(case_barcode)), 10) >= 5 THEN 'testing'
          END AS data_partition
          FROM `{clinical_table}`
          WHERE
            project_short_name = '{cancer_type}'
            AND primary_therapy_outcome_success IS NOT NULL
      )
    ) labels
    ON labels.case_barcode = ge.case_barcode
    WHERE gene_name IN ({genes})
    GROUP BY sample, label, data_partition, gene_name
  )
  PIVOT (
    MAX(gene_expression) FOR gene_name IN ({genes})
  );
  END;
""").format(
  bq_project=bq_project,
  bq_dataset=bq_dataset,
  bq_tmp_table=bq_tmp_table,
  rnaseq_table=rnaseq_table,
  clinical_table=clinical_table,
  cancer_type=cancer_type,
  genes=genes
)).result()

print(tmp_table_query)

<google.cloud.bigquery.table._EmptyRowIterator object at 0x7f3894001250>


Let's take a look at this subset table. The data has been pivoted such that each of the 33 genes is available as a column that can be "SELECTED" in a query. In addition, the "label" and "data_partition" columns simplify data handling for classifier training and evaluation.  

In [None]:
tmp_table_data = client.query(("""
  SELECT
    * --usually not recommended to use *, but in this case, we want to see all of the 33 genes
  FROM `{bq_project}.{bq_dataset}.{bq_tmp_table}`
""").format(
    bq_project=bq_project,
    bq_dataset=bq_dataset,
    bq_tmp_table=bq_tmp_table
)).result().to_dataframe()

print(tmp_table_data.info())
tmp_table_data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 36 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sample          264 non-null    object 
 1   data_partition  264 non-null    object 
 2   label           264 non-null    int64  
 3   RHOT1           264 non-null    float64
 4   MYO7A           264 non-null    float64
 5   ZBTB10          264 non-null    float64
 6   MATK            264 non-null    float64
 7   ST18            264 non-null    float64
 8   RPS23           264 non-null    float64
 9   GCNT1           264 non-null    float64
 10  DROSHA          264 non-null    float64
 11  NUAK1           264 non-null    float64
 12  CCPG1           264 non-null    float64
 13  PDGFD           264 non-null    float64
 14  KLRAP1          264 non-null    float64
 15  MTAP            264 non-null    float64
 16  RNF13           264 non-null    float64
 17  THBS1           264 non-null    flo

Unnamed: 0,sample,data_partition,label,RHOT1,MYO7A,ZBTB10,MATK,ST18,RPS23,GCNT1,DROSHA,NUAK1,CCPG1,PDGFD,KLRAP1,MTAP,RNF13,THBS1,MLX,FAP,TIMP3,PRSS1,SLC7A11,OLFML3,RPS20,MCM5,POLE,STEAP4,LRRC8D,WBP1L,ENTPD5,SYNE1,DPT,COPZ2,TRIO,PDPR
0,TCGA-25-2399,testing,0,11.218927,10.593754,11.803524,9.698836,5.956653,14.401808,10.929323,12.712570,11.296218,10.285688,10.772224,10.099598,11.180881,12.446402,13.861651,12.302866,11.066391,11.269558,9.734049,11.243680,13.520391,15.814351,12.147187,11.190978,9.587428,12.406302,12.714173,11.296678,9.197123,10.326844,11.506841,12.260439,11.589751
1,TCGA-36-1569,testing,0,11.878180,9.587484,12.164811,11.734965,5.994159,15.347050,10.687141,12.119207,12.033689,10.596642,12.058906,9.588679,10.708653,12.645890,13.552602,12.633329,11.461555,12.872430,10.501176,10.647165,14.032178,16.020236,12.035344,11.001714,9.580656,11.963750,13.782162,10.770550,9.732567,11.786655,12.154511,12.212926,11.787660
2,TCGA-25-1316,testing,0,11.617410,9.906155,11.821766,8.731946,5.019471,16.034114,10.456743,12.487554,10.511889,10.506308,12.279526,9.630796,10.804051,12.414362,11.633886,12.054653,8.549910,11.698269,6.389008,10.339602,13.493717,16.862480,12.713662,11.535904,7.052493,11.656133,13.290609,10.595292,9.338195,9.510685,10.488821,11.567196,11.550038
3,TCGA-23-1109,testing,0,11.235278,9.680673,12.364107,9.137896,5.365037,14.565009,10.257964,12.090694,11.171379,9.682388,11.533465,10.434751,10.749742,12.543119,13.067357,12.133442,10.446814,12.173429,11.444376,10.210248,12.848341,16.704705,12.694445,11.392936,10.073326,12.297649,12.804440,10.436196,9.563981,10.485045,12.222601,11.504207,11.931388
4,TCGA-24-2293,testing,0,11.517847,10.202071,11.412929,9.605096,6.916302,14.294266,11.047738,12.449760,13.146952,10.346396,11.989061,10.152489,10.698240,12.501491,14.640638,12.359135,12.150775,12.036429,9.374763,8.660519,14.609033,15.541052,12.234289,11.034500,10.918893,12.176423,13.718112,10.863193,10.201102,11.201539,12.935112,12.101972,11.384756
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,TCGA-24-1418,training,1,11.179123,10.030710,11.754480,9.890963,5.053992,14.991673,10.565480,12.558499,12.146200,9.668121,10.706535,10.394506,10.503950,12.334730,14.210613,12.460933,11.495057,12.223316,13.291306,9.452475,13.491247,16.061199,12.215986,10.854387,9.764404,12.818093,13.071414,10.605887,9.015512,9.149569,11.744812,12.078673,11.271800
260,TCGA-36-1576,training,1,11.655380,10.032751,12.059854,10.290936,5.240370,14.219227,10.592819,12.357946,11.837887,10.792815,11.876288,9.877489,10.436339,13.053486,14.257058,12.432225,12.485092,13.489722,11.128625,10.160496,13.924804,15.779136,12.684473,11.004037,9.652130,12.400591,13.213521,11.858550,9.651770,11.846926,12.634752,11.738259,11.254909
261,TCGA-24-2290,training,1,11.517504,10.191245,12.552382,11.060430,6.843453,14.193186,11.223297,12.373880,9.943035,10.125776,10.684256,9.115061,11.350987,12.533298,12.417001,12.476885,9.791050,10.038397,13.737365,10.590630,12.815252,15.743162,12.895449,11.982601,9.504284,13.286975,12.378044,11.070276,8.722501,7.861902,10.515498,11.807454,11.819364
262,TCGA-24-1563,training,1,11.595700,9.381507,12.383145,9.477424,6.052465,14.900644,11.146321,12.339813,12.194963,10.752415,12.020596,10.462295,10.992227,12.784387,14.123139,12.431182,12.406281,12.340340,8.871656,10.151962,13.952942,15.877573,12.726649,11.336160,9.163516,12.527071,12.865786,11.090034,9.497262,11.940572,12.592379,11.515596,10.891493


# Train the Machine Learning Model

Now we can train a classifier using BigQuery ML with the data stored in the subset table. This model will be stored in the location specified by the "bq_ml_model" variable, and can be reused to predict samples in the future.

We pass three options to the BQ ML model: model_type, auto_class_weights, and input_label_cols. Model_type specifies the classifier model type. In this case, we use "LOGISTIC_REG" to train a logistic regression classifier. Other classifier options are documented [here](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create). Auto_class_weights indicates whether samples should be weighted to balance the classes. For example, if the dataset happens to have more samples labeled as "Complete Response", those samples would be less weighted to ensure that the model is not biased towards predicting those samples. Input_label_cols tells BigQuery that the "label" column should be used to determine each sample's label. 

**Warning**: BigQuery ML models can be very time-consuming and expensive to train. Please check your data size before running BigQuery ML commands. Information about BigQuery ML costs can be found [here](https://cloud.google.com/bigquery-ml/pricing).

In [None]:
# create ML model using BigQuery
ml_model_query = client.query(("""
  CREATE OR REPLACE MODEL `{bq_project}.{bq_dataset}.{bq_ml_model}`
  OPTIONS
    (
      model_type='LOGISTIC_REG',
      auto_class_weights=TRUE,
      input_label_cols=['label']
    ) AS
  SELECT * EXCEPT(sample, data_partition)  -- when training, we only the labels and feature columns
  FROM `{bq_project}.{bq_dataset}.{bq_tmp_table}`
  WHERE data_partition = 'training' -- using training data only
""").format(
  bq_project=bq_project,
  bq_dataset=bq_dataset,
  bq_ml_model=bq_ml_model,
  bq_tmp_table=bq_tmp_table
)).result()
print(ml_model_query)

# now get the model metadata
ml_model = client.get_model('{}.{}.{}'.format(bq_project, bq_dataset, bq_ml_model))
print(ml_model)

<google.cloud.bigquery.table._EmptyRowIterator object at 0x7f3893663810>
Model(reference=ModelReference(project='isb-project-zero', dataset_id='jhp_scratch', project_id='tcga_ov_therapy_ml_lr_model'))


# Evaluate the Machine Learning Model
Once the model has been trained and stored, we can evaluate the model's performance using the "testing" dataset from our subset table. Evaluating a BQ ML model is generally less expensive than training. 

Use the following query to evaluate the BQ ML model. Note that we're using the "data_partition = 'testing'" clause to ensure that we're only evaluating the model with test samples from the subset table.  

BigQuery's ML.EVALUATE function returns several performance metrics: precision, recall, accuracy, f1_score, log_loss, and roc_auc. More details about these performance metrics are available from [Google's ML Crash Course](https://developers.google.com/machine-learning/crash-course/classification/video-lecture). Specific topics can be found at the following URLs: [precision and recall](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall), [accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy), [ROC and AUC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc). 

In [None]:
ml_eval = client.query(("""
SELECT * FROM ML.EVALUATE (MODEL `{bq_project}.{bq_dataset}.{bq_ml_model}`, 
  (
    SELECT * EXCEPT(sample, data_partition)
    FROM `{bq_project}.{bq_dataset}.{bq_tmp_table}`
    WHERE data_partition = 'testing'
  )
)
""").format(
  bq_project=bq_project,
  bq_dataset=bq_dataset,
  bq_ml_model=bq_ml_model,
  bq_tmp_table=bq_tmp_table
)).result().to_dataframe()

In [None]:
# Display the table of evaluation results
ml_eval

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.753425,0.639535,0.623077,0.691824,0.681854,0.680869


# Predict Outcome for One or More Samples
ML.EVALUATE evaluates a model's performance, but does not produce actual predictions for each sample. In order to do that, we need to use the ML.PREDICT function. The syntax is similar to that of the ML.EVALUATE function and returns "label", "predicted_label", "predicted_label_probs", and all feature columns. Since the feature columns are unchanged from the input dataset, we select only the original label, predicted label, and probabilities for each sample. 

Note that the input dataset can include one or more samples, and must include the same set of features as the training dataset. 

In [None]:
ml_predict = client.query(("""
SELECT
  label,
  predicted_label,
  predicted_label_probs
FROM ML.PREDICT (MODEL `{bq_project}.{bq_dataset}.{bq_ml_model}`, 
  (
    SELECT * EXCEPT(sample, data_partition)
    FROM `{bq_project}.{bq_dataset}.{bq_tmp_table}`
    WHERE data_partition = 'testing' -- Use the testing dataset
  )
)
""").format(
  bq_project=bq_project,
  bq_dataset=bq_dataset,
  bq_ml_model=bq_ml_model,
  bq_tmp_table=bq_tmp_table
)).result().to_dataframe()

In [None]:
# Display the table of prediction results
ml_predict

Unnamed: 0,label,predicted_label,predicted_label_probs
0,0,1,"[{'label': 1, 'prob': 0.8768811208292162}, {'l..."
1,0,0,"[{'label': 1, 'prob': 0.10363196036093822}, {'..."
2,0,0,"[{'label': 1, 'prob': 0.21867379546674287}, {'..."
3,0,1,"[{'label': 1, 'prob': 0.8001977008015101}, {'l..."
4,0,0,"[{'label': 1, 'prob': 0.08631178853125597}, {'..."
...,...,...,...
125,1,0,"[{'label': 1, 'prob': 0.37733292631578663}, {'..."
126,1,1,"[{'label': 1, 'prob': 0.5500701258874308}, {'l..."
127,1,0,"[{'label': 1, 'prob': 0.1872213285324457}, {'l..."
128,1,0,"[{'label': 1, 'prob': 0.031385329936844904}, {..."


In [None]:
# Calculate the accuracy of prediction, which should match the result of ML.EVALUATE
accuracy = 1-sum(abs(ml_predict['label']-ml_predict['predicted_label']))/len(ml_predict)
print('Accuracy: ', accuracy)

Accuracy:  0.6230769230769231


# Next Steps
The BigQuery ML logistic regression model trained in this notebook is comparable to the scikit-learn model developed in our [companion notebook](https://github.com/isb-cgc/Community-Notebooks/blob/master/MachineLearning/How_to_build_an_RNAseq_logistic_regression_classifier.ipynb). BigQuery ML simplifies the model building and evaluation process by enabling bioinformaticians to use machine learning within the BigQuery ecosystem. However, it is often necessary to optimize performance by evaluating several types of models (i.e., other than logistic regression), and tuning model parameters. Due to the cost of BigQuery ML for training, such iterative model fine-tuning may be cost prohibitive. In such cases, a combination of scikit-learn (or other libraries such as Keras and TensorFlow) and BigQuery ML may be appropriate. E.g., models can be fine-tuned using scikit-learn and published as a BigQuery ML model for production applications. In future notebooks, we will explore methods for model selection, optimization, and publication with BigQuery ML. 