<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/master/MitelmanDB/Mitelman_Fusions_In_TCGA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mitelman Gene Fusions in TCGA

Check out other notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!

```
Title: Mitelman Gene Fusions in TCGA  
Author: Jacob Wilson  
Created: 2023-11-14  
URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/MitelmanDB/Mitelman_Fusions_In_TCGA.ipynb  
Purpose: We will explore gene fusions in the Mitelman database using BigQuery. With a few basic queries we can select the most common gene fusions specific to any disease type present in the Mitelman database. For demonstration in this notebook, we will be looking at gene fusions associated with Prostate adenocarcinomas. After creating a list of relevant genes, we will obtain gene expression data from TCGA and build a machine learning model to predict the Primary Gleason Grade.
```

## Initialize Notebook Environment

Before beginning, we first need to load dependencies and authenticate to BigQuery.

## Install Dependencies

In [30]:
# GCP Libraries
from google.cloud import bigquery
from google.colab import auth

from itertools import chain

## Authenticate

In order to utilize BigQuery, we must obtain authorization to BigQuery and Google Cloud.

In [31]:
# if you're using Google Colab, authenticate to gcloud with the following
auth.authenticate_user()

# alternatively, use the gcloud SDK
#!gcloud auth application-default login

## Google project ID

Set your own Google project ID for use with this notebook.

In [52]:
# set the google project that will be billed for this notebook's computations
google_project = 'your_project_id'  ## change this

# set the google project that will be used to store the model and temp table
ML_project = 'your_model_project'  ## change this

# set the dataset for the temporary data table and machine learning model
ML_data = 'your_data_table'  ## change this

## BigQuery Client

In [42]:
# Initialize a client to access the data within BigQuery
if google_project == 'your_project_id':
    print('Please update the project ID with your Google Cloud Project')
else:
    client = bigquery.Client(google_project)

if ML_project == 'your_model_project':
    print('Please update the project ID with your Google Cloud Project')
else:
    model_client = bigquery.Client(ML_project)

# set the Mitelman Database project
mitel_proj = 'mitelman-db'
mitel_data = 'prod'

# set the TCGA project
TCGA_proj = 'isb-cgc-bq'
TCGA_data = 'TCGA'

## Exploring Fusions in the Mitelman DB

We will begin by exploring the disease types and gene fusions that are present in the Mitelman database. The database utilizes a coding system where specific disease morphologies and topographics are represented by a unique value stored in the **Koder** BigQuery table. Other BigQuery tables that will be used include **MolBiolClinAssoc**, which contains the morphlogy and topography in addition to gene fusion and karyotype data, and **MolClinGene** which details the gene fusions. Using the following queries, we can obtain the codes relevant to our disease of interest.

In [43]:
# query to see all disease types by morphology
query_morph = f'''
SELECT m.Morph, k.Benamning
FROM `{mitel_proj}.{mitel_data}.MolBiolClinAssoc` m
JOIN `{mitel_proj}.{mitel_data}.Koder` k
  ON k.Kod = m.Morph AND k.kodTyp = "MORPH"
GROUP BY m.Morph, k.Benamning
ORDER BY k.Benamning ASC
'''

#print(query_morph)

In [44]:
# query to see all disease types by topography
query_topo = f'''
SELECT m.Topo, k.Benamning
FROM `{mitel_proj}.{mitel_data}.MolBiolClinAssoc` m
JOIN `{mitel_proj}.{mitel_data}.Koder` k
  ON k.Kod = m.Topo AND k.kodTyp = "TOP"
GROUP BY m.Topo, k.Benamning
ORDER BY k.Benamning ASC
'''

#print(query_topo)

In [45]:
# query for gene fusions of all disease types
# gene names that are separated with a double colon "::" represent gene-pair fusions (e.g. gene1::gene2)
query_fusions = f'''
SELECT g.Gene, count(g.Gene) AS Count
FROM `{mitel_proj}.{mitel_data}.MolClinGene` g
-- gene name for fusions is double-colon separated gene pair
WHERE g.Gene LIKE "%::%"
GROUP BY g.Gene
ORDER BY Count DESC
'''

#print(query_fusions)

In [46]:
# run the queries and store results in dataframes
morphology_df = client.query(query_morph).result().to_dataframe()
topography_df = client.query(query_topo).result().to_dataframe()
fusions_df = client.query(query_fusions).result().to_dataframe()

In [47]:
print(morphology_df.head())
print(len(morphology_df))

print(topography_df.head())
print(len(topography_df))

print(fusions_df.head())
print(len(fusions_df))

# obtain the morphology and topography codes specific to Prostate Adenocarcinoma
print(morphology_df.loc[morphology_df['Benamning'] == 'Adenocarcinoma'])
print(topography_df.loc[topography_df['Benamning'] == 'Prostate'])

  Morph                                          Benamning
0  3117                              Acinic cell carcinoma
1  1115                          Acute basophilic leukemia
2  1117                        Acute eosinophilic leukemia
3  1112                Acute erythroleukemia (FAB type M6)
4  1602  Acute lymphoblastic leukemia/lymphoblastic lym...
224
   Topo   Benamning
0  0703     Adrenal
1  0305     Bladder
2  0801       Brain
3  0806  Brain stem
4  0401      Breast
50
             Gene  Count
0       BCR::ABL1    419
1  RUNX1::RUNX1T1    130
2       PML::RARA     95
3     ETV6::RUNX1     83
4     ETV6::NTRK3     71
33826
   Morph       Benamning
17  3111  Adenocarcinoma
    Topo Benamning
30  0602  Prostate


These results show that there are a total of 224 morphologies and 50 topographies in the Mitelman database. We will look at fusions involving adenocarcinomas in the prostate. The morphology code for adenocarcinoma is 3111, and the topography code for prostate is 0602. Using these two values, we can construct a list of the ten most common gene fusions for Prostate adenocarcinomas present in the Mitelman database.

In [48]:
# query for the ten most common gene fusions for Mitelman Prostate adenocarcinoma cases
fusions_prostate = f'''
SELECT g.Gene, count(g.Gene) AS Count, m.Morph, k.Benamning, m.Topo
FROM `{mitel_proj}.{mitel_data}.MolClinGene` g
JOIN `{mitel_proj}.{mitel_data}.MolBiolClinAssoc` m
  ON m.RefNo = g.RefNo AND m.InvNo = g.InvNo
JOIN `{mitel_proj}.{mitel_data}.Koder` k
  ON k.Kod = m.Topo AND k.kodTyp = "TOP"
-- we are only considering gene fusions for Prostate adenocarcinoma
WHERE g.Gene LIKE "%::%" AND m.Morph LIKE "3111" AND m.Topo LIKE "0602"
GROUP BY g.Gene, m.Morph, k.Benamning, m.Topo
ORDER BY Count DESC
LIMIT 10
'''

#print(fusions_prostate)

In [49]:
# run the query and view the gene fusions
top10_prostate = client.query(fusions_prostate).result().to_dataframe()
print(top10_prostate)

              Gene  Count Morph Benamning  Topo
0     TMPRSS2::ERG     52  3111  Prostate  0602
1    TMPRSS2::ETV4      5  3111  Prostate  0602
2    TMPRSS2::ETV1      5  3111  Prostate  0602
3     SLC45A3::ERG      5  3111  Prostate  0602
4       NDRG1::ERG      4  3111  Prostate  0602
5  OSBPL9::SERINC5      3  3111  Prostate  0602
6    SLC45A3::ELK4      3  3111  Prostate  0602
7       KLK2::ETV1      3  3111  Prostate  0602
8  METTL13::EIF4G3      3  3111  Prostate  0602
9      ADGRL2::AK5      3  3111  Prostate  0602


In [50]:
# convert the list of gene fusion pairs into a string containing individual unique gene names
genes_list = [x.split("::") for x in top10_prostate['Gene']]
genes_set = set(chain.from_iterable(genes_list))
genes_str = ','.join(f"'{gene}'" for gene in genes_set)

print(genes_str)

'KLK2','AK5','ETV1','TMPRSS2','EIF4G3','OSBPL9','METTL13','ADGRL2','ERG','NDRG1','ETV4','SERINC5','ELK4','SLC45A3'


Using the ten most common gene fusions for Prostate Adenocarcinoma in the Mitelman database, we have created a list of 14 individual genes (duplicates removed) to explore in TCGA.

## Finding the Genes in TCGA

Now we will find our list of Prostate adenocarcinoma genes in TCGA. We will first find cases matching our target disease type, then we will create a temporary table containing the gene expression data for these cases. The TCGA database is made up of various focused studies as described by this table: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations. We can use the PRAD project to retrieve cases for Prostate adenocarcinoma.

In [13]:
# query to select all Prostate adenocarcinomas in TCGA
query_TCGA = f'''
SELECT r.project_short_name,
      r.primary_site,
      r.gene_name,
      p.proj__name,
      p.disease_type,
      p.diag__primary_gleason_grade
FROM `{TCGA_proj}.{TCGA_data}.RNAseq_hg38_gdc_current` r
JOIN `{TCGA_proj}.{TCGA_data}.clinical_gdc_current` p
  ON r.case_gdc_id = p.case_id
-- select cases from the Prostate adenocarcinoma project
WHERE r.project_short_name = "TCGA-PRAD"
  AND p.disease_type = "Adenomas and Adenocarcinomas"
'''

In [14]:
# run the query and view results
TCGA_prostate = client.query(query_TCGA).result().to_dataframe()
print(TCGA_prostate)

         project_short_name    primary_site   gene_name  \
0                 TCGA-PRAD  Prostate gland   LINC00538   
1                 TCGA-PRAD  Prostate gland   RNU6-153P   
2                 TCGA-PRAD  Prostate gland    OR7E136P   
3                 TCGA-PRAD  Prostate gland    PRAMEF15   
4                 TCGA-PRAD  Prostate gland  AC008759.1   
...                     ...             ...         ...   
32879883          TCGA-PRAD  Prostate gland  ZNF451-AS1   
32879884          TCGA-PRAD  Prostate gland  AP003352.1   
32879885          TCGA-PRAD  Prostate gland     TMEM138   
32879886          TCGA-PRAD  Prostate gland    NIPBL-DT   
32879887          TCGA-PRAD  Prostate gland     MARCHF9   

                       proj__name                  disease_type  \
0         Prostate Adenocarcinoma  Adenomas and Adenocarcinomas   
1         Prostate Adenocarcinoma  Adenomas and Adenocarcinomas   
2         Prostate Adenocarcinoma  Adenomas and Adenocarcinomas   
3         Prostate Aden

## Create a Dataset to Use With the ML Model

A temporary table will be used to store the following data: case identifier, gene name, gene expression values, and Primary Gleason Grade. In the final step of this query, we will pivot the table data creating a column for each gene in our target list. The temp table and ML model queries will replace any table or model using the same name within your provided project and dataset.

In [51]:
# query to create a temporary table for use with the ML model
create_tmp_table = f'''
CREATE OR REPLACE TABLE `{ML_project}.{ML_data}.tmp_data` AS
  SELECT * FROM(
    SELECT
      seq.case_barcode,
      seq.gene_name,
      seq.fpkm_uq_unstranded,
      --assign Primary Gleason Grade to an appropriate integer
      CASE clin.diag__primary_gleason_grade
        WHEN "Pattern 1" THEN 1
        WHEN "Pattern 2" THEN 2
        WHEN "Pattern 3" THEN 3
        WHEN "Pattern 4" THEN 4
        WHEN "Pattern 5" THEN 5
      END AS primary_gleason_grade
    FROM `{TCGA_proj}.{TCGA_data}.RNAseq_hg38_gdc_current` seq
    JOIN `{TCGA_proj}.{TCGA_data}.clinical_gdc_current` clin
      ON seq.case_gdc_id = clin.case_id
    WHERE seq.project_short_name = "TCGA-PRAD"
      AND clin.disease_type = "Adenomas and Adenocarcinomas")
  --transform genes from rows to columns using pivot
  PIVOT(MAX(fpkm_uq_unstranded) for gene_name IN ({genes_str}))
'''
print(create_tmp_table)


CREATE OR REPLACE TABLE `your_model_project.your_data_table.tmp_data` AS
  SELECT * FROM(
    SELECT
      seq.case_barcode,
      seq.gene_name,
      seq.fpkm_uq_unstranded,
      --assign Primary Gleason Grade to an appropriate integer
      CASE clin.diag__primary_gleason_grade
        WHEN "Pattern 1" THEN 1
        WHEN "Pattern 2" THEN 2
        WHEN "Pattern 3" THEN 3
        WHEN "Pattern 4" THEN 4
        WHEN "Pattern 5" THEN 5
      END AS primary_gleason_grade
    FROM `isb-cgc-bq.TCGA.RNAseq_hg38_gdc_current` seq
    JOIN `isb-cgc-bq.TCGA.clinical_gdc_current` clin
      ON seq.case_gdc_id = clin.case_id
    WHERE seq.project_short_name = "TCGA-PRAD"
      AND clin.disease_type = "Adenomas and Adenocarcinomas")
  --transform genes from rows to columns using pivot
  PIVOT(MAX(fpkm_uq_unstranded) for gene_name IN ('KLK2','AK5','ETV1','TMPRSS2','EIF4G3','OSBPL9','METTL13','ADGRL2','ERG','NDRG1','ETV4','SERINC5','ELK4','SLC45A3'))



In [16]:
# Run the query. This will create a table in the assigned Google project.
tmp_data = model_client.query(create_tmp_table).result().to_dataframe()

In [17]:
# query to retrieve data from the new table
query_tmp = f'''
  SELECT *
  FROM `{ML_project}.{ML_data}.tmp_data`
  LIMIT 10
'''

# run the query and view results
tmp_table = client.query(query_tmp).result().to_dataframe()
print(tmp_table)

   case_barcode  primary_gleason_grade       KLK2     AK5     ETV1   TMPRSS2  \
0  TCGA-CH-5768                      2  1072.6219  0.7413   1.1543  246.7231   
1  TCGA-HC-7209                      3  1036.7989  6.2232   0.7732  249.5574   
2  TCGA-HC-7078                      3  1341.2227  8.4328   0.8230  336.9265   
3  TCGA-G9-6339                      3   911.5085  1.2798  14.4465  358.8988   
4  TCGA-G9-6347                      3   860.3153  1.1806   0.7879  121.1990   
5  TCGA-2A-AAYO                      3  1088.0336  0.2035   0.8879  219.7866   
6  TCGA-EJ-5497                      3  1112.3672  0.2399   1.4062  394.1048   
7  TCGA-EJ-7794                      3   949.5938  1.7362   1.5181  291.1958   
8  TCGA-EJ-7793                      3  1390.5860  0.5867   0.8311  441.6666   
9  TCGA-J4-AATV                      3  1024.8606  0.9401   0.8350  212.0895   

    EIF4G3   OSBPL9  METTL13  ADGRL2      ERG     NDRG1    ETV4  SERINC5  \
0  11.4251  12.3321      NaN  5.2915  67.82

## Create a Machine Learning Model

Using our list of relevant Prostate adenocarcinoma genes, we will create a random forest classifier to predict the Primary Gleason Grade from the gene expression data of our target genes.

In [35]:
# query to build a random forest classifier model
rf_model_query = f'''
CREATE OR REPLACE MODEL
  `{ML_project}.{ML_data}.rf_model`
OPTIONS
  ( MODEL_TYPE='RANDOM_FOREST_CLASSIFIER',
    NUM_PARALLEL_TREE=HPARAM_RANGE(50,100),
    TREE_METHOD='HIST',
    --split the data randomly into 10% eval, 20% test, and 70% train
    DATA_SPLIT_METHOD='RANDOM',
    DATA_SPLIT_EVAL_FRACTION=0.1,
    DATA_SPLIT_TEST_FRACTION=0.2,
    NUM_TRIALS=3,
    INPUT_LABEL_COLS=['primary_gleason_grade'])
--ignore case identifier and NULL column
AS SELECT * EXCEPT(case_barcode, METTL13)
FROM
  `{ML_project}.{ML_data}.tmp_data`
'''

#print(rf_model_query)

In [36]:
# Run the query. This will store the new model in the Google project.
# NOTE this query may take several minutes to complete.
rf_model = model_client.query(rf_model_query).result()

## Evaluate Model Results

In [38]:
# query to evaluate model performance
eval_query = f'''
SELECT * FROM
  ML.EVALUATE(MODEL `{ML_project}.{ML_data}.rf_model`)
'''

# query for creating a confusion matrix
matrix_query = f'''
SELECT * FROM
  ML.CONFUSION_MATRIX(MODEL `{ML_project}.{ML_data}.rf_model`)
'''

# query for feature importance in the model
feature_importance_query = f'''
SELECT * FROM
  ML.FEATURE_IMPORTANCE(MODEL `{ML_project}.{ML_data}.rf_model`)
'''

# Run the queries and view results
rf_model_eval = model_client.query(eval_query).result().to_dataframe()
rf_model_matrix = model_client.query(matrix_query).result().to_dataframe()
rf_model_features = model_client.query(feature_importance_query).result().to_dataframe()

print(rf_model_eval)
print()
print(rf_model_matrix)
print()
print(rf_model_features)

   trial_id  precision    recall  accuracy  f1_score  log_loss   roc_auc
0         1   0.426894  0.380239  0.612245  0.381794  1.187049  0.543772
1         2   0.381410  0.331800  0.551020  0.324618  1.192002  0.541744
2         3   0.396507  0.362520  0.581633  0.363992  1.194137  0.533144

   trial_id expected_label  _2  _3  _4  _5
0         1              3   0  25  16   2
1         1              4   0   8  33   1
2         1              5   0   0  11   2
3         2              3   0  21  21   1
4         2              4   0   9  32   1
5         2              5   0   0  12   1
6         3              3   0  24  17   2
7         3              4   0   9  31   2
8         3              5   0   0  11   2

    trial_id  feature  importance_weight  importance_gain  importance_cover
0          3     KLK2               1450         2.015511         20.702845
1          3      AK5               1194         2.990116         28.048367
2          3     ETV1                711        

Three trials were run for hyperparameter tuning. The first trial generated the highest accuracy and precision at 61% and 43% respectively. The confusion matrix for trial 3 shows that grade 4 predictions were most accurate with 31 correct predictions and group 5 had the worst results with only 2 predicted corectly. The feature importance function orders the model features based on how useful the feature was when training the model. In our case, the *KLK2* gene was assigned the highest importance weight.

## Conclusion

The extensive number of gene fusions in the Mitelman database has allowed us to curate a list of common fusions specific to Prostate adenocarcinoma. Using these genes, we were able to easily create a random forest classifier in BigQuery that achieved an accuracy >60% in predicting the Primary Gleason Grade using gene expression data from TCGA. Further experiments can be used to improve this accuracy by incorporating more genes or additional features. In this notebook we have demonstrated the usability of the Mitelman and TCGA databases, as well as the ease of creating machine learning models in BigQuery.