<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/master/MitelmanDB/Correlations_MitelmanDB_and_TCGA_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Correlations between Mitelman and TCGA datasets
Check out other notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!

- **Title:** Correlations between Mitelman DB and TCGA datasets
- **Author:** Boris Aguilar
- **Created:** 04-23-2022
- **Purpose:** Compare Mitelman DB and TCGA datasets
- **URL:** 

This notebook demonstrates how to compute correlations between Mitelman DB and TCGa datasets. The Mitelman DB is hosted by ISB-CGC and can be accessed at this URL: https://mitelmandatabase.isb-cgc.org/. This notebook replicates some of the analyses from the paper by Denomy et al: https://cancerres.aacrjournals.org/content/79/20/5181. Note, however that results are not replicated exactly as some of the underlying data has changed since publication.  



## Initialize Notebook Environment

Before running the analysis, we need to load dependencies, authenticate to BigQuery, and customize notebook parameters.

### Import Dependencies

In [None]:
# GCP Libraries
from google.cloud import bigquery
from google.colab import auth

# Data Analytics
import numpy as np
from scipy import stats

# Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

### Authenticate

Before using BigQuery, we need to get authorization for access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html). Alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html).

In [None]:
# If you're using Google Colab, authenticate to gcloud with the following
auth.authenticate_user()

# alternatively, use the gcloud SDK
#!gcloud auth application-default login

### Google project ID

To run this notebook, you will need to have your Google Cloud Account set up. If you need to set up a Google Cloud Account, follow the "Obtain a Google identity" and "Set up a Google Cloud Project" steps on our [Quick-Start Guide documentation](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) page.

In [None]:
# set the google project that will be billed for this notebook's computations
google_project = 'my_google_project_id' ## change me

## BigQuery Client

Create the BigQuery client.

In [None]:
# Create a client to access the data within BigQuery
client = bigquery.Client(google_project)

# Calculate Frequency of Gains and Losses of breast cancer samples in Mitelman DB

We can use CytoConverter genomic coordinates to calculate the frequency of chromosomal gains and losses across a cohort of samples, e.g., across all breast cancer samples. 

In [None]:
# Set parameters for this query
cancer_type = 'BRCA' # Cancer type for TCGA
bq_project = 'mitelman-db'  # project name of Mitelman-DB BigQuery table
bq_dataset = 'prod' # Name of the dataset containing Mitelman-DB BigQuery tables
morphology = '3111' # Breast cancer
topology = '0401' # Adenocarcinoma

First, we identify all Mitelman DB cases related to the morphology and topology of interest. 

This query was copied from the new feature of the MitelmanDB interface: View Overall Gain/Loss in chromosome. 

In [None]:
case_query = """
cyto_cases AS (
 SELECT DISTINCT
 c.Refno,
 c.CaseNo,
 c.InvNo,
FROM
 `{bq_project}.{bq_dataset}.CytogenInvValid` c,
 `{bq_project}.{bq_dataset}.Reference` Reference,
 `{bq_project}.{bq_dataset}.Cytogen` Cytogen
LEFT JOIN `{bq_project}.{bq_dataset}.Koder` KoderM
ON
 (Cytogen.Morph = KoderM.Kod AND KoderM.KodTyp = 'MORPH')
LEFT JOIN `{bq_project}.{bq_dataset}.Koder` KoderT
ON
 (Cytogen.Topo = KoderT.Kod AND KoderT.KodTyp = 'TOP')
WHERE
 Cytogen.RefNo = c.RefNo
 AND Cytogen.CaseNo = c.CaseNo
 AND c.Refno = Reference.Refno
 AND Cytogen.Morph IN ('{morphology}')
 AND Cytogen.Topo IN ('{topology}')
),
SampleCount AS(

 SELECT COUNT(*) AS sCount
 FROM cyto_cases
),
Case_CC_Kary_Result AS (
 SELECT cc_result.*
 FROM cyto_cases
 LEFT JOIN `{bq_project}.{bq_dataset}.CytoConverted` AS cc_result
 ON cc_result.RefNo = cyto_cases.RefNo
 AND cc_result.caseNo = cyto_cases.caseNo
 AND cc_result.invNo = cyto_cases.invNo
),
Clone_imbal_sums AS (
 SELECT cytoBands.chromosome,
 cytoBands.cytoband_name,
 cytoBands.hg38_start,
 cytoBands.hg38_stop,
 Case_CC_Kary_Result.RefNo,
 Case_CC_Kary_Result.CaseNo,
 Case_CC_Kary_Result.InvNo,
 Case_CC_Kary_Result.Clone,
 SUM( CASE WHEN type = 'Gain' THEN 1 ELSE 0 END ) AS totalGain,
 SUM( CASE WHEN type = 'Loss' THEN 1 ELSE 0 END ) AS totalLoss
 FROM `{bq_project}.{bq_dataset}.CytoBands_hg38` AS cytoBands
 INNER JOIN Case_CC_Kary_Result
  ON cytoBands.chromosome = Case_CC_Kary_Result.Chr
 WHERE cytoBands.hg38_start >= Case_CC_Kary_Result.Start
 AND cytoBands.hg38_stop <= Case_CC_Kary_Result.End
 GROUP BY
 cytoBands.chromosome,
 cytoBands.cytoband_name,
 cytoBands.hg38_start,
 cytoBands.hg38_stop,
 Case_CC_Kary_Result.RefNo,
 Case_CC_Kary_Result.CaseNo,
 Case_CC_Kary_Result.InvNo,
 Case_CC_Kary_Result.Clone
),
AMP_DEL_counts AS (
 SELECT Clone_imbal_sums.chromosome, Clone_imbal_sums.cytoband_name, Clone_imbal_sums.hg38_start, Clone_imbal_sums.hg38_stop,
 Clone_imbal_sums.RefNo, Clone_imbal_sums.CaseNo, Clone_imbal_sums.InvNo, Clone_imbal_sums.Clone,
 CASE WHEN Clone_imbal_sums.totalGain > 1 THEN Clone_imbal_sums.totalGain ELSE 0 END AS amplified,
 CASE WHEN Clone_imbal_sums.totalLoss > 1 THEN Clone_imbal_sums.totalLoss ELSE 0 END AS hozy_deleted,
 CASE WHEN Clone_imbal_sums.totalGain > 1 THEN 1 ELSE 0 END AS amp_count,
 CASE WHEN Clone_imbal_sums.totalLoss > 1 THEN 1 ELSE 0 END AS hozy_del_count,
 FROM Clone_imbal_sums
),
Singular_imbal AS (
 SELECT Clone_imbal_sums.chromosome,
 Clone_imbal_sums.cytoband_name,
 Clone_imbal_sums.hg38_start,
 Clone_imbal_sums.hg38_stop,
 Clone_imbal_sums.RefNo,
 Clone_imbal_sums.CaseNo,
 Clone_imbal_sums.InvNo,
 Clone_imbal_sums.Clone,
 Clone_imbal_sums.totalGain - AMP_DEL_counts.amplified AS Singular_gain,
 Clone_imbal_sums.totalLoss - AMP_DEL_counts.hozy_deleted AS Singular_loss,
 AMP_DEL_counts.amp_count,
 AMP_DEL_counts.hozy_del_count
 FROM Clone_imbal_sums
 INNER JOIN AMP_DEL_counts
 ON Clone_imbal_sums.chromosome = AMP_DEL_counts.chromosome
 AND Clone_imbal_sums.cytoband_name = AMP_DEL_counts.cytoband_name
 AND Clone_imbal_sums.hg38_start= AMP_DEL_counts.hg38_start
 AND Clone_imbal_sums.hg38_stop = AMP_DEL_counts.hg38_stop
 AND Clone_imbal_sums.RefNo= AMP_DEL_counts.RefNo
 AND Clone_imbal_sums.CaseNo = AMP_DEL_counts.CaseNo
 AND Clone_imbal_sums.InvNo = AMP_DEL_counts.InvNo
 AND Clone_imbal_sums.Clone = AMP_DEL_counts.Clone
),
Sample_dist_count AS (
 SELECT Singular_imbal.chromosome,
 Singular_imbal.cytoband_name,
 Singular_imbal.hg38_start,
 Singular_imbal.hg38_stop,
 Singular_imbal.RefNo,
 Singular_imbal.CaseNo,
 Singular_imbal.InvNo,
 CASE WHEN SUM(Singular_imbal.Singular_gain)> 0 THEN 1 ELSE 0 END AS Sample_dist_singular_gain,
 CASE WHEN SUM(Singular_imbal.Singular_loss)> 0 THEN 1 ELSE 0 END AS Sample_dist_singular_loss,
 CASE WHEN SUM(Singular_imbal.amp_count)>0 THEN 1 ELSE 0 END AS Sample_dist_amp,
 CASE WHEN SUM(Singular_imbal.hozy_del_count)>0 THEN 1 ELSE 0 END AS Sample_dist_del,
 FROM Singular_imbal
 GROUP BY
 Singular_imbal.chromosome,
 Singular_imbal.cytoband_name,
 Singular_imbal.hg38_start,
 Singular_imbal.hg38_stop,
 Singular_imbal.RefNo,
 Singular_imbal.CaseNo,
 Singular_imbal.InvNo
),
mitelman AS ( 
SELECT Sample_dist_count.chromosome,
 CASE WHEN SUBSTRING(Sample_dist_count.chromosome, 4) = 'X' THEN 23
      WHEN SUBSTRING(Sample_dist_count.chromosome, 4) = 'Y' THEN 24
      ELSE CAST(SUBSTRING(Sample_dist_count.chromosome, 4) AS INT64)
 END AS chr_ord,
 Sample_dist_count.cytoband_name,
 Sample_dist_count.hg38_start,
 Sample_dist_count.hg38_stop,
 SampleCount.sCount,
 SUM(Sample_dist_count.Sample_dist_singular_gain) AS total_gain,
 SUM(Sample_dist_count.Sample_dist_singular_loss) AS total_loss,
 SUM(Sample_dist_count.Sample_dist_amp) AS total_amp,
 SUM(Sample_dist_count.Sample_dist_del) AS total_del,
 ROUND(SUM(Sample_dist_count.Sample_dist_singular_gain)/SampleCount.sCount*100, 2) AS gain_freq,
 ROUND(SUM(Sample_dist_count.Sample_dist_singular_loss)/SampleCount.sCount*100, 2) AS loss_freq,
 ROUND(SUM(Sample_dist_count.Sample_dist_amp)/SampleCount.sCount*100, 2) AS amp_freq,
 ROUND(SUM(Sample_dist_count.Sample_dist_del)/SampleCount.sCount*100, 2) AS del_freq
 FROM Sample_dist_count, SampleCount
 GROUP BY
 Sample_dist_count.chromosome,
 Sample_dist_count.cytoband_name,
 Sample_dist_count.hg38_start,
 Sample_dist_count.hg38_stop,
 SampleCount.sCount
 ORDER BY chr_ord, Sample_dist_count.hg38_start
)
""".format(
  bq_project=bq_project,
  bq_dataset=bq_dataset,
  morphology=morphology,
  topology=topology
)

In [None]:
#print(case_query)

In [None]:
# Run the query and put results in a data frame
mysql = ( "WITH " + case_query + """
SELECT *
FROM mitelman
""" )
final_mitelman = client.query(mysql).result().to_dataframe()

In [None]:
# Display the table of cases
final_mitelman

Unnamed: 0,chromosome,chr_ord,cytoband_name,hg38_start,hg38_stop,sCount,total_gain,total_loss,total_amp,total_del,gain_freq,loss_freq,amp_freq,del_freq
0,chr1,1,1p36,0,27600000,787,66,93,46,20,8.39,11.82,5.84,2.54
1,chr1,1,1p35,27600000,34300000,787,71,88,48,18,9.02,11.18,6.10,2.29
2,chr1,1,1p34,34300000,46300000,787,72,89,47,17,9.15,11.31,5.97,2.16
3,chr1,1,1p33,46300000,50200000,787,72,86,49,17,9.15,10.93,6.23,2.16
4,chr1,1,1p32,50200000,60800000,787,72,84,49,17,9.15,10.67,6.23,2.16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315,chrX,23,Xq27,138900000,148000000,787,72,94,46,3,9.15,11.94,5.84,0.38
316,chrX,23,Xq28,148000000,156040895,787,73,93,45,4,9.28,11.82,5.72,0.51
317,chrY,24,Yp11,0,10400000,787,0,8,0,0,0.00,1.02,0.00,0.00
318,chrY,24,Yq11,10400000,26600000,787,0,8,0,0,0.00,1.02,0.00,0.00


# Calculate Frequency of TCGA Copy Number Gains and Losses in breast cancer samples.

As a comparison to Mitelman DB gain and loss frequency, we can calculate similar frequencies using TCGA Copy Number data. 

In [None]:
cnv_query = """
copy AS (
  SELECT case_barcode,	#sample_barcode,	aliquot_barcode, 
    chromosome,	start_pos,	end_pos,	MAX(copy_number) as copy_number
  FROM `isb-cgc-bq.TCGA_versioned.copy_number_segment_allelic_hg38_gdc_r23` 
  WHERE  project_short_name = 'TCGA-BRCA'
  GROUP BY case_barcode, chromosome,	start_pos,	end_pos
),
total_cases AS (
  SELECT COUNT( DISTINCT case_barcode) as total
  FROM copy 
),
cytob AS (
  SELECT chromosome, cytoband_name, hg38_start, hg38_stop,
  FROM mitelman-db.prod.CytoBands_hg38
),
joined AS (
  SELECT cytob.chromosome, cytoband_name, hg38_start, hg38_stop,
    case_barcode,
    ( ABS(hg38_stop - hg38_start) + ABS(end_pos - start_pos) 
      - ABS(hg38_stop - end_pos) - ABS(hg38_start - start_pos) )/2.0  AS overlap ,
    copy_number  
  FROM copy
  LEFT JOIN cytob
  ON cytob.chromosome = copy.chromosome 
  WHERE 
    #cytob.hg38_start >= copy.start_pos AND cytob.hg38_start <= copy.end_pos  
    ( cytob.hg38_start >= copy.start_pos AND copy.end_pos >= cytob.hg38_start )
    OR ( copy.start_pos >= cytob.hg38_start  AND  copy.start_pos <= cytob.hg38_stop )
),
cbands AS(
SELECT chromosome, cytoband_name, hg38_start, hg38_stop, case_barcode,
    ROUND( SUM(overlap*copy_number) / SUM(overlap) ) as copy_number
    #ARRAY_AGG( copy_number ORDER BY overlap DESC )[OFFSET(0)] as copy_number
    #ANY_VALUE(copy_number) as copy_number
    #MAX(copy_number) as copy_number
    #MIN(copy_number) as copy_number
FROM joined
GROUP BY 
   chromosome, cytoband_name, hg38_start, hg38_stop, case_barcode
),
aberrations AS (
  SELECT
    chromosome,
    cytoband_name,
    hg38_start,
    hg38_stop,
    -- Amplifications: more than two copies for diploid > 4
    SUM( IF (copy_number > 3 , 1 , 0) ) AS total_amp,
    -- Gains: at most two extra copies
    SUM( IF( copy_number = 3 ,1, 0) ) AS total_gain,
    -- Homozygous deletions, or complete deletions
    SUM( IF( copy_number = 0, 1, 0) ) AS total_homodel,
    -- Heterozygous deletions, 1 copy lost
    SUM( IF( copy_number = 1, 1, 0) ) AS total_heterodel,
    -- Normal for Diploid = 2
    SUM( IF( copy_number = 2, 1, 0) )  AS total_normal

  FROM cbands
  GROUP BY chromosome, cytoband_name, hg38_start, hg38_stop
),
tcga AS (
SELECT chromosome, cytoband_name, hg38_start, hg38_stop,
  total,  
  100 * total_amp / total as freq_amp, 
  100 * total_gain / total as freq_gain,
  100 * total_homodel/ total as freq_homodel, 
  100 * total_heterodel / total as freq_heterodel, 
  100 * total_normal / total as freq_normal  
FROM aberrations, total_cases
ORDER BY chromosome, hg38_start, hg38_stop
)
"""

In [None]:
# Execute query and put results into a data frame
mysql = ( "WITH " + cnv_query + """
SELECT *
FROM tcga
""" )
cnv = client.query(mysql).result().to_dataframe()

In [None]:
# Display table
cnv

Unnamed: 0,chromosome,cytoband_name,hg38_start,hg38_stop,total,freq_amp,freq_gain,freq_homodel,freq_heterodel,freq_normal
0,chr1,1p36,0,27600000,1067,11.902530,19.962512,0.000000,13.120900,55.014058
1,chr1,1p35,27600000,34300000,1067,13.214620,21.462043,0.000000,9.372071,55.951265
2,chr1,1p34,34300000,46300000,1067,18.650422,21.743205,0.000000,5.716963,53.889410
3,chr1,1p33,46300000,50200000,1067,17.525773,22.774133,0.000000,6.373008,53.327085
4,chr1,1p32,50200000,60800000,1067,19.119025,21.462043,0.000000,6.279288,53.139644
...,...,...,...,...,...,...,...,...,...,...
300,chrX,Xq27,138900000,148000000,1067,24.273664,14.058107,0.281162,10.496720,50.890347
301,chrX,Xq28,148000000,156040895,1067,23.711340,14.526710,0.187441,10.309278,51.265230
302,chrY,Yp11,0,10400000,1067,0.374883,0.281162,96.438613,2.624180,0.281162
303,chrY,Yq11,10400000,26600000,1067,0.281162,0.281162,97.469541,1.593252,0.374883


# Compute Pearson correlation and p-values
The following query compute Pearson correlation for each chromosome comparing Mitelman DB frequencies with those computed from TCGA. Moreover, for each correlation values, its respective p-values is computed by using the BigQuery function `isb-cgc-bq.functions.corr_pvalue_current`. The minimum number of cases for correlation computation was 5.

In [None]:
mysql = ( "WITH " + case_query + "," + cnv_query + """ 
SELECT chromosome,
  corr_amp, `isb-cgc-bq.functions.corr_pvalue_current`(corr_amp, N) AS pvalue_amp,
  corr_gain, `isb-cgc-bq.functions.corr_pvalue_current`(corr_gain, N) AS pvalue_gain,
  corr_loss, `isb-cgc-bq.functions.corr_pvalue_current`(corr_loss, N) AS pvalue_loss,
  corr_del, `isb-cgc-bq.functions.corr_pvalue_current`(corr_del, N) AS pvalue_del,
FROM (
  SELECT mitelman.chromosome,
      COUNT(*) AS N, 
      CORR( mitelman.amp_freq, tcga.freq_amp ) as corr_amp,
      CORR(mitelman.gain_freq, tcga.freq_gain ) as corr_gain,
      CORR(mitelman.loss_freq, tcga.freq_heterodel ) as corr_loss,
      CORR(mitelman.del_freq, tcga.freq_homodel ) as corr_del,
  FROM mitelman
  JOIN tcga
  ON mitelman.chromosome = tcga.chromosome
    AND mitelman.cytoband_name = tcga.cytoband_name 
  GROUP BY mitelman.chromosome
)
WHERE N > 5
ORDER BY chromosome

""" )

corr_table = client.query(mysql).result().to_dataframe()
corr_table

Unnamed: 0,chromosome,corr_amp,pvalue_amp,corr_gain,pvalue_gain,corr_loss,pvalue_loss,corr_del,pvalue_del
0,chr1,0.772126,1.174299e-05,-0.361223,0.083522,0.889018,9.708662e-09,,
1,chr10,-0.666979,0.01969631,-0.380857,0.225073,-0.417797,0.1798612,,
2,chr11,0.000982,0.9973455,0.738486,0.002974,0.833928,0.0002791603,,
3,chr12,0.588961,0.04664973,0.467437,0.12884,0.858186,0.0005026906,,
4,chr13,-0.717526,0.02254193,-0.305352,0.394564,-0.37381,0.2918127,0.030203,0.934283
5,chr14,-0.552576,0.1299509,0.087078,0.824793,0.603866,0.09186479,,
6,chr15,-0.540334,0.09021494,-0.089091,0.795228,-0.898673,0.0002750227,-0.193649,0.570099
7,chr16,0.884172,0.00106052,0.982763,1e-06,0.979074,2.723566e-06,0.164261,0.651966
8,chr17,0.93373,0.0001523126,0.622668,0.059134,0.940111,0.0001070353,,
9,chr18,0.737859,0.1166475,-0.269433,0.61484,0.695262,0.1484915,-0.147755,0.784603


In [None]:
corr_table.to_csv('correlation_MitelmanDB_vs_TCGA_BreastCancer.csv', index=False)

The non a value results (NaN) represent cases in which the computed frequencies of TCGA are zero for all the cytobands.

# Conclusion

This notebook demonstrated usage of the Mitelman BigQuery dataset, which includes CytoConverter chromosomal coordinate data, in combination with TCGA BigQuery tables for a comparative analysis. Specifically, the notebook computes correlation (Pearson) coefficients between gains and losses obtained with Mitelam DB and TCGA datasets.

We observed that the mayority (but not all) of the significan correlation shown in Denomy et al. paper (Table 1, https://doi.org/10.1158/0008-5472.CAN-19-0585) are also significan in this analysis.