<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/lkh-staging/RegulomeExplorer/BigQuery_FisherExact.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ISB-CGC Community Notebooks

Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!

```
Title:      Regulome Explorer  Fisher's exact test to identify significant co-mutations
Author:     Boris Aguilar
Updated by: Lauren Hagen
Created:    2020-06-20
Updated:    2024-04
Purpose:    To provide an example of how to use Fisher's exact test in Bigquery
URL:        https://github.com/isb-cgc/Community-Notebooks/blob/master/RegulomeExplorer/BigQuery-FisherExact.ipynb
Notes:      This notebook uses the BigQuery User-defined function p_fisherexact_current. The source code and examples of how to use this function can be found in https://github.com/isb-cgc/Community-Notebooks/tree/master/BQUserFunctions.
```

Check out more notebooks with statistical analysis at our ['Regulome Explorer Repository'](https://github.com/isb-cgc/Community-Notebooks/tree/master/RegulomeExplorer)!

In this notebook we describe how Regulome Explorer uses Fisher's exact test to compute the significance of associations between two categorical features. This test is used by Regulome Explorer when both features have only two categories, such as the presence or absence of Somatic mutations or the gender of the participants.

To describe the implementation, we will use Somatic mutation data for two user defined genes. This data is read from BigQuery tables in the ISB-CGC Google Project `isb-cgc-bq`. For reference, a description of the Fisher's exact can be found in the following link: http://mathworld.wolfram.com/FishersExactTest.html

In [None]:
# Import, Authenticate, and Parameters

## Import Python Libraries

In [None]:
from google.colab import auth
import google.auth
from google.cloud import bigquery
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import pandas_gbq

## Authenticate with Google  (IMPORTANT)
The first step is to authorize access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found [here](https://googleapis.github.io/google-cloud-python/latest/core/auth.html).

In [None]:
# if you're using Google Colab, authenticate to gcloud with the following
auth.authenticate_user()

# alternatively, use the gcloud SDK
#!gcloud auth application-default login

## Setting up your Google project

In [None]:
# Create a variable for which client to use with BigQuery
my_project_id = 'YOUR_PROJECT_ID_CHANGE_ME' # Update with your Google Project Id

## Userdefined Parameters
The parameters for this experiment are the program, primary site, the name of gene1 and the name of gene2 for which mutation information will be extracted, and which GDC release to use for each table.

In [None]:
primary_site = 'Uterus, NOS'
program = 'CPTAC'
gdc_release_clinical = 'r37'
gdc_release_mutation = 'r37'
mutation_name1 = 'KRAS'
mutation_name2 = 'TP53'

# Data from BigQuery tables

The first step is to select all participants in the selected program with the selected primary site.

In [None]:
barcode_set = f"""barcodes AS (
   SELECT submitter_id as case_barcode
   FROM `isb-cgc-bq.{program}_versioned.clinical_gdc_{gdc_release_clinical}`
   WHERE primary_site = '{primary_site}'
)
"""

## Somatic mutation data for gene 1
The following string query will retrieve a table with patients with at least one Somatic mutation in the user defined gene ('mutation_name'). This information is extracted from the user defined program table for somatic mutations.

In [None]:
query1 = f"""table1 AS (
SELECT
   t1.case_barcode,
   IF( t2.case_barcode is null, 'NO', 'YES') as data
FROM
   barcodes AS t1
LEFT JOIN
   (
   SELECT
      case_barcode
   FROM `isb-cgc-bq.{program}_versioned.masked_somatic_mutation_hg38_gdc_{gdc_release_mutation}`
   WHERE Hugo_Symbol = '{mutation_name1}'
   GROUP BY case_barcode
   ) AS t2
ON t1.case_barcode = t2.case_barcode
)
"""

The somatic mutation data for gene 2 is retrieved using a similar query:

In [None]:
query2 = f"""table2 AS (
SELECT
   t1.case_barcode,
   IF( t2.case_barcode is null, 'NO', 'YES') as data
FROM
   barcodes AS t1
LEFT JOIN
   (
   SELECT
      case_barcode
   FROM `isb-cgc-bq.{program}_versioned.masked_somatic_mutation_hg38_gdc_{gdc_release_mutation}`
   WHERE Hugo_Symbol = '{mutation_name2}'
   GROUP BY case_barcode
   ) AS t2
ON t1.case_barcode = t2.case_barcode
)
"""

The following query combines the two tables based on case barcodes. Nij is the number of participants for each pair of categories. data1 (data2) column is the Somatic Mutations for gene1 (gene2). 'YES' for pariticpants with mutation and 'NO' otherwise.   

In [None]:
query_summarize = """summ_table AS (
SELECT
   n1.data as data1,
   n2.data as data2,
   COUNT(*) as Nij
FROM
   table1 AS n1
INNER JOIN
   table2 AS n2
ON
   n1.case_barcode = n2.case_barcode
GROUP BY
  data1, data2
)
"""

At this point we can take a look at output table, where the column **Nij** is the number of participants for each pair of categorical values.

In [None]:
sql_data = 'WITH\n'+ barcode_set+','+query1+','+query2+','+query_summarize

sql = (sql_data + '\n' +
"""SELECT * FROM summ_table
   ORDER BY  data1
""")

df_data = pandas_gbq.read_gbq(sql,project_id=my_project_id )

df_data

We can use a 'catplot' to visualize the populations in each category.

In [None]:
df_data.rename(columns={ "data1": "KRAS", "data2": "TP53" }, inplace=True)
sns.catplot(y="KRAS", x="Nij",hue="TP53",data=df_data, kind="bar",height=4, aspect=.7)

# Compute Statistics

After sumarizing the data in the table above, we are in the position to perform the 2-sided Fisher's Exact test for the null hypothesis that no nonrandom associations exist between the two categorical variables (Somatic mutations). For clarity we consider the following 2x2 contingency table.

|-         |-     |Gene2|    |
|:--       |:-- |:--- |:---|
|-         |-    |YES  |NO  |
|**Gene1** |YES |$a$    |$b$   |
|-          |NO  |$c$    |$d$   |

To compute the p-Value of the Fisher's test, we need to compute the Hypergeometric distribution:

$$Pr(x) = \frac{(a+b)!(c+d)!(a+c)!(b+d)! }{x!(a+b-x)!(a+c-x)!(d-a+x)!n!} $$

Where $n=a+b+c+d$. The p-Value is then computed by:

$$p_{FET}(a,b,c,d) = \sum_{x} Pr(x) \ I\left[ Pr(x) \leq Pr(a) \right]  $$

Efficient computation of  $p_{FET}$ using BigQuery commands would be very difficult due to the factorials. Instead we take advantage of the possibility of implementing User-Defined Functions using JavaScript. We implemented a public User-defined function called pFisherExact that computes $p_{FET}$. The source code and an example of how to use this function in  Bigquery can be found in: https://github.com/isb-cgc/Community-Notebooks/tree/master/BQUserFunctions#p_fisherexact

The following BigQuery string has the steps to compute $a$, $b$, $c$, and $d$ as indicated above and then uses the BigQuery function `p_fisherexact_current` to compute the p-Value of the Fisher exact test.  

In [None]:
query_fishertest = """
SELECT a,b,c,d,
      bqutil.fn.p_fisherexact(a,b,c,d) as pValue
FROM (
SELECT
  MAX( IF( (data1='YES') AND (data2='YES'), Nij, NULL ) ) as a ,
  MAX( IF( (data1='YES') AND (data2='NO') , Nij, NULL ) ) as b ,
  MAX( IF( (data1='NO') AND (data2='YES') , Nij, NULL ) ) as c ,
  MAX( IF( (data1='NO') AND (data2='NO')  , Nij, NULL ) )  as d
FROM summ_table
)
WHERE a IS NOT NULL AND b IS NOT NULL AND c IS NOT NULL AND d IS NOT NULL
"""

sql = (  sql_data +  query_fishertest )

df_results = pandas_gbq.read_gbq(sql,project_id=my_project_id )

df_results

To test our implementation we can use the 'fisher_exact' function available in python

In [None]:
a = df_results['a'][0]
b = df_results['b'][0]
c = df_results['c'][0]
d = df_results['d'][0]

oddsratio, pvalue = stats.fisher_exact([[a, b], [c, d]])
pvalue