<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/jacob-dev/MitelmanDB/Mitelman_DB_Queries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Navigating the Mitelman Database in BigQuery Using SQL

Check out other notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!

- Title: Navigating the Mitelman Database in BigQuery Using SQL
- Author: Jacob Wilson
- Created: 2025-01-01
- URL: https://github.com/isb-cgc/Community-Notebooks/blob/jacob-dev/MitelmanDB/Mitelman_DB_Queries.ipynb
- Purpose: Introduce the Mitelman Database BigQuery Tables and how to obtain information using basic SQL queries.
<br/>

In this notebook, we will use SQL queries to obtain various types of data available in the Mitelman Database BigQuery tables.

## BigQuery and SQL Background
BigQuery is a data warehouse platform that is available from Google. It provides a powerful way of storing and analyzing tabular data. We use this platform heavily in ISB-CGC to store our data and provide our users with extensive data analysis capabilities.

Users interact with data in BigQuery tables using SQL-like queries in the Google Cloud Console. SQL (Structured Query Language) is a common language that is used for many databases. It provides an easy to understand syntax that is used for working with database tables.


# Mitelman Database Table Descriptions

|Table              |Data Type         |Table Description    |
|:-------------------|:------------------|:---------------------|
|AuthorReference    | LITERATURE       | Data was compiled from the original publications used in the Mitelman database. Author names were manually curated for each assigned reference number. The order of the names in the list of authors for the given reference is recorded.                |
 |CytoBands_hg38     | CYTOGENETIC      | Data was extracted from the UCSC database. Start and end coordinates of each cytoband were obtained for all chromosomes based on genome build hg38.                     |
|CytoConverted      | CYTOGENETIC      | Data was processed using karyotype nomenclature from the CytoConverterSample table. Chromosomal imbalances and associated genomic coordinates were generated by the CytoConverter software. More details: see https://github.com/isb-cgc/ISB-CGC-CytoConverter/                |
|CytoConvertedLog   | METADATA         | Data was generated by the CytoConverter software. Warnings and errors were returned from the software while processing karyotype nomenclature from the CytoConverterSample table. More details: see https://github.com/isb-cgc/ISB-CGC-CytoConverter/               |
|Cytogen            | CLINICAL DATA    | Data was compiled from the original publications used in the Mitelman database. Patient demographics and disease data were manually curated for each assigned reference and case number.               |
|CytogenInv         | CYTOGENETIC      | Data was compiled from the original publications used in the Mitelman database. Karyotype nomenclature data were manually curated for each assigned reference, case, and investigation number.                |
|CytogenInvValid    | CYTOGENETIC      | Data was generated using the CytogenInv table. Karyotype nomenclature was processed using a syntax checker that validated the string for use with the CytoConverter software.               |
|KaryAbnorm         | CYTOGENETIC      | Data was generated using the CytogenInv table. Individual karyotype nomenclature abnormalities are maintained for searching purposes.               |
|KaryBit            | CYTOGENETIC      | Data was generated using the CytogenInv table. Includes all elements of karyotype nomenclature when split on commas.               |
|KaryBreak          | CYTOGENETIC      | Data was generated using the CytogenInv table. Breakpoints were extracted from the karyotype nomenclature.                |
|KaryClone          | CYTOGENETIC      | Data was generated using the CytogenInv table. Contains separated clones from karyotype nomenclature.               |
|Koder              | CLINICAL SUPPLEMENT | Data was compiled from the original publications used in the Mitelman database. Various fields of data were assigned a unique code to be used as a reference in other database tables.               |
|MolBiolClinAssoc   | CYTOGENETIC      | Data was compiled from the original publications used in the Mitelman database. Gene fusions associated with karyotype nomenclature were manually curated for each reference.               |
|MolClinAbnorm      | CYTOGENETIC      | Data was generated using the MolBiolClinAssoc table. Karyotype nomenclature was split into individual abnormalities.               |
|MolClinBreak       | CYTOGENETIC      | Data was generated using the MolBiolClinAssoc table. Breakpoints were extracted from the karyotype nomenclature.               |
|MolClinGene        | GENE FUSION      | Data was compiled from the original publications used in the Mitelman database. Gene fusions were manually curated for each reference.               |
|RecurrentData      | GENE FUSION      | Data was aggregated from the separated chromosomal abnormalities. Frequency counts of each structural abnormality are recorded.                |
|RecurrentNumData   | GENE FUSION      | Data was aggregated from the separated chromosomal abnormalities. Frequency counts of each numerical abnormality are recorded.                |
|Reference          | LITERATURE       | Data was compiled from the original publications used in the Mitelman database. Information associated with the reviewed publication is recorded and assigned a reference number.               |

## Potential columns of interest


Find cytogenetic abnormalities and gene fusions:

| Table              |Column Name         |Description |
|:-------------------|:------------------|:--------|
|CytoConverted    | Type       |Type of copy number aberration as determined by CytoConverter (gain or loss).|
|     | Start      |Start coordinate of the aberration identified by CytoConverter. Based on genome build hg38.|
|    | End       |End coordinate of the aberration identified by CytoConverter. Based on genome build hg38.|
|     | Chr      |Chromosome number (1- 22, X or Y) in format "chr##"|
|CytogenInv    | KaryShort       |Short (possibly truncated) karyotype description|
|KaryAbnorm     | Abnormality      |Abnormality description, eg 'del(15)(q22q24)', 't(17;17)(p11;q11)', '-22', etc|
|KaryBreak     | Breakpoint      |Cytogenetic band location of abnormality, e.g., 1p21, as defined by ISCN.|
|KaryClone | ChromoMax |The maximum modal number present in the karyotype nomenclature. |
| |ChromoMin |The minimum modal number present in the karyotype nomenclature. |
|MolClinGene|Gene |HGNC gene symbol |

Find information relevant to publications:

|Table              |Column Name         |Description |
|:-------------------|:------------------|:--------|
|AuthorReference |Name |Author name |
|Reference |Journal |Journal name abbreviation, eg 'Cancer Genet Cytogenet' |
| |Year |Publication year |

Find information relevant to disease:

|Table              |Column Name         |Description |
|:-------------------|:------------------|:--------|
|Cytogen |Morph |Unique code indicating the morphology type as present in Koder table. |
| |Topo |Unique code indicating the topography type as present in Koder table. |
|CytogenInv |Tissue |Indicates the type of tissue. "BM" means Bone Marrow, "TB" means Tumor biopsy, "LN" means Lymph node, "EX" means Exudate, and "CSF" means Cerebrospinal fluid. |

Find general information:

|Table              |Column Name         |Description |
|:-------------------|:------------------|:--------|
|Koder |Benamning |The 'long' name corresponding to this code, eg 'Vascular and perivascular tumors (all subtypes)' |
| |Kod |Code number which depends on the KodTyp |
| |KodType |This field indicates the type of the code -- there are 7 types: 'MORPH' (morphology), 'GEO' (geography), 'TOP' (topography), 'HER' (heredity), 'TISSUE', 'TREAT' (treatment), and 'RACE' |

## Initialize Notebook Environment

Before beginning, we first need to load dependencies and authenticate to BigQuery.

## Install Dependencies

In [1]:
# GCP libraries
from google.cloud import bigquery
from google.colab import auth

## Authenticate

In order to utilize BigQuery, we must obtain authorization to BigQuery and Google Cloud.

In [2]:
# if you're using Google Colab, authenticate to gcloud with the following
auth.authenticate_user()

# alternatively, use the gcloud SDK
#!gcloud auth application-default login

## Google project ID

Set your own Google project ID for use with this notebook.

In [3]:
# set the google project that will be billed for this notebook's computations
google_project = 'isb-cgc-notebook-dev'  ## change this

## BigQuery Client

In [4]:
# Initialize a client to access the data within BigQuery
if google_project == 'your_project_id':
    print('Please update the project ID with your Google Cloud Project')
else:
    client = bigquery.Client(google_project)

# set the Mitelman Database project
bq_project = 'mitelman-db'
bq_dataset = 'prod'

# Querying for Information

In [8]:
# Find the column names of the 'Cytogen' table. The 'table_name'
# value can be changed to find the column names of other tables.
col_names = f'''
SELECT column_name
FROM `{bq_project}`.{bq_dataset}.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'Cytogen'
ORDER BY column_name
'''

# Return the distict code types.
kod_types = f'''
SELECT DISTINCT KodTyp
FROM `{bq_project}.{bq_dataset}.Koder`
'''

# Find the code, full name, and abbreviations for the code type
# 'MORPH'.
all_morph = f'''
SELECT Kod, KodTyp, Benamning AS FullName, Kortnamn AS ShortName
FROM `{bq_project}.{bq_dataset}.Koder`
WHERE KodTyp = "MORPH"
ORDER BY Kortnamn
'''

# Same as above but for the code type 'TOP'
all_topo = f'''
SELECT Kod, KodTyp, Benamning AS FullName, Kortnamn AS ShortName
FROM `{bq_project}.{bq_dataset}.Koder`
WHERE KodTyp = "TOP"
ORDER BY Kortnamn
'''

In [9]:
columns = client.query(col_names).result().to_dataframe()
codes = client.query(kod_types).result().to_dataframe()
morph = client.query(all_morph).result().to_dataframe()
topo = client.query(all_topo).result().to_dataframe()

In [10]:
print(columns.head())
print(codes.head())
print(morph.head())
print(topo.head())

  column_name
0         Age
1      CaseNo
2   CaseOrder
3     Country
4      HerDis
  KodTyp
0    GEO
1    HER
2    TOP
3   RACE
4  MORPH
    Kod KodTyp                           FullName ShortName
0  8725  MORPH               Aneurysmal bone cyst       ABC
1  1115  MORPH          Acute basophilic leukemia       ABL
2  1303  MORPH  Atypical chronic myeloid leukemia      ACML
3  1117  MORPH        Acute eosinophilic leukemia       AEL
4  8506  MORPH   Angiomatoid fibrous histiocytoma       AFH
    Kod KodTyp                           FullName      ShortName
0  0703    TOP                            Adrenal        Adrenal
1  0230    TOP                               Anus           Anus
2  1201    TOP                       Blood vessel      Bl vessel
3  0305    TOP                            Bladder        Bladder
4    09    TOP  Bone and soft tissues (all sites)  Bone and soft


#Using SQL Aggregate Functions

Aggregate functions in SQL are useful for generating basic statistics. Commonly used aggregate functions include:
- COUNT(): returns the number of rows in a chosen set
- MIN(): minimum value in the column
- MAX(): maximum value in the column
- AVG(): average of values in a column
- SUM(): total sum of a column

It is often necesary to use the GROUP BY clause with certain aggregate functions to create a set for the function to be applied to.

In [11]:
# Count the number of 'M' and 'F' entries in the 'Sex' column
count_sex = f'''
SELECT
SUM(IF(Sex = 'M', 1, 0)) as Male,
SUM(IF(Sex = 'F', 1, 0)) As Female
FROM `{bq_project}.{bq_dataset}.Cytogen`
'''

# Select all ages and provide a count for each age
age_stats = f'''
SELECT Age, COUNT(Age) AS Count
FROM `{bq_project}.{bq_dataset}.Cytogen`
GROUP BY Age
ORDER BY Age
'''

# Find the minimum, maximum, and average of ages
age_range = f'''
SELECT MIN(Age) AS min_age, MAX(Age) AS max_age, AVG(Age) AS avg_age
FROM `{bq_project}.{bq_dataset}.Cytogen`
'''

# Find gene fusions and count the number of occurrences
query_fusions = f'''
SELECT g.Gene, COUNT(g.Gene) AS Count
FROM `{bq_project}.{bq_dataset}.MolClinGene` g
-- gene name for fusions is double-colon separated gene pair
WHERE g.Gene LIKE "%::%"
GROUP BY g.Gene
ORDER BY Count DESC
'''

In [12]:
counts = client.query(count_sex).result().to_dataframe()
ages = client.query(age_stats).result().to_dataframe()
age_range = client.query(age_range).result().to_dataframe()
fusions = client.query(query_fusions).result().to_dataframe()

In [13]:
print('Number of male and female cases:')
print(counts)
print('\nAge distribution:')
print(ages)
print('\nAge range:')
print(age_range)
print('\nTypes and counts of fusions:')
print(fusions)

Number of male and female cases:
    Male  Female
0  43488   34542

Age distribution:
      Age  Count
0    <NA>      0
1       0   1625
2       1   1019
3       2   1305
4       3   1177
..    ...    ...
96     95      9
97     96      7
98     97      3
99     98      3
100   100      1

[101 rows x 2 columns]

Age range:
   min_age  max_age    avg_age
0        0      100  43.435255

Types and counts of fusions:
                 Gene  Count
0           BCR::ABL1    422
1      RUNX1::RUNX1T1    134
2           PML::RARA     96
3         ETV6::RUNX1     84
4         ETV6::NTRK3     74
...               ...    ...
34411   FOXP1::HOXA10      1
34412     ZNF106::PML      1
34413    CDK12::KCNRG      1
34414    ATAD5::CDK12      1
34415     PRDM16::SKI      1

[34416 rows x 2 columns]


## Conclusion

In this notebook we introduced basic SQL as a method for obtaining data in the Mitelman database BigQuery tables. We also provided detailed descriptions of the database tables to help with database navigation. In follow-up notebooks we will demonstrate table joins as a method for performing more complex queries.