<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/tree/master/MitelmanDB/Mitelman_Cytogenetics_Subsets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cytogenetics and Data Subsets in the Mitelman Database

Check out other notebooks as our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!

```
Title:    Cytogenetics and Data Subsets in the Mitelman Database
Author:   Jacob Wilson
Created:  2023-08-21
URL:      https://github.com/isb-cgc/Community-Notebooks/tree/master/MitelmanDB/Mitelman_Cytogenetics_Subsets.ipynb
Purpose:  Demonstrate examples of unique subsets of the Mitelman Database that may be useful in Cytogenetics research.
```

In this notebook, we will explore multiple methods for subsetting the Mitelman dataset into groupings that are relevant to Cytogenetics research. The goal of this exercise is to show how the Mitelman Database can be used in BigQuery to perform research on various groupings of cytogenetic abnormalities. In the following examples, we will:

utilize CytoConverter coordinates to:
- target specific gene loci and groups of genes
- compare to microarray copy number data
- compare large regions of the chromosome

determine complexity of the cytogenetic nomenclature to:
- compare complex vs. non-complex cases

## Cytogenetics Background

The field of Cytogenetics is based on a chromosome-level understanding of the genome. The method most commonly associated with Cytogenetics is the use of banded metaphase chromosomes to generate a karyogram. Analysis of the karyogram results in standardized nomenclature that follows guidelines detailed in the International System for Human Cytogenomic Nomenclature (ISCN) [1].

Karyotype nomenclature is a comma-separated list that follows a standardized convention:  

"number of chromosomes,sex chromosomes,abnormalities"

Examples:

>46,XX

>46,XY,t(9;22)(q34;q11.2)

>47,XY,+8[3]/46,idem,del(5)(q13q33),+21[7]

When more than one clone is present, the cell lines are separated by a forward slash and the number of cells present in each clone is included in brackets (third example above). Abnormalities are identified by an abbreviated term along with the chromosome number and breakpoints.

Examples:

>t(9;22)(q34;q11.2) -> translocation between chromosomes 9 and 22 at breakpoints 9q34 and 22q11.2 (BCR::ABL1 fusion)

>del(5)(q13q33) -> deletion of chromosome 5 from breakpoint 5q13 to 5q33

</br>

As a clinical test in oncology, Cytogenetics is often used in the diagnosis and staging of disease. Based on the abnormalities identified in the nomenclature string, specific markers may identify and clarify the type of cancer present. One common use case is the International Prognostic Scoring System - Revised (IPSS-R) for staging patients with MDS [2,3] . The table below details the cytogenetic abnormalities and the associated cytogenetic prognosis.

</br>

|Cytogenetic prognostic subgroup|Cytogenetic Abnormalities|
|:-                             |                       :-|
|Very good                      |-Y, del(11q)             |
|Good                           |normal, del(5q), del(12p), del(20q), double including del(5q)|
|Intermediate                   |del(7q), +8, +19, i(17q), and other single or independent double clones|
|Poor                           |-7, inv(3)/t(3q)/del(3q), double including -7/del(7q), Complex: 3 abns|
|Very poor                      |Complex: >3 abns|


## Initialize Notebook Environment

Before beginning, we first need to load dependencies and authenticate to BigQuery.

## Install Dependencies

In [None]:
# GCP libraries
from google.cloud import bigquery
from google.colab import auth

# data analysis libraries
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

## Authenticate

In order to utilize BigQuery, we must obtain authorization to BigQuery and Google Cloud.

In [None]:
# if you're using Google Colab, authenticate to gcloud with the following
auth.authenticate_user()

# alternatively, use the gcloud SDK
#!gcloud auth application-default login

## Google project ID

Set your own Google project ID for use with this notebook.

In [None]:
# set the google project that will be billed for this notebook's computations
google_project = 'your_project_id'  ## change this

## BigQuery Client

In [None]:
# Initialize a client to access the data within BigQuery
if google_project == 'your_project_id':
    print('Please update the project ID with your Google Cloud Project')
else:
    client = bigquery.Client(google_project)

# set the Mitelman Database project
bq_project = 'mitelman-db'
bq_dataset = 'prod'

# Creating Unique Data Subsets

## Utilizing CytoConverter Genomic Coordinates

The CytoConverter component of the Mitelman Database converts the chromosomal bands present in the cytogenetic nomenclature into corresponding genomic coordinates. Using the **CytoConverted** BigQuery table with these converted genomic coordinates, we can query for specific gene coverage or genomic regions based on our interests. This method may be useful if you have existing sequencing data or microarray copy number data and would like to find Mitelman cases that may correspond based on their cytogenetic nomenclature.

Gene regions can be targeted using individual genes or combinations of genes.

|Gene      |Chromosome|Coordinates            |
|:-        |:-        |:-                     |
|*CDKN2A*  |9         |21,967,752-21,995,324  |
|*ATM*     |11        |108,223,067-108,369,102|
|*RB1*     |13        |48,303,751-48,481,890  |
|*TP53*    |17        |7,668,421-7,687,490    |

Example of microarray copy number nomenclature:  
>arr[GRCh38] (X,Y) 13q13.3q31.1(34901394_83018282)x1,20q11.23q13.2(35892594_43210401)x1

### Single locus query

In [None]:
# values required for an individual gene: chromosome number, start and end coordinate
target_chr = 9
target_start = 21967752
target_end = 21995324

# including the 9p11.2 coordinate for limiting to p-arm example
target_p_prox_band = 42200000

# selecting a single locus, CDKN2A
single_locus = f'''
SELECT c.RefNo,
      c.CaseNo,
      c.InvNo,
      c.Clone,
      c.ChrOrd,
      c.Start,
      c.End,
      c.Type
FROM `{bq_project}.{bq_dataset}.CytoConverted` c
    WHERE c.ChrOrd = {target_chr}
        AND c.Start < {target_start}
        AND c.End > {target_end}
'''
#print(single_locus)

# selecting a single locus, but limit to just the p-arm of chromosome 9
single_locus_arm = f'''
SELECT c.RefNo,
      c.CaseNo,
      c.InvNo,
      c.Clone,
      c.ChrOrd,
      c.Start,
      c.End,
      c.Type
FROM `{bq_project}.{bq_dataset}.CytoConverted` c
    WHERE c.ChrOrd = {target_chr}
        AND c.Start < {target_start}
        AND c.End BETWEEN {target_end} AND {target_p_prox_band}
'''
#print(single_locus_arm)

# selecting a single locus, limit to a smaller region
# The coordiates used in this example represent bands near the target locus. Chromosome band-level coordinates are
# available in the UCSC genome browser.
single_locus_focused = f'''
SELECT c.RefNo,
      c.CaseNo,
      c.InvNo,
      c.Clone,
      c.ChrOrd,
      c.Start,
      c.End,
      c.Type
FROM `{bq_project}.{bq_dataset}.CytoConverted` c
    WHERE c.ChrOrd = {target_chr}
        AND c.Start BETWEEN 18500000 AND 21967752
        AND c.End BETWEEN 21995324 AND 33200000
'''
#print(single_locus_focused)

In [None]:
# run the queries and store results in dataframes
cdkn2a_whole_chrom = client.query(single_locus).result().to_dataframe()
cdkn2a_p_arm = client.query(single_locus_arm).result().to_dataframe()
cdkn2a_focused = client.query(single_locus_focused).result().to_dataframe()

In [None]:
# view the subsets
print(f'Count of CDKN2A aberrations: {len(cdkn2a_whole_chrom.index)}')
print(cdkn2a_whole_chrom.head())

print(f'Count of CDKN2A aberrations limited to p-arm: {len(cdkn2a_p_arm.index)}')
print(cdkn2a_p_arm.head())

print(f'Count of CDKN2A aberrations limited to neighboring bands: {len(cdkn2a_focused.index)}')
print(cdkn2a_focused.head())

In [None]:
# plot the frequency of CDKN2A aberrations identified in the two subsets
sns.set(rc={"figure.figsize":(8, 8)})

cdkn2a_whole_chrom['ds'] = "whole_chrom"
cdkn2a_p_arm['ds'] = "p_arm"
cdkn2a_focused['ds'] = "focused"
merged = pd.concat([cdkn2a_whole_chrom, cdkn2a_p_arm, cdkn2a_focused])

hue_order = ['Gain', 'Loss']
combined_cases_plot = sns.countplot(x='ds', hue='Type', hue_order=hue_order, data=merged)

combined_cases_plot.set(xlabel="Type of Abnormality",
              ylabel="Occurences",
              title="CDKN2A Loss vs. Gain Occurences")
plt.show()

#### Results

The graph shows that when looking at abnormal segments across the entire chromosome 9, the majority of the abnormality types are gains. But when we look at just the p-arm or the focused segments, we see far more deletions. And as expected, as we reduce the size of the window that we are using for our query, we minimize the total number of cases returned. Karyotypes in cancer cases are typically low resolution, so even though the CytoConverter coordinates allows us to investigate very small regions of the genome, there may be no case results when looking at too small of a region.

### Multiple loci query

In [None]:
# values for a collection of genes of interest
gene_coords = {
    "atm": {"chr": 11,
             "start": 108223067,
             "end": 108369102},
    "rb1": {"chr": 13,
             "start": 48303751,
             "end": 48481890},
    "tp53": {"chr": 17,
             "start": 7668421,
             "end": 7687490},
}

# selecting multiple loci: in this example, cases containing a loss of RB1 and TP53 along with gain of ATM
multi_loci = f'''
WITH
    rb1 AS (SELECT c.* FROM `{bq_project}.{bq_dataset}.CytoConverted` c
        WHERE c.ChrOrd = {gene_coords["rb1"]["chr"]}
            AND c.Start < {gene_coords["rb1"]["start"]}
            AND c.End > {gene_coords["rb1"]["end"]}
            AND c.Type = 'Loss'),
    tp53 AS (SELECT c.* FROM `{bq_project}.{bq_dataset}.CytoConverted` c
        WHERE c.ChrOrd = {gene_coords["tp53"]["chr"]}
            AND c.Start < {gene_coords["tp53"]["start"]}
            AND c.End > {gene_coords["tp53"]["end"]}
            AND c.Type = 'Loss'),
    atm AS (SELECT c.* FROM `{bq_project}.{bq_dataset}.CytoConverted` c
        WHERE c.ChrOrd = {gene_coords["atm"]["chr"]}
            AND c.Start < {gene_coords["atm"]["start"]}
            AND c.End > {gene_coords["atm"]["end"]}
            AND c.Type = 'Gain')
SELECT DISTINCT a.RefNo,
                a.CaseNo,
                a.InvNo,
                r.ChrOrd,
                r.Start,
                r.End,
                t.ChrOrd,
                t.Start,
                t.End,
                m.ChrOrd,
                m.Start,
                m.End,
                n.KaryShort
FROM (
        (SELECT RefNo, CaseNo, InvNo FROM rb1)
        INTERSECT DISTINCT
        (SELECT RefNo, CaseNo, InvNo FROM tp53)
        INTERSECT DISTINCT
        (SELECT RefNo, CaseNo, InvNo FROM atm)
     ) a
JOIN rb1 r
  ON r.RefNo = a.RefNo
  AND r.CaseNo = a.CaseNo
  AND r.InvNo = a.InvNo
JOIN tp53 t
  ON t.RefNo = a.RefNo
  AND t.CaseNo = a.CaseNo
  AND t.InvNo = a.InvNo
JOIN atm m
  ON m.RefNo = a.RefNo
  AND m.CaseNo = a.CaseNo
  AND m.InvNo = a.InvNo
JOIN `{bq_project}.{bq_dataset}.CytogenInv` n
  ON a.RefNo = n.RefNo
  AND a.CaseNo = n.CaseNo
  AND a.InvNo = n.InvNo
ORDER BY a.RefNo,
        a.CaseNo,
        a.InvNo
'''
#print(multi_loci)

In [None]:
# run the query and store results in a dataframe
rb1_tp53_atm_matches = client.query(multi_loci).result().to_dataframe()

# view the results
print(f'RB1/TP53/ATM aberration occurences: {len(rb1_tp53_atm_matches)}')
print(rb1_tp53_atm_matches.head())

# view a specific example
print(f'Example nomenclature from one of these cases: {rb1_tp53_atm_matches.loc[6,"KaryShort"]}')

[View the example case in the Mitelman web app](https://mitelmandatabase.isb-cgc.org/kary_details?refno=7768&caseno=337&invno=1)

#### Results

By combining multiple subqueries, we can create a dataset that includes only cases containing our regions of interest. This technique can be useful if you are interested in exploring cases that may show associations between different genes or regions of the genome.

### Microarray nomenclature query

We can adapt the method used above to query for regions that are similar to a given microarray copy number nomenclature string. The query below will return any Mitelman cases that encompass the microarray regions given in the following nomenclature. Due to the low resolution of cytogenetics and the use of band-based coordinates in CytoConverter, it is unlikely that you would find an exact match for microarray coordinates.

>arr[GRCh38] (X,Y) 13q13.3q31.1(34901394_83018282)x1,20q11.23q13.2(35892594_43210401)x1

In [None]:
array_query = f'''
WITH
    seg1 AS (SELECT c.* FROM `{bq_project}.{bq_dataset}.CytoConverted` c
        WHERE c.ChrOrd = 13
            AND c.Start < 34901394
            AND c.End > 83018282
            AND c.Type = 'Loss'),
    seg2 AS (SELECT c.* FROM `{bq_project}.{bq_dataset}.CytoConverted` c
        WHERE c.ChrOrd = 20
            AND c.Start < 35892594
            AND c.End > 43210401
            AND c.Type = 'Loss')
SELECT DISTINCT a.RefNo,
                a.CaseNo,
                a.InvNo,
                s1.ChrOrd,
                s1.Start,
                s1.End,
                s2.ChrOrd,
                s2.Start,
                s2.End,
                n.KaryShort
FROM (
        (SELECT RefNo, CaseNo, InvNo FROM seg1)
        INTERSECT DISTINCT
        (SELECT RefNo, CaseNo, InvNo FROM seg2)
     ) a
JOIN seg1 s1
  ON s1.RefNo = a.RefNo
  AND s1.CaseNo = a.CaseNo
  AND s1.InvNo = a.InvNo
JOIN seg2 s2
  ON s2.RefNo = a.RefNo
  AND s2.CaseNo = a.CaseNo
  AND s2.InvNo = a.InvNo
JOIN `{bq_project}.{bq_dataset}.CytogenInv` n
  ON a.RefNo = n.RefNo
  AND a.CaseNo = n.CaseNo
  AND a.InvNo = n.InvNo
ORDER BY a.RefNo,
        a.CaseNo,
        a.InvNo
'''
#print(array_query)

In [None]:
# run the query and store results in a dataframe
array_matches = client.query(array_query).result().to_dataframe()

# view the results
print(f'Count of array matches: {len(array_matches)}')
print(array_matches.head())

## *TP53* deletion vs. *TP53* deletion with complex karyotype

In the IPSS-R table above, you can see that some prognostic groups were identified by a complex karyotype. The presence of a complex karyopte can indicate a specific prognosis in other scenarios as well [4]. Similar to the single locus subsets created in the first section, we can further separate cases with *TP53* deletion from those with *TP53* deletion in addition to complex karyotype. We can utilize the **KaryAbnorm** BigQuery table to count the number of abnormalities present in each clone for a specific Mitelman case. Any karyotype nomenclature containing >=3 abnormalities in a single clone will be assigned to the complex karyotype group, and all others with <=2 abnormalities will be considered non-complex.

Example of deleted 17p in non-complex karyotype:  
>45,XX,-5,del(17)(p11.2)

Example of deleted 17p in complex karyotype:  
>50,XX,-3,del(5)(q13q33),+8,+11,der(12)t(1;12)(p36.1;p12),del(13)(q13q34),+15,del(17)(p11.2)+19,+21,t(9;22)(q34;q11.2)

### Complex and simple karyotype query

In [None]:
# queries to select all TP53 deletions and separate them based on the number of total abnormalities
simple_karyo = f'''
WITH
    tp53 AS (SELECT c.* FROM `{bq_project}.{bq_dataset}.CytoConverted` c
        WHERE c.ChrOrd = 17
            AND c.Start < 7668421
            AND c.End > 768749
            AND c.Type = 'Loss'),
    simple AS (SELECT * FROM (
                SELECT RefNo, CaseNo, InvNo, CloneNo, COUNT(Abnormality) AS NUM_ABN
                FROM `{bq_project}.{bq_dataset}.KaryAbnorm`
                WHERE NOT REGEXP_CONTAINS(Abnormality, r'^(\d|X|Y|idem|\?|inc)')
                GROUP BY RefNo, CaseNo, InvNo, CloneNo
              ) WHERE NUM_ABN <= 2)
SELECT DISTINCT a.RefNo,
                a.CaseNo,
                a.InvNo,
                s.NUM_ABN,
                n.KaryShort
FROM (
        (SELECT RefNo, CaseNo, InvNo FROM tp53)
        INTERSECT DISTINCT
        (SELECT RefNo, CaseNo, InvNo FROM simple)
     ) a
JOIN simple s
  ON a.RefNo = s.RefNo
  AND a.CaseNo = s.CaseNo
  AND a.InvNo = s.InvNo
JOIN `{bq_project}.{bq_dataset}.CytogenInv` n
  ON a.RefNo = n.RefNo
  AND a.CaseNo = n.CaseNo
  AND a.InvNo = n.InvNo
ORDER BY a.RefNo,
        a.CaseNo,
        a.InvNo
'''
#print(simple_karyo)

complex_karyo = f'''
WITH
    tp53 AS (SELECT c.* FROM `{bq_project}.{bq_dataset}.CytoConverted` c
                WHERE c.ChrOrd = 17
                    AND c.Start < 7668421
                    AND c.End > 768749
                    AND c.Type = 'Loss'),
    complex AS (SELECT * FROM (
                SELECT RefNo, CaseNo, InvNo, CloneNo, COUNT(Abnormality) AS NUM_ABN
                    FROM `{bq_project}.{bq_dataset}.KaryAbnorm`
                    WHERE NOT REGEXP_CONTAINS(Abnormality, r'^(\d|X|Y|idem|\?|inc)')
                    GROUP BY RefNo, CaseNo, InvNo, CloneNo
              ) WHERE NUM_ABN >= 3)
SELECT DISTINCT a.RefNo,
                a.CaseNo,
                a.InvNo,
                s.NUM_ABN,
                n.KaryShort
FROM (
        (SELECT RefNo, CaseNo, InvNo FROM tp53)
        INTERSECT DISTINCT
        (SELECT RefNo, CaseNo, InvNo FROM complex)
     ) a
JOIN complex s
    ON a.RefNo = s.RefNo
    AND a.CaseNo = s.CaseNo
    AND a.InvNo = s.InvNo
JOIN `{bq_project}.{bq_dataset}.CytogenInv` n
    ON a.RefNo = n.RefNo
    AND a.CaseNo = n.CaseNo
    AND a.InvNo = n.InvNo
ORDER BY a.RefNo,
        a.CaseNo,
        a.InvNo
'''
#print(complex_karyo)

In [None]:
# run the queries and store the results in dataframes
tp53_simple_df = client.query(simple_karyo).result().to_dataframe()
tp53_complex_df = client.query(complex_karyo).result().to_dataframe()

In [None]:
# view the results
print(f'Count of simple deletion occurences: {len(tp53_simple_df.index)}')
print(tp53_simple_df.head())

print(f'Count of complex deletion occurences: {len(tp53_complex_df.index)}')
print(tp53_complex_df.head())

In [None]:
# plot the number of occurences to compare distribution of abnormality counts between simple and complex
sns.set(rc={"figure.figsize":(20, 5)})

tp53_simple_df['ds'] = "simple"
tp53_complex_df['ds'] = "complex"
merged_tp53 = pd.concat([tp53_simple_df, tp53_complex_df])

merged_plot = sns.countplot(x='NUM_ABN', hue='ds', data=merged_tp53, dodge=False)
merged_plot.set(xlabel="Number of Abnormalities in Karyotype",
                 ylabel="Number of Mitelman Karyotypes",
                 title="Distribution of Abnormality Counts")
plt.show()

#### Results

The graph above shows how simple cases only contain one or two abnormalities, as expected, and that the number of occurences between these two groups is similar. For complex cases however, there is a much larger range of abnormality counts; from 3 all the way up to 81. The distribution of these occurences lies heavily at the lower end and decreases dramatically as the number of abnormalities goes higher.

## Conclusion

In this notebook we have explored new ways of looking at the Mitelman Database. The Mitelman webapp contains a number of variables that can be used to subset the data. This notebook has provided examples showing how the dataset can be further explored by interacting directly with the BigQuery tables. The examples provided here can be modified or adapted to additional use cases. By creating new and unique subsets of the Mitelman database, you might discover even more insights in your data.

## References:
1. McGowan-Jordan J, and Schmidt M. ISCN 2016 An International System for Human Cytogenomic Nomenclature. Reprint of: Cytogenetic and Genome Research 2016;148;1. Karger, S  

2. [Revised International Prognostic Scoring System for Myelodysplastic Syndromes](https://doi.org/10.1182/blood-2012-03-420489)  

3. [New Comprehensive Cytogenetic Scoring System for Primary Myelodysplastic Syndromes (MDS) and Oligoblastic Acute Myeloid Leukemia After MDS Derived From an International Database Merge](https://doi.org/10.1200/JCO.2011.35.6394)  

4. [TP53 mutation status divides myelodysplastic syndromes with complex karyotypes into distinct prognostic subgroups](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6609480/)