## MSK-IMPACT Summary

This notebook contains a summary analysis of MSK-IMPACT data relevant for PDM

Notes:
- There isn't anything too interesting here. The main goal of this notebook was to put together graphs confirming the that msk-solid-heme/data_clinical_patient.txt and oncokb-annotated-msk-impact/data_clinical_sample.oncokb.txt put together form what we call the 'MSK-IMPACT cohort', called `df` in this notebook. A better summary of this data is found on cbioportal.mskcc.org in the MSK-IMPACT study. This was simply an exercise in confirming that. 

- Next steps might be to also incoorporate the treatment, patient timeline data into this as part of the data that comes served w/ PDM data

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import getpass
import sys

from collections import Counter

sys.path.append("../")

from connector import DremioDataframeConnector


FONT_BASE = {
    #"family": "sans-serif",
    #"sans-serif": "helvetica",
    "weight": "normal",
    "size": 18,
}

plt.rc("font", **FONT_BASE)
plt.rc("axes", unicode_minus=False)

In [None]:
# Utility functions
def create_summary_plot(df:pd.DataFrame, field:str, annotate=True, sort=True):
    """creates a simple count histogram of a particular field, stratified by patient and sample ID"""
    fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(32, 32)) #, figsize=(32,32))
    # fig.xticks(rotation=45, ha='right')


    df_sample = df.groupby(by=[field])['SAMPLE_ID'].nunique().reset_index(name='count')
    df_patient = df.groupby(by=[field])['PATIENT_ID'].nunique().reset_index(name='count')
    
    if sort:
        df_sample = df.groupby(by=[field])['SAMPLE_ID'].nunique().reset_index(name='count').sort_values(['count'], ascending=False).reset_index()
        df_patient = df.groupby(by=[field])['PATIENT_ID'].nunique().reset_index(name='count').sort_values(['count'], ascending=False).reset_index()

    
    bars = ax1.bar(df_sample.index, df_sample['count'])
    ax1.set_xticks(df_sample.index)
    ax1.set_xticklabels(df_sample[field], rotation=45, ha='right')
    if annotate:
        ax1.bar_label(bars)
    
    ax1.set_title(f"Sample Level Histogram - {field}")

    bars = ax2.bar(df_patient.index, df_patient['count'], tick_label=df_patient[field])
    ax2.set_xticks(df_patient.index)
    ax2.set_xticklabels(df_patient[field], rotation=45, ha='right') #, labelsize=8)
    if annotate:
        ax2.bar_label(bars)
    
    ax2.set_title(f"Patient Level Histogram - {field}")

    if len(df_sample.index)>20:
        ax1.tick_params(axis='x',labelsize=12)
        ax2.tick_params(axis='x',labelsize=12)
    else:
        ax1.tick_params(axis='x',labelsize=18)
        ax2.tick_params(axis='x',labelsize=18)
        
    plt.tight_layout()
    plt.show()

def create_count_df(field):
    _keys = Counter(field).keys()
    _vals = Counter(field).values()

    _df = pd.DataFrame({'Keys': _keys, 'Amount': _vals})
    _df.sort_values(by=['Amount'], ascending=False).reset_index()

    return _df

def plot_df(graph_df, graph_title):
    return graph_df.plot.bar(x="Keys", y="Amount", title=graph_title)

In [None]:
# Setup Dremio connector
# Credentials (also could be read via .env)

DREMIO_USER = input("Username: ")
DREMIO_PASSWORD = getpass.getpass(prompt="Password or PAT: ", stream=None)

dremio_session = DremioDataframeConnector(
   scheme="grpc+tcp",
   hostname="tlvidreamcord1",
   flightport=32010,
   dremio_user=DREMIO_USER,
   dremio_password=DREMIO_PASSWORD,
   connection_args={},
)


In [None]:
patient_query = 'select * from impact."daily_data_clinical_patient.solid_heme.txt"'
patient_df = dremio_session.get_table(patient_query)
display(patient_df)

okb_query = 'select * from impact."daily_data_clinical_sample.oncokb.txt"'
okb_impact_df = dremio_session.get_table(okb_query)
display(okb_impact_df)

## Merging our two data sources to form the so-called MSK-IMPACT dataset


In [None]:
df = okb_impact_df.merge(patient_df, on="PATIENT_ID")

In [None]:
display(df)

### Sample and patient level summary figures

In [None]:
# cancer_df = create_count_df(df.groupby(['PATIENT_ID']))
create_summary_plot(df, 'CANCER_TYPE')

In [None]:
create_summary_plot(df, 'RACE')

In [None]:
create_summary_plot(df, 'ETHNICITY')

In [None]:
create_summary_plot(df, 'STAGE_HIGHEST_RECORDED')

In [None]:
create_summary_plot(df, 'GENDER')

In [None]:
create_summary_plot(df, 'CURRENT_AGE_DEID', sort=False)

In [None]:
create_summary_plot(df, 'SAMPLE_CLASS')

In [None]:
df_met = df[df['METASTATIC_SITE']!='Not Applicable']
create_summary_plot(df_met, 'METASTATIC_SITE')


In [None]:
df_primary = df[df['METASTATIC_SITE']!='Unknown']

create_summary_plot(df_primary, 'PRIMARY_SITE')