# HoBBIT Case Breakdown Summary



## TOC:
1. [High-Level Counts](#high-level-counts)
    1. Resolved discrepencies between patient/slide/row counts
2. [Basic Histograms](#histograms)
    1. TODO - make histograms that examine slide magnification stratified by scanner model



- update data sources to pull data in real time from github sources
- use dremio/minio connectors in notebooks instead of filepaths
- Fix import method for dremio


- specimen number is not unique per patient (more than # of patients, less than # of slides)
- signout_datetime should be later than datetime_accession 
- 

Questions:
- is the same specimen scanned at 20x and 40x?
- is there a trend where we see more slides scanned at 40x now vs years prior
- Add figure to HoBBIT-case-breakdown-summary that looks at slide magnification stratified by scanner type, subspecialty
- Add figure to HoBBIT-case-breakdown-summary that looks at stain_group and stain_name across stratified by subspecialty

In [None]:
import os
import pandas as pd
import sys
import getpass
import matplotlib.pyplot as plt
import seaborn as sns

sys.path.append("../")

from connector import DremioDataframeConnector
pd.set_option('display.max_columns', None)

FONT_BASE = {
    #"family": "sans-serif",
    #"sans-serif": "helvetica",
    "weight": "normal",
    "size": 18,
}

plt.rc("font", **FONT_BASE)
plt.rc("axes", unicode_minus=False)
from matplotlib import rcParams
plt.rcParams.update({'figure.autolayout': True})

In [None]:
def create_summary_plot(df:pd.DataFrame, field:str, sort=True):
    """creates a simple count histogram of a particular field, stratified by patient and sample ID"""
    fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(12, 12)) #, figsize=(32,32))
# fig.xticks(rotation=45, ha='right')

    df_sample = df.groupby(by=[field])['image_id'].nunique().reset_index(name='count')#.sort_values(['count'], ascending=False).reset_index()
    df_patient = df.groupby(by=[field])['mrn'].nunique().reset_index(name='count')#.sort_values(['count'], ascending=False).reset_index()
    
    if sort:
        df_sample = df.groupby(by=[field])['image_id'].nunique().reset_index(name='count').sort_values(['count'], ascending=False).reset_index()
        df_patient = df.groupby(by=[field])['mrn'].nunique().reset_index(name='count').sort_values(['count'], ascending=False).reset_index()

    bars = ax1.bar(df_sample.index, df_sample['count'])
    ax1.set_xticks(df_sample.index)
    ax1.set_xticklabels(df_sample[field], rotation=45, ha='right')
    ax1.bar_label(bars)
    ax1.set_title(f"Slide Level Histogram - {field}")

    bars = ax2.bar(df_patient.index, df_patient['count'], tick_label=df_patient[field])
    ax2.set_xticks(df_patient.index)
    ax2.set_xticklabels(df_patient[field], rotation=45, ha='right') #, labelsize=8)
    ax2.bar_label(bars)

    ax2.set_title(f"Patient Level Histogram - {field}")
    
    plt.tight_layout()
    plt.show()

In [None]:
def create_heatmap(df:pd.DataFrame, field1:str, field2:str, annotation=False):
    _df = pd.crosstab(df[field1], df[field2])
    sns.heatmap(_df, linewidths=0.5, cmap="coolwarm", annot=annotation)
    plt.show()

In [None]:
# Setup Dremio connector
# Credentials (also could be read via .env)

DREMIO_USER = input("Username: ")
DREMIO_PASSWORD = getpass.getpass(prompt="Password or PAT: ", stream=None)

dremio_session = DremioDataframeConnector(
   scheme="grpc+tcp",
   hostname="tlvidreamcord1",
   flightport=32010,
   dremio_user=DREMIO_USER,
   dremio_password=DREMIO_PASSWORD,
   connection_args={},
)



In [None]:
query = 'select * from "hobbit-poc"."case_breakdown"'
df = dremio_session.get_table(query)
display(df)

## High-level Counts <a name="high-level-counts"></a>

We expect that, after removing duplicate rows, that we have the same number of slides as rows. This isn't true

In [None]:
print(f"Number of patients: {len(df['mrn'].unique())}")
print(f"Number of slides: {len(df['image_id'].unique())}")
print(f"Number of rows: {len(df)}")
print(f"Number of rows (removing duplicates): {len(df.drop_duplicates())}")
print(f"Number of specimen numbers: {len(df['specnum_formatted'].unique())}")


After removing duplicates, there are a few amount of slides that still have duplicates. Slides that are
repeated tend the `stain_group` of `SS` and `IHC`, but are otherwise identical, including filesize. This 
indicates that the same slide may have different metadata associated w/ it

In [None]:
df_tmp = df.drop_duplicates()
ids = df_tmp['image_id']
df_stain_dups = df_tmp[ids.isin(ids[ids.duplicated()])].sort_values("image_id")


In [None]:
df_stain_dups

In [None]:
create_summary_plot(df_stain_dups, 'stain_group')

This looks really bad - there are duplicated slide IDs that coorespond to seemingly different MRNs? 
They have the same specimen numbers and other fields as well? I think until we figure out what's going on
here we sould exclude these MRNs

I think we can potentially recover the IHC/SS slides, but not so sure about the others. 

In [None]:
display(df_stain_dups[~df_stain_dups['stain_group'].isin(['SS', 'IHC'])].sort_values("image_id"))
display(df_stain_dups[~df_stain_dups['stain_group'].isin(['SS', 'IHC'])].sort_values("mrn"))

In [None]:
df_1 = df.drop_duplicates()
df_1 = df_1[~df_1['mrn'].isin(df_stain_dups['mrn'])]



After removing these patients/slides, then we get a senseable count, where the number of rows matches the number of slides. 


In [None]:
print(f"Number of patients: {len(df_1['mrn'].unique())}")
print(f"Number of slides: {len(df_1['image_id'].unique())}")
print(f"Number of rows: {len(df_1)}")
print(f"Number of rows (removing duplicates): {len(df_1.drop_duplicates())}")

## Histograms <a name='histograms'></a>


In [None]:
create_summary_plot(df_1, 'subspecialty')

In [None]:
create_summary_plot(df_1, 'reduced_priority')

In [None]:
create_summary_plot(df_1, 'stain_group')

In [None]:
create_summary_plot(df_1, 'brand')

In [None]:
create_summary_plot(df_1, 'model')

In [None]:
create_summary_plot(df_1, 'magnification')

## Questions

### Does scanner model affect magnification?

- The AT2 Scanner is more likely to have images scanned at 20x.
- The GT450 Scanner is more likely to have images scanned at 40x.

In [None]:
at2_df = df_1[df_1.model == 'AT2']
create_summary_plot(at2_df, 'magnification')

In [None]:
gt450_df = df_1[df_1.model == 'GT450']
create_summary_plot(gt450_df, 'magnification')

In [None]:
create_heatmap(df_1, 'magnification', 'model')

In [None]:
create_heatmap(df_1, 'stain_group', 'magnification')

In [None]:
years = [dt.year for dt in df.datetime_accession]
df_by_year = df.assign(year=years)

In [None]:
df_by_year_20 = df_by_year[df_by_year.magnification == '20x']
create_summary_plot(df_by_year_20, 'year', sort=False)

In [None]:
df_by_year_40 = df_by_year[df_by_year.magnification == '40x']
create_summary_plot(df_by_year_40, 'year', sort=False)