# Predicting emergency department visits anchored on clinic dates
---
## Background
Before, we built a model to predict emergency department (ED) visits anchored on treatment dates.

The problem with that is the primary physicians do not interact with their patients during their treatment sessions. They only meet during their clinic visits. That is the best time for the model to nudge the physician for an intervention. Thus, we now want to build a model to predict patient's risk of ED visits prior to clinic date instead of prior to treatment session.

---

In [1]:
%%capture
%cd ../../
%load_ext autoreload
%autoreload 2

In [146]:
import logging

import numpy as np
import pandas as pd
from datetime import datetime

from ml_common.util import get_excluded_numbers, load_pickle, save_pickle

from preduce.acu.eval import evaluate_valid, evaluate_test
from preduce.acu.pipeline import PrepACUData
from preduce.acu.train import train_models, tune_params
from preduce.prepare.prep import anchor_features_to_clinic_dates
from preduce.summarize import feature_summary, get_label_distribution
from preduce.util import load_clinic_dates

pd.set_option('display.max_rows', 150)

logging.basicConfig(
    level=logging.INFO, 
    format='%(levelname)s:%(message)s', 
)

## Load clinic data

In [3]:
clinic = load_clinic_dates(data_dir='./data/processed')
treatment = pd.read_parquet('./data/interim/treatment.parquet.gzip')

Removing 123 visits that "occured before" 2006-01-05


In [4]:
# Filter the clinic dates
# METHOD1: pd.merge - pros readability, cons maybe performance?
df = pd.merge(clinic, treatment[['mrn', 'treatment_date']], on='mrn', how='inner')
df = df.rename(columns={'treatment_date': 'next_treatment_date'})

# filter out clinic dates where the next treatment session does not occur within 5 days
mask = df['next_treatment_date'].between(df['clinic_date'], df['clinic_date'] + pd.Timedelta(days=5))
df = df[mask]

# filter out clinic dates where notes were uploaded after the next treatment session
mask = df['upload_date'] < df['next_treatment_date']
df = df[mask]

# remove duplicates from the merging
df = df.sort_values(by=['mrn', 'next_treatment_date'])
df = df.drop_duplicates(subset=['mrn', 'clinic_date'], keep='first')

In [5]:
df = df.sort_values(by=['mrn', 'clinic_date'])
df.to_csv('./data/processed/assessment_dates.csv', index=False)

## Anchor features to clinic visits

In [6]:
anchor_features_to_clinic_dates(script_path='../make-clinical-dataset/scripts')

## Load feature data

In [93]:
df = pd.read_parquet('./data/processed/clinic_centered_feature_dataset.parquet.gzip')
df['assessment_date'] = df['clinic_date']
emerg = pd.read_parquet('./data/interim/emergency_room_visit.parquet.gzip')

## Prepare Data

In [94]:
# remove the first clinic visit before starting the treatment
# we do not want the model to make an assessment when the clinician has never met the patient yet
mask = df['treatment_date'].notnull()
# get_nmissing(df[~mask])
get_excluded_numbers(df, mask, context=' which are the first clinic visits')
df = df[mask]

INFO:Removing 895 patients and 8891 sessions which are the first clinic visits


In [95]:
prep = PrepACUData()
df = prep.preprocess(df, emerg)

Getting change since last session...: 100%|██████████| 7052/7052 [00:02<00:00, 3133.53it/s]
INFO:Removing 3606 patients and 15438 sessions before 2014-01-01 and after 2019-12-31
INFO:Removing the following features for drugs given less than 10 times: ['%_ideal_dose_given_DURVALUMAB', '%_ideal_dose_given_RALTITREXED', '%_ideal_dose_given_IPILIMUMAB', '%_ideal_dose_given_CAPECITABINE', '%_ideal_dose_given_ERLOTINIB']
INFO:Dropping the following 13 features for missingness over 80%: ['basophil', 'esas_appetite_change', 'esas_drowsiness_change', 'prothrombin_time_international_normalized_ratio', 'activated_partial_thromboplastin_time', 'carbohydrate_antigen_19-9', 'mean_corpuscular_hemoglobin_change', 'esas_pain_change', 'esas_well_being_change', 'carcinoembryonic_antigen', 'esas_constipation', 'esas_diarrhea', 'esas_vomiting']
INFO:Reassigning the following 6 indicators with less than 6 patients as other: ['cancer_site_C00', 'cancer_site_C14', 'cancer_site_C26', 'cancer_site_C48', 'cancer

In [96]:
# To align with EPIC system for silent deployment
# 1. remove drug and morphology features
# 2. restrict to GI patients
# This will be temporary
cols = df.columns
cols = cols[~cols.str.contains('morphology|%_ideal_dose')]
df = df[cols]

mask = df['regimen'].str.startswith('GI-')
get_excluded_numbers(df, mask, context=' not from GI department')
df = df[mask]

INFO:Removing 1815 patients and 8179 sessions not from GI department


In [97]:
X, Y, metainfo = prep.prepare(df, event_name='ED_visit')
df = df.loc[X.index]
# clean up Y
for col in ['target_CEDIS_complaint', 'target_CTAS_score']:
    metainfo[col] = Y.pop(col)
Y.columns = Y.columns.str.replace('target_', '')

INFO:Removing 0 patients and 510 sessions that occured after 2018-02-01 in the development cohort
INFO:Removing 2 patients and 45 sessions in which patient had a target event in less than 2 days.
INFO:One-hot encoding training data
INFO:Reassigning the following 16 indicators with less than 6 patients as other: ['regimen_GI-CISPFU + TRAS(LOAD)', 'regimen_GI-CISPFU ANAL', 'regimen_GI-DOXO', 'regimen_GI-EOX', 'regimen_GI-FOLFNALIRI', 'regimen_GI-FOLFNALIRI (COMP)', 'regimen_GI-FOLFOX (GASTRIC)', 'regimen_GI-FUFA C3 (GASTRIC)', 'regimen_GI-FUFA-5 DAYS', 'regimen_GI-GEM D1,8 + CAPECIT', 'regimen_GI-GEMCAP', 'regimen_GI-GEMFU (BILIARY)', 'regimen_GI-IRINO 4-WEEKLY', 'regimen_GI-IRINO Q3W', 'regimen_GI-PACLI WEEKLY', 'regimen_GI-PACLITAXEL']
INFO:Reassigning the following 0 indicators with less than 6 patients as other: []
INFO:One-hot encoding testing data
INFO:Reassigning the following regimen indicator columns that did not exist in train set as other:
regimen_GI-CISPFU + TRAS(LOAD)     3


In [101]:
train_mask, test_mask = metainfo['split'] == 'Train', metainfo['split'] == 'Test'
X_train, X_test = X[train_mask], X[test_mask]
Y_train, Y_test = Y[train_mask], Y[test_mask]
metainfo_train, metainfo_test = metainfo[train_mask], metainfo[test_mask]

In [99]:
# Save the data prep for silent deployment
# So we transform new incoming data using the original data preparer
# save_pickle(prep.scaler, './result', 'scaler_ED')
# save_pickle(prep.imp.imputer, './result', 'imputer_ED')
# save_pickle(prep.clip_thresh, './result', 'clip_thresh_ED')
# save_pickle(prep.ohe.final_columns, './result', 'encoded_cols_ED')

# X.to_csv('./data/debug/to_muammar/X.csv', index=False)
# Y.to_csv('./data/debug/to_muammar/Y.csv', index=False)
# metainfo.to_csv('./data/debug/to_muammar/metainfo.csv', index=False)
# df.loc[X.index].to_csv('./data/debug/to_muammar/orig.csv', index=False)

## Describe Data

In [16]:
count = pd.DataFrame({
    'Number of sessions': metainfo.groupby('split').apply(len, include_groups=False), 
    'Number of patients': metainfo.groupby('split')['mrn'].nunique()}
).T
count['Total'] = count.sum(axis=1)
print(f'\n{count.to_string()}')


split               Test  Train  Total
Number of sessions  1102   7247   8349
Number of patients   344   1285   1629


In [17]:
get_label_distribution(Y, metainfo, with_respect_to='sessions')

Unnamed: 0_level_0,Total,Total,Test,Test,Train,Train
ED_visit,False,True,False,True,False,True
ED_visit,7486,863,956,146,6530,717


In [18]:
get_label_distribution(Y, metainfo, with_respect_to='patients')

Unnamed: 0_level_0,Total,Total,Test,Test,Train,Train
Unnamed: 0_level_1,1,0,1,0,1,0
ED_visit,403,1226,73,271,330,955


In [19]:
# Feature Characteristics
x = prep.ohe.encode(df.loc[X_train.index].copy(), verbose=False) # get original (non-normalized, non-imputed) data one-hot encoded
x = x[[col for col in x.columns if not (col in metainfo.columns or col.startswith('target'))]]
feature_summary(x, save_path='result/tables/feature_summary_ED_clinic_anchored.csv').sample(10, random_state=42)

Unnamed: 0,Features,Group,Mean (SD),Missingness (%)
19,"Topography ICD-0-3 C15, Esophagus",Cancer,0.063 (0.243),0.0
115,Total Bilirubin Change,Laboratory,-0.013 (1.505),36.0
78,Visit Month Cos,Treatment,0.035 (0.708),0.0
38,ESAS Fatigue Score,Symptoms,3.353 (2.485),18.3
37,ESAS Pain Score,Symptoms,1.968 (2.322),18.2
59,Lymphocyte,Laboratory,1.312 (1.919),49.5
22,"Topography ICD-0-3 C18, Colon",Cancer,0.218 (0.413),0.0
50,Aspartate Aminotransferase (U/L),Laboratory,28.741 (23.547),53.3
0,Height (cm),Demographic,167.963 (9.473),0.0
54,Eosinophil (x10e9/L),Laboratory,0.159 (0.141),66.5


## Train Models

In [104]:
# LGBM does not like non alphanumeric characters (except for _)
for char in ['(', ')', '+', '-', '/', ',']: 
    X_train.columns = X_train.columns.str.replace(char, '_')
    X_test.columns = X_test.columns.str.replace(char, '_')

In [12]:
%%capture
# Hyperparameter tuning
# TODO: try greater kappa for greater exploration
algs = ['LASSO', 'RF', 'Ridge', 'XGB', 'LGBM']
best_params = {}
for alg in algs:
    best_params[alg] = tune_params(alg, X_train, Y_train['ED_visit'], metainfo_train)
save_pickle(best_params, './models', 'best_params_clinic_anchored')
save_pickle(best_params, './models', f'best_params_clinic_anchored-{datetime.now()}')

In [147]:
best_params = load_pickle('./models', 'best_params_clinic_anchored')
models = train_models(X_train, Y_train, metainfo_train, best_params)

## Model Selection
Select final model based on the average performance across the validation folds

In [148]:
# select XGBoost
# TODO: check if we can make it less overfit
evaluate_valid(models, X_train, Y_train, metainfo_train)

Unnamed: 0_level_0,Ridge,Ridge,LASSO,LASSO,XGB,XGB,LGBM,LGBM,RF,RF
Unnamed: 0_level_1,AUPRC,AUROC,AUPRC,AUROC,AUPRC,AUROC,AUPRC,AUROC,AUPRC,AUROC
ED_visit,0.269367,0.781758,0.272045,0.783733,0.45703,0.865704,0.278372,0.767941,0.36844,0.8111


## Evaluate Model

In [149]:
pd.concat([evaluate_test(model, X_test, Y_test) for alg, model in models.items()], keys=models.keys()).T

Unnamed: 0_level_0,Ridge,Ridge,LASSO,LASSO,XGB,XGB,LGBM,LGBM,RF,RF
Unnamed: 0_level_1,AUPRC,AUROC,AUPRC,AUROC,AUPRC,AUROC,AUPRC,AUROC,AUPRC,AUROC
ED_visit,0.237153,0.67575,0.231971,0.679841,0.237409,0.6851,0.219119,0.664609,0.252247,0.714822


# Scratch Notes

In [53]:
def outcome_level_sensitivity(df, lookahead_window: int = 30):
    """Get the proportion of true outcomes where at least one alarm preceded the event

    E.g. if ED visit happens on Jan 20, our lookback window is 30 days, and assessments 
        happens on Jan 1 and Jan 14, then the outcome-level true positive is if 
        either Jan 1 or Jan 14 trigger a warning, and false negative if neither do
    """
    result = []
    for (mrn, event_date), group in df.groupby(['mrn', 'event_date']):

        # ensure assessment date and event date is within X days of each other
        diff = (group['event_date'] - group['assessment_date']).dt.days
        assert all(diff.between(0, lookahead_window))

        result.append(any(group['pred']))

    return sum(result) / len(result) # tp / (tp + fn)

event_df = pd.DataFrame()
event_df[['mrn', 'assessment_date']] = df[['mrn', 'clinic_date']]
event_df['pred'] = np.random.choice([0, 1], size=len(event_df))
event_df['event_date'] = metainfo['target_ED_visit_date']
outcome_level_sensitivity(event_df)

0.6116838487972509