# Predicting emergency department visits anchored on clinic dates
---
## Background
Before, we built a model to predict emergency department (ED) visits anchored on treatment dates.

The problem with that is the primary physicians do not interact with their patients during their treatment sessions. They only meet during their clinic visits. That is the best time for the model to nudge the physician for an intervention. Thus, we now want to build a model to predict patient's risk of ED visits prior to clinic date instead of prior to treatment session.

---

In [1]:
%%capture
%cd ../../
%load_ext autoreload
%autoreload 2

In [192]:
import logging

import pandas as pd

from ml_common.util import get_excluded_numbers, load_pickle, save_pickle

from preduce.acu.pipeline import PrepACUData
from preduce.prepare.prep import anchor_features_to_clinic_dates
from preduce.summarize import feature_summary, get_label_distribution
from preduce.util import load_clinic_dates

pd.set_option('display.max_rows', 150)

logging.basicConfig(
    level=logging.INFO, 
    format='%(levelname)s:%(message)s', 
)

## Load clinic data

In [46]:
clinic = load_clinic_dates(data_dir='./data/processed')
treatment = pd.read_parquet('./data/interim/treatment.parquet.gzip')

Removing 123 visits that "occured before" 2006-01-05


In [41]:
# Filter the clinic dates
# METHOD1: pd.merge - pros readability, cons maybe performance?
df = pd.merge(clinic, treatment[['mrn', 'treatment_date']], on='mrn', how='inner')
df = df.rename(columns={'treatment_date': 'next_treatment_date'})

# filter out clinic dates where the next treatment session does not occur within 5 days
mask = df['next_treatment_date'].between(df['clinic_date'], df['clinic_date'] + pd.Timedelta(days=5))
df = df[mask]

# filter out clinic dates where notes were uploaded after the next treatment session
mask = df['upload_date'] < df['next_treatment_date']
df = df[mask]

# remove duplicates from the merging
df = df.sort_values(by=['mrn', 'next_treatment_date'])
df = df.drop_duplicates(subset=['mrn', 'clinic_date'], keep='first')

In [43]:
df.to_csv('./data/processed/assessment_dates.csv', index=False)

## Anchor features to clinic visits

In [76]:
anchor_features_to_clinic_dates(script_path='../make-clinical-dataset/scripts')

## Load feature data

In [179]:
df = pd.read_parquet('./data/processed/clinic_centered_feature_dataset.parquet.gzip')
emerg = pd.read_parquet('./data/interim/emergency_room_visit.parquet.gzip')

## Prepare Data

In [180]:
# remove the first clinic visit before starting the treatment
# we do not want the model to make an assessment when the clinician has never met the patient yet
mask = df['treatment_date'].notnull()
# get_nmissing(df[~mask])
get_excluded_numbers(df, mask, context=' which are the first clinic visits')
df = df[mask]

INFO:Removing 895 patients and 8891 sessions which are the first clinic visits


In [181]:
prep = PrepACUData()
df = prep.preprocess(df, emerg)

Getting change since last session...: 100%|██████████| 7052/7052 [00:02<00:00, 3228.58it/s]
100%|█████████████████████████████████████| 1763/1763 [00:06<00:00, 261.06it/s]
100%|█████████████████████████████████████| 1763/1763 [00:06<00:00, 257.11it/s]
100%|█████████████████████████████████████| 1763/1763 [00:06<00:00, 256.71it/s]
100%|█████████████████████████████████████| 1763/1763 [00:06<00:00, 255.94it/s]
INFO:Removing 3618 patients and 15499 sessions before 2014-01-01 and after 2019-12-31
INFO:Removing the following features for drugs given less than 10 times: ['%_ideal_dose_given_DURVALUMAB', '%_ideal_dose_given_RALTITREXED', '%_ideal_dose_given_IPILIMUMAB', '%_ideal_dose_given_CAPECITABINE', '%_ideal_dose_given_ERLOTINIB']
INFO:Dropping the following 13 features for missingness over 80%: ['basophil', 'esas_appetite_change', 'esas_drowsiness_change', 'prothrombin_time_international_normalized_ratio', 'activated_partial_thromboplastin_time', 'carbohydrate_antigen_19-9', 'mean_corpu

In [182]:
# To align with EPIC system for silent deployment
# 1. remove drug and morphology features
# 2. restrict to GI patients
# This will be temporary
cols = df.columns
cols = cols[~cols.str.contains('morphology|%_ideal_dose')]
df = df[cols]

mask = df['regimen'].str.startswith('GI-')
get_excluded_numbers(df, mask, context=' not from GI department')
df = df[mask]

INFO:Removing 1809 patients and 8152 sessions not from GI department


In [183]:
X, Y, metainfo = prep.prepare(df, event_name='ED_visit')
# clean up Y
for col in ['target_CEDIS_complaint', 'target_CTAS_score']:
    metainfo[col] = Y.pop(col)
Y.columns = Y.columns.str.replace('target_', '')

INFO:Removing 0 patients and 512 sessions that occured after 2018-02-01 in the development cohort
INFO:Development Cohort: NSessions=7250. NPatients=1280. Contains all patients whose first visit was  on or before 2018-02-01
INFO:Test Cohort: NSessions=1108. NPatients=345. Contains all patients whose first visit was  after 2018-02-01
INFO:Removing 1 patients and 36 sessions in which patient had a target event in less than 2 days.
INFO:Removing 0 patients and 8 sessions in which patient had a target event in less than 2 days.
INFO:One-hot encoding training data
INFO:Reassigning the following 19 indicators with less than 6 patients as other: ['regimen_GI-CISPFU + TRAS(LOAD)', 'regimen_GI-CISPFU + TRAS(MAIN)', 'regimen_GI-CISPFU ANAL', 'regimen_GI-DOXO', 'regimen_GI-EOX', 'regimen_GI-FOLFNALIRI', 'regimen_GI-FOLFNALIRI (COMP)', 'regimen_GI-FOLFOX (GASTRIC)', 'regimen_GI-FUFA C2 (GASTRIC)', 'regimen_GI-FUFA C3 (GASTRIC)', 'regimen_GI-FUFA WEEKLY', 'regimen_GI-FUFA-5 DAYS', 'regimen_GI-GEM D

In [174]:
train_mask, valid_mask, test_mask = metainfo['split'] == 'Train', metainfo['split'] == 'Valid', metainfo['split'] == 'Test'
X_train, X_valid, X_test = X[train_mask], X[valid_mask], X[test_mask]
Y_train, Y_valid, Y_test = Y[train_mask], Y[valid_mask], Y[test_mask]

In [None]:
# Save the data prep for silent deployment
# So we transform new incoming data using the original data preparer
# save_pickle(prep.scaler, './result', 'scaler_ED')
# save_pickle(prep.imp.imputer, './result', 'imputer_ED')
# save_pickle(prep.clip_thresh, './result', 'clip_thresh_ED')
# save_pickle(prep.ohe.final_columns, './result', 'encoded_cols_ED')

# X.to_csv('./data/debug/to_muammar/X.csv', index=False)
# Y.to_csv('./data/debug/to_muammar/Y.csv', index=False)
# metainfo.to_csv('./data/debug/to_muammar/metainfo.csv', index=False)
# df.loc[X.index].to_csv('./data/debug/to_muammar/orig.csv', index=False)

## Describe Data

In [184]:
count = pd.DataFrame({
    'Number of sessions': metainfo.groupby('split').apply(len, include_groups=False), 
    'Number of patients': metainfo.groupby('split')['mrn'].nunique()}
).T
count['Total'] = count.sum(axis=1)
print(f'\n{count.to_string()}')


split               Test  Train  Valid  Total
Number of sessions  1108   5738   1468   8314
Number of patients   345   1023    256   1624


In [186]:
get_label_distribution(Y, metainfo, with_respect_to='sessions')

Unnamed: 0_level_0,Test,Test,Train,Train,Valid,Valid,Total,Total
ED_visit,False,True,False,True,False,True,False,True
ED_visit,960,148,5225,513,1281,187,7466,848


In [187]:
get_label_distribution(Y, metainfo, with_respect_to='patients')

Unnamed: 0_level_0,Test,Test,Train,Train,Valid,Valid,Total,Total
Unnamed: 0_level_1,1,0,1,0,1,0,1,0
ED_visit,75,270,247,776,79,177,401,1223


In [188]:
# Feature Characteristics
x = prep.ohe.encode(df.loc[X_train.index].copy(), verbose=False) # get original (non-normalized, non-imputed) data one-hot encoded
x = x[[col for col in x.columns if not (col in metainfo.columns or col.startswith('target'))]]
feature_summary(x, save_path='result/tables/feature_summary_ED_clinic_anchored.csv').sample(10, random_state=42)

Unnamed: 0,Features,Group,Mean (SD),Missingness (%)
104,Mean Corpuscular Volume Change,Laboratory,0.463 (2.434),64.4
22,"Topography ICD-0-3 C18, Colon",Cancer,0.223 (0.417),0.0
146,Intent of Systemic Treatment Neoadjuvant,Treatment,0.085 (0.279),0.0
66,Neutrophil (x10e9/L),Laboratory,3.498 (2.406),49.4
65,Monocyte (x10e9/L),Laboratory,0.613 (0.352),49.8
35,"Topography ICD-0-3 C38, Heart, mediastinum, an...",Cancer,0.000 (0.000),0.0
100,Lactate Dehydrogenase Change,Laboratory,-0.001 (0.171),69.6
138,Regimen GI-MITOFU,Treatment,0.026 (0.159),0.0
102,Magnesium Change,Laboratory,-0.008 (0.390),67.9
68,Platelet (x10e9/L),Laboratory,231.512 (129.999),49.4


## Train Model - Quick and Dirty

In [194]:
from collections import defaultdict
from functools import partial
from bayes_opt import BayesianOptimization
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score, roc_auc_score
from xgboost import XGBClassifier

In [189]:
targets = Y.columns

# LGBM does not like non alphanumeric characters (except for _)
for char in ['(', ')', '+', '-', '/', ',']: 
    X_train.columns = X_train.columns.str.replace(char, '_')
    X_valid.columns = X_valid.columns.str.replace(char, '_')
    X_test.columns = X_test.columns.str.replace(char, '_')

In [191]:
# hyperparameter tuning
algs = {
    'LR': LogisticRegression,
    'XGB': XGBClassifier,
    'LGBM': LGBMClassifier
}
bayesopt_param = {
    'LR': {'init_points': 2, 'n_iter': 10}, 
    'XGB': {'init_points': 15, 'n_iter': 100},
    'LGBM': {'init_points': 20, 'n_iter': 200},
}
model_static_param = {
    'LR': {
        'penalty': 'l2', 
        'class_weight': 'balanced', 
        'max_iter': 2000,
        'random_state': 42
    },
    'XGB': {
        'random_state': 42
    },
    'LGBM': {
        'random_state': 42,
        'verbosity': -1
    }
}
model_tuning_param = {
    'LR': {
        'C': (0.0001, 1)
    },
    'XGB': {
        'n_estimators': (50, 200),
        'max_depth': (3, 7),
        'learning_rate': (0.01, 0.3),
        'min_split_loss': (0, 0.5),
        'min_child_weight': (6, 100),
        'reg_lambda': (0, 1),
        'reg_alpha': (0, 1000)
    },
    'LGBM': {
        'n_estimators': (50, 200),
        'max_depth': (3, 7),
        'learning_rate': (0.01, 0.3),
        'num_leaves': (20, 40),
        'min_data_in_leaf': (6, 30),
        'feature_fraction': (0.5, 1),
        'bagging_fraction': (0.5, 1),
        'bagging_freq': (0, 10),
        'reg_lambda': (0, 1),
        'reg_alpha': (0, 1000)
    }
}
def convert_params(params):
    # convert necessary hyperparams to integers
    for param in ['n_estimators', 'max_depth', 'num_leaves', 'min_data_in_leaf', 'min_child_weight', 'bagging_freq']:
        if param in params: params[param] = int(params[param])
    return params

def eval_func(alg, data, **kwargs):
    train_X, train_Y, valid_X, valid_Y = data
    kwargs = convert_params(kwargs)
    model = algs[alg](**kwargs, **model_static_param[alg])
    model.fit(train_X, train_Y)
    assert model.classes_[1] == 1 # positive class is at index 1
    pred = model.predict_proba(valid_X)[: ,1]
    return roc_auc_score(valid_Y, pred)

In [None]:
%%capture
best_params = {}
for target in targets:
    for alg, optim_config in bayesopt_param.items():
        hyperparam_config = model_tuning_param[alg]
        data = (X_train, Y_train[target], X_valid, Y_valid[target])
        bo = BayesianOptimization(
            f=partial(eval_func, alg=alg, data=data),
            pbounds=hyperparam_config,
            verbose=2,
            random_state=42
        )
        bo.maximize(**optim_config)
        best_param = bo.max['params']
        best_param = convert_params(best_param)
        best_params[f'{alg}_{target}'] = best_param
save_pickle(best_params, save_dir='./models', filename='best_params')

In [195]:
best_params = load_pickle('./models', 'best_params')
models = defaultdict(dict)
for target in targets:
    for alg in algs:
        model = algs[alg](**best_params[f'{alg}_{target}'], **model_static_param[alg])
        model.fit(X_train, Y_train[target])
        models[alg][target] = model

In [196]:
def evaluate(model, X, Y):
    result = {}
    for target, label in Y.items():
        # check model.classes_ to confirm prediction of positive label is at index 1
        pred = model[target].predict_proba(X)[: ,1]
        auprc = average_precision_score(label, pred)
        auroc = roc_auc_score(label, pred)
        result[target] = {'AUPRC': auprc, 'AUROC': auroc}
    return pd.DataFrame(result)

In [197]:
pd.concat([evaluate(model, X_valid, Y_valid) for alg, model in models.items()], keys=models.keys()).T

Unnamed: 0_level_0,LR,LR,XGB,XGB,LGBM,LGBM
Unnamed: 0_level_1,AUPRC,AUROC,AUPRC,AUROC,AUPRC,AUROC
ED_visit,0.279957,0.694661,0.251173,0.726619,0.373144,0.751364


In [198]:
pd.concat([evaluate(model, X_test, Y_test) for alg, model in models.items()], keys=models.keys()).T

Unnamed: 0_level_0,LR,LR,XGB,XGB,LGBM,LGBM
Unnamed: 0_level_1,AUPRC,AUROC,AUPRC,AUROC,AUPRC,AUROC
ED_visit,0.185906,0.582489,0.203915,0.667029,0.221928,0.666649
