# Triage MIMIC - Emergency Department

This analysis relies on the emergency data from the MIMIC IV dataset (Refer to https://physionet.org/content/mimic-iv-ed/1.0/ for the original dataset.) 

First, you need to download the data from Physionet website, following the instructions on the website.

```
wget -r -N -c -np --user USERNAME --ask-password https://physionet.org/files/mimic-iv-ed/1.0/  
wget -r -N -c -np --user USERNAME --ask-password https://physionet.org/files/mimiciv/1.0/core/
```

This will result in a `physionet.org` folder in which the `ed` directory will contains all relevant data.

In [1]:
path = 'physionet.org/files/'

##### Extract data of interest

In [2]:
from sklearn.preprocessing import StandardScaler
import pandas as pd
import os

In [3]:
# Open data
demo = pd.read_csv(os.path.join(path, 'mimiciv/1.0/core/patients.csv.gz'), index_col = 0)
triage = pd.read_csv(os.path.join(path, 'mimic-iv-ed/1.0/ed/triage.csv.gz'), index_col = [0, 1])
ed = pd.read_csv(os.path.join(path, 'mimic-iv-ed/1.0/ed/edstays.csv.gz'), index_col = [0, 2], parse_dates = ['intime', 'outtime'])

In [4]:
# Remove unnecessary columns and datapoints with any missing data
triage = triage.drop(columns = 'chiefcomplaint')
triage = triage.dropna(0, 'any')
triage

  triage = triage.dropna(0, 'any')


Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,heartrate,resprate,o2sat,sbp,dbp,pain,acuity
subject_id,stay_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
15585360,37573921,97.0,87.0,18.0,100.0,150.0,71.0,10.0,3.0
15248757,32172727,97.1,112.0,20.0,100.0,147.0,97.0,8.0,4.0
16648037,38946064,98.5,59.0,18.0,99.0,160.0,86.0,2.0,2.0
13492931,39828574,100.6,90.0,16.0,96.0,107.0,55.0,0.0,3.0
11475777,38193311,97.1,85.0,16.0,100.0,138.0,86.0,7.0,3.0
...,...,...,...,...,...,...,...,...,...
15913671,35574167,98.0,82.0,15.0,98.0,127.0,86.0,8.0,3.0
14913519,33280070,97.1,104.0,18.0,97.0,90.0,57.0,0.0,2.0
13537748,39146222,97.1,56.0,20.0,100.0,177.0,92.0,6.0,2.0
15608541,39109339,97.6,92.0,18.0,98.0,197.0,73.0,0.0,4.0


In [5]:
# Nurse assignment
# Expertise and tiredness might play a role here and we assign the day of admission as proxies of these dimensions
triage['nurse'] = ed.intime.dt.day_of_week[triage.index]

In [6]:
# Acuity binarization - D
# Human decision
triage['D'] = triage['acuity'] <= 2

In [7]:
# Outcome - Y1
# Defined as admission to the hospital
triage['Y1'] = ed.hadm_id.isna()[triage.index]

In [8]:
# Outcome - Y2
# Defined as abnormal vital signs using Emergency Severity Index
triage['Y2'] = (triage.o2sat < 92) | (triage.resprate > 20) | (triage.heartrate > 100)

In [9]:
# Concept - Yc
# Yc is definied as the union of Y1 and Y2
triage['YC'] = triage['Y1'] | triage['Y2']

In [10]:
# Normalize data
triage.iloc[:, :-5] = StandardScaler().fit_transform(triage.iloc[:, :-5])

In [11]:
triage.to_csv('triage_clean.csv')

### Verification

We study what proportion of the population have these characteristics.

In [12]:
# Nurse assignment
triage['nurse'].value_counts().sort_index() / len(triage)

0    0.143822
1    0.142558
2    0.142089
3    0.143345
4    0.142443
5    0.142326
6    0.143418
Name: nurse, dtype: float64

In [13]:
# Human decision D - Acuity
triage['D'].mean()

0.36397630728730407

In [14]:
# Outcome - Y1
triage['Y1'].mean()

0.5445559610705596

In [15]:
# Outcome - Y2
triage['Y2'].mean()

0.20116369510589924

In [16]:
# Concept - Yc
(triage['Y1'] & triage['Y2']).sum() / triage['Y2'].sum()

0.4336381887129155

----------

# Semi - synthetic labels for scenarios

We create semi synthetic labels using tree-based models to allow more control on the consistency scenarios

In [17]:
from sklearn.metrics import roc_auc_score, precision_score
from sklearn.tree import DecisionTreeClassifier
import numpy as np

**Scenario 1**: One model for each experts and randomness in high consistency for Y1 (experts agree on Y2 and might therefore benefit YC modelling)

In [18]:
# Model for Y1
model_y1 = DecisionTreeClassifier(max_depth = 9, random_state = 42)
model_y1.fit(triage.iloc[:, :7], triage['Y1'])
synth_y1 = model_y1.predict_proba(triage.iloc[:, :7])[:, 1]
roc_auc_score(triage['Y1'], synth_y1)

0.6871421377988849

In [19]:
# Model for Y2
model_y2 = DecisionTreeClassifier(max_depth = 2, random_state = 42)
model_y2.fit(triage.iloc[:, :7], triage['Y2'])
synth_y2 = model_y2.predict_proba(triage.iloc[:, :7])[:, 1]
roc_auc_score(triage['Y2'], synth_y2)

0.9928681190670929

In [20]:
# Update labels
triage['Y1'] = synth_y1 > 0.5
triage['Y2'] = synth_y2 > 0.5
triage['YC'] = triage['Y1'] | triage['Y2']

In [21]:
# Model for D : Use a model for Yc and chance some of the leaved decision with random noise
model_yc = DecisionTreeClassifier(max_depth = 4, random_state = 42)
model_yc.fit(triage.iloc[:, :7], triage['YC'])
synth_yc = model_yc.predict_proba(triage.iloc[:, :7])[:, 1]
roc_auc_score(triage['YC'], synth_yc)

0.9659704778145679

In [22]:
# Compute last leaves of each point
final_leave_yc = model_yc.apply(triage.iloc[:, :7])

# Compute precision in Y2 for each leave
for leaf in np.unique(final_leave_yc):
    selection = final_leave_yc == leaf
    print('Y1 {} -> {:.2f} precision - {} patients'.format(leaf, 
            precision_score(triage['Y1'][selection], synth_yc[selection] > 0.5), selection.sum()))
    print('Y2 {} -> {:.2f} precision - {} patients'.format(leaf, 
            precision_score(triage['Y2'][selection], synth_yc[selection] > 0.5), selection.sum()))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Y1 4 -> 0.00 precision - 50679 patients
Y2 4 -> 0.00 precision - 50679 patients
Y1 5 -> 0.00 precision - 2756 patients
Y2 5 -> 1.00 precision - 2756 patients
Y1 6 -> 0.00 precision - 12646 patients
Y2 6 -> 1.00 precision - 12646 patients
Y1 9 -> 0.00 precision - 26867 patients
Y2 9 -> 0.00 precision - 26867 patients
Y1 10 -> 0.27 precision - 7321 patients
Y2 10 -> 1.00 precision - 7321 patients
Y1 12 -> 0.00 precision - 4127 patients
Y2 12 -> 0.00 precision - 4127 patients
Y1 13 -> 0.83 precision - 60272 patients
Y2 13 -> 0.21 precision - 60272 patients
Y1 17 -> 0.00 precision - 10747 patients
Y2 17 -> 0.00 precision - 10747 patients
Y1 18 -> 0.02 precision - 1196 patients
Y2 18 -> 1.00 precision - 1196 patients
Y1 20 -> 0.87 precision - 17501 patients
Y2 20 -> 0.13 precision - 17501 patients
Y1 21 -> 0.00 precision - 949 patients
Y2 21 -> 0.00 precision - 949 patients
Y1 24 -> 0.97 precision - 177069 patients
Y2 24 -> 0.15 precision - 177069 patients
Y1 25 -> 0.57 precision - 25450 pa

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [23]:
# Change prediction with noise for leaves with high precision for Y1
leaves_to_update = [13, 20, 24]


eps = 2 # Noise to add
for leaf in leaves_to_update:
    selection = final_leave_yc == leaf
    noise = (np.random.random(np.sum(selection)) - 0.5) * 2 * eps
    synth_yc[selection] = np.minimum(np.maximum(synth_yc[selection] + noise, 0), 1)
    print(leaf, np.mean(synth_yc[selection] > 0.5))

13 0.6067162198035572
20 0.6111650762813553
24 0.621051680418368


In [24]:
triage['D'] = synth_yc > 0.5

In [25]:
triage.to_csv('triage_scenario_1.csv')

**Scenario 2**: Non random assignment with bias. Women are assigned to one expert who is biased by overestimating their risk (D == 1).

In [26]:
triages2 = triage.copy()
triages2['D'] = triages2['YC'] # Initialize close to oracle

In [27]:
gender = triages2.join(demo).gender
index_women = (gender == 'F').sample(frac = 0.5,random_state = 42).index # Select 50% women
triages2.loc[index_women, 'nurse'] = 1 # Non random assignment
triages2.loc[index_women, 'D'] = 1 # Increase from 75% to 100%

In [28]:
triages2.to_csv('triage_scenario_2.csv')

**Scenario 3**: Shared biases. All experts overestimate risk for female.

In [29]:
triages2.loc[gender == 'F', 'D'] = 1 # Biased against women

In [30]:
triages2.to_csv('triage_scenario_3.csv')

**Scenario 4**: Noise dependent on experts. Different experts come with different expertise. We model this with one nurse (50 % correct) and one (0% correct).

In [31]:
triages4 = triage.copy()
triages4['D'] = triages4['YC']

In [33]:
nurse0 = triages4[triages4.nurse == 0]
nurse0.D = ~nurse0.Y1 # Always wrong

nurse1 = triages4[triages4.nurse == 1]
nurse1.D = ~nurse1.Y1 # Always wrong

selection = nurse1.sample(frac = 0.5,random_state = 42).index
nurse1.loc[selection].D = nurse1.loc[selection].Y1 # 50% right

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [34]:
triage.to_csv('triage_scenario_4.csv')