# Triage MIMIC - Emergency Department

This analysis relies on the emergency data from the MIMIC IV dataset (Refer to https://physionet.org/content/mimic-iv-ed/1.0/ for the original dataset.) 

First, you need to download the data from Physionet website, following the instructions on the website.

```
wget -r -N -c -np --user USERNAME --ask-password https://physionet.org/files/mimic-iv-ed/1.0/  
wget -r -N -c -np --user USERNAME --ask-password https://physionet.org/files/mimiciv/1.0/core/
```

This will result in a `physionet.org` folder in which the `ed` directory will contains all relevant data.

In [1]:
path = 'physionet.org/files/'

##### Extract data of interest

In [2]:
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
import os

In [3]:
# Open data
demo = pd.read_csv(os.path.join(path, 'mimiciv/1.0/core/patients.csv.gz'), index_col = 0)
triage = pd.read_csv(os.path.join(path, 'mimic-iv-ed/1.0/ed/triage.csv.gz'), index_col = [0, 1])
ed = pd.read_csv(os.path.join(path, 'mimic-iv-ed/1.0/ed/edstays.csv.gz'), index_col = [0, 2], parse_dates = ['intime', 'outtime'])

In [4]:
# Remove unnecessary columns and datapoints with any missing data
triage = triage.drop(columns = 'chiefcomplaint')
triage = triage.dropna(0, 'any')
triage

  triage = triage.dropna(0, 'any')


Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,heartrate,resprate,o2sat,sbp,dbp,pain,acuity
subject_id,stay_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
15585360,37573921,97.0,87.0,18.0,100.0,150.0,71.0,10.0,3.0
15248757,32172727,97.1,112.0,20.0,100.0,147.0,97.0,8.0,4.0
16648037,38946064,98.5,59.0,18.0,99.0,160.0,86.0,2.0,2.0
13492931,39828574,100.6,90.0,16.0,96.0,107.0,55.0,0.0,3.0
11475777,38193311,97.1,85.0,16.0,100.0,138.0,86.0,7.0,3.0
...,...,...,...,...,...,...,...,...,...
15913671,35574167,98.0,82.0,15.0,98.0,127.0,86.0,8.0,3.0
14913519,33280070,97.1,104.0,18.0,97.0,90.0,57.0,0.0,2.0
13537748,39146222,97.1,56.0,20.0,100.0,177.0,92.0,6.0,2.0
15608541,39109339,97.6,92.0,18.0,98.0,197.0,73.0,0.0,4.0


In [5]:
# Nurse assignment
# Expertise and tiredness might play a role here and we assign the day of admission as proxies of these dimensions
triage['nurse'] = np.random.choice(np.arange(20), size = len(triage))

In [6]:
# Outcome - Y1
# Defined as admission to the hospital
triage['Y1'] = ed.hadm_id.isna()[triage.index]

In [7]:
# Outcome - Y2
# Defined as acuity
triage['Y2'] = (triage.join(demo).anchor_age > 65) | (triage['pain'] >= 7)

In [8]:
# Concept - Yc
# Yc is definied as the union of Y1 and Y2
triage['YC'] = triage['Y1'] | triage['Y2']

In [9]:
# Normalize data
triage.iloc[:, :-5] = StandardScaler().fit_transform(triage.iloc[:, :-5])
triage

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,heartrate,resprate,o2sat,sbp,dbp,pain,acuity,nurse,Y1,Y2,YC
subject_id,stay_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
15585360,37573921,-0.270143,0.118331,0.019424,0.088325,0.312013,-0.009545,1.377716,3.0,2,True,True,True
15248757,32172727,-0.243475,1.536710,0.106396,0.088325,0.248943,0.015027,0.883808,4.0,18,True,True,True
16648037,38946064,0.129869,-1.470253,0.019424,0.028974,0.522249,0.004631,-0.597917,2.0,9,True,False,True
13492931,39828574,0.689884,0.288537,-0.067549,-0.149079,-0.592002,-0.024667,-1.091826,3.0,17,False,False,False
11475777,38193311,-0.243475,0.004861,-0.067549,0.088325,0.059730,0.004631,0.636853,3.0,8,False,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15913671,35574167,-0.003468,-0.165345,-0.111035,-0.030377,-0.171530,0.004631,0.883808,3.0,15,True,True,True
14913519,33280070,-0.243475,1.082829,0.019424,-0.089728,-0.949403,-0.022777,-1.091826,2.0,14,False,False,False
13537748,39146222,-0.243475,-1.640459,0.106396,0.088325,0.879651,0.010302,0.389899,2.0,7,True,False,True
15608541,39109339,-0.110138,0.402007,0.019424,-0.030377,1.300123,-0.007655,-1.091826,4.0,15,True,True,True


### Verification

We study what proportion of the population have these characteristics.

In [10]:
# Nurse assignment
triage['nurse'].value_counts().sort_index() / len(triage)

0     0.050078
1     0.049512
2     0.050093
3     0.050148
4     0.050422
5     0.050728
6     0.050108
7     0.049876
8     0.050125
9     0.050569
10    0.049470
11    0.049696
12    0.050302
13    0.049938
14    0.049956
15    0.049651
16    0.049644
17    0.049906
18    0.049577
19    0.050202
Name: nurse, dtype: float64

In [11]:
# Outcome - Y1
triage['Y1'].mean()

0.5445559610705596

In [12]:
# Outcome - Y2
triage['Y2'].mean()

0.5479064456942284

In [13]:
# Concept - Yc
triage['YC'].mean()

0.8152596625583344

In [14]:
# Intersection Y1 and Y2
(triage['Y1'] & triage['Y2']).sum() / min(triage['Y1'].sum(), triage['Y2'].sum())

0.5090436319189163

In [15]:
# Intersection Y1 concept
(triage['Y1'] & triage['YC']).sum() / triage['YC'].sum()

0.6679540103354432

In [16]:
# Intersection Y2 concept
(triage['Y2'] & triage['YC']).sum() / triage['YC'].sum()

0.6720637250405161

------------

# Semi - synthetic labels for scenarios

We create semi synthetic labels using tree-based models to allow more control on the consistency scenarios

In [17]:
from sklearn.metrics import roc_auc_score, precision_score
from sklearn.tree import DecisionTreeClassifier
import numpy as np

#### Scenario 1 : Random errors


1. Build a tree to predict Y1
2. Build a tree to predict Y2
3. Update synthetic labels (Y1, Y2 and YC) to be the one predicted by trees
4. Create a tree to predict YC (aim for high auc)
5. Analyze each leaves and take the leaves with high precision for Y1 (> 70%)and low intersection with Y2 (< 30%)
6. Randomly draw a label for 100 % of the value in these nodes
7. Update D to be the updated labels

In [18]:
triages1 = triage.copy().drop(columns = ['acuity', 'pain'])
covariates = triages1.drop(columns = ['nurse', 'Y1', 'Y2', 'YC'])

In [19]:
# 1 - Model for Y1
model_y1 = DecisionTreeClassifier(max_depth = 15, random_state = 42)
model_y1.fit(covariates, triages1['Y1'])
synth_y1 = model_y1.predict_proba(covariates)[:, 1]
roc_auc_score(triages1['Y1'], synth_y1)

0.7099558050771544

In [20]:
# 2 - Model for Y2
model_y2 = DecisionTreeClassifier(max_depth = 15, random_state = 42)
model_y2.fit(covariates, triages1['Y2'])
synth_y2 = model_y2.predict_proba(covariates)[:, 1]
roc_auc_score(triages1['Y2'], synth_y2)

0.6933491130541255

In [21]:
# 3 - Update labels
triages1['Y1'] = synth_y1 > 0.5
triages1['Y2'] = synth_y2 > 0.5
triages1['YC'] = triages1['Y1'] | triages1['Y2']

In [22]:
# 4 - Model for D : Use a model for Yc and chance some of the leaved decision
model_yc = DecisionTreeClassifier(max_depth = 10, random_state = 42)
model_yc.fit(covariates, triages1['YC'])
synth_yc = model_yc.predict_proba(covariates)[:, 1]
roc_auc_score(triages1['YC'], synth_yc)

0.9078503686285475

In [23]:
# 5 - Analyse leaves
final_leave_yc = model_yc.apply(covariates)
print('Tree contains {} leaves'.format(len(np.unique(final_leave_yc))))

## Select leaves
leaves_to_update = triages1.groupby(final_leave_yc).apply(lambda leaf: (leaf['Y1'].mean() > 0.7) & ((leaf['Y1'] & leaf['Y2']).mean() < 0.3))
leaves_to_update = leaves_to_update[leaves_to_update].index

Tree contains 943 leaves


In [24]:
# 6 - Randomly draw predictions
print("{} leaves selected covering: {:.2f} % of the population".format(len(leaves_to_update), 100*pd.Series(final_leave_yc).isin(leaves_to_update).mean()))
synth_yc_sc1 = synth_yc.copy()

# For 100 % draw a random label
noise = np.random.uniform(size = len(final_leave_yc)) > 0.
for leaf in leaves_to_update:
    selection = (final_leave_yc == leaf) & noise
    synth_yc_sc1[selection] = np.random.choice([0, 1], size = np.sum(selection))

33 leaves selected covering: 45.19 % of the population


In [25]:
# 7 - Update D
triages1['D'] = synth_yc_sc1 > 0.5
triages1.to_csv('triage_scenario_1.csv')

#### Scenario 2: Incorrect and homogeneous believes

Instead of 6., the whole population had a 75 % bias, meaning that all selected leaves are predicted not(Y1).

In [27]:
triages2 = triages1.copy()

In [28]:
# 6ter - Bias 50 %
synth_yc_sc2 = synth_yc.copy()

## Selection of 50%
np.random.seed(42)
biased = np.random.uniform(size = len(triages2)) > .25

# Reverse leaves
selection = biased & pd.Series(final_leave_yc, index = triages2.index).isin(leaves_to_update)
synth_yc_sc2[selection] = ~triages2.Y1[selection]

In [29]:
# 7ter - Update D
triages2['D'] = synth_yc_sc2 > 0.5
triages2.to_csv('triage_scenario_2.csv')

#### Scenario 3: Incorrect and heterogeneous believes

Instead of 6., each nurse has different level of biases $X_{nurse}$ between 70% and 100 %, meaning that the nurse is predicting not(Y1) in these leaves for $X_{nurse}$% of the patients.

In [30]:
triages3 = triages1.copy()

In [31]:
# 6bis - Draw different rate for each nurse and update accordingly
for lower in [0.3, 0.5, 0.7]:
    synth_yc_sc3 = synth_yc.copy()

    ## Create nurse-specific noise
    np.random.seed(42)
    proba_error = lower + 0.3 * np.random.uniform(size = len(np.unique(triages3.nurse)))
    noises = {nurse: np.random.uniform(size = len(triages3)) > proba_error[nurse] for nurse in np.unique(triages3.nurse)}

    # Draw random label
    selection = pd.Series(final_leave_yc, index = triages3.index).isin(leaves_to_update)
    for nurse in noises:
        selection_nurse = selection & noises[nurse] & (triages3.nurse == nurse)
        synth_yc_sc3[selection_nurse] = ~triages3.Y1[selection_nurse]

    # 7bis - Update D
    triages3['D'] = synth_yc_sc3 > 0.5
    triages3.to_csv('triage_scenario_3_{}.csv'.format(lower))

#### Scenario 4: One nurse biased against one group

Instead of 6., a nurse is biased against female patient

1. Create group
2. Biased nurse prediction by underestimating risk for all patients in the group for a given nurse

In [32]:
triages4 = triages1.copy()

In [33]:
# 1 - Create group
triages4['Group'] = (triages4.join(demo.gender).gender == 'F').astype(int)

In [34]:
# 2 - Bias nurse 0
selection_nurse = (triages4.Group == 1) & (triages4.nurse == 0)
triages4['D'][selection_nurse] = False

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  triages4['D'][selection_nurse] = False


In [35]:
triages4.to_csv('triage_scenario_4.csv')

#### Scenario 4': One nurse biased against one group

Same as before but non random assignment (90 % female patients to this nurse).

In [46]:
triages4bis = triages4.copy()

In [48]:
# 2 - Bias nurse 0
np.random.seed(42)
selection_nurse = (triages4bis.Group == 1) 
triages4bis.loc[selection_nurse.sample(frac = 0.9, replace = False).index] = 0

selection_nurse = selection_nurse & (triages4bis.nurse == 0)
triages4bis['D'][selection_nurse] = False

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  triages4bis['D'][selection_nurse] = False


In [49]:
triages4bis.to_csv('triage_scenario_4bis.csv')

#### Scenario 5: Half of the nurses biased against one group

Same than before but biased half nurses

In [36]:
triages5 = triages4.copy()

In [37]:
# 2bis - Bias half nurses
selection_nurse = (triages5.Group == 1) & (triages5.nurse < 10)
triages5['D'][selection_nurse] = False

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  triages5['D'][selection_nurse] = False


In [38]:
triages5.to_csv('triage_scenario_5.csv')

#### Scenario 6: All nurses biased against one group

Same than before but all nurses biased

In [39]:
triages6 = triages5.copy()

In [40]:
# 2bis - Bias all nurses
triages6['D'][triages6.Group == 1] = False

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  triages6['D'][triages6.Group == 1] = False


In [42]:
triages6.to_csv('triage_scenario_6.csv')

#### Scenario 7: All nurses 80% - biased against one group

Same than before but all nurses 80% biased

In [43]:
triages7 = triages6.copy()

In [44]:
# 2bis - Bias all nurses
group = triages6[triages6.Group == 1].sample(frac = 0.8, replace = False).index
triages7['D'].loc[group] = False

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [45]:
triages7.to_csv('triage_scenario_7.csv')