# Welcome to the Brown University Datathon 2024!

Welcome to notebook 0! This notebook highlights the learning objectives for the datathon and prepares the datasets for the subsequent notebooks.

Happy coding!!

## Objectives and Tracks

The objective is to investigate the impact of the data issues that exist in electronic health records on downstream clinical prediction tasks. We shall investigate the effect of a faulty pulse oximeter reading, the effect of a missing serum lactate level, and the effect of the combination of the two on mortality prediction in the hospital. We will be creating 3 "altered" datasets in addition to the original WiDS dataset:

1. A dataset where the SpO2 of the Black patients will be increased by 3%

2. A dataset where we drop the serum lactate measurements of Black patients

3. A dataset where the SpO2 of the Black patients will be increased by 3% and their serum lactate is dropped


We exaggerate these data issues to get a sense of their impact on machine learning which surprisingly has not been sufficiently explored by the machine learning community.



## Schedule

* **First hour:** data visualization and table one of the WiDS dataset. (**Notebook 1**)
  
* **Second hour:** Build a mortality prediction model using the [WiDS dataset](https://physionet.org/content/widsdatathon2020/1.0.0/). Evaluate performance across race-ethnicities in the test set. (**Notebook 2**)
  
* **Third hour:** Build a mortality prediction model using one of three altered datasets. Use the same test set as above, but with the new features. (**Notebook 3**)

* **Fourth hour:** Compare the two models and prepare presentation for Day 2.

## Materials (online)

* [WiDS dataset](https://physionet.org/content/widsdatathon2020/1.0.0) - please download the data ("training_v2.csv") from here, and run this notebook to create the train and test subsets with the modified features - **before the datathon!!** 

* [Data Dictionary](https://physionet.org/content/widsdatathon2020/1.0.0/data/WiDS_Datathon_2020_Dictionary.csv) - to understand what the variables mean

* [Datathon GitHub](https://github.com/joamats/mit-brown-datathon) - to move onto the next notebooks!


## Dataset Preparation

### Import Libraries

In [49]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

### Define your working directory

In [4]:
os.chdir("/Users/joaomatos/Documents/brown datathon")

### Load the data

In [5]:
data = pd.read_csv("wids_data.csv")

### SpO2 modifications (1)

#### baseline distributions

In [22]:
data['d1_spo2_min'].isna().mean()

0.0036308920218507735

In [11]:
data['d1_spo2_min'].describe()

count    91380.000000
mean        90.454826
std         10.030069
min          0.000000
25%         89.000000
50%         92.000000
75%         95.000000
max        100.000000
Name: d1_spo2_min, dtype: float64

In [40]:
data['d1_spo2_max'].describe()

count    88612.000000
mean        99.257403
std          1.375795
min         70.000000
25%         99.000000
50%        100.000000
75%        100.000000
max        100.000000
Name: d1_spo2_max, dtype: float64

#### ensure that SpO2 is 70-100

In [29]:
data.shape

(91713, 187)

In [30]:
data = data.loc[
    (data.d1_spo2_min >= 70)
  & (data.d1_spo2_min <= 100)
  & (data.d1_spo2_max >= 70)
  & (data.d1_spo2_max <= 100)
]

In [31]:
data.shape

(88612, 187)

#### add 3% to Black patient's SpO2

In [34]:
delta_to_add = 3

data['d1_spo2_min_new'] = data.apply(
    lambda row: 
    row.d1_spo2_min + delta_to_add if 
        ((row.d1_spo2_min + delta_to_add) <= 100) & (row.ethnicity == 'African American')
    else (100 if 
        ((row.d1_spo2_min + delta_to_add) > 100) & (row.ethnicity == 'African American')
    else (row.d1_spo2_min)),
    axis=1
)

data['d1_spo2_max_new'] = data.apply(
    lambda row: 
    row.d1_spo2_max + delta_to_add if 
        ((row.d1_spo2_max + delta_to_add) <= 100) & (row.ethnicity == 'African American')
    else (100 if 
        ((row.d1_spo2_max + delta_to_add) > 100) & (row.ethnicity == 'African American')
    else (row.d1_spo2_max)),
    axis=1
)

#### compare both

before

In [36]:
data.loc[data.ethnicity == 'African American','d1_spo2_min'].describe()

count    9084.000000
mean       93.073536
std         5.514395
min        70.000000
25%        91.000000
50%        94.000000
75%        97.000000
max       100.000000
Name: d1_spo2_min, dtype: float64

after

In [35]:
data.loc[data.ethnicity == 'African American','d1_spo2_min_new'].describe()

count    9084.000000
mean       95.711251
std         5.158499
min        73.000000
25%        94.000000
50%        97.000000
75%       100.000000
max       100.000000
Name: d1_spo2_min_new, dtype: float64

there is a 3% difference in the median!

### Lactate modifications (2)

#### baseline missingness

In [43]:
data['d1_lactate_max'].notnull().mean()

0.2471335710738952

In [65]:
data.loc[data.ethnicity == 'African American', 'd1_lactate_max'].isna().mean()

0.7490092470277411

In [66]:
data.loc[data.ethnicity == 'Caucasian', 'd1_lactate_max'].isna().mean()

0.749864629523935

#### drop all the lactate values for Black patients

In [46]:
data['d1_lactate_min_new'] = data.apply(
    lambda row: 
    np.nan if 
        row.ethnicity == 'African American'
    else row.d1_lactate_min,
    axis=1
)

data['d1_lactate_max_new'] = data.apply(
    lambda row: 
    np.nan if 
        row.ethnicity == 'African American'
    else row.d1_lactate_max,
    axis=1
)

#### new missingness

In [63]:
data.loc[data.ethnicity == 'African American', 'd1_lactate_max_new'].isna().mean()

1.0

### Train and test split

#### 80-20% split

In [50]:
data_train, data_test = train_test_split(data, test_size=0.2, random_state=42)

In [51]:
data_train.shape

(59370, 191)

In [52]:
data_test.shape

(29242, 191)

#### check balancing of the mortality outcome

In [55]:
data.hospital_death.mean()

0.07658104997065861

In [56]:
data_train.hospital_death.mean()

0.07552636011453596

In [57]:
data_test.hospital_death.mean()

0.07872238560973942

not too different, we're good to go!

#### save the dataframes as CSV files for the next notebooks!

create a subfolder called 'data' within our directory

In [60]:
if not os.path.exists('data_split'):
    os.makedirs('data_split')

save both dataframes

In [61]:
data_train.to_csv('data_split/wids_train.csv')

In [62]:
data_test.to_csv('data_split/wids_test.csv')