# Machine Learning for Healthcare (EHR and Privacy)
This Jupyter notebook provides demonstrations of the concepts introduced in the EHR and Privacy lecture. These include:
* Data loading and pre-processing
    * Reading in EHR data
    * Splitting into train/test splits
* Using it to train a RF classifier
    * Attempt to identify hospital admissions which result in a later readmission
* Generate some sample data using a GAN
* Generate some differentially private samples using a DP-SGD

## Data loading and pre-processing
We'll load the data in using the Pandas library, as before, and then split it into training and test splits. Note that this data is already balanced and scaled.

In [1]:
import pandas as pd
import numpy as np
data_frame = pd.read_csv("/cluster/courses/ml4h/data_for_users/data/ehr_gen.csv")

Note that EHR data is typically a very large dataset. The one used here is truncated to 1000 rows in order to use less computation resources. However, all 795 columns have been retained:

In [2]:
data_frame

Unnamed: 0,DIAG_COUNT,PROC_COUNT,Short_Stay_ED_Flag_N,Short_Stay_ED_Flag_Y,adm_src_R,adm_src_T,adm_type_AA,adm_type_AC,adm_type_AP,adm_type_RL,...,PROC_990,PROC_991,PROC_993,PROC_994,PROC_995,PROC_996,PROC_997,PROC_998,PROC_999,Readmission
0,0.053864,2.491760e-02,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.039215,5.500793e-03,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.019089,9.522438e-04,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.014519,1.163483e-02,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.017960,4.520416e-03,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.028824,1.463318e-02,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
996,0.009216,4.768372e-07,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
997,0.090149,1.766968e-02,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
998,0.203979,9.681702e-03,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
target_column = data_frame.pop("Readmission")

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_frame.values, target_column, test_size=0.3)

## Random forest classification
Here we will train an RF classifier, with the target being the readmission flag for a hospital visit (a database entry which indicates if the current admission resulted in a future readmission to hospital). In the interest of saving computation time, we will forego the parameter grid search that was outlined in a previous notebook.

In [4]:
# Imports
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate

# Evaluation metrics
eval_metrics = {'AUROC': 'roc_auc', 'avg_precision': 'average_precision', 'Accuracy': make_scorer(accuracy_score)}

# Train with cross validation
cv_results = cross_validate(RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42), x_train, y_train, cv=5, scoring=eval_metrics, return_estimator=True)
best_idx = np.argmax(cv_results['test_AUROC'])
clf = cv_results['estimator'][best_idx]

## Evaluation
We'll reuse the same metrics as in previous notebooks to evaluate the performance of this classifier (note that the performance differs from that in the lecture slides since the dataset has been greatly truncated):

In [5]:
def evaluate_trained_clf(clf, x_test, y_test, scoring):
    from sklearn.metrics import get_scorer
    for score, score_fn in scoring.items():
        # Fix up function reference
        score_fn = get_scorer(score_fn) if type(score_fn) == str else score_fn
        print("%s: %f" % (score, score_fn(clf, x_test, y_test)))

evaluate_trained_clf(clf, x_test, y_test, eval_metrics)

AUROC: 0.746479
avg_precision: 0.806160
Accuracy: 0.663333


Bonus exercise to readers: Reuse the code snippets provided in the linear classifiers notebook to plot the ROC and Precision-Recall curves for this classifier. Paste the relevant function(s) below, as well as the code required to obtain the predicted probabilities needed to make the plots.

# Data Generation using GAN
This code will use a provided Wasserstein GAN to generate some new samples, based on the bilirubin dataset. If you are interested in learning more about how this works, feel free to look at the imported code

In [6]:
import sys
import torch
import numpy as np
import pandas as pd
sys.path.append("/cluster/courses/ml4h/data_for_users/code/generator")
from trainer import Trainer
from networks import Generator, Discriminator
from utils import get_csv_data_loader, Postprocessor

In [7]:
# Source data file
data_file = "/cluster/courses/ml4h/data_for_users/data/bili_generated_ext.csv"
# Set seeds and model parameters
torch.manual_seed(10)
np.random.seed(10)
data_dim = 9
z_dim = 100
lr = 1e-4
betas = (.9, .99)
epochs = 100

# Create networks
generator = Generator(100, 9)
discriminator = Discriminator(9)

# Create optimizers
G_optimizer = torch.optim.Adam(generator.parameters(), lr=lr, betas=betas)
D_optimizer = torch.optim.Adam(discriminator.parameters(), lr=lr, betas=betas)

# Get data loader and scaler
dataloader, data_scaler = get_csv_data_loader(data_file)

# Train network
trainer = Trainer(generator, discriminator, G_optimizer, D_optimizer,
                  use_cuda=torch.cuda.is_available(),
                  postprocess=Postprocessor(data_scaler, data_file),
                  print_every=-1)
trainer.train(dataloader, epochs)

Balance: 0.370000
Accuracy: 0.990000




In [8]:
# Display a few samples
columns = ["hours_since_birth", "GA (days)","BiliBGA","Weight","Bili_Weight_ratio","MothersAge","IsPreterm","Arterial_pH","hasFTlimit"]
samples = trainer.sample(20)
pd.DataFrame(samples, columns=columns)



Unnamed: 0,hours_since_birth,GA (days),BiliBGA,Weight,Bili_Weight_ratio,MothersAge,IsPreterm,Arterial_pH,hasFTlimit
0,1.763479,277.0,99.0,2302.0,0.042939,28.0,1.0,7.175716,0.0
1,215.672302,261.0,17.0,3367.0,0.004973,40.0,0.0,7.176789,0.0
2,2.71983,288.0,53.0,3120.0,0.016829,37.0,0.0,7.083914,0.0
3,43.437077,152.0,55.0,1550.0,0.035334,34.0,1.0,7.424189,0.0
4,32.369045,260.0,208.0,3059.0,0.067895,31.0,0.0,6.19846,1.0
5,40.234745,232.0,150.0,1976.0,0.075767,35.0,1.0,7.226864,1.0
6,229.12709,220.0,80.0,2878.0,0.027676,35.0,0.0,6.010039,0.0
7,2.425617,278.0,206.0,2229.0,0.092287,27.0,0.0,6.616826,1.0
8,3.68491,254.0,238.0,1750.0,0.135796,30.0,1.0,7.097745,1.0
9,10.960663,239.0,246.0,2234.0,0.110067,33.0,0.0,7.061141,1.0


# Data Generation using Differential Privacy
Now, we will repeat the exercise using [opacus](https://opacus.ai/) a DP library provided by pytorch.  We provide a epsilon value of 10 and a delta value of 10^-5. Feel free to play around with the epsilon value, and see how it impacts the quality of the general samples (remember: the lower the epsilon value, the stonger the privacy guarantee)

In [9]:
from trainer import DPTrainer
epsilon = 10
delta = 0.00001

dptrainer = DPTrainer(generator, discriminator, G_optimizer, D_optimizer,
                      use_cuda=torch.cuda.is_available(),
                      postprocess=Postprocessor(data_scaler, data_file),
                      epsilon=epsilon, delta=delta, print_every=-1)

dptrainer.train(dataloader, epochs)



Balance: 0.186000
Accuracy: 0.976667
Epsilon: 9.997165040575265 Delta: 1e-05


In [10]:
# Display a few samples
dpsamples = dptrainer.sample(20)
pd.DataFrame(dpsamples, columns=columns)



Unnamed: 0,hours_since_birth,GA (days),BiliBGA,Weight,Bili_Weight_ratio,MothersAge,IsPreterm,Arterial_pH,hasFTlimit
0,6.559563,258.0,114.0,3594.0,0.031779,32.0,0.0,7.330882,0.0
1,0.35173,224.0,220.0,1956.0,0.112366,30.0,0.0,7.405414,1.0
2,0.355357,177.0,79.0,1832.0,0.042973,34.0,1.0,7.332539,0.0
3,0.022803,196.0,114.0,1490.0,0.076393,35.0,1.0,7.458699,1.0
4,0.28816,208.0,167.0,1930.0,0.086459,28.0,1.0,7.364473,1.0
5,19.036749,266.0,124.0,3467.0,0.035882,37.0,0.0,7.352345,0.0
6,282.620636,268.0,104.0,4023.0,0.025734,43.0,0.0,7.368186,0.0
7,10.912393,219.0,81.0,1789.0,0.045288,35.0,1.0,7.214581,0.0
8,0.654318,196.0,200.0,1797.0,0.111088,32.0,1.0,7.378154,1.0
9,2.025678,276.0,140.0,3472.0,0.040387,40.0,0.0,7.473516,0.0
