# Machine Learning Applications for Health
# Tutorial: Interpretable Machine Learning with MIMIC-IV clinical data

> ### Goal: Predict and Understand the mortality risk for Sepsis Cohort

####Explainable Boosting Machine (EBM)


* **Data** set: query the cohort in MIMIC-IV 
* Create the machine learning model with **interpretML library**
* **Split** the dataset
* **Fit** the model using the training data set
* **Predict and Evaluate** the performance of the model using the testing set (unseen data).

* InterpretML is an open-source Python package and the documentation can be found [here](https://interpret.ml/).
* The preprint about the framework can be read [here,](https://arxiv.org/abs/1909.09223) and the GA2M paper [here.](https://www.cs.cornell.edu/~yinlou/papers/lou-kdd12.pdf)

### Set up the main **libraries**: interpret, numpy, pandas.

In [None]:
# !pip install interpret #Uncomment and run this cell to install interpretML

In [None]:
from interpret.glassbox import ExplainableBoostingClassifier
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
from interpret import show

from sklearn.model_selection import train_test_split 
from sklearn import metrics

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

# Access data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None) ##This is only to show all columns when printing a DataFrame

* Authenticate in the BigQuery platform. Define the function to query.

In [None]:
# authenticate
auth.authenticate_user()

In [None]:
# Set up environment variables
project_id = 'CHANGE-ME' ##Change only this variable with your project ID in BigQuery Platform.
if project_id == 'CHANGE-ME': #No Need to change this one!
  raise ValueError('You must change project_id to your GCP project.')
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id

# Read data from BigQuery into pandas dataframes.
def run_query(query, project_id=project_id):
  return pd.io.gbq.read_gbq(
      query,
      project_id=project_id,
      dialect='standard')

# set the dataset
dataset = 'mimiciv'


## **Data set**
We'll use a cohort derived from MIMIC-IV.

* The query bellow is searching for the data in the **BigQuery Platform**.
* We are retrieving patients with **Sepsis**: A life-threatening complication caused by the body's response to an infection. When your immune system goes into **overdrive in response to an infection**, sepsis may develop as a result
* Further, we will join the Date of Death information, the age and gender from patients table.


In [None]:
##We are retrieving patients using sepsis3 Table and joining it to patients Table.

df = run_query("""
SELECT sep.subject_id,sep.sofa_score,sep.respiration,sep.coagulation,sep.liver,sep.cardiovascular,sep.cns,sep.renal,pt.dod,pt.anchor_age,pt.gender
FROM `physionet-data.mimiciv_derived.sepsis3` as sep
INNER JOIN `physionet-data.mimiciv_hosp.patients` as pt
ON sep.subject_id = pt.subject_id
ORDER BY subject_id
""")
print(df)

* We have been analysing this dataset since the beggining, so just recap what needs to be done: Check for missing values, transform categorical into numerical and verify the dtype of each column.

In [None]:
dataset = df.copy()

#Replace Date of Death times with binary (0 or 1)
dataset.loc[dataset['dod'].notna(),'dod'] = int(1)
dataset.loc[dataset['dod'].isnull(),'dod'] = int(0)
dataset['dod'] = dataset['dod'].astype(int)

#Transform Gender column from Categorical Data to Binary:
gender_categorical = pd.get_dummies(dataset['gender'])

#Concatenate both Data frames:
final_sepsis = pd.concat([dataset,gender_categorical], axis = 1)

#Final Data set to work with:
final_sepsis = final_sepsis.drop(['subject_id','gender'], axis = 1)
print(final_sepsis)

In [None]:
#Check the final dtype of each column. Are they properly defined now? 
print(final_sepsis.info(),"\n\n")

* Split the data set into Training and Testing 

In [None]:
# split into input (X) and output (y) variables
target = 'dod'
X = final_sepsis.drop(labels = target, axis = 1) #Remove the target column from the dataset to create the independent(features) variables set
y = final_sepsis[target]

#Adjust the size of the testing set: we'll use 10% of the entire data. 
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.1, random_state = 1)


### Explainable Boosting Machine model

* Glassbox models are designed to be completely interpretable, and often provide similar accuracy to state-of-the-art methods.
* Can also provide explanations on a both global (overall behavior) and local (individual predictions) level.
  * Global explanations are useful for understanding what a model finds important, as well as identifying potential flaws in its decision making
* **Explainable Boosting Machine (EBM)** is a tree-based, cyclic gradient boosting Generalized Additive Model with automatic interaction detection. Read more about it [here.](https://interpret.ml/docs/ebm.html)

In [None]:
ebm = ExplainableBoostingClassifier(random_state=1)
classifier = ebm.fit(X_train, y_train)
classifier

In [None]:
## Accuracy of the Model in the training
print(classifier.score(X_train,y_train))

### Evaluation of the EBM model with unseen data from the testing set.

In [None]:
predictions = classifier.predict(X_test)

#Accuracy classification score
acc = float(round(metrics.accuracy_score(y_test, predictions),3))

#Compute the balanced accuracy.
bacc = float(round(metrics.balanced_accuracy_score(y_test, predictions),3))

#Compute the Matthews correlation coefficient (MCC)
mcc = float(round(metrics.matthews_corrcoef(y_test, predictions),3))

#Compute the F1 score, also known as balanced F-score or F-measure.
f1 = float(round(metrics.f1_score(y_test, predictions),3))

#Show results as a DataFrame:
results = {'Accuracy' : [acc], 'Balanced Accuracy' : [bacc], 'MCC' : [mcc], 'F1-Score' : [f1]}
df_results = pd.DataFrame.from_dict(data = results, orient='columns')
print(df_results)

### Let's Visualise the global model Behaviour with each feature.

In [None]:
ebm_global = ebm.explain_global()
show(ebm_global)

### Let's Visualise the local model Behaviour with some unseen examples from the testing set.

In [None]:
set_visualize_provider(InlineProvider()) #plot the output here
ebm_local = ebm.explain_local(X_test[:5], y_test[:5])
show(ebm_local)

## To save the figure you must install orca library first (https://github.com/plotly/orca)
#plotly_fig = ebm_local.visualize(0) # This is the plotly figure for visualization of the 0th datapoint's local explanation
#plotly_fig.write_image("images/fig0.pdf")

* Discussion: How can you compare this result with the previous (Deep Learning with the same Sepsis data)?
* What understanding can you grasp from the plots? 
* Which feature is contributing the most?