# Introduction to the MIMIC Database and Machine Learning with scikit-learn

**Author: Meghan R. Hutch**

**Date: Februrary 8th, 2024**

Inspired and adapted from Dr. Garrett Eickelberg's workshop on MIMIC-III and scikit learn: https://github.com/geickelb/MIMIC-III_to_Model/tree/master

---

### **Today's lecture will be divided into two parts:**

* Introduction to working with the MIMIC-IV dataset
<br></br>
* Introduction to machine learning with scikit learn

We will discuss both of these in the context of a framework for conducting clinical research

# Framework for Conducting Clinical Research 

---

1. Research Question Specification (hypothesis)
<br></br>
2. Cohort Specification
<br></br>
3. Data Extraction
<br></br>
4. Data Pre-processing (cleaning!)
<br></br>
5. Model Training
<br></br>
5. Model Evaluation
<br></br>

**Today, we will largely focus on steps 3-6**

# Introduction to MIMIC-IV

MIMIC-IV dataset is the most recently updated version of the MIMIC dataset. MIMIC curates the data of patients admitted to BIDMC emergency department or an ICU between 2008-2019. The dataset has been de-identified and contains the following (as of the MIMIC-IV v2.2 release):

* 299,712 patients
* 431,231 admissions
* 73,181 icustays

Useful [documentation](https://mimic.mit.edu/docs/iv/) for using the MIMIC-IV dataset

## Demo-Dataset

Today we will use the previously curated [demo-dataset](https://physionet.org/content/mimic-iv-demo/2.2/)

The demo-dataset contains the data from a random subset of 100 hospitalized patients. This dataset was curated for the purposes of workshops and tutorials. Thus, having appropriate CITI training is not needed. The full dataset is available on the class quest server, but a physionet account and [CITI training](https://physionet.org/about/citi-course/) should be completed before accessing it.

In [None]:
# import our packages
import os

import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# set our working directory - this notebook assume you are in the folder where the MIMIC-IV demo dataset is located
os.chdir('/projects/e30766/mimic-iv-demo/2.2')

In [None]:
# make sure you are in the working directory: '/projects/e30766/mimic-iv-demo/2.2'
os.getcwd()

# Data Extraction

- The demo dataset has been downloaded to the classroom's quest allocation.
- We will further load this data into our notebook

### Load in MIMIC data

First, we will load the list of subject identifiers (`subject_id`) for the demo cohort

In [None]:
# load data
subject_id = pd.read_csv('demo_subject_id.csv')

In [None]:
# view the first 5 rows of data
subject_id.head(5)

In [None]:
# check how many unique subjects
subject_id.nunique()

## MIMIC-IV Tables

For this demo, we will explore data found in the following two MIMIC-IV modules:


### [Hosp](https://mimic.mit.edu/docs/iv/modules/hosp/)

> "The Hosp module provides all data acquired from the hospital wide electronic health record. Information covered includes patient and admission information, laboratory measurements, microbiology, medication administration, and billed diagnoses."

### [ICU](https://mimic.mit.edu/docs/iv/modules/icu/)

> "The ICU module contains information collected from the clinical information system used within the ICU. Documented data includes intravenous administrations, ventilator settings, and other charted items."

As part of the 100 patient demo dataset, we have also downloaded all of the accompanying tables within the `hosp` and `icu` modules. Data/csv files can be found: `/projects/e30766/mimic-iv-demo/2.2`

MIMIC-IV also contains additional modules for data from emergency department (ED), chest X-rays (CXR), and clinical notes (Note)

### Load in the `patient` table to aquire demographic data and admission date information

Important note about MIMIC's pre-processing of [dates](https://mimic.mit.edu/docs/iv/modules/hosp/patients/#anchor_age-anchor_year-anchor_year_group):

* Dates are randomly shifted consistently for all patients

* `anchor_year` - shifted year for the patient

* `anchor_year_group` -  is a range of years - the patient’s anchor_year occurred during this range

* `anchor_age` - the patient’s age in the `anchor_year` (year of admission). If a patient’s `anchor_age` is over 89 in the `anchor_year` then their `anchor_age` is set to 91, regardless of how old they actually were.

> Example: a patient has an `anchor_year` of 2153, `anchor_year_group` of 2008 - 2010, and an `anchor_age` of 60.
The year 2153 for the patient corresponds to 2008, 2009, or 2010.
The patient was 60 in the shifted year of 2153, i.e. they were 60 in 2008, 2009, or 2010.
A patient admission in 2154 will occur in 2009-2011, an admission in 2155 will occur in 2010-2012, and so on.


In [None]:
patients = pd.read_csv('hosp/patients.csv.gz')

In [None]:
# review the first 5 rows
patients.head()

In [None]:
# review the number of unique values in each dataset
patients.nunique()

In [None]:
len(patients)

### Load data files into python

Here, we will load a few of the tables we will use to demonstate the workflow of aquiring, pre-processing, and analyzing data from the MIMIC database.

In [None]:
# load several tables from the hosp module
omr = pd.read_csv('hosp/omr.csv.gz')
labevents = pd.read_csv('hosp/labevents.csv.gz')
d_labitems = pd.read_csv('hosp/d_labitems.csv.gz')

# load several tables from the icu module
chartevents = pd.read_csv('icu/chartevents.csv.gz')
d_items = pd.read_csv('icu/d_items.csv.gz')

### omr table

> "The Online Medical Record (OMR) table contains miscellaneous information from the EHR."

[omr documentation](https://mimic.mit.edu/docs/iv/modules/hosp/omr) of columns:

* `chartdate` - the date on which the observation was recorded
* `seq_num`- a monotonically increasing integer which uniquely distinguishes results of the same type recorded on the same day. For example, if two blood pressure measurements occur on the same day, seq_num orders them chronologically.
* `result_name` - human interpretable description of the observation
* `result_value` - the avalue associated with the observation

In [None]:
# review the first 5 rows
omr.head()

In [None]:
# review unique result_names (observations)
# here we can see the unique variables/measurements contained in this table
omr[['result_name']].drop_duplicates()

In [None]:
omr.isnull().sum()

# Data Pre-processing (Data Cleaning)

---

### What does it mean to clean data?

Simply, data cleaning is the act of preparing data for analysis. In a clinical context, we also want to make sure that the data we are using is clinically relevant and accurate, especially in regard to the specific clinical question or problem we are trying to solve.

### How do we "clean" data:

* Ensure data are formatted into the right data types (numeric, character, date, etc)

* Review summary statistics (how many observations, how many patients have each observation, mean/median/sd/variance of values, proportion of missing data)

* Review the distribution of data

* Are there possible data entry problems or unit conversions needed?

* How should we handle missing data?

* How many observations does each patient have? How should we handle repeated observations?

---

“Data science, he says, involves multiple “very small decisions” — data cleaning and filtering steps, for instance, which are crucially important, but difficult to document. And journal page limits preclude exposition. But by blending code, data and text in a single document, researchers can show just how their results were generated.” - Ben Marwick; [Perkel, JM](https://www.nature.com/articles/d41586-022-00563-z)


### Let's focus on the variables `Height` and `Weight`

We will filter the `omr` table to only contains results for height and weight. During data cleaning, each variable will be saved in its own dataframe for ease of pre-processing

In [None]:
# create separate dataframes for each variable
height = omr[omr['result_name']=="Height (Inches)"]

weight = omr[omr['result_name']=="Weight (Lbs)"]

### Height (Inches)

In [None]:
height.head()

### Review variable types 

It looks like `result_value` is a continuous variable. Let's make sure that Python also has recognized it as such.

In [None]:
# check variable data types
height.dtypes

It looks like Python read `result_value` as a object (or character) variable. Let's convert to numeric. This is an important step to make sure any downstream functions (calculating the mean or median, plotting the distribution, etc) work correctly.

In [None]:
height['result_value'] = pd.to_numeric(height['result_value'])

In [None]:
# check variable data types again to make sure our transformation and code worked
# float64 indicates a continuous variable, so this looks good!
height.dtypes

#### Summary statistics

Next, let's calculate summary statistics to get a better sense of our variable

In [None]:
# how many unique patients have a height recording
# 61 patients - not all of our 100 patients had a record for height
height.nunique()

In [None]:
# how many observations - 378 observations, thus some patients must have height recorded more than once
len(height)

### Visualize distribution of values

In [None]:
sns.distplot(height['result_value'])

In [None]:
# use describe() to compute summary statistics of the distribution
height['result_value'].describe()

**Potential red flag:** Is there really a patient who might be 5 inches tall? This is a clear case I'd like to investigate more. Perhaps this is a data entry issue or a newborn?

In [None]:
# sort values from lowest to highest - I can then retrieve the subject_id of the patient with a height of 5 inches
# I can also see if there are any other patients with unusually low heights
height.sort_values(['result_value'])

In [None]:
# let's see if this patient has other recorded height measurements
height[height['subject_id']==10012853]

### Investigate patients with multiple observations

In [None]:
# for each unique patient, we will calculate the standard deviation of the patient's height measurements
# ddof = 0 indicates that std for patients with one measurement will be displayed as 0 rather than NaN
height['std'] = height.groupby('subject_id')['result_value'].transform(np.std, ddof=0)

In [None]:
height.sort_values('std', ascending = False)[['subject_id', 'std']].drop_duplicates().head(10)

In [None]:
height[height['subject_id']==10005909]

In [None]:
height[height['subject_id']==10019917]

### For patients with multiple observations, let's calculate the median

**Notes:** 

* The best way to handle these types of inconsistencies will largely depend on the exact variable, question of interest, and domain knowledge. 
<br> </br>
* Seek consultation from mentors and domain experts! 
<br> </br>
* In this case, I'm not considering the date or specific admission. Some patients have multiple admissions. 
<br> </br>
* Additionally, depending how old they are, height may be expected to change over time. In those cases, we might want to consider the patient's age before just taking the median

In [None]:
# create a new variable with the median height value for each patient
height['result_value_median'] = height.groupby('subject_id')['result_value'].transform('median')

**Save the new dataset**

This next step ensures that we will now only have one unique row per patient

In [None]:
# assuming we don't need to link by date
height_clean = height[['subject_id', 'result_name', 'result_value_median']].drop_duplicates()

Let's review the distribution again - now it looks a bit more normally distributed

In [None]:
sns.distplot(height_clean['result_value_median'])

In [None]:
height_clean.nunique()

In [None]:
len(height_clean)

### Weight (lbs)

In [None]:
weight.head()

### Review variable types 

It looks like `result_value` is a continuous variable. Let's make sure that Python also has recognized it as such.

In [None]:
# check variable data types
weight.dtypes

In [None]:
weight['result_value'] = pd.to_numeric(weight['result_value'])

In [None]:
# check variable data types
weight.dtypes

In [None]:
weight['result_value'].describe()

### Distribution of values

In [None]:
sns.distplot(weight['result_value'])

In [None]:
weight.sort_values(['result_value'])

### Investigate the variance of in measurements between patients with multiple observations

In [None]:
weight['std'] = weight.groupby('subject_id')['result_value'].transform(np.std, ddof=0)# ddof = 0 indicates that std for patients with one measurement will be displayed as 0 rather than NaN

In [None]:
weight.sort_values('std', ascending = False)[['subject_id', 'std']].drop_duplicates().head(10)

In [None]:
weight[weight['subject_id']==10019385] # did this patient lose > 110 pounds in < 1 month?

**Potential Unit Issue? 97 kg ~ 214 pounds**

In [None]:
weight[weight['subject_id']==10021487].sort_values('chartdate')

### Notes:

In the first case, my intuition is that the discrepancy in weight measurements is due to a unit conversions issue: 97 kg ~ 214 pounds

In the second case, the patient has many weight measurements over the a 1.5 year period. They are not varrying too widely. It looks like there is an additional weight loss, followed by an increase in weight (U-shaped curve).

### Notes on Missing Data

This could be it's own class. 

Rule of thumb: Talk to your domain experts/collaborators!

Some common ways to handle missing data:

* Remove those patients
* Impute mean or median
* Forward/backward filling
* MICE

In the above cases, I used the median to impute. This may or may not be the best method depending on your use cases. Here, I wanted to keep it simple!

### Evaluating Unit Conversions

Another clinical variable that often needs cleaning are laboratory test measurements.

In MIMIC we can use the `labevents` and `d_labitems` tables to work with laboratory data. 

Importantly, `d_labitems` links to the `labevents` table. 

The `d_labitems` provides the definitions (or names) or the specific labs via the `itemid`.

In [None]:
# let's first look at the labevents table
labevents.head()

In [None]:
# let's first look at the d_labitems table
d_labitems.head()

### There are a few ways you could think about working with the labs dataset. 

1. We can merge `d_labitems` and `labevents` by shared identifiers (we'll review this in the next section on merging tables)

2. We can search for specific labs of interest via the `label` column

3. We can also filter with specific `itemid` numbers if we know which correspond to labs of interest

First, let's try and see whether we can find the `itemid` for potassium

In [None]:
d_labitems[d_labitems['label'].str.contains('potassium', case = False, na=False)]

In [None]:
potassium = labevents[labevents['itemid']==50971]
potassium.nunique()

In [None]:
potassium

In [None]:
sns.distplot(potassium['valuenum'])

In [None]:
# let's check the distinct units of potassium
potassium[['valueuom']].value_counts()

### Let's evaluate another lab test, one which has more than one unit specification

In [None]:
d_labitems[d_labitems['itemid']==51249]

In [None]:
MCHC = labevents[(labevents['itemid']==51249)]
MCHC['valueuom'].value_counts()

In [None]:
lab_perc = MCHC[MCHC['valueuom']=='%']
lab_perc.nunique()

In [None]:
lab_gdl = MCHC[MCHC['valueuom']=='g/dL']
lab_gdl.nunique()

In [None]:
sns.distplot(lab_gdl['valuenum'])

In [None]:
sns.distplot(lab_perc['valuenum'])

## Merging tables

Tables can be merged (e.g. linked) together through shared identifiers. The MIMIC official [documentation](https://mimic.mit.edu/docs/iv/) is a great reference for learning more about the associations between tables

---

### First, we will merge the different lab tables

In [None]:
labevents_with_itemid = pd.merge(d_labitems, labevents,
                      on = ['itemid'],
                      how = 'inner')

labevents_with_itemid

### Let's merge some additional MIMIC-IV tables

In [None]:
admissions = pd.read_csv('hosp/admissions.csv.gz')
diagnoses = pd.read_csv('hosp/diagnoses_icd.csv.gz')
d_icd_diagnoses = pd.read_csv('hosp/d_icd_diagnoses.csv.gz')

### Merge diagnoses (icd) tables

In [None]:
diagnoses.head()

In [None]:
d_icd_diagnoses.head()

In [None]:
diagnoses_with_code = pd.merge(diagnoses, 
                         d_icd_diagnoses,
                         on = ['icd_code', 'icd_version'],
                         how = 'inner')

In [None]:
diagnoses_with_code

### Let's merge the new diagnoses table with the admissions table

In [None]:
admissions.head()

In [None]:
admissions_with_icd = pd.merge(admissions,
                               diagnoses_with_code,
                               on = ['subject_id', 'hadm_id'],
                               how = 'inner')

In [None]:
admissions_with_icd.nunique()

In [None]:
admissions_with_icd.head()

### Filtering datasets (data cleaning / cohort specification)

Let's say we are only interested in patients with sepsis. We could filter our dataset using keyword matching or specific ICD codes (if we have specific codes we are interested in)

In [None]:
sepsis_cohort = admissions_with_icd[admissions_with_icd['long_title'].str.contains('sepsis', case = False)]

In [None]:
sepsis_cohort.head()

In [None]:
sepsis_cohort.nunique()

In [None]:
sepsis_cohort[['icd_code', 'icd_version', 'long_title']].drop_duplicates()

In [None]:
sepsis_cohort_icd = admissions_with_icd[admissions_with_icd['icd_code']=='99592']
sepsis_cohort_icd

# Introduction to Machine Learning with scikit-learn

---

For this demonstration, we will use a previously prepared dataset from an earlier version of MIMIC. This data comes from a [community challenge](https://physionet.org/content/challenge-2012/1.0.0/) focused on predicting mortality with the MIMIC dataset.

To demonstate the use of scikit-learn for machine learning, we will use the previously pre-processed dataset from one of the challenge winners (https://github.com/alistairewj/challenge2012). Data can be accessed via quest or from Alistair Johnson's github. He also includes his [pre-processing notebook](https://github.com/alistairewj/challenge2012/blob/master/prepare-data.ipynb) which can provide another example of how to proceed with cleaning clinical data.

### Hypothesis: Machine Learning can facilitate prediction of ICU mortality

In [None]:
# load in our machine learning functions from scikit-learn
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report, roc_curve, roc_auc_score, precision_score, recall_score, accuracy_score, auc, average_precision_score, precision_recall_curve, precision_recall_fscore_support, f1_score, log_loss
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, GridSearchCV,RandomizedSearchCV,cross_validate
from sklearn.preprocessing import StandardScaler

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
# load in pre-processed challenge dataset
data_ml = pd.read_csv('PhysionetChallenge2012-set-b.csv')

### Let's review our dataset

In [None]:
data_ml.head()

In [None]:
data_ml.nunique()

### **Notes**: 

* This dataset has 4000 unique subjects and > 100 columns.

* As mentioned, this dataset is already pre-processed. I would still advocate for performing your own data quality checks. For the purposes of this exercise, my checks will be extremely limited. 

### Review missing data

In [None]:
# count how many variables have no missing values; We have 10 columns with complete entries
data_ml.dropna(axis = 1)

In [None]:
# count how many variables have < 100 missing values
data_ml.columns[ np.sum(data_ml.isnull()) < 100 ]

### Our limited data pre-processing

* We will subset the data frame to keep only the columns that have < 100 missing values. 
* To keep things simple, we will also just take the variables that are the `first` measurement
* We will subsequently drop the patients who have missing data.
* Note: `In-hospital_death` is our outcome variable

In [None]:
# we will remove `GCS_last`, 'Length_of_stay', and 'Survival' since these may leak info about our likely deceased patients
data_ml_clean = data_ml[['In-hospital_death', 'Age', 'SAPS-I', 'SOFA', 'Weight',
       'Height', 'GCS_first', 'Glucose_first', 'HR_first', 'Temp_first',
       'BUN_first', 'Creatinine_first', 'HCO3_first', 'HCT_first', 'K_first', 'Mg_first',
       'Na_first', 'Platelets_first', 'WBC_first']]

In [None]:
# drop any rows (patients) with a missing value
data_ml_clean = data_ml_clean.dropna()
data_ml_clean

In [None]:
# we have ~2,000 rows (or unique patients)
len(data_ml_clean)

### Prepare data for modeling

For continuous varibales, many machine learning algorithms work best when the data has been normalized (mean = 0, standard deviation 1).

In [None]:
# Specify the columns you want to standardize (in this case, we want to standardize every column by Gender and our outcome `In-hospital_death`)
columns_to_standardize = ['Age', 'SAPS-I', 'SOFA', 'Weight',
       'Height', 'GCS_first', 'Glucose_first', 'HR_first', 'Temp_first',
       'BUN_first', 'Creatinine_first', 'HCO3_first', 'HCT_first', 'K_first', 'Mg_first',
       'Na_first', 'Platelets_first', 'WBC_first']

# Create a StandardScaler instance
scaler = StandardScaler()

# Fit and transform only the specified columns
data_ml_clean[columns_to_standardize] = scaler.fit_transform(data_ml_clean[columns_to_standardize])

In [None]:
data_ml_clean

In [None]:
data_ml_clean['Age'].describe()

## Create Training and Test Sets

In [None]:
## first, we will create two dataframes. 

# X will contain only our predictors 
X = data_ml_clean.drop(['In-hospital_death'], axis = 1)

# y contains our outcome 
y = data_ml_clean['In-hospital_death']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345, shuffle = True)

In [None]:
y_train.value_counts()#/len(y_train)

In [None]:
y_test.value_counts()#/len(y_test)

# Model Training & Evaluation

---

## Logistic Regression Algorithm

<img src="lecture_images/logistic_regression.png" alt="title" width="400" align="left">


* Commonly used for classification, thus need a binary outcome variable (e.g. survival or death)
* Good baseline model (simple!)
* L1 (Lasso) vs L2 (Ridge) - penalty to avoid overfitting

[Image Source](https://www.javatpoint.com/logistic-regression-in-machine-learning)

### Train L2 (ridge) logistic regression model using crossfold validation

In [None]:
# instantiate logistic regression model
lr = LogisticRegression(penalty='l2',solver='liblinear', random_state = 12345)

# use cross fold validation to randomly split data in k folds
cv_results = cross_validate(lr, X_train, y_train, cv=10, 
                            scoring = ['accuracy', 'roc_auc', 'recall', 'precision'],
                            return_train_score = False, return_estimator = False)

In [None]:
cv_results

In [None]:
# calculate the mean across folders
print('Accuracy:', cv_results['test_accuracy'].mean().round(3))
print('AUC:', cv_results['test_roc_auc'].mean().round(3))
print('Recall:', cv_results['test_recall'].mean().round(3))
print('Precision:', cv_results['test_precision'].mean().round(3))

In [None]:
##basic model performance - code prepared by Garrett Eickelberg

# fit our model on the training set
fit_lr = lr.fit(X_train, y_train)

# predicted classes using default 0.5 threshold
y_hat = fit_lr.predict(X_train) 

#predicted probabilities
y_proba = fit_lr.predict_proba(X_train)[:,1] 

# auc
auc = roc_auc_score(y_train, y_proba)

# model loss
loss = log_loss(y_train, y_hat)

print ('the AUC is: {:0.3f}'.format(auc))
print ('the logloss is: {:0.3f}'.format(loss))
print("classification report:\n ", classification_report(y_train,y_hat, digits=3))
print("confusion matrix:\n ")

# Create the confusion matrix
cm = confusion_matrix(y_train, y_hat)
ConfusionMatrixDisplay(confusion_matrix=cm).plot()

In [None]:
y_proba

## Evaluation of Training 

---

### Receiver Operating Characteristic Curve

A ROC plots the true postitve rate (sensitivity) vs false positive rate (1-specificity) at different decision thresholds. This allows us to calculate the area under the curve (AUC) to estimate model performance. An AUC of 0.5 indicates a model that performs as well as chance.

In [None]:
fpr, tpr, thresholds = roc_curve(y_train, y_proba, pos_label=1)

plt.title('ROC curve')
ax1 = plt.plot(fpr, tpr, 'b', label = '%s AUC = %0.3f' % ('', auc), linewidth=2)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

### Precision-Recall Curve

The curve plots the prcesion and recall at different decision thresholds. Recall or the true positive rate tells us how well the model correctly identifies all positive cases, whereas precision estimates the proportion of cases the model identified as positive that were truely positive.

In [None]:
y_proba = fit_lr.predict_proba(X_train)[:,1]

precision, recall, thresholds = precision_recall_curve(y_train, y_proba, pos_label=1, sample_weight=None)
avg_p = average_precision_score(y_train, y_proba, pos_label=1, sample_weight=None)

plt.title('Precision-Recall curve')
ax1= plt.plot(precision, recall, 'b', label = '%s AP = %0.3f' % ('', avg_p), linewidth=2)
plt.legend(loc = 'lower left')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('Precision')
plt.xlabel('Recall')

### Evaluation on Test Set

In [None]:
# predict on the test set
# predicted classes using default 0.5 threshold
y_hat = fit_lr.predict(X_test) 

#predicted probabilities
y_proba = fit_lr.predict_proba(X_test)[:,1] 

# auc
auc=roc_auc_score(y_test, y_proba)

# model loss
loss= log_loss(y_test, y_hat)

print ('the AUC is: {:0.3f}'.format(auc))
print ('the logloss is: {:0.3f}'.format(loss))
print("classification report:\n ", classification_report(y_test, y_hat, digits=3))
print("confusion matrix:\n ")

# Create the confusion matrix
cm = confusion_matrix(y_test, y_hat)
ConfusionMatrixDisplay(confusion_matrix=cm).plot()

### Receiver Operating Characteristic Curve

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_proba, pos_label=1)

plt.title('ROC curve')
ax1 = plt.plot(fpr, tpr, 'b', label = '%s AUC = %0.3f' % ('', auc), linewidth=2)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

### Precision-Recall Curve

In [None]:
y_proba = fit_lr.predict_proba(X_test)[:,1]

precision, recall, thresholds = precision_recall_curve(y_test, y_proba, pos_label=1, sample_weight=None)
avg_p = average_precision_score(y_test, y_proba, pos_label=1, sample_weight=None)

plt.title('Precision-Recall curve')
ax1= plt.plot(precision, recall, 'b', label = '%s AP = %0.3f' % ('', avg_p), linewidth=2)
plt.legend(loc = 'lower left')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('Precision')
plt.xlabel('Recall')

## Random Forest Algorithm

<img src="lecture_images/decision_tree.png" alt="title" width="400" align="left">

* Random forest is an esembling algorithm that learns from generating multiple (hundreds-thousands) of decision trees
* Randomly selects a subset of features when growing each tree 
* This approach allows us to average the predictions of the "crowd"

<b></b>
* Many **hyperparameters** that can be tuned during training:
<br></br>
* `n_estimators`: Number of trees
* `max_features`: Number of featues to be considered at each split 
* `max_depth`: Maximum number of levels in tree (can help regularize the tree to prevent overfitting)
* `min_samples_split`: Minimum samples to split a node 
* `min_samples_leaf`: Minimum number of samples required at each leaf node 

In [None]:
### tuning RF hyperparameters
# Number of trees in random forest
n_estimators = [100, 300]
# Number of features to consider at every split
max_features = [2, 5]
# Maximum number of levels in tree
max_depth = [5,10]
# Minimum number of samples required to split a node
min_samples_split = [5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [10, 15]

# create grid of hypterparameter settings
# we will permute through this grid
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

### Hyperparameter Tuning

In [None]:
# Create a random forest classifier
rf = RandomForestClassifier(criterion='entropy', random_state=12345)

# Use random search to find the best hyperparameters
grid_search = GridSearchCV(estimator = rf,
                           param_grid = param_grid,
                           cv = 5,
                           scoring = 'roc_auc',
                           return_train_score = False,
                           n_jobs = -1)

# Fit the random search object to the data
grid_search.fit(X_train, y_train)

In [None]:
# Create a variable for the best model
best_rf = grid_search.best_estimator_

# Print the best hyperparameters
print('Best hyperparameters:',  grid_search.best_params_)

In [None]:
#grid_search.cv_results_

### Training Evaluation

In [None]:
y_pred = best_rf.predict(X_train)
y_prob = best_rf.predict_proba(X_train)[:,1]

accuracy = accuracy_score(y_train, y_pred)
precision = precision_score(y_train, y_pred)
recall = recall_score(y_train, y_pred)
auroc = roc_auc_score(y_train, y_prob)

print("Accuracy:", accuracy)
print("AUC:", auroc)
print("Precision:", precision)
print("Recall:", recall)

### Feature Importance

In [None]:
# review feature importances on the training set
feature_importances = pd.Series(best_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

# Plot a simple bar chart
feature_importances.plot.bar();

In [None]:
y_hat = best_rf.predict(X_train) # predicted classes using default 0.5 threshold
y_proba = best_rf.predict_proba(X_train)[:,1] #predicted probabilities
auc=roc_auc_score(y_train, y_proba)
loss= log_loss(y_train, y_hat)

print ('the AUC is: {:0.3f}'.format(auc))
print ('the logloss is: {:0.3f}'.format(loss))
print("classification report:\n ", classification_report(y_train,y_hat, digits=3))
print("confusion matrix:\n ")
# Create the confusion matrix
cm = confusion_matrix(y_train, y_hat)
ConfusionMatrixDisplay(confusion_matrix=cm).plot()

### Receiver Operating Characteristic Curve

In [None]:
fpr, tpr, thresholds = roc_curve(y_train, y_proba, pos_label=1)

plt.title('ROC curve')
ax1 = plt.plot(fpr, tpr, 'b', label = '%s AUC = %0.3f' % ('', auc), linewidth=2)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

### Precision-Recall Curve

In [None]:
y_proba = best_rf.predict_proba(X_train)[:,1]

precision, recall, thresholds = precision_recall_curve(y_train, y_proba, pos_label=1, sample_weight=None)
avg_p = average_precision_score(y_train, y_proba, pos_label=1, sample_weight=None)

plt.title('Precision-Recall curve')
ax1= plt.plot(precision, recall, 'b', label = '%s AP = %0.3f' % ('', avg_p), linewidth=2)
plt.legend(loc = 'lower left')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('Precision')
plt.xlabel('Recall')

### Evaluation of Test Set

In [None]:
y_hat = best_rf.predict(X_test) # predicted classes using default 0.5 threshold
y_proba = best_rf.predict_proba(X_test)[:,1] #predicted probabilities
auc=roc_auc_score(y_test, y_proba)
loss= log_loss(y_test, y_hat)

print ('the AUC is: {:0.3f}'.format(auc))
print ('the logloss is: {:0.3f}'.format(loss))
print("classification report:\n ", classification_report(y_test,y_hat, digits=3))
print("confusion matrix:\n ")
# Create the confusion matrix
cm = confusion_matrix(y_test, y_hat)
ConfusionMatrixDisplay(confusion_matrix=cm).plot()

### Receiver Operating Characteristic Curve

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_proba, pos_label=1)

plt.title('ROC curve')
ax1 = plt.plot(fpr, tpr, 'b', label = '%s AUC = %0.3f' % ('', auc), linewidth=2)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

### Precision-Recall Curve

In [None]:
y_proba = best_rf.predict_proba(X_test)[:,1]

precision, recall, thresholds = precision_recall_curve(y_test, y_proba, pos_label=1, sample_weight=None)
avg_p = average_precision_score(y_test, y_proba, pos_label=1, sample_weight=None)

plt.title('Precision-Recall curve')
ax1= plt.plot(precision, recall, 'b', label = '%s AP = %0.3f' % ('', avg_p), linewidth=2)
plt.legend(loc = 'lower left')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('Precision')
plt.xlabel('Recall')