# Machine Learning for Healthcare (Interpretability)
This Jupyter notebook provides some demonstrations of how one could analyze the Bilirubin data set. These include:
* basic data analysis:
    * determine data set size
    * determine balance of data (is it balanced or imbalanced)
    * investigate variability of data using Principal Component Analysis (PCA)
* Using the Logistic Regression (LR) classifier to analyze Bilirubin data
* Using the Random Forest (RF) classifier to analyze Bilirubin data
    * Comparing the impact of varying forest and tree sizes
    * Comparing to the LR classifier
* Feature selection: Using RF and LR to select the most important features from the dataset
* Dataset balance: Use downsampling and upsampling to balance the dataset and compare performance

## Reading in raw data from CSV
The raw data for this exercise is stored in a CSV file. The first 10 rows of the data are provided below, just to give you an idea of what it looks like:

```
hours_since_birth,GA (days),BiliBGA,Weight,Bili_Weight_ratio,MothersAge,IsPreterm,Arterial_pH,hasFTlimit
7.58105,243,118,1929,0.061304,30,1,7.42,1
62.65124,264,8,2577,0.003085,34,0,6.22,0
30.56771,274,94,3302,0.028395,38,0,7.35,0
60.80173,260,66,2183,0.030425,39,0,7.21,0
77.64267,228,46,2568,0.017848,37,1,7.13,0
35.92315,233,70,1964,0.035554,28,1,7.14,0
110.58025,264,90,2972,0.030240,29,0,6.28,1
105.23609,263,117,2548,0.045773,34,0,6.93,0
1.90231,291,86,4056,0.021273,39,0,7.47,0
36.82003,230,99,2073,0.047600,30,1,7.41,0

```

We'll be using the [Pandas](https://pandas.pydata.org/) library to read in and work with this data.

In [None]:
import pandas as pd
data_frame = pd.read_csv("/cluster/courses/ml4h/data_for_users/data/bili_generated_ext.csv")
data_frame = data_frame[0:2000]  # Only use the first 2000 samples in this notebook

Let's take another look at the data, in a nicer format:

In [None]:
print(data_frame[0:10].to_string())

## Data Analysis
Now, let's look at the data a little bit closer. The `shape` attribute of the DataFrame will tell us how many rows and columns there are:

In [None]:
print(data_frame.shape)

Next, let's dig a little deeper into the composition of the dataset. In our case, the `hasFTlimit` column denotes whether or not the baby requires phototherapy. How many require it and how many do not? We can determine this by comparing the `hasFTlimit` column to either `True` or `False`, and counting the number of matches:

In [None]:
# Remove the target column from the data set, it should not be part of the input features
target_column = data_frame.pop("hasFTlimit")
count_requires_treatment = len(target_column[target_column == True])
no_treatment_required = len(target_column[target_column == False])
print("Requires treatment: %i" % count_requires_treatment)
print("No treatment required: %i" % no_treatment_required)

Thus, we can see that the dataset is very imbalanced. This will be explored later. For now, let's continue by taking a look at the variability present in the data using [Principal Component Analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis).

We will define a function that performs PCA for a given input dataset, since we will repeat this analysis several more times.

In [None]:
# Define function
def do_pca(dataset, target_column):
    # Declare and fit PCA object, transform data
    from sklearn.decomposition import PCA
    pca = PCA(n_components=2)
    pca.fit(dataset)
    pca_output = pca.transform(dataset)

    # Plot transformed data
    import matplotlib.pyplot as plt
    import seaborn as sns
    figure = plt.figure(figsize=(8, 8))
    sns.scatterplot(x=pca_output[:,0], y=pca_output[:,1], hue=target_column, alpha=0.7)
    plt.xlabel("PC1")
    plt.ylabel("PC2")
    plt.show()

In [None]:
# Standardize the data (remove mean and scale to unit variance)
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data_frame)

# Call the function
do_pca(standardized_data, target_column)

## Logistic Regression
Now we will fit a logistic regression classifier to the standardized data, and then use it to classify the same data, using simple accuracy as a performance metric.

In [None]:
import numpy as np

# Define function to fit and return LR classifier
def fit_lr_classifier(dataset, labels):
    # Fit classifier
    from sklearn.linear_model import LogisticRegression
    lr_classifier = LogisticRegression(solver="lbfgs", max_iter=200)
    lr_classifier.fit(dataset, labels)
    return lr_classifier

# Get classifier
lr_classifier = fit_lr_classifier(standardized_data, target_column)

# Get predictions
predictions = lr_classifier.predict_proba(standardized_data)
predicted_labels = np.argmax(predictions, axis=1)

# Calculate accuracy
accuracy = np.sum(predicted_labels == target_column) / len(target_column)
print("Accuracy: %f" % accuracy)

## Evaluation Metrics
Accuracy is not the only way (or perhaps even the best way) to evaluate the performance of a classifier. Let's additionally calculate the AUROC and AUPRC metrics, and plot the respective curves.

In [None]:
# Define a function to evaluate the performance of a classifier based on the results
def evaluate_classifier(results):
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.decomposition import PCA
    from sklearn.metrics import roc_auc_score, roc_curve, average_precision_score, precision_recall_curve
    
    # Set up AUROC plot
    figure = plt.figure(figsize=(10, 8))
    ax = sns.lineplot(x=[0, 1], y=[0, 1], color="red", label="Random guess area 0.5")  # Plot random guess threshold
    
    # Iterate over provided result sets
    for data_name, data_value in results.items():
        # Unpack values
        predictions, target_column = data_value
        
        # Calculate the AUROC score
        auroc_score = roc_auc_score(target_column, predictions[:, 1])

        # Plot the receiver operating characteristic curve
        roc_fpr, roc_tpr, _ = roc_curve(target_column, predictions[:, 1])

        ax = sns.lineplot(x=roc_fpr, y=roc_tpr, label="%s area %f" % (data_name, auroc_score))

    # Set options and show plot
    ax.set_title("Receiver Operating Characteristic")
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate")
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.legend(loc="lower right")
    plt.show()

    # Set up precision recall plot
    figure = plt.figure(figsize=(10, 8))
    
    # Calculate PR random guess baseline as len(positive samples) / len(total samples)
    pr_baseline = np.sum(results[list(results.keys())[0]][1]) / len(results[list(results.keys())[0]][1])
    ax = sns.lineplot(x=[0, 1], y=[pr_baseline, pr_baseline], color="red", label="Random guess area %f" % pr_baseline)
    
    # Iterate over provided result sets
    for data_name, data_value in results.items():
        # Unpack values
        predictions, target_column = data_value
    
        # Calculate average precision
        avg_precision = average_precision_score(target_column, predictions[:, 1])
    
        # Plot the precision recall curve
        precision, recall, _ = precision_recall_curve(target_column, predictions[:, 1])
        ax = sns.lineplot(x=recall, y=precision, label="%s area %f" % (data_name, avg_precision))
    
    # Set options and show plot
    ax.set_title("Precision Recall")
    ax.set_xlabel("Recall")
    ax.set_ylabel("Precision")
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.legend(loc="lower left")
    plt.show()

# Do evaluation
evaluate_classifier({'Bilirubin dataset': (predictions, target_column)})

Now we will scale the dataset (as before), and split the dataset into a training and testing sets, using a 70%/30% split.

In [None]:
# Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(data_frame)
standardized_data = scaler.transform(data_frame)

# Train/test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(standardized_data, target_column, test_size=0.3)

## Random Forest

Next, we will train a [RF classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) using 100 trees, each with a max depth of 3, and 5 fold cross validation. We will also define a set of metrics to use for evaluation (the same metrics as in the last notebook), and use them to obtain a performance baseline for this classifier. Of the 5 folds, we select the classifier with the best performance in AUROC as the baseline.

In [None]:
# Imports
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate

# Evaluation metrics
eval_metrics = {'AUROC': 'roc_auc', 'avg_precision': 'average_precision', 'Accuracy': make_scorer(accuracy_score)}

# Train with cross validation
cv_results = cross_validate(RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42), x_train, y_train, cv=5, scoring=eval_metrics, return_estimator=True)

# Determine the best classifier from the folds, and use it as the baseline
best_idx = np.argmax(cv_results['test_AUROC'])
baseline_clf = cv_results['estimator'][best_idx]

print("Baseline performance:")
print("AUROC: %f, Avg. Precision: %f, Accuracy: %f" % (cv_results['test_AUROC'][best_idx], cv_results['test_avg_precision'][best_idx], cv_results['test_Accuracy'][best_idx]))

Let's now take a closer look at how the hyperparameters of the classifier impact the performance. In this example, the two hyperparameters we'll adjust are the number of trees in the random forest, and the depth of the trees.  We are interested in finding the best combination of these parameters -- this is a classic example of a [grid-search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search), and we will use the [corresponding functionality of the scikit learn library](https://scikit-learn.org/stable/modules/grid_search.html#multimetric-grid-search) to help exceute it. Note that in this cell we will use 3 fold CV instead of 5 fold CV in the interest of saving computation time.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {'n_estimators': [50, 100, 150], 'max_depth': [1, 3, 5]}

# Execute grid search
gs = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=param_grid, scoring=eval_metrics, refit='AUROC', cv=3)
gs.fit(x_train, y_train)

Given that we defined a 3x3 grid of parameter values to search, we expect to have results for 9 different permutations of parameter values. Indeed, we can see the permuatations tested in the grid search, as well as their results. Given that we have been using *AUROC* as our primary metric, which model parameters do you think performs best? How do the results vary with different parameter values?

In [None]:
print("Permutations:")
print(gs.cv_results_['params'])

print("Mean AUROC result across folds:")
print(gs.cv_results_['mean_test_AUROC'])

print("Mean avg. precision result across folds:")
print(gs.cv_results_['mean_test_avg_precision'])

print("Mean accuracy result across folds:")
print(gs.cv_results_['mean_test_Accuracy'])

In [None]:
print("Best parameters: " + str(gs.best_params_))
grid_search_clf = gs.best_estimator_

Let's now compare the performance of our baseline classifier and the best classifier from our grid search on the test set.

In [None]:
# Helper function to compute evaluation metrics for trained classifiers
def evaluate_trained_clf(clf, x_test, y_test, scoring):
    from sklearn.metrics import get_scorer
    for score, score_fn in scoring.items():
        # Fix up function reference
        score_fn = get_scorer(score_fn) if type(score_fn) == str else score_fn
        print("%s: %f" % (score, score_fn(clf, x_test, y_test)))

print("Baseline classifier:")
evaluate_trained_clf(baseline_clf, x_test, y_test, eval_metrics)
print()

print("Grid search classifier:")
evaluate_trained_clf(grid_search_clf, x_test, y_test, eval_metrics)

Let's also compare to the LR classifier, this time using the LASSO (L1 regularization) penalty.

In [None]:
# Define function to fit and return LR classifier
def fit_lr_classifier(dataset, labels, C=1.0):
    # Fit classifier
    from sklearn.linear_model import LogisticRegression
    lr_classifier = LogisticRegression(solver="liblinear", penalty="l1", C=C, max_iter=200)
    lr_classifier.fit(dataset, labels)
    return lr_classifier

lr_classifier = fit_lr_classifier(x_train, y_train, 0.1)
print("LR classifier:")
evaluate_trained_clf(lr_classifier, x_test, y_test, eval_metrics)

## Feature Selection
The random forest classifier exposes the feature importances as determined during training in the `feature_importances_` parameter. Let's see how this looks for our dataset. Which features do you think matter more for prediction?

In [None]:
for idx, imp in enumerate(grid_search_clf.feature_importances_):
    print("%s: %f" % (data_frame.columns[idx], imp))

The logistic regression classifier doesn't expose feature importances directly, however the sign and magnitudes of the learned feature coefficients can be used to judge the influence of each feature on the overall prediction: 

In [None]:
for idx, imp in enumerate(lr_classifier.coef_[0]):
    print("%s: %f" % (data_frame.columns[idx], imp))

As it is a little difficult to interpret these values simply by looking at them, let's try a little visualization. One typical way to compare the coefficients or importance learned by these classifiers is to plot a barchart. Let's do this for both the LR and RF classifiers. Was the LASSO penalty able to induce some sparsity in the LR classifier?

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(14, 8))
sns.barplot(x=data_frame.columns, y=lr_classifier.coef_[0])
ax.set_title("Logistic Regression w/ LASSO coefficients")
plt.show()

fig, ax = plt.subplots(figsize=(14, 8))
sns.barplot(x=data_frame.columns, y=grid_search_clf.feature_importances_)
ax.set_title("Random Forest feature importance")
plt.show()

Exercise: Update the `C` value in the cell below, rerunning the cell as necessary, and try to induce more sparsity (hint: the C parameter is the *inverse* of the regularization strength

In [None]:
C = 0.1
new_lr_classifier = fit_lr_classifier(x_train, y_train, C)

fig, ax = plt.subplots(figsize=(14, 8))
sns.barplot(x=data_frame.columns, y=new_lr_classifier.coef_[0])
ax.set_title("Logistic Regression w/ LASSO coefficients")
plt.show()

Another visualization that might give some insight into feature importance is the [partial dependence plot](https://scikit-learn.org/stable/modules/partial_dependence.html). These plots shed some light onto how each feature impacts the final classification result. Since we have scaled the input data, we need to add some extra code if we want the values on the axes of the charts to correspond to the original values. Here, we will create partial dependence plots for two input features (Bili_weight_ratio and MothersAge).

In [None]:
# Code for restoring original data values
def inverse_transform_dim(scaler, dim, val):
    return "%.2f" % (scaler.var_[dim] ** 0.5 * val + scaler.mean_[dim])

def transform_x_labels(axes, dim, scaler):
    orig_x_values = axes.get_xticks()
    transformed_x_values = [inverse_transform_dim(scaler, dim, x) for x in orig_x_values]
    axes.set_xticks(orig_x_values)
    axes.set_xticklabels(transformed_x_values)
    axes.set_xlim(orig_x_values[1], orig_x_values[-2])

def transform_y_labels(axes, dim, scaler):
    orig_y_values = axes.get_yticks()
    transformed_y_values = [inverse_transform_dim(scaler, dim, y) for y in orig_y_values]
    axes.set_yticks(orig_y_values)
    axes.set_yticklabels(transformed_y_values)
    axes.set_ylim(orig_y_values[1], orig_y_values[-2])

# Plot partial dependence charts
charts_to_create = [4, 5]
from sklearn.inspection import plot_partial_dependence
fig, ax = plt.subplots(figsize=(16, 10))
dpd = plot_partial_dependence(grid_search_clf, x_test, charts_to_create, feature_names=data_frame.columns, ax=ax)

# Transform labels on x-axis
k = 0
for i in range(len(dpd.axes_)):
    for j in range(len(dpd.axes_[i])):
        if dpd.axes_[i][j] is None:
            continue
            
        transform_x_labels(dpd.axes_[i][j], charts_to_create[k], scaler)
        k += 1

plt.show()

Note that it is also possible to combine two features and create a partial dependence plot that shows how these to features jointly impact the outcome. To do this, simply pass a tuple instead of a feature ID to the plot_partial_dependence function.

## Dataset Balance
Earlier, we saw how the Bilirubin dataset is unbalanced. Let's balance the dataset using both down and upsampling, and see how this impacts the performance. To do this, we will use the [imbalanced-learn](https://github.com/scikit-learn-contrib/imbalanced-learn) library to help us out here.

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

print("Original sample count: %i" % len(target_column))
print("Positive class sample count: %i"  % len(target_column[target_column == True]))
print("Negative class sample count: %i"  % len(target_column[target_column == False]))
print()

# Create new balanced dataset using downsampling
downsampled_x, downsampled_y = RandomUnderSampler().fit_resample(standardized_data, target_column)
print("Downsampled sample count: %i" % len(downsampled_y))
print("Downsampled positive class sample count: %i"  % len(downsampled_y[downsampled_y == True]))
print("Negative class sample count: %i"  % len(downsampled_y[downsampled_y == False]))
print()

# Create new balanced dataset using upsampling
upsampled_x, upsampled_y = RandomOverSampler().fit_resample(standardized_data, target_column)
print("Upsampled sample count: %i" % len(upsampled_y))
print("Upsampled positive class sample count: %i"  % len(upsampled_y[upsampled_y == True]))
print("Negative class sample count: %i"  % len(upsampled_y[upsampled_y == False]))

Now let's evaluate the training performance of the down and up sampled classifiers with 5 fold CV. How does the difference in performance between the balanced and unbalanced RF classifiers compare? Which metrics are impacted, and which are not?

In [None]:
# Train downsampled classifier with 5 fold CV
down_x_train, down_x_test, down_y_train, down_y_test = train_test_split(downsampled_x, downsampled_y, test_size=0.3)
down_cv_results = cross_validate(RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42), down_x_train, down_y_train, cv=5, scoring=eval_metrics, return_estimator=True)
down_best_idx = np.argmax(down_cv_results['test_AUROC'])
down_clf = down_cv_results['estimator'][down_best_idx]

print("Unbalanced training performance:")
print("AUROC: %f, Avg. Precision: %f, Accuracy: %f" % (cv_results['test_AUROC'][best_idx], cv_results['test_avg_precision'][best_idx], cv_results['test_Accuracy'][best_idx]))
print()

print("Downsampled training performance:")
print("AUROC: %f, Avg. Precision: %f, Accuracy: %f" % (down_cv_results['test_AUROC'][down_best_idx], down_cv_results['test_avg_precision'][down_best_idx], down_cv_results['test_Accuracy'][down_best_idx]))
print()

# Train upsampled classifier with 5 fold CV
up_x_train, up_x_test, up_y_train, up_y_test = train_test_split(upsampled_x, upsampled_y, test_size=0.3)
up_cv_results = cross_validate(RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42), up_x_train, up_y_train, cv=5, scoring=eval_metrics, return_estimator=True)
up_best_idx = np.argmax(up_cv_results['test_AUROC'])
up_clf = up_cv_results['estimator'][up_best_idx]

print("Upsampled training performance:")
print("AUROC: %f, Avg. Precision: %f, Accuracy: %f" % (up_cv_results['test_AUROC'][up_best_idx], up_cv_results['test_avg_precision'][up_best_idx], up_cv_results['test_Accuracy'][up_best_idx]))

Exercise to reader: Try to evaluate the performance of the trained down and up sampled classifiers using the test sets (down_x_test, down_y_test) and (up_x_test, up_y_test). Feel free to reuse code from earlier on in the notebook, adjusting as necessary. The `evaluate_trained_clf` function will help here.