CPSC 8810 Machine Learning for Biomedical Applications

### NOTE
To enable all displays in this notebook, you will need to execute the following command in a terminal after instally ipywidgets

# Assignment 1 - Regression with Structured Data
# SUPPORT2 Dataset
In this assignment, you are asked to use the [SUPPORT2](https://archive.ics.uci.edu/dataset/880/support2) dataset to create regression models that estimate the total charges for a patient's hospital stay based on input features that include patient demographics, disease information, clinical status scores, and physiological measurements. Please read through the notebook and follow the instructions for each of the 7 problems.

In [None]:
# Google Colab setup
# mount the google drive - this is necessary to access supporting src
from google.colab import drive
drive.mount("/content/drive")

In [None]:
# install any packages not found in the Colab environment
!pip install ucimlrepo
!pip install catboost
!pip install ipywidgets'
!pip install tableone

In [None]:
# imports
from ucimlrepo import fetch_ucirepo
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import make_regression
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import OLSInfluence, variance_inflation_factor
import numpy as np
from prettytable import PrettyTable
import catboost as cb
from tableone import TableOne

# local project imports
import sys
sys.path.append("/content/drive/MyDrive/Colab Notebooks/CPSC-8810-ML-BioMed/src")
from plotting import plt_kde_grid, plt_box_grid, plt_xy_scatter_grid
from uci_utils import get_vars_of_type, get_vars_of_type_in_list
from regression_util import plot_fitted_resids, plot_outliers, plot_leverage
from filter import correlation_filter

In [None]:
# global settings
pd.options.display.max_columns = 100
rs = 654321 # random state, use this to ensure reproducibility

The following code cell loads the data and prepares it for the analyses to be performed in this assignment. Similar to practicum-02, the pre-processing includes: (1) dropping samples with any missing data; (2) splitting the data into train and test splits; (3) removal of highly correlated feature pairs; (4) standardization of continuous features; and (5) creation of dummy variables for categorical features.

Execute the following cell to load and preproces the data. Please do not modify the cell.

In [None]:
####################################################################################################
# DO NOT CHANGE THIS CELL
####################################################################################################

# fetch Infrared Thermography Dataset
# fetch dataset
support2 = fetch_ucirepo(id=880)

# data (as pandas dataframes)
X = support2.data.features.drop(['charges', 'totcst', 'totmcst', 'hday', 'dementia'], axis=1).dropna()
y = support2.data.features.loc[X.index]['charges'] # total cost of patient stay
meta_vars = support2.variables
feature_type_corrections = [('edu', 'Integer'),
                            ('prg6m', 'Continuous'),
                            ('adls', 'Categorical'),
                            ('diabetes', 'Categorical'),
                            ('dementia', 'Categorical')]
# several of the features have the wrong type, so we correct them here
for tpl in feature_type_corrections:
    row = meta_vars[meta_vars.name == tpl[0]].index[0]
    meta_vars.loc[row, 'type'] = tpl[1]


y = y.dropna() # drop missing values in the target
X = X.loc[y.index] # drop the corresponding rows in the features
y.reset_index(drop=True, inplace=True)
X.reset_index(drop=True, inplace=True)

# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=rs)

# drop highly correlated features
continuous_vars, _ = get_vars_of_type(X_train, meta_vars, var_type_key = 'type', var_name_key = 'name', type_kw = 'Continuous')
features_to_drop = correlation_filter(X_train[continuous_vars], threshold=0.95)
X_train.drop(features_to_drop, axis=1, inplace=True)
X_test.drop(features_to_drop, axis=1, inplace=True)

# standardize the continuous features
continuous_vars, X_train_continuous = get_vars_of_type_in_list(X_train, meta_vars, var_type_key = 'type', var_name_key = 'name', type_list = ['Continuous', 'Integer'])
scaler = preprocessing.StandardScaler().fit(X_train_continuous)
X_train_continuous_scaled = scaler.transform(X_train_continuous)
# note we use the same scaler for the test data to prevent data leakage
continuous_vars, X_test_continuous = get_vars_of_type_in_list(X_test, meta_vars, var_type_key = 'type', var_name_key = 'name', type_list = ['Continuous', 'Integer'])
X_test_continuous_scaled = scaler.transform(X_test_continuous)

# create dummy variables for categorical features
categorical_vars, X_train_categorical = get_vars_of_type(X_train, meta_vars, var_type_key = 'type', var_name_key = 'name', type_kw = 'Categorical')
X_train_categorical_dummy = pd.get_dummies(X_train_categorical, columns=categorical_vars,drop_first=True, dtype=int)
categorical_vars, X_test_categorical = get_vars_of_type(X_test, meta_vars, var_type_key = 'type', var_name_key = 'name', type_kw = 'Categorical')
X_test_categorical_dummy = pd.get_dummies(X_test_categorical, columns=categorical_vars,drop_first=True, dtype=int)
for c in X_train_categorical_dummy.columns:
    if c not in X_test_categorical_dummy.columns:
        X_test_categorical_dummy[c] = 0
for c in X_test_categorical_dummy.columns:
    if c not in X_train_categorical_dummy.columns:
        X_train_categorical_dummy[c] = 0

# combine the continuous and categorical features
X_train_new = pd.concat([pd.DataFrame(X_train_continuous_scaled, columns=X_train_continuous.columns), X_train_categorical_dummy.reset_index(drop=True)], axis=1)
X_test_new = pd.concat([pd.DataFrame(X_test_continuous_scaled, columns=X_train_continuous.columns), X_test_categorical_dummy.reset_index(drop=True)], axis=1)
y_train_new = y_train.reset_index(drop=True)
y_test_new = y_test.reset_index(drop=True)

# Problem 1 (1 point) - Table One
In the code cell below, use the Python TableOne package to create a Table One for the SUPPORT2 dataset. Only include the following information in the table: Age, Sex, Education, Income, Race, and Hositpal Charges. The following features should be taken from the `X` variable:

| Variable          | Column Header in X | Type        |
|-------------------|--------------------|-------------|
| Age (years)       | age                | Continuous  |
| Education (years) | edu                | Continuous  |
| Income            | income             | Categorical |
| Race              | race               | Categorical |
| Sex               | sex                | Categorical |

These features should be combined with the `y` dataframe, which only contains the continuous variable `charges` which is the _Total Hospital Charges_ for each sample, and then provided to the `TableOne` constructor. Additionally, include a list of the categorical features in the `TableOne` contructor using the `categorical` keyword argument. Finally, print the table using `tablefmt="fancy_grid"`. <br/><br/>__Hint: Refer to practicum-01.__

In [None]:
# PROBLEM 1
####### ENTER YOUR CODE HERE #######

# Problem 2 (1 point)
In the code cell below, plot a kernel density esitmate of box-plot of the unscaled version of the hospital charges contained in variable in `y`. In the figure title, include the _range_ and _median_ of the total charges. This will help us later in intepreting the model results.
<br/><br/>__Hint: Refer to practicum-02.__

In [None]:
# PROBLEM 2
####### ENTER YOUR CODE HERE #######

# Problem 3 (1 point)
In the code cell below, use the `OLS` module in the Python statsmodels package to create an ordinary least squares multilinear model to estimate the hospital charges. The model should be fit with the features in the training set, `X_train_new` and the targets in the training set, `y_train_new`. After fitting the model, print the results summary.
<br/><br/>__Hint: Refer to practicum-02.__

In [None]:
# PROBLEM 3
####### ENTER YOUR CODE HERE #######

### Test-set evaluation
In the code cell, the method compute_mae computes the mean absolute error of the model on the provided test data. The argument `rslt` provided to the `compute_mae` method is the the result object returned by the `sm.OLS.fit` method.

# Problem 4 (2 points)
As detailed in the _Linear Regression_ lecture, we often want to limit the number of features we include in the model either to avoid overfitting or to improve model interpretability. In the code cell below, you are asked to complete the implementation of the forward select method. This method takes as input: `X` a feature set, `y` a target set, and a `significance_level` which is the p-value threshold for selecting a feature. At each iteration, the method should:
1.  Fit an new OLS multilinear model using the current feature combined with all previously selected features.
2.  Store the p-value of the current feature
3.  Select the feature with the lowest p-value. If that feature p-value is less than the significance level, the feature should be added to the selected features and the method should continue to check the remaining features, else the method should exit and return the selected features and the best model (the one that uses the selected features) fitted to the data using the selected features <br/><br/>


In the code cell below, complete the `forward_selection` method. There are two locations that you need to complete. In _Step 1_, you'll need to create an OLS multilinear model instance using the selected features and the current feature. Do __NOT__ fit the model as this is already implemented in the code. In _Step 2_, you'll need to update the `selected_features` and `remaining_features` variables.
<br/><br/>__Hint: Refer to practicum-02.__

In [None]:
def forward_selection(X, y, significance_level=0.1):
    selected_features = []
    remaining_features = list(X.columns)
    best_model = None
    current_best_pvalue = 1.0

    while remaining_features:
        pvalues = []
        for feature in remaining_features:
            ####### STEP 1: START YOUR CODE HERE #######
            # update the model variable
            model = None
            ####### END YOUR CODE HERE #######
            reslt = model.fit()
            pvalue = reslt.pvalues[feature]
            pvalues.append((feature, pvalue))

        # Find the feature with the lowest p-value
        best_feature, best_pvalue = min(pvalues, key=lambda x: x[1])

        if best_pvalue < significance_level:
            ####### STEP 2: START YOUR CODE HERE #######
            # update the selected_features and remaining_features lists
            # delete the pass keyword
            pass
            ####### END YOUR CODE HERE #######
        else:
            break

    if len(selected_features)>0:
        return selected_features, sm.OLS(y, X[selected_features]).fit()
    else:
        print("No significant features found.")

In the cell below, we apply the `forward_selection` procedure to fit a new linear regression model using the selected features.

In [None]:
selected_features, res = forward_selection(X_train_new, y_train_new)
print(res.summary())

# Problem 5 (2 points)
In the markdown cell below, provide your assessment of the OLS multilinear model using the features selected with the forward select method on the training data. Specifically, how does the $R^2$ compare to the model fit using all of the features Problem 3. Compare the model fit and coefficient statistics between the two models. What do these suggest to you about the tradeoff between model complexity and performance on the trainingdata?

__Problem 5: Enter your response here__

# CatBoost
Now let's build a boosted tree model using the same SUPPORT2 dataset to predict total hospital cost for each patient. In the cell below, we first create the `Pool` objects used by CatBoost. Note, that CatBoost requires that categorical variables that take real values be converted to string types.

In [None]:
####################################################################################################
# DO NOT CHANGE THIS CELL
####################################################################################################
X[categorical_vars] = X[categorical_vars].astype('str')

# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=rs)

# create the catboost datasets
train_dataset = cb.Pool(X_train, y_train, cat_features=categorical_vars)
test_dataset = cb.Pool(X_test, y_test, cat_features=categorical_vars)

# Problem 6 (1 points)
In the code cell below use the `CatBoostRegressor` class to create a boosted tree regression model. Then fit the model to the `train_dataset`. Finally print the model results using the `get_best_score` method from the result object returned by the `CatBoostRegressor().fit()` method.
<br/><br/>__Hint: Refer to practicum-02.__

In [None]:
# PROBLEM 6
####### ENTER YOUR CODE HERE #######

# Problem 7 (2 points)
In the catboost model, the training error gets very small, while the test set error remains large. How do you interpret this result with respect to the generalizability of the model?

__Problem 7: Enter your response here__