CPSC 8810 Machine Learning for Biomedical Applications

# Practicum 02 - Regression with Structured Biomedical Data

In this practicum, we apply regression models on the dataset, [Infrared Thermography Temperature](https://archive.ics.uci.edu/dataset/925/infrared+thermography+temperature+dataset). Per the [UCI website](https://archive.ics.uci.edu/dataset/925/infrared+thermography+temperature+dataset), _"this dataset contains temperatures read from various locations of inferred images about patients, with the addition of oral temperatures measured for each individual. The 33 features consist of gender, age, ethnicity, ambiant temperature, humidity, distance, and other temperature readings from the thermal images"_. Our goal is to develop regression models that can accurately predict a patient's oral temperature based on the infrared images. We develop a multilinear regression model using [statsmodels](https://www.statsmodels.org/stable/index.html) and a gradient boosting tree model using [catboost](https://catboost.ai/).

Before working with the temperature data, we will first illustrate the model development and evaluation approaches using a simulated dataset.

This practicum will implement some of the concepts discussed in the two previsous lecture, _Linear Regression_ and _Tree-based regression_. Specifically we: will
-  Demonstrate ordinary least squares (OLS) multilinear regressiong on a simulated dataset
-  Apply OLS multilinear regression to the _Infrared Thermography Temperature_ dataset
-  Demonstrate gradient boosting trees on the same simulated dataset
-  Apply gradient boosting trees to the _Infrared Thermography Temperature_ datest
-  Compare the multilinear regression and gradient boosting tree model performance

In [None]:
# Google Colab setup
# mount the google drive - this is necessary to access supporting src
from google.colab import drive
drive.mount("/content/drive")

In [None]:
# install any packages not found in the Colab environment
!pip install ucimlrepo
!pip install catboost
!pip install ipywidgets

In [None]:
# imports
from ucimlrepo import fetch_ucirepo
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import make_regression
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import OLSInfluence, variance_inflation_factor
import numpy as np
from prettytable import PrettyTable
import catboost as cb

# local project imports
import sys
sys.path.append("/content/drive/MyDrive/Colab Notebooks/CPSC-8810-ML-BioMed/src")
from plotting import plt_kde_grid, plt_box_grid, plt_xy_scatter_grid
from uci_utils import get_vars_of_type
from regression_util import plot_fitted_resids, plot_outliers, plot_leverage
from filter import correlation_filter

from google.colab import output
output.enable_custom_widget_manager()

In [None]:
# global settings
pd.options.display.max_columns = 100
rs = 654321 # random state, use this to ensure reproducibility

# Simulated data

To get started, let's generate simulated data for regression. We will use this data to illustrate the regression model development and evaluation process using both multilinear regression and gradient boosting tree regression approaches.

To generate the simulated data, we use the `make_regression` method from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html). For demonstration, we will split the data into training and test sets.

In [None]:
# generate simulated regression data
bias = 1 # bias term, set to 0 for no bias
n = 250 # number of samples
n_train = 200 # number of training samples
X, y, coefs = make_regression(n_samples=n, bias = bias,
                              n_features=8, n_informative=6, effective_rank=5, tail_strength=0.1,
                              n_targets=1,
                              noise = 2.5,
                              coef=True, random_state=rs, shuffle=True)
# split into train and test
X_train = X[:n_train]
y_train = y[:n_train]
X_test = X[n_train:]
y_test = y[n_train:]

Let's plot the distribution of the dependent variable, $y$ (also referred to as the endogenous variable).

In [None]:
# plot the kde of y using seaborn
fig, ax = plt.subplots(figsize=(8, 4))
sns.kdeplot(y, fill=True, ax=ax)
ax.set_xlabel('y')
ax.set_ylabel('density')
ax.set_title(f'Distribution of y (range=[{np.min(y):.1f},{np.max(y):.1f}], \u03BC ={np.mean(y):.2f}), \u03C3 ={np.std(y):.2f})')
plt.tight_layout()
plt.show()


Next, let's plot dependent feature, $y$, against each of the indepenent features (also referred to exongenous variables). This will give us a sense of the relationship between each independent feature and $y$. We should expect to see some linear trend for the informative features and no relationship for the non-informative features. The presence of noise will mask this relation to some extent.

In [None]:
informative = ["Informative" if coefs[_]!=0 else "Non-Informative" for _ in range(len(coefs))]
colors = ["blue" if coefs[_]!=0 else "red" for _ in range(len(coefs))]
plt_xy_scatter_grid(X, y, labels = informative, colors = colors, fig_size=(12,8));

Now let's review the coefficients used to generate the simulated data. For the OLS regression model, we will anticipate that the learned coeficients are close to these.

In [None]:
# Let's print the true coefficients
coef_table = PrettyTable(['Variable', 'Actual Coefficient'])
for _ in range(len(coefs)):
    coef_table.add_row([f"X{_+1}", f"{coefs[_]:.4f}"])
print(coef_table)

# OLS Multilinear Regression Models
## Simulated data model
Let's start by fitting an OLS multilinear model to the simulation data. We will use the [statsmodels OLS module](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html#statsmodels.regression.linear_model.OLS) to create the model. When generating the simulated data we used the `bias` variable to set the amount of offset in the output variable, $y$. If `bias>0`, we need to prepend a constant column to the independent variable values when creating the OLS model. After fitting the model with the _training data_, we print the summary which provides a number of useful statistics to assess the model fit.

In [None]:
if bias != 0:
    mod = sm.OLS(y_train, sm.add_constant(X_train, prepend=True)) # add constant term to account for bias
else:
    mod = sm.OLS(y_train, X_train) # no bias term
res = mod.fit() # fit the model
print(res.summary())

Let's make sure we understand the important values in the summary table:
- _No. Observations_: the number of samples used to create the model
- _Df Residuals_: degrees of freedom in the residuals, $y-\hat{y}$
- _Df Model_: degrees of freedom in the model
- _R-squared_: equal to the correlation, $Cor\left(Y,\hat{Y}\right)^2$, between the fitted and actual values (in linear regression with 1 variable it is the square of the correlation with $Y$ and $X$). Represents a measure of the how much the independent variables, $X$, explain the variance in the dependent variable $y$
- _Adj. R-squared_: a penalized version of _R-squared_ that accounts for the fact that adding independent variables always increases _R-squared_ even if they are uninformative
- _F-statistic_: a measure of the likelihood that the true linear coefficient are all zero (that is there is no linear relationship between the independent and dependent variables). Values near 1 suggest there is no relationship, values >> 1 suggest at least one coefficient is truly non-zero
- _Prob (F-statistic)_: the probability of observing a F-statistic value at least as large as the observed if the null hypothesis that all coefficients are actually zero is true. Thus, values close to zero indicate the null hypothesis should be rejected.
- _Log-likelihood_: a measure of model goodness of fit. Can range from $\left[-\infty, \infty\right]$. Higher values are better. The actual value is difficulty to interpret, typically used to compare two or more models
- _AIC_: The _Akaike Information Criterion_ is also a measure of model fit where lower values are better. It is defined by $2 \left(p - LL\right)$ where LL is the Log-likelihood and $p$ is the number of model parameters (including the bias term)
- _BIC_: The _Bayesian Information Criterion_ is another measure of model fit where lower values are better. It is defined by $2 \left(p\ln N - LL\right)$ where LL is the Log-likelihood, $N$ is the number of samples in the training data, and $p$ is the number of model parameters (including the bias term)

The columns in the bottom portion of the table represent:
- _coef_: the estimated value of the coefficients
- _std_err_: the _standard error_ of the coefficient estimates, tells us the amount, on average, that the estimate deviates from the actual coefficient
- _t_: the _t-statistic_ for the given coefficient. It is a test of the null hypothesis that the individual coefficient is actually zero (indicating no relation with the outcome, $y$).
- _P>|t|_ : the probability of observing a t-statistic value at least as large as the observed one if the null hypothesis (of no relation for this coefficient) is true. Values close to zero indicate the null hypothesis should be rejected.
- [0.025] - lower bound of 95% cofidence interval
- [0.975] - upper bound of 95% confidence interval

Finally, the lower portion of the summary table provide information on the residuals:
- _Omnibus_: values near 0 indicate the residuals have a normal distribution
- _Prob(Omnibus)_: probabalistic measure of the residual normalcy, values near 1 indicate normal
- _Skew_: measure of data symmetry (0 is perfect symmetry)
- _Kurtosis_: measure of data concentration near 0, with higher values indicating fewer outliers
- _Durbin-Watson_ : measure of homoscedasticity (constant variance of error terms) with ideal values in $\left[1,2\right]$
- _Jarque-Bera (JB) and Prob(JB)_: alternate measures of Ominbus and Prob(Omnibus), respectively
- _Cond. No._: measure of collinearity with higher values indicating collinearity

In [None]:
##### NOT INCLUDED IN THE STUDENT VERSIONS OF THE PRACTICUM #####
# In addtion to R2, we can also compute the RSE (residual standard error)
# here we do this for the training data
if bias != 0:
    y_pred = res.predict(sm.add_constant(X_train, prepend=True))
else:
    y_pred = res.predict(X_train)

rss = np.sum((y_pred - (y_train))**2) # residual sum of squares
rse = np.sqrt(rss/(len(y_train)-2)) # residual standard error
tss = np.sum((y_train - np.mean(y_train))**2) # total sum of squares
R2 = 1 - rss/tss
print("R2: ", R2)
print("RSE: ", rse)
print("mean(y): ", np.mean(y_train))

### Potential issues with the model

#### Non-linear data and heteroscedasticity
Plotting the residual values, $y-\hat{y}$, as a function of the fitted values, $\hat{y}$, we can examine:
1. The presence of non-linearity in the data (i.e., is there a non-linear relation betwene the predictors and the $y$). This could be revealed in the plot if there was an observble trend. We do __not__ expect this in the simulated data as it was designed to be linear.
2. The presence of _heteroscedasticity_, i.e., non-constant variance in the model errors. This can be seen in the plot by adding lines representing the outer quantiles of the residual values (i.e., by binning the residuals relative to the fitted values and computing the outer quantiles for each bin) and assessing the presence of a _funnel_ or other curvature indicating the variance varies. In the example, no heteroscedasiticity is not obviously present. This is further indicated by the _Durbin-Watson_ value (values between 1-2 suggest homoscedasticity).

__Note__: The `res` variable. returned by `model.fit`, is an instance of [RegressionResults](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.html#statsmodels.regression.linear_model.RegressionResults). The `res` instance attributes include:
-  `res.fittedvalues`: The values predict by the model, $\hat{y}$
-  `res.resid` : The residual values, $y - \hat{y}$


In [None]:
plot_fitted_resids(res.fittedvalues, res.resid)

### Outliers
We can evaluate the presence of _outliers_ (values of $\hat{y}$ that are far from ${y}$) by plotting the fitted values against the _studentized residules_ (residual divided by the estimated standard error). Studentized residuals greater than 3, are typically considered to be outliers.

__Note__: The `influence` variable below is an instance of [OLSInfluence](https://www.statsmodels.org/devel/generated/statsmodels.stats.outliers_influence.OLSInfluence.html#statsmodels.stats.outliers_influence.OLSInfluence). It's attributes include `resid_studentized`, the studentized residuals.

In [None]:
influence = res.get_influence()
plot_outliers(res.fittedvalues, influence.resid_studentized)

### High Leverage Points
We can evalute inputs with high leverage (those that are far from the distribution of other input samples) by plotting the leverage statistic against the studentized residuals. The average leverage for all observations is always $\left(p+1\right)/n$ where $p$ is the model degrees of freedom and $n$ is the number of observations. A point which greatly exceed this value may be considered to have high leverage.

__Note__: The `influence` also has the attribute `hat_matrix_diag` which represents the leverage of each point. See [OLSInfluence](https://www.statsmodels.org/devel/generated/statsmodels.stats.outliers_influence.OLSInfluence.html#statsmodels.stats.outliers_influence.OLSInfluence).

In [None]:
influence = res.get_influence()
leverage = influence.hat_matrix_diag
plot_leverage(influence.resid_studentized, leverage, (res.df_model+1)/res.nobs)

#### Colinearity
We can examine pair-wise collinearity with a correlation matrix as in the practicum 1. However, multicollinearity can occur among more than two features without any pair of features demonstrating pair-wise collinearity. To detect multicollinearity, we can examine the variance inflation factor (VIF) of each of the variables. A VIF of $\ge 10$ indicates the variable is collinear with some other subset of variables to a degree that may degrade model performance.

In [None]:
colin_table = PrettyTable(['Variable', 'VIF'])
for _ in range(X_train.shape[1]):
    colin_table.add_row([f"X{_+1}", f"{variance_inflation_factor(X_train, _):.4f}"])
print(colin_table)

### Simulated test set
Finally, let's examine how well the linear model does on the held out test set. We will use the result, `res`, of the fitted model, to predict the response, $y$ on the test set. We can first examine the $R^2$ metric on the test set. Also, we can then plot the residuals as a function of the actual values. Here we also plot the bounds on the residulas as determined by the _prediction intervals_. __It is well-established that these intervals are optimistic as they are based on the training data.__ As we can see, many of the residuals fall outside the bounds.

__Note__: The `res.get_prediction` method returns a [PredictionResults](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.PredictionResults.html) object that has the method `conf_int` which here is equivalent to _prediction intervals_.

In [None]:
# How does the model do on the test set?
if bias != 0:
    y_pred = res.get_prediction(sm.add_constant(X_test, prepend=True))
else:
    y_pred = res.get_prediction(X_test)

residual = y_test - y_pred.predicted
y_test_sorted = np.sort(y_test)
residual_sorted = residual[np.argsort(y_test)]
y_pred_sorted = y_pred.predicted[np.argsort(y_test)]
ci = y_pred.conf_int(alpha=0.05)
ci_sorted = ci[np.argsort(y_test)]

print("Simulated Test Set R2: ", r2_score(y_test, y_pred.predicted))
plt.scatter(y_test_sorted, residual_sorted)
plt.plot(y_test_sorted, y_pred_sorted-ci_sorted[:,1], 'r--')
plt.plot(y_test_sorted, y_pred_sorted-ci_sorted[:,0], 'r--')
plt.grid()
plt.xlabel('Actual y')
plt.ylabel('Residual')
plt.legend(['Residual', '(95% CI Residuals)']);

# OLS Multilinear Model for Infrared Thermography Temperature Data
Now, let's apply the OLS multilinear model analysis to the [Infrared Thermography Temperature](https://archive.ics.uci.edu/dataset/925/infrared+thermography+temperature+dataset) dataset. Our goal is to use the features of this data set to predict the oral temperature of the patient.

The predictors for this dataset include demographics, environmental features, and several features derived from thermal images of patient faces (see [Figure 2 in Wang et al.](https://www.semanticscholar.org/paper/Infrared-Thermography-for-Measuring-Elevated-Body-Wang-Zhou/443b9932d295ca3a014e7d874b4bd77a33a276bd/figure/3)).

We explored many of the characteristics of this dataset in practicum 1, which we will not repeat here. However, before developing our model, we will need to:
1. Pull the data from the UCI website
2. Address missing data
3. Address feature cross-correlation
4. Standardize features
5. Convert categorical data to dummy variables
We will address these concerns in more detail in future lectures. For now, we will address missing data by dropping incomplete cases. As we found in practicum 1, there are only two such samples. We will address cross-correlation. We will address feature cross-correlation by dropping one feature from any pair that has a correlation coefficient >0.95. Finally, we will standardize features as we did in practicum 1.

Once these steps are completed, we will complete the following:
1. Fit an OLS multilinear model to the training data using the _oral temperature measured in monitor mode_ as our target variable
2. Assess the potential problems with the model including data nonlinearity and heteroscedasticity, outliers, high leverage points, and collinearity
3. Evalute performance on the test set

In [None]:
# fetch Infrared Thermography Dataset
# fetch dataset
infrared_thermography_temperature = fetch_ucirepo(id=925)

# data (as pandas dataframes)
# drop incomplete cases to address missing data
X = infrared_thermography_temperature.data.features.dropna()
y = infrared_thermography_temperature.data.targets.loc[X.index]['aveOralM'] # using the oral temp in monitor mode, optionally, we could instead use fast mode

# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=rs)

#correlated_features_to_drop = correlation_filter(X_train, threshold=0.95)
meta_vars = infrared_thermography_temperature.variables
continuous_vars, _ = get_vars_of_type(X_train, meta_vars, var_type_key = 'type', var_name_key = 'name', type_kw = 'Continuous')
features_to_drop = correlation_filter(X_train[continuous_vars], threshold=0.95)
X_train.drop(features_to_drop, axis=1, inplace=True)
X_test.drop(features_to_drop, axis=1, inplace=True)

# standardize the continuous features
continuous_vars, X_train_continuous = get_vars_of_type(X_train, meta_vars, var_type_key = 'type', var_name_key = 'name', type_kw = 'Continuous')
scaler = preprocessing.StandardScaler().fit(X_train_continuous)
X_train_continuous_scaled = scaler.transform(X_train_continuous)
# note we use the same scaler for the test data to prevent data leakage
continuous_vars, X_test_continuous = get_vars_of_type(X_test, meta_vars, var_type_key = 'type', var_name_key = 'name', type_kw = 'Continuous')
X_test_continuous_scaled = scaler.transform(X_test_continuous)

# create dummy variables for categorical features
categorical_vars, X_train_categorical = get_vars_of_type(X_train, meta_vars, var_type_key = 'type', var_name_key = 'name', type_kw = 'Categorical')
X_train_categorical_dummy = pd.get_dummies(X_train_categorical, columns=categorical_vars,drop_first=True, dtype=int)
categorical_vars, X_test_categorical = get_vars_of_type(X_test, meta_vars, var_type_key = 'type', var_name_key = 'name', type_kw = 'Categorical')
X_test_categorical_dummy = pd.get_dummies(X_test_categorical, columns=categorical_vars,drop_first=True, dtype=int)
for c in X_train_categorical_dummy.columns:
    if c not in X_test_categorical_dummy.columns:
        X_test_categorical_dummy[c] = 0

# combine the continuous and categorical features
X_train_new = pd.concat([pd.DataFrame(X_train_continuous_scaled, columns=X_train_continuous.columns), X_train_categorical_dummy.reset_index(drop=True)], axis=1)
X_test_new = pd.concat([pd.DataFrame(X_test_continuous_scaled, columns=X_train_continuous.columns), X_test_categorical_dummy.reset_index(drop=True)], axis=1)
y_train_new = y_train.reset_index(drop=True)
y_test_new = y_test.reset_index(drop=True)

## Fit the OLS multilinear model
<span style ="color:dodgerblue">
<h3>Problem 1 (1 point):</h3>
In the code cell below, (1) use the statsmodel.OLS module to fit a linear regression model to Infrared Thermography Temeperature problem using the X_train_new and y_train_new variables; and (2) print the results summary. </span>

In [None]:
#### YOUR CODE HERE ####

## Analyze Potential Issues
### Data non-linearity and heteroscedasticity

<span style ="color:dodgerblue">
<h3>Problem 2 (1 point):</h3>
In the code cell below, plot the linear model residual values (y-axis) as a function of the fitted values (x-axis) including the ourter quantiles of the rediduls using the plot_fitted_resid function. </span>

In [None]:
#### YOUR CODE HERE ####

<span style ="color:dodgerblue">
<h3>Problem 3 (1 point):</h3>
Based on the plot of the fitted values vs. the residuals, what is your interpretation relative to the presence of data non-linearity and heteroscedasticity. It may be helpful to check the <em>Durban-Watson</em> score recalling that a score in [1,2] suggests there is no heteroscedasticity.
</br></br>ENTER YOUR RESPONSE IN THE MARKDOWN CELL BELOW
</span>

#### Enter Your Response Here ####

### Outliers
Here, we examine the results of Infrared Thermography Temperature data linear regression model for outliers. As we can see in the plot below, while outliers are present, they are realitively few in number.

In [None]:
influence = res.get_influence()
plot_outliers(res.fittedvalues, influence.resid_studentized)

### High Leverage Points
Here, we examine the results of Infrared Thermography Temperature data for points with high leverage. As we can see in the plot below, there are many points with above average leverage, though most are near the average. However, there are 42 points (5.9%) with a leverage that is more than twice the average. In a refined model, we may consider dropping these points from the evaluation to improve the results.

In [None]:
influence = res.get_influence()
leverage = influence.hat_matrix_diag
plot_leverage(influence.resid_studentized, leverage, (res.df_model+1)/res.nobs)

threshold = 2*(res.df_model+1)/res.nobs
num_pts = np.sum(leverage>threshold)
print(f"Number (%) of points with leverage > {threshold:.2f}: {num_pts} ({100*num_pts/leverage.shape[0]:.2f})")

### Colinearity
Here, we examine the results the variance inflation factor (VIF) of of the Infrared Thermography Temperature data. Not surprisingly, this real world data has variables with $VIF>>10$ suggesting those variables are collinear with at least one additional variable.In a refined model, we would want to implement some feature selection method (either a wrapper method or embedded regularization approach) to address the collinearity and potentially improve model performance.

In [None]:
colin_table = PrettyTable(['Variable', 'VIF'])
for _ in range(X_train.shape[1]):
    colin_table.add_row([f"{X_train_new.columns[_]}", f"{variance_inflation_factor(X_train_new, _):.4f}"])
print(colin_table)

## Test data evaluation
Let's examine how the model performed on the test set. As a measure of performance, we will first calculate the mean absolute error (MAE) on the test set
$$MAE = \frac{1}{n}\sum_{i=1}^{n}\left| y_i -\hat{y}_i \right|$$

which will give us a sense of how far off the predicted values are from the actual values.

<span style ="color:dodgerblue">
<h3>Problem 4 (1 point):</h3>
In the code cell below, the model predictions on the test set are stored in the variable y_pred. Use the predictions to compute the MAE based on the true temperature results in y_test_new
</span>

In [None]:
y_pred = res.get_prediction(sm.add_constant(X_test_new, prepend=True)).predicted

#### YOUR CODE HERE ####



Based on the MAE the model seems to have done reasonably well predicting on the test set. Let's investigate test set performance in some more detail. In the cell block below, we will pring the $R^2$ for the test set and plot the residuals against the actual temperature values.

In [None]:
# How does the model do on the test set?
y_pred = res.get_prediction(sm.add_constant(X_test_new, prepend=True))

residual = y_test_new - y_pred.predicted
y_test_sorted = np.sort(y_test_new)
residual_sorted = residual[np.argsort(y_test_new)]
y_pred_sorted = y_pred.predicted[np.argsort(y_test_new)]
ci = y_pred.conf_int(alpha=0.05)
ci_sorted = ci[np.argsort(y_test_new)]

print("Simulated Test Set R2: ", r2_score(y_test, y_pred.predicted))
plt.scatter(y_test_sorted, residual_sorted)
plt.plot(y_test_sorted, y_pred_sorted-ci_sorted[:,1], 'r--')
plt.plot(y_test_sorted, y_pred_sorted-ci_sorted[:,0], 'r--')
plt.grid()
plt.xlabel('Actual y')
plt.ylabel('Residual')
plt.legend(['Residual', '(95% CI Residuals)'])

# Tree-based Regression Models
## Simulated data model
Let's now investigate a tree-based model. We will use the [CatBoost.ai](https://catboost.ai/) library. CatBoost is a relatively recent library initially released in 2017. Among other capabilities, the CatBoost library implements a computationally efficient gradient boosting tree algorithm that can be used to create regression and classification boosted trees. For more details on the implementation see the original [CatBoost paper](https://arxiv.org/abs/1706.09516).

First, let's recreate our simulated data.

In [None]:
# generate simulated regression data
X, y, coefs = make_regression(n_samples=n, bias = bias,
                              n_features=8, n_informative=6, effective_rank=5, tail_strength=0.1,
                              n_targets=1,
                              noise = 2.5,
                              coef=True, random_state=rs, shuffle=True)
# split into train and test
X_train = X[:n_train]; y_train = y[:n_train]; X_test = X[n_train:]; y_test = y[n_train:]

CatBoost uses `Pools` as a construct for datasets. We'll see when working with the Infrared Thermography Temperature data, that a convenience of Pools is that we will be able to tell CatBoost which variables are categorical and forego creating dummy variables. Note also, that when we call `model.fit` below, that we set `plot=True`. Provided the [Jupyter Widgets](https://ipywidgets.readthedocs.io/en/stable/) package is installed, this will let CatBoost generate plots automatically. Here, we see the training loss function per iteration of training.

In [None]:
train_dataset = cb.Pool(X_train, y_train)
test_dataset = cb.Pool(X_test, y_test)
model = cb.CatBoostRegressor(loss_function='RMSE', num_trees=1000, random_seed=rs)
model.fit(train_dataset, eval_set=test_dataset, verbose=False, plot=True)

Let's examine performance on the test set for the simulated data. As with the multilinear model, we consider the test set $R^2$ and plot the residuals against the actual values. Interestingly, the boosted tree model peforms slightly worse in terms of $R^2$ then then the multilinear model. This may not be surprising, given that the simulated data was created to have linear relationship between the outcome and the predictors.

In [None]:
y_pred = model.predict(X_test)

residual = y_test - y_pred
y_test_sorted = np.sort(y_test)
residual_sorted = residual[np.argsort(y_test)]
y_pred_sorted = y_pred[np.argsort(y_test)]

print("Simulated Test Set R2: ", r2_score(y_test, y_pred))
plt.scatter(y_test_sorted, residual_sorted)

plt.grid()
plt.xlabel('Actual y')
plt.ylabel('Residual')
plt.legend(['Residual', '(95% CI Residuals)'])

## CatBoost Regressor on Infrared Thermography Temperature Data
Now, let's apply the gradient boosted tree model analysis to the [Infrared Thermography Temperature](https://archive.ics.uci.edu/dataset/925/infrared+thermography+temperature+dataset) dataset. We will first reload the dataset. Notice two important items here:
1. Because we are using decision trees we do not need to standardize the variables. Recall, there are no weights applied to the inputs, rather only cutpoints need to be selected.
2. CatBoost can automatically handle categorical variables as long as we provide it the names (or column indicies) of the categorical variables for the predictors.

As with the simulated data, we use the CatBoost `Pool` class to construct training and testing pools. Notice that in this case, we pass the names of the categorical features to the object instantiation.

In [None]:
# fetch Infrared Thermography Dataset
# fetch dataset
infrared_thermography_temperature = fetch_ucirepo(id=925)

# data (as pandas dataframes)
# drop incomplete cases to address missing data
X = infrared_thermography_temperature.data.features.dropna()
y = infrared_thermography_temperature.data.targets.loc[X.index]['aveOralM'] # using the oral temp in monitor mode, optionally, we could instead use fast mode

# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=rs)

# create dummy variables for categorical features
categorical_vars, X_train_categorical = get_vars_of_type(X, meta_vars, var_type_key = 'type', var_name_key = 'name', type_kw = 'Categorical')

# create the catboost datasets
train_dataset = cb.Pool(X_train, y_train, cat_features=categorical_vars)
test_dataset = cb.Pool(X_test, y_test, cat_features=categorical_vars)

<span style ="color:dodgerblue">
<h3>Problem 5 (1 point):</h3>
In the code cell below, create a model using the CatBoostRegressor class and fit the model to the Infrared Thermography Dataset created in the cell above.
</span>

In [None]:
#### YOUR CODE HERE ####

## Test data evaluation
Let's examine how the model performed on the test set. As a measure of performance, we will calculate the mean absolute error (MAE) on the test set so we can compare to the multilinear model. Recall,
$$MAE = \frac{1}{n}\sum_{i=1}^{n}\left| y_i -\hat{y}_i \right|$$

<span style ="color:dodgerblue">
<h3>Problem 6 (1 point):</h3>
In the code cell below, the model predictions on the test set are stored in the variable y_pred. Use the predictions to compute the MAE based on the true temperature results in y_test
</span>

In [None]:
y_pred = model.predict(X_test)

#### YOUR CODE HERE ####

We should see that the MAE for the boosted tree model is marginally better than the multilinear model. We can also examine the test set $R^2$ measure as done in the cell block below. We see that is also marginally higher than the multilinear model. We've also plotted the residual values against the actual test values. For now, it appears that the boosted tree model performs about the same on the test set as the mulitlinear model. However, we have done anything to address collinearity in the multilinear model and we have not performed hyperparameter tuning in the boosted tree model. So these results should be taken with a measure of caution.

In [None]:
y_pred = model.predict(X_test)

residual = y_test - y_pred
residual.reset_index(drop=True, inplace=True)
y_test_sorted = np.sort(y_test)
residual_sorted = residual[np.argsort(y_test)]
y_pred_sorted = y_pred[np.argsort(y_test)]

print("Infrared Thermography Temperature Test Set R2: ", r2_score(y_test, y_pred))
plt.scatter(y_test_sorted, residual_sorted)

plt.grid()
plt.xlabel('Actual y')
plt.ylabel('Residual')
plt.legend(['Residual', '(95% CI Residuals)'])