# Section 6. Supervised ML Walkthrough

#### Instructor: Pierre Biscaye 

In this notebook, we're going to execute a machine learning project from start to finish. We'll use techniques covered in the previous section to facilitate this process, but we'll also introduce some new concepts. The goal is to demonstrate a basic machine learning pipeline.

### Required packages
* pandas
* numpy
* matplotlib
* scikit-learn
* seaborn
* xgboost

### Required data
* heart_2020_cleaned_sample.csv

## Overview of Pipeline/Sections

We'll take the following steps to develop our machine learning models:

1. Introduce Dataset and Objectives
2. Exploratory Data Analysis, Feature Engineering, and Preprocessing
    - Produce several plots to give us a better understanding of the data.
    - Split the sample.
    - Perform feature engineering and preprocessing.
    - Use the pipeline tool for speedy and reproducible preprocessing.
3. Modeling Process
    - Train three different models: Logistic Regression, Decision Trees, and Random Forest.
    - Learn about and apply the grid search method for choosing hyperparameters.
4. Evaluation and Interpretation
    - Evaluate the models using a variety of metrics.
    - Discuss how successful we were with our modeling.
5. Bonus: XGBoost

## 1. Introduce Dataset and Objectives

We are going to be using a Kaggle dataset called ["Personal Key Indicators of Heart Disease"](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease).

This dataset consists of 2020 annual CDC survey data from 400,000 adults related to their health status with regard to heart disease.

Below, we provide an edited description of the data, taken from the Kaggle data description.

### What topic does the dataset cover?

According to the CDC, heart disease is one of the leading causes of death for people of most races in the USA (African Americans, American Indians and Alaska Natives, and white people). About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicators include diabetic status, obesity (high BMI), not getting enough physical activity or drinking too much alcohol. 

Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Computational developments allow the application of machine learning methods to detect "patterns" from the data that can predict a patient's condition.

### Where did the dataset come from?

Originally, the dataset came from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. The dataset we are using includes data from 2020. The original dataset consists of 401,958 rows and 279 columns.

The vast majority of columns are questions asked to respondents about their health status, such as "Do you have serious difficulty walking or climbing stairs?" or "Have you smoked at least 100 cigarettes in your entire life?".

The dataset available for this course includes just a small set of the most relevant variables for heart disease risk (from a theoretical/medical perspective). The original dataset of nearly 300 variables was reduced to just about 20 variables, for computational tractability for this course. It has also undergone some basic cleaning so that it would be usable for machine learning projects, such as dealing with missing data.

### What can you do with this dataset?

This dataset can be used to apply a range of machine learning methods, most notably classifier models since the target variable "HeartDisease" is binary ("Yes" - respondent had heart disease; "No" - respondent had no heart disease). 

We note that classes are not balanced, so the classic model application approach is not advisable. Fixing the weights/stratified sampling should yield significantly better results. 

### Data Dictionary

The features available in the dataset are shown in the table below. The first variable, **HeartDisease**, is the target variable. We aim to predict whether **HeartDisease** is true or false. 

| Feature     | Description |
| ----------- | ----------- |
| **HeartDisease**       | Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)    |
| **BMI**   | Body Mass Index (BMI)        |
| **Smoking** | Have you smoked at least 100 cigarettes in your entire life? |
| **AlcoholDrinking** | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week |
| **Stroke** | Ever had a stroke? |
| **PhysicalHealth** | Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 was your physical health not good. |
| **MentalHealth** | Thinking about your mental health, for how many days during the past 30 days was your mental health not good? |
| **DiffWalking** | Do you have serious difficulty walking or climbing stairs? |
| **Sex** | Sex Assigned at Birth | 
| **AgeCategory** |  Fourteen-level age category |
| **Race** | Race and ethnicity |
| **Diabetic** | Have you ever had diabetes? |
| **PhysicalActivity** | Adults who reported doing physical activity or exercise during the past 30 days other than their regular job |
| **GenHealth** | Would you say that in general your health is...|
| **SleepTime** | On average, how many hours of sleep do you get in a 24-hour period?|
| **Asthma** | Have you ever had asthma?|
| **KidneyDisease** | Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease? |
| **SkinCancer** | Have you ever had skin cancer? |

### What is our objective?

Our objective is to use a variety of demographic, health, and behavioral data to predict if the patients in this dataset have or ever had heart disease.

### Import the Dataset

We'll begin by importing needed libraries and functions, importing the dataset, and taking a look at the columns.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(font_scale=1.5)

# Import functions from scikit-learn
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score,
                             confusion_matrix,
                             classification_report,
                             f1_score,
                             precision_score,
                             recall_score)
from sklearn.model_selection import (cross_val_score,
                                     cross_val_predict,
                                     StratifiedKFold,
                                     GridSearchCV,
                                     RandomizedSearchCV,
                                     train_test_split)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (LabelEncoder,
                                   OneHotEncoder,
                                   StandardScaler)
from sklearn.tree import DecisionTreeClassifier

We'll use `pandas` to import the dataset. Be sure to use the correct file path:

In [None]:
df = pd.read_csv("Data/heart_2020_cleaned_sample.csv")
df.head()

We have 13 categorical features (though many are binary), 4 continuous featues, and one categorical (binary) target variable.

The preprocessing of the dataset before it was posted including dealing with missing values - we will have to accept whatever method was used in order to work with these data. If we had missing values, we would have to make sure that all are coded as NaN and decide what imputation (if any) to do with missing continuous variable values. 

In [None]:
# How many null values per column?
df.isnull().sum()

## 2. Exploratory Data Analysis and Feature Engineering

Before we jump into modeling, it's important to get to know our data. This will help motivate the features we use, any additional features we construct, and how we perform preprocessing.

### Exploratory Data Analysis

Let's first get a sense of the distributions of the variables in the dataset. We'll start with the numerical data. Let's plot histograms of these features. Notice that, in some plots, we use a log-scale for the $y$-axis. Try turning it off to see how the distribution looks.

In [None]:
# Grab the numeric features
df_numeric = df.select_dtypes("number")
numeric_features = df_numeric.columns
df_numeric.head()

In [None]:
# Plot BMI
sns.histplot(data=df_numeric, x='BMI', bins=50)
plt.show()

In [None]:
# Plot PhysicalHealth
sns.histplot(data=df_numeric, x='PhysicalHealth', bins=10)
plt.yscale('log')
plt.show()

In [None]:
# Plot MentalHealth
sns.histplot(data=df_numeric, x='MentalHealth', bins=10)
plt.yscale('log')
plt.show()

In [None]:
# Plot SleepTime
sns.histplot(data=df_numeric, x='SleepTime', bins=20)
plt.yscale('log')
plt.show()

In [None]:
# Correlation plot
corr = df_numeric.corr()
corr

In [None]:
# Plotting correlations as a heatmap

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(8,8))

# Plot a heatmap using seaborn
# Include the mask and correct aspect ratio, and a diverging colormap
sns.heatmap(corr, mask=mask, cmap='RdBu', vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Next, let's look at the categorical features.

In [None]:
df_categorical = df.select_dtypes(exclude=['number'])
categorical_features = df_categorical.columns
df_categorical.head()

What are the number of unique values for each categorical feature?

In [None]:
df_categorical.nunique()

Let's plot the distributions of all the categorical features.

Note that we're doing this in a single set of subplots using `matplotlib`. There's a lot of code here - don't stress too much about the details. Instead, focus on the plot output and what the distribution of the variables look like. What do you notice?

In [None]:
# We choose 3 rows - feel free to adjust
nrows = 3
# Number of columns chosen automatically based on number of features
ncols = categorical_features.size // 3 + 1

# Create subplots using matplotlib
fig, axes = plt.subplots(nrows=3, ncols=ncols, figsize=(nrows * 9, ncols * 2.5))
# Adjust subplot spacing
plt.subplots_adjust(hspace=0.75)

# Iterate over categorical features
for idx, feature in enumerate(categorical_features):
    # Choose axis for features
    ax = axes[idx // ncols, idx % ncols]
    # Calculate proportions and plot bars
    df_categorical[feature].value_counts(normalize=True).sort_index().plot(
        kind='bar',
        ax=ax)
    # Rotate x ticks
    ax.tick_params(axis='x', rotation=0)
    # Set y limits
    ax.set_ylim([0, 1])
    # Create title for plot
    ax.set_title(feature)

# Turn off unused plot
axes[-1, -1].axis(False)

plt.show()

Notes on readability: some of the labels are hard to read. Below is code we could have used to make specific subplots more readable. Copy it in above, before the line to turn off the unused plot, to see how they change the plots. The code rotates the labels for specific subplots, and also edits certain labels.

In [None]:
# # Adjustments to single plots
# axes[1, 1].set_xticklabels(axes[1, 1].get_xticklabels(), rotation=40, ha='right', fontsize=14)
# cur_xticks = axes[1, 2].get_xticklabels()
# cur_xticks[0] = 'AI/AN'
# axes[1, 2].set_xticklabels(cur_xticks, rotation=40, ha='right')
# axes[1, 1].set_ylim([0, 0.12])
# cur_xticks = axes[1, 3].get_xticklabels()
# cur_xticks[1] = 'Borderline'
# cur_xticks[3] = 'During\nPregnancy'
# axes[1, 3].set_xticklabels(cur_xticks, rotation=40, ha='right')
# axes[2, 0].set_xticklabels(axes[2, 0].get_xticklabels(), rotation=40, ha='right')

Now let's see how these features correlate with the target variable.

For the continuous variables, for example, we can examine the distribution of each feature separately for the samples where the patient had heart disease and the samples where the patient did not have heart disease.

We can use a `seaborn` histogram with a `hue` argument to compare this directly. Pay attention to what arguments are passed into the function. What do these plots tell you?

In [None]:
sns.histplot(data=df, x='BMI', hue='HeartDisease', stat='density', bins=50, common_norm=False)
plt.xlim([10, 60])
plt.show()

In [None]:
sns.histplot(data=df, x='PhysicalHealth', hue='HeartDisease', stat='density', bins=10, common_norm=False)
plt.show()

In [None]:
sns.histplot(data=df, x='MentalHealth', hue='HeartDisease', stat='density', bins=10, common_norm=False)
plt.show()

In [None]:
sns.histplot(data=df, x='SleepTime', hue='HeartDisease', stat='density', bins=20, common_norm=False)
plt.xlim([0, 15])
plt.show()

Now, for the categorical data, we'll plot the average `HeartDisease` rate by each unique value of the variable. For example, consider how heart disease varies with smoking. Let's convert the heart disease feature into a binary label to make this easy:

In [None]:
# Create binary variable for heart disease
df_categorical['HeartDiseaseBinary'] = np.where(df_categorical['HeartDisease'] == 'Yes', 1, 0)

Now, we group the samples by smoking, and calculate the heart disease rate for each group by averaging across the `HeartDiseaseBinary` variable. We can then visualize the rates as a horizontal bar plot:

In [None]:
df_categorical.groupby("Smoking")['HeartDiseaseBinary'].mean().plot(kind = "barh")

Let's do this same procedure for all categorical features. Once again, don't worry too much about the code. Focus instead on what the data is telling you. What correlates with heart disease?

In [None]:
# We choose 3 rows - feel free to adjust
nrows = 3
# Number of columns chosen automatically based on number of features
categorical_predictors = [feature for feature in df_categorical.columns
                          if 'HeartDisease' not in feature]
ncols = len(categorical_predictors) // 3 + 1

# Create subplots using matplotlib
fig, axes = plt.subplots(nrows=3, ncols=ncols, figsize=(nrows * 9, ncols * 2.5))
# Adjust subplot spacing
plt.subplots_adjust(hspace=0.75)

# Iterate over categorical features
for idx, feature in enumerate(categorical_predictors):
    # Make sure we skip over the heart disease features
    if 'HeartDisease' not in feature:
        # Choose axis for features
        ax = axes[idx // ncols, idx % ncols]
        # Calculate proportions and plot bars
        df_categorical.groupby(feature)['HeartDiseaseBinary'].mean().sort_index().plot(kind='bar', ax=ax)
        # Remove x label
        ax.set_xlabel('')
        # Rotate x ticks
        ax.tick_params(axis='x', rotation=0)
        # Create title for plot
        ax.set_title(feature)

# Adjustments to single plots
axes[1, 0].set_xticklabels(axes[1, 0].get_xticklabels(), rotation=40, ha='right', fontsize=14)
cur_xticks = axes[1, 1].get_xticklabels()
cur_xticks[0] = 'AI/AN'
axes[1, 1].set_xticklabels(cur_xticks, rotation=40, ha='right')
cur_xticks = axes[1, 2].get_xticklabels()
cur_xticks[1] = 'Borderline'
cur_xticks[3] = 'During\nPregnancy'
axes[1, 2].set_xticklabels(cur_xticks, rotation=40, ha='right')
axes[1, 4].set_xticklabels(axes[1, 4].get_xticklabels(), rotation=40, ha='right')
# Turn off unused plots
axes[-1, -2].axis(False)
axes[-1, -1].axis(False)

plt.show()

### Feature Engineering and Preprocessing

The exploratory data analysis (EDA) suggests nearly all the features correlate with the target variable; this should not be surprising as they were specifically chosen from a larger dataset because they might be relevant for predicting heart disease. This would not be true in other cases, when EDA could help you sort through a larger set of features to identify a possible subset for modeling. 

Of course, an alternative approach is to include all features in the initial models and include penalties (such as in the lasso algorithm) to guide which features to ultimately include. But even in that situation, understanding relationships in the data is useful.

One reason it can still be useful is to help indicate the nature of the relationships between a feature and the target. Might it be quadratic or some higher order polynomial? Might it vary in some way across the distribution of the feature? Might it vary based on values of another feature? Such findings can inform your feature engineering.

Feature engineering is at the heart of machine learning, and there's no single way to do it. **Feature engineering** is the process of constructing new features that we think might be informative about the predictor variable. It could mean taking categorical data and one-hot encoding it, creating interaction terms, and preprocessing. These are all steps we take *prior* to fitting a model in order to make the data more suitable for prediction.

We're going to do limited feature engineering in the interest of time. But you should always think about what useful features may exist/can be constructed while working with data. Specifically, we will:

- Adjust the age features,
- Label encode the target variable,
- Scale the numerical features,
- One-hot encode the categorical data.

First, we'll adjust the age variable into a pseudo-continuous variable. We'll do this because the age category feature has 13 unique values, which is quite a lot for a categorical variable. Furthermore, age has an ordered structure (increasing 5 year bins), which we lose when we use the categorical formulation. What we'll do is replace each age value with the lower limit of the age range. We lose some information this way, but sometimes there's a cost to preprocessing.

We will also include a quadratic age term - this is common in many models and seems relevant based on the plots above.

In [None]:
# Unique age category values
df["AgeCategory"].value_counts().sort_index()

In [None]:
# Create "Age" column by taking the left number (remember, it's a string) and converting it to a float
df['Age'] = df['AgeCategory'].str[:2].astype(float)
df['Age'].value_counts().sort_index()

In [None]:
# Create age squared
df['Agesq'] = df['Age']**2

Now, let's remove the age category as well as the heart disease column to obtain a "design matrix". We'll also extract the heart disease column into its own dependent variable:

In [None]:
X = df.drop(["HeartDisease", "AgeCategory"], axis=1).copy()
y = df["HeartDisease"]

Now, before we scale and one hot encode data, let's first split it into training and validation datasets. 

Why do we do this? All preprocessing should be done separately on the training and test (or validation) set. We'll use `sklearn`'s `train_test_split` function to perform the split, **stratifying** by values of HeartDisease because the sample is not balanced.

**Question:** What does this stratification ensure?

In [None]:
X_train_raw, X_valid_raw, y_train_raw, y_valid_raw = train_test_split(X, y, 
                                                                      test_size=0.25, 
                                                                      random_state=212, 
                                                                      stratify=y)

Before, we converted the heart disease feature into a binary feature using a `numpy` function. Now, we'll do it again, using a `scikit-learn` function that does the same thing. The benefit to using the `LabelEncoder` is that `scikit-learn` does all the work for us. It will also give us an object that can be applied to new data. For example, we can fit the `LabelEncoder` to the training data, and apply it to the validation data:

In [None]:
# Intialize label encoder
labeler = LabelEncoder()
# Fit and transform the target variable from the training set
y_train = labeler.fit_transform(y_train_raw)
# Transform the validation target variable
y_valid = labeler.transform(y_valid_raw)

In [None]:
# What classes did we obtain?
print(labeler.classes_)
# Confidence check: does it work?
print(labeler.transform(["No", "Yes"]))

Next, to perform additional preprocessing, we're going to create a `scikti-learn` `Pipeline`. The **Pipeline** allows us to compose multiple steps into a single object, which we can then fit and apply to multiple datasets. Let's take it one step at a time.

In [None]:
# Collect all features
feature_cols = X_train_raw.columns
# Identify numeric features
numeric_cols = X_train_raw.select_dtypes("number").columns.tolist()
# Identify categorical features
categorical_cols = X_train_raw.select_dtypes(exclude="number").columns.tolist()

Every `Pipeline` is composed of steps. Each step is a tuple of two elements: the first tells us the name, and the second tells us what transformation is happening. We can create a `Pipeline` by stitching together steps via a list. We can also create a `Pipeline` by stitching together smaller `Pipeline`s.

The tricky thing about a `Pipeline` is that it applies a transformation to all the data. This won't work in cases with heterogeneous data. For example, we don't want to one-hot encode continuous features, and we don't want to standardize categorical features. So, we need one more tool: the `ColumnTransformer`. 

In [None]:
# Use Pipeline to create a numeric transformer: standard scaling/normalization
numeric_transformer = Pipeline([("scaler", StandardScaler())])

In [None]:
# Use Pipeline to create a categorical transformer: one hot encoding
categorical_transformer = Pipeline(
    [("one_hot_encoder",
      OneHotEncoder(categories='auto', 
                    handle_unknown='error', 
                    sparse_output=False, # sparse_output=False for scikit-learn v> 1.2; else sparse=False
                    drop="first"))
    ])

In [None]:
# Now, we create the overall preprocessor with the ColumnTransformer.
# The ColumnTransformer is itself a Pipeline, and needs steps (i.e., a list). 
# In this case, each step needs a tuple with length 3:
# 1. The name of the step.
# 2. The Pipeline to apply at that step.
# 3. The columns to apply the step to.
preprocessor = ColumnTransformer(transformers=[
    # First step: numeric features
    ("numeric", numeric_transformer, numeric_cols),
    # Second step: categorical features
    ("categorical", categorical_transformer, categorical_cols)
])

Pipelines can also have classifiers (e.g., a `LogisticRegression`) included as well. In that case, the output of the pipeline would be a trained model. For now, however, we'll simply just preprocess the data using the `Pipeline`.

In [None]:
# Fit transform the train dataset using the pipeline
X_train = preprocessor.fit_transform(X_train_raw)
# Transform the testing dataset with the rules learned from the training dataset
X_valid = preprocessor.transform(X_valid_raw)

**Question**: Why do we use `fit_transform` for the training data, and only `transform` for the validation data? 

Now, we have our data preprocessed. Notice how easy, clean, and reproducible this process was. This demonstrates the value of using the `Pipeline` to conduct machine learning analyses (and more generally). We can quickly and cleanly transform any new batches of data to confirm to the rules established by the training dataset.

In [None]:
# View result
print(X_train.shape)
print(X_train)
print(X_train_raw.shape)

Notice that the output of the preprocessor is a `numpy` array with more columns than the original dataframe. This is because the one-hot encoding created some new columns. It'd be nice if we had this data in a data frame, with named columns. So, the last step we'll do is convert this back into a data frame. First, we need to access the new column names, which we can do with the `get_feature_names_out` method:

In [None]:
#For scikit-learn>1.0
preprocessor.named_transformers_['categorical']['one_hot_encoder'].get_feature_names_out(categorical_cols).tolist()
#For scikit-learn<1.0
#preprocessor.named_transformers_['categorical']['one_hot_encoder'].get_feature_names(categorical_cols).tolist()

In [None]:
# Access pipeline data to so we can name the one-hot encoded columns
new_categorical_cols = preprocessor.named_transformers_['categorical']['one_hot_encoder'].get_feature_names_out(categorical_cols).tolist()
#new_categorical_cols = preprocessor.named_transformers_['categorical']['one_hot_encoder'].get_feature_names(categorical_cols).tolist()
# Create list of column names - numeric columns don't change
column_names = numeric_cols + new_categorical_cols
# Create dataframes
X_train = pd.DataFrame(data=X_train, columns=column_names)
X_valid = pd.DataFrame(data=X_valid, columns=column_names)
X_train.head()

Voila! We have our finalized dataset and we can move on to building predictive models.

## 3. Modeling Proccess

Now the fun begins!

But first: we need to calculate a **baseline accuracy**. This is the fraction of the data that is of the most common class. With a binary target, we can calculate this by taking the mean of the outcome variable.

In [None]:
1 - y_train.mean()

**Question**: What does this baseline accuracy mean in the context of the data? What does it mean in the context of trying to predict heart disease outcomes?

Accuracy is not the only thing we need to be worried about when building machine learning models. A model can be accurate, and still have issues in what samples it classifies correctly, and what samples it makes mistakes on.

**Question**: For example, consider false positives and false negatives. What do each mean in the context of classifying heat disease? If such a model were deployed in real life, which of the two - false positives or false negatives - would be more costly? Which should we be more concerned about?

### First Model: Logistic Regression

The first model we'll try is called logistic regression, which we've used already. As a reminder, logistic regression is a linear model that can be used to predict the probability of a sample falling in a certain class. Thus, it's a common model for classification.

We're going to use the `cross_val_score` function to calculate model performance across folds in the training data. The way we'll cross-validate is via the `StatifiedKFold` cross-validator. You can read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html). 

Compare this to the [documentation](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.KFold.html) for `KFold`. What does `StratifiedKFold` add in?  Why is stratifying cross-validation important, particularly in this context?

In [None]:
# Make list of performance metric functions
metrics = ['accuracy', 'precision', 'recall', 'f1']
# Choose number of folds
n_folds = 5

In [None]:
# Create cross-validator
skfold = StratifiedKFold(n_folds)
# Create model
model = LogisticRegression()

# Iterate over metrics
for metric in metrics:
    cv_results = cross_val_score(model, X_train, y_train, cv=skfold, scoring=metric)
    print(f"Mean {metric} score is {cv_results.mean().round(3)}")

**Questions**:

How does the mean accuracy compare to the baseline accuracy? What does this tell you about the importance of establishing baseline accuracy? Is establishing baseline accuracy likely to be more or less important with more imbalanced target data?

What does this precision score ($TP / (TP+FP)$) mean in this context?

What does this recall score ($TP / (TP+FN)$) mean in this context?

`f1` is the harmonic mean of precision and recall, integrating them both into a single metric.

What do you conclude from these performance scores? Note that so far we are only analyzing performance within the training data set, without considering generalization to the test sample yet.

----
### Challenge 1: Ridge Regression

Re-run the above analysis, but use ridge regression instead. In particular, use `RidgeCV` so that you can choose the best regularization penalty. You can adapt the code from notebook 6b to do this.

How does this model perform relative to logistic regression? Relative to baseline accuracy?

----

### Second Model: Decision Trees

Next, let's try using a decision tree. You may recall that decision trees have a wide array of *hyperparameters*, or settings in the model we set before fitting it to the data. These can include the maximum depth, the criterion used for performing a split, etc. When we first fit the decision tree, we used default parameters for all of these specified by `scitkit-learn`, only varying `max_depth`. How can we go about *choosing* the best values instead?

In the case of ridge regression, we did a cross-validation procedure to choose the best hyperparameter. When we have many hyperparameters though, we'll need to do cross-validation across all combinations. The approach for this is called **grid search**, which we can use to search across all combinations of hyperparameters to find the best one.

#### Grid Search for Model Selection

Grid search is a brute-force method that executes cross-validation for *all* possible combinations of hyperparameters from a set of hyperparameter ranges.

Let's consider an example. Suppose we have two hyperparameters $A$ and $B$. We don't know what values to choose for them, so we'll use a grid search to identify the best set. Grid search requires we specify hyperparameter ranges. So, let's say hyperparameter $A$ can be either of the two values $(0, 1)$, and hyperparameter $B$ can be either of the two values $(2, 3)$ (in practice, we might choose more values, but we'll use two each for simplicity). 

Grid search forms each combination of hyperparameters, and fits a model for it. We can then use the valiation performance to choose the best combination across all hyperparameters. In this case, we'd consider all the following combinations:

- $A = 0$, $B = 2$
- $A = 0$, $B = 3$
- $A = 1$, $B = 2$
- $A = 1$, $B = 3$

and choose the combination that performs the best.

We can easily perform this process by using `scikit-learn`'s `GridSearchCV`. Let's take a look at how it works by running it on the `max_depth` and `min_samples_leaf` hyperparameters in a decision tree:

In [None]:
# First, we specify a parameter grid as a dictionary
param_grid = {
    "max_depth": np.arange(2, 20, 2), # why are we starting at 5?
    "min_samples_leaf": np.arange(20, 200, 10) # avoid over-splitting/over-fitting
}

In [None]:
# What is the size of the parameter grid we're tuning on?
param_grid["max_depth"].shape[0] * param_grid["min_samples_leaf"].shape[0]

That's 162 different sets of parameters. That's a lot of models to fit!

Next, we pass some information into the `GridSearchCV` object (check out the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for more details):

In [None]:
grid_dt = GridSearchCV(
    # Specify the model
    estimator=DecisionTreeClassifier(),
    # Specify the hyperparameter grid
    param_grid=param_grid,
    # What metric should we use to select for the best model?
    scoring = "accuracy",
    # How do we generate cross-validation folds?
    cv=skfold) # stratified cross-validation

Finally, we can fit the grid search. Let's use the `%%time` magic command to see how long this process takes (around 1 minute for me). 

Note that `%%time` is a magic command to measure the time of execution for a whole cell, whereas `%time` only measures the next line.

In [None]:
%%time

# Fit the grid search object on the training data
grid_dt.fit(X_train, y_train)

We've got a fitted grid search variable! Let's take a look at what we get with it. First, the best score (on accuracy, as we specified):

In [None]:
grid_dt.best_score_

That's not as good as logistic regression, but a little better than baseline accuracy.

We can also get the best cross-validated parameters:

In [None]:
grid_dt.best_params_

**Question:** Are you surprised at the optimal `max_depth`? What does it imply?

The grid search variable is its own predictor, and we can run it on any set of samples:

In [None]:
grid_dt.predict(X_train)

Let's store the results of the best decision tree estimator for later.

In [None]:
best_dt = grid_dt.best_estimator_

In [None]:
print("Training: ", best_dt.score(X_train, y_train))
print("Test: ", best_dt.score(X_valid, y_valid))

----
### Challenge 2: Choosing a Different Scoring Metric

Run a new grid search, this time searching over max depth from 2 to 10 by 1 and min samples leaf from 75 to 125 by 5, to see if the optimal parameter is one that was not included in the previous grid search. Did performance improve?

Now run a new grid search using recall as the choice of the scoring metric. What are the best parameters in this case? Are they different from before?

----

### Third Model: Random Forests

So far, our modeling hasn't yield great results. Let's consider a different model, which is more commonly used in harder prediction problems.

This model is called the Random Forest. As you might expect from the name, a random forest is a collection of many decision trees. Specifically, it's an **ensemble model**, since it consists of an ensemble of $N$ decision trees. The $N$ trees in the forest can separately make predictions, each of which counts as a vote toward the final prediction. The ensemble prediction - typically by majority voting - performs better than a single tree alone. This is the machine learning version of a model "greater than the sum of its parts".

![](https://miro.medium.com/v2/resize:fit:640/format:webp/1*i0o8mjFfCn-uD79-F1Cqkw.png)

There's a few important things to note about the random forest:

- Each tree in the forest is not trained on the same data and features. That would be counterproductive, because you would up end up with dozens of duplicate trees.
- Instead, the trees are trained on a subset (usually a random 2/3) of the features and a bootstrapped (sampled with replacement) sample of the data. This helps reduce the variance of the predictions. *Bagging* (Bootstrap aggregating) estimates $N$ trees on bootstrapped samples with the same features. Random forests have a second parameter that controls how many features to try when finding the best split. Common rules of thumb for determining this parameter when there are $p$ features are $\sqrt{p}$ and $p/3$.
- To further decorrelate the trees, pruning trees to prevent overfitting is discouraged - we want each tree to fit its sample well.

![](https://miro.medium.com/max/1240/1*EemYMyOADnT0lJWSXmTDdg.jpeg)

We are going to gloss over some of the details of the random forest in order to focus on their application in this context. However, those details are important! Check out this [blog post](https://victorzhou.com/blog/intro-to-random-forests/) for a gentle introduction to random forests. For a *very* in-depth explanation of random forests, check out Chapter 15 of [Elements of Statistical Learning Theory](https://hastie.su.domains/Papers/ESLII.pdf).

Let's get a sense for how a random forest performs without any hyperparameter tuning. We'll use the `RandomForestClassifier` from `scikit-learn`. Read the [documentation](https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.RandomForestClassifier.html). The `n_estimators` argument is where we specify the number of trees. We will rely on the default `max_features` value of $\sqrt{p}$.

In [None]:
X_train.shape[1] ** (1/2)

In [None]:
# Create random forest
rf = RandomForestClassifier(n_estimators=50)
# Cross-validate
#cv_results = cross_val_score(rf, X_train, y_train, cv=5)
#cv_results.mean()

# Iterate over same metrics as before
for metric in metrics:
    cv_results = cross_val_score(rf, X_train, y_train, cv=skfold, scoring=metric)
    print(f"Mean {metric} score is {cv_results.mean().round(3)}")

This is worse accuracy than the logistic regression - we're still not getting much improvement above baseline performance. 

Let's bring in the grid search on the `n_estimators` and `min_samples_split` hyperparameters to see if we can improve on this result.

In [None]:
param_grid = {
    "n_estimators": [50, 100, 150], # previously used 50
    "min_samples_split": [2, 5, 10], # default is 2
    "max_depth": [5, 10, 15] # default is none
}

grid_rf = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=param_grid,
    cv=skfold,
    scoring="accuracy")

Running this grid search may take a few minutes.

In [None]:
%%time

grid_rf.fit(X_train, y_train)

In [None]:
print(grid_rf.best_score_)
print(grid_rf.best_params_)

In [None]:
best_rf = grid_rf.best_estimator_

In [None]:
best_rf

In [None]:
print("Training: ", best_rf.score(X_train, y_train))
print("Test: ", best_rf.score(X_valid, y_valid))

In [None]:
1 - y_train.mean()

We did slightly better than the baseline accuracy, but not much - this is a hard problem! We'll now move to more evaluation.

----
### Challenge 3: Random Search

The downside of grid search is that it can take a long time, especially when you have a large number hyperparameters, a complex model, and a lot of data. Grid search quickly becomes unwieldy because the search space is multidimensional.

RandomSearchCV is a potential solution to this issue. In a random search, we randomly select a fraction of the hyperparameters sets to evaluate model performance. In a random search, you're not guaranteed to find the best set of parameters. However, it oftens performs pretty well, especially when there are computational constraints.

You can do a random search in `scikit-learn` with `RandomSearchCV`. Choose a set of hyperparameters (maybe for this random forests model), and run a random search with `RandomSearchCV`. How does the best set of parameters compare to what is identified by a grid search?

----

## Part 4: Evaluation and Interpretation

It's time to bring back our validation dataset to evaluate how well our models perform on out-of-sample data.

We're going to use the logistic regression, decision tree, and random forest models we created to make predictions on the validation features and evaluate them on a variety of metrics.

In [None]:
# Refit logisic regression
lr = LogisticRegression()
lr.fit(X_train, y_train)

In [None]:
# Make predictions
lr_pred = lr.predict(X_valid)
dt_pred = best_dt.predict(X_valid)
rf_pred = best_rf.predict(X_valid)

Let's use the `classification_report` and `confusion_matrix` functions to evaluate the predictions.

Remember that in a confusion matrix, the columns indicate classes predicted by the model (starting from 0 on the left), while the rows indicate actual classes (starting from 0 on the top). A perfect model would therefore have only diagonal entries.

In [None]:
print('Logistic Regression\n')
print(confusion_matrix(y_valid, lr_pred))
print(classification_report(y_valid, lr_pred))

In [None]:
print('Decision Tree\n')
print(confusion_matrix(y_valid, dt_pred))
print(classification_report(y_valid, dt_pred))

In [None]:
print('Random Forest\n')
print(confusion_matrix(y_valid, rf_pred))
print(classification_report(y_valid, rf_pred))

How do all the models compare to each other? Which model do you think performed best, and for what reason?

### Interpretation

The model performance we obtained was OK, but not amazing. In particular, the models struggle to identify true cases of heart disease. This often happens in the development of machine learning algorithms. 

It's useful at this point to try and interpret our models to see where they're getting signal from in order to decide what to do next. Do we need more data? Do we need better features? Do we need a better model?

First, let's take a look at the logistic regression coefficients.

In [None]:
coefs = lr.coef_[0]
coefs = pd.Series(index=column_names, data = coefs)
coefs.sort_values(ascending=False)

Which features, according to the model, are the most and least associated with heart disease? How should you interpret categorical coefficients versus numerical coefficients? What do the sign of the coefficients mean?

With tree-based models, feature importance works a little differently. We can access a `feature_importannces_` attribute which captures the "importance", defined as "The (normalized) total reduction of the criterion brought by that feature." Basically, a quantification of how much the criterion we used (in our case the Gini impurity) was impacted by the feature's decision point. Notice that these feature importances are not signed:

In [None]:
dt_fi = best_dt.feature_importances_
dt_fi = pd.Series(index=column_names, data=dt_fi)
dt_fi.sort_values(ascending = False)

In [None]:
rf_fi = best_rf.feature_importances_
rf_fi = pd.Series(index=column_names, data=rf_fi)
rf_fi.sort_values(ascending = False)

Do the two sets of feature importances differ greatly? What do they tell us about predicting heart disease?

Some of the feature importances are pretty low. This could imply that we should cut them out of the model. This may improve generalization performance, since the model is not trying to incorporate those less predictive features during training. This choice falls in the domain of *feature selection*. So, in future work, one thing we could do is retrain models without these features. We could also use regularization to implicitly do feature selection (e.g., a Lasso regression).

Do the sets of feature importance suggest anything about potential additional feature engineering?

Try to think about steps you might take to improve your models!

## Machine Learning Walkthrough Recap

In this exercise we attempted to predict the onset of heart disease. We did the following:

- We familiarized ourselves with the data and its patterns by studying the data dictionary and conducting exploratory data analyses.
- We applied a number of feature engineering techniques to the data to prepare for modeling.
- We employed three different machine learning models, two of which we parameter tuned in order maximize the generalization performance.
- We evaluated our models on a validation dataset to get a sense of how well they did on an out-of-sample dataset.
- We analyzed the relationship between the features and target variable by using attributes provided by the models.

## Part 5: Bonus - XGBoost 

eXtreme Gradient Boosting, or XGBoost, is currently (as of early 2026) the gold standard to supervised machine learning with structured or tabular data. It is incredibly fast and highly accurate relative to other classification models.

XGBoost is part of the evolution of decision tree models:
1. Decision Tree: A single "flowchart" making decisions. High variance (overfits easily).
2. Random Forest (Bagging): Builds many trees independently and averages them.
3. Gradient Boosting (Boosting): Builds trees sequentially. Each new tree tries to fix the specific errors (residuals) made by the previous trees.
4. XGBoost: A highly optimized version of Gradient Boosting that adds clever math to prevent overfitting and specialized engineering to run extremely fast.

What makes XGBoost "eXtreme"? The main advantage is **regularization**. Standard boosting can be "too aggressive" and memorize noise. XGBoost includes L1 and L2 regularization (as we saw with Lasso and Ridge models) directly in its objective function. This penalizes complex trees, forcing the model to stay simple and generalize better.

XGBoost also has some additional nice features. First, it has a Sparsity-Aware Split Finding algorithm to deal with missing values. If a value is missing, the model "tries" sending it to the left branch and the right branch, then learns which direction works best for that specific feature. This replaces manual imputation, though that can still be useful as well. Second, it is very fast. XGBoost is designed to utilize all the cores of a student's CPU simultaneously (parallel processing). It also uses "Cache Awareness," meaning it organizes data in the computer's memory so the processor can grab it instantly without waiting.

### Parameters

XGBoost has dozens of parameters you can set and tune. For this introduction, we'll focus just on a few "big ones":
* `n_estimators`: Number of trees to build. Too many = overfitting; too few = underfitting.
* `learning_rate` (eta): How much we "listen" to each new tree. Usually between 0.01 and 0.3. Smaller values require more n_estimators.
* `max_depth`: How deep each individual tree can go. Standard is 3–6. Deep trees capture complex patterns but overfit quickly.
* `subsample`: Percentage of rows used to train each tree. Helps prevent the model from focusing too much on specific outliers.

### Estimating an XGBoost model

We will use the `xgboost` package. Fortunately, the syntax is very similar to `scikit-learn`. We'll estimate a basic model here with some specific selected parameters. Click on "Parameters" after estimating to see what else we might have specified.

In [None]:
len(y_valid)

In [None]:
from xgboost import XGBClassifier

# Determining the ratio for scale_pos_weight
ratio = len(y_valid[y_valid==0]) / y_valid.sum()

# Initialize with the "Safe" defaults
xgb_model = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=212,
)

# It fits just like the DecisionTree did!
xgb_model.fit(X_train, y_train)

As with the other decision-tree based models, we can look at feature importance.

In [None]:
xgb_fi = xgb_model.feature_importances_
xgb_fi = pd.Series(index=column_names, data=xgb_fi)
xgb_fi.sort_values(ascending = False)

We can also evaluate the model using the same syntax as previously. How does the model with the selected parameters compare to those we used previously?

In [None]:
# Predict and Evaluate
xgb_pred = xgb_model.predict(X_valid)
print('XGB\n')
print(confusion_matrix(y_valid, xgb_pred))
print(classification_report(y_valid, xgb_pred))

Another important parameter, particularly for our case with a rare outcome, is `scale_pos_weight`. This parameter increases the penalty for making a mistake on rare outcomes. It effectively dorces the model to pay more attention to the minority class, which is useful for imbalanced data as in our current exercise. 

A common rule of thumb is to set the weight as the ration of the majority class to the minority class. Using this parameter rather than over/undersampling is efficient because it uses the loss function.

Let's see how this parameter affects the results.


In [None]:
# Determining the ratio for scale_pos_weight
ratio = len(y_valid[y_valid==0]) / y_valid.sum()

# Set and fit the model
xgb_model2 = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=212,
    scale_pos_weight=ratio
)
xgb_model2.fit(X_train, y_train)

# Evaluate
xgb_pred2 = xgb_model2.predict(X_valid)
print('XGB with scale_pos_weight\n')
print(confusion_matrix(y_valid, xgb_pred2))
print(classification_report(y_valid, xgb_pred2))

**Question**: How do the results compare? What are the implications?

When we increase `scale_pos_weight`, we are telling the model: "It is better to accidentally predict a 'False Positive' than to miss a 'True Positive'." As a result, Recall (Sensitivity) will go up, but Precision will go down.

Note: if we have more than two classes, we can use the `sample_weight` parameter within the `.fit()` method instead, which assigns a specific weight to each row in the dataset. 

How can we improve the models? **Hyperparameter tuning** is where XGBoost truly shines, given its efficiency and speed. 

As before we can use a grid search and cross-validation to do the tuning. Note that we can combine `scikit-learn` and `xgboost` functions fairly easily.

Below is an example.

Note: `logloss` is the default **evaluation metric** for binary classification. It compares the actual label to the predicted probability to determine the "loss", and punishes "confident but wrong" predictions more severly than "unsure" predictions. This encourages more accurate probabilities. 

Alternative metrics for binary classification include `error` (number of wrong predictions/total), `auc` (how well the model separates two classes across all possible thresholds), and `aucpr` (similar to auc but incorporates recall, useful when the positive class is very rare). For multiclass classification there are `mlogloss` and `merror` multiclass versions of the binary metrics. For regression, `rmse` and `mae` are common, as we used in our linear regression models.

In [None]:
%%time

# 1. Initialize the model
xgb = XGBClassifier(random_state=212, eval_metric='logloss')

# 2. Define the "Grid" of parameters to test
# Tip: Start small! Even this grid creates 3 * 3 * 2 = 18 combinations.
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.05, 0.1, 0.2],
    'n_estimators': [100, 200],
    'scale_pos_weight': [ratio/2, ratio, ratio*2]
}

# 3. Set up the Search
# cv=3 means it will run 3-fold cross-validation for every combination
grid_search = GridSearchCV(estimator=xgb, 
                           param_grid=param_grid, 
                           cv=3, 
                           scoring='accuracy', 
                           verbose=1)

# 4. Run the search on your training data
grid_search.fit(X_train, y_train)

# 5. See the results
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_:.4f}")

# 6. Use the best model to predict on test data and evaluate
best_model = grid_search.best_estimator_
predictions = best_model.predict(X_valid)
print(confusion_matrix(y_valid, predictions))
print(classification_report(y_valid, predictions))

That was pretty fast! 

This is better than the original model, but not yet as good as the more basic models we estimated. To improve, we may need to run a broader hyperparameter search and to think hard about what metrics we want to prioritize. 

### Moving beyond the basics

Once you understand the basic "Fit and Predict" process, here are the four areas you could look into to become more proficient with XGBoost:

1. Understand the **Bias-Variance Tradeoff**
Tuning in XGBoost is a balancing act:
* To reduce Overfitting (High Variance): Lower max_depth, increase gamma (minimum loss reduction to split), or decrease learning_rate.
* To reduce Underfitting (High Bias): Increase max_depth or increase n_estimators.
Grid search hyperparameter tuning can help with setting values, but it is important to have a strong conceptual understanding.

2. **Early Stopping**
Instead of picking a random number for n_estimators, you can tell XGBoost: "Keep building trees until the validation score stops improving for 10 rounds, then stop." This saves time and prevents overfitting perfectly.

3. **Feature Engineering** over Tuning
Better data beats a better model. Spending two hours tuning hyperparameters usually yields a 1% improvement. Spending two hours creating a new, clever feature (like interactions between dummy variables, squared continuous variables, etc.) can yield a 5–10% improvement. This point holds for all supervised ML, not just XGBoost.

4. XGBoost for **Regression**
XGBoost can also be used to predict continuous numbers by simply using XGBRegressor instead of XGBClassifier. The logic and hyperparameters remain nearly identical. It can produce efficiency gains relative to alternatives such as Lasso or Ridge Regression.

In addition, here are some basic **troubleshooting notes:**
* Model is too slow: Decrease n_estimators or increase learning_rate.
* Model is memorizing the training data: Decrease max_depth or use subsample (e.g., 0.8).
* Classes are imbalanced (e.g., rare disease): Use scale_pos_weight to give the rare class more "weight."
* The computer is getting hot/loud: Set n_jobs=-1 to ensure it's using all CPU cores efficiently.