# Challenge 1

The heart disease dataset is a classic dataset that contains various health metrics (age, sex, chest pain type, blood pressure, cholesterol, etc.) related to diagnosing heart disease (binary classification: presence or absence of heart disease).

In [1]:
# Imports the necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# Loads the heart disease dataset 
df = pd.read_csv('../data/heart.csv')

In [3]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


We are going to try to predict the presence of hart disease suing this features, starting with a classical baseline method and trying to improve on that result with a series of ensembled approaches.

In [11]:
# Separates the data
X = df.drop(columns="target")
y = df["target"]

# Splits data into training and test data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scaling of features (for certain models, e.g., SVM or logistic regression, not always necessary for trees)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Baseline model : decision Tree

We'll train a decision tree as our baseline model and evaluate it using accuracy.

In [14]:
#Create and Train a Decision Tree Classifier and print the train and test accuracy

# Imports the necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, mean_squared_error

# Initializes the Decision Tree Classifier
decision_tree = DecisionTreeClassifier(random_state=42)

# Trains the Decision Tree model
decision_tree.fit(X_train, y_train)

# Makes predictions
y_train_pred = decision_tree.predict(X_train)
y_test_pred = decision_tree.predict(X_test)

# Evaluates model performance
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Prints the accuracies
print(f"Training Accuracy: {train_accuracy:.2f}")
print(f"Testing Accuracy: {test_accuracy:.2f}")

Training Accuracy: 1.00
Testing Accuracy: 0.79


**Training Accuracy:** 1.00
**Testing Accuracy:** 0.79

We can see that this model is overfitting. This is expected, decision trees, especially deep ones  are notorious agressive at exploiting the data available. But that also makes them highly variant: a small change on the tree/data makes for potentially large changes in performance.

In [24]:
# Run the same code again a couple of times. 
# To get different results each time we have to remove the random_state generated the first time around.
# Using a loop will make the code more efficient.

# Imports the necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Stores the results for multiple runs
results = []

# Runs 5 iterations with different random states
for seed in range(1, 6):  
    # Splits the dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
    
    # Initializes and trains the model
    decision_tree = DecisionTreeClassifier(random_state=seed)
    decision_tree.fit(X_train, y_train)
    
    # Evaluates the model
    train_accuracy = accuracy_score(y_train, decision_tree.predict(X_train))
    test_accuracy = accuracy_score(y_test, decision_tree.predict(X_test))
    
    # Stores results
    results.append((seed, train_accuracy, test_accuracy))

# Prints the results for all runs
for seed, train_acc, test_acc in results:
    print(f"Run {seed}: Training Accuracy = {train_acc:.2f}, Testing Accuracy = {test_acc:.2f}")

# You can see that the Train Accuracy is always 100% (overfitting) and the Test Accuracy is all over the place. 
# This is undesirable: our method is not generalizing and has high variance

Run 1: Training Accuracy = 1.00, Testing Accuracy = 0.69
Run 2: Training Accuracy = 1.00, Testing Accuracy = 0.87
Run 3: Training Accuracy = 1.00, Testing Accuracy = 0.77
Run 4: Training Accuracy = 1.00, Testing Accuracy = 0.82
Run 5: Training Accuracy = 1.00, Testing Accuracy = 0.84


**Results from running the loop**
Run 1: Training Accuracy = 1.00, Testing Accuracy = 0.69
Run 2: Training Accuracy = 1.00, Testing Accuracy = 0.87
Run 3: Training Accuracy = 1.00, Testing Accuracy = 0.77
Run 4: Training Accuracy = 1.00, Testing Accuracy = 0.82
Run 5: Training Accuracy = 1.00, Testing Accuracy = 0.84

You can see that the Train Accuracy is always 100% (overfitting) and the Test Accuracy is all over the place. 
This is undesirable: our method is not generalizing and has high variance

# Bagging: reducing variance

Bagging improves models because it reduces variance by averaging the predictions of multiple models trained on different subsets of the training data. This averaging effect reduces the sensitivity of the overall model to any one dataset or model, making the final prediction more stable and less prone to overfitting.

- High-variance models, like decision trees, tend to overfit the training data. This means that small changes in the training data can lead to large changes in the model’s predictions. For example, a decision tree trained on one subset of data might look completely different from a decision tree trained on another subset. This leads to high variance, where the model’s performance fluctuates a lot depending on the specific data it was trained on.
- Once all the individual models are trained, Bagging combines their predictions by averaging them (for regression) or using a majority vote (for classification). The key idea here is that the errors in each individual model are somewhat independent because they are trained on different bootstrap samples. Some models will make errors in one direction, while others might make errors in another. When you average these predictions, the errors cancel out, reducing the overall variability (variance) of the final model.

In [33]:
# Create and Train a BaggingClassifier. 
# Use as base estimator a weak decision tree (max_depth=1) and 100 estimators to really over a lot of different data samples
# Print the train and test accuracy

from sklearn.ensemble import BaggingClassifier

# Splits the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize a weak decision tree with max_depth=1
weak_tree = DecisionTreeClassifier(max_depth=1, random_state=42)

# Creates the bagging classifier with 100 estimators
bagging_clf = BaggingClassifier(estimator=weak_tree, n_estimators=100, random_state=42)

# Trains the bagging classifier
bagging_clf.fit(X_train, y_train)

# Makes predictions
y_train_pred = bagging_clf.predict(X_train)
y_test_pred = bagging_clf.predict(X_test)

# Evaluates performance
# Calculates the training and test accuracy scores
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Prints the results
print(f"Training Accuracy: {train_accuracy:.2f}")
print(f"Testing Accuracy: {test_accuracy:.2f}")

Training Accuracy: 0.83
Testing Accuracy: 0.90


You can probably see a modest improvement in score, but most importantly, the overfitting is mostly gone. This is because averaging over multiple datasets stabilizes the high variance of the base model. 

In [42]:
# Runs the same code a couple of times, using a loop.

# Stores the results for multiple runs
results_mult_bagging = []

# Runs 5 iterations with different random states
for seed in range(1, 6):  
    # Splits the dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
    
    # Initializes the model 
    # Initialize a weak decision tree with max_depth=1
    weak_tree = DecisionTreeClassifier(max_depth=1, random_state=42)

    # Creates the bagging classifier with 100 estimators
    bagging_clf = BaggingClassifier(estimator=weak_tree, n_estimators=100, random_state=42)

    # Trains the bagging classifier
    bagging_clf.fit(X_train, y_train)
    
    # Makes predictions
    y_train_pred = bagging_clf.predict(X_train)
    y_test_pred = bagging_clf.predict(X_test)

    # Calculates accuracy for training and testing sets
    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    
    # Appends results to the list
    results_mult_bagging.append((seed, train_acc, test_acc))

# Prints the results for all runs
for seed, train_acc, test_acc in results_mult_bagging:
    print(f"Run {seed}: Training Accuracy = {train_acc:.2f}, Testing Accuracy = {test_acc:.2f}")

Run 1: Training Accuracy = 0.78, Testing Accuracy = 0.67
Run 2: Training Accuracy = 0.84, Testing Accuracy = 0.85
Run 3: Training Accuracy = 0.76, Testing Accuracy = 0.77
Run 4: Training Accuracy = 0.75, Testing Accuracy = 0.80
Run 5: Training Accuracy = 0.82, Testing Accuracy = 0.92


**Results**

Run 1: Training Accuracy = 0.78, Testing Accuracy = 0.67

Run 2: Training Accuracy = 0.84, Testing Accuracy = 0.85

Run 3: Training Accuracy = 0.76, Testing Accuracy = 0.77

Run 4: Training Accuracy = 0.75, Testing Accuracy = 0.80

Run 5: Training Accuracy = 0.82, Testing Accuracy = 0.92

You can see that consistently the Train Accuracy is close to the Test Accuracy. 

# Boosting: reducing bias

Now we’ll apply AdaBoost with decision trees as weak learners. This will sequentially improve the model by focusing on difficult cases.

Boosting reduces bias by sequentially training a series of weak learners (often simple models like decision trees) where each subsequent model focuses on the mistakes made by the previous models. The key idea behind boosting is to incrementally improve the model by correcting errors, which helps to reduce bias, especially when the initial model is too simple and underfits the data.

- Boosting typically uses weak learners, which are models that perform only slightly better than random guessing. For example, in classification, a weak learner might be a shallow decision tree (a "stump") with just a few levels. Weak learners usually have high bias, meaning they are too simplistic and don't capture the underlying patterns in the data well. As a result, they underfit the data.

- In each iteration, boosting trains a new model that tries to correct the errors made by the earlier models. If an instance was misclassified by the first weak learner, it will receive a higher weight, so the next model pays more attention to it. As the sequence of models progresses, the ensemble collectively focuses more on the difficult-to-predict instances. Over time, the combined models become better at fitting the data, as they successively reduce the bias (systematic error) by adjusting for earlier mistakes.

In [50]:
# Creates and trains an AdaBoostClassifier. 
# Uses as base estimator a weak decision tree (max_depth=1) and 100 estimators to really target the specific behaviors of this phenomenon
# Prints the train and test accuracies

# Imports the necessary libraries
from sklearn.ensemble import AdaBoostClassifier

# Splits the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initializes a weak decision tree with max_depth=1
weak_tree = DecisionTreeClassifier(max_depth=1, random_state=42)

# Creates and trains the AdaBoost classifier
# ORIGINAL code - removed due to future warning received that the default is deprecated.
    # adaboost_clf = AdaBoostClassifier(estimator=weak_tree, n_estimators=100, random_state=42)
adaboost_clf = AdaBoostClassifier(estimator=weak_tree, n_estimators=100, algorithm='SAMME', random_state=42)
adaboost_clf.fit(X_train, y_train)

# Makes predictions
y_train_pred = adaboost_clf.predict(X_train)
y_test_pred = adaboost_clf.predict(X_test)

# Evaluates train and test performance
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Prints the accuracies
print(f"Training Accuracy: {train_accuracy:.2f}")
print(f"Testing Accuracy: {test_accuracy:.2f}")

Training Accuracy: 0.90
Testing Accuracy: 0.82


You can probably see a good improvement in score, but overfitting rearing it's ugly head a gain (not as much as in the base model). This is because the iterative correction of adaboost really allows the model to focus on the specifics of this problem, at a cost of overexploiting the dataset.

In [57]:
# Runs the same code again a couple of times. 

# Stores the results for multiple runs
results_mult_boosting = []

# Runs 5 iterations with different random states
for seed in range(1, 6):  
    # Splits the dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
    
    # Initializes the model 
    # Initialize a weak decision tree with max_depth=1
    weak_tree = DecisionTreeClassifier(max_depth=1, random_state=42)
    
    # Creates the ADABoost classifier with 100 estimators
    adaboost_clf = AdaBoostClassifier(estimator=weak_tree, n_estimators=100, algorithm='SAMME', random_state=42)

    # Trains the bagging classifier
    adaboost_clf.fit(X_train, y_train)
    
    # Makes predictions
    y_train_pred = adaboost_clf.predict(X_train)
    y_test_pred = adaboost_clf.predict(X_test)

    # Evaluates train and test performance
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

     # Appends the results to the list
    results_mult_boosting.append((seed, train_accuracy, test_accuracy))

# Prints the results for all runs
for seed, train_acc, test_acc in results_mult_boosting:
    print(f"Run {seed}: Training Accuracy = {train_acc:.2f}, Testing Accuracy = {test_acc:.2f}")

# You can see that the test Accuracy will mostly be pretty good, even if some times it get's lower or higher scores (high variance, low bias)
# You can see also that consistently the Train Accuracy is higher than the Test Accuracy,indicating some (not extreme) overfitting 


Run 1: Training Accuracy = 0.92, Testing Accuracy = 0.74
Run 2: Training Accuracy = 0.89, Testing Accuracy = 0.89
Run 3: Training Accuracy = 0.89, Testing Accuracy = 0.89
Run 4: Training Accuracy = 0.90, Testing Accuracy = 0.85
Run 5: Training Accuracy = 0.90, Testing Accuracy = 0.84


**Results**
Run 1: Training Accuracy = 0.92, Testing Accuracy = 0.74
Run 2: Training Accuracy = 0.89, Testing Accuracy = 0.89
Run 3: Training Accuracy = 0.89, Testing Accuracy = 0.89
Run 4: Training Accuracy = 0.90, Testing Accuracy = 0.85
Run 5: Training Accuracy = 0.90, Testing Accuracy = 0.84

You can see that the test Accuracy will mostly be pretty good, even if some times it get's lower or higher scores (high variance, low bias)
You can see also that consistently the Train Accuracy is higher than the Test Accuracy,indicating some (not extreme) overfitting 
