# Challenge 1

The heart disease dataset is a classic dataset that contains various health metrics (age, sex, chest pain type, blood pressure, cholesterol, etc.) related to diagnosing heart disease (binary classification: presence or absence of heart disease).

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset (change the path if needed)
df = pd.read_csv('../data/heart.csv')

In [3]:
df.head() # Display the first 5 rows of the dataset to understand its structure

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


We are going to try to predict the presence of hart disease suing this features, starting with a classical baseline method and trying to improve on that result with a series of ensembled approaches.

In [4]:
# Separate features (X) and target variable (y)
# We drop the column "target" from the dataset to get the features (X)
# The "target" column is what we want to predict (1 = heart disease, 0 = no heart disease)
X = df.drop(columns="target")
y = df["target"]

# Split the dataset into training and test sets
# 25% of the data will be used for testing, 75% for training
# random_state=0 ensures we get the same split every time
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scale the features (important for models like SVM, logistic regression, etc.)
# Scaling puts all feature values on the same scale
scaler = StandardScaler()

# Fit the scaler on the training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Use the same scaler (already fitted) to transform the test data
X_test_scaled = scaler.transform(X_test)

# Baseline model : decision Tree

We'll train a decision tree as our baseline model and evaluate it using accuracy.

In [5]:
#Create and Train a Decision Tree Classifier and print the train and test accuracy

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, mean_squared_error

# Train Decision Tree
# Create the model
clf = DecisionTreeClassifier(random_state=0)

# Fit the model to the training data
clf.fit(X_train, y_train)

# Predictions and evaluation
# Make predictions on both train and test sets
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

# Evaluate performance
print("Train Accuracy:", accuracy_score(y_train, y_train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))

Train Accuracy: 1.0
Test Accuracy: 0.7894736842105263


In [6]:
#The model has Train Accuracy = 100%, but the Test Accuracy is much lower. This means the model is overfitting: it memorizes the training data perfectly, but it does not generalize well to new data. Decision Trees can easily become too complex and too specific to the training data, which makes them unstable and high variance.


We can see that this model is overfitting. This is expected, decision trees, especially deep ones  are notorious agressive at exploiting the data available. But that also makes them highly variant: a small change on the tree/data makes for potentially large changes in performance.

In [18]:
# Run the same code again a couple of times. 
# You can see that the Train Accuracy is always 100% (overfitting) and the Test Accuracy is all over the place. 
# This is undesirable: our method is not generalizing and has high variance

clf = DecisionTreeClassifier() # We remove the random_state so the tree will be slightly different on each run
clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train, y_train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))

Train Accuracy: 1.0
Test Accuracy: 0.7894736842105263


In [20]:
#The Decision Tree model always gets 100% accuracy on the training data, but never goes above 81% on the test data. This shows that the model is overfitting: it learns the training examples too well, but it doesn't generalize well to new data. We need a better method that reduces this overfitting problem.


# Bagging: reducing variance

Bagging improves models because it reduces variance by averaging the predictions of multiple models trained on different subsets of the training data. This averaging effect reduces the sensitivity of the overall model to any one dataset or model, making the final prediction more stable and less prone to overfitting.

- High-variance models, like decision trees, tend to overfit the training data. This means that small changes in the training data can lead to large changes in the model’s predictions. For example, a decision tree trained on one subset of data might look completely different from a decision tree trained on another subset. This leads to high variance, where the model’s performance fluctuates a lot depending on the specific data it was trained on.
- Once all the individual models are trained, Bagging combines their predictions by averaging them (for regression) or using a majority vote (for classification). The key idea here is that the errors in each individual model are somewhat independent because they are trained on different bootstrap samples. Some models will make errors in one direction, while others might make errors in another. When you average these predictions, the errors cancel out, reducing the overall variability (variance) of the final model.

In [22]:
# Create and Train a BaggingClassifier. 
# Use as base estimator a weak decision tree (max_depth=1) and 100 estimators to really over a lot of different data samples
# Print the train and test accuracy

from sklearn.ensemble import BaggingClassifier

# Train BaggingClassifier
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # weak learner (shallow tree)
    n_estimators=100,  # number of trees
    random_state=0)
bagging_model.fit(X_train, y_train)

# Predictions and evaluation
y_train_pred = bagging_model.predict(X_train)
y_test_pred = bagging_model.predict(X_test)

# Evaluate performance
print("Train Accuracy:", accuracy_score(y_train, y_train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))

Train Accuracy: 0.8193832599118943
Test Accuracy: 0.8421052631578947


In [24]:
#With Bagging, the Train Accuracy is lower than before, but the Test Accuracy is better and more stable. This means the model is not overfitting anymore. Bagging helps by training many simple trees and averaging their predictions. This reduces the model's sensitivity to the training data and makes it more reliable on new data.


You can probably see a modest improvement in score, but most importantly, the overfitting is mostly gone. This is because averaging over multiple datasets stabilizes the high variance of the base model. 

In [27]:
# Run the same code again a couple of times. 
# You can see that consistently the Train Accuracy is close to the Test Accuracy. 

# Create and Train a BaggingClassifier. 
# Use as base estimator a weak decision tree (max_depth=1) and 100 estimators to really over a lot of different data samples
# Print the train and test accuracy

from sklearn.ensemble import BaggingClassifier

# Train BaggingClassifier no random_state
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # weak learner (shallow tree)
    n_estimators=100) # number of trees
bagging_model.fit(X_train, y_train)

# Predictions and evaluation
y_train_pred = bagging_model.predict(X_train)
y_test_pred = bagging_model.predict(X_test)

# Evaluate performance
print("Train Accuracy:", accuracy_score(y_train, y_train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))

Train Accuracy: 0.8325991189427313
Test Accuracy: 0.8552631578947368


In [29]:
#When we remove the random_state, the results change each time we run the model. However, we can still see that the Train Accuracy and Test Accuracy stay very close to each other. This means the model is stable and not overfitting. Bagging helps reduce variance by training on different subsets of data and combining their results, so the performance does not depend too much on a specific training set.


# Boosting: reducing bias

Now we’ll apply AdaBoost with decision trees as weak learners. This will sequentially improve the model by focusing on difficult cases.

Boosting reduces bias by sequentially training a series of weak learners (often simple models like decision trees) where each subsequent model focuses on the mistakes made by the previous models. The key idea behind boosting is to incrementally improve the model by correcting errors, which helps to reduce bias, especially when the initial model is too simple and underfits the data.

- Boosting typically uses weak learners, which are models that perform only slightly better than random guessing. For example, in classification, a weak learner might be a shallow decision tree (a "stump") with just a few levels. Weak learners usually have high bias, meaning they are too simplistic and don't capture the underlying patterns in the data well. As a result, they underfit the data.

- In each iteration, boosting trains a new model that tries to correct the errors made by the earlier models. If an instance was misclassified by the first weak learner, it will receive a higher weight, so the next model pays more attention to it. As the sequence of models progresses, the ensemble collectively focuses more on the difficult-to-predict instances. Over time, the combined models become better at fitting the data, as they successively reduce the bias (systematic error) by adjusting for earlier mistakes.

In [41]:
# Create and Train a AdaBoostClassifier. 
# Use as base estimator a weak decision tree (max_depth=1) and 100 estimators to really target the specific behaviors of this phenomenon
# Print the train and test accuracy

from sklearn.ensemble import AdaBoostClassifier

# Train AdaBoost
adaboost_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # very simple tree (weak learner)
    n_estimators=100  # number of boosting rounds (trees)
)

# Fit the model on training data
adaboost_model.fit(X_train, y_train)

# Predictions and evaluation
y_train_pred = adaboost_model.predict(X_train)
y_test_pred = adaboost_model.predict(X_test)

# Evaluate performance
print("Train Accuracy:", accuracy_score(y_train, y_train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))

Train Accuracy: 0.9118942731277533
Test Accuracy: 0.868421052631579


In [31]:
# AdaBoost increases the training accuracy significantly, but the test accuracy is only slightly better than Bagging. 
# This small gap between train and test suggests that overfitting is coming back, although not as strongly as with a single decision tree.
# AdaBoost focuses on hard examples, which helps the model fit the training data well, but can hurt generalization


You can probably see a good improvement in score, but overfitting rearing it's ugly head a gain (not as much as in the base model). This is because the iterative correction of adaboost really allows the model to focus on the specifics of this problem, at a cost of overexploiting the dataset.

In [40]:
# Run the same code again a couple of times. S
# You can see that the test Accuracy will mostly be pretty good, even if some times it get's lower or higher scores (high variance, low bias)
# You can see also that consistently the Train Accuracy is higher than the Test Accuracy,indicating some (not extreme) overfitting 

np.random.seed(None)  # force different randomness on each run

# Train AdaBoost
adaboost_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # very simple tree (weak learner)
    n_estimators=100  # number of boosting rounds (trees)
)

# Fit the model on training data
adaboost_model.fit(X_train, y_train)

# Predictions and evaluation
y_train_pred = adaboost_model.predict(X_train)
y_test_pred = adaboost_model.predict(X_test)

# Evaluate performance
print("Train Accuracy:", accuracy_score(y_train, y_train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))


Train Accuracy: 0.9118942731277533
Test Accuracy: 0.868421052631579


In [42]:
# AdaBoost is a deterministic algorithm, so running the same code multiple times gives the same result.
# Unlike Bagging, there is no randomness in how AdaBoost samples data. It builds each tree step-by-step 
# based on the previous model's errors, so the outcome is stable as long as the input data stays the same.