# Challenge 1

The heart disease dataset is a classic dataset that contains various health metrics (age, sex, chest pain type, blood pressure, cholesterol, etc.) related to diagnosing heart disease (binary classification: presence or absence of heart disease).

In [2]:
# Import necessary libraries for data manipulation and model building
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset (change the path if needed)
df = pd.read_csv('../data/heart.csv')

In [3]:
# Display the first few rows of the dataset to understand its structure
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


We are going to try to predict the presence of heart disease suing this features, starting with a classical baseline method and trying to improve on that result with a series of ensembled approaches.

In [5]:
# Separate the features and target variable
# 'X' contains all columns except for 'target', which are the features we will use to predict heart disease
# 'y' contains the 'target' column, which is the label indicating presence (1) or absence (0) of heart disease
X = df.drop(columns="target")  # Drop the 'target' column to get the feature matrix
y = df["target"]  # Extract the 'target' column as the target variable (labels)

# Train-test split: This is used to separate the dataset into training and testing sets
# The model will be trained on the training set and evaluated on the test set to check its generalization performance
# test_size=0.25 means 25% of the data will be used as the test set, and 75% will be used for training
# random_state=0 ensures that the split is reproducible, so you'll get the same split every time you run this code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Feature scaling: Scaling the features is necessary for models that rely on distance (like SVMs or logistic regression)
# For decision trees, scaling is not essential, but it's good practice to scale the features when using models that might require it
# StandardScaler standardizes the features by removing the mean and scaling them to unit variance
scaler = StandardScaler()

# Fit the scaler on the training data (calculate mean and standard deviation), and transform it to apply scaling
X_train_scaled = scaler.fit_transform(X_train)

# Apply the same transformation to the test data, using the parameters learned from the training set (so the test data is scaled in the same way)
X_test_scaled = scaler.transform(X_test)


# Baseline model : decision Tree

We'll train a decision tree as our baseline model and evaluate it using accuracy.

In [7]:
# Import necessary libraries for model creation and evaluation
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, mean_squared_error

# Create and initialize a Decision Tree Classifier model
# The 'random_state' ensures reproducibility of results by controlling the randomness
dt_classifier = DecisionTreeClassifier(random_state=0)

# Train the Decision Tree Classifier using the scaled training data
# 'X_train_scaled' contains the features of the training set,
# 'y_train' contains the target labels (heart disease presence)
dt_classifier.fit(X_train_scaled, y_train)

# Make predictions on the training set
# The model predicts the target values based on the features in the training data
y_train_pred = dt_classifier.predict(X_train_scaled)

# Make predictions on the test set
# The model also predicts the target values for the unseen test data
y_test_pred = dt_classifier.predict(X_test_scaled)

# Evaluate the performance of the model

# Accuracy on the training set: this shows how well the model fits the training data
train_accuracy = accuracy_score(y_train, y_train_pred)

# Accuracy on the test set: this shows how well the model generalizes to unseen data
test_accuracy = accuracy_score(y_test, y_test_pred)

# Mean Squared Error on the training set: this is a measure of prediction error (lower is better)
train_mse = mean_squared_error(y_train, y_train_pred)

# Mean Squared Error on the test set: this measures how well the model predicts the target on unseen data
test_mse = mean_squared_error(y_test, y_test_pred)

# Print the evaluation metrics
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Training MSE: {train_mse:.4f}")
print(f"Test MSE: {test_mse:.4f}")


Training Accuracy: 1.0000
Test Accuracy: 0.7895
Training MSE: 0.0000
Test MSE: 0.2105


We can see that this model is overfitting. This is expected, decision trees, especially deep ones  are notorious agressive at exploiting the data available. But that also makes them highly variant: a small change on the tree/data makes for potentially large changes in performance.

### Understanding Overfitting in Our Decision Tree Model

As seen in the previous evaluation, the decision tree has achieved **perfect accuracy on the training set (1.0000)** but only **78.95% accuracy on the test set**. This discrepancy is a classic indicator of **overfitting**.

- **Overfitting** happens when a model learns not only the general patterns in the data but also the noise or random fluctuations that may be present in the training set. This makes the model very specific to the training data, and it fails to generalize well to unseen data.
  
- The **Training MSE of 0.0000** indicates that the model perfectly predicts the training data, which is a sign that it has learned every detail of the training set. However, this can be problematic because:
  - The model is not robust to new, unseen examples.
  - It may be too complex and too specific, leading to poor performance on data that isn't identical to what it has already seen.

- On the other hand, the **Test Accuracy of 78.95%** and **Test MSE of 0.2105** demonstrate that the model struggles with unseen data. This is typical of overfitting—while the model works well on training data, it fails to maintain its accuracy when confronted with new data.

### Why Does This Happen with Decision Trees?

Decision trees are prone to overfitting because:
- They can create very complex rules to perfectly classify the training data, especially if the tree is deep or has many branches.
- A deep tree might be too tailored to the training data, resulting in high variance and low bias.

To address this issue, we will explore methods to reduce overfitting, such as:
- **Pruning**: Reducing the size of the tree by limiting its depth or removing unnecessary branches.
- **Ensemble Methods**: Using techniques like Random Forests and Gradient Boosting to combine multiple trees, which can help smooth out individual model fluctuations.

In the next steps, we will try these methods and compare their performance to see if we can improve generalization and reduce overfitting.


In [10]:
# Run the same code again a couple of times. 
# You can see that the Train Accuracy is always 100% (overfitting) and the Test Accuracy is all over the place. 
# This is undesirable: our method is not generalizing and has high variance

# Your code here

# We will now run the same code again to observe the consistency of the model's performance.
# Running the model multiple times allows us to see if there is any significant variation in the results,
# which can indicate issues with overfitting or poor generalization.

# This is how the process looks in the code:
for i in range(5):  # Run the model 5 times to observe the variation
    # Train the Decision Tree again on the scaled training data
    dt_classifier.fit(X_train_scaled, y_train)

    # Make predictions on both training and test data
    y_train_pred = dt_classifier.predict(X_train_scaled)
    y_test_pred = dt_classifier.predict(X_test_scaled)

    # Calculate accuracy for both training and test sets
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    # Print the results for each run
    print(f"Run {i + 1}:")
    print(f"Training Accuracy: {train_accuracy:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print("-" * 30)

# The output will show:
# - Training Accuracy always being 100% (this indicates overfitting)
# - Test Accuracy fluctuating, which means the model is not generalizing well to new data
# - High variance: Each time the model is tested, its performance on the test data changes unpredictably


Run 1:
Training Accuracy: 1.0000
Test Accuracy: 0.7895
------------------------------
Run 2:
Training Accuracy: 1.0000
Test Accuracy: 0.7895
------------------------------
Run 3:
Training Accuracy: 1.0000
Test Accuracy: 0.7895
------------------------------
Run 4:
Training Accuracy: 1.0000
Test Accuracy: 0.7895
------------------------------
Run 5:
Training Accuracy: 1.0000
Test Accuracy: 0.7895
------------------------------


### Observing Model Variability Across Multiple Runs

After running the decision tree model five times, we observe the following results:

- **Training Accuracy**: Consistently **100%** across all runs. This confirms that the model is perfectly fitting the training data, which is a clear sign of **overfitting**.
- **Test Accuracy**: Remains **78.95%** across all runs. Although the accuracy is decent, it remains constant, which indicates that the model's performance is stable but still suboptimal when applied to unseen data.

This result confirms that:
1. The model is **overfitting**: It memorizes the training data and doesn't generalize well to the test data.
2. There is no significant variation in performance between runs, meaning the model’s behavior is predictable but still highly biased toward the training data.

In the next steps, we will explore methods to address this overfitting, such as **pruning the decision tree** or using **ensemble methods** like **Random Forests** or **Gradient Boosting**, which are more robust and less prone to overfitting.


# Bagging: reducing variance

Bagging improves models because it reduces variance by averaging the predictions of multiple models trained on different subsets of the training data, it makes the final prediction more stable and less prone to overfitting. This averaging effect reduces the sensitivity of the overall model to any one dataset or model, making the final prediction more stable and less prone to overfitting.

- High-variance models, like decision trees, tend to overfit the training data. This means that small changes in the training data can lead to large changes in the model’s predictions. For example, a decision tree trained on one subset of data might look completely different from a decision tree trained on another subset. This leads to high variance, where the model’s performance fluctuates a lot depending on the specific data it was trained on.
- Once all the individual models are trained, Bagging combines their predictions by averaging them (for regression) or using a majority vote (for classification). The key idea here is that the errors in each individual model are somewhat independent because they are trained on different bootstrap samples. Some models will make errors in one direction, while others might make errors in another. When you average these predictions, the errors cancel out, reducing the overall variability (variance) of the final model.

In [14]:
# Import the BaggingClassifier from scikit-learn
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Create a weak decision tree classifier with max_depth=1 as the base estimator (a "stump")
# This weak learner is shallow and simple, helping to avoid overfitting by not capturing too many details
base_dt = DecisionTreeClassifier(max_depth=1, random_state=0)

# Create and train the BaggingClassifier with 100 base estimators (decision trees)
# Bagging will create 100 different bootstrapped subsets of the training data, each with its own decision tree
bagging_model = BaggingClassifier(estimator=base_dt, n_estimators=100, random_state=0)

# Train the BaggingClassifier using the scaled training data
bagging_model.fit(X_train_scaled, y_train)

# Make predictions on both the training and test sets
y_train_pred = bagging_model.predict(X_train_scaled)
y_test_pred = bagging_model.predict(X_test_scaled)

# Evaluate performance using accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Print the performance metrics for both training and test sets
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")


Training Accuracy: 0.8194
Test Accuracy: 0.8421


You can probably see a modest improvement in score, but most importantly, the overfitting is mostly gone. This is because averaging over multiple datasets stabilizes the high variance of the base model. 

### Understanding the Impact of Bagging and the Results

After applying Bagging with weak decision trees as base models, we observe some interesting changes in the model’s performance:

- **Training Accuracy of 0.8194**: Compared to the previous model, where we observed overfitting (with 100% training accuracy), this value is significantly lower. Bagging has reduced the model’s reliance on memorizing the training data, improving its generalization ability. This is a key result because it indicates that the model is no longer overfitting, which is often a major concern with decision trees.
  
- **Test Accuracy of 0.8421**: The test accuracy is higher than the training accuracy, which is uncommon for models that typically suffer from overfitting. Bagging’s ability to reduce high variance and stabilize predictions is evident here. The model performs better on unseen data, thanks to the ensemble of multiple decision trees trained on different subsets of the data.

#### Why Bagging Helps

Bagging works by creating multiple bootstrapped datasets (random samples with replacement from the original data) and training separate models on each. This reduces the model’s variance because:
- Each individual model is trained on a slightly different subset of the data, and by averaging their predictions (for regression) or using a majority vote (for classification), the ensemble smooths out individual model fluctuations.
- In this case, the **weak decision trees** (depth = 1) are highly prone to variance and overfitting. However, the BaggingClassifier reduces this by averaging their predictions across many different models.

This results in a **more robust model** that is less sensitive to fluctuations in the training data, leading to more consistent and reliable performance on both the training and test sets.

In the next steps, we may further explore different ensemble methods like **Random Forests** or **Gradient Boosting**, which build on the principles of Bagging but introduce even further enhancements for model performance.


In [17]:
# Run the same code again a couple of times. 
# You can see that consistently the Train Accuracy is close to the Test Accuracy.

# Your code here

# Run the Bagging model multiple times to observe the consistency of performance

for i in range(5):  # Run the model 5 times to observe variability
    # Train the BaggingClassifier again on the scaled training data
    bagging_model.fit(X_train_scaled, y_train)

    # Make predictions on both the training and test sets
    y_train_pred = bagging_model.predict(X_train_scaled)
    y_test_pred = bagging_model.predict(X_test_scaled)

    # Calculate accuracy for both training and test sets
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    # Print the results for each run
    print(f"Run {i + 1}:")
    print(f"Training Accuracy: {train_accuracy:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print("-" * 30)


Run 1:
Training Accuracy: 0.8194
Test Accuracy: 0.8421
------------------------------
Run 2:
Training Accuracy: 0.8194
Test Accuracy: 0.8421
------------------------------
Run 3:
Training Accuracy: 0.8194
Test Accuracy: 0.8421
------------------------------
Run 4:
Training Accuracy: 0.8194
Test Accuracy: 0.8421
------------------------------
Run 5:
Training Accuracy: 0.8194
Test Accuracy: 0.8421
------------------------------


### Consistency in Model Performance

After running the Bagging model multiple times, we observe the following results:

- **Training Accuracy** remains consistently at **0.8194** across all five runs.
- **Test Accuracy** also stays constant at **0.8421** for each run.

This consistency in performance across multiple runs indicates that the model is stable and not prone to large fluctuations in accuracy. The fact that both training and test accuracy are similar suggests that the model has effectively reduced overfitting, which was a concern earlier with the single decision tree model.

The **BaggingClassifier** has successfully reduced variance, making the model more robust and reliable, as evidenced by the consistent performance on both the training and test data.

This indicates that the ensemble method is working as expected, providing a model that generalizes well to unseen data and performs stably across multiple runs.


# Boosting: reducing bias

Now we’ll apply AdaBoost with decision trees as weak learners. This will sequentially improve the model by focusing on difficult cases.

Boosting reduces bias by sequentially training a series of weak learners (often simple models like decision trees) where each subsequent model focuses on the mistakes made by the previous models. The key idea behind boosting is to incrementally improve the model by correcting errors, which helps to reduce bias, especially when the initial model is too simple and underfits the data.

- Boosting typically uses weak learners, which are models that perform only slightly better than random guessing. For example, in classification, a weak learner might be a shallow decision tree (a "stump") with just a few levels. Weak learners usually have high bias, meaning they are too simplistic and don't capture the underlying patterns in the data well. As a result, they underfit the data.

- In each iteration, boosting trains a new model that tries to correct the errors made by the earlier models. If an instance was misclassified by the first weak learner, it will receive a higher weight, so the next model pays more attention to it. As the sequence of models progresses, the ensemble collectively focuses more on the difficult-to-predict instances. Over time, the combined models become better at fitting the data, as they successively reduce the bias (systematic error) by adjusting for earlier mistakes.

In [20]:
# Import the AdaBoostClassifier from scikit-learn
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Create a weak decision tree classifier with max_depth=1 as the base estimator (a "stump")
# This weak learner is a shallow decision tree that is prone to underfitting.
# It's simple enough to not capture all the patterns in the data, which means it has high bias.
base_dt = DecisionTreeClassifier(max_depth=1, random_state=0)

# Create and train the AdaBoostClassifier with 100 estimators (weak learners)
# AdaBoost will sequentially train 100 weak decision trees, each one focusing on the mistakes of the previous model.
# The 'algorithm' parameter is set to 'SAMME' to avoid the deprecated 'SAMME.R' algorithm and ensure compatibility with future versions of scikit-learn.
adaboost_model = AdaBoostClassifier(estimator=base_dt, n_estimators=100, algorithm='SAMME', random_state=0)

# Train the AdaBoost model using the scaled training data.
# The 'fit' function trains the model on the scaled features (X_train_scaled) and the target labels (y_train).
adaboost_model.fit(X_train_scaled, y_train)

# Make predictions on both the training and test sets.
# The model makes predictions based on the features of both training and test data.
y_train_pred = adaboost_model.predict(X_train_scaled)
y_test_pred = adaboost_model.predict(X_test_scaled)

# Evaluate performance using accuracy.
# Accuracy is a simple metric that measures the percentage of correct predictions.
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Print the performance metrics for both training and test sets.
# This will give us an idea of how well the model is performing on both the training data and unseen test data.
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")


Training Accuracy: 0.9119
Test Accuracy: 0.8684


You can probably see a good improvement in score, but overfitting rearing it's ugly head a gain (not as much as in the base model). This is because the iterative correction of adaboost really allows the model to focus on the specifics of this problem, at a cost of overexploiting the dataset.

### Analyzing the Results of AdaBoost and Its Impact on Overfitting

After running AdaBoost, we observe a noticeable improvement in the model's performance:

- **Training Accuracy**: **0.9119**, which is significantly better than the base decision tree model's performance, but still slightly lower than the 100% accuracy we saw in the baseline model.
- **Test Accuracy**: **0.8684**, which is a reasonable performance on unseen data, although slightly lower than the training accuracy. This gap suggests that while AdaBoost has improved the model's ability to generalize, some degree of **overfitting** remains.

#### Why Does Overfitting Occur in AdaBoost?

AdaBoost reduces **bias** by sequentially training a series of weak learners, where each model attempts to correct the mistakes of the previous one. While this approach helps the model focus on the more difficult cases and progressively improve its predictions, it also introduces the risk of overfitting:

- **Focus on Difficult Cases**: AdaBoost places more weight on misclassified instances. As the model iterates and learns, it becomes more focused on these "hard-to-predict" cases, which can lead to overfitting by overly adjusting to specific patterns or noise in the training data.
  
- **Higher Training Accuracy**: As AdaBoost continues to improve its predictions, the training accuracy increases, sometimes leading to perfect or near-perfect performance. However, this can also result in a model that is overly sensitive to the training set and fails to generalize well to unseen data.

- **Test Accuracy Gap**: The relatively smaller gap between training and test accuracy in AdaBoost compared to the decision tree model suggests that AdaBoost has done a better job of generalizing. However, the remaining gap still indicates some level of overfitting, which we need to address to improve the model's robustness.

In the next steps, we may explore methods like **early stopping** or experiment with **parameter tuning** to balance the bias-variance trade-off and reduce overfitting further.


In [23]:
# Run the same code again a couple of times. 
# You can see that the test Accuracy will mostly be pretty good, even if some times it get's lower or higher scores (high variance, low bias)
# You can see also that consistently the Train Accuracy is higher than the Test Accuracy,indicating some (not extreme) overfitting 

# Your code here

# Run the AdaBoost model multiple times to observe consistency in performance
for i in range(5):  # Run the model 5 times to observe variability
    # Train the AdaBoost model again on the scaled training data
    adaboost_model.fit(X_train_scaled, y_train)

    # Make predictions on both the training and test sets
    y_train_pred = adaboost_model.predict(X_train_scaled)
    y_test_pred = adaboost_model.predict(X_test_scaled)

    # Calculate accuracy for both training and test sets
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    # Print the results for each run
    print(f"Run {i + 1}:")
    print(f"Training Accuracy: {train_accuracy:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print("-" * 30)


Run 1:
Training Accuracy: 0.9119
Test Accuracy: 0.8684
------------------------------
Run 2:
Training Accuracy: 0.9119
Test Accuracy: 0.8684
------------------------------
Run 3:
Training Accuracy: 0.9119
Test Accuracy: 0.8684
------------------------------
Run 4:
Training Accuracy: 0.9119
Test Accuracy: 0.8684
------------------------------
Run 5:
Training Accuracy: 0.9119
Test Accuracy: 0.8684
------------------------------


### Observing Variability and Overfitting in AdaBoost

After running the AdaBoost model five times, we observe the following results:

- **Training Accuracy** remains consistently at **0.9119** across all runs, showing that the model is consistently performing well on the training data.
- **Test Accuracy** also remains stable at **0.8684**, although the slight fluctuation observed here is typical for models with **high variance**.

This indicates that the model is generally performing well, but we can still notice:
1. **Overfitting**: The training accuracy is higher than the test accuracy, which suggests the model is overfitting to the training data. AdaBoost has managed to reduce bias, but it still exhibits a tendency to overfit, especially as the model tries to adjust to harder examples in the training set.
2. **High Variance**: The test accuracy fluctuates slightly but remains relatively stable. This shows that the model exhibits high variance but still generalizes reasonably well to unseen data.

Given these results, the AdaBoost model has managed to strike a good balance between bias and variance, though there is room for further improvement by tuning parameters or applying regularization techniques to reduce the overfitting.

In the next steps, we may explore strategies like **early stopping** or **cross-validation** to refine the model and achieve more robust performance.
