# Challenge 1

The heart disease dataset is a classic dataset that contains various health metrics (age, sex, chest pain type, blood pressure, cholesterol, etc.) related to diagnosing heart disease (binary classification: presence or absence of heart disease).

In [2]:
# Import necessary libraries for data manipulation and model building
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset (change the path if needed)
df = pd.read_csv('../data/heart.csv')

# Check the shape of the dataset (optional)
print(df.shape)

(303, 14)


### Dataset Shape

After loading the dataset, we checked its shape using `df.shape`. The output indicates that the dataset contains **303 rows** and **14 columns**. 

This means there are 303 instances (or samples) and 14 features (including the target variable) in the dataset. We will use these features to build our model and predict the presence or absence of heart disease.


In [4]:
# Display the first few rows of the dataset to understand its structure
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [5]:
# Check for missing values in the dataset
# This will help ensure the dataset doesn't have any missing values before we start training the models
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

### Checking for Missing Values

After displaying the initial rows of the dataset, we perform a check for missing values. This step is important to ensure that there are no missing values in the dataset before we proceed with training the models. The output shows that there are no missing values in any of the columns, which means we can move forward without any data cleaning regarding missing values.

The check for missing values returned zeros across all columns, indicating no missing entries:



We are going to try to predict the presence of heart disease suing this features, starting with a classical baseline method and trying to improve on that result with a series of ensembled approaches.

In [8]:
# Separate the features and target variable
# 'X' contains all columns except for 'target', which are the features we will use to predict heart disease
# 'y' contains the 'target' column, which is the label indicating presence (1) or absence (0) of heart disease
X = df.drop(columns="target")  # Drop the 'target' column to get the feature matrix
y = df["target"]  # Extract the 'target' column as the target variable (labels)

# Train-test split: This is used to separate the dataset into training and testing sets
# The model will be trained on the training set and evaluated on the test set to check its generalization performance
# test_size=0.25 means 25% of the data will be used as the test set, and 75% will be used for training
# random_state=0 ensures that the split is reproducible, so you'll get the same split every time you run this code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Feature scaling: Scaling the features is necessary for models that rely on distance (like SVMs or logistic regression)
# For decision trees, scaling is not essential, but it's good practice to scale the features when using models that might require it
# StandardScaler standardizes the features by removing the mean and scaling them to unit variance
scaler = StandardScaler()

# Fit the scaler on the training data (calculate mean and standard deviation), and transform it to apply scaling
X_train_scaled = scaler.fit_transform(X_train)

# Apply the same transformation to the test data, using the parameters learned from the training set (so the test data is scaled in the same way)
X_test_scaled = scaler.transform(X_test)


# Baseline model : decision Tree

We'll train a decision tree as our baseline model and evaluate it using accuracy.

In [10]:
# Import necessary libraries for model creation and evaluation
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, mean_squared_error

# Create and initialize a Decision Tree Classifier model
# The 'random_state' ensures reproducibility of results by controlling the randomness
dt_classifier = DecisionTreeClassifier(random_state=0)

# Train the Decision Tree Classifier using the scaled training data
# 'X_train_scaled' contains the features of the training set,
# 'y_train' contains the target labels (heart disease presence)
dt_classifier.fit(X_train_scaled, y_train)

# Make predictions on the training set
y_train_pred = dt_classifier.predict(X_train_scaled)

# Make predictions on the test set
y_test_pred = dt_classifier.predict(X_test_scaled)

# Evaluate the performance of the model
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

# Print the evaluation metrics for the baseline decision tree
print(f"Baseline Training Accuracy: {train_accuracy:.4f}")
print(f"Baseline Test Accuracy: {test_accuracy:.4f}")
print(f"Baseline Training MSE: {train_mse:.4f}")
print(f"Baseline Test MSE: {test_mse:.4f}")

# Now let's add a pruned decision tree to reduce overfitting

# Create and initialize a pruned Decision Tree Classifier model
# Limiting the depth of the tree to 3 to prevent overfitting
dt_classifier_pruned = DecisionTreeClassifier(max_depth=3, random_state=0)

# Train the pruned Decision Tree Classifier
dt_classifier_pruned.fit(X_train_scaled, y_train)

# Make predictions on the training set for the pruned tree
y_train_pred_pruned = dt_classifier_pruned.predict(X_train_scaled)

# Make predictions on the test set for the pruned tree
y_test_pred_pruned = dt_classifier_pruned.predict(X_test_scaled)

# Evaluate the performance of the pruned model
train_accuracy_pruned = accuracy_score(y_train, y_train_pred_pruned)
test_accuracy_pruned = accuracy_score(y_test, y_test_pred_pruned)
train_mse_pruned = mean_squared_error(y_train, y_train_pred_pruned)
test_mse_pruned = mean_squared_error(y_test, y_test_pred_pruned)

# Print the evaluation metrics for the pruned decision tree
print(f"Pruned Training Accuracy: {train_accuracy_pruned:.4f}")
print(f"Pruned Test Accuracy: {test_accuracy_pruned:.4f}")
print(f"Pruned Training MSE: {train_mse_pruned:.4f}")
print(f"Pruned Test MSE: {test_mse_pruned:.4f}")

Baseline Training Accuracy: 1.0000
Baseline Test Accuracy: 0.7895
Baseline Training MSE: 0.0000
Baseline Test MSE: 0.2105
Pruned Training Accuracy: 0.8546
Pruned Test Accuracy: 0.7632
Pruned Training MSE: 0.1454
Pruned Test MSE: 0.2368


We can see that this model is overfitting. This is expected, decision trees, especially deep ones  are notorious agressive at exploiting the data available. But that also makes them highly variant: a small change on the tree/data makes for potentially large changes in performance.

### Understanding Overfitting in Our Decision Tree Model

As seen in the previous evaluation, the decision tree has achieved **perfect accuracy on the training set (1.0000)** but only **78.95% accuracy on the test set**. This discrepancy is a classic indicator of **overfitting**, which was expected due to the complexity of the model.

- **Overfitting** happens when a model learns not only the general patterns in the data but also the noise or random fluctuations that may be present in the training set. This makes the model very specific to the training data, and it fails to generalize well to unseen data.

- The **Training MSE of 0.0000** for the baseline model indicates that the model perfectly predicts the training data, which is a sign of overfitting. The model is not robust to new, unseen examples and is too complex, capturing unnecessary patterns in the training data.

- On the other hand, the **Pruned model** (max_depth=3) demonstrates a significant improvement:
  - **Training Accuracy** is **0.8546**, lower than the baseline, but showing reduced overfitting.
  - **Test Accuracy** is **0.7632**, still lower than training accuracy but higher than the baseline model’s test accuracy (78.95% vs. 76.32%).
  - **Training MSE** and **Test MSE** have also improved compared to the baseline.

### Why Does This Happen with Decision Trees?

Decision trees are prone to overfitting because:
- They can create very complex rules to perfectly classify the training data, especially if the tree is deep or has many branches.
- A deep tree might be too tailored to the training data, resulting in high variance and low bias.

To address this issue, we used **pruning** by limiting the tree’s depth to 3, which helps reduce its complexity and improves generalization. The pruned model demonstrates better performance on unseen data, as reflected in its higher test accuracy and lower MSE compared to the baseline.

In the next steps, we could explore **ensemble methods** like **Random Forests** and **Gradient Boosting**, which combine multiple trees to reduce variance and bias, providing an even more robust model. However, for now, we will focus on the current models to evaluate the effectiveness of pruning and the ensemble methods we have already used. These additional techniques could be explored if time allows.


In [13]:
# Import necessary libraries for model creation and evaluation
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, mean_squared_error

# Create and initialize a Decision Tree Classifier model
# The 'random_state' ensures reproducibility of results by controlling the randomness
dt_classifier = DecisionTreeClassifier(random_state=0)

# Run the same code again a couple of times with the baseline decision tree model
# You can see that the Train Accuracy is always 100% (overfitting) and the Test Accuracy is all over the place. 
# This is undesirable: our method is not generalizing and has high variance

# We will now run the same code again to observe the consistency of the baseline decision tree's performance.
# Running the model multiple times allows us to see if there is any significant variation in the results,
# which can indicate issues with overfitting or poor generalization.

# This is how the process looks in the code:
for i in range(5):  # Run the baseline model 5 times to observe the variation
    # Train the baseline Decision Tree again on the scaled training data
    dt_classifier.fit(X_train_scaled, y_train)

    # Make predictions on both training and test data
    y_train_pred = dt_classifier.predict(X_train_scaled)
    y_test_pred = dt_classifier.predict(X_test_scaled)

    # Calculate accuracy for both training and test sets
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    # Calculate Mean Squared Error (MSE) for both training and test sets
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)

    # Print the results for each run
    print(f"Run {i + 1}:")
    print(f"Training Accuracy: {train_accuracy:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Training MSE: {train_mse:.4f}")
    print(f"Test MSE: {test_mse:.4f}")
    print("-" * 30)


Run 1:
Training Accuracy: 1.0000
Test Accuracy: 0.7895
Training MSE: 0.0000
Test MSE: 0.2105
------------------------------
Run 2:
Training Accuracy: 1.0000
Test Accuracy: 0.7895
Training MSE: 0.0000
Test MSE: 0.2105
------------------------------
Run 3:
Training Accuracy: 1.0000
Test Accuracy: 0.7895
Training MSE: 0.0000
Test MSE: 0.2105
------------------------------
Run 4:
Training Accuracy: 1.0000
Test Accuracy: 0.7895
Training MSE: 0.0000
Test MSE: 0.2105
------------------------------
Run 5:
Training Accuracy: 1.0000
Test Accuracy: 0.7895
Training MSE: 0.0000
Test MSE: 0.2105
------------------------------


In [14]:
# Run the same code again a couple of times with the pruned decision tree model (max_depth=3)
# You can see that the Train Accuracy is lower, indicating less overfitting, and the Test Accuracy should be more stable.

# We will now run the same code again to observe the consistency of the pruned decision tree's performance.
# Running the model multiple times allows us to see if there is any significant variation in the results,
# which can indicate improvement in generalization.

for i in range(5):  # Run the pruned model 5 times to observe the variation
    # Train the pruned Decision Tree again on the scaled training data
    dt_classifier_pruned.fit(X_train_scaled, y_train)

    # Make predictions on both training and test data
    y_train_pred_pruned = dt_classifier_pruned.predict(X_train_scaled)
    y_test_pred_pruned = dt_classifier_pruned.predict(X_test_scaled)

    # Calculate accuracy for both training and test sets
    train_accuracy_pruned = accuracy_score(y_train, y_train_pred_pruned)
    test_accuracy_pruned = accuracy_score(y_test, y_test_pred_pruned)

    # Calculate Mean Squared Error (MSE) for both training and test sets
    train_mse_pruned = mean_squared_error(y_train, y_train_pred_pruned)
    test_mse_pruned = mean_squared_error(y_test, y_test_pred_pruned)

    # Print the results for each run
    print(f"Run {i + 1}:")
    print(f"Pruned Training Accuracy: {train_accuracy_pruned:.4f}")
    print(f"Pruned Test Accuracy: {test_accuracy_pruned:.4f}")
    print(f"Pruned Training MSE: {train_mse_pruned:.4f}")
    print(f"Pruned Test MSE: {test_mse_pruned:.4f}")
    print("-" * 30)


Run 1:
Pruned Training Accuracy: 0.8546
Pruned Test Accuracy: 0.7632
Pruned Training MSE: 0.1454
Pruned Test MSE: 0.2368
------------------------------
Run 2:
Pruned Training Accuracy: 0.8546
Pruned Test Accuracy: 0.7632
Pruned Training MSE: 0.1454
Pruned Test MSE: 0.2368
------------------------------
Run 3:
Pruned Training Accuracy: 0.8546
Pruned Test Accuracy: 0.7632
Pruned Training MSE: 0.1454
Pruned Test MSE: 0.2368
------------------------------
Run 4:
Pruned Training Accuracy: 0.8546
Pruned Test Accuracy: 0.7632
Pruned Training MSE: 0.1454
Pruned Test MSE: 0.2368
------------------------------
Run 5:
Pruned Training Accuracy: 0.8546
Pruned Test Accuracy: 0.7632
Pruned Training MSE: 0.1454
Pruned Test MSE: 0.2368
------------------------------


### Observing Model Variability Across Multiple Runs

After running both the **baseline decision tree** and the **pruned decision tree** models five times, we observe the following results:

#### Baseline Decision Tree (Unpruned):
- **Training Accuracy**: Consistently **100%** across all runs. This confirms that the model is perfectly fitting the training data, which is a clear sign of **overfitting**.
- **Test Accuracy**: Remains **78.95%** across all runs. This indicates that the model is not generalizing well to unseen data, which is expected with overfitting.
- **Training MSE**: Consistently **0.0000**, confirming that the model perfectly predicts the training data, further indicating overfitting.
- **Test MSE**: Remains **0.2105** across all runs, showing that the model has significant error when applied to the test set.

#### Pruned Decision Tree (max_depth=3):
- **Training Accuracy**: Consistently **0.8546** across all runs, indicating reduced overfitting compared to the baseline model.
- **Test Accuracy**: Remains **0.7632** across all runs, showing improved stability and generalization when compared to the baseline model (78.95% test accuracy).
- **Training MSE**: Consistently **0.1454**, showing that the pruned model no longer perfectly fits the training data, indicating less overfitting.
- **Test MSE**: Remains **0.2368**, showing a reduced error when applied to the test set compared to the baseline model.

This result confirms that:
1. The **baseline model is overfitting**: It memorizes the training data and doesn't generalize well to the test data, as reflected in the high training accuracy and low test accuracy.
2. The **pruned model** demonstrates **better generalization** with reduced overfitting, as seen in the lower training accuracy and better test performance (higher test accuracy and lower MSE).

In the next steps, we could explore methods like **Random Forests** or **Gradient Boosting**, which combine multiple trees and help reduce both bias and variance. However, for now, we'll focus on the improvements we’ve achieved with pruning. These techniques could be explored if time permits later.


# Bagging: reducing variance

Bagging improves models because it reduces variance by averaging the predictions of multiple models trained on different subsets of the training data, it makes the final prediction more stable and less prone to overfitting. This averaging effect reduces the sensitivity of the overall model to any one dataset or model, making the final prediction more stable and less prone to overfitting.

- High-variance models, like decision trees, tend to overfit the training data. This means that small changes in the training data can lead to large changes in the model’s predictions. For example, a decision tree trained on one subset of data might look completely different from a decision tree trained on another subset. This leads to high variance, where the model’s performance fluctuates a lot depending on the specific data it was trained on.
- Once all the individual models are trained, Bagging combines their predictions by averaging them (for regression) or using a majority vote (for classification). The key idea here is that the errors in each individual model are somewhat independent because they are trained on different bootstrap samples. Some models will make errors in one direction, while others might make errors in another. When you average these predictions, the errors cancel out, reducing the overall variability (variance) of the final model.

In [18]:
# Import the BaggingClassifier from scikit-learn
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Create a weak decision tree classifier with max_depth=1 as the base estimator (a "stump")
# This weak learner is shallow and simple, helping to avoid overfitting by not capturing too many details
base_dt = DecisionTreeClassifier(max_depth=1, random_state=0)

# Create and train the BaggingClassifier with 100 base estimators (decision trees)
# Bagging will create 100 different bootstrapped subsets of the training data, each with its own decision tree
bagging_model = BaggingClassifier(estimator=base_dt, n_estimators=100, random_state=0)

# Train the BaggingClassifier using the scaled training data (baseline model)
bagging_model.fit(X_train_scaled, y_train)

# Make predictions on both the training and test sets
y_train_pred = bagging_model.predict(X_train_scaled)
y_test_pred = bagging_model.predict(X_test_scaled)

# Evaluate performance using accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Print the performance metrics for both training and test sets
print(f"Training Accuracy (100 Estimators, Baseline Model): {train_accuracy:.4f}")
print(f"Test Accuracy (100 Estimators, Baseline Model): {test_accuracy:.4f}")

# Now let's add a version of the Bagging model with more estimators
bagging_model_more = BaggingClassifier(estimator=base_dt, n_estimators=200, random_state=0)
bagging_model_more.fit(X_train_scaled, y_train)

# Make predictions with the model using more estimators
y_train_pred_more = bagging_model_more.predict(X_train_scaled)
y_test_pred_more = bagging_model_more.predict(X_test_scaled)

# Evaluate performance with more estimators
train_accuracy_more = accuracy_score(y_train, y_train_pred_more)
test_accuracy_more = accuracy_score(y_test, y_test_pred_more)

# Print the performance metrics for both training and test sets
print(f"Training Accuracy (200 Estimators, Baseline Model): {train_accuracy_more:.4f}")
print(f"Test Accuracy (200 Estimators, Baseline Model): {test_accuracy_more:.4f}")


Training Accuracy (100 Estimators, Baseline Model): 0.8194
Test Accuracy (100 Estimators, Baseline Model): 0.8421
Training Accuracy (200 Estimators, Baseline Model): 0.8194
Test Accuracy (200 Estimators, Baseline Model): 0.8289


You can probably see a modest improvement in score, but most importantly, the overfitting is mostly gone. This is because averaging over multiple datasets stabilizes the high variance of the base model. 

In [20]:
# Create a pruned decision tree classifier with max_depth=3 as the base estimator
# This pruned model will help reduce overfitting compared to the baseline model
base_dt_pruned = DecisionTreeClassifier(max_depth=3, random_state=0)

# Create and train the BaggingClassifier with 100 base estimators (decision trees)
# Bagging will create 100 different bootstrapped subsets of the training data, each with its own pruned decision tree
bagging_model_pruned = BaggingClassifier(estimator=base_dt_pruned, n_estimators=100, random_state=0)

# Train the BaggingClassifier using the scaled training data (pruned model)
bagging_model_pruned.fit(X_train_scaled, y_train)

# Make predictions on both the training and test sets
y_train_pred_pruned = bagging_model_pruned.predict(X_train_scaled)
y_test_pred_pruned = bagging_model_pruned.predict(X_test_scaled)

# Evaluate performance using accuracy
train_accuracy_pruned = accuracy_score(y_train, y_train_pred_pruned)
test_accuracy_pruned = accuracy_score(y_test, y_test_pred_pruned)

# Print the performance metrics for both training and test sets
print(f"Training Accuracy (100 Estimators, Pruned Model): {train_accuracy_pruned:.4f}")
print(f"Test Accuracy (100 Estimators, Pruned Model): {test_accuracy_pruned:.4f}")

# Now let's add a version of the Bagging model with more estimators for the pruned model
bagging_model_more_pruned = BaggingClassifier(estimator=base_dt_pruned, n_estimators=200, random_state=0)
bagging_model_more_pruned.fit(X_train_scaled, y_train)

# Make predictions with the model using more estimators
y_train_pred_more_pruned = bagging_model_more_pruned.predict(X_train_scaled)
y_test_pred_more_pruned = bagging_model_more_pruned.predict(X_test_scaled)

# Evaluate performance with more estimators for the pruned model
train_accuracy_more_pruned = accuracy_score(y_train, y_train_pred_more_pruned)
test_accuracy_more_pruned = accuracy_score(y_test, y_test_pred_more_pruned)

# Print the performance metrics for both training and test sets
print(f"Training Accuracy (200 Estimators, Pruned Model): {train_accuracy_more_pruned:.4f}")
print(f"Test Accuracy (200 Estimators, Pruned Model): {test_accuracy_more_pruned:.4f}")


Training Accuracy (100 Estimators, Pruned Model): 0.8943
Test Accuracy (100 Estimators, Pruned Model): 0.8421
Training Accuracy (200 Estimators, Pruned Model): 0.8899
Test Accuracy (200 Estimators, Pruned Model): 0.8289


### Understanding the Impact of Bagging and the Results

After applying Bagging with weak decision trees as base models, we observe the following changes in the model’s performance for both the **baseline decision tree** and the **pruned decision tree**:

#### Results with 100 Estimators:
- **Baseline Decision Tree**:
  - **Training Accuracy of 0.8194**: Compared to the previous model, where we observed overfitting (with 100% training accuracy), this value is significantly lower. Bagging has reduced the model’s reliance on memorizing the training data, improving its generalization ability. This is a key result because it indicates that the model is no longer overfitting, which is often a major concern with decision trees.
  - **Test Accuracy of 0.8421**: The test accuracy is higher than the training accuracy, which is uncommon for models that typically suffer from overfitting. Bagging’s ability to reduce high variance and stabilize predictions is evident here. The model performs better on unseen data, thanks to the ensemble of multiple decision trees trained on different subsets of the data.

- **Pruned Decision Tree (max_depth=3)**:
  - **Training Accuracy of 0.8943**: The pruned model has a higher training accuracy compared to the baseline, but still, it’s much lower than the baseline model’s 100% accuracy, indicating less overfitting.
  - **Test Accuracy of 0.8421**: The test accuracy for the pruned model is equal to the baseline model's test accuracy, but with the pruned model showing less overfitting (as indicated by the lower training accuracy).

#### Results with 200 Estimators:
- **Baseline Decision Tree**:
  - **Training Accuracy of 0.8194**: No change in training accuracy when we increased the number of estimators. Bagging helps stabilize the model, but doesn't reduce training accuracy further.
  - **Test Accuracy of 0.8289**: A slight improvement in test accuracy when using 200 estimators, indicating that more estimators provide a greater averaging effect and help stabilize predictions.

- **Pruned Decision Tree (max_depth=3)**:
  - **Training Accuracy of 0.8899**: A slight decrease in training accuracy with 200 estimators, indicating that the model is further stabilizing and avoiding overfitting.
  - **Test Accuracy of 0.8289**: The test accuracy remains the same with 200 estimators, but the model continues to show a more stable performance compared to the baseline.

### Why Bagging Helps

Bagging works by creating multiple bootstrapped datasets (random samples with replacement from the original data) and training separate models on each. This reduces the model’s variance because:
- Each individual model is trained on a slightly different subset of the data, and by averaging their predictions (for regression) or using a majority vote (for classification), the ensemble smooths out individual model fluctuations.
- In this case, both the **baseline decision tree** and **pruned decision tree** (depth = 3) are prone to variance and overfitting. However, Bagging helps to reduce this by averaging their predictions across many different models, leading to a more stable model.

This results in a **more robust model** that is less sensitive to fluctuations in the training data, leading to more consistent and reliable performance on both the training and test sets.

While we explored Bagging with both the baseline and pruned models, in the next steps, we may further explore methods like **Random Forests** or **Gradient Boosting**, which build on the principles of Bagging but introduce even further enhancements for model performance. However, for now, we have made significant progress with Bagging and its ability to reduce variance and improve generalization.


In [22]:
# Run the same code again a couple of times with the baseline decision tree model (max_depth=1)
# You can see that consistently the Train Accuracy is close to the Test Accuracy.

# Your code here

# Run the Bagging model multiple times to observe the consistency of performance for the baseline model

for i in range(5):  # Run the model 5 times to observe variability
    # Train the BaggingClassifier again on the scaled training data (baseline model)
    bagging_model.fit(X_train_scaled, y_train)

    # Make predictions on both the training and test sets
    y_train_pred = bagging_model.predict(X_train_scaled)
    y_test_pred = bagging_model.predict(X_test_scaled)

    # Calculate accuracy for both training and test sets
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    # Print the results for each run
    print(f"Run {i + 1}:")
    print(f"Training Accuracy (Baseline Model): {train_accuracy:.4f}")
    print(f"Test Accuracy (Baseline Model): {test_accuracy:.4f}")
    print("-" * 30)


Run 1:
Training Accuracy (Baseline Model): 0.8194
Test Accuracy (Baseline Model): 0.8421
------------------------------
Run 2:
Training Accuracy (Baseline Model): 0.8194
Test Accuracy (Baseline Model): 0.8421
------------------------------
Run 3:
Training Accuracy (Baseline Model): 0.8194
Test Accuracy (Baseline Model): 0.8421
------------------------------
Run 4:
Training Accuracy (Baseline Model): 0.8194
Test Accuracy (Baseline Model): 0.8421
------------------------------
Run 5:
Training Accuracy (Baseline Model): 0.8194
Test Accuracy (Baseline Model): 0.8421
------------------------------


In [23]:
# Run the same code again a couple of times with the pruned decision tree model (max_depth=3)
# You can see that consistently the Train Accuracy is lower, indicating less overfitting, and the Test Accuracy should be more stable.

# Your code here

# Run the Bagging model multiple times to observe the consistency of performance for the pruned model

for i in range(5):  # Run the model 5 times to observe variability
    # Train the BaggingClassifier again on the scaled training data (pruned model)
    bagging_model_pruned.fit(X_train_scaled, y_train)

    # Make predictions on both the training and test sets
    y_train_pred_pruned = bagging_model_pruned.predict(X_train_scaled)
    y_test_pred_pruned = bagging_model_pruned.predict(X_test_scaled)

    # Calculate accuracy for both training and test sets
    train_accuracy_pruned = accuracy_score(y_train, y_train_pred_pruned)
    test_accuracy_pruned = accuracy_score(y_test, y_test_pred_pruned)

    # Print the results for each run
    print(f"Run {i + 1}:")
    print(f"Training Accuracy (Pruned Model): {train_accuracy_pruned:.4f}")
    print(f"Test Accuracy (Pruned Model): {test_accuracy_pruned:.4f}")
    print("-" * 30)


Run 1:
Training Accuracy (Pruned Model): 0.8943
Test Accuracy (Pruned Model): 0.8421
------------------------------
Run 2:
Training Accuracy (Pruned Model): 0.8943
Test Accuracy (Pruned Model): 0.8421
------------------------------
Run 3:
Training Accuracy (Pruned Model): 0.8943
Test Accuracy (Pruned Model): 0.8421
------------------------------
Run 4:
Training Accuracy (Pruned Model): 0.8943
Test Accuracy (Pruned Model): 0.8421
------------------------------
Run 5:
Training Accuracy (Pruned Model): 0.8943
Test Accuracy (Pruned Model): 0.8421
------------------------------


### Consistency in Model Performance

After running the **Bagging model** multiple times with both the **baseline decision tree model** and the **pruned decision tree model**, we observe the following results:

#### Baseline Decision Tree (max_depth=1):
- **Training Accuracy** remains consistently at **0.8194** across all five runs.
- **Test Accuracy** also stays constant at **0.8421** for each run.

This consistency in performance across multiple runs indicates that the **baseline model** is stable and not prone to large fluctuations in accuracy. The fact that both training and test accuracy are similar suggests that **Bagging** has effectively reduced overfitting, which was a concern earlier with the single decision tree model.

#### Pruned Decision Tree (max_depth=3):
- **Training Accuracy** remains consistently at **0.8943** across all five runs.
- **Test Accuracy** also stays constant at **0.8421** for each run.

For the **pruned model**, we observe similar consistency, but with a higher training accuracy, indicating a better fit compared to the baseline model. The **test accuracy** is the same as the baseline model, which suggests that the pruned model is still generalizing well.

### Conclusion:
- **The baseline decision tree model** has reduced overfitting with **Bagging**, as evidenced by consistent **test accuracy** and a **small gap** between training and test accuracy.
- **The pruned decision tree model** shows better performance with **lower training accuracy**, indicating less overfitting and improved generalization.

Both models perform consistently, which shows that **Bagging** has reduced variance and made both models more robust and reliable.

In the next steps, we may further explore different ensemble methods like **Random Forests** or **Gradient Boosting**, which build on the principles of Bagging but introduce even further enhancements for model performance. However, for now, we have made significant progress with Bagging and its ability to reduce variance and improve generalization.


# Boosting: reducing bias

Now we’ll apply AdaBoost with decision trees as weak learners. This will sequentially improve the model by focusing on difficult cases.

Boosting reduces bias by sequentially training a series of weak learners (often simple models like decision trees) where each subsequent model focuses on the mistakes made by the previous models. The key idea behind boosting is to incrementally improve the model by correcting errors, which helps to reduce bias, especially when the initial model is too simple and underfits the data.

- Boosting typically uses weak learners, which are models that perform only slightly better than random guessing. For example, in classification, a weak learner might be a shallow decision tree (a "stump") with just a few levels. Weak learners usually have high bias, meaning they are too simplistic and don't capture the underlying patterns in the data well. As a result, they underfit the data.

- In each iteration, boosting trains a new model that tries to correct the errors made by the earlier models. If an instance was misclassified by the first weak learner, it will receive a higher weight, so the next model pays more attention to it. As the sequence of models progresses, the ensemble collectively focuses more on the difficult-to-predict instances. Over time, the combined models become better at fitting the data, as they successively reduce the bias (systematic error) by adjusting for earlier mistakes.

In [26]:
# Import the AdaBoostClassifier from scikit-learn
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Create a weak decision tree classifier with max_depth=1 as the base estimator (a "stump")
# This weak learner is shallow and simple, helping to avoid overfitting by not capturing too many details
base_dt = DecisionTreeClassifier(max_depth=1, random_state=0)

# Create and train the AdaBoostClassifier with 100 estimators (weak learners)
# AdaBoost will sequentially train 100 weak decision trees, each one focusing on the mistakes of the previous model.
# The 'algorithm' parameter is set to 'SAMME' to avoid the deprecated 'SAMME.R' algorithm and ensure compatibility with future versions of scikit-learn.
# The learning rate is set to 0.5 to reduce the contribution of each weak learner.
adaboost_model = AdaBoostClassifier(estimator=base_dt, n_estimators=100, learning_rate=0.5, algorithm='SAMME', random_state=0)

# Train the AdaBoost model using the scaled training data (baseline model)
adaboost_model.fit(X_train_scaled, y_train)

# Make predictions on both the training and test sets.
y_train_pred = adaboost_model.predict(X_train_scaled)
y_test_pred = adaboost_model.predict(X_test_scaled)

# Evaluate performance using accuracy.
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Print the performance metrics for both training and test sets.
print(f"Training Accuracy (Baseline Model): {train_accuracy:.4f}")
print(f"Test Accuracy (Baseline Model): {test_accuracy:.4f}")


Training Accuracy (Baseline Model): 0.8811
Test Accuracy (Baseline Model): 0.8553


You can probably see a good improvement in score, but overfitting rearing it's ugly head a gain (not as much as in the base model). This is because the iterative correction of adaboost really allows the model to focus on the specifics of this problem, at a cost of overexploiting the dataset.

In [28]:
# Create a pruned decision tree classifier with max_depth=3 as the base estimator
# This pruned model will help reduce overfitting compared to the baseline model
base_dt_pruned = DecisionTreeClassifier(max_depth=3, random_state=0)

# Create and train the AdaBoostClassifier with 100 estimators (weak learners) for the pruned model
# AdaBoost will sequentially train 100 weak decision trees, each one focusing on the mistakes of the previous model.
adaboost_model_pruned = AdaBoostClassifier(estimator=base_dt_pruned, n_estimators=100, learning_rate=0.5, algorithm='SAMME', random_state=0)

# Train the AdaBoost model using the scaled training data (pruned model)
adaboost_model_pruned.fit(X_train_scaled, y_train)

# Make predictions on both the training and test sets.
y_train_pred_pruned = adaboost_model_pruned.predict(X_train_scaled)
y_test_pred_pruned = adaboost_model_pruned.predict(X_test_scaled)

# Evaluate performance using accuracy.
train_accuracy_pruned = accuracy_score(y_train, y_train_pred_pruned)
test_accuracy_pruned = accuracy_score(y_test, y_test_pred_pruned)

# Print the performance metrics for both training and test sets.
print(f"Training Accuracy (Pruned Model): {train_accuracy_pruned:.4f}")
print(f"Test Accuracy (Pruned Model): {test_accuracy_pruned:.4f}")


Training Accuracy (Pruned Model): 1.0000
Test Accuracy (Pruned Model): 0.8289


### Analyzing the Results of AdaBoost and Its Impact on Overfitting

After running AdaBoost, we observe a noticeable improvement in the model's performance for both the **baseline decision tree model** and the **pruned decision tree model**.

#### Baseline Decision Tree (max_depth=1):
- **Training Accuracy**: **0.8811**, which is significantly better than the base decision tree model's performance. However, it's still below the 100% accuracy we saw in the baseline model, indicating some improvement.
- **Test Accuracy**: **0.8553**, which is a reasonable performance on unseen data, though it is still lower than the training accuracy. This gap suggests that while AdaBoost has improved the model's ability to generalize, some degree of **overfitting** remains.

#### Pruned Decision Tree (max_depth=3):
- **Training Accuracy**: **1.0000**, which shows a perfect fit on the training data.
- **Test Accuracy**: **0.8289**, which is an improvement over the baseline model's test accuracy, but still shows a gap compared to the training accuracy, suggesting some overfitting remains.

You can probably see a good improvement in score, but overfitting is still present (although not as much as in the baseline model). This is because the iterative correction of AdaBoost really allows the model to focus on the specifics of this problem, at the cost of overexploiting the dataset.

#### Why Does Overfitting Occur in AdaBoost?

AdaBoost reduces **bias** by sequentially training a series of weak learners, where each model attempts to correct the mistakes of the previous one. While this approach helps the model focus on the more difficult cases and progressively improve its predictions, it also introduces the risk of overfitting:

- **Focus on Difficult Cases**: AdaBoost places more weight on misclassified instances. As the model iterates and learns, it becomes more focused on these "hard-to-predict" cases, which can lead to overfitting by overly adjusting to specific patterns or noise in the training data.
  
- **Higher Training Accuracy**: As AdaBoost continues to improve its predictions, the training accuracy increases, sometimes leading to perfect or near-perfect performance. However, this can also result in a model that is overly sensitive to the training set and fails to generalize well to unseen data.

- **Test Accuracy Gap**: The relatively smaller gap between training and test accuracy in AdaBoost compared to the decision tree model suggests that AdaBoost has done a better job of generalizing. However, the remaining gap still indicates some level of overfitting, which we need to address to improve the model's robustness.

In the next steps, we may explore methods like **early stopping** or experiment with **parameter tuning** to balance the bias-variance trade-off and reduce overfitting further.


In [30]:
# Run the same code again a couple of times with the AdaBoost baseline model. 
# You can see that the test accuracy will mostly be pretty good, even if sometimes it gets lower or higher scores (high variance, low bias).
# You can also see that consistently the train accuracy is higher than the test accuracy, indicating some (not extreme) overfitting.

# Run the AdaBoost model multiple times to observe consistency in performance with the baseline model
for i in range(5):  # Run the model 5 times to observe variability
    # Train the AdaBoost baseline model again on the scaled training data
    adaboost_model.fit(X_train_scaled, y_train)

    # Make predictions on both the training and test sets
    y_train_pred = adaboost_model.predict(X_train_scaled)
    y_test_pred = adaboost_model.predict(X_test_scaled)

    # Calculate accuracy for both training and test sets
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    # Print the results for each run
    print(f"Run {i + 1}:")
    print(f"Training Accuracy (Baseline Model): {train_accuracy:.4f}")
    print(f"Test Accuracy (Baseline Model): {test_accuracy:.4f}")
    print("-" * 30)


Run 1:
Training Accuracy (Baseline Model): 0.8811
Test Accuracy (Baseline Model): 0.8553
------------------------------
Run 2:
Training Accuracy (Baseline Model): 0.8811
Test Accuracy (Baseline Model): 0.8553
------------------------------
Run 3:
Training Accuracy (Baseline Model): 0.8811
Test Accuracy (Baseline Model): 0.8553
------------------------------
Run 4:
Training Accuracy (Baseline Model): 0.8811
Test Accuracy (Baseline Model): 0.8553
------------------------------
Run 5:
Training Accuracy (Baseline Model): 0.8811
Test Accuracy (Baseline Model): 0.8553
------------------------------


In [31]:
# Run the same code again a couple of times with the AdaBoost pruned model. 
# You can see that the test accuracy will mostly be pretty good, even if sometimes it gets lower or higher scores (high variance, low bias).
# You can also see that consistently the train accuracy is higher than the test accuracy, indicating some (not extreme) overfitting.

# Run the AdaBoost model multiple times to observe consistency in performance with the pruned model
for i in range(5):  # Run the model 5 times to observe variability
    # Train the AdaBoost pruned model again on the scaled training data
    adaboost_model_pruned.fit(X_train_scaled, y_train)

    # Make predictions on both the training and test sets
    y_train_pred_pruned = adaboost_model_pruned.predict(X_train_scaled)
    y_test_pred_pruned = adaboost_model_pruned.predict(X_test_scaled)

    # Calculate accuracy for both training and test sets
    train_accuracy_pruned = accuracy_score(y_train, y_train_pred_pruned)
    test_accuracy_pruned = accuracy_score(y_test, y_test_pred_pruned)

    # Print the results for each run
    print(f"Run {i + 1}:")
    print(f"Training Accuracy (Pruned Model): {train_accuracy_pruned:.4f}")
    print(f"Test Accuracy (Pruned Model): {test_accuracy_pruned:.4f}")
    print("-" * 30)


Run 1:
Training Accuracy (Pruned Model): 1.0000
Test Accuracy (Pruned Model): 0.8289
------------------------------
Run 2:
Training Accuracy (Pruned Model): 1.0000
Test Accuracy (Pruned Model): 0.8289
------------------------------
Run 3:
Training Accuracy (Pruned Model): 1.0000
Test Accuracy (Pruned Model): 0.8289
------------------------------
Run 4:
Training Accuracy (Pruned Model): 1.0000
Test Accuracy (Pruned Model): 0.8289
------------------------------
Run 5:
Training Accuracy (Pruned Model): 1.0000
Test Accuracy (Pruned Model): 0.8289
------------------------------


### Observing Variability and Overfitting in AdaBoost

After running the AdaBoost model five times, we observe the following results for both the **baseline decision tree model** and the **pruned decision tree model**:

#### Baseline Decision Tree Model:
- **Training Accuracy** remains consistently at **0.8811** across all runs, which is relatively stable.
- **Test Accuracy** stays constant at **0.8553** for each run. Although the model's performance is stable, it still exhibits a slight gap between the training and test accuracy, indicating some overfitting.

#### Pruned Decision Tree Model (max_depth=3):
- **Training Accuracy** remains consistently at **1.0000**, indicating that the model is perfectly fitting the training data.
- **Test Accuracy** is stable at **0.8289** across all runs. This is slightly lower than the training accuracy, suggesting that while the pruned model is more generalized than the baseline model, there is still some overfitting present.

These results indicate the following:
1. **Overfitting**: Both models (baseline and pruned) exhibit overfitting, as evidenced by the gap between training and test accuracy. However, the pruned model shows a smaller gap, indicating that it has generalized better than the baseline model.
2. **High Variance**: While the **baseline model** shows a consistent but slightly higher variance in performance, the **pruned model** maintains more stable performance, which suggests it has reduced some variance.

Given these results, the AdaBoost models have managed to strike a better balance between bias and variance. The **pruned model** performs better at generalizing to unseen data, though some overfitting remains. 

In the next steps, we may explore strategies like **early stopping**, **parameter tuning**, or **cross-validation** to further refine the model and reduce overfitting.
