# Project 5 - Ensemble
**Author:** Karli Dean\
**Due Date:** November 21, 2025\
**Purpose:** In this Jupyter Notebook, we will analyze the dataset on Wine from the UCI Library. In my sector of the project, we are analyzing the Bagging and AdaBoost models. We will have a final recap explaining what model we would rather use when making a rational decision on this set and topic.

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

## Section 1 - Load and Inspect the Data

In [2]:
# Loading the Dataset as a DataFrame
df = pd.read_csv("winequality-red.csv", sep=";")

In [3]:
# Display the first few rows to make sure data loaded correctly
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [4]:
# Notes per Project Doc
# The dataset includes 11 physicochemical input variables (features):
# ---------------------------------------------------------------
# - fixed acidity          mostly tartaric acid
# - volatile acidity       mostly acetic acid (vinegar)
# - citric acid            can add freshness and flavor
# - residual sugar         remaining sugar after fermentation
# - chlorides              salt content
# - free sulfur dioxide    protects wine from microbes
# - total sulfur dioxide   sum of free and bound forms
# - density                related to sugar content
# - pH                     acidity level (lower = more acidic)
# - sulphates              antioxidant and microbial stabilizer
# - alcohol                % alcohol by volume

# The target variable is:
# - quality (integer score from 0 to 10, rated by wine tasters)

# We will simplify this target into three categories:
#   - low (3–4), medium (5–6), high (7–8) to make classification feasible.
#   - we will also make this numeric (we want both for clarity)
# The dataset contains 1599 samples and 12 columns (11 features + target).

## Section 2 - Prepare the Data

We're going to next prepare the data for our analysis. There are descriptions after each piece for explanation.

In [5]:
# Define helper function that:

# Takes one input, the quality (which we will temporarily name q while in the function)
# And returns a string of the quality label (low, medium, high)
# This function will be used to create the quality_label column
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"

# Call the apply() method on the quality column to create the new quality_label column
df["quality_label"] = df["quality"].apply(quality_to_label)

### Why Convert Quality Scores into Labels?

The original `quality` variable is numeric, which is useful for calculations but not always intuitive to understand. To make the analysis more meaningful, I created the `quality_to_label()` function to group quality scores into three categories: **low**, **medium**, and **high**. 

This transformation is helpful because:

- It makes the data more interpretable for humans.
- It allows us to analyze trends by category instead of raw numbers.
- It improves visualizations such as bar charts and comparisons across groups.
- It supports slicing, grouping, and summarizing the data more easily.

By applying this function, we add a new column (`quality_label`) that captures the same information in a clearer and more useful format.


In [6]:
# Then, create a numeric column for modeling: 0 = low, 1 = medium, 2 = high
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2

# Creating the df based on this
df["quality_numeric"] = df["quality"].apply(quality_to_number)

### Why Convert Quality Labels into Numbers?

Some machine learning models work best with numeric features rather than text labels. To prepare the quality information for modeling, I created the `quality_to_number()` function, which maps:

- low → 0  
- medium → 1  
- high → 2  

This gives us a simple, ordinal numeric feature that preserves the order of the categories while making it suitable for ML algorithms, correlations, and statistical analysis.

## Section 3 - Feature Selection and Justification

In [7]:
# Define input features (X) and target (y)
# Features: all columns except 'quality' and 'quality_label' and 'quality_numberic' - drop these from the input array
# Target: quality_label (the new column we just created)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target

### Why these features?

- The original quality column is the raw human rating and is being replaced by engineered versions, so I dropped it from the feature set.

- `quality_label` is a descriptive/categorical version of the target that I use for interpretation and plotting, but including it as an input feature would leak the answer into the model, so it is also dropped.

- `quality_numeric` is the numeric encoding of wine quality (0 = low, 1 = medium, 2 = high), and this is what I want the model to predict, so it belongs in y, not in X.

- By dropping ["quality", "quality_label", "quality_numeric"], I ensure that:

- X contains only the 11 physicochemical features (fixed acidity, volatile acidity, citric acid, etc.).

- y contains a clean, numeric target that is suitable for classification algorithms.

## Section 4 - Split the Data into Training and Testing Sets

In [8]:
# Train/test split (stratify to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## Section 5 - Evaluate Model Performance

### Model Evaluation Approach

To keep the evaluation process consistent across all models, I created a helper function `evaluate_model()` that:

- trains the model  
- generates predictions for both the training and test sets  
- computes the accuracy and F1 scores  
- prints a confusion matrix for the test set  
- stores all results in a list for later comparison  

Using a helper function ensures that each model is evaluated using the exact same metrics and structure, which makes the results fair and easy to compare.

### Why Accuracy and F1 Score?

I used **accuracy** to measure overall predictive performance, but accuracy alone can be misleading when classes are imbalanced. Because the wine quality dataset has more samples in the “medium” class, I also included the **weighted F1 score**, which accounts for precision and recall while adjusting for class frequency. This gives a more reliable view of how well each model performs across all classes.

### Why Compare AdaBoost and Bagging?

AdaBoost and Bagging are both ensemble methods but they operate very differently:

- **Bagging (Bootstrap Aggregation)** trains many base learners independently on different bootstrapped samples. It reduces variance and improves stability, especially for decision trees.
- **AdaBoost (Adaptive Boosting)** trains learners sequentially, giving more weight to samples the previous models misclassified. It focuses on reducing bias and improving difficult cases.

Comparing these two approaches shows how different ensemble strategies behave on the same dataset.

### Tracking Results

I created `results1` and `results2` as lists to store the performance metrics for each model. The helper function appends a dictionary containing:

- Model name  
- Train accuracy  
- Test accuracy  
- Train F1  
- Test F1  

This allows me to summarize model performance later in a results table.


In [9]:
# Helper function to train and evaluate models
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results = [results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )]

In [10]:
# Calling results
results1 = []

# 3. AdaBoost
evaluate_model(
    "AdaBoost (100)",
    AdaBoostClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results1,
)


AdaBoost (100) Results
Confusion Matrix (Test):
[[  1  12   0]
 [  5 240  19]
 [  0  20  23]]
Train Accuracy: 0.8342, Test Accuracy: 0.8250
Train F1 Score: 0.8209, Test F1 Score: 0.8158


In [11]:
# Calling results
results2 = []

# 8. Bagging
evaluate_model(
    "Bagging (DT, 100)",
    BaggingClassifier(
        estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results2,
)


Bagging (DT, 100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 252  12]
 [  0  12  31]]
Train Accuracy: 1.0000, Test Accuracy: 0.8844
Train F1 Score: 1.0000, Test F1 Score: 0.8655


## Section 6 - Compare Results

### Summary of Model Performance

After evaluating each model using the `evaluate_model()` function, I combined the individual results lists into a single DataFrame. This summary table allows me to directly compare performance across all models in terms of both **accuracy** and **F1 score** on the training and test sets.

Using one consolidated table makes it much easier to:

- identify which model generalizes best,
- spot overfitting (high train score but lower test score),
- compare performance across ensemble methods, and
- determine whether the more complex model (e.g., AdaBoost) actually outperforms simpler approaches (e.g., Bagging).

This summary DataFrame provides a clear side-by-side comparison that supports my final model selection.


In [12]:
df1 = pd.DataFrame(results1)
df2 = pd.DataFrame(results2)

In [13]:
results_df = pd.concat([df1, df2], ignore_index=True)

In [14]:
print("\nSummary of All Models:")
display(results_df)


Summary of All Models:


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1
0,AdaBoost (100),0.834246,0.825,0.820863,0.815803
1,"Bagging (DT, 100)",1.0,0.884375,1.0,0.865452


## Section 7 - Conclusions and Insights

### Interpretation of Model Performances

What stands out immediately is how differently the two ensemble methods behave.  
The Bagging model shows **extremely high training performance (100% accuracy/F1)**, which is typical for Decision Tree–based ensembles. Decision Trees can easily memorize the training data, and Bagging does not attempt to reduce that tendency—its strength lies in lowering variance through averaging, not preventing overfitting. 

Despite perfectly fitting the training set, the Bagging model still generalizes well, achieving **about 88% accuracy on the test set**, which is strong for this dataset.

AdaBoost behaves very differently. Because it focuses on reducing bias and sequentially correcting mistakes, it does not fully memorize the training set. This results in training and testing accuracies that are much closer (around 82%). AdaBoost is more stable and less prone to extreme overfitting, but in this case it does not achieve the same test performance as Bagging.

**Overall, Bagging performs best** on this dataset. Even though it overfits the training data, its test accuracy and F1 score are higher, indicating that it captures the underlying patterns more effectively. If I were making predictions or drawing insights, I would rely on the Bagging (Decision Tree) ensemble because it provides stronger real-world performance on unseen data.
