# Project 05: Ensemble Machine Learning - Wine Dataset
**Author:**  James Pinkston  
**Date:**  November 23, 2025  
**Objective:**  P5:  This project will evaluate model performance by exploring ensemble models by using the Wine Quality Dataset located <a href="https://archive.ics.uci.edu/ml/datasets/Wine+Quality" target="_blank">here</a>.

## Introduction to Ensemble Models
Ensemble models combine the outputs of multiple models to improve predictive performance. Common types of ensemble models include:

- Boosted Decision Trees – Models train sequentially, with each new tree correcting the errors of the previous one.
- Random Forest – Multiple decision trees train in parallel, each on a random subset of the data, and their predictions are averaged.
- Voting Classifier (Heterogeneous Models) – Combines different types of models (e.g., Decision Tree, SVM, and Neural Network) by taking the majority vote or average prediction.
- Cross Validation – Divides data into multiple folds to improve the reliability of performance estimates.

## Performance Metrics
We will evaluate model performance using the following metrics:

- Accuracy –  The proportion of all predictions that are correct.
- Precision – Proportion of positive predictions that are truly positive.
- Recall – Proportion of actual positives that are correctly predicted.
- F1 Score – Harmonic mean of precision and recall, balancing both.

These metrics are especially helpful when working with multiple classes (e.g., low, medium, high), not just binary yes/no predictions.

Good models have:

- High Test Accuracy – the model predicts well on new, unseen (e.g., test) data.
- High Test F1 Score – especially useful  if some classes (categories) have fewer examples than others.
- Small Gap between Train and Test accuracy – shows the model is generalizing well (not overfitting or underfitting).
- Small Gap between Train and Test F1 score – shows the model is generalizing well (not overfitting or underfitting).

## Imports

In [1]:
# All imports should be at the top of the notebook

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import(
    confusion_matrix,
    accuracy_score,
    recall_score,
    f1_score,
)

## Section 1. Load and Inspect the Data

In [2]:
# Load the dataset
df = pd.read_csv("winequality-red.csv", sep=";")

In [6]:
# Display structure and first few rows
df.info()
df.head()

# The dataset includes 11 physicochemical input variables (features):
# ---------------------------------------------------------------
# - fixed acidity          mostly tartaric acid
# - volatile acidity       mostly acetic acid (vinegar)
# - citric acid            can add freshness and flavor
# - residual sugar         remaining sugar after fermentation
# - chlorides              salt content
# - free sulfur dioxide    protects wine from microbes
# - total sulfur dioxide   sum of free and bound forms
# - density                related to sugar content
# - pH                     acidity level (lower = more acidic)
# - sulphates              antioxidant and microbial stabilizer
# - alcohol                % alcohol by volume

# The target variable is:
# - quality (integer score from 0 to 10, rated by wine tasters)

# We will simplify this target into three categories:
#   - low (3–4), medium (5–6), high (7–8) to make classification feasible.
#   - we will also make this numeric (we want both for clarity)
# The dataset contains 1599 samples and 12 columns (11 features + target).

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Section 2. Prepare the Data
Includes cleaning, feature engineering, encoding, splitting, helper functions

In [10]:
# Define helper function that:

# Takes one input, the quality (which we will temporarily name q while in the function)
# And returns a string of the quality label (low, medium, high)
# This function will be used to create the quality_label column
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"

# Call the apply() method on the quality column to create the new quality_label column
df["quality_label"] = df["quality"].apply(quality_to_label)

# Then, create a numeric column for modeling: 0 = low, 1 = medium, 2 = high
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2

# Call the apply() method on the quality column to create the new quality_numeric column
df["quality_numeric"] = df["quality"].apply(quality_to_number)

# By adding these columns, we are able to predict using the quality column in more meaningful ways

# Display structure and first few rows with new columns
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
 12  quality_label         1599 non-null   object 
 13  quality_numeric       1599 non-null   int64  
dtypes: float64(11), int64(2), object(1)
memory usage: 175.0+ KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,quality_label,quality_numeric
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,medium,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,medium,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,medium,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,medium,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,medium,1


## Section 3. Feature Selection and Justification

In [12]:
# Define input features (X) and target (y)
# Features: all columns except 'quality' and 'quality_label' and 'quality_numberic' - drop these from the input array
# Target: quality_label (the new column we just created)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target

## Section 4. Split the Data into Train and Test

In [13]:
# Train/test split (stratify to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## Section 5. Evaluate Model Performance

We have a list of 9 model variations to choose from:

1. Random Forest (100):  A strong baseline model using 100 decision trees.
2. Random Forest (200, max_depth=10):  Adds more trees, but limits tree depth to reduce overfitting.
3. AdaBoost (100):  Boosting method that focuses on correcting previous errors.
4. AdaBoost (200, lr=0.5):  More iterations and slower learning for better generalization.
5. Gradient Boosting (100):  Boosting approach using gradient descent.
6. Voting (DT + SVM + NN):  Combines diverse models by averaging their predictions.
7. Voting (RF + LR + KNN):  Another mix of different model types.
8. Bagging (DT, 100):  Builds many trees in parallel on different samples.
9. MLP Classifier:  A basic neural network with one hidden layer.

I am choosing:
- Random Forest (100)
- MLP Classifier

In [16]:
# Helper function to train and evaluate models
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )

In [21]:
results = []

# Random Forest Model
evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)


Random Forest (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 256   8]
 [  0  15  28]]
Train Accuracy: 1.0000, Test Accuracy: 0.8875
Train F1 Score: 1.0000, Test F1 Score: 0.8661


In [22]:
# MLP Classifier
evaluate_model(
    "MLP Classifier",
    MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)


MLP Classifier Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 257   7]
 [  0  30  13]]
Train Accuracy: 0.8514, Test Accuracy: 0.8438
Train F1 Score: 0.8141, Test F1 Score: 0.8073


## Section 6. Compare Results

In [26]:
# Create a table of results
results_df = pd.DataFrame(results)
print("\nSummary of All Models:")

# Add a gap calculation
results_df["Accuracy Gap"] = results_df["Train Accuracy"] - results_df["Test Accuracy"]
results_df["F1 Gap"] = results_df["Train F1"] - results_df["Test F1"]

# Sort the table by Test Accuracy
results_df = results_df.sort_values(by="Test Accuracy", ascending=False)
display(results_df)


Summary of All Models:


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1,Accuracy Gap,F1 Gap
0,Random Forest (100),1.0,0.8875,1.0,0.866056,0.1125,0.133944
1,MLP Classifier,0.851446,0.84375,0.814145,0.807318,0.007696,0.006827


## Section 7. Conclusions and Insights

1. Project Summary:  **In this analysis, we evaluated two machine learning models — Random Forest and Multi-Layer Perceptron (MLP) classifiers — to predict red wine quality. We measured model performance using both accuracy and weighted F1 score on the training and test datasets, and calculated gaps to assess overfitting.**

2. Performance Trends:  **The Random Forest classifier achieved the highest test accuracy at 88.75%, but also showed a notable training/test gap (11.25%), indicating some overfitting. In contrast, the MLP classifier had slightly lower test accuracy (84.38%) but a much smaller accuracy gap (0.76%), suggesting more stable generalization.**

3. Insights on Why Models Behave Differently:  **Tree-based models, like Random Forest, perform well on this dataset because they capture non-linear relationships and interactions between wine chemical properties, but they can overfit when too many trees or deep splits are used. Neural networks, such as MLP, tend to generalize more consistently and handle complex patterns, though they may require careful tuning and more data to reach peak accuracy.**

4. Next Steps:  **While this project focused on using 2 of the 9 models, I think when working on a professional team, it would be prudent to use as many models as possible, and then compare them together as a team. The team could then decide on which model's (models') results to use when presenting their findings to the project shareholders. Results of all 9 models would look similar to those produced by <a href="https://github.com/wkarto/applied-ml-karto/blob/main/notebooks/project05/ensemble-karto.ipynb" target="_blank">Womenker Karto's Wine Quality Data Summary Table</a> *(See Section 6 — Compare Results)*.**