# Lab 5 Project (Ensemble Models)

- **Author:** Katie McGaughey 
- **Date:** 2025-04-11
- **Objective:** Utilizing data about wine quality to train & test ensemble ML models

This code base is being created in the course of completing module 5 of CSIS 44-670 from NW Missouri University. In this Jupyter Notebook which we will analyze data representing quality of wine. This data is sourced from the UCI Machine Learning Repository -> https://archive.ics.uci.edu/ml/datasets/Wine+Quality

> Data originally published by: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
> Modeling wine preferences by data mining from physicochemical properties.
> In Decision Support Systems, Elsevier, 47(4):547–553, 2009.

Direct download link to raw csv: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv 

In this lab we're exploring the prediction of the **quality** (scale of 0 to 10 as rated by wine tasters) of the wines in the dataset according to their features. This section from the assignment explains the featuers & goal well.

The dataset includes 11 physicochemical input variables (features):
---------------------------------------------------------------

- fixed acidity          mostly tartaric acid
- volatile acidity       mostly acetic acid (vinegar)
- citric acid            can add freshness and flavor
- residual sugar         remaining sugar after fermentation
- chlorides              salt content
- free sulfur dioxide    protects wine from microbes
- total sulfur dioxide   sum of free and bound forms
- density                related to sugar content
- pH                     acidity level (lower = more acidic)
- sulphates              antioxidant and microbial stabilizer
- alcohol                % alcohol by volume

The target variable is:
- quality (integer score from 0 to 10, rated by wine tasters)

We will simplify this target into three categories:
- low (3–4), medium (5–6), high (7–8) to make classification feasible.
- we will also make this numeric (we want both for clarity)
The dataset contains 1599 samples and 12 columns (11 features + target). 

## Section 1. Import and Inspect the Data

In this section we load a sample dataset from Seaborn's library into a DataFrame and do a standard set of what I'll call "getting to know you" methods to get a view of the dataset schema, its contents, the proportions of missing values, and any correlations that exist between the numerical columns.

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

# Load the dataset (download from UCI and save in the same folder)
df = pd.read_csv("winequality-red.csv", sep=";")

# Display structure and first few rows
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Section 2: Prepare the Data

In this section we clean the data, do feature engineering, set up helper functions and generally get the data ready for the ML algorithms.

We will create a **quality_to_label** function to stratify the 0 to 10 numerical grading scheme into a simpler "low/medium/high" scheme. We'll also work this scheme backwards to assign low/medium/high quality wines to the integers between 0 and 2.

In [26]:
# Define helper function that:

# Takes one input, the quality (which we will temporarily name q while in the function)
# And returns a string of the quality label (low, medium, high)
# This function will be used to create the quality_label column
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"


# Call the apply() method on the quality column to create the new quality_label column
df["quality_label"] = df["quality"].apply(quality_to_label)


# Then, create a numeric column for modeling: 0 = low, 1 = medium, 2 = high
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2


df["quality_numeric"] = df["quality"].apply(quality_to_number)

## Section 3: Feature Selection & Justification

In thsi section we will set up the X values (features) and Y value (target) for the ML algorithms to eventually train on and test against.

We will drop unnecessary columns for the sake of keeping the model clean. In our case, the quality column is not linearly separable from the target (we used it to create the target), therefore we don't really want to model to consider these features.

In [27]:
# Define input features (X) and target (y)
# Features: all columns except 'quality' and 'quality_label' and 'quality_numberic' - drop these from the input array
# Target: quality_label (the new column we just created)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target

## Section 4: Split to Train & Test

In this section we create two datasets from our input - one for model **training** and one for model **testing**. This will give the model something to work with, then something to prove to us that it's capable of what it was trained to do. This sets up the evaluation we'll be doing later.

In [28]:
# Train/test split (stratify to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## Section 5: Evaluate Model Performance

In this section we'll be using **two** models from the below table to train against our training dataset and evaluate against our evaluation dataset. 

Below is a list of  9 model variations. Choose two to focus on for your comparison. 

|Option|Model Name|Notes|
|---|---|---|
|1|Random Forest (100)|A strong baseline model using 100 decision trees.|
|2|Random Forest (200, max_depth=10)|Adds more trees, but limits tree depth to reduce overfitting.|
|3|AdaBoost (100)|Boosting method that focuses on correcting previous errors.|
|4|AdaBoost (200, lr=0.5)|More iterations and slower learning for better generalization.|
|5|Gradient Boosting (100)|Boosting approach using gradient descent.|
|6|Voting (DT + SVM + NN)|Combines diverse models by averaging their predictions.|
|7|Voting (RF + LR + KNN)|Another mix of different model types.|
|8|Bagging (DT, 100)|Builds many trees in parallel on different samples.|
|9|MLP Classifier|A basic neural network with one hidden layer.|


In [29]:

# Helper function to train and evaluate models
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )

With the above helper function we can efficiently train, test, then print to the console the evaluation results of the chosen models. While I've included the code for all models, only models **#5** and **#7** will be run & included, per lab instructions (choose 2).

In [30]:

# List to store results
results = []

# # 1. Random Forest
# evaluate_model(
#     "Random Forest (100)",
#     RandomForestClassifier(n_estimators=100, random_state=42),
#     X_train,
#     y_train,
#     X_test,
#     y_test,
#     results,
# )

# # 2. Random Forest (200, max depth=10) 
# evaluate_model(
#     "Random Forest (200, max_depth=10)",
#     RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42),
#     X_train,
#     y_train,
#     X_test,
#     y_test,
#     results,
# )

# # 3. AdaBoost 
# evaluate_model(
#     "AdaBoost (100)",
#     AdaBoostClassifier(n_estimators=100, random_state=42),
#     X_train,
#     y_train,
#     X_test,
#     y_test,
#     results,
# )

# # 4. AdaBoost (200, lr=0.5) 
# evaluate_model(
#     "AdaBoost (200, lr=0.5)",
#     AdaBoostClassifier(n_estimators=200, learning_rate=0.5, random_state=42),
#     X_train,
#     y_train,
#     X_test,
#     y_test,
#     results,
# )

# 5. Gradient Boosting
evaluate_model(
    "Gradient Boosting (100)",
    GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# # 6. Voting Classifier (DT, SVM, NN) 
# voting1 = VotingClassifier(
#     estimators=[
#         ("DT", DecisionTreeClassifier()),
#         ("SVM", SVC(probability=True)),
#         ("NN", MLPClassifier(hidden_layer_sizes=(50,), max_iter=1000)),
#     ],
#     voting="soft",
# )
# evaluate_model(
#     "Voting (DT + SVM + NN)", voting1, X_train, y_train, X_test, y_test, results
# )

# 7. Voting Classifier (RF, LR, KNN) 
voting2 = VotingClassifier(
    estimators=[
        ("RF", RandomForestClassifier(n_estimators=100)),
        ("LR", LogisticRegression(max_iter=1000)),
        ("KNN", KNeighborsClassifier()),
    ],
    voting="soft",
)
evaluate_model(
    "Voting (RF + LR + KNN)", voting2, X_train, y_train, X_test, y_test, results
)

# # 8. Bagging 
# evaluate_model(
#     "Bagging (DT, 100)",
#     BaggingClassifier(
#         estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42
#     ),
#     X_train,
#     y_train,
#     X_test,
#     y_test,
#     results,
# )

# # 9. MLP Classifier 
# evaluate_model(
#     "MLP Classifier",
#     MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42),
#     X_train,
#     y_train,
#     X_test,
#     y_test,
#     results,
# )


Gradient Boosting (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  3 247  14]
 [  0  16  27]]
Train Accuracy: 0.9601, Test Accuracy: 0.8562
Train F1 Score: 0.9584, Test F1 Score: 0.8411

Voting (RF + LR + KNN) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 256   8]
 [  0  27  16]]
Train Accuracy: 0.9132, Test Accuracy: 0.8500
Train F1 Score: 0.8933, Test F1 Score: 0.8185


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Section 6: Compare Results

In this section we will compare the performance of the chosen models.

In [31]:
# Create a table of results 
results_df = pd.DataFrame(results)

print("\nSummary of All Models:")

# Sort by 'Test Accuracy' in descending order
df_sorted = results_df.sort_values(by="Test Accuracy", ascending=False)

# Print the sorted DataFrame
display(df_sorted)



Summary of All Models:


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1
0,Gradient Boosting (100),0.960125,0.85625,0.95841,0.841106
1,Voting (RF + LR + KNN),0.913213,0.85,0.89328,0.818465


## Section 7: Conclusions & Insights

In this section we will analyze and discuss the results. We'll also utilize results from another student in my class who's performed this same lab assignment and chose different models for their evaluation.

> Referenced work by [Brett Neely](https://github.com/bncodes19) - see [his GitHub Repo for this lab](https://github.com/bncodes19/applied-ml-bneely/blob/main/lab05/ensemble-neely.ipynb)

Brett chose to analyze the AdaBoost (100) and MLP Classifier models, (3 & 9, respectively, from the table). The results Brett obtained were as follows:

||Model|Train Accuracy|Test Accuracy|Train F1|Test F1|
|---|---|---|---|---|---|
|0|AdaBoost (100)|0.834246|0.82500|0.820863|0.815803|
|1|MLP Classifier|0.851446|0.84375|0.814145|0.807318|

Merging Brett's results and the results I obtained for the Gradient Boosting (100) and Voting (RF + LR + KNN) models & sorting them by test **accuracy**:

||Model|Train Accuracy|Test Accuracy|Train F1|Test F1|
|---|---|---|---|---|---|
|5|Gradient Boosting (100)|0.960125|0.85625|0.95841|0.841106|
|7|Voting (RF + LR + KNN)|0.913213|0.85000|0.89328|0.818465|
|9|MLP Classifier|0.851446|0.84375|0.814145|0.807318|
|3|AdaBoost (100)|0.834246|0.82500|0.820863|0.815803|

**Accuracy** is a measure of what proportion of the predictions were correct. The formula for accuracy is what you'd expect:

$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$$

The highest accuracy model was Gradient Boosting (100), which was correct in 96% of its predictions. The to the lowest accuracy model, AdaBoost (100), still performed respectably, obtaining the correct result 82.5% of the time. That's the difference between earning an A and a B. Not terrible.

Sorting by the test **F1 Score** you get a slightly different order:

||Model|Train Accuracy|Test Accuracy|Train F1|Test F1|
|---|---|---|---|---|---|
|5|Gradient Boosting (100)|0.960125|0.85625|0.95841|0.841106|
|7|Voting (RF + LR + KNN)|0.913213|0.85000|0.89328|0.818465|
|3|AdaBoost (100)|0.834246|0.82500|0.820863|0.815803|
|9|MLP Classifier|0.851446|0.84375|0.814145|0.807318|

**F1 score** is a single metric that provides a balanced measure of a model's performance, combining two crucial metrics: **precision** and **recall**.

$$F1 \ Score = 2 \times \frac{Precision \times Recall}{Precision + Recall} = \frac{2 \cdot TP}{2 \cdot TP + FN + FP}$$

The model with the best F1 score was again Gradient Boosting (100). The lowest accuracy model actually had a slightly better F1 score than the lowest-scoring F1 Test model: MLP Classifier.

### Overall conclusion

For the purposes of this dataset in prediction the classification of the quality of wine there is no "bad" models to choose from. However, there are meaningful differences in the models tested. The Gradient Boosting (100) model is the strongest choice. This model uses the Gradient Boosting technique with a hyperparameter of 100. 

> **Gradient Boosting** is a machine learning technique that builds an ensemble of weak prediction models, typically decision trees. It works in a stage-wise fashion, where each new model is trained to correct the errors made by the previous models. The "gradient" part refers to the use of gradient descent in the optimization process to find the best way to combine the weak learners.  
> <cite>source: Gemini</cite>

