#### **Ensemble Lab**
Mahitha

**Date: 4/19/2025**

**Introduction:**


In this lab, we explore the application of various ensemble machine learning models to classify wine quality using the winequality-red dataset. The dataset contains physicochemical properties of red wine samples, such as acidity, sugar content, and alcohol percentage, along with a quality score rated by wine tasters. To simplify the analysis, the quality scores are categorized into three classes: low, medium, and high.

The primary objective of this lab is to compare the performance of different ensemble models, including Random Forest, Gradient Boosting, AdaBoost, and Voting Classifiers, among others. We evaluate these models based on metrics such as accuracy, F1 score, and their ability to generalize to unseen data. By analyzing the results, we aim to identify the most effective model for predicting wine quality while balancing performance and generalization.

This lab also emphasizes the importance of feature selection, data preprocessing, and hyperparameter tuning in building robust machine learning models. Through this hands-on approach, we gain insights into the strengths and limitations of ensemble methods in classification tasks.

In [1]:
# Imports
# ------------------------------------------------
# Imports once at the top, organized
# ------------------------------------------------

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score   
)

#### **Section 1. Load and Inspect the Data**

In [7]:
# Load the dataset (download from UCI and save in the same folder)
df = pd.read_csv(r'C:\Users\Mahi2\projects\applied-ml-mk\applied-ml-mk\lab05\wine\winequality-red.csv', sep=';')
# Display structure and first few rows
df.info()
df.head(5)

# The dataset includes 11 physicochemical input variables (features):
# ---------------------------------------------------------------
# - fixed acidity          mostly tartaric acid
# - volatile acidity       mostly acetic acid (vinegar)
# - citric acid            can add freshness and flavor
# - residual sugar         remaining sugar after fermentation
# - chlorides              salt content
# - free sulfur dioxide    protects wine from microbes
# - total sulfur dioxide   sum of free and bound forms
# - density                related to sugar content
# - pH                     acidity level (lower = more acidic)
# - sulphates              antioxidant and microbial stabilizer
# - alcohol                % alcohol by volume

# The target variable is:
# - quality (integer score from 0 to 10, rated by wine tasters)

# We will simplify this target into three categories:
#   - low (3–4), medium (5–6), high (7–8) to make classification feasible.
#   - we will also make this numeric (we want both for clarity)
# The dataset contains 1599 samples and 12 columns (11 features + target).

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


#### **Section 2. Prepare the Data**

Includes cleaning, feature engineering, encoding, splitting, helper functions

#### **Define helper function that:**

**Takes one input, the quality (which we will temporarily name q while in the function)**

**And returns a string of the quality label (low, medium, high)**

**This function will be used to create the quality_label column**

**def quality_to_label(q): if q <= 4: return "low" elif q <= 6: return "medium" else: return "high"**

In [12]:
# Function to convert quality score to label
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"

# Apply the function to create a new column
df["quality_label"] = df["quality"].apply(quality_to_label)


# Function to convert quality score to a numeric value
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2

# Apply the function to create a numeric column
df["quality_numeric"] = df["quality"].apply(quality_to_number)



#### **Section 3. Feature Selection and Justification**

**Define input features (X) and target (y)**

**Features: all columns except 'quality' and 'quality_label' and 'quality_numberic' - drop these from the input array**

**Target: quality_label (the new column we just created)**

X = df.drop(columns=["quality", "quality_label", "quality_numeric"]) # Features y = df["quality_numeric"] # Target

Explain / introduce your choices:

We want to train only on physicochemical properties of the wine (like acidity, pH, alcohol content, etc. We’re treating this as a multi-class classification problem where we want to train a model to predict one of three categories.

In [14]:
## Section 4. Split the Data into Train and Test
# Train/test split (stratify to preserve class balance)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"]) ##Features
y = df["quality_numeric"] # Target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

#### **Section 5. Evaluate Model Performance (Choose 2)**

Below is a list of 9 model variations. Choose two to focus on for your comparison.

Option Model Name Notes 1 Random Forest (100) A strong baseline model using 100 decision trees.

2 Random Forest (200, max_depth=10) Adds more trees, but limits tree depth to reduce overfitting.

3 AdaBoost (100) Boosting method that focuses on correcting previous errors.

4 AdaBoost (200, lr=0.5) More iterations and slower learning for better generalization.

5 Gradient Boosting (100) Boosting approach using gradient descent.

6 Voting (DT + SVM + NN) Combines diverse models by averaging their predictions.

7 Voting (RF + LR + KNN) Another mix of different model types.

8 Bagging (DT, 100) Builds many trees in parallel on different samples.

9 MLP Classifier A basic neural network with one hidden layer.

In [17]:
# Helper function to train and evaluate models
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )


######################################
#Here's how to create the different types of 
# ensemble models listed above 
# (you don't need to do all of them yourself. 
# Choose 2 - we have a whole team working on this.)
######################################

results = []

# 1. Random Forest
evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 2. Random Forest (200, max depth=10) 
# evaluate_model(
#     "Random Forest (200, max_depth=10)",
#     RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42),
#     X_train,
#     y_train,
#     X_test,
#     y_test,
#     results,
# )

# 3. AdaBoost 
# evaluate_model(
#     "AdaBoost (100)",
#     AdaBoostClassifier(n_estimators=100, random_state=42),
#     X_train,
#     y_train,
#     X_test,
#     y_test,
#     results,
# )

# 4. AdaBoost (200, lr=0.5) 
# evaluate_model(
#     "AdaBoost (200, lr=0.5)",
#     AdaBoostClassifier(n_estimators=200, learning_rate=0.5, random_state=42),
#     X_train,
#     y_train,
#     X_test,
#     y_test,
#     results,
# )

# 5. Gradient Boosting
evaluate_model(
    "Gradient Boosting (100)",
    GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 6. Voting Classifier (DT, SVM, NN) 
# voting1 = VotingClassifier(
#     estimators=[
#         ("DT", DecisionTreeClassifier()),
#         ("SVM", SVC(probability=True)),
#         ("NN", MLPClassifier(hidden_layer_sizes=(50,), max_iter=1000)),
#     ],
#     voting="soft",
# )
# evaluate_model(
#     "Voting (DT + SVM + NN)", voting1, X_train, y_train, X_test, y_test, results
# )

# 7. Voting Classifier (RF, LR, KNN) 
# voting2 = VotingClassifier(
#     estimators=[
#         ("RF", RandomForestClassifier(n_estimators=100)),
#         ("LR", LogisticRegression(max_iter=1000)),
#         ("KNN", KNeighborsClassifier()),
#     ],
#     voting="soft",
# )
# evaluate_model(
#     "Voting (RF + LR + KNN)", voting2, X_train, y_train, X_test, y_test, results
# )

# 8. Bagging 
# evaluate_model(
#     "Bagging (DT, 100)",
#     BaggingClassifier(
#         estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42
#     ),
#     X_train,
#     y_train,
#     X_test,
#     y_test,
#     results,
# )

# 9. MLP Classifier 
# evaluate_model(
#     "MLP Classifier",
#     MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42),
#     X_train,
#     y_train,
#     X_test,
#     y_test,
#     results,
# )


Random Forest (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 256   8]
 [  0  15  28]]
Train Accuracy: 1.0000, Test Accuracy: 0.8875
Train F1 Score: 1.0000, Test F1 Score: 0.8661

Gradient Boosting (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  3 247  14]
 [  0  16  27]]
Train Accuracy: 0.9601, Test Accuracy: 0.8562
Train F1 Score: 0.9584, Test F1 Score: 0.8411


#### **Section 6. Compare Results**

In [18]:
# Create a table of results 
results_df = pd.DataFrame(results)

######################################
# Recommendation: See if you can add gap calculations 
# to your results and sort the table by test accuracy 
# to find the best models more efficiently. 
######################################

results_df["Accuracy Gap"] = results_df["Train Accuracy"] - results_df["Test Accuracy"]
results_df["F1 Gap"] = results_df["Train F1"] - results_df["Test F1"]

results_df = results_df.sort_values(by="Test Accuracy", ascending=False)

print("\nSummary of All Models:")
display(results_df)


Summary of All Models:


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1,Accuracy Gap,F1 Gap
0,Random Forest (100),1.0,0.8875,1.0,0.866056,0.1125,0.133944
1,Gradient Boosting (100),0.960125,0.85625,0.95841,0.841106,0.103875,0.117304


#### **Section 7. Conclusions and Insights**

**Model 1: Random Forest (100)**

**Train Accuracy:** 1.000000 (100%)
The model perfectly predicts the training data, which is a sign of potential overfitting.

**Test Accuracy:** 0.88750 (88.75%)
The model performs well on unseen data but not as perfectly as on the training data.

**Accuracy Gap:** 0.112500 (11.25%)
The gap between training and test accuracy indicates some overfitting, as the model performs significantly better on the training data.

**Train F1:** 1.00000 (100%)
Perfect F1 score on the training data, further confirming overfitting.

**Test F1:** 0.866056 (86.61%)
The F1 score on the test data is slightly lower, reflecting the model's reduced performance on unseen data.

**F1 Gap:** 0.133944 (13.39%)
The gap between training and test F1 scores also indicates overfitting.

#### **Model 2: Gradient Boosting (100)**

**Train Accuracy:** 0.960125 (96.01%)
The model performs very well on the training data but does not overfit as much as Random Forest.

**Test Accuracy:** 0.85625 (85.63%)
Slightly lower than Random Forest, but still a strong performance on unseen data.

**Accuracy Gap:** 0.103875 (10.39%)
The gap is smaller than Random Forest's, suggesting better generalization to unseen data.

**Train F1:** 0.95841 (95.84%)
High F1 score on the training data, indicating good balance between precision and recall.

**Test F1:** 0.841106 (84.11%)
Slightly lower than the training F1 score, but still strong on unseen data.

**F1 Gap:** 0.117304 (11.73%)
Smaller than Random Forest's F1 gap, again suggesting better generalization.

#### **Comparison of Analysis**

**Random Forest (100)**

- Performance:
  
Slightly better on the test set in terms of accuracy and F1 score.

- Overfitting:
Shows more overfitting, as indicated by the larger accuracy and F1 gaps.

**Gradient Boosting (100)**

- Performance:
Slightly lower test performance compared to Random Forest.
- Generalization:
Generalizes better, as shown by smaller gaps between training and test metrics.

**Conclusion**

Random Forest (100) is a strong performer but shows signs of overfitting.
Gradient Boosting (100) generalizes better and might be more reliable for unseen data, even if its test performance is slightly lower.

Results from Sandra Ruiz

(https://github.com/S572396/ml-05-sruiz/blob/main/ensemble-sruiz.ipynb)

Model	Train Accuracy	Test Accuracy
Random Forest (100)	1.000000	0.8875
AdaBoost (100)	0.834246	0.8250
