# Lab 5: Ensemble Models for Wine Quality Prediction
**Author:** Kate Huntsman
**Date:** April 8th, 2025  
**Objective:** 

## Introduction: 
In this lab, we apply ensemble machine learning models to predict the quality of red wine based on physicochemical characteristics. Ensemble models often outperform individual models by reducing overfitting and improving generalization. We'll evaluate various ensemble strategies and compare them using metrics like accuracy, F1 score, and the gap between training and test performance.

## Imports
In the code cell below, import the necessary Python libraries for this notebook.  

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

## Section 1: Load and Inspect the Data

In [36]:
# Load the dataset (download from UCI and save in the same folder)
df = pd.read_csv("winequality-red.csv", sep=";")

# Display structure and first few rows
df.info()
df.head()

# The dataset includes 11 physicochemical input variables (features):
# ---------------------------------------------------------------
# - fixed acidity          mostly tartaric acid
# - volatile acidity       mostly acetic acid (vinegar)
# - citric acid            can add freshness and flavor
# - residual sugar         remaining sugar after fermentation
# - chlorides              salt content
# - free sulfur dioxide    protects wine from microbes
# - total sulfur dioxide   sum of free and bound forms
# - density                related to sugar content
# - pH                     acidity level (lower = more acidic)
# - sulphates              antioxidant and microbial stabilizer
# - alcohol                % alcohol by volume

# The target variable is:
# - quality (integer score from 0 to 10, rated by wine tasters)

# We will simplify this target into three categories:
#   - low (3–4), medium (5–6), high (7–8) to make classification feasible.
#   - we will also make this numeric (we want both for clarity)
# The dataset contains 1599 samples and 12 columns (11 features + target).

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Section 2: Prepare the Data

In [37]:
# Transform the quality score into categorical labels and numeric targets
# 'low', 'medium', 'high' are easier for classification and interpretation
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"

def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2

df["quality_label"] = df["quality"].apply(quality_to_label)
df["quality_numeric"] = df["quality"].apply(quality_to_number)

## Section 3: Feature Selection

In [38]:
# Select features and the target variable
# We drop columns not needed for training (original quality columns and labels)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])
y = df["quality_numeric"]

## Section 4: Split the Data

In [39]:
# Split the dataset into training and testing sets
# Stratify to ensure balanced class distribution in train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## Section 5: Model Evaluation

In [40]:
# Define a helper function to evaluate and compare models
# It prints performance metrics and appends them to a results list for later comparison
results = []

def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
            "Accuracy Gap": train_acc - test_acc,
            "F1 Gap": train_f1 - test_f1
        }
    )

### Run Two Models of Your Choice (RandomForest and Gradient Boosting)

In [41]:
# Evaluate Random Forest Classifier with 100 trees
evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# Evaluate Gradient Boosting Classifier with 100 estimators and learning rate of 0.1
evaluate_model(
    "Gradient Boosting (100)",
    GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)


Random Forest (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 256   8]
 [  0  15  28]]
Train Accuracy: 1.0000, Test Accuracy: 0.8875
Train F1 Score: 1.0000, Test F1 Score: 0.8661

Gradient Boosting (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  3 247  14]
 [  0  16  27]]
Train Accuracy: 0.9601, Test Accuracy: 0.8562
Train F1 Score: 0.9584, Test F1 Score: 0.8411


## Section 6: Compare Results

In [42]:
# Convert the list of results into a DataFrame for easier sorting and visualization
results_df = pd.DataFrame(results)
results_df_sorted = results_df.sort_values(by="Test Accuracy", ascending=False)
display(results_df_sorted)

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1,Accuracy Gap,F1 Gap
0,Random Forest (100),1.0,0.8875,1.0,0.866056,0.1125,0.133944
1,Gradient Boosting (100),0.960125,0.85625,0.95841,0.841106,0.103875,0.117304


## Section 7: Conclusions and Insights

### Overall Performance
- **Random Forest** performed best on the test set:
  - **Test Accuracy**: 88.75%
  - **Test F1 Score**: 86.61%
- **Gradient Boosting** was close behind:
  - **Test Accuracy**: 85.62%
  - **Test F1 Score**: 84.11%

### Generalization (Train-Test Gap)
| Model                  | Train Accuracy | Test Accuracy | Accuracy Gap | Train F1 | Test F1 | F1 Gap   |
|------------------------|----------------|----------------|---------------|----------|---------|----------|
| Random Forest (100)    | 1.0000         | 0.8875         | **0.1125**    | 1.0000   | 0.8661  | **0.1339** |
| Gradient Boosting (100)| 0.9601         | 0.8562         | **0.1039**    | 0.9584   | 0.8411  | **0.1173** |

- Gradient Boosting had **smaller gaps** between training and test scores, indicating **better generalization** despite slightly lower performance.

### Overfitting
- **Random Forest** shows signs of **overfitting**:
  - Perfect train accuracy and F1 (1.0) but lower test performance.
  - Larger gaps between train and test metrics.
- **Gradient Boosting** was more balanced and less overfit.

### Next Steps & Hyperparameter Ideas
- **Random Forest**:
  - Limit `max_depth`, reduce `n_estimators`, or use `max_features` to avoid overfitting.
- **Gradient Boosting**:
  - Tune `learning_rate`, `n_estimators`, and `max_depth`.
  - Consider using `subsample < 1.0` for regularization.
- **Both**:
  - Perform **cross-validation** to confirm results.
  - Try **XGBoost** or **LightGBM** as alternative boosting methods.

**Conclusion**: If interpretability and speed matter, Random Forest is strong. But for a more generalizable model, Gradient Boosting is a great fit. Further tuning could help both!
