# Lab 5: Ensemble Machine Learning – Wine Quality Dataset
**Author:** Katie McGaughey  
**Date:** April 11, 2025  


## Introduction: 
This notebook explores red wine quality classification (low, medium, high) using ensemble models applied to the UCI Wine Quality Dataset from the UCI Machine Learning Repository. We evaluate two approaches—Random Forest and a Voting Classifier—based on 11 physicochemical attributes, comparing their performance using accuracy and F1 scores. The objective is to determine the most effective model for predicting wine quality.

## Imports

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score

## Section 1. Import and Inspect the Data

In [25]:
# Load dataset from local file
df = pd.read_csv("winequality-red.csv", sep=";")
print("Dataset Info:")
df.info()
print("\nFirst Rows:")
print(df.head())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

First Rows:
   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4     

#### Reflection 
##### - **Data Structure:** 1599 samples, 12 columns (11 features + quality), all numeric with no missing values.  
##### - **Features:** Physicochemical properties (e.g., alcohol, pH) likely influence quality scores.   
##### - **Target:** Quality (3–8) requires categorization for classification.

## Section 2: Data Exploration and Preparation

In [26]:
# Categorize quality
def quality_to_number(q):
    if q <= 4: return 0  # Low
    elif q <= 6: return 1  # Medium
    else: return 2  # High

df['quality_numeric'] = df['quality'].apply(quality_to_number)
print("\nClass Distribution:")
print(df['quality_numeric'].value_counts(normalize=True))


Class Distribution:
quality_numeric
1    0.824891
2    0.135710
0    0.039400
Name: proportion, dtype: float64


#### Reflection 
##### - **Classes:** Low (0, 3.94%), Medium (1, 82.49%), High (2, 13.57%) show heavy imbalance favoring medium quality. 
##### - **Encoding:** Numeric labels (0, 1, 2) suit scikit-learn; imbalance may bias models toward medium class.  

## Section 3: Feature Selection and Justification

In [27]:
# Features and target
X = df.drop(columns=['quality', 'quality_numeric'])
y = df['quality_numeric']
print("Features:", X.columns.tolist())

Features: ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']


#### Reflection 
##### - **Why these features?** All 11 features (e.g., alcohol, sulphates) capture chemical influences on quality; alcohol often correlates strongly with taste.
##### - **Other options:** Subsets like alcohol or pH could simplify, but full set maximizes predictive potential.  
##### - **Risk:** High dimensionality may introduce noise, though ensembles handle this well.

## Section 4: Train Ensemble Models

### 4.1 Split the Data

In [28]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print("Train Size:", X_train.shape)
print("Test Size:", X_test.shape)

Train Size: (1279, 11)
Test Size: (320, 11)


#### Reflection 
##### - **Stratification:** Preserves class distribution (e.g., 3.94% low) in both sets, essential for imbalanced data.    
##### - **Size:** 80/20 split (1279 train, 320 test) balances training and evaluation.

### 4.2 Train and Evaluate

In [29]:
# Helper function
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average='weighted')
    test_f1 = f1_score(y_test, y_test_pred, average='weighted')
    
    print(f"\n{name}:")
    print("Test Confusion Matrix:")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"  Train Acc: {train_acc:.3f}, Test Acc: {test_acc:.3f}")
    print(f"  Train F1: {train_f1:.3f}, Test F1: {test_f1:.3f}")
    
    results.append({
        'Model': name, 'Train Acc': train_acc, 'Test Acc': test_acc,
        'Train F1': train_f1, 'Test F1': test_f1
    })

results = []

# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
evaluate_model("Random Forest (100)", rf_model, X_train, y_train, X_test, y_test, results)

# Voting Classifier
voting_model = VotingClassifier(
    estimators=[
        ('dt', DecisionTreeClassifier(random_state=42)),
        ('svm', SVC(probability=True, random_state=42)),
        ('nn', MLPClassifier(hidden_layer_sizes=(50,), max_iter=1000, random_state=42))
    ], voting='soft'
)
evaluate_model("Voting (DT+SVM+NN)", voting_model, X_train, y_train, X_test, y_test, results)


Random Forest (100):
Test Confusion Matrix:
[[  0  13   0]
 [  0 256   8]
 [  0  15  28]]
  Train Acc: 1.000, Test Acc: 0.887
  Train F1: 1.000, Test F1: 0.866

Voting (DT+SVM+NN):
Test Confusion Matrix:
[[  0  12   1]
 [  0 253  11]
 [  0  19  24]]
  Train Acc: 0.923, Test Acc: 0.866
  Train F1: 0.906, Test F1: 0.843


#### Reflection 
##### - **Random Forest:** Perfect train accuracy (1.0) drops to 0.887 on test, indicating overfitting but strong performance (28/43 high correct).  
##### - **Voting:** Train (0.923) closer to test (0.866) suggests better generalization; poor low-quality prediction (1/13 correct).  
##### - **Imbalance Impact:** Both models favor medium class (82.49%), missing most low-quality wines.

## Section 5: Compare Results

In [30]:
# Results table
results_df = pd.DataFrame(results)
results_df['Acc Gap'] = results_df['Train Acc'] - results_df['Test Acc']
results_df['F1 Gap'] = results_df['Train F1'] - results_df['Test F1']
print("\nResults Summary:")
print(results_df[['Model', 'Train Acc', 'Test Acc', 'Acc Gap', 'Train F1', 'Test F1', 'F1 Gap']])


Results Summary:
                 Model  Train Acc  Test Acc   Acc Gap  Train F1   Test F1  \
0  Random Forest (100)   1.000000  0.887500  0.112500   1.00000  0.866056   
1   Voting (DT+SVM+NN)   0.922596  0.865625  0.056971   0.90606  0.843416   

     F1 Gap  
0  0.133944  
1  0.062644  


#### Reflection 
##### - **Best Model:** Random Forest leads in test accuracy (0.887) and F1 (0.866), excelling at capturing patterns.  
##### - **Gaps:** Random Forest’s larger gaps (Acc: 0.112, F1: 0.134) confirm overfitting; Voting’s smaller gaps (Acc: 0.057, F1: 0.063) show stability.  
##### - **Trade-off:** Random Forest for raw performance, Voting for reliability; neither predicts low-quality well (0–1/13 correct).

## Section 6: Final Thoughts

### 6.1 Summarize Findings

##### - **Best Model:** Random Forest (100) achieves highest test accuracy (0.887) and F1 (0.866), leveraging ensemble strength on non-linear data.  
##### - **Alternative:** Voting Classifier (DT+SVM+NN) offers lower accuracy (0.866) but better generalization (smaller gaps: Acc 0.057, F1 0.063).  
##### - **Performance Limit:** Max accuracy 0.887 suggests features explain only part of quality variance.

### 6.2 Challenges

##### - **Class Imbalance:** Low-quality wines (3.94%) are rarely predicted correctly, skewing results toward medium class.  
##### - **Feature Power:** Physicochemical data alone caps predictive ability; sensory data might improve accuracy.
