# Predicting Wine Quality: A Binary Classification Approach Using Physicochemical Properties

**Authors:** Aiden Hew, Karan Bains, Shuhang Li

## Summary

This analysis investigates whether physicochemical properties (eg. alcohol content, volatile acidity, and sulphates) can reliably predict wine quality using classification. Using a dataset of 1,599 red Portuguese "Vinho Verde" wines, we developed models to distinguish between high-quality wines (rated 7 or higher) and lower-quality wines (rated below 7). The analysis employed logistic regression, decision trees, and random forest classifiers. Results indicate that alcohol content, volatile acidity, and sulphates are the strongest predictors of wine quality, with the random forest model achieving 87% accuracy and an AUC of 0.91. These findings suggest that automated quality assessment based on chemical properties is feasible and could support wine production quality control processes.

## Methods & Results

This section describes the complete analytical workflow, including data loading, preprocessing, exploratory analysis, model building, and evaluation. All code is presented with narrative explanations of the methodology.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, auc, roc_auc_score)

### Exploratory Data Analysis

In [2]:
df = pd.read_csv("data/winequality-red.csv", sep=';')

# Create binary target variable for quality>=7 and quality<7
df['quality_binary'] = (df['quality'] >= 7)

# Separate features and target
feature_columns = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
                   'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
                   'pH', 'sulphates', 'alcohol']

X = df[feature_columns]
y = df['quality_binary']

# Summary statistics of all features
print("Summary statistics for physicochemical features:")
df[feature_columns].describe().round(3)

Summary statistics for physicochemical features:


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.32,0.528,0.271,2.539,0.087,15.875,46.468,0.997,3.311,0.658,10.423
std,1.741,0.179,0.195,1.41,0.047,10.46,32.895,0.002,0.154,0.17,1.066
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99,2.74,0.33,8.4
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.996,3.21,0.55,9.5
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.997,3.31,0.62,10.2
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.998,3.4,0.73,11.1
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.004,4.01,2.0,14.9


In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=2025, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Model Development and Training

We trained three different classification algorithms to compare their performance:

1. **Logistic Regression**: A linear model that estimates the probability of class membership using a logistic function.

2. **Decision Tree**: A non-linear model that recursively partitions the feature space based on feature thresholds. We limit the maximum depth and require minimum samples per leaf to prevent overfitting.

3. **Random Forest**: An ensemble method that combines multiple decision trees through bootstrap aggregation (bagging). This typically provides better generalization than a single decision tree.

All models use class weighting (balanced) to account for the imbalanced class distribution, giving more importance to the minority class (high-quality wines).

In [4]:
# Initialize models with class balancing
models = {
    'Logistic Regression': LogisticRegression(
        random_state=123, 
        max_iter=1000, 
        class_weight='balanced'
    ),
    'Decision Tree': DecisionTreeClassifier(
        random_state=123, 
        max_depth=10, 
        min_samples_split=20,
        min_samples_leaf=10,
        class_weight='balanced'
    ),
    'Random Forest': RandomForestClassifier(
        n_estimators=100, 
        random_state=123, 
        max_depth=15,
        min_samples_split=10,
        min_samples_leaf=5,
        class_weight='balanced'
    )
}

# Train models and store results
trained_models = {}
results = []

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train the model
    model.fit(X_train_scaled, y_train)
    trained_models[name] = model
    
    # Make predictions
    y_train_pred = model.predict(X_train_scaled)
    y_test_pred = model.predict(X_test_scaled)
    y_test_proba = model.predict_proba(X_test_scaled)[:, 1]
    
    # Calculate metrics
    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    precision = precision_score(y_test, y_test_pred)
    recall = recall_score(y_test, y_test_pred)
    f1 = f1_score(y_test, y_test_pred)
    roc_auc = roc_auc_score(y_test, y_test_proba)
    
    # Store results
    results.append({
        'Model': name,
        'Train Accuracy': train_acc,
        'Test Accuracy': test_acc,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'ROC AUC': roc_auc
    })
    
    print(f"  Train Accuracy: {train_acc:.4f}")
    print(f"  Test Accuracy: {test_acc:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall: {recall:.4f}")
    print(f"  F1 Score: {f1:.4f}")
    print(f"  ROC AUC: {roc_auc:.4f}")


Training Logistic Regression...
  Train Accuracy: 0.7928
  Test Accuracy: 0.7594
  Precision: 0.3455
  Recall: 0.8837
  F1 Score: 0.4967
  ROC AUC: 0.8797

Training Decision Tree...
  Train Accuracy: 0.8757
  Test Accuracy: 0.7656
  Precision: 0.3298
  Recall: 0.7209
  F1 Score: 0.4526
  ROC AUC: 0.7845

Training Random Forest...
  Train Accuracy: 0.9578
  Test Accuracy: 0.8844
  Precision: 0.5577
  Recall: 0.6744
  F1 Score: 0.6105
  ROC AUC: 0.9212


In [5]:
# Perform 5-fold cross-validation on the best model
best_model = trained_models['Random Forest']
cv_scores = cross_val_score(best_model, X_train_scaled, y_train, cv=5, scoring='accuracy')

print("5-Fold Cross-Validation Results (Random Forest):")
print(f"  Fold accuracies: {[f'{score:.4f}' for score in cv_scores]}")
print(f"  Mean CV Accuracy: {cv_scores.mean():.4f}")
print(f"  Std CV Accuracy: {cv_scores.std():.4f}")
print(f"\nThis suggests our model generalizes well with consistent performance across folds.")

5-Fold Cross-Validation Results (Random Forest):
  Fold accuracies: ['0.8828', '0.8906', '0.8672', '0.8828', '0.9020']
  Mean CV Accuracy: 0.8851
  Std CV Accuracy: 0.0114

This suggests our model generalizes well with consistent performance across folds.
