In [1]:
#| echo: false
from IPython.display import Image

# Wine Quality Classifier

**MSDS 5509 Machine Learning 1 - Final Project**

John Mitchell  
University of Colorado at Boulder MSDS Program  
*john.mitchell@colorado.edu*

## Project Overview

This supervised learning project develops a binary classifier to distinguish between regular and premium wines - in terms of their taste quality - based on their chemical composition. The project makes use of the UCI Wine Quality dataset, which contains thousands of wine samples from Portugal, including both red and white varieties. Each wine sample includes chemical features such as acidity levels, alcohol content, and residual sugar, along with quality assessments from wine experts.

## Problem Definition

Wine quality in the dataset is rated on a scale from 0 to 10, where 0 represents very poor quality and 10 represents excellent quality. In practice, the observed quality ratings range from 3 to 9. 

For this classification task, wines are categorized into two groups:

- **Regular wines**: Quality ratings of 6 or below
- **Premium wines**: Quality ratings of 7, 8, or 9

This classification problem has a potential application in the wine industry: wine importers and distributors could use such a classifier to help find premium wines more efficiently. 

## Project Goals

The primary objectives of this project are:

1. **Data Investigation**: Clean and conduct an Exploratory Data Analysis (EDA) to prepare features for input to several classifiers.
2. **Infrastructure Building**: Develop a basic machine learning pipeline to allow the data to be fed into several models, search for the best model in each case, and compare the best models.
3. **Model Evaluation**: Explore the performance of several models both from within this course and from other courses, specifically:
    - Logistic Regression (as a baseline linear classifier)
    and several nonlinear models:
    - Support Vector Machine (SVM)
    - Random Forest
    - Multi-Layer Perceptron (MLP) (or feedforward neural network), which I used in the NLP (Natural Language Processing) course sequence (#5747, #5748)
4. **Results Analysis**: Compare model performance using appropriate metrics, evaluate and interpret the results
5. **Practical Significance**: Investigate the practical significance of these results - could they be useful in the wine business?

More generally, I hoped that the process would give me insights into the complete supervised learning project lifecycle - and tools to take to future projects.

# The Wine Dataset

## Data Source and Citation

The dataset originates from the UCI Machine Learning Repository. It was used in an early paper on using machine learning models (Cortez et al., 2009). In their paper, they focused on regression (predicting the wine quality level) rather than classification. In publishing the data to the repository, the authors expressed the hope that it would be a useful dataset for other researchers.

## Dataset Description

The dataset contains samples of Portuguese "Vinho Verde" wine variants, including both red and white wines. The dataset includes only "physicochemical" (physical and/or chemical) properties as input features. The target variable is quality ratings. 

Information such as the wine brand, grape variety, or pricing is not part of the dataset.

**Dataset Size:**
- Red wine samples: 1,599
- White wine samples: 4,898  
- Total combined samples: 6,497 rows
- Features: 11 physicochemical attributes plus quality rating
- Missing values: None reported in original dataset

**Input Features:**
These features are all continuous-valued. Units are in parentheses.
1. Fixed acidity (g/dm³)  
2. Volatile acidity (g/dm³)  
3. Citric acid (g/dm³)  
4. Residual sugar (g/dm³)  
5. Chlorides (g/dm³)  
6. Free sulfur dioxide (mg/dm³)  
7. Total sulfur dioxide (mg/dm³)  
8. Density (g/cm³)  
9. pH (unitless)  
10. Sulphates (g/dm³)  
11. Alcohol content (% by volume)  

**Target Variable:**
- Quality: Wine quality ratings from 0 (very poor) to 10 (excellent). The data only contained values from 3 to 9.
- The scores were median scores from three wine expert evaluations, and thus were integers.

**How Variables Will Be Referred to in the Report:**
I have used the convention of `highlighting` variables or quality levels when referring to them in a technical sense. So for example, the quality level `premium` will be highlighted if I am referring to that target value specifically, but "premium wines" (not highlighted) if just discussing the idea of premium wines generally. Python functions for displaying tables do not always capitalize variable names consistently, so I have just used the same capitalization as the tool or table being discussed - so the feature may be `residual sugar` or `Residual Sugar` depending on context.

# Data Cleaning 

## Data Preparation and Feature Engineering

For this project, I combined the red and white wine datasets and added a `wine_type` feature (red: 0, white: 1) to preserve the distinction between wine varieties. This is used as an additional feature in the wine classifier, as wine type (red/white) may potentially be relevant to predicting quality. 

## Missing Value Assessment

I verified that there were no missing values in the combined wine dataset - confirming the authors' original claim. This eliminated the need for any missing value imputation.  

## Duplicate Detection and Removal

When checking for data quality issues, I discovered a significant problem not mentioned in the original research: there were over 1,000 (specifically, 1,177) duplicate rows in the combined dataset. Each duplicate contained identical values across all features and the same quality rating.

After researching data's provenance, I decided these duplicates were highly unlikely to represent legitimate identical measurements. The absence of any indexing or identification information in the dataset suggested potential errors during data collection or combination.

This assessment was reinforced by discovering two other projects - based on this dataset - that identified the same duplicate issue (Bunevičius, V. 2019,Ijemuah, V. 2020). These researchers also chose to remove duplicates before analysis.

Consequently, I removed all duplicate entries from the dataset as part of the cleaning process, reducing the number of examples from 6,497 to 5,320. The bar plot below shows the totals for each wine type before and after duplicate removal. There were proprortionately more white wines removed.

![Wine Totals Before and After Duplicate removal](figures/wine_type_chart_2.png)  
*Wine Totals Before and After Duplicate removal*

For the rest of the EDA, and for the subsequent modeling, the data set without duplicates will be used.

## Outlier Analysis

As I explored outliers, I kept in mind that this data has been used before in a peer-reviewed study and thus is likely to have been pre-screened for outliers that are errors, as opposed to unusual but valid measurements. 

First, how common are outliers? Below is a tally of outliers for all features. I used the IQR (Inter-Quartile Range) method that is more robust to non-normality: that is, outliers are outside the range:
- lower bound: Q1 - 1.5 × IQR
- upper bound: Q3 + 1.5 × IQR

Where the IQR is Q3-Q1, the difference between the third and first quartiles.

The features varied widely in the number of outliers. The `Alcohol` feature had just one, whereas `Fixed Acidity` had over 300. 

| Feature | Outlier Count | Outlier % |
|---------|---------------|-----------|
| Fixed acidity | 304 | 5.7% |
| Volatile acidity | 279 | 5.2% |
| Citric acid | 143 | 2.7% |
| Residual sugar | 141 | 2.7% |
| Chlorides | 237 | 4.5% |
| Free sulfur dioxide | 44 | 0.8% |
| Total sulfur dioxide | 10 | 0.2% |
| Density | 3 | 0.1% |
| pH | 49 | 0.9% |
| Sulphates | 163 | 3.1% |
| Alcohol | 1 | 0.0% |

I then explored each feature with a boxplot. A representative sample of features is shown below.

![Feature Boxplots](figures/outlier_boxplots.png)
*Boxplots for a Representative Sample of Features*

Insights from boxplots:
- Most features have outliers in a similar pattern to `Chlorides` or `Residual Sugar`: that is, a cluster of outliers in the upper tail.
- The `Alcohol` and `Density` features, on the other hand, have very few outliers. Unusual values of these features are very rare.
- Finally, the `pH` feature is the only feature with outliers in both tails.

Are any of these outliers unrealistic, as opposed to unusual? To explore whether the outliers were unusual but realistic in the context of wine production, I asked an AI engine (Claude Opus 4.1) to research realistic values for wines and summarize. Here are the findings. 

| Feature | Units | Realistic Min | Realistic Max | Notes |
|---------|-------|---------------|---------------|-------|
| Fixed acidity | g/dm³ | 3.0 | 18.0 | Very low acid wines exist; high-acid wines can reach 15-18 |
| Volatile acidity | g/dm³ | 0.08 | 2.0 | >1.2 typically considered faulty, but some exist up to 2.0 |
| Citric acid | g/dm³ | 0.0 | 1.5 | Can be completely absent; rarely exceeds 1.0-1.2 naturally |
| Residual sugar | g/dm³ | 0.5 | 65.0 | Dry wines ~1-4; dessert wines can exceed 50-60 |
| Chlorides | g/dm³ | 0.01 | 1.0 | Very pure wines <0.05; salty/faulty wines can reach 0.8+ |
| Free sulfur dioxide | mg/dm³ | 1.0 | 150.0 | Legal limits vary; over-sulfured wines can hit 100-150 |
| Total sulfur dioxide | mg/dm³ | 10.0 | 400.0 | Must be ≥ free SO2; heavily sulfured wines reach 300+ |
| Density | g/cm³ | 0.985 | 1.010 | High sugar/alcohol affects density significantly |
| pH | unitless | 2.7 | 4.2 | Very acidic wines ~2.8; low-acid wines can hit 4.0+ |
| Sulphates | g/dm³ | 0.2 | 2.5 | Natural minimum ~0.3; heavily treated wines reach 2.0+ |
| Alcohol | % vol | 7.0 | 16.5 | Legal minimums vary; fortified-style can reach 16% |

These limits indicate that even the extreme outliers (such as the value of `Residual Sugar` above 60 visible in its boxplot) are plausible. 

Since no outliers were so extreme as to be unrealistic, none were removed. 

## Data Cleaning Summary

- The two raw datasets (red and white wines) were combined, and a binary `wine_type` (0=red, 1=white) added. 
- There were no missing values in the dataset for any of the features, or for the target.
- However, 1,177 examples (17% of the total) were duplicates, of uncertain provenance, though most likely from errors in merging datasets. These were removed.
- Outliers: outliers were found to be within plausible values, and thus none were removed.


# Exploratory Data Analysis (EDA)

I performed an EDA on the dataset, exploring features, the target, and interactions (correlations, multicollinearity). This section discusses the results of the EDA. At the end of the section, the essential takeaways for the modeling stage are summarized.

## Feature Summary Statistics and Distributions

The mean, standard deviation, and a 5-number summary (min, Q1, median, Q3, max) are in the table below. 

|       |   Fixed Acidity |   Volatile Acidity |   Citric Acid |   Residual Sugar |   Chlorides |   Free Sulfur Dioxide |   Total Sulfur Dioxide |   Density |        pH |   Sulphates |   Alcohol |
|:------|----------------:|-------------------:|--------------:|-----------------:|------------:|----------------------:|-----------------------:|----------:|----------:|------------:|----------:|
| count |       5320      |          5320      |     5320      |        5320      |   5320      |             5320      |              5320      | 5320      | 5320      |   5320      | 5320      |
| mean  |          7.2152 |             0.3441 |        0.3185 |           5.0485 |      0.0567 |               30.0367 |               114.109  |    0.9945 |    3.2247 |      0.5334 |   10.5492 |
| std   |          1.3197 |             0.1682 |        0.1472 |           4.5002 |      0.0369 |               17.805  |                56.7742 |    0.003  |    0.1604 |      0.1497 |    1.1859 |
| min   |          3.8    |             0.08   |        0      |           0.6    |      0.009  |                1      |                 6      |    0.9871 |    2.72   |      0.22   |    8      |
| 25%   |          6.4    |             0.23   |        0.24   |           1.8    |      0.038  |               16      |                74      |    0.9922 |    3.11   |      0.43   |    9.5    |
| 50%   |          7      |             0.3    |        0.31   |           2.7    |      0.047  |               28      |               116      |    0.9946 |    3.21   |      0.51   |   10.4    |
| 75%   |          7.7    |             0.41   |        0.4    |           7.5    |      0.066  |               41      |               153.25   |    0.9968 |    3.33   |      0.6    |   11.4    |
| max   |         15.9    |             1.58   |        1.66   |          65.8    |      0.611  |              289      |               440      |    1.039  |    4.01   |      2      |   14.9    |

Observations:

- Features are non-negative (as would be expected from physical and chemical measurements). However, some have zero values. This will need to be factored in if the feature is log transformed to adjust for skew.
- Features vary widely in their ranges, ranging from `density` (very tightly clustered around 1) to `total sulfur dioxide` (which ranges from 6 to 440). Scaling will be applied prior to modeling to standardize. 
- The table verifies that there are no missing values, and that duplicates have indeed been removed. 

Next, the distribution of each feature is explored. Histograms are below.

![Feature Histograms](figures/feature_distributions.png)
*Histograms for Features. Wine Type (a binary feature) is not shown*

While some features appear roughly symmetric, several (especially `chlorides` and `residual sugar`) appear to be significantly positively skewed, and thus may need to be transformed prior to scaling. The features are ranked by skew in the table below. Those with skew above 1 are significantly skewed. 

Feature Skewness:
| Feature              |   Skewness |
|:---------------------|-----------:|
| chlorides            |  5.33824   |
| sulphates            |  1.80945   |
| residual sugar       |  1.70655   |
| fixed acidity        |  1.65042   |
| volatile acidity     |  1.50456   |
| free sulfur dioxide  |  1.36272   |
| density              |  0.666326  |
| alcohol              |  0.545696  |
| citric acid          |  0.484309  |
| pH                   |  0.389969  |
| total sulfur dioxide |  0.0636144 |

I explored transforming the six highly skewed features. A log transform turned out to be effective. The log-transformed features are below. Note that `residual sugar` is now bimodal (which is acceptable for modeling, as long as it is roughly symmetric) and `free sulfur dioxide` is now negatively skewed (but less so than before).

![Transformed Feature Histograms](figures/transformed_feature_distributions.png)
*Features after Log Transformation*

However, the skews now have absolute values less than one:

| Feature              |   Skewness |
|:---------------------|-----------:|
| chlorides            |     0.9067 |
| fixed acidity        |     0.8431 |
| density              |     0.6663 |
| alcohol              |     0.5457 |
| citric acid          |     0.4843 |
| sulphates            |     0.394  |
| pH                   |     0.39   |
| volatile acidity     |     0.3304 |
| residual sugar       |     0.3265 |
| total sulfur dioxide |     0.0636 |
| free sulfur dioxide  |    -0.7893 |

Next, the target variable - wine quality ratings - is explored.

## Exploring the Target Variable

The raw wine quality scores have a very imbalanced distribution, with most values being in the 6 to 9 range. The distribution is shown below. 

![Totals with each wine quality score](figures/original_quality_counts.png)  
*Total count for each wine quality score (3-9)*

When I was initially exploring how to use the data set for wine classification, this imbalance caused problems. For example, splitting the data into poor (3,4), average (5-7), and good (8 to 9) resulted in so many examples in the 'average' category that the models could appear to be performing well by simply classifying most wines as average. 

I explored how to bucket the scores in a way would have a useful business purpose while addressing the class imbalance issue. I decided on a binary classification problem: identifying "regular wines" (quality 6 or less) versus "premium" wines (quality 7 or more). The "premium" wines are wines rated as above average by experts, and therefore could be priced higher. 

This split resulted in approximately 19% premium wines and 81% regular wines (figure below). While still imbalanced, this ratio should provide sufficient samples in both classes for the models to extract useful results. The figure below shows the number of examples in each quality category.

![Class Proportions in Regular (6 or less) and Premium (7 or more)](figures/binary_quality_percentages.png)  
*Proportions for Binary Class Split into Regular and Premium*

## Relationships

A correlation heatmap (below) was used to explore linear relationships between feature pairs, where absolute values close to 1 indicate a strong correlation. The highest correlation (0.72) occurred between `total sulfur dioxide` and `free sulfur dioxide`. Since total sulfur includes the free component, it is not surprising there is some correlation between the two. There is also a high correlation between density and alcohol (-0.67). Most other correlations are low - below 0.5 - but the fact that some of the variables are even weakly correlated suggests that there is some potential structure there for Logistic Regression.

![Feature Correlation Heatmap](figures/feature_corrs.png)
*Correlation Heatmap of Features. Duplicates above the diagonal are not shown.*

I also calculated variance inflation factors (VIFs) to explore multicollinearity - that is, linear relationships not between individual pairs, but between one variable and some combination of the others (Faraway, 2014, pp. 106-109). A large VIF for a feature is an indication that it is highly multicollinear. There is no universally accepted cutoff value for defining "high" multicollinearity: I decided to use a conservative cutoff of 10.

The VIFs of the unscaled data had some huge values (the largest - density - was over 900), but this turned out to be due to the small variance of density, which caused numerical instability in the VIF calculation. Using `StandardScaler` (mean zero, standard deviation one) resulted in the table below. 

| Feature | VIF |
|---------|-----|
| density | 14.92 |
| residual sugar | 6.43 |
| fixed acidity | 4.88 |
| alcohol | 4.58 |
| total sulfur dioxide | 2.93 |
| pH | 2.48 |
| free sulfur dioxide | 2.14 |
| volatile acidity | 1.96 |
| citric acid | 1.65 |
| chlorides | 1.63 |
| sulphates | 1.55 |

Even using the conservative value of 10 for a high VIF, the `density` feature shows multicollinearity. I performed a brief domain research on why, using just a basic AI-assisted google search: since wine is composed of water, alcohol, and other dissolved compounds, its `density` is highly predictable from these features. In wine, it is also tightly clustered around the density of pure water of 1.0 g/cmÂ³, so the variance of `density` is low.

## Feature-Target Relationship

What about correlations between features and the target? A bar chart of correlations is below. 

![Feature Correlations with Quality](figures/feature_correlations_with_target.png)  
*Feature Correlations with Wine Quality Target*

The `density` and `alcohol` features have the highest correlations, both with absolute values in the moderate range (above 0.3 but under 0.5). Higher alcohol content - and lower density - are moderately predictive of wine quality. Other features are weakly correlated. 

## Modeling Takeaways from the EDA

The EDA revealed several key insights to factor in during model fitting:

- **Scale differences**: Features differ widely in their scale and should be standardized before using models that are scale-sensitive (such as SVM)
- **Skewed features**: Several features are also highly skewed. Log transforming them reduced this issue, and this should be used prior to standard scaling
- **Linear relationships**: Some moderate correlations were discovered within features, or between the feature and target. This suggests that there may be some linear patterning within the data for a Logistic Regression model
- **Multicollinearity**: However, `density` exhibits multicollinearity. This could lead to instability in estimates for linear regression parameters. If this is a problem, remove `density`.
- **Target class imbalance**: There are very few quality measures of low (3,4) and high (8,9) values. This leads to reframing the modeling problem into a binary classification problem with `regular` (6 or less) and `premium` (7 or more). Classes are still imbalanced (81%/19%) but there should be enough `premium` examples for model fitting.

The next section explores model fitting with these insights in mind.

# Models and Model Fitting

To first briefly recap the goal: to find model(s) that - given a set of chemical features of wines - have "some success" on classifying wines into one of two classes: `regular` or `premium`.

This section looks at:
- The models that will be used
- The Python modeling infrastructure to load and run model searches
- How the data was split and transformed prior to modeling
- Model tuning - finding the best of each type of model
- A brief summary of results

The next section on results and analysis will go deeper into model performance.

## The Models 

I used several models covered in our course (**Logistic Regression**, **Support Vector Machine (SVM)** , **Random Forest**), as well as a **Multi-Layer Perceptron (MLP)** (also called a **feedforward neural network**) which I explored in the NLP courses (#5747, 5748). 
- Logistic Regression would serve as the baseline linear model.
- Given the likely problem complexity, I felt it was important to explore a variety of non-linear models that work in different ways.

## Modeling Infrastructure and Hardware

Prior to the project, I developed a basic exploratory pipeline to run grid searches (using `GridSearchCV` in the `sklearn` package) of some or all of the models and compare the best results. 

The GitHub repo has the code. 

The core components of the pipeline are:
- The `WineConfig` class that has a dictionary of parameter values to use for the grid search. Parameters were split into "base" parameters - that would stay constant during the model search, and "grid" parameters that would be varied. It also had a list of models to run. The default was all models ['rf', 'svm', 'logistic', 'mlp'], but this could be changed to run a search on a subset, or single model.
- The `WineModelPipeline` reads this set of parameter values, runs a grid search for each model, and stores details of the best models and their performance. For the grid search, the parameters were passed into `GridSearchCV` as a dictionary. This function would then only use the parameters needed for that model.

For simplicity, the results are saved in dataframes (rather than to files) and modeling and model evaluation were performed in the same notebook. Once the best models were obtained - after numerous iterations of modeling and evaluation using successively finer grids - the results were copied into the final report. A sample parameter grid is in Appendix B.

The models were run on a laptop with GPU (Lenovo P16 Gen 2, 32 Cores,NVIDIA RTX 4000 Ada). None of the models can use the GPU, but all of them - except the MLP - can use multiple cores. While interpreting the results of each run and adjusting parameter values for each model for the next run was time-consuming, the grid searches themselves were not. A grid search with (say) 20-40 combinations for each model ran within 1-2 minutes (compared to 10-20 minutes without using multiple cores - the random forest was particularly time-consuming).

## Data Splitting and Transformation Prior to Model Fitting

- The data is split into 80% training, 20% testing. Models will be fit on the training data and evaluated on the test data
- Five-fold cross validation will be used to monitor for overfitting
- Because the target classes are imbalanced, the *weighted F1 score* will be used rather than *accuracy* as the primary driver of the grid search
- For the models that support it (Logistic Regression, SVM, Random Forest), `class_weight='balanced'` is used to automatically adjust for class frequencies

To create a level playing field, the same preprocessing was applied to the data for all models:

- **Log transformation** of skewed features identified in the EDA to correct for skew (see EDA section)
- **StandardScaler normalization** is applied to all features for algorithms sensitive to scale

## Model Tuning: Finding the Best Models

I started with coarse grids for all models and iterated with successively finer grids. In some cases (such as the Random Forest), I started with the most important parameters, then added others as I got close to the best model. Details for each model follow. In each, I give parameters which were fixed, either initially or after some exploration (with comments if there were other options), the hyperparameters, and the best values found. 

The scikit-learn reference has details of the models, their parameters, and tips on usage.

### 1. Logistic Regression

**Base Configuration:**

- Penalty: L2 (Ridge) regularization. I used this to penalize more complex models, with the regularization strength C being the hyperparameter (see below).
- Solver: 'saga' (Stochastic Average Gradient Augmented) - noted as a good general purpose solver with L2.
- Max iterations: 1000. 

**Hyperparameter Tuning:**
I started with a grid 'C' = [0.01, 0.1, 1, 10, 100]. Eventually, after multiple iterations, I found that C = 0.8 provided optimal performance. 

### 2. Support Vector Machine

**Base Configuration:**
- Kernel: A Radial Basis Function kernel was used for all searches. Using a linear function would have just given another linear classifier.

**Hyperparameter Tuning:**
The hyperparameters in this case are C and gamma:
- Low values of C have a softer margin and thus higher tolerance for misclassification. Increasing C penalizes misclassification and results in a more complex decision boundary.
- Gamma controls the radius of the kernels. Higher gamma results in more "local" kernels and thus more complex boundaries.

The optimal values found - after several iterations of finer grids - were C = 2 and gamma = 0.8.

### 3. Random Forest

This ensemble decision tree model has several hyperparameters. I started with: 
- **n_estimators**: Number of trees in the ensemble
- **max_depth**: Maximum tree depth
- **max_features**: How many features to use in each tree. I explored both 'sqrt' and 'log' options, and found that 'sqrt' was better. I used this for the detailed searches.

During finer searches, I added:
- **min_samples_leaf**: The minimum examples to form a new leaf
- **min_samples_split**: The minimum examples to split (that is, form a new subtree)

The final optimal configuration used 55 trees, max_depth of 23, min_samples_leaf of 3, and min_samples_split of 4.

### 4. Multi-Layer Perceptron (MLP)

The model and its hyperparameters are discussed in chapter 11 of the course recommended reference (James et al., 2021). 

Briefly, the MLP passes weighted sums of the features through one or more internal (or hidden) layers, each of which applies nonlinear functions. Each hidden layer is composed of processors or "neurons" that take a weighted sum of their inputs, apply a nonlinear function, and pass the output on to the next layer. The final (or output) layer for a binary classification problem is a sigmoid function (similar to Logistic Regression). Training is performed using gradient descent methods: making small changes to the internal weights to reduce the error on the target. 

Thus, the hyperparameters are the internal structure of the MLP (number of layers and number of neurons each layer), the gradient descent method to use, and it's parameters.

**Base Configuration:**
- Max iterations: 1000. (The number of iterations of gradient descent)
- Early stopping: This was set to "true" with a patience parameter of 20 iterations (that is, stop early if there's no improvement over the next 20 iterations). A validation fraction of 10% was used.
- Solver: 'sgd' (stochastic gradient descent). This performed slightly better than 'Adam' (adaptive momentum) in initial searches, so was used for detailed searches.
- Activation function: What nonlinear function to use in the hidden layers. I used 'relu' (rectified linear unit) after briefly exploring 'tanh' (hyperbolic tangent) too.

**Hyperparameter Tuning:**
- **Network Architecture**: I explored small 1 and 2 hidden layer networks of varying sizes
- **alpha**: Controls the regularization strength
- **learning_rate_init**: The initial value of the learning rate
- **momentum**: Controls how much of previous updates is used in rate

The optimal architecture used had two hidden layers of nonlinear (relu) units of 110 and 60 neurons. Other parameters that worked best were: alpha = 0.1, learning_rate_init = 0.02, and momentum = 0.95.

After finding the best versions of each model with the data preprocessing outlined early (transforming skewed features, scaling), I explored individual models further to see if I could get any further performance improvement.

Briefly:

- **Logistic Regression**: I removed `density` from the input features (from the EDA, this had a high VIF, and was linearly dependent on other features). However, removing this feature caused the best F1 score to drop slightly.
- **Random Forest**: I explored both removing the least important feature `wine_type` (feature importance is discussed in the next section), and removing some or all of the scaling. Very slightly better results were obtained with keeping all features, using standard scaling, but log transforming only `chlorides` (the F1 score crept above 0.84), but for practical purposes, the results are virtually identical. 

For simplicity - and to keep a level playing field - the results reported in the next section all use the same preprocessing applied to the data.

### Brief Results Summary
So - after all that tuning - how did the models actually do? The table below ranks the best models by weighted F1 score on the test set. It also shows the time spent on fitting the best model for each.

This gives a preliminary ranking of Random Forest as the best, with Logistic Regression as the worst. As one would expect, Logistic Regression had the fastest run time, and the MLP the most expensive.

| Model    | F1 Score | Test Accuracy | Time(s) |
|----------|----------|---------------|---------|
| rf       | 0.8366   | 0.8374        | 1.1     |
| svm      | 0.8272   | 0.8280        | 3.3     |
| mlp      | 0.8187   | 0.8430        | 4.4     |
| logistic | 0.7683   | 0.7434        | 0.5     |

It's worth first noting that none of the models is extremely accurate at classifying wines. The best is getting about 84% accuracy. However, this will still turn out to have practical use. 

The next section explores in more depth *how* each model is classifying wine data. It will turn out that the overall F1 score does not tell the complete picture, and that the models have very different qualities when it comes to how they are classifying wines. It also explores how the classifier might be used in practice.


# Results and Analysis

The table below has a detailed summary of performance metrics for each model. As well as the weighted F1 score, the weighted precision and recall are given, as well as the precision and recall for each quality class (`regular` and `premium`). The terms "precision" and "recall" are usually used to refer to the positive (1) class observation - in this case `premium` wines. However here I felt it would be helpful to include those metrics for regular wines too, as well as the weighted average (weighting for class balance).

|          |   Accuracy |   Weighted Precision |   Weighted Recall |   Weighted F1 |   Premium Precision |   Premium Recall |   Regular Precision |   Regular Recall |   Premium F1 |
|:---------|-----------:|---------------------:|------------------:|--------------:|--------------------:|-----------------:|--------------------:|-----------------:|-------------:|
| rf       |     0.8374 |               0.8359 |            0.8374 |        0.8366 |              0.5736 |           0.5594 |              0.8973 |           0.9026 |       0.5664 |
| svm      |     0.828  |               0.8264 |            0.828  |        0.8272 |              0.5482 |           0.5347 |              0.8916 |           0.8968 |       0.5414 |
| logistic |     0.7434 |               0.835  |            0.7434 |        0.7683 |              0.4083 |           0.7822 |              0.935  |           0.7343 |       0.5365 |
| mlp      |     0.843  |               0.8257 |            0.843  |        0.8187 |              0.6882 |           0.3168 |              0.8579 |           0.9664 |       0.4339 |

The model differences become clear looking at the recalls for each class. The recalls tell us, for each wine quality, what percentage of that quality the model got correct. Note that the logistic model - despite having the lowest F1 score, actually has the *highest* recall on premium wines. The differences between models are easier to see in a bar chart. 

![Model Recall Comparison](figures/Recall_Comparisons.png)

*Recall for each Class for All Models*

The chart shows:
- Random Forest and SVM have similar patterns on each class. Both are correctly classifying about 90% of the regular wines, but under 60% of the premium wines. 
- Logistic Regression is correctly classifying between 75-80% of both. It is actually the best at identifying premium wines, but the least accurate overall, since it misclassifies so many regular wines.
- The MLP - despite having similar overall F1 score to both the Random Forest and SVM - is actually behaving very differently internally. It is the poorest on premium wines (recall under 30%), but the best on regular wines (recall over 90%).

The confusion matrices show this behavior in terms of totals. Here is the matrix for **Random Forest** (the SVM has a similar matrix but with a few fewer successes in each class).

![Random Forest Confusion Matrix](figures/confusion_matrix_rf.png)  
*Random Forest Confusion Matrix*

For **Logistic Regression**, the total correctly classified premium wine is higher (bottom right) but the errors on regular wines is much higher (top right).

![Logistic Regression Confusion Matrix](figures/confusion_matrix_logistic.png)  
*Logistic Regression Confusion Matrix*

The **MLP** has very few errors on regular wines (top right) but a lot of errors on premium wines (bottom left).

![MLP Confusion Matrix](figures/confusion_matrix_mlp.png)  
*MLP Confusion Matrix*

In summary, the Random Forest (and the SVM) both have a similar balance in terms of classification errors. The MLP tends to overclassify wines as `regular`, and Logistic Regression is the opposite. It is best at correctly classifying `premium` wines, but has a lot of false positives. Overall, the Random Forest has the best performance. An added advantage is some capability to explore which features are most important.

## Feature Importance

What insights do the models give on what features are relevant to wine quality? I used the Random Forest's feature importance. This measures how much each feature contributed to reducing classification uncertainty (in terms of Gini impurity) across the entire forest.

| Feature | Importance |
|---------|------------|
| alcohol | 0.222987 |
| density | 0.119376 |
| chlorides | 0.092531 |
| volatile acidity | 0.092127 |
| total sulfur dioxide | 0.075486 |
| citric acid | 0.073331 |
| sulphates | 0.072998 |
| pH | 0.068285 |
| residual sugar | 0.063973 |
| free sulfur dioxide | 0.062964 |
| fixed acidity | 0.052781 |
| wine_type | 0.003161 |

The most important feature was alcohol, and the least was the type of wine (red or white). However, removing wine type caused performance to drop slightly to about 82%, so although it was the least important feature, it was necessary to get the best performance. 

All models have used the default threshold of 0.5. That is, if the output is a probability of 0.5 or greater, the wine is classed as `premium`. What would the effect on the model be of adjusting this threshold? The next section looks at the Receiver Operating Characteristic (ROC) curve. 

## Adjusting Model Thresholds

The plot below shows the ROC curve for each model, as well as the area under the curve (AUC). The curve plots the True Positive Rate as a function of the False Positive Rate as the threshold (or cutoff) probability for predicting a "1" (premium wine) vs "0" is adjusted. It does not show the thresholds themselves directly, but shows the overall profile of the model in response to tuning the threshold.  A good model should curve as close as possible to the top left, meaning that as the threshold is adjusted to get more True Positives (that is, correctly identify a higher proportion of premium wines), the False Positive rate stays low. Randomly guessing would fall along the diagonal.

![ROC Curve](figures/ROC.png)
*The ROC Curve for the best of each of the four Models*

## Performance Summary and Implications

For the problem as a whole, identifying `regular` wines seems easier (the best models had about 90% recall) than identifying `premium` wines (even the best model - logistic - achieved only 78% recall). It appears that identifying premium (that is, above average in quality) is complex.

All things considered, the Random Forest was the best classifier - narrowly beating out the MLP - in terms of F1 score. It also was the most stable in terms of threshold tuning, generally being in the top two classifiers. Additionally it - unlike the MLP - gives some insight into what features are most important. It has a recall of about 90% on regular wines, but only 56% on premium wines.

The next section explores whether such a model would be useful in practice - given its mediocre performance on `premium` wines. 

# Discussion: Is the Best Model Actually Useful?

I have to honestly admit that during the tuning process I found the performance of the models on `premium` wines disappointing. However, let's see if it can still be useful in practice by exploring a potential application using the data and model from above.

Suppose a wine importer wants to identify premium wines - wines that have been highly rated by wine experts, and thus can be priced higher. If they chose a wine at random they have a 19% chance of choosing a premium wine.

However, suppose instead they use the Random Forest as regular wine pre-filter. That is, rather than importing a randomly chosen wine, they feed the wine's physical and chemical measurements - assuming these are available - into the trained Random Forest classifier. They only import the wine for assessment by their own wine tasters if the Random Forest classifies it as `premium`.

Using a Bayesian approach (Reich et al., 2019), let's look at how using the classifier in this way changes the probability of choosing a premium wine.

The "prior" probabilities (before using the classifier) are P(Regular) = 0.81, P(Premium) = 0.19. 

Let the events "classifier labels the wine as premium" and "classifier labels the wine as regular" be denoted simply by 1 and 0 respectively.

To calculate the "posterior" probability that the wine is a premium wine given that the classifier labeled it as such - that is P(Premium|1) - Bayes' theorem can be used.

Briefly, Bayes' theorem stipulates:

$P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}$

In this case, the numerator is:   

P(1 | Premium)P(Premium) = (0.56 × 0.19) = 0.1064

For the denominator:

P(1) = P(1 | Premium)P(Premium) + P(1 | Regular)P(Regular) 

P(1) = (0.56 × 0.19) + (0.10 × 0.81) = 0.1874

Plugging these into Bayes' Theorem gives:

P(Premium | 1) = 0.1064 / 0.1874 = 0.568 

So, the "posterior" probability P(Premium | 1) - the probability that the wine is a premium wine given that the Random Forest classified it as a "1" (a premium wine) - is approximately 57%. 

In other words, using the classifier to screen out regular wines improves the chances of selecting a premium wine from 19% to 57% - a threefold improvement. 

Put in large scale terms, if the importer wants to find (say) 500 premium wine brands, how many wines would they expect to have to import in each case?

If Y is a random variable representing the number of wines imported, r is the desired number of successes (in this case, the number of premium wines we want), and p the probability of a success on each trial, then Y has a "negative binomial" distribution. The expected value of Y is simply r/p. Plugging numbers in, we expected to have to import

E(Y) = 500/0.19 ≈ 2,632

wines without using the classifier, and 

E(Y) = 500/0.568 ≈ 880

wines using the classifier. The importer only needs to import about one-third as many wines (on average) to achieve their goal of 500 premium wines! 

Of course, there are many caveats: enough high quality data would have to be available - including expert ratings - from the region we are importing from. Wine importers would not be selecting at random but would have domain expertise to guide them, and so on. But this analysis shows that a machine learning approach at least has the *potential* to be a valuable tool as part of the process for making wine importation more efficient.


# Conclusion

In this project, the problem of classifying wines by their physical and chemical properties into taste quality categories was explored. Four machine learning models were tested: one linear model (Logistic Regression) and three nonlinear models (Random Forest, Support Vector Machine, and Multi-Layer Perceptron). After extensive hyperparameter tuning, the Random Forest was found to be the best performing model. While it was more accurate on regular wines than premium wines, its overall recall was sufficient to be potentially useful as a screener for wine importers looking to increase their chances of selecting premium wines.

Overall lessons learned:

• **Data quality matters, even from reputable sources.** While this dataset was generally clean, it unexpectedly contained many duplicates of unknown provenance that had to be removed prior to modeling.

• **Target class imbalance requires careful handling in supervised machine learning.** The original data had several quality levels with very few examples. Reframing the problem as a binary classification with balanced classes resulted in a problem with enough examples for useful results.

• **Look beyond surface-level metrics.** It was critical to go beyond the F1 scores and dig deeper into how models were performing their classifications. The superficial similarity of some F1 scores masked crucial differences in how the models were splitting the two classes internally.

• **Even imperfect models can have practical value.** Identifying premium wines turned out to be a subtle problem, and all models struggled with achieving high recall on the premium class. However, a Bayesian analysis showed that the best model would be potentially useful as a screening filter for regular wines. This shows the importance of thinking about model performance not just in terms of metrics like F1 scores, but in terms of how the models could be used in practice.

More generally, I feel that the project was a valuable experience in taking a real-world data set through the entire supervised learning lifecycle, and in the process building the infrastructure needed to support fine-tuning and comparing multiple models on the same data set.

Thank you for reading, and I look forward to your feedback.


# References

Bunevičius, V. (2019). Wine Quality: A Machine Learning Case Study. Retrieved September 10, 2025, from https://bunevicius.com/projects/red-wine/index.html

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. *Decision Support Systems*, 47(4), 547–553. https://doi.org/10.1016/j.dss.2009.05.016

Faraway, J. J. (2014). *Linear models with R* (2nd ed., pp. 106-109). CRC Press.

Ijemuah, V. (2020). Wine Quality Prediction using Machine Learning. *Medium*. Retrieved September 10, 2025, from https://medium.com/@ijemuahvictoria/wine-quality-prediction-using-machine-learning-c0b3b427693e

James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2021). *An Introduction to Statistical Learning: With Applications in R* (2nd ed.). Springer.

Reich, B. J., & Ghosh, S. K. (2019). *Bayesian Statistical Methods* (1st ed.). Chapman & Hall/CRC.

# Appendix A: AI Usage

Claude Sonnet 4.1 was used in a variety of ways within this project, within the "limited" usage guidelines as set out in the course. This section gives details, and additional steps where necessary to ensure the guidelines were adhered to.

- For project management, keeping tasks, time estimates, and maintaining a checklist of what was done and what was left.
- For research on the reason for missing values in the dataset, and any additional research that has been done with the dataset.
- For domain research, as specified in the EDA section.
- For proofreading and editing the final report. The prompt used for this thread:

  *"This thread is for Drafting my conclusions and proofreading the report. In the thread I will appreciate your help with basic edits for clarity and cohesion while maintaining my voice. Please do not add any additional opinions or domain expertise. This is important so that I maintain academic integrity. Even if the result is less than perfect in terms of language I would rather that, than something that's perfect and inauthentic."*

- Formatting tables, references and other organizational tasks in preparing the report

- As a sounding board to discuss modeling results during the modeling and tuning process. In these discussions I asked the tool to play the role of a supportive professor that would challenge me and help me gain a deeper understanding. 

- Prior to this term, I used Claude to explore model pipelining design and development principles. With its assistance I developed a basic modeling pipeline suitable for comparing the performance of multiple models on a classification problem. For this project - in order to adhere to the new AI usage guidelines - I took several weeks to refactor and rebuild a simpler pipeline. In particular, I designed and built `WineConfig` and `WineModelPipeline` classes - the core components of the system, using the insights I had gained from the previous research, but starting from my own specifications and templates. 

# Appendix B: Sample Parameter Dictionary

## Base Parameters (Kept Constant During Grid Search)
```python
self.model_base_params = {
    'logistic': {
        'penalty': 'l2',
        'max_iter': 1000,
        'solver': 'saga',  
        'class_weight': 'balanced', 
        'n_jobs': -1,  # use all available cores for parallel processing
        'random_state': self.random_state
    },
    'rf': {
        'class_weight': 'balanced',
        'n_jobs': -1,  # use all available cores for parallel processing
        'random_state': self.random_state
    },
    'svm': {
        'kernel': 'rbf',
        'class_weight': 'balanced',
        'probability': True,   # This stores probabilities of class labels as well as predictions for AUC etc
        'random_state': self.random_state
        # Note: svm doesn't support n_jobs
    },
    'mlp': {
        'max_iter': 1000,
        'early_stopping': True,
        'validation_fraction': 0.1,
        'n_iter_no_change': 20,  # Patience parameter - how long to keep going with no improvement
        'random_state': self.random_state
        # Note: MLP doesn't support balanced class_weight or n_jobs
    }
}
```

## Grid Search Parameters

This grid is from an early rough grid search.

```python
self.model_grid_params = {
    'logistic': {
        'C': [0.01, 0.1, 1, 10, 100],
    },
    'rf': {
        'n_estimators': [50, 100, 200, 300, 400, 500],
        'max_depth': [5, 10, 15, 20, 25, 30, None],
        'max_features': ['sqrt']
    },
    'svm': {
        'C': [0.1, 1, 10, 100],
        'gamma': [0.01, 0.1, 1.0, 10, 100]
    },
    'mlp': {
        'hidden_layer_sizes': [(50,), (100,), (150,), (200,), (100, 50)],
        'alpha': [0.001, 0.01, 0.1],
        'learning_rate_init': [0.001, 0.01, 0.02]
    }
}
```