# Wine quality

## About the data

The data can be found either in the UCI Machine Learning Repository under the name of "Wine Quality" (link: https://archive.ics.uci.edu/ml/datasets/Wine+Quality), or using the official information below:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

  Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
                [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
                [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

The task is to predict the quality of wines using a variety of measurements taken form the wines.

## Imports

In [81]:
import sklearn
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from itertools import combinations
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

## Exploratory Data Analysis

This data set contains two CSV files, one for white and one for red wines, but I will only use the white wine table since it contains more data.

In [56]:
wines = pd.read_csv('winequality-white.csv', sep=';')
wines.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [57]:
print(wines['quality'].min(), wines['quality'].max())

3 9


We can see that in the table, all columns contain only real valued data except for "quality", which is I presume a rating on the scale of 10, although in the dataset the lowest rating is 3 and the highest is 9. These are the values that are to be predicted. For a model that predicts ratings, a regression approach may be appealing, but since all the ratings are strictly integer and clearly ment to be on a scale of 10, I will use a multi-class classification approach with 10 classes. Even if the data does not contain values below 3 or higher than 9, this does not mean that the model in a real world use case could not receive data that is a wine of quality 1 or 10.

In [58]:
print('white wine:', np.histogram(wines['quality'], range(1, 12))[0])

white wine: [   0    0   20  163 1457 2198  880  175    5    0]


We can see that the amount of data points corresponding to the different labels are heavily unbalanced. This could raise some problems for example the classifiers may become heavily biased towards the middle ratings.

Let us look at the correlation matrix of the table to get a sense of the data and what features to use later in the model.

In [59]:
wines.corr().abs().style.background_gradient(cmap='Reds').format(precision=2)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
fixed acidity,1.0,0.02,0.29,0.09,0.02,0.05,0.09,0.27,0.43,0.02,0.12,0.11
volatile acidity,0.02,1.0,0.15,0.06,0.07,0.1,0.09,0.03,0.03,0.04,0.07,0.19
citric acid,0.29,0.15,1.0,0.09,0.11,0.09,0.12,0.15,0.16,0.06,0.08,0.01
residual sugar,0.09,0.06,0.09,1.0,0.09,0.3,0.4,0.84,0.19,0.03,0.45,0.1
chlorides,0.02,0.07,0.11,0.09,1.0,0.1,0.2,0.26,0.09,0.02,0.36,0.21
free sulfur dioxide,0.05,0.1,0.09,0.3,0.1,1.0,0.62,0.29,0.0,0.06,0.25,0.01
total sulfur dioxide,0.09,0.09,0.12,0.4,0.2,0.62,1.0,0.53,0.0,0.13,0.45,0.17
density,0.27,0.03,0.15,0.84,0.26,0.29,0.53,1.0,0.09,0.07,0.78,0.31
pH,0.43,0.03,0.16,0.19,0.09,0.0,0.0,0.09,1.0,0.16,0.12,0.1
sulphates,0.02,0.04,0.06,0.03,0.02,0.06,0.13,0.07,0.16,1.0,0.02,0.05


It seems that the features with the highest absolute correlation, which can be useful for the white wine quality predictions are: alcohol, density, chlorides, total sulfur dioxide and volatile acidity. Since the alcohol correlates with density, total sulfur dioxide and chlorides, they may not contribute too much to a good predictor so it may be clever to only use these if we want to really finetune the classifier.

## Building classifiers

First, make a train-test split for both tables to use later when evaluating the classifiers.

In [61]:
x_train, x_test, y_train, y_test = train_test_split(
    wines.drop(columns='quality', inplace=False)[['alcohol', 'volatile acidity', 'fixed acidity', 'chlorides']].values,
    wines['quality'].values,
    test_size=0.1, random_state=42
)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(4408, 4) (490, 4) (4408,) (490,)


It may be useful to make a really basic and simple model to use as a baseline. In this case we will make a classifier that just makes a weighted guess and see how well it performs.

In [68]:
def guess(x, seed=None):
    if seed is not None:
        np.random.seed(seed)
    distribution = np.histogram(x_train, range(1, 12))[0]
    return np.random.choice(np.arange(1, 11), x.shape[0], True, distribution / distribution.sum())

print(accuracy_score(y_train, guess(x_train, seed=42)))
print(accuracy_score(y_test, guess(x_test, seed=42)))

0.1898820326678766
0.19795918367346937


Now, we will experiment a bit with some basic classifiers to get a feel for the difficulty of the problem. Let us try a basic KNN, Random Forest and SVM approach.

In [64]:
knn = KNeighborsClassifier(3).fit(x_train, y_train)
print(accuracy_score(y_train, knn.predict(x_train)))
print(accuracy_score(y_test, knn.predict(x_test)))

0.7484119782214156
0.5510204081632653


In [67]:
rf = RandomForestClassifier(200).fit(x_train, y_train)
print(accuracy_score(y_train, rf.predict(x_train)))
print(accuracy_score(y_test, rf.predict(x_test)))

0.9984119782214156
0.6408163265306123


In [79]:
svm = make_pipeline(StandardScaler(), LinearSVC(dual=False, C=1.0, class_weight='balanced', random_state=42)).fit(x_train, y_train)
print(accuracy_score(y_train, svm.predict(x_train)))
print(accuracy_score(y_test, svm.predict(x_test)))

0.4330762250453721
0.42244897959183675


It looks like the random forest could be a good choice if we can reduce the overfitting and make it generalize a bit better. To achieve this, we will use a grid search with cross validation to find the best hyperparameters. Since we are using a random forest and it selects random features to use for each decision tree, it can be beneficial to use some of the other features as well, to combat overfitting and only drop the ones with a really small absolute correlation with quality.

In [108]:
x_train2, x_test2, y_train2, y_test2 = train_test_split(
    wines.drop(columns=['quality', 'citric acid', 'free sulfur dioxide', 'sulphates'], inplace=False).values,
    wines['quality'].values,
    test_size=0.1, random_state=42
)

In [116]:
params = {
    'n_estimators': [200, 500, 1000],
    'max_depth': [2, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'class_weight': ['balanced', None],
    'max_samples': [0.3, 0.6, None]
}
grid = GridSearchCV(RandomForestClassifier(random_state=42), params, cv=5).fit(x_train2, y_train2)
model = grid.best_estimator_

print(grid.best_params_)
print(accuracy_score(y_train2, model.predict(x_train2)))
print(accuracy_score(y_test2, model.predict(x_test2)))

{'class_weight': 'balanced', 'max_depth': None, 'max_samples': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 500}
0.9963702359346642
0.7040816326530612


It looks like the gridsearch and the expanded feature pool was able to improve the performance of the classifier on the test set, by about 6%.