# Bonus: Wine Quality Classification Model

In this bonus, we will build a classification model to predict the quality of wine.

# Import Library

In [977]:
import pandas as pd
import numpy as np

# Import Dataset

In [978]:
df = pd.read_csv("../data/anggur.csv")
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,5.9,0.4451,0.1813,2.049401,0.070574,16.593818,42.27,0.9982,3.27,0.71,8.64,7
1,8.4,0.5768,0.2099,3.10959,0.101681,22.555519,16.01,0.996,3.35,0.57,10.03,8
2,7.54,0.5918,0.3248,3.673744,0.072416,9.316866,35.52,0.999,3.31,0.64,9.23,8
3,5.39,0.4201,0.3131,3.371815,0.072755,18.2123,41.97,0.9945,3.34,0.55,14.07,9
4,6.51,0.5675,0.194,4.404723,0.066379,9.360591,46.27,0.9925,3.27,0.45,11.49,8


# Split Training and Test set

We will split the training and test set to determine the performance of a classification model.

In [979]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.2, stratify=df['quality'], random_state=42)

In [980]:
X_train = train_set.drop(['quality'], axis=1)
y_train = train_set['quality']
X_test = test_set.drop(['quality'], axis=1)
y_test = test_set['quality']

In [981]:
X_train

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
244,7.72,0.5451,0.1264,2.801338,0.091782,20.567663,38.42,0.9963,3.29,0.72,10.56
917,8.27,0.7417,0.2181,2.339661,0.063838,8.221410,55.40,0.9950,3.29,0.61,11.20
895,7.49,0.4576,0.2252,3.177156,0.075400,18.603452,55.58,0.9922,3.25,0.68,8.97
66,7.25,0.5545,0.2535,1.721984,0.089206,21.507712,43.33,0.9921,3.35,0.46,11.19
331,9.44,0.5490,0.2622,5.210260,0.054500,24.021371,39.76,0.9998,3.25,0.55,10.97
...,...,...,...,...,...,...,...,...,...,...,...
732,6.72,0.4886,0.2933,2.178067,0.085630,15.476538,45.15,0.9973,3.20,0.51,13.87
547,5.30,0.5220,0.3049,1.130890,0.089340,13.756951,48.34,0.9976,3.32,0.55,11.32
569,8.18,0.3570,0.1931,1.693136,0.077684,7.751049,38.81,0.9952,3.46,0.63,10.11
155,7.40,0.4505,0.2401,2.798932,0.069678,23.568005,48.41,0.9965,3.44,0.55,13.58


In [982]:
y_train

244     8
917     9
895     7
66      8
331     9
       ..
732     9
547     8
569     8
155    10
52     10
Name: quality, Length: 800, dtype: int64

In [983]:
X_test

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
643,9.37,0.4153,0.2638,3.430787,0.056311,16.719396,24.74,0.9979,3.38,0.54,12.41
612,7.90,0.4026,0.2746,2.791431,0.082855,6.172613,28.34,0.9987,3.27,0.54,11.16
822,6.82,0.5197,0.3358,2.408717,0.100882,13.210217,41.16,0.9958,3.17,0.56,8.96
982,8.25,0.5035,0.2690,1.573458,0.105009,7.389244,38.75,0.9956,3.42,0.66,10.48
588,7.91,0.6452,0.2551,3.074861,0.123317,15.755803,33.95,0.9992,3.28,0.50,12.20
...,...,...,...,...,...,...,...,...,...,...,...
748,6.32,0.4472,0.2593,3.599399,0.069487,11.033585,28.55,0.9955,3.19,0.56,12.40
957,7.88,0.4736,0.2887,4.380263,0.055661,21.633450,45.21,0.9963,3.30,0.52,9.18
884,6.88,0.4912,0.2175,1.992063,0.082524,9.005354,33.34,0.9966,3.50,0.71,11.31
688,7.44,0.6484,0.2596,3.196452,0.038998,21.260241,34.14,0.9962,3.38,0.62,7.19


In [984]:
y_test

643    10
612     8
822     7
982     8
588     9
       ..
748     9
957     8
884     8
688     6
170     6
Name: quality, Length: 200, dtype: int64

# Data Preprocessing

In this section, we will process the data before feeding it into a classifier.

## Handle Missing Values

In [985]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1000 non-null   float64
 1   volatile acidity      1000 non-null   float64
 2   citric acid           1000 non-null   float64
 3   residual sugar        1000 non-null   float64
 4   chlorides             1000 non-null   float64
 5   free sulfur dioxide   1000 non-null   float64
 6   total sulfur dioxide  1000 non-null   float64
 7   density               1000 non-null   float64
 8   pH                    1000 non-null   float64
 9   sulphates             1000 non-null   float64
 10  alcohol               1000 non-null   float64
 11  quality               1000 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 93.9 KB


There are no missing values to be handled.

# Encode Target Variable

The values in the target variable range from 5 to 10, which may work for some classification models but other models may require the target variable to have values starting from 0 and incrementing by 1.

In [986]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

# Generate Synthetic Data

The data to be trained is relatively small, therefore we will perform resampling to balance the class distribution, then generate synthetic data using augmentation with Gaussian noise.

In [987]:
from imblearn.over_sampling import SMOTE

# Perform oversampling to balance the classes
y_counts = np.bincount(y_train)
min_sample = np.min(y_counts[np.nonzero(y_counts)]) - 1
sm = SMOTE(random_state=42, k_neighbors=min_sample)
X_train, y_train = sm.fit_resample(X_train, y_train)

In [988]:
# Set the standard deviation of the Gaussian noise to add
std_dev = 0.1

# Generate new data points with Gaussian noise added
n_iter = 5
fold = len(X_train)

for i in range(n_iter):
    for j in range(fold):
        synthetic_data = X_train.iloc[j].values + np.random.normal(loc=0, scale=std_dev, size=X_train.shape[1])
        synthetic_data_df = pd.DataFrame(synthetic_data.reshape(1, -1), columns=X_train.columns)
        X_train = pd.concat([X_train, synthetic_data_df], axis=0)
        y_train = np.append(y_train, y_train[j])

In [989]:
print(len(X_train), len(y_train))

12924 12924


# Feature Scaling

In this section, we aim to perform feature scaling by transforming the numerical columns of our data into a normal standard distribution. The purpose of this transformation is to make the data comparable across different features, as well as to improve the performance of linear-based classification models. This way, the classifiers can more effectively recognize patterns in the data and make accurate predictions, regardless of the range of values in the original data.

In [990]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [991]:
X_train

array([[ 0.6508553 ,  0.22116508, -1.36115963, ...,  0.04376549,
         0.98052311,  0.22693379],
       [ 1.05054744,  1.67618347, -0.46465626, ...,  0.04376549,
         0.1428833 ,  0.48728043],
       [ 0.48371131, -0.42641432, -0.39524324, ..., -0.2500715 ,
         0.67592682, -0.41986489],
       ...,
       [ 0.1444766 , -1.3820829 , -0.1686155 , ..., -0.78387294,
         1.70821381,  1.70560842],
       [ 0.75620467,  1.63778712, -1.41021028, ..., -0.01811131,
         0.19981156,  1.39362231],
       [ 1.41061112, -0.33480214,  1.09559882, ..., -0.59310825,
         0.55410014,  0.81617185]])

In [992]:
X_test

array([[ 1.84993173, -0.7394727 , -0.01787105, ...,  0.70489872,
        -0.39016022,  0.9794983 ],
       [ 0.78166364, -0.83346422,  0.08771495, ..., -0.103153  ,
        -0.39016022,  0.47100877],
       [-0.00318639,  0.03318203,  0.68603562, ..., -0.83774547,
        -0.23786207, -0.42393281],
       ...,
       [ 0.04041639, -0.17774383, -0.47052215, ...,  1.58640968,
         0.90437404,  0.53202751],
       [ 0.44737566,  0.98567881, -0.05893228, ...,  0.70489872,
         0.21903237, -1.14395398],
       [-0.64269382,  0.53570364,  0.24413865, ..., -0.32353074,
        -0.00941485, -0.88767526]])

# Implement Classifiers

In this section, we will fit multiple models to the processed training data.

## Logistic Regression

In [993]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42, max_iter=10000)
lr.fit(X_train, y_train)

## Support Vector Machine

In [994]:
from sklearn.svm import SVC

svc = SVC(random_state=42)
svc.fit(X_train, y_train)

## XGBoost

In [995]:
from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(X_train, y_train)

## Voting Classifier

In [996]:
from sklearn.ensemble import VotingClassifier

vote = VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42, max_iter=10000)), 
                                    ('xgb', XGBClassifier())], voting='hard')
vote.fit(X_train, y_train)

# Evaluation

In this section, we will determine the performance of each model towards the test set.

In [997]:
from sklearn.metrics import classification_report

y_pred_lr = lr.predict(X_test)
print(classification_report(y_test, y_pred_lr, zero_division=0))

              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       0.88      0.88      0.88         8
           2       0.91      0.98      0.94        50
           3       0.96      0.81      0.88        90
           4       0.76      0.86      0.81        44
           5       0.70      1.00      0.82         7

    accuracy                           0.88       200
   macro avg       0.78      0.92      0.83       200
weighted avg       0.89      0.88      0.88       200



In [998]:
y_pred_svc = svc.predict(X_test)
print(classification_report(y_test, y_pred_svc, zero_division=0))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.60      0.75      0.67         8
           2       0.87      0.82      0.85        50
           3       0.88      0.86      0.87        90
           4       0.81      0.86      0.84        44
           5       0.86      0.86      0.86         7

    accuracy                           0.84       200
   macro avg       0.67      0.69      0.68       200
weighted avg       0.84      0.84      0.84       200



In [999]:
y_pred_xgb = xgb.predict(X_test)
print(classification_report(y_test, y_pred_xgb, zero_division=0))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.86      0.75      0.80         8
           2       0.92      0.90      0.91        50
           3       0.86      0.87      0.86        90
           4       0.76      0.80      0.78        44
           5       0.83      0.71      0.77         7

    accuracy                           0.84       200
   macro avg       0.70      0.67      0.69       200
weighted avg       0.85      0.84      0.85       200



In [1000]:
y_pred_vote = vote.predict(X_test)
print(classification_report(y_test, y_pred_vote, zero_division=0))

              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       0.88      0.88      0.88         8
           2       0.91      0.98      0.94        50
           3       0.90      0.87      0.88        90
           4       0.79      0.77      0.78        44
           5       0.83      0.71      0.77         7

    accuracy                           0.87       200
   macro avg       0.80      0.87      0.82       200
weighted avg       0.87      0.87      0.87       200



After conducting multiple trials, we can conclude that the **Logistic Regression model outperformed other models on the test data**. This could be attributed to the strong correlation between the `alcohol` feature and the target variable, where the average `alcohol` values for each class showed a linear relationship with the target variable. Logistic Regression is particularly suitable for datasets with linear relationships between input and output variables, which could explain why it performed the best in this case.