# Predicting Wine Quality Using Linear and Ordinal Regression

### Authors

| Name | Roll Number |
| - | - |
| Gautam Singh | CS21BTECH11018 |
| Jaswanth Beere | BM21BTECH11007 |

This `.ipynb` file predicts the quality of wine using ordinal regression Packages in Python.

## Package Imports

The required packages for the prediction are specified here

In [9]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from statsmodels.miscmodels.ordinal_model import OrderedModel

## Loading Datasets

The `pandas` library is used to load the `csv` files.

In [10]:
red_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
white_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')

From the dataset, it is clear that the `quality` variable is ordinal with ratings ranging from 3 to 9, while the other variables are real-valued. We require to predict the value of `quality` given the other independent variables. We use an ordinal regression model as well as a linear regression model for the same and compare the performance of both models.

### Preprocessing

Each dataset is split as follows using _proportional sampling_. This is because the ratings are not evenly distributed, and thus using random sampling may not generate an optimal model.
1. 80 percent _training_ data.
2. 20 percent _test_ data.

Before splitting the data, we adjust the ratings so that the smallest rating is zero.

In [11]:
# Adjust ratings to start from zero onwards
red_df['quality'] -= red_df['quality'].min()
white_df['quality'] -= white_df['quality'].min()
# Proportionally sample from dataset to create training dataset
red_train_df = red_df.groupby('quality').apply(lambda x : x.sample(frac=0.8)).reset_index(drop=True)
white_train_df = white_df.groupby('quality').apply(lambda x : x.sample(frac=0.8)).reset_index(drop=True)
# The remaining data becomes the test dataset
red_test_df = pd.concat([red_df, red_train_df]).drop_duplicates(keep=False).reset_index(drop=True)
white_test_df = pd.concat([white_df, white_train_df]).drop_duplicates(keep=False).reset_index(drop=True)

## Training

### Ordinal Regression

The `statsmodels` library is used to perform ordinal regression on the given dataset.

In [12]:
# Perform ordinal regression on the training dataset
red_mod_prob = OrderedModel(red_train_df['quality'], red_train_df.loc[:, red_train_df.columns != 'quality'])
# Use the BFGS algorithm to find the maximum likelihood solution
red_res_prob = red_mod_prob.fit(method='bfgs')
# Summarize the results of training
red_res_prob.summary()

Optimization terminated successfully.
         Current function value: 0.959643
         Iterations: 68
         Function evaluations: 71
         Gradient evaluations: 71


0,1,2,3
Dep. Variable:,quality,Log-Likelihood:,-1226.4
Model:,OrderedModel,AIC:,2485.0
Method:,Maximum Likelihood,BIC:,2567.0
Date:,"Sun, 08 Oct 2023",,
Time:,22:14:10,,
No. Observations:,1278,,
Df Residuals:,1262,,
Df Model:,11,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
fixed acidity,0.0141,0.050,0.279,0.780,-0.085,0.113
volatile acidity,-2.0449,0.250,-8.167,0.000,-2.536,-1.554
citric acid,-0.3542,0.288,-1.228,0.219,-0.919,0.211
residual sugar,0.0139,0.029,0.476,0.634,-0.043,0.071
chlorides,-2.5133,0.832,-3.019,0.003,-4.145,-0.882
free sulfur dioxide,0.0116,0.004,2.773,0.006,0.003,0.020
total sulfur dioxide,-0.0066,0.001,-4.705,0.000,-0.009,-0.004
density,-0.6888,41.924,-0.016,0.987,-82.858,81.480
pH,-0.9062,0.381,-2.380,0.017,-1.653,-0.160


In [13]:
# Perform ordinal regression on the training dataset
white_mod_prob = OrderedModel(white_train_df['quality'], white_train_df.loc[:, white_train_df.columns != 'quality'])
# Use the BFGS algorithm to find the maximum likelihood solution
white_res_prob = white_mod_prob.fit(method='bfgs')
# Summarize the results of training
white_res_prob.summary()

Optimization terminated successfully.
         Current function value: 1.124708
         Iterations: 111
         Function evaluations: 115
         Gradient evaluations: 115


0,1,2,3
Dep. Variable:,quality,Log-Likelihood:,-4406.6
Model:,OrderedModel,AIC:,8847.0
Method:,Maximum Likelihood,BIC:,8954.0
Date:,"Sun, 08 Oct 2023",,
Time:,22:14:20,,
No. Observations:,3918,,
Df Residuals:,3901,,
Df Model:,11,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
fixed acidity,0.0780,0.033,2.376,0.018,0.014,0.142
volatile acidity,-2.8390,0.188,-15.078,0.000,-3.208,-2.470
citric acid,0.0231,0.158,0.146,0.884,-0.286,0.333
residual sugar,0.1130,0.012,9.566,0.000,0.090,0.136
chlorides,0.1220,0.874,0.140,0.889,-1.590,1.834
free sulfur dioxide,0.0042,0.001,3.111,0.002,0.002,0.007
total sulfur dioxide,-0.0002,0.001,-0.403,0.687,-0.001,0.001
density,-191.9481,29.218,-6.569,0.000,-249.215,-134.681
pH,0.9461,0.168,5.631,0.000,0.617,1.275


### Linear Regression

We use `scikit-learn` to perform a standard linear regression on the data, by treating `quality` to be a real-valued dependent variable.

In [14]:
# Perform linear regression on training data
red_reg = LinearRegression().fit(red_df.loc[:, red_df.columns != 'quality'], red_df['quality'])
white_reg = LinearRegression().fit(white_df.loc[:, white_df.columns != 'quality'], white_df['quality'])
# Show parameter coefficients for the linear models
print("Parameters for red wine:", red_reg.coef_)
print("Parameters for white wine:", white_reg.coef_)

Parameters for red wine: [ 2.49905527e-02 -1.08359026e+00 -1.82563948e-01  1.63312698e-02
 -1.87422516e+00  4.36133331e-03 -3.26457970e-03 -1.78811638e+01
 -4.13653144e-01  9.16334413e-01  2.76197699e-01]
Parameters for white wine: [ 6.55199614e-02 -1.86317709e+00  2.20902007e-02  8.14828026e-02
 -2.47276537e-01  3.73276519e-03 -2.85747419e-04 -1.50284181e+02
  6.86343742e-01  6.31476473e-01  1.93475697e-01]


## Testing

### Ordinal Regression

The _root mean squared error_ (RMSE) is a good evaluation metric as per [Gaudette and Japkowicz 2009](https://link.springer.com/chapter/10.1007/978-3-642-01818-3_25), since the ordinal data consists of small integers and deviations from the mean are penalized more severely.

In [15]:
# Predict class based on maximum probability
red_ord_pred = red_res_prob.predict(red_test_df.loc[:, red_test_df.columns != 'quality']).idxmax(axis=1)
white_ord_pred = white_res_prob.predict(white_test_df.loc[:, white_test_df.columns != 'quality']).idxmax(axis=1)
# Calculate RMSE from correct labels
red_ord_rmse = ((red_ord_pred - red_test_df['quality'])**2).mean()**0.5
white_ord_rmse = ((white_ord_pred - white_test_df['quality'])**2).mean()**0.5
print("RMSE for red wine:", red_ord_rmse)
print("RMSE for white wine:", white_ord_rmse)

RMSE for red wine: 0.7714542762891773
RMSE for white wine: 0.8340576562282991


### Linear Regression

As per usual, a linear regression model is evaluated using the RMSE.

In [16]:
# Perform predictions on test data
red_reg_pred = red_reg.predict(red_test_df.loc[:, red_test_df.columns != 'quality'])
white_reg_pred = white_reg.predict(white_test_df.loc[:, white_test_df.columns != 'quality'])
# Compute RMSE of predictions
red_reg_rmse = ((red_reg_pred - red_test_df['quality'])**2).mean()**0.5
white_reg_rmse = ((white_reg_pred - white_test_df['quality'])**2).mean()**0.5
print("RMSE for red wine:", red_reg_rmse)
print("RMSE for white wine:", white_reg_rmse)

RMSE for red wine: 0.6949935995716785
RMSE for white wine: 0.7610717259997346


## Results

On most runs, it seems that linear regression performs slightly better than ordinal regression, on ordinal data! The possible reasons for this are as follows.

1. The ordinal data corresponds to small integers.
2. The interval classes are spaced apart equally, and linear regression performs equally well in these cases.

However, the downside of using linear regression is a lack of interpretability even though it is a simpler model to work with. Thus, ordinal regression would still be preferred given its high interpretability at the cost of a small increase in RMSE.