# Predicting Wine Quality Using Linear and Ordinal Regression

### Authors

| Name | Roll Number |
| - | - |
| Gautam Singh | CS21BTECH11018 |
| Jaswanth Beere | BM21BTECH11007 |

This `.ipynb` file predicts the quality of wine using ordinal regression Packages in Python.

## Package Imports

The required packages for the prediction are specified here

In [181]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from statsmodels.miscmodels.ordinal_model import OrderedModel

## Loading Datasets

The `pandas` library is used to load the `csv` files.

In [182]:
red_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
white_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')

From the dataset, it is clear that the `quality` variable is ordinal with ratings ranging from 3 to 9, while the other variables are real-valued. We require to predict the value of `quality` given the other independent variables. We use an ordinal regression model as well as a linear regression model for the same and compare the performance of both models.

### Preprocessing

Each dataset is split as follows using _proportional sampling_. This is because the ratings are not evenly distributed, and thus using random sampling may not generate an optimal model.
1. 80 percent _training_ data.
2. 20 percent _test_ data.

Before splitting the data, we adjust the ratings so that the smallest rating is zero.

In [183]:
# Adjust ratings to start from zero onwards
red_df['quality'] -= red_df['quality'].min()
white_df['quality'] -= white_df['quality'].min()
# Proportionally sample from dataset to create training dataset
red_train_df = red_df.groupby('quality').apply(lambda x : x.sample(frac=0.8)).reset_index(drop=True)
white_train_df = white_df.groupby('quality').apply(lambda x : x.sample(frac=0.8)).reset_index(drop=True)
# The remaining data becomes the test dataset
red_test_df = pd.concat([red_df, red_train_df]).drop_duplicates(keep=False).reset_index(drop=True)
white_test_df = pd.concat([white_df, white_train_df]).drop_duplicates(keep=False).reset_index(drop=True)

## Training

### Ordinal Regression

The `statsmodels` library is used to perform ordinal regression on the given dataset.

In [184]:
# Perform ordinal regression on the training dataset
red_mod_prob = OrderedModel(red_train_df['quality'], red_train_df.loc[:, red_train_df.columns != 'quality'])
# Use the BFGS algorithm to find the maximum likelihood solution
red_res_prob = red_mod_prob.fit(method='bfgs')
# Summarize the results of training
red_res_prob.summary()

Optimization terminated successfully.
         Current function value: 0.960790
         Iterations: 67
         Function evaluations: 71
         Gradient evaluations: 71


0,1,2,3
Dep. Variable:,quality,Log-Likelihood:,-1227.9
Model:,OrderedModel,AIC:,2488.0
Method:,Maximum Likelihood,BIC:,2570.0
Date:,"Sun, 08 Oct 2023",,
Time:,14:20:57,,
No. Observations:,1278,,
Df Residuals:,1262,,
Df Model:,11,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
fixed acidity,0.0113,0.050,0.229,0.819,-0.086,0.109
volatile acidity,-1.8174,0.240,-7.568,0.000,-2.288,-1.347
citric acid,-0.0840,0.292,-0.288,0.773,-0.655,0.487
residual sugar,0.0359,0.028,1.284,0.199,-0.019,0.091
chlorides,-3.7313,0.790,-4.724,0.000,-5.279,-2.183
free sulfur dioxide,0.0086,0.004,2.053,0.040,0.000,0.017
total sulfur dioxide,-0.0065,0.001,-4.599,0.000,-0.009,-0.004
density,-0.8882,41.865,-0.021,0.983,-82.942,81.165
pH,-0.7541,0.368,-2.049,0.040,-1.475,-0.033


In [185]:
# Perform ordinal regression on the training dataset
white_mod_prob = OrderedModel(white_train_df['quality'], white_train_df.loc[:, white_train_df.columns != 'quality'])
# Use the BFGS algorithm to find the maximum likelihood solution
white_res_prob = white_mod_prob.fit(method='bfgs')
# Summarize the results of training
white_res_prob.summary()

Optimization terminated successfully.
         Current function value: 1.123059
         Iterations: 112
         Function evaluations: 116
         Gradient evaluations: 116


0,1,2,3
Dep. Variable:,quality,Log-Likelihood:,-4400.1
Model:,OrderedModel,AIC:,8834.0
Method:,Maximum Likelihood,BIC:,8941.0
Date:,"Sun, 08 Oct 2023",,
Time:,14:21:07,,
No. Observations:,3918,,
Df Residuals:,3901,,
Df Model:,11,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
fixed acidity,0.1631,0.037,4.417,0.000,0.091,0.236
volatile acidity,-2.8290,0.189,-14.934,0.000,-3.200,-2.458
citric acid,0.0655,0.157,0.418,0.676,-0.241,0.372
residual sugar,0.1475,0.014,10.549,0.000,0.120,0.175
chlorides,-0.4046,0.886,-0.457,0.648,-2.141,1.332
free sulfur dioxide,0.0048,0.001,3.508,0.000,0.002,0.007
total sulfur dioxide,3.08e-05,0.001,0.049,0.961,-0.001,0.001
density,-311.9295,37.463,-8.326,0.000,-385.355,-238.504
pH,1.2175,0.181,6.708,0.000,0.862,1.573


### Linear Regression

We use `scikit-learn` to perform a standard linear regression on the data, by treating `quality` to be a real-valued dependent variable.

In [186]:
# Perform linear regression on training data
red_reg = LinearRegression().fit(red_df.loc[:, red_df.columns != 'quality'], red_df['quality'])
white_reg = LinearRegression().fit(white_df.loc[:, white_df.columns != 'quality'], white_df['quality'])
# Show parameter coefficients for the linear models
print("Parameters for red wine:", red_reg.coef_)
print("Parameters for white wine:", white_reg.coef_)

Parameters for red wine: [ 2.49905527e-02 -1.08359026e+00 -1.82563948e-01  1.63312698e-02
 -1.87422516e+00  4.36133331e-03 -3.26457970e-03 -1.78811638e+01
 -4.13653144e-01  9.16334413e-01  2.76197699e-01]
Parameters for white wine: [ 6.55199614e-02 -1.86317709e+00  2.20902007e-02  8.14828026e-02
 -2.47276537e-01  3.73276519e-03 -2.85747419e-04 -1.50284181e+02
  6.86343742e-01  6.31476473e-01  1.93475697e-01]


## Testing

### Ordinal Regression

Since the ordinal data contains small integers, the _root mean squared error_ (RMSE) is a good evaluation metric.

In [187]:
# Predict class based on maximum probability
red_ord_pred = red_res_prob.predict(red_test_df.loc[:, red_test_df.columns != 'quality']).idxmax(axis=1)
white_ord_pred = white_res_prob.predict(white_test_df.loc[:, white_test_df.columns != 'quality']).idxmax(axis=1)
# Calculate RMSE from correct labels
red_ord_rmse = ((red_ord_pred - red_test_df['quality'])**2).mean()**0.5
white_ord_rmse = ((white_ord_pred - white_test_df['quality'])**2).mean()**0.5
print("RMSE for red wine:", red_ord_rmse)
print("RMSE for white wine:", white_ord_rmse)

RMSE for red wine: 0.7194052124144799
RMSE for white wine: 0.806946584785929


### Linear Regression

As per usual, a linear regression model is evaluated using the RMSE.

In [188]:
# Perform predictions on test data
red_reg_pred = red_reg.predict(red_test_df.loc[:, red_test_df.columns != 'quality'])
white_reg_pred = white_reg.predict(white_test_df.loc[:, white_test_df.columns != 'quality'])
# Compute RMSE of predictions
red_reg_rmse = ((red_reg_pred - red_test_df['quality'])**2).mean()**0.5
white_reg_rmse = ((white_reg_pred - white_test_df['quality'])**2).mean()**0.5
print("RMSE for red wine:", red_reg_rmse)
print("RMSE for white wine:", white_reg_rmse)

RMSE for red wine: 0.6730945439280761
RMSE for white wine: 0.7493466349620683


## Results

On most runs, it seems that linear regression performs slightly better than ordinal regression, on ordinal data! The possible reasons for this are as follows.

1. The ordinal data corresponds to small integers.
2. The interval classes are spaced apart equally, and linear regression performs equally well in these cases.