#  Wine Quality Rating - Python Group Project - MSCI:6040 

By Group 5 (Kiran Bichugatti,Tarun Mandava ,Senthil Kumar Gopalakrishnan, Venkatesan Kandavelu - Data Scientists) 
- June, 2020                                                             

Objective - Determine key significant factors in the chemistry of wine that result in high quality. Once those factors are determined, fit an analytical model that could be used for predicting whether the chemical properties of a sampled wine will result in an accurate judgement of quality.

Data source: Red wine dataset from Kaggle


In [1]:
#Importing the clean wine data
import pandas as pd

clean_wine_data = pd.read_csv("Clean_Wine_data.csv")

clean_wine_data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,wine rating
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Average
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,Average
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,Average
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,Average
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Average


Observing the overall statistics in each of the columns

In [2]:
print(clean_wine_data.describe())

       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1179.000000       1179.000000  1179.000000     1179.000000   
mean        8.162002          0.523066     0.246760        2.185411   
std         1.458270          0.164231     0.179441        0.440972   
min         5.100000          0.120000     0.000000        1.200000   
25%         7.100000          0.390000     0.080000        1.900000   
50%         7.800000          0.520000     0.240000        2.100000   
75%         9.000000          0.630000     0.390000        2.500000   
max        12.300000          1.005000     0.730000        3.600000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1179.000000          1179.000000           1179.000000  1179.000000   
mean      0.078586            15.020356             42.268024     0.996584   
std       0.014317             8.792916             26.106438     0.001593   
min       0.041000             1.000000         

# Machine Learning with Scikit-Learn

In [3]:
from sklearn.cluster import KMeans                        # k-means clustering 
from sklearn.model_selection import train_test_split      # train/test data
from sklearn.neighbors import KNeighborsClassifier        # k-NN classification 
from sklearn.linear_model import LogisticRegression       # logistic regression 

import math
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score, classification_report
from sklearn.metrics import explained_variance_score, mean_absolute_error, r2_score, mean_squared_error
from sklearn.metrics import confusion_matrix


### There are 11 features and 1 target. Based on those features and target, define the two variables X as a feature dataset and y as a target dataset.  So here quality is our target variable

In [4]:
features = list(clean_wine_data.columns)
features.remove('quality')
features.remove('wine rating')
target = "quality"

X = clean_wine_data[features]
y = clean_wine_data[target]

### Split X and y into two training datasets X_train and y_train and two test datasets X_text and y_test. Set the test_size set to 25% and random_state to 0

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)  

## Modeling with Logistic Regression

In [6]:
lr = LogisticRegression(solver="liblinear")   # Build a new logistic regression model 
lr.fit(X_train, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [7]:
lr.fit(X_train, y_train)

## score for linear regression is the R2
lr.score(X_train, y_train)                                            # Training score of the model


0.6176470588235294

In [8]:
print(lr.score(X_test, y_test))                                        # Test score of the model

0.6305084745762712


### Removing the highly co-related variables to improve the mean accuracy (score)

In [9]:
revisedfeatures = list(clean_wine_data.columns)
revisedfeatures.remove('quality')
revisedfeatures.remove('volatile acidity')
revisedfeatures.remove('wine rating')
revisedfeatures.remove('density')
revisedfeatures.remove('fixed acidity')


target = "quality"

X_revised = clean_wine_data[revisedfeatures]
y_revised = clean_wine_data[target]


X_revised_train, X_revised_test, y_revised_train, y_revised_test = train_test_split(X_revised, y_revised, test_size=0.25, random_state=0)  

# Build a new logistic regression model 

revisedlr = LogisticRegression(solver="liblinear")   # Build a new logistic regression model 
# lr = LinearRegression()
revisedlr.fit(X_revised_train, y_revised_train)

print(revisedlr.score(X_revised_train, y_revised_train), revisedlr.score(X_revised_test, y_revised_test))

0.6221719457013575 0.6


### The Train score got little better , however we see a decrease in the test score. 

### Other Accuracy Measures

In [10]:
# Make predictions against the test set

preds = lr.predict(X_test)

score = explained_variance_score(y_test, preds)
mae = mean_absolute_error(y_test, preds)
rmse = math.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)
    
print("score = {:.5f} | MAE = {:.3f} | RMSE = {:.3f} | R2 = {:.5f}"
          .format(score, mae, rmse, r2))

score = 0.21454 | MAE = 0.390 | RMSE = 0.656 | R2 = 0.21351


In [11]:
print(lr.intercept_)
print(lr.coef_)

[ 0.0866095   0.9895772  -0.50135554 -0.90056294]
[[ 3.18329411e-02  1.57471586e+00 -1.31988223e-01  2.49049995e-02
   2.10826329e-02 -6.14315260e-02 -8.01449313e-03  8.48187598e-02
   1.05961551e+00 -1.03741476e+00 -6.70047947e-01]
 [ 2.45141338e-03  1.47362980e+00  6.91947104e-01  1.83381297e-01
   2.00625793e-01 -2.65909747e-02  1.96150351e-02  9.82836683e-01
   1.77345917e+00 -3.27516417e+00 -7.72909872e-01]
 [ 3.76439913e-02 -7.26340488e-01 -1.09836176e+00 -8.44965648e-02
   6.21718604e-02  3.23013322e-02 -1.37305313e-02 -4.77491129e-01
  -6.44412730e-01  1.45365882e+00  2.43257311e-01]
 [-1.15867566e-01 -2.24119397e+00  6.57782653e-01 -1.77110984e-01
  -1.99250499e-01  2.75110096e-02 -2.88526538e-02 -9.02195549e-01
  -2.90820669e+00  3.13632430e+00  9.48216815e-01]]


## Regression Modeling with Ridge


In [12]:
from sklearn.linear_model import Ridge

rr=Ridge(solver='svd')

In [13]:
rr.fit(X_train, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='svd', tol=0.001)

In [14]:
## score for ridge regression is the R2
rr.score(X_train, y_train)

0.393860363701923

In [15]:
## Other accuracy measures
print(rr.score(X_test, y_test))

preds = rr.predict(X_test)

score = explained_variance_score(y_test, preds)
mae = mean_absolute_error(y_test, preds)
rmse = math.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)
    
print("score = {:.5f} | MAE = {:.3f} | RMSE = {:.3f} | R2 = {:.5f}"
          .format(score, mae, rmse, r2))

0.3215933076695272
score = 0.32734 | MAE = 0.479 | RMSE = 0.609 | R2 = 0.32159


### Now we have two models, we can use some test data to predict the results

In [16]:
wine1 = {   "fixed acidity": 4.6, 
           "volatile acidity": 0.52,
           "citric acid": 0.15,
           "residual sugar": 2.1,
           "chlorides": 0.054,
           "free sulfur dioxide": 8.0,
           "total sulfur dioxide": 65.0,
           "density": 0.9934,
           "pH": 3.9,
           "sulphates": 0.56,
           "alcohol": 13.1}

wine2 = {   "fixed acidity": 11.6, 
           "volatile acidity": 0.58,
           "citric acid": 0.66,
           "residual sugar": 2.2,
           "chlorides": 0.074,
           "free sulfur dioxide": 10,
           "total sulfur dioxide": 47,
           "density": 1.0008,
           "pH": 3.25,
           "sulphates":0.57,
           "alcohol": 9}

wine3 = {   "fixed acidity": 5, 
           "volatile acidity": 1.04,
           "citric acid": 0.24,
           "residual sugar": 1.6,
           "chlorides": 0.05,
           "free sulfur dioxide": 32,
           "total sulfur dioxide": 96,
           "density": 0.9934,
           "pH": 3.74,
           "sulphates": 0.62,
           "alcohol": 11.5}

wine4 = {   "fixed acidity": 14.3, 
           "volatile acidity": 0.31,
           "citric acid": 0.74,
           "residual sugar": 1.8,
           "chlorides": 0.075,
           "free sulfur dioxide": 6,
           "total sulfur dioxide": 15,
           "density": 1.0008,
           "pH": 2.86,
           "sulphates": 0.79,
           "alcohol": 8.4}

wine5 = {   "fixed acidity": 6.2, 
           "volatile acidity": 0.39,
           "citric acid": 0.43,
           "residual sugar": 2,
           "chlorides": 0.071,
           "free sulfur dioxide": 14,
           "total sulfur dioxide": 24,
           "density": 0.99428,
           "pH": 3.45,
           "sulphates": 0.87,
           "alcohol": 11.2}

wine6 = {   "fixed acidity": 9.4, 
           "volatile acidity": 0.3,
           "citric acid": 0.56,
           "residual sugar": 2.8,
           "chlorides": 0.08,
           "free sulfur dioxide": 6,
           "total sulfur dioxide": 17,
           "density": 0.9964,
           "pH": 3.15,
           "sulphates": 0.92,
           "alcohol": 11.7}


In [17]:
X_new = []                                    # X_new contains new data items 
for obs in [wine1, wine2, wine3, wine4, wine5, wine6]:
    new_obs = [obs["fixed acidity"], obs["volatile acidity"], obs["citric acid"]
               , obs["residual sugar"], obs["chlorides"], obs["free sulfur dioxide"], obs["total sulfur dioxide"]
               , obs["density"], obs["pH"], obs["sulphates"], obs["alcohol"]]
    X_new.append(new_obs)

### Predicting Using Ridge

In [18]:
def convertWineRating(x):
    if(0 < x <= 4):
        return "Bad"
    elif(4 < x <= 6):
        return "Average"
    elif(6 < x <= 10):
        return "Good"

clean_wine_data['wine rating'] = clean_wine_data['quality'].apply(convertWineRating)


predicted_result = [convertWineRating(x) for x in rr.predict(X_new)]
predicted_result

['Average', 'Average', 'Average', 'Average', 'Good', 'Good']

### Predicting Using Logistic Regression

In [19]:
predicted_result = [convertWineRating(x) for x in lr.predict(X_new)]
predicted_result

['Average', 'Average', 'Average', 'Average', 'Average', 'Good']

## Modeling with k-Means Clustering


In [20]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=0)     # Create a new k-means clustering model with k set to 5

In [21]:
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [22]:
kmeans.cluster_centers_                          # Store the values of centroids 

array([[8.14866667e+00, 5.25166667e-01, 2.73466667e-01, 2.27466667e+00,
        8.04666667e-02, 2.25733333e+01, 6.81066667e+01, 9.97019133e-01,
        3.32540000e+00, 6.22266667e-01, 1.00446667e+01],
       [8.49369369e+00, 5.07582583e-01, 2.73513514e-01, 2.09324324e+00,
        7.64654655e-02, 6.69369369e+00, 1.62702703e+01, 9.96410360e-01,
        3.29900901e+00, 6.24444444e-01, 1.05345345e+01],
       [7.90162602e+00, 5.74552846e-01, 2.48536585e-01, 2.33780488e+00,
        8.38536585e-02, 2.14715447e+01, 9.83089431e+01, 9.96884228e-01,
        3.31357724e+00, 5.95853659e-01, 9.86598916e+00],
       [8.07212644e+00, 5.23735632e-01, 2.24425287e-01, 2.16666667e+00,
        7.81206897e-02, 1.27873563e+01, 3.20114943e+01, 9.96526810e-01,
        3.34129310e+00, 6.33275862e-01, 1.04204502e+01],
       [7.96133333e+00, 5.15400000e-01, 2.22933333e-01, 2.20800000e+00,
        7.83111111e-02, 2.22355556e+01, 4.87466667e+01, 9.96473956e-01,
        3.34226667e+00, 6.63600000e-01, 1.04402222e+

In [23]:
kmeans.labels_                                   # Store the cluster labels of data items 

array([3, 0, 4, ..., 4, 4, 4])

In [24]:
clean_wine_data["label"] = kmeans.labels_                    # Add a new column lable with the clustering labels 

In [25]:
clean_wine_data.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,wine rating,label
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Average,3
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,Average,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,Average,4
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,Average,0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Average,3
5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5,Average,3
6,7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,5,Average,0
7,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7,Good,1
8,7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,7,Good,1
9,6.7,0.58,0.08,1.8,0.097,15.0,65.0,0.9959,3.28,0.54,9.2,5,Average,0


In [26]:
clean_wine_data.label.value_counts()                          # Count the number of values for each label 

3    348
1    333
4    225
0    150
2    123
Name: label, dtype: int64

In [27]:
clean_wine_data[clean_wine_data.label == 3].sample(n=10, replace=False, random_state=0)  # Select a random sample with 10 rows that have the label 2

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,wine rating,label
23,6.9,0.685,0.0,2.5,0.105,22.0,37.0,0.9966,3.46,0.57,10.6,6,Average,3
181,7.7,0.53,0.06,1.7,0.074,9.0,39.0,0.99615,3.35,0.48,9.8,6,Average,3
924,7.8,0.58,0.13,2.1,0.102,17.0,36.0,0.9944,3.24,0.53,11.2,6,Average,3
169,6.9,0.52,0.25,2.6,0.081,10.0,37.0,0.99685,3.46,0.5,11.0,5,Average,3
994,9.2,0.54,0.31,2.3,0.112,11.0,38.0,0.99699,3.24,0.56,10.9,5,Average,3
586,6.0,0.5,0.04,2.2,0.092,13.0,26.0,0.99647,3.46,0.47,10.0,5,Average,3
588,6.6,0.66,0.0,3.0,0.115,21.0,31.0,0.99629,3.45,0.63,10.3,5,Average,3
370,10.3,0.27,0.24,2.1,0.072,15.0,33.0,0.9956,3.22,0.66,12.8,6,Average,3
303,11.9,0.37,0.69,2.3,0.078,12.0,24.0,0.9958,3.0,0.65,12.8,6,Average,3
834,8.3,0.6,0.25,2.2,0.118,9.0,38.0,0.99616,3.15,0.53,9.8,5,Average,3


In [28]:
clean_wine_data.groupby("label").mean()

Unnamed: 0_level_0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,8.148667,0.525167,0.273467,2.274667,0.080467,22.573333,68.106667,0.997019,3.3254,0.622267,10.044667,5.453333
1,8.493694,0.507583,0.273514,2.093243,0.076465,6.693694,16.27027,0.99641,3.299009,0.624444,10.534535,5.702703
2,7.901626,0.574553,0.248537,2.337805,0.083854,21.471545,98.308943,0.996884,3.313577,0.595854,9.865989,5.260163
3,8.072126,0.523736,0.224425,2.166667,0.078121,12.787356,32.011494,0.996527,3.341293,0.633276,10.42045,5.695402
4,7.961333,0.5154,0.222933,2.208,0.078311,22.235556,48.746667,0.996474,3.342267,0.6636,10.440222,5.706667


Conclusion Note: From above it can be observed that Mean quality of wine is high with around 5.7 for Label = (1, 3, 4). 

## Creating a pkl file using joblib

In [29]:
import joblib

with open("./lib/models/logisticRegressionModel.pkl", "wb") as fwb:
    joblib.dump(lr, fwb)