# Red Wine Quality Prediction Project


Project Description
The dataset is related to red and white variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

This dataset can be viewed as classification task. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Attribute Information
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
What might be an interesting thing to do, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'.
This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value.
You need to build a classification model. 
Inspiration
Use machine learning to determine which physiochemical properties make a wine 'good'!

Dataset Link-
https://github.com/FlipRoboTechnologies/ML-Datasets/blob/main/Red%20Wine/winequality-red.csv


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv("C://Users//rahul//OneDrive//Desktop//data trained//internship//Sample project//winequality-red.csv")
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [4]:
df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [6]:
df.shape

(1599, 12)

In [7]:
df.nunique()

fixed acidity            96
volatile acidity        143
citric acid              80
residual sugar           91
chlorides               153
free sulfur dioxide      60
total sulfur dioxide    144
density                 436
pH                       89
sulphates                96
alcohol                  65
quality                   6
dtype: int64

In [8]:
df.quality.value_counts()

5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64

In [9]:
df['quality'] = df['quality'].replace([3,4,5,6],0)

In [10]:
df.quality.value_counts()

0    1382
7     199
8      18
Name: quality, dtype: int64

In [11]:
df['quality'] = df['quality'].replace([7,8],1)

In [12]:
df.quality.value_counts()

0    1382
1     217
Name: quality, dtype: int64

In [19]:
x= df.drop('quality', axis=1)
y= df['quality']

In [20]:
x

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4
...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2


In [21]:
y

0       0
1       0
2       0
3       0
4       0
       ..
1594    0
1595    0
1596    0
1597    0
1598    0
Name: quality, Length: 1599, dtype: int64

In [23]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
maxAcc=0
maxRs=0
for i in range(1,200):
    x_train,x_test, y_train, y_test = train_test_split(x,y,test_size=0.30, random_state=i)
    rfc = RandomForestClassifier()
    rfc.fit(x_train, y_train)
    pred= rfc.predict(x_test)
    acc= accuracy_score(y_test, pred)
    if acc > maxAcc:
        maxAcc=acc
        maxRs=i
        
print("Max Accuracy is", maxAcc, "at random state" ,maxRs)


Max Accuracy is 0.93125 at random state 132


In [24]:
x_train,x_test, y_train,y_test = train_test_split(x,y, test_size=0.30, random_state=maxRs)

In [27]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, accuracy_score
from sklearn.model_selection import cross_val_score

In [31]:
# Random Forest
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
pred = rfc.predict(x_test)
print("Accuracy score", accuracy_score(y_test, pred))
print("Confusion Matrix", confusion_matrix(y_test, pred))
print("Classification Report", classification_report(y_test, pred))

Accuracy score 0.9270833333333334
Confusion Matrix [[411  11]
 [ 24  34]]
Classification Report               precision    recall  f1-score   support

           0       0.94      0.97      0.96       422
           1       0.76      0.59      0.66        58

    accuracy                           0.93       480
   macro avg       0.85      0.78      0.81       480
weighted avg       0.92      0.93      0.92       480



In [32]:
# Logistic Regression
lr = LogisticRegression()
lr.fit(x_train, y_train)
pred = lr.predict(x_test)
print("Accuracy score", accuracy_score(y_test, pred))
print("Confusion Matrix", confusion_matrix(y_test, pred))
print("Classification Report", classification_report(y_test, pred))

Accuracy score 0.8833333333333333
Confusion Matrix [[410  12]
 [ 44  14]]
Classification Report               precision    recall  f1-score   support

           0       0.90      0.97      0.94       422
           1       0.54      0.24      0.33        58

    accuracy                           0.88       480
   macro avg       0.72      0.61      0.63       480
weighted avg       0.86      0.88      0.86       480



In [33]:
# ExtraTreeClassifier
etc = ExtraTreeClassifier()
etc.fit(x_train, y_train)
pred = etc.predict(x_test)
print("Accuracy score", accuracy_score(y_test, pred))
print("Confusion Matrix", confusion_matrix(y_test, pred))
print("Classification Report", classification_report(y_test, pred))

Accuracy score 0.8916666666666667
Confusion Matrix [[387  35]
 [ 17  41]]
Classification Report               precision    recall  f1-score   support

           0       0.96      0.92      0.94       422
           1       0.54      0.71      0.61        58

    accuracy                           0.89       480
   macro avg       0.75      0.81      0.77       480
weighted avg       0.91      0.89      0.90       480



In [34]:
# SVC
svc = SVC()
svc.fit(x_train, y_train)
pred = svc.predict(x_test)
print("Accuracy score", accuracy_score(y_test, pred))
print("Confusion Matrix", confusion_matrix(y_test, pred))
print("Classification Report", classification_report(y_test, pred))

Accuracy score 0.88125
Confusion Matrix [[422   0]
 [ 57   1]]
Classification Report               precision    recall  f1-score   support

           0       0.88      1.00      0.94       422
           1       1.00      0.02      0.03        58

    accuracy                           0.88       480
   macro avg       0.94      0.51      0.49       480
weighted avg       0.90      0.88      0.83       480



In [35]:
# Gradient Boosting
gbc = GradientBoostingClassifier()
gbc.fit(x_train, y_train)
pred = gbc.predict(x_test)
print("Accuracy score", accuracy_score(y_test, pred))
print("Confusion Matrix", confusion_matrix(y_test, pred))
print("Classification Report", classification_report(y_test, pred))

Accuracy score 0.9041666666666667
Confusion Matrix [[401  21]
 [ 25  33]]
Classification Report               precision    recall  f1-score   support

           0       0.94      0.95      0.95       422
           1       0.61      0.57      0.59        58

    accuracy                           0.90       480
   macro avg       0.78      0.76      0.77       480
weighted avg       0.90      0.90      0.90       480



In [36]:
# AdaBoost
abc = AdaBoostClassifier()
abc.fit(x_train, y_train)
pred = abc.predict(x_test)
print("Accuracy score", accuracy_score(y_test, pred))
print("Confusion Matrix", confusion_matrix(y_test, pred))
print("Classification Report", classification_report(y_test, pred))

Accuracy score 0.8645833333333334
Confusion Matrix [[389  33]
 [ 32  26]]
Classification Report               precision    recall  f1-score   support

           0       0.92      0.92      0.92       422
           1       0.44      0.45      0.44        58

    accuracy                           0.86       480
   macro avg       0.68      0.69      0.68       480
weighted avg       0.87      0.86      0.87       480



In [37]:
# Bagging Classifier
bc = BaggingClassifier()
bc.fit(x_train, y_train)
pred = bc.predict(x_test)
print("Accuracy score", accuracy_score(y_test, pred))
print("Confusion Matrix", confusion_matrix(y_test, pred))
print("Classification Report", classification_report(y_test, pred))

Accuracy score 0.9166666666666666
Confusion Matrix [[408  14]
 [ 26  32]]
Classification Report               precision    recall  f1-score   support

           0       0.94      0.97      0.95       422
           1       0.70      0.55      0.62        58

    accuracy                           0.92       480
   macro avg       0.82      0.76      0.78       480
weighted avg       0.91      0.92      0.91       480



In [None]:
# cross validation score 

In [40]:
# RFC
score= cross_val_score(rfc, x,y)
print(score)
print(score.mean())
print('Difference between Accuracy score and cross validation score is', accuracy_score(y_test, pred) - score.mean())

[0.878125   0.865625   0.878125   0.853125   0.87460815]
0.8699216300940439
Difference between Accuracy score and cross validation score is 0.046745036572622745
