## CLASSIFY THE QUALITY OF WINE

Import required  Libraries

In [63]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix

Read the Data
Downloaded from https://archive.ics.uci.edu/ml/datasets/Wine+Quality

In [19]:
df = pd.read_csv('winequality/winequality-white.csv',sep=';',quotechar='"')

In [20]:
print (df.shape)

(4898, 12)


In [21]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

So, Totally there are twelve columns / Features in the data set. The **quality** feature is our intrest as we need to predcit the quality of the wine.Lets Check the Values for the **Quality** feature

In [22]:
df['quality'].unique()

array([6, 5, 7, 8, 4, 3, 9], dtype=int64)

**Analysis:** The max value is 9 and the minimum value is 3. The values 0,1,2 are missing

In [23]:
df['quality'].describe()

count    4898.000000
mean        5.877909
std         0.885639
min         3.000000
25%         5.000000
50%         6.000000
75%         6.000000
max         9.000000
Name: quality, dtype: float64

In [24]:
df['quality'].value_counts()

6    2198
5    1457
7     880
8     175
4     163
3      20
9       5
Name: quality, dtype: int64

Most of the data is with quality 6 and then followed by 5 and 7

### Solve as Classification Problem:
Lets classify the data set as boolean Type - HighQuality , Non-HighQuality
Lets assume quality feature value greater than or equal to 6 is of High Quality wine

In [25]:
def isHighQuality(quality):
    if quality >= 6:
        return 1
    else:
        return 0

In [26]:
df['tasty'] = df['quality'].apply(isHighQuality)

Check if a new column **tasty** represnting a binary data of our highqaulity is getting added 

In [27]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'tasty'],
      dtype='object')

In [28]:
df['tasty'].value_counts()

1    3258
0    1640
Name: tasty, dtype: int64

In [43]:
target = df['tasty']
data = df.drop(['tasty','quality'],axis=1)
data.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol'],
      dtype='object')

** Split Training and Testing set **

In [44]:
data_train, data_test, target_train, target_test = train_test_split(data,target,test_size = 0.33,random_state=123)

In [45]:
[subset.shape for subset in [data_train,data_test,target_train,target_test]]

[(3281, 11), (1617, 11), (3281,), (1617,)]

Just the Testing and Training Size

## Training our Classifiers

#### Simple Tree

In [54]:
simpleTree = DecisionTreeClassifier(max_depth=10)
simpleTree.fit(data_train,target_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

*Evaluate:*

In [55]:
simpleTreePerformance = precision_recall_fscore_support(target_test,simpleTree.predict(data_test))
simpleTreePerformance

(array([ 0.62852113,  0.84080076]),
 array([ 0.68129771,  0.80695334]),
 array([ 0.65384615,  0.82352941]),
 array([ 524, 1093], dtype=int64))

*Precision , Recall, Fscore ,Support*

#### GradientBoostingClassifier

In [56]:
gbmTree = GradientBoostingClassifier(max_depth=10)
gbmTree.fit(data_train,target_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=10,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)

In [57]:
gbmTreePerformance = precision_recall_fscore_support(target_test,gbmTree.predict(data_test))
gbmTreePerformance

(array([ 0.71760155,  0.86090909]),
 array([ 0.70801527,  0.86642269]),
 array([ 0.71277618,  0.86365709]),
 array([ 524, 1093], dtype=int64))

#### RandomForest

In [58]:
rfTree = RandomForestClassifier(max_depth=10)
rfTree.fit(data_train,target_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [59]:
rfTreePerformance = precision_recall_fscore_support(target_test,rfTree.predict(data_test))
rfTreePerformance

(array([ 0.70235546,  0.82956522]),
 array([ 0.6259542 ,  0.87282708]),
 array([ 0.66195762,  0.85064646]),
 array([ 524, 1093], dtype=int64))

**Analysis:** The *precision_recall_fscore_support* computes the precision, Recall, Fscore and Support

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.

The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.

The support is the number of occurrences of each class in y_true.

As seen, **Gradient Boosting Classifier** has the highest Accuracy for our current dataset. **Random Forest** has the most recall.

**Confusion Matrix:**

In [60]:
print('Confusion Matrix for simple, gradient boosted, and random forest tree classifiers:')
print('Simple Tree:\n',confusion_matrix(target_test,simpleTree.predict(data_test)),'\n')
print('Gradient Boosted:\n',confusion_matrix(target_test,gbmTree.predict(data_test)),'\n')
print('Random Forest:\n',confusion_matrix(target_test,rfTree.predict(data_test)))

Confusion Matrix for simple, gradient boosted, and random forest tree classifiers:
Simple Tree:
 [[357 167]
 [211 882]] 

Gradient Boosted:
 [[371 153]
 [146 947]] 

Random Forest:
 [[328 196]
 [139 954]]


Confusion Matrix Format is (tn, fp, fn, tp).
In Random Forest , 328 are correctly classified as negative(True Negative) and 954 are correctly classified as Positive(True Positive). Also, 196 classified as False Positive, and 139 as False Negative

**Feature Importances: **Lets get to know the important features. GBM provides a way to know the features that affect the classification. No all decision Trees support this

gbmTree.feature_importances_


In [62]:
print('Feature Importances for GBM tree\n')
for importance,feature in zip(gbmTree.feature_importances_,['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']):
    print('{}: {}'.format(feature,importance))

Feature Importances for GBM tree

fixed acidity: 0.06537100626301552
volatile acidity: 0.09430056501264585
citric acid: 0.07374152495759295
residual sugar: 0.095929571417197
chlorides: 0.08772752400086599
free sulfur dioxide: 0.09679364229496468
total sulfur dioxide: 0.10444083880283374
density: 0.10477658658102623
pH: 0.09712890877355783
sulphates: 0.08036418770166355
alcohol: 0.09942564419463686


Inspired from https://www.leozqin.me/using-machine-learning-to-classify-the-quality-of-wine/