In [0]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

In [0]:
df = pd.read_csv('winequality-red.csv', sep=';')

In [0]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [0]:
print (df.shape)
df.describe()

(1599, 12)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [0]:
df['quality'].unique()

array([5, 6, 7, 4, 8, 3], dtype=int64)

In [0]:
df['quality'].value_counts()

5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64

We have to treating the problem as a classification problem, so we are going to use a decision tree to learn a classification model that predicts red wine quality based on the features. Since we have to predict the wine quality the attribute "quality" will become our label and the rest of the attributes will become the features. 

The target variable "quality" ranges from 3 to 8. We can notice that the most observations are in class 5. 

In [0]:
X = df.drop('quality', axis = 1)

In [0]:
Y = df.quality

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state=42)

We just stored label "quality" in Y, which is the common used to represent the labels in machine learning and in X the features. Next we split our dataset into test and train data.Training data is the data on which the machine learning programs learn to perform correlational tasks. Testing data is the data, whose outcome is already known (even the outcome of training data is known) and is used to determine the accuracy of the machine learning algorithm, based on the training data (how effectively the learning happened).  We will be using train data to train our model for predicting the quality. Also we take the 20% of the original population and use it for testing.

In [0]:
from sklearn import preprocessing
X_train_scaled = preprocessing.scale(X_train)

In [0]:
print (X_train_scaled)

[[ 1.97418149 -0.23260309  1.11458849 ..., -0.78641859 -1.3131938
  -1.15257747]
 [ 0.28189382  0.37802632  0.09088663 ...,  0.3161036  -0.97064635
  -1.24703683]
 [-0.71013687  0.32251456 -1.39348108 ...,  0.70522908 -0.62809889
   1.01998773]
 ..., 
 [-0.65178213  0.48904985 -1.08637052 ...,  1.28891729 -0.68519014
  -0.8691994 ]
 [-0.2432989  -1.84244427  0.39799719 ...,  0.05668661  0.79918216
   1.39782516]
 [-1.46874859 -1.34283839 -0.06266865 ...,  0.51066634 -0.68519014
   2.90917487]]


In this step we use preprocessing.scale() from scikit-learn library to scale the data. Preprocessing is an operation that we use to get the data into a form more appropriate for what we want to use them because the range of values of our data varies widely.  

In [0]:
clf = DecisionTreeClassifier(criterion= 'entropy', max_depth=5)

Because the most common decision trees based on entropy we use the entropy criterion.Entropy is a function that measures the quality of the splits. So the final nodes of the tree we want each node to be as much as homogeneous as possible.Also we hold the max_depth(the maximum depth of the tree) at 15 to lower overfitting risk.

In [0]:
clf.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [0]:
y_pred = clf.predict(X_test)

In [0]:
from sklearn.metrics import accuracy_score

In [0]:
accuracy_score (Y_test, y_pred)

0.59499999999999997

In [0]:
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None)
graph = graphviz.Source(dot_data)
graph

In [0]:
prominent_features = clf.feature_importances_

In [0]:
for importance,feature in zip(prominent_features, X):
    print ('{}: {}'.format(feature, importance))

fixed acidity: 0.019992387829022358
volatile acidity: 0.0842913895464894
citric acid: 0.03179695360040903
residual sugar: 0.05119015487745672
chlorides: 0.0
free sulfur dioxide: 0.0
total sulfur dioxide: 0.10089661972647447
density: 0.0231461198310514
pH: 0.020605373644905785
sulphates: 0.21877365334788262
alcohol: 0.44930734759630814


We can observe that alcohol content and sulplates play the two largest roles in the decision of classifier.