# Practical case - Wine quality 

In this practical case, we have to make a model which estimates the wine quality according to its physico-chemical features. We are going to use the white wine dataset given by UCI (https://archive.ics.uci.edu/ml/datasets/Wine+Quality).

### Load data

In [1]:
import pandas as pd

In [2]:
# Load data using URL
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', 
                   sep = ';')
# If the URL is not enable, load data in data/ folder
#data = pd.read_csv('data/winequality-white.csv', sep = ';')

In [3]:
# Show data
data[:20][:]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
5,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6
7,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
8,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
9,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6


Dataset contains the `target` variable together other variables. For that, we have to split it:

In [4]:
target = 'quality'
features = list(data.columns)
features.remove(target)
print(features)

# Create X and Y values:
x = pd.DataFrame(data[features])
y = pd.DataFrame(data[target])

['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']


In [5]:
# Show splited data
x[0:5][:]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9


In [6]:
# Show splited data
y[0:5][:]

Unnamed: 0,quality
0,6
1,6
2,6
3,6
4,6


### Model's creation

Practical case only tell us that make a model, without specify what model. In this case, we are going to make a Decision Tree model and we are going to see if this model is correct or not.

In [7]:
from sklearn.tree            import DecisionTreeClassifier
from sklearn.metrics         import confusion_matrix
from sklearn.metrics         import accuracy_score, precision_score, recall_score
from sklearn.metrics         import roc_curve, auc
from sklearn.model_selection import train_test_split

In [8]:
# First, let's split the dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    random_state = 1, 
                                                    train_size = 0.8, # Change default split size
                                                    test_size = 0.2)

# Create the model and fit it
dt_model = DecisionTreeClassifier(criterion = 'entropy',
                                  max_depth = 5, 
                                  random_state = 1).fit(x_train, y_train)
# Make the predictions
y_train_pred = dt_model.predict(x_train)
y_test_pred  = dt_model.predict(x_test)

# Show the confusion matrixes
cm_train = confusion_matrix(y_train, y_train_pred)
cm_test  = confusion_matrix(y_test, y_test_pred)

print('Trainning confusion matrix:')
print(cm_train)
print('\nTesting confusion matrix:')
print(cm_test)

Trainning confusion matrix:
[[   4    0    8    6    0    0    0]
 [   0   26   75   33    0    0    0]
 [   2   12  691  431   17   11    0]
 [   1    7  349 1212  181   13    0]
 [   0    0   38  392  257    9    0]
 [   0    1    0   53   61   23    0]
 [   0    0    0    1    3    1    0]]

Testing confusion matrix:
[[  0   0   1   1   0   0]
 [  0   1  17  11   0   0]
 [  0   6 152 126   5   4]
 [  1   3  98 271  57   5]
 [  0   0   9 113  60   2]
 [  0   0   0  12  19   6]]


We can observe that in this case, the confusion matrix don't show a lot information. Although it seems that the model is not good because there are a lot of values out the main diagonal. 

Let's see the $R^2$ values:

In [9]:
print('Trainning R^2: ', dt_model.score(x_train, y_train))
print('Testing R^2: ', dt_model.score(x_test, y_test)) 

Trainning R^2:  0.564828994385
Testing R^2:  0.5


With these results, we can suspect that as the model is a little overfitting as the model is bad. In the following units, we are going to see proceduces for variables selection.