Training and validating a machine learning model involves the selection of training and validating datasets. We'll also measure the performance that is meaningful to our problem. 

* http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
* http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


In [138]:
import os, sys
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

In [139]:
datasource = "datasets/winequality-red.csv"
df = pd.read_csv(datasource).sample(frac = 1).reset_index(drop = True)

In [140]:
df.head().transpose()

Unnamed: 0,0,1,2,3,4
Unnamed: 0,626.0,1199.0,1414.0,852.0,242.0
fixed acidity,9.9,9.1,7.7,6.8,9.1
volatile acidity,0.5,0.45,0.62,0.64,0.795
citric acid,0.5,0.35,0.04,0.0,0.0
residual sugar,13.8,2.4,3.8,2.7,2.6
chlorides,0.205,0.08,0.084,0.123,0.096
free sulfur dioxide,48.0,23.0,25.0,15.0,11.0
total sulfur dioxide,82.0,78.0,45.0,33.0,26.0
density,1.00242,0.9987,0.9978,0.99538,0.9994
pH,3.16,3.38,3.34,3.44,3.35


In [141]:
df.shape

(1599, 13)

In [142]:
df.columns

Index(['Unnamed: 0', 'fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol',
       'quality'],
      dtype='object')

In [143]:
del df['Unnamed: 0']

In [144]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

Let's look at the basic stats on each column

df.describe()

In [145]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,9.9,0.5,0.5,13.8,0.205,48.0,82.0,1.00242,3.16,0.75,8.8,5
1,9.1,0.45,0.35,2.4,0.08,23.0,78.0,0.9987,3.38,0.62,9.5,5
2,7.7,0.62,0.04,3.8,0.084,25.0,45.0,0.9978,3.34,0.53,9.5,5
3,6.8,0.64,0.0,2.7,0.123,15.0,33.0,0.99538,3.44,0.63,11.3,6
4,9.1,0.795,0.0,2.6,0.096,11.0,26.0,0.9994,3.35,0.83,9.4,6


The last column (quality) tells us the quality of the wine. duh! 

Let's build a classifier to predict the quality based on it's other features. We'll also come up with a way to evaluate the performance of the classifier. The input of the classifier are all the columns (features) except for quality
The output of the classifier is the quality column

In [146]:
X = np.array(df.iloc[:, :-1])
# give me all the rows for each column EXCEPT the last column (quality)

In [147]:
y = np.array(df["quality"])
# just give me the quality column

In [148]:
y[:10]

array([5, 5, 5, 6, 6, 4, 5, 5, 6, 7], dtype=int64)

Let's binarize the wine quality to keep things simple. Quality less than 6 is bad (0) and quality greater than or equal to 6 is good (1). 

In [149]:
y[y < 6] = 0
y[y >= 6] = 1

In [150]:
y[:10]

array([0, 0, 0, 1, 1, 0, 0, 0, 1, 1], dtype=int64)

Time for a sanity check

In [151]:
print("X", X.shape, "y", y.shape)

X (1599, 11) y (1599,)


In [152]:
print("Label Distribution", {0: np.sum(y == 0), 1: np.sum(y == 1)})

Label Distribution {0: 744, 1: 855}


### Simple (aka bad) approach - train and evaluate on the entire dataset

In [153]:
model = GaussianNB() # create an instance of a model that can be trained

In [154]:
model.fit(X, y) # train the model parameters using this data and expected outcomes

GaussianNB(priors=None)

In [155]:
model.score(X, y)

0.72858036272670423

Accuracy is between 0 and 1. Ask yourself these questions though:

* Would the same predictive performance extend to future data?
* Are there enough data for the model to learn from?
* Could the classifier be learning from features that happen to correlate the result yet without necessary connection (noise)?
* How to make the classifier more accurate?


### Another approach - Only train on 3 observations

Let's just train on the first 3 rows

In [156]:
new_X = X[:3]
new_X

array([[  9.90000000e+00,   5.00000000e-01,   5.00000000e-01,
          1.38000000e+01,   2.05000000e-01,   4.80000000e+01,
          8.20000000e+01,   1.00242000e+00,   3.16000000e+00,
          7.50000000e-01,   8.80000000e+00],
       [  9.10000000e+00,   4.50000000e-01,   3.50000000e-01,
          2.40000000e+00,   8.00000000e-02,   2.30000000e+01,
          7.80000000e+01,   9.98700000e-01,   3.38000000e+00,
          6.20000000e-01,   9.50000000e+00],
       [  7.70000000e+00,   6.20000000e-01,   4.00000000e-02,
          3.80000000e+00,   8.40000000e-02,   2.50000000e+01,
          4.50000000e+01,   9.97800000e-01,   3.34000000e+00,
          5.30000000e-01,   9.50000000e+00]])

In [157]:
new_y = y[:3]
new_y

array([0, 0, 0], dtype=int64)

In [158]:
model = GaussianNB()

In [159]:
model.fit(new_X, new_y)

GaussianNB(priors=None)

In [160]:
model.score(new_X, new_y)

1.0

Let's try predicting some other observations from the original dataset. In other words, if the model is applied to the new data that was not part of the training data, how well does it actually do?

In [161]:
print("Prediction", model.predict(X[20:50]))
print("Answer", y[20:50])
print("Score", model.score(X[20:50], y[20:50]))

Prediction [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Answer [1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 1 1 1 0 1 1]
Score 0.5


This is known as overfitting. Overfiting occurs when the model was not able to successfully generalize to perform its task on the general population of the data. Instead, it has been optimized for the specific instances of the training data. Let's try one last approach 

### The best approach - holding out 25% of the data for validation

In [162]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) 
# outputs to 4 variables (with the test data being 25% of the original dataset)

In [163]:
model = GaussianNB()

In [164]:
model.fit(X_train, y_train) 
# train on the training data (75% of the original data)

GaussianNB(priors=None)

In [165]:
model.score(X_test, y_test)
# test against the testing data (25% of the original data)

0.72499999999999998

In [166]:
print("Prediction", model.predict(X_test))
print("\n")
print("Actual", y_test)

Prediction [0 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 1 0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0
 1 0 1 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0
 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0
 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 0 1
 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0
 1 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0
 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 1 1 0 1 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 0
 0 0 0 1 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 0 1 0 0 0 1 1
 0 1 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 0 1 0 0 0
 1 0 0 1 0 1 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0
 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 0 1]


Actual [0 1 0 1 1 1 0 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1
 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 1 0 1 0 0 0
 1 0 1 1 0 0 1 1 1

In [167]:
prediction = pd.DataFrame({"Prediction": model.predict(X_test)})
prediction.head()

Unnamed: 0,Prediction
0,0
1,0
2,0
3,0
4,1


In [168]:
actual = pd.DataFrame({"Actual": y_test})
actual.head()

Unnamed: 0,Actual
0,0
1,1
2,0
3,1
4,1


In [169]:
frames = [prediction, actual]
compare = pd.concat(frames, axis = 1)
compare.tail()

Unnamed: 0,Prediction,Actual
395,1,0
396,1,1
397,0,0
398,0,1
399,1,1


In [170]:
compare["isCorrect"] = compare["Prediction"] == compare["Actual"]

In [171]:
compare

Unnamed: 0,Prediction,Actual,isCorrect
0,0,0,True
1,0,1,False
2,0,0,True
3,0,1,False
4,1,1,True
5,1,1,True
6,0,0,True
7,1,1,True
8,1,1,True
9,1,1,True
