# Train, Validate $\rightarrow$ Train, Test

In this exercise, you will perform empirical comparison of the results of a ten-fold cross validated model with a fully trained model.

## Notes and Guidelines
* Read a dataset from disk and use it for a classification task.
* Construct a Gaussian Naive Bayes classifier and fit it to the phoneme dataset provided.
* Save and re-load a trained classifier.
* Compare K-fold cross-validation scores with the success rate of a fully-trained model.


### Dataset
* Dataset acquired from [KEEL](http://sci2s.ugr.es/keel/dataset.php?cod=105), an excellent resource for finding 'toy' datasets (and a few more serious ones).
    * A description of the dataset is provided at the above link - **read it.**
    * Excerpt: 
    *The aim of this dataset is to distinguish between nasal (class 0) and oral sounds (class 1).
    The class distribution is 3,818 samples in class 0 and 1,586 samples in class 1.
    The phonemes are transcribed as follows: sh as in she, dcl as in dark, iy as the vowel in she, aa as the vowel in dark, and ao as the first vowel in water.*
    
* It is not necessary to fully understand the nature or context of the values in the dataset - only that there are five columns of input (featural) data and one column of output (class) data.

## Handling imports and dataset inclusion

In [1]:
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split, cross_val_score
from collections import OrderedDict
from sklearn.metrics import classification_report
import sklearn.model_selection
from sklearn.metrics import accuracy_score
# <import necessary modules> 

# locate dataset
DATASET = '/dsa/data/all_datasets/phoneme.csv'  # phoneme classification dataset
assert os.path.exists(DATASET)  # check if the file actually exists

## Constructing DataFrame from raw dataset

<span style="background:yellow">**Note**</span>: Variable `dataset` should be used for the dataframe.

In [33]:

dataset = pd.read_csv(DATASET, header=0).sample(frac=1)

# verify dataset shape
print("Dataset shape: ", dataset.shape)

Dataset shape:  (5404, 6)


In [34]:
# show first few lines of the dataset
dataset.head()

Unnamed: 0,Aa,Ao,Dcl,Iy,Sh,Class
4281,0.873,2.74,0.827,-0.275,-0.149,0
826,0.84,2.23,1.577,-0.333,-0.157,1
572,1.098,2.692,0.812,0.348,0.0,0
4290,0.593,2.737,0.856,-0.336,1.173,0
4666,0.431,0.803,1.946,-0.764,0.523,1


## Splitting data into training and test sets

Split the datasets into training (80%) and testing (20%) sets. 

The below is only necessary if you are interested in visualizing
the data or providing neatly-labeled output within the program.

```python
# extract labels from column headers
phonemes = dataset.columns[0:5].tolist()  # Feature labels
labels = {0: 'Nasal', 1: 'Oral'}  # Class labels
```

In [35]:
# extract features and class data from primary data frame
X = dataset.iloc[:,:-1]
y = dataset.Class 


In [36]:
# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print("Training shapes (X, y): ", X_train.shape, y_train.shape)
print("Testing shapes (X, y): ", X_test.shape, y_test.shape)

Training shapes (X, y):  (4323, 5) (4323,)
Testing shapes (X, y):  (1081, 5) (1081,)


## Constructing the classifier and running automated cross-validation

* Run a 10-fold cross validation with `GaussianNB` classifier
* Print the accuracy scores for these 10 folds

In [37]:
# Your code below this line (Question #E101)
# --------------------------

classifier = GaussianNB()

cv_scores = sklearn.model_selection.cross_val_score(classifier, X, y, cv=10) 
print(cv_scores)


[0.73567468 0.74676525 0.7689464  0.77264325 0.76666667 0.77962963
 0.76296296 0.75555556 0.73518519 0.78703704]


## Training the classifier and pickling to disk
* Learn the model with all the training instances and store to disk

In [38]:
# Your code below this line (Question #E102)
# --------------------------
classifier.fit(X_train, y_train)

import joblib
joblib.dump(classifier, 'GaussianDigitsExercise.pkl')


['GaussianDigitsExercise.pkl']

## Unpickling the model and making predictions

* Load the saved model 
* Make predictions for the testing set


In [39]:
# Your code below this line (Question #E103)
# --------------------------

# load pickled model
loaded_model = joblib.load('GaussianDigitsExercise.pkl')

# make predictions with freshly loaded model
y_pred = loaded_model.predict(X_test)

# verify input and output shape are appropriate
print("Input vs. output shape:")
print(X_test.shape, y_pred.shape)




Input vs. output shape:
(1081, 5) (1081,)


## Performing final performance comparison

In [40]:
# tally up right + wrong 'guesses' by model
true, false = 0, 0
for i, j in zip(y_test, y_pred):
    # print(i, j)
    if i == j:
        true += 1
    else:
        false += 1

# report results numerically and by percentage
true_percent = true / (true + false) * 100
print("Correct guesses: " + str(true) + "\nIncorrect guesses: " + str(false))
print("Percent correct: " + str(true_percent))

# compare to average of cross-validation scores
avg_cv = np.sum(cv_scores) / len(cv_scores) * 100
print("Percent cross-validation score (10 folds, average): " + str(avg_cv))

Correct guesses: 810
Incorrect guesses: 271
Percent correct: 74.93061979648473
Percent cross-validation score (10 folds, average): 76.11066611898406


## Measure performance using Scikit Learn modules 

Compute and display the following:
 1. Compute Confusion Matrix
 1. Accuracy
 1. Precision
 1. Recall
 1. $F_1$-Score
 
Add additional cells if required. 

In [41]:
# Your code below this line  (Question #E104)
# --------------------------
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
print("Accuracy:", np.round(accuracy_score(y_test, y_pred), 2))
print("Precision:", np.round(precision_score(y_test, y_pred, average='weighted'), 2))
print("Recall:", np.round(precision_score(y_test, y_pred, average='weighted'), 2))
print("F1-Score:", np.round(f1_score(y_test, y_pred, average='weighted'), 2))

[[573 178]
 [ 93 237]]
Accuracy: 0.75
Precision: 0.77
Recall: 0.77
F1-Score: 0.76


## Conclusions ?

How did your trained model perform relative to your expectations based on the cross-validation?
Provide your answer in the cell below.

# Add your answer below this comment  (Question #E105)
# -----------------------------------

My model was slightly less accurate (~1%) than what was predicted via the 10 fold cross-validation. The goal of cross validation is to test the models accuracy on unseen data. Thus, when doing a ten fold like we performed, fitting is completed 10 times, each consisting of 90% of training data used for training, and 10% of training data to be used for validation





# Save your notebook!  Then `File > Close and Halt`