# Train, Validate $\rightarrow$ Train, Test

In this exercise, you will perform empirical comparison of the results of a ten-fold cross validated model with a fully trained model.

## Notes and Guidelines
* Read a dataset from disk and use it for a classification task.
* Construct a Gaussian Naive Bayes classifier and fit it to the phoneme dataset provided.
* Save and re-load a trained classifier.
* Compare K-fold cross-validation scores with the success rate of a fully-trained model.


### Dataset
* Dataset acquired from [KEEL](http://sci2s.ugr.es/keel/dataset.php?cod=105), an excellent resource for finding 'toy' datasets (and a few more serious ones).
    * A description of the dataset is provided at the above link - **read it.**
    * Excerpt: 
    *The aim of this dataset is to distinguish between nasal (class 0) and oral sounds (class 1).
    The class distribution is 3,818 samples in class 0 and 1,586 samples in class 1.
    The phonemes are transcribed as follows: sh as in she, dcl as in dark, iy as the vowel in she, aa as the vowel in dark, and ao as the first vowel in water.*
    
* It is not necessary to fully understand the nature or context of the values in the dataset - only that there are five columns of input (featural) data and one column of output (class) data.

## Handling imports and dataset inclusion

In [8]:
import os
import pandas as pd
import numpy as np

# <import necessary modules> 
# sklearn packages
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.naive_bayes import GaussianNB
# from sklearn.model_selection import cross_val_score, train_test_split
import sklearn.model_selection
from sklearn.metrics import classification_report
from collections import OrderedDict

# pickling
import joblib

# locate dataset
DATASET = '/dsa/data/all_datasets/phoneme.csv'  # phoneme classification dataset
assert os.path.exists(DATASET)  # check if the file actually exists

## Constructing DataFrame from raw dataset

<span style="background:yellow">**Note**</span>: Variable `dataset` should be used for the dataframe.

In [2]:

dataset = pd.read_csv(DATASET, header=0).sample(frac=1)

# verify dataset shape
print("Dataset shape: ", dataset.shape)

Dataset shape:  (5404, 6)


In [3]:
# Describing the dataset as well

dataset.describe()

Unnamed: 0,Aa,Ao,Dcl,Iy,Sh,Class
count,5404.0,5404.0,5404.0,5404.0,5404.0,5404.0
mean,0.818957,1.258802,0.764732,0.398743,0.078619,0.293486
std,0.858733,0.851057,0.925436,0.796531,0.575624,0.455401
min,-1.7,-1.327,-1.823,-1.581,-1.284,0.0
25%,0.24375,0.596,-0.115,-0.205,-0.23225,0.0
50%,0.4925,1.0755,0.729,0.2855,-0.044,0.0
75%,1.08925,1.86625,1.484,0.937,0.19625,1.0
max,4.107,4.378,3.199,2.826,2.719,1.0


In [4]:
# show first few lines of the dataset
dataset.head()

Unnamed: 0,Aa,Ao,Dcl,Iy,Sh,Class
3273,3.161,-0.243,0.211,-0.086,0.085,0
4092,2.111,0.901,0.371,0.268,0.515,0
4631,0.807,2.168,0.714,0.84,0.58,1
227,1.021,2.274,-0.52,-0.19,-0.12,0
3914,0.738,1.981,-0.297,0.193,-0.118,0


## Splitting data into training and test sets

Split the datasets into training (80%) and testing (20%) sets. 

The below is only necessary if you are interested in visualizing
the data or providing neatly-labeled output within the program.

```python
# extract labels from column headers
phonemes = dataset.columns[0:5].tolist()  # Feature labels
labels = {0: 'Nasal', 1: 'Oral'}  # Class labels
```

In [5]:
# extract features and class data from primary data frame
X = dataset.iloc[:,:5] 
y = dataset.Class  

# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("Training shapes (X, y): ", X_train.shape, y_train.shape)
print("Testing shapes (X, y): ", X_test.shape, y_test.shape)

Training shapes (X, y):  (4323, 5) (4323,)
Testing shapes (X, y):  (1081, 5) (1081,)


In [6]:
phonemes = dataset.columns[0:5].tolist()  # Feature labels
labels = {0: 'Nasal', 1: 'Oral'}  # Class labels

## Constructing the classifier and running automated cross-validation

* Run a 10-fold cross validation with `GaussianNB` classifier
* Print the accuracy scores for these 10 folds

In [9]:
# Your code below this line (Question #E101)
# --------------------------

# Gaussian classifier
model = GaussianNB()

# cross validation test
cv_ten = sklearn.model_selection.cross_val_score(model, X, y, cv=10)

# printing array of accuracy scores
cv_ten



array([0.73752311, 0.80591497, 0.76340111, 0.73937153, 0.79074074,
       0.76296296, 0.74259259, 0.77777778, 0.74814815, 0.73888889])

## Training the classifier and pickling to disk
* Learn the model with all the training instances and store to disk

In [10]:
# Your code below this line (Question #E102)
# --------------------------

model.fit(X_train, y_train)


GaussianNB()

In [11]:
joblib.dump(model, 'GaussianPhonemes.pkl')

['GaussianPhonemes.pkl']

## Unpickling the model and making predictions

* Load the saved model 
* Make predictions for the testing set


In [12]:
# Your code below this line (Question #E103)
# --------------------------

# load pickled model
loaded_model = joblib.load('GaussianPhonemes.pkl')

# make predictions with freshly loaded model
y_pred = loaded_model.predict(X_test)

# verify input and output shape are appropriate
print("Input vs. output shape:")
print(X_test.shape, y_pred.shape)




Input vs. output shape:
(1081, 5) (1081,)


## Performing final performance comparison

In [13]:
# tally up right + wrong 'guesses' by model
true, false = 0, 0
for i, j in zip(y_test, y_pred):
    # print(i, j)
    if i == j:
        true += 1
    else:
        false += 1

# report results numerically and by percentage
true_percent = true / (true + false) * 100
print("Correct guesses: " + str(true) + "\nIncorrect guesses: " + str(false))
print("Percent correct: " + str(true_percent))

# compare to average of cross-validation scores
avg_cv = np.sum(cv_ten) / len(cv_ten) * 100
print("Percent cross-validation score (10 folds, average): " + str(avg_cv))

Correct guesses: 825
Incorrect guesses: 256
Percent correct: 76.31822386679002
Percent cross-validation score (10 folds, average): 76.07321831998358


## Measure performance using Scikit Learn modules 

Compute and display the following:
 1. Compute Confusion Matrix
 1. Accuracy
 1. Precision
 1. Recall
 1. $F_1$-Score
 
Add additional cells if required. 

In [14]:
# Your code below this line  (Question #E104)
# --------------------------

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.88      0.77      0.82       772
           1       0.57      0.74      0.64       309

    accuracy                           0.76      1081
   macro avg       0.72      0.76      0.73      1081
weighted avg       0.79      0.76      0.77      1081



## Conclusions ?

How did your trained model perform relative to your expectations based on the cross-validation?
Provide your answer in the cell below.

In [None]:
# Add your answer below this comment  (Question #E105)
# -----------------------------------

The trained model was able to correctly predict 76.3% of the testing set, while the cross-validation had an average of 76.2%.
Since both of these are essentially the same, I would say that the trained model performed exactly as anticipated after
a 10-fold cross validation test. 





# Save your notebook!  Then `File > Close and Halt`