# Train, Validate $\rightarrow$ Train, Test 
### Focus: Naive Bayes Classifier & Logistic Regression

## Introduction
When constructing a model, data availability may become an issue. 
In order to avoid overfitting, it is necessary to withhold some portion of the data as a test set. 
However, overfitting *on the test set* may also occur without a secondary validation step. 
As such, `scikit` contains a number of methods for cross-validation of data.

## References
1. [Scikit documentation - GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

## Setting up the model

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import classification_report
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from collections import OrderedDict

# load dataset 
# raw = load_iris()

# X = raw.data # slice off only the first feature (.data is multi-dimensional)
# y = raw.target # the target data is a single label, so it can all be kept

DATASET = '/dsa/data/all_datasets/wine-quality/winequality-red.csv'
dataset = pd.read_csv(DATASET, sep=';').sample(frac = 1).reset_index(drop=True)
X = dataset.iloc[:, [1,2,6,9,10]]
y = dataset.quality

# test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# we'll use the Gaussian Naive Bayes classifier
baseline_classifier = DummyClassifier()
nb_classifier = GaussianNB()
regression_classifier = LogisticRegression(max_iter = 10000)

## Cross-validation
Though a manual CV workflow was described in [the cross-validation lab](./CrossValidation.ipynb), the automated `cross_val_score()` will work well enough for this example.

In [2]:
# automated CV step
baseline_scores = cross_val_score(baseline_classifier, X_train, y_train, cv=5)
nb_scores = cross_val_score(nb_classifier, X_train, y_train, cv=5)
regression_scores = cross_val_score(regression_classifier, X_train, y_train, cv=5)
print("baseline score ", baseline_scores) # TODO: visualization of CV process
print("NB score :", nb_scores)
print("Regression score: ", regression_scores)


baseline score  [0.4296875  0.4296875  0.4296875  0.43359375 0.43529412]
NB score : [0.61328125 0.578125   0.5859375  0.5859375  0.57647059]
Regression score:  [0.640625   0.58203125 0.5546875  0.58203125 0.60784314]


Note that we are performing cross validation with the training set. These cross-validation values represent how well (with 1 being a perfect score) the model performed against a small, as-yet-untrained portion of the data for the classification task.

## Training the new models

Since the CV values are relatively high, we can create a model using all the data in the training set and test against the testing set:

In [3]:
# fit new model
baseline_classifier.fit(X_train, y_train)
nb_classifier.fit(X_train, y_train)
regression_classifier.fit(X_train, y_train)

# model.predict() returns class labels (integers)
y_pred_baseline = baseline_classifier.predict(X_test)
y_pred_nb = nb_classifier.predict(X_test)
y_pred_reg = regression_classifier.predict(X_test)


## Score comparision between the three models. 
Score shows that bayes classifier is better that baseline. 
Again Regression is better in score than Bayes classifier. 

For comparision we used all the given features. But you can try out several amount of features to test the models. 
if we increase the number of features then accuracy will increase. but it will require more computation to converge. 
Again reduction of number of features will loose accuracy. We need to trade-off between computation power and accuracy. 

In [4]:
from sklearn.metrics import precision_score, f1_score, recall_score, accuracy_score
from pandas import DataFrame
from IPython.display import display 

data_dict = {
    'Metrics': ["Precision", 'Recall', 'F1', 'Recall'],
    'Baseline': [
        precision_score(y_test, y_pred_baseline, average= 'weighted'),
        recall_score(y_test, y_pred_baseline, average='weighted'),
        f1_score(y_test, y_pred_baseline, average='weighted'),
        accuracy_score(y_test, y_pred_baseline)
    ],
    "Naive Bayes": [
        precision_score(y_test, y_pred_nb, average= 'weighted'),
        recall_score(y_test, y_pred_nb, average='weighted'),
        f1_score(y_test, y_pred_nb, average='weighted'),
        accuracy_score(y_test, y_pred_nb)
    ],
    "Logistic Regression": [
        precision_score(y_test, y_pred_reg, average= 'weighted'),
        recall_score(y_test, y_pred_reg, average='weighted'),
        f1_score(y_test, y_pred_reg, average='weighted'),
        accuracy_score(y_test, y_pred_reg)
    ]
}
table = np.around(DataFrame(data_dict), 2)
print(table.to_string(index=False))

  Metrics  Baseline  Naive Bayes  Logistic Regression
Precision      0.16         0.53                 0.56
   Recall      0.40         0.55                 0.58
       F1      0.23         0.54                 0.55
   Recall      0.40         0.55                 0.58


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**Comment:**

* The naive Bayes classifier and logistic regression perform equally well
* Warning is thrown as for some of the classes precision is 0