# Validation
- Use to determine final ML model you'll use to make predictions on unseen data

## Test Splits

### Training/Testing dataset
- Split data into training set to train model on, and an unseen testing set to test it on later
- When you’re just getting started, stick with a simple split of train and test data (such as 66%/34%) and move onto cross validation once you have more confidence.
- We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

In [None]:
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

### Cross Validation: 
- Separating dataset into a number of equally sized groups of instances (called folds)
- Model is then trained on all folds exception one that was left out and the prepared model is tested on that left out fold
- The process is repeated so that each fold gets an opportunity at being left out and acting as the test dataset
- Finally, the performance measures are averaged across all folds to estimate the capability of the algorithm on the problem.
- e.g. 3-fold cross validation involves training and tests a model 3 times
    - #1: Train on folds 1+2, test on fold 3
    - #2: Train on folds 1+3, test on fold 2
    - #3: Train on folds 2+3, test on fold 1
- You should choose a value for k that splits the data into groups with enough rows that each group is still representative of the original dataset
- A good default to use is k=3 for a small dataset or k=10 for a larger dataset
- A quick way to check if the fold sizes are representative is to calculate summary statistics such as mean and standard deviation and see how much the values differ from the same statistics on the whole dataset.
- The value of k should be divisible by the number of rows in your training dataset, to ensure each of the k groups has the same number of rows.
- fold size = total rows / total folds
- The number of folds can vary based on the size of your dataset, but common numbers are 3, 5, 7 and 10 folds. The goal is to have a good balance between the size and representation of data in your train and test sets.

#### Example of k-folds validation
- We will use 10-fold cross validation to estimate accuracy.
- This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.
- Evaluate 6 different algorithms:
    - Logistic Regression (LR)
    - Linear Discriminant Analysis (LDA)
    - K-Nearest Neighbors (KNN).
    - Classification and Regression Trees (CART).
    - Gaussian Naive Bayes (NB).
    - Support Vector Machines (SVM).
- Mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms.
- We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

In [None]:
seed = 7
scoring = 'accuracy'

# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

- KNN algorithm was the most accurate model that we tested. Now we want to get an idea of the accuracy of the model on our validation set.
- This will give us an independent final check on the accuracy of the best model.
- It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.
- We can run the KNN model directly on the validation set and summarize the results as a final accuracy score, a confusion matrix and a classification report.

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

#### Another example
- Retrieve via url and set column names
- Running the example provides a list of each algorithm short name, the mean accuracy and the standard deviation accuracy.
- The example also provides a box and whisker plot showing the spread of the accuracy scores across each cross validation fold for each algorithm.

In [None]:
import pandas
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

# prepare configuration for cross validation test harness
seed = 7

# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

## Evaluation
- Number of model Diagnostics: Learning curves, partial dependence plots, feature importances, ROC and other diagnostics are extremely useful to generate automatically.
- Classification
    - Classification accuracy is the ratio of correct predictions to total predictions made.
    - Confusion matrix is a summary of prediction results on a classification problem: number of correct and incorrect predictions are summarized with count values and broken down by each class. 
- Regression
    - Regression problems are those where a real value is predicted.
    -  R^2
    - An easy metric to consider is the error in the predicted values as compared to the expected values.
    - The Mean Absolute Error (MAE)
        - MAE for short is a good first error metric to use
        - Calculated as the average of the absolute error values, where “absolute” means “made positive” so that they can be added together.
        - MAE = sum( abs(predicted_i - actual_i) ) / total predictions
    - Root Mean Squared Error (RMSE, sometimes called Mean Squared Error or MSE)
        - Calculated as the square root of the mean of the squared differences between actual outcomes and predictions.
        - Squaring each error forces the values to be positive, and the square root of the mean squared error returns the error metric back to the original units for comparison.
        - RMSE values are always slightly higher than MSE values, which becomes more pronounced as the prediction errors increase. 
        - This is a benefit of using RMSE over MSE in that it penalizes larger errors with worse scores.