# <font color="blue">Lesson 6 - Feature Engineering and Selection</font>

# What is Cross Validation? 
We've actually been using cross validation this entire time, when you've held out a percentage of your dataset for testing. 

However, there are a few different approaches to cross validation that can help you more accurately measure model performance. Let's take a look at some of the most common 

## Leave One Out
With this method, we keep out just one data point from out dataset and train the model on the rest of the data. We repeat this process for each data point. 

- Use all data points, so bias will be low
- Repeat cross validation process n times, resulting in longer execution time
- Approach leads to higher variation in testing model effectiveness because we test against one data point. So, estimation gets highly influenced by the data point. If the data point turns out to be an outlier, it can lead to a higher variation. 


In [None]:
import numpy as np
from sklearn.model_selection import LeaveOneOut

# create fake training data
X = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])

In [None]:
X

In [None]:
y

In [None]:
# instantiate model
loo = LeaveOneOut()

# determine how many times to split
loo.get_n_splits(X)

# apply leave one out for each data point
for train_index, test_index in loo.split(X):
        print("train:", train_index, "validation:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

## K-fold 
This is one of the most common validation methods. 

1. Randomly split dataset into k”folds”
2. For each k-fold in your dataset, build your model on k – 1 folds of the dataset. 
3. Test the model with the left out fold
4. Record the error you see on each of the predictions
5. Repeat this until each of the k-folds has served as the test set
6. Calc the cross-validation error, which is the average of your k recorded errors

In [None]:
from sklearn.model_selection import KFold

# create a dataset
X = ["a", "b", "c", "d"]

# instantiate kfold model
kf = KFold(n_splits=2)

# split the dataset into k folds and view indices
for train, test in kf.split(X):
     print("%s %s" % (train, test))

Each fold is constituted by two arrays: the first one is related to the training set, and the second one to the test set.

## Consider this
Which method is more robust to an outlier? Re-run this simulation with an obvious outlier in your X array.