# Cross Validation 

## Intro

Every time after we make a model, we have to check if it is functioning properly or not. But testing on the data that was used to build the model is a bad idea. When testing if one performs well, it should be tested on a separate data set which hasn't been used for training which is the reason before doing anything, we separate the whole data into training and testing sets.

Once we build a model, we test it on testing set over and over again until it becomes accurate enough to be used. <br>
A problem in this case is this is also a bad practice. Doing this will cause the model to be overfitting on testing data that when some new data is inserted, it may work poorly. This is a reason that testing set should only be used once throughout the entire training and testing. 

If we cannot use the testing set, how can we check if a model is working well? We cannot use training nor testing set. For this reason we have something called Cross Validation. This is a simple idea. We just divide a training set into $k$ sets and train a model with $k-1$ set and get an error value with the remaining set. After we divided and used the last set to train, next thing we do is initialize a new model and train it with different combinations of $k-1$ sets and test with the one left. We do this process until there is no more combination of sets we haven't used to train model. Now we have $k$ different error values. We get the mean value of it and that will become our final performace score (or error value) of the model. Also by doing cross validation, we can check which subset of features produce high or low error and select the ones with low value.<br>
The following is a pseudo-code of cross validation.

In above code, the x is the whole training set (not the whole data) and y is the label corresponding to it.

## Coding

We can easily do the cross validation with existing library (scikit). But since the idea is simple and implementation is also easy do, let's try implementing it on our own first, and then see how to use the library. <br>
Again, we will use linear regression with housing price data.

### Import libraries and Load and Prep Data

In [107]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [2]:
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

In [3]:
X = train.loc[:, list(train)[:-1]]
y = train.loc[:, 'SalePrice']

In [15]:
# set split size to 0.8
X_train, y_train, X_test, y_test = train_test_split(X, y, train_size=0.8)



Now, let's define necessary functions.

In [100]:
# Root Mean Squared Error
def rmse(y, y_hat):
    return np.sqrt(np.mean((y - y_hat)**2))

def get_index(x, size):
    index = []
    validation = []
    r = set([i for i in range(len(x))])
    
    for i in range(len(x)):
        validation.append(i)

        if (i+1) % size == 0 or i == len(x)-1:
            
            training = list(r - set(validation))
            index.append([np.array(training), np.array(validation)])
            validation = []

    return index

def cross_validation(x, y, k):
    
    size = ceil(len(x) / k)
    
    # list of list of indices to split
    index = get_index(x, size)
    error = []
    
    model = LinearRegression()
    
    for training, validation in index:
#         print(training)
        train = x.iloc[training]
        test = y.iloc[training]
        
#         print(type(training))
#         print(validation)
        break
#         model.fit(train, test)
        
#         validate = x[!ind]
#         t_validate = y[!ind]
    
#         error.append(rmse(model.predict(validate), t_validate))
    
    return index


In [105]:
y_train.shape

(292, 80)

In [106]:
ind = cross_validation(X_train, X_test, 5)