# Cross Validation

It is the process to split dataset into few parts to train and validation.We train the model on
some of these parts and test on the remaining parts.

Choosing the right cross-validation
depends on the dataset you are dealing with, and one’s choice of cross-validation
on one dataset may or may not apply to other datasets.

Types of cross-validation techniques which are the most popular and widely used.
These include:
- k-fold cross-validation
- stratified k-fold cross-validation
- hold-out based validation
- leave-one-out cross-validation
- group k-fold cross-validation

We can divide the data into k different sets which are exclusive of each other. This
is known as **k-fold cross-validation.**

We can split any data into k-equal parts using KFold from scikit-learn. Each sample
is assigned a value from 0 to k-1 when using k-fold cross validation.

Dataset -> randomize rows -> Dataset Randomized -> Select K equal parts

We can use this process with almost all kinds of datasets. For example, when you
have images, you can create a CSV with image id, image location and image label
and use the process above.

In [None]:
import pandas as pd
from sklearn import model_selection

# Run the code below
if __name__ == '__main__':
    
    #Load dataset
    df = pd.read_csv("train.csv")
    
    # create new collumn called kfold and fill it with -1
    df['kfold'] = -1
    
    # randomise dataset
    df = df.sample(frac=1).reset_index(drop=True)
    
    # initiate kfold class
    kf = model_selection.KFold(n_splits=5)
    
    # fill the new kfold column
    for fold, (trn_, val_) in enumerate(kf.split(X=df)):
        df.loc[val_, 'kfold'] = fold
    
    # save the new csv with kfold column
    df.to_csv("train_folds.csv", index=False)

**Stratified k-fold**. If you have a skewed dataset for binary classification with 90% positive samples and only 10% negative samples

Stratified k-fold cross-validation keeps the ratio of labels in each fold constant. So,
in each fold, you will have the same 90% positive and 10% negative samples.

So, in each fold, you will have the same 90% positive and 10% negative samples. Thus, whatever metric you choose to evaluate, it will give similar results across all folds.

**We assume that** our CSV dataset has a column called “target” and it is a classification problem!

The code is the same as the k-fold except for `model_selection`.


In [None]:
import pandas as pd
from sklearn import model_selection

# Run the code below
if __name__ == '__main__':
    
    #Load dataset
    df = pd.read_csv("train.csv")
    
    # create new collumn called kfold and fill it with -1
    df['kfold'] = -1
    
    # randomize dataset
    df = df.sample(frac=1).reset_index(drop=True)
    
    # initiate kfold class
    kf = model_selection.StratifiedKFold(n_splits=5)
    
    # fill the new kfold column
    for fold, (trn_, val_) in enumerate(kf.split(X=df)):
        df.loc[val_, 'kfold'] = fold
    
    # save the new csv with kfold column
    df.to_csv("train_folds.csv", index=False)

**For large amount data** we can opt for a **hold-out based validation**. 

This method is very frequent for **time-series data**.  let’s say our job is to predict the sales from
time step 31 to 40. We can then keep 21 to 30 as hold-out and train our model from
step 0 to step 20. You should note that when you are predicting from 31 to 40, you
should include the data from 21 to 30 in your model; otherwise, performance will
be sub-par

## Regression Problems

Now we can move to regression. The good thing about regression problems is that
we can use all the cross-validation techniques mentioned above for regression
problems except for stratified k-fold. That is **we cannot use stratified k-fold directly**,
but there are ways to change the problem a bit so that we can use stratified k-fold
for regression problems. Mostly, simple k-fold cross-validation works for any
regression problem. However, **if you see that the distribution of targets is not
consistent, you can use stratified k-fold**.

To use stratified k-fold for a regression problem, **we have first to divide the target
into bins**, and then we can use stratified k-fold in the same way as for classification
problems. There are several choices for selecting the appropriate number of bins. If
you have a **lot of samples**( > 10k, > 100k), then you don’t need to care about the
number of bins. Just **divide the data into 10 or 20 bins.** If you do **not have a lot of
samples**, you can use a simple rule like **Sturge’s Rule to calculate the appropriate
number of bins.**

Let’s make a sample regression dataset and try to apply stratified k-fold as shown
in the following python snippet.

In [2]:
import numpy as np
import pandas as pd

from sklearn import datasets, model_selection

In [34]:
def create_folds(data):
    # we create a new column called kfold and fill it with -1
    data["kfold"] = -1
    
    # the next step is to randomize the rows of the data
    data = data.sample(frac=1).reset_index(drop=True)
    
    # calculate the number of bins by Sturge's rule
    # I take the floor of the value, you can also
    # just round it
    num_bins = int(np.floor(1 + np.log2(len(data))))
    
    # bin targets
    data.loc[:, "bins"] = pd.cut(data["target"], bins=num_bins, labels=False)
    
    # initiate the kfold class from model_selection module
    kf = model_selection.StratifiedKFold(n_splits=5)
    
    # fill the new kfold column
    # note that, instead of targets, we use bins!
    for f, (t_, v_) in enumerate(kf.split(X=data, y=data.bins.values)):
        data.loc[v_, 'kfold'] = f
    
    # drop the bins column
    data = data.drop("bins", axis=1)
    
    # return dataframe with folds
    return data

In [35]:
if __name__ == "__main__":
    # we create a sample dataset with 15000 samples
    # and 100 features and 1 target
    X, y = datasets.make_regression(n_samples=15000, n_features=100, n_targets=1)
    
    # create a dataframe out of our numpy arrays
    df = pd.DataFrame(X,columns=[f"f_{i}" for i in range(X.shape[1])])
    df.loc[:, "target"] = y

    # create folds
    df = create_folds(df)



**Cross-validation is the first and most essential step** when it comes to building
machine learning models. 

If you want to do feature engineering, **split your data first**.
If you're going to build models, split your data first. 

If you have a good cross-validation scheme in which validation data is representative of training and real-
world data, you will be able **to build a good machine learning model** which is highly generalizable.