**Use of Stratifiedkfold cross-validation in Regression**

we can't use Stratifiedkfold directly , but there are ways t change the problem a bit so that we can use Stratified k-fold for Regression problems . Mostly, simple k-fold cross-validation works for any regression problem .  However, if we see that the distribution of targets is not consistent , we can use Stratified k-fold.

To use Stratified K-fold for a regression problem , we have first to divide the target into bins , and then can use Stratified k-fold in the same way as for classification.  There are several choices for selecting the appropriate number of bins . If we have a lot of samples (> 10k > 100k), then we don't need to care about the number of bins. Just divide the data into 10  or  20 bins . If we do not have a lot of samples , we can use a simple rule like **Sturge's Rule** to calculate the appropriate number of bins
- Sturge's Rule:
$ Number of Bins = 1 + log_2(N)$

Where N is the number of samples in our dataset

In [1]:
import numpy as np 
import pandas as pd 
from sklearn import datasets
from sklearn import model_selection

def create_folds (data):
    # we create a new column called kfold ad fill it with -1
    data['kfold'] = -1

    # the next step id to randomize the rows of the data
    data = data.sample(frac = 1).reset_index(drop =True)

    # calculate the number of bins bu sturge's rule
    # we take floor of the value , one can also just round it
    num_bins = np.floor( 1 + np.log2(len(data)))

    # bin target
    data.loc[ : , 'bins'] = pd.cut(
        data['target'] ,  bins=num_bins , labels= False
    )

    # Initiate the kfold column
    kf  = model_selection.StratifiedKFold(n_splits= 5 )

    # fill the new kfold column
    #note that instead of target we use bins 
    for f , (t_ , v_ ) in enumerate(kf.split(X = data , y = data.bins.values)):
        data.loc[v_ , 'kfold'] = f

    #drop the bins column
    data = data.drop('bins' , axis = 1)
    # return dataframe
    return data

In [5]:
if __name__ == '__main__':
    # we create a sample dataset with 15000 samples and 100 features and 1 target
    X, y = datasets.make_regression(
        n_samples= 15000 , n_features= 100 , n_targets= 1
    )
    # create a dataframe out of our numpy arrays
    df = pd.DataFrame(
        X,
        columns = [f"f_{i}" for i in range(X.shape[1])]
    )
    df.loc[: , 'target'] = y

    # create fold
    df = create_folds(df)

In [6]:
df.head()

Unnamed: 0,f_0,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,...,f_92,f_93,f_94,f_95,f_96,f_97,f_98,f_99,target,kfold
0,-1.072049,0.048874,-1.776504,0.812529,0.039193,-0.008777,0.111748,-0.685321,-0.794618,-0.219846,...,1.016304,0.913315,-1.334337,-0.590968,0.687942,0.747553,-0.38285,0.543663,-252.135623,0
1,0.098822,0.466285,-0.301996,0.128372,1.646033,-1.352607,1.794482,-1.431186,-0.548489,0.212283,...,-0.679937,0.196082,-0.690067,-1.590908,1.670808,-1.390162,-0.969001,1.113607,-4.731363,0
2,0.866245,0.584522,-0.595651,-2.745084,-1.61491,0.840648,0.991548,0.296775,0.015612,-0.555631,...,-0.986704,0.862823,0.776425,-0.869599,-1.489392,0.216389,1.293981,-0.044173,-123.131567,0
3,-0.467379,-0.927148,-0.056269,-0.940335,-1.053923,-0.942803,-1.689408,0.791748,1.574628,0.403512,...,-0.441902,-0.148544,-0.026214,1.22128,-0.204927,0.64562,1.122023,0.63739,-172.841176,0
4,0.172043,-0.91778,0.936267,0.902319,-2.426538,-0.021678,0.003368,0.3209,-2.539603,0.165963,...,-0.273824,-0.72108,0.31915,-0.296552,0.586121,-1.636423,-0.757833,0.244904,-259.690589,0
