we try **k-fold stratified** sampling and do cross validation on a regression dataset.

Note that this is continuous traget, ie regression - so binning before stratified sampling is important.

In [2]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn import model_selection

In [7]:
# we create a sample dataset with 15000 samples
# and 100 features and 1 target
X, y = datasets.make_regression(n_samples=15000, n_features=100, n_targets=1)
# create a dataframe out of our numpy arrays
df = pd.DataFrame(X,columns=[f"f_{i}" for i in range(X.shape[1])])
df.loc[:, "target"] = y

In [17]:
## Create folds

def create_folds(df):
    data = df.copy()

    # we create a new column called kfold and fill it with -1
    data["kfold"] = -1

    # the next step is to randomize the rows of the data
    data = data.sample(frac=1).reset_index(drop=True)

    #There are several choices for selecting the appropriate number of bins. If
    #you have a lot of samples( > 10k, > 100k), then you don’t need to care about the
    #number of bins. Just divide the data into 10 or 20 bins. If you DO NOT have a lot of
    #samples, you can use a simple rule like Sturge’s Rule to calculate the appropriate
    #number of bins. #Number of bins = 1 + log2(N)

    # calculate the number of bins by Sturge's rule. I take the floor of the value, 
    # you can also just round it
    num_bins = int(np.floor(1 + np.log2(len(data))))

    # bin targets
    data.loc[:, "bins"] = pd.cut(data["target"], bins=num_bins, labels=False)


    # initiate the kfold class from model_selection module
    kf = model_selection.StratifiedKFold(n_splits=5)

    # fill the new kfold column
    # note that, instead of targets, we use bins!
    for f, (t_, v_) in enumerate(kf.split(X=data, y=data.bins.values)):
        data.loc[v_, 'kfold'] = f

    # drop the bins column
    data = data.drop("bins", axis=1)
    # return dataframe with folds
    return data


In [18]:
# create folds
df = create_folds(df)




In [19]:
df.head()

Unnamed: 0,f_0,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,...,f_92,f_93,f_94,f_95,f_96,f_97,f_98,f_99,target,kfold
0,0.565425,0.971777,0.2138,-0.353792,-0.335173,0.006121,-1.44251,-0.78781,2.025002,0.742515,...,-0.791381,0.927819,0.152073,-3.277272,-0.122104,0.641381,-1.210755,1.522079,-397.340459,0
1,-2.746361,0.181441,-1.133135,0.868746,-0.868995,-1.456483,-0.508943,0.579874,1.961598,1.183114,...,0.40138,-0.027808,-0.222313,0.823885,0.575728,-0.281101,-0.030375,-0.268561,-1.927032,0
2,-0.668751,0.154002,1.764469,0.909762,-0.026638,-0.381099,-0.580303,-0.769698,1.053011,0.138395,...,0.577468,-0.614562,1.228049,-0.553985,1.241602,0.839471,-0.653361,1.091489,-211.42079,0
3,-0.221898,-0.50433,-0.543967,-0.240035,0.064741,0.449069,0.26197,0.224125,0.013446,0.063844,...,1.051585,0.856302,1.526893,0.25432,-1.07856,-0.700253,-0.408231,-0.121752,190.095023,0
4,-1.508757,0.672645,-0.686448,-1.800434,-0.375978,1.067656,-0.833104,-0.114172,-1.725787,-0.929284,...,-1.190717,-0.115362,-2.228292,0.50123,0.078194,-0.297457,-0.236146,0.64999,-350.468307,0


Now based on bins (which are based on values), we obtain fold values. We can apply now k-fold cross validation as we did in previous notebook.

There may be other scenarios where we may need to innovate the stratified sampling.

```
For example, let’s say we have a problem in which we would like to build a model to detect skin cancer from skin images of patients. Our task is to build a binary classifier which takes an input image and predicts the probability for it being benign or malignant.
In these kinds of datasets, you might have multiple images for the same patient in the training dataset. So, to build a good cross-validation system here, you must have stratified k-folds, but you must also make sure that patients in training data do not appear in validation data. Fortunately, scikit-learn offers a type of cross-validation known as GroupKFold. Here the patients can be considered as groups. But unfortunately, there is no way to combine GroupKFold with StratifiedKFold in scikit-learn. So you need to do that yourself. I’ll leave it as an exercise for the reader.
```