# Notebook 3b: K-Fold Validation

In this notebook I will show you how to implement K-Fold Cross Validation on your data and apply this to a test set. We will use the ADULTs dataset as an example but the steps can be applied across the board

In [0]:
!pip3 install torch===1.3.0 torchvision===0.4.1 -f https://download.pytorch.org/whl/torch_stable.html
!pip install git+https://github.com/fastai/fastai_dev > /dev/null

In [0]:
from fastai2.torch_basics import *
from fastai2.basics import *
from fastai2.tabular.core import *

In [0]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

In [0]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]

First I want to seperate a test set that is the last 10% of my data (for adults this is fine, but in actuality this is quite an important topic. To read more see [here](https://www.fast.ai/2017/11/13/validation-sets/)

In [0]:
end = len(df) - 3256

In [0]:
test = df.iloc[end:]
train = df.iloc[:end]

Now let's grab `StratifiedKFold` from the `sklearn` library

In [0]:
from sklearn.model_selection import StratifiedKFold

The following code is just to get our `TabularLearner` up and running

In [0]:
def emb_sz_rule(n_cat): 
    "Rule of thumb to pick embedding size corresponding to `n_cat`"
    return min(600, round(1.6 * n_cat**0.56))

def _one_emb_sz(classes, n, sz_dict=None):
    "Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`."
    sz_dict = ifnone(sz_dict, {})
    n_cat = len(classes[n])
    sz = sz_dict.get(n, int(emb_sz_rule(n_cat)))  # rule of thumb
    return n_cat,sz

def get_emb_sz(to, sz_dict=None):
    "Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
    return [_one_emb_sz(to.procs.classes, n, sz_dict) for n in to.cat_names]

class TabularModel(Module):
    "Basic model for tabular data."
    def __init__(self, emb_szs, n_cont, out_sz, layers, ps=None, embed_p=0., y_range=None, use_bn=True, bn_final=False):
        ps = ifnone(ps, [0]*len(layers))
        if not is_listy(ps): ps = [ps]*len(layers)
        self.embeds = nn.ModuleList([Embedding(ni, nf) for ni,nf in emb_szs])
        self.emb_drop = nn.Dropout(embed_p)
        self.bn_cont = nn.BatchNorm1d(n_cont)
        n_emb = sum(e.embedding_dim for e in self.embeds)
        self.n_emb,self.n_cont,self.y_range = n_emb,n_cont,y_range
        sizes = [n_emb + n_cont] + layers + [out_sz]
        actns = [nn.ReLU(inplace=True) for _ in range(len(sizes)-2)] + [None]
        _layers = [BnDropLin(sizes[i], sizes[i+1], bn=use_bn and i!=0, p=p, act=a)
                       for i,(p,a) in enumerate(zip([0.]+ps,actns))]
        if bn_final: _layers.append(nn.BatchNorm1d(sizes[-1]))
        self.layers = nn.Sequential(*_layers)
    
    def forward(self, x_cat, x_cont):
        if self.n_emb != 0:
            x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]
            x = torch.cat(x, 1)
            x = self.emb_drop(x)
        if self.n_cont != 0:
            x_cont = self.bn_cont(x_cont)
            x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont
        x = self.layers(x)
        if self.y_range is not None:
            x = (self.y_range[1]-self.y_range[0]) * torch.sigmoid(x) + self.y_range[0]
        return x

In [0]:
# My own mock version of a `tabular_learner`
def tabular_learner(data:DataBunch, layers, emb_szs=None, metrics=None,
        ps=None, emb_drop:float=0., y_range=None, use_bn:bool=True, **learn_kwargs):
    "Get a `Learner` using `data`, with `metrics`, including a `TabularModel` created using the remaining params."
    emb_szs = get_emb_sz(data.train)
    model = TabularModel(emb_szs, len(data.cont_names), data.train[data.train.y_names].nunique(), layers=layers)
    return Learner(data, model, metrics=metrics, **learn_kwargs)

Now for the actual running. I'll describe what we're doing below step by step. We declare our `cat` and `cont` vars, our procs, and also generate our test set's data loader (so we can test against it). Along wtih this, to stay in v2 style our validation and test lists will be of type `L`.

From here, we will use the `StratifiedKFold` to generate 10 shuffled splits, and split them with the `.split` method. From here, we can go into each of those splits and they will contain our indexs. Convert them to `L`'s and we can directly pass them into our `TabularPandas`. From here, we create our `DataBunch`, `Learner`, train it, and then evaluate on our test data.

Finally, we will print out the validation and test set statistics.

In [93]:
val_pct = L()
test_pct = L()

test_preds = L()

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]

test_dl = TabularPandas(test, procs, cat_names, cont_names, y_names="salary")
test_dl = TabDataLoader(test_dl)

skf = StratifiedKFold(n_splits=10, shuffle=True)
res = skf.split(train.index, train['salary'])
for x, y in res:
  ix = (L(list(x)), L(list(y)))
  to = TabularPandas(train, procs, cat_names, cont_names, y_names="salary", splits=ix)
  data = to.databunch()
  learn = tabular_learner(data, layers=[200,100], loss_func=CrossEntropyLossFlat(), metrics=accuracy)
  learn.fit(1)
  val_pct.append(learn.validate()[1])
  test_pct.append(learn.validate(dl=test_dl)[1])

(#5) [0,0.39185065031051636,0.36321401596069336,0.8341862559318542,00:41]
(#5) [0,0.3853352665901184,0.370516300201416,0.8184919953346252,00:40]
(#5) [0,0.39923539757728577,0.37282225489616394,0.8375980854034424,00:41]
(#5) [0,0.3787161111831665,0.3603558838367462,0.837256908416748,00:41]
(#5) [0,0.36787909269332886,0.3505774438381195,0.8365745544433594,00:41]
(#5) [0,0.3749101459980011,0.37149032950401306,0.8320819139480591,00:40]
(#5) [0,0.3706760108470917,0.37037286162376404,0.8245733976364136,00:41]
(#5) [0,0.36982911825180054,0.3898116648197174,0.8156996369361877,00:41]
(#5) [0,0.3981660306453705,0.36508622765541077,0.8313993215560913,00:40]
(#5) [0,0.3743650019168854,0.3596053421497345,0.8327645063400269,00:40]


In [97]:
print(f'Validation:\nmean: {np.mean(val_pct)}\nstd: {np.std(val_pct)}')
print(f'\n\nTest:\nmean: {np.mean(test_pct)}\nstd: {np.std(test_pct)}')

Validation:
mean: 0.8300626575946808
std: 0.007425775359093289


Test:
mean: 0.7957616686820984
std: 0.014733064799056943


## Bonus:

If we wanted to do a mash up of our ten models, here is how you would adjust the loop

In [110]:
test_preds = L() # HERE

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]

test_dl = TabularPandas(test, procs, cat_names, cont_names, y_names="salary")
test_dl = TabDataLoader(test_dl)

skf = StratifiedKFold(n_splits=10, shuffle=True)
res = skf.split(train.index, train['salary'])
for x, y in res:
  ix = (L(list(x)), L(list(y)))
  to = TabularPandas(train, procs, cat_names, cont_names, y_names="salary", splits=ix)
  data = to.databunch()
  learn = tabular_learner(data, layers=[200,100], loss_func=CrossEntropyLossFlat(), metrics=accuracy)
  learn.fit(1)
  test_preds.append(learn.get_preds(dl=test_dl)[0]) # HERE

(#5) [0,0.3765460252761841,0.3636363744735718,0.8358922004699707,00:40]
(#5) [0,0.3646171987056732,0.35906463861465454,0.8249744176864624,00:40]
(#5) [0,0.3781249523162842,0.35186585783958435,0.8355510234832764,00:40]
(#5) [0,0.40145716071128845,0.3564353585243225,0.8311156630516052,00:40]
(#5) [0,0.38083574175834656,0.3732168674468994,0.8348686695098877,00:40]
(#5) [0,0.4260295331478119,0.39271262288093567,0.8249146938323975,00:40]
(#5) [0,0.36830320954322815,0.3626205623149872,0.8273037672042847,00:40]
(#5) [0,0.3807460367679596,0.36837369203567505,0.8331058025360107,00:40]
(#5) [0,0.40952637791633606,0.35914427042007446,0.8307167291641235,00:40]
(#5) [0,0.37534549832344055,0.359743595123291,0.8313993215560913,00:40]


In [0]:
preds = [pred for pred in test_preds]

In [0]:
pred = sum(preds)/10

In [122]:
accuracy(pred, test_preds[0][1])

tensor(0.8117)

And we're done! 20 lines of code! *Much* easier to do in v2 than v1 thanks to that test `DataLoader` being so simple to set up.