# Notebook 3b: K-Fold Validation

In this notebook I will show you how to implement K-Fold Cross Validation on your data and apply this to a test set. We will use the ADULTs dataset as an example but the steps can be applied across the board

In [0]:
import os
!pip install -q torch torchvision feather-format kornia pyarrow Pillow wandb nbdev fastprogress --upgrade 
!pip install -q git+https://github.com/fastai/fastcore  --upgrade
!pip install -q git+https://github.com/fastai/fastai2 --upgrade
os._exit(00)

[K     |████████████████████████████████| 122kB 6.9MB/s 
[K     |████████████████████████████████| 59.2MB 129kB/s 
[K     |████████████████████████████████| 2.1MB 44.6MB/s 
[K     |████████████████████████████████| 1.3MB 43.8MB/s 
[K     |████████████████████████████████| 92kB 11.1MB/s 
[K     |████████████████████████████████| 102kB 11.7MB/s 
[K     |████████████████████████████████| 256kB 30.1MB/s 
[K     |████████████████████████████████| 460kB 45.6MB/s 
[K     |████████████████████████████████| 92kB 10.7MB/s 
[K     |████████████████████████████████| 184kB 43.6MB/s 
[K     |████████████████████████████████| 71kB 9.6MB/s 
[?25h  Building wheel for subprocess32 (setup.py) ... [?25l[?25hdone
  Building wheel for shortuuid (setup.py) ... [?25l[?25hdone
  Building wheel for gql (setup.py) ... [?25l[?25hdone
  Building wheel for watchdog (setup.py) ... [?25l[?25hdone
  Building wheel for pathtools (setup.py) ... [?25l[?25hdone
[31mERROR: albumentations 0.1.12 has r

In [0]:
from fastai2.basics import *
from fastai2.tabular.all import *

In [2]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

In [0]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]

First I want to seperate a test set that is the last 10% of my data (for adults this is fine, but in actuality this is quite an important topic. To read more see [here](https://www.fast.ai/2017/11/13/validation-sets/)

In [0]:
end = len(df) - 3256

In [0]:
test = df.iloc[end:]
train = df.iloc[:end]

Now let's grab `StratifiedKFold` from the `sklearn` library

In [0]:
from sklearn.model_selection import StratifiedKFold

Now for the actual running. I'll describe what we're doing below step by step. We declare our `cat` and `cont` vars, our procs, and also generate our test set's data loader (so we can test against it). Along wtih this, to stay in v2 style our validation and test lists will be of type `L`.

From here, we will use the `StratifiedKFold` to generate 10 shuffled splits, and split them with the `.split` method. From here, we can go into each of those splits and they will contain our indexs. Convert them to `L`'s and we can directly pass them into our `TabularPandas`. From here, we create our `DataLoaders`, `Learner`, train it, and then evaluate on our test data.

Finally, we will print out the validation and test set statistics.

In [7]:
val_pct = L()
test_pct = L()

test_preds = L()

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]

test_dl = TabularPandas(test, procs, cat_names, cont_names, y_names="salary")
test_dl = TabDataLoader(test_dl)

skf = StratifiedKFold(n_splits=10, shuffle=True)
res = skf.split(train.index, train['salary'])
for x, y in res:
  ix = (L(list(x)), L(list(y)))
  to = TabularPandas(train, procs, cat_names, cont_names, y_names="salary", splits=ix)
  data = to.databunch()
  learn = tabular_learner(data, layers=[200,100], loss_func=CrossEntropyLossFlat(), metrics=accuracy)
  learn.fit(1)
  val_pct.append(learn.validate()[1])
  test_pct.append(learn.validate(dl=test_dl)[1])

(#5) [0,0.3749929368495941,0.39017242193222046,0.8195155262947083,00:30]
(#5) [0,0.3966914415359497,0.38946130871772766,0.8249744176864624,00:30]
(#5) [0,0.3748205900192261,0.3508380353450775,0.8311156630516052,00:30]
(#5) [0,0.3749563694000244,0.3576997220516205,0.8389627933502197,00:30]
(#5) [0,0.3725537955760956,0.37853461503982544,0.8314568400382996,00:30]
(#5) [0,0.37171193957328796,0.3733140230178833,0.826279878616333,00:30]
(#5) [0,0.37295711040496826,0.3607505261898041,0.8283276557922363,00:30]
(#5) [0,0.38796281814575195,0.36992305517196655,0.8351535797119141,00:30]
(#5) [0,0.38456496596336365,0.37874263525009155,0.8177474141120911,00:31]
(#5) [0,0.3830104470252991,0.3649543225765228,0.8337883949279785,00:31]


In [8]:
print(f'Validation:\nmean: {np.mean(val_pct)}\nstd: {np.std(val_pct)}')
print(f'\n\nTest:\nmean: {np.mean(test_pct)}\nstd: {np.std(test_pct)}')

Validation:
mean: 0.8287322163581848
std: 0.006406870402931799


Test:
mean: 0.8078009903430938
std: 0.013699860891465427


## Bonus:

If we wanted to do a mash up of our ten models, here is how you would adjust the loop

In [0]:
test_preds = L() # HERE

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]

test_dl = TabularPandas(test, procs, cat_names, cont_names, y_names="salary")
test_dl = TabDataLoader(test_dl)

skf = StratifiedKFold(n_splits=10, shuffle=True)
res = skf.split(train.index, train['salary'])
for x, y in res:
  ix = (L(list(x)), L(list(y)))
  to = TabularPandas(train, procs, cat_names, cont_names, y_names="salary", splits=ix)
  data = to.databunch()
  learn = tabular_learner(data, layers=[200,100], loss_func=CrossEntropyLossFlat(), metrics=accuracy)
  learn.fit(1)
  test_preds.append(learn.get_preds(dl=test_dl)[0]) # HERE

(#5) [0,0.3765460252761841,0.3636363744735718,0.8358922004699707,00:40]
(#5) [0,0.3646171987056732,0.35906463861465454,0.8249744176864624,00:40]
(#5) [0,0.3781249523162842,0.35186585783958435,0.8355510234832764,00:40]
(#5) [0,0.40145716071128845,0.3564353585243225,0.8311156630516052,00:40]
(#5) [0,0.38083574175834656,0.3732168674468994,0.8348686695098877,00:40]
(#5) [0,0.4260295331478119,0.39271262288093567,0.8249146938323975,00:40]
(#5) [0,0.36830320954322815,0.3626205623149872,0.8273037672042847,00:40]
(#5) [0,0.3807460367679596,0.36837369203567505,0.8331058025360107,00:40]
(#5) [0,0.40952637791633606,0.35914427042007446,0.8307167291641235,00:40]
(#5) [0,0.37534549832344055,0.359743595123291,0.8313993215560913,00:40]


In [0]:
preds = [pred for pred in test_preds]

In [0]:
pred = sum(preds)/10

In [0]:
accuracy(pred, test_preds[0][1])

tensor(0.8117)

And we're done! 20 lines of code! *Much* easier to do in v2 than v1 thanks to that test `DataLoader` being so simple to set up.