# K-Fold Cross Validation in fastai

This notebook will detail how to implement K-Fold Cross Validation in fastai.

I do want to make one note though. If you do use Cross Validation, you must be **sure** to have a seperate test set to accuractly grade your model, as k-fold will go through everything. To ensure this, we will take a 10% subset of the Adults data.

## Libraries

In [0]:
from fastai.tabular import *
from sklearn.model_selection import StratifiedKFold

## Data



In [0]:
path = untar_data(URLs.ADULT_SAMPLE)

In [0]:
df = pd.read_csv(path/'adult.csv')

We need to create an initial databunch to copy the processors over (so FillMissing operates correctly).

In [0]:
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]

In [0]:
data_init = (TabularList.from_df(df, path=path, cat_names=cat_names.copy(), cont_names=cont_names.copy(), procs=procs)
                           .split_by_idx(list(range(800,1000)))
                           .label_from_df(cols=dep_var)
                           .databunch())

Now we can split the dataframe

In [6]:
int(len(df)*.9)

29304

In [0]:
train_df = df.iloc[:29304]
test_df = df.iloc[29304:]

## K-Fold

Now that everything is seperated, we can use K-Fold. The code is based off of Fernando A's post [here](https://forums.fast.ai/t/is-it-possible-to-implement-cross-validation-in-fastai/44961/5?u=muellerzr). 

First we initialize the Stratified K-Fold

In [0]:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

Now we need our training loop where we will go over all of the folds and gather our validation and test set accuracy.

`data_test` generates a databunch that we can run `learn.validate()` for a **labeled** test dataset

In [0]:
val_pct = []
test_pct = []

for train_index, val_index in skf.split(train_df.index, train_df[dep_var]):
  data_fold = (TabularList.from_df(train_df, path=path, cat_names=cat_names.copy(),
                                  cont_names=cont_names.copy(), procs=procs,
                                  processor=data_init.processor) # Very important
              .split_by_idxs(train_index, val_index)
              .label_from_df(cols=dep_var)
              .databunch())
  
  data_test = (TabularList.from_df(test_df, path=path, cat_names=cat_names.copy(),
                                  cont_names=cont_names.copy(), procs=procs,
                                  processor=data_init.processor) # Very important
              .split_none()
              .label_from_df(cols=dep_var))
  data_test.valid = data_test.train
  data_test = data_test.databunch()
  
  learn = tabular_learner(data_fold, layers=[200,100], metrics=accuracy)
  learn.fit(1)
  
  _, val = learn.validate()
  
  learn.data.valid_dl = data_test.valid_dl
  
  _, test = learn.validate()
  
  val_pct.append(val.numpy())
  test_pct.append(test.numpy())

Now we just take the statistics of our results and we're done!

In [20]:
print(f'Validation\nmean: {np.mean(val_pct)}\nstd: {np.std(val_pct)}')

Validation
mean: 0.8330944180488586
std: 0.006616224069148302


In [22]:
print(f'Test\nmean: {np.mean(test_pct)}\nstd: {np.std(test_pct)}')

Test
mean: 0.8332821726799011
std: 0.0022934952285140753
