# Splitting datasets into train, tune, and test sets
This notebook shows how to split datasets into train, tune, and test sets. The splits can be saved to a folder for reuse and reproducibility (recommended).

You can generate three types of splits. 
- A "super test" or withholding split. It's a simple random sample of variants meant to be completely held out until the final model training and evaluation.
- Classic train-tune-test splits based on percentages of the total dataset. 
- Reduced train size splits for testing model sensitivity to smaller training set sizes.

You can also write your own function to generate a split based on whatever criteria you like. As long as there's a "train" set (and a "tune" set if you're using early stopping), it should work with the rest of the codebase.

In [9]:
# reload modules before executing code in order to make development and debugging easier
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
# this jupyter notebook is running inside of the "notebooks" directory
# for relative paths to work properly, we need to set the current working directory to the root of the project
# for imports to work properly, we need to add the code folder to the system path
import os
from os.path import abspath, join, isdir
import sys
if not isdir("notebooks"):
    # if there's a "notebooks" directory in the cwd, we've already set the cwd so no need to do it again
    os.chdir("..")
module_path = abspath("code")
if module_path not in sys.path:
    sys.path.append(module_path)

In [11]:
import numpy as np
import constants
import split_dataset as sd
import utils

# set logging level to info
import logging
logging.basicConfig()
logger = logging.getLogger("nn4dms")
logger.setLevel(logging.INFO)

# Generating a "super test" set

I recommend using a completely held-out supertest set. Don't use this set for development of the algorithm and don't look at evaluation results on this set until the very end, when you are ready to publish. Here we will create a supertest set for avgfp and save it to the avgfp splits directory in data/avgfp/splits.

In [12]:
# load the full dataset (really, we only need the # of variants in the dataset, but this is easier)
ds = utils.load_dataset(ds_name="1c0p")

# specifying an out_dir will save 
supertest_idxs, supertest_fn = sd.supertest(ds, size=.1, rseed=12, out_dir="data/1c0p/splits", overwrite=True)
supertest_idxs

INFO:nn4dms.split_dataset:saving supertest split to file data/1c0p/splits/supertest_w1230de2dad90_s0.1_r12.txt


array([15805,  4294,  1630, ...,   882, 31413,  5101])

In [13]:
supertest_fn

'data/1c0p/splits/supertest_w1230de2dad90_s0.1_r12.txt'

# Generating a classic train-tune-test split
You must specify the size of each set as a fraction of the total number of examples. You can specify a fraction of 0 if you do not want to have a particular set. Train + tune + test must sum to 1. 

In [14]:
out_dir = "data/1c0p/splits"
split, split_dir = sd.train_tune_test(ds, train_size=.6, tune_size=.2, test_size=.2, rseed=12, out_dir=out_dir, overwrite=True)

INFO:nn4dms.split_dataset:saving train-tune-test split to directory data/1c0p/splits/standard_tr0.6_tu0.2_te0.2_wF_r12


In [15]:
split

{'tune': array([15805,  4294,  1630, ..., 25416, 14911, 15107]),
 'test': array([32181, 16097, 40326, ...,  9212, 21283,  6650]),
 'train': array([19708, 11674,  8448, ..., 24735, 22496,  9759])}

In [16]:
split_dir

'data/1c0p/splits/standard_tr0.6_tu0.2_te0.2_wF_r12'

Note that these sets include the variants from the supertest set we created above. In order to generate a train-tune-test split without those variants, you must specify the either the supertest indices as an array or the filename if you saved them to the disk. 

In [17]:
# using the supertest_fn from above
split, split_dir = sd.train_tune_test(ds, train_size=.6, tune_size=.2, test_size=.2, 
                           withhold=supertest_fn, rseed=12, out_dir="data/avgfp/splits", overwrite=True)
split

INFO:nn4dms.split_dataset:saving train-tune-test split to directory data/avgfp/splits/standard_tr0.6_tu0.2_te0.2_w1230de2dad90_r12


{'stest': array([15805,  4294,  1630, ...,   882, 31413,  5101]),
 'tune': array([26060, 21785, 28044, ...,  3275,  4654, 12648]),
 'test': array([ 9544, 12719, 42572, ..., 10882, 13092,  8728]),
 'train': array([ 8542, 26820,  9410, ..., 26616, 36088, 30408])}

The supertest set is added to the split as a "withheld" set, and the train, tune, and test sets do not contain any variants from it. The save directory contains an alphanumeric string "f333b1bd195c" representing a hash of the withheld indices. This is added to handle the edge case where you have multiple supertest sets and want to make multiple splits based on them.

# Generating a reduced train size split
Similar to generating a classic train-tune-test split, this function takes the size of the tune and test set as a fraction of the total data. The train set size is specified as a proportion of the data remaining after the tune and test sets have been selected. So, if a you specify a 20% tune and a 20% test size, that leaves a pool of 60% for selecting the test set. If your test set proportion is 50%, then the the function will randomly select half the variants from the 60% pool. You must also specify how many train samples of the given size you want. Using multiple samples is important because you may get a really good or really bad training set by chance, especially if you are using a small training size. 

In [10]:
splits, reduced_split_dir = sd.reduced_train_size(ds, tune_size=.2, test_size=.2, train_prop=.5, num_train_reps=5,
                               withhold=supertest_fn, rseed=12, out_dir="data/avgfp/splits", overwrite=True)
splits

INFO:nn4dms.split_dataset:saving reduced split to directory data/avgfp/splits/reduced_tr0.5_tu0.2_te0.2_wf333b1bd195c_s5_r12


[{'stest': array([48689, 22582, 38827, ..., 34538, 27872, 39568]),
  'tune': array([ 4223, 49475, 33378, ...,  4728, 49730, 51157]),
  'test': array([46162, 51625,  1508, ...,  9149, 23872, 38108]),
  'train': array([10535, 24407,  7026, ..., 20189, 29980,  3205])},
 {'stest': array([48689, 22582, 38827, ..., 34538, 27872, 39568]),
  'tune': array([ 4223, 49475, 33378, ...,  4728, 49730, 51157]),
  'test': array([46162, 51625,  1508, ...,  9149, 23872, 38108]),
  'train': array([20825, 32025, 24626, ..., 22147, 44748, 53209])},
 {'stest': array([48689, 22582, 38827, ..., 34538, 27872, 39568]),
  'tune': array([ 4223, 49475, 33378, ...,  4728, 49730, 51157]),
  'test': array([46162, 51625,  1508, ...,  9149, 23872, 38108]),
  'train': array([ 8986, 15392, 17114, ..., 48320,  6359, 40510])},
 {'stest': array([48689, 22582, 38827, ..., 34538, 27872, 39568]),
  'tune': array([ 4223, 49475, 33378, ...,  4728, 49730, 51157]),
  'test': array([46162, 51625,  1508, ...,  9149, 23872, 38108]),


Since multiple train replicates were specified, the result is a list of splits. Note that the tune and test sets are exactly the same as the previous section. This is because we used the same random seed as above and is the desired behavior.

# Loading a saved split

In [18]:
split = sd.load_split_dir(split_dir)
split

{'stest': array([15805,  4294,  1630, ...,   882, 31413,  5101]),
 'test': array([ 9544, 12719, 42572, ..., 10882, 13092,  8728]),
 'train': array([ 8542, 26820,  9410, ..., 26616, 36088, 30408]),
 'tune': array([26060, 21785, 28044, ...,  3275,  4654, 12648])}

In [12]:
splits = sd.load_split_dir(reduced_split_dir)
splits

[{'train': array([  995, 42799, 47914, ..., 22973, 25789, 49976]),
  'stest': array([48689, 22582, 38827, ..., 34538, 27872, 39568]),
  'test': array([46162, 51625,  1508, ...,  9149, 23872, 38108]),
  'tune': array([ 4223, 49475, 33378, ...,  4728, 49730, 51157])},
 {'train': array([18859, 36073, 39365, ..., 13389, 46650,  6508]),
  'stest': array([48689, 22582, 38827, ..., 34538, 27872, 39568]),
  'test': array([46162, 51625,  1508, ...,  9149, 23872, 38108]),
  'tune': array([ 4223, 49475, 33378, ...,  4728, 49730, 51157])},
 {'train': array([ 8986, 15392, 17114, ..., 48320,  6359, 40510]),
  'stest': array([48689, 22582, 38827, ..., 34538, 27872, 39568]),
  'test': array([46162, 51625,  1508, ...,  9149, 23872, 38108]),
  'tune': array([ 4223, 49475, 33378, ...,  4728, 49730, 51157])},
 {'train': array([10535, 24407,  7026, ..., 20189, 29980,  3205]),
  'stest': array([48689, 22582, 38827, ..., 34538, 27872, 39568]),
  'test': array([46162, 51625,  1508, ...,  9149, 23872, 38108]),