# Creating the Train and Test

After categorizing our data we need to create our train and validation sets, a test set is optional but we can always try new images later. Run the cells below to create the sets and you can modify the directories or change the size ratios. It's better to keep the train set above 80% since there are many images.

In [11]:
import os
import shutil
import pandas as pd
import numpy as np

In [24]:
os.mkdir('Face_Age_Train_Val_Test') # create the directory to store the sets

In [25]:
src_root = os.path.join(os.getcwd(), 'Face_Age_Dataset/') # this is where the images comes from
root_dir = os.path.join(os.getcwd(), 'Face_Age_Train_Val_Test/') # this is where the images are stored

In [26]:
def train_test_val_split(val_ratio, test_ratio):
    '''
    This function is used to split the data into train, validation and test sets.
    INPUTS:
    val_ratio - the proportion you would like to set the validation set to be, e.g. 0.15 = 15%
    test_ratio - the proportion you would like to set the test set to be
    '''
    classes_dir = ['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34',
                   '35-39','40-44', '45-49', '50-54', '55-59', '60-64', '65+']

    for cls in classes_dir:
        os.makedirs(root_dir +'/train' + '/' + cls)
        os.makedirs(root_dir +'/valid' + '/' + cls)
        os.makedirs(root_dir +'/test' + '/' + cls)


        # Creating partitions of the data after shuffling
        src = src_root + '/' + cls # Folder to copy images from

        allFileNames = os.listdir(src)
        np.random.shuffle(allFileNames)
        train_FileNames, val_FileNames, test_FileNames = np.split(np.array(allFileNames),
                                                                  [int(len(allFileNames) * (1 - (val_ratio + test_ratio))), 
                                                                   int(len(allFileNames) * (1 - test_ratio))])


        train_FileNames = [src+'/'+ name for name in train_FileNames.tolist()]
        val_FileNames = [src+'/' + name for name in val_FileNames.tolist()]
        test_FileNames = [src+'/' + name for name in test_FileNames.tolist()]

        print('Total images: ', len(allFileNames))
        print('Training: ', len(train_FileNames))
        print('Validation: ', len(val_FileNames))
        print('Testing: ', len(test_FileNames))

        # Copy-pasting images
        for name in train_FileNames:
            shutil.copy(name, root_dir +'/train' + '/' + cls)

        for name in val_FileNames:
            shutil.copy(name, root_dir +'/valid' + '/' + cls)

        for name in test_FileNames:
            shutil.copy(name, root_dir +'/test' + '/' + cls)


In [27]:
train_test_val_split(0.15,0.05)

Total images:  4254
Training:  3403
Validation:  638
Testing:  213
Total images:  1628
Training:  1302
Validation:  244
Testing:  82
Total images:  1042
Training:  833
Validation:  156
Testing:  53
Total images:  1394
Training:  1115
Validation:  209
Testing:  70
Total images:  1317
Training:  1053
Validation:  198
Testing:  66
Total images:  1713
Training:  1370
Validation:  257
Testing:  86
Total images:  960
Training:  768
Validation:  144
Testing:  48
Total images:  1108
Training:  886
Validation:  166
Testing:  56
Total images:  608
Training:  486
Validation:  91
Testing:  31
Total images:  742
Training:  593
Validation:  111
Testing:  38
Total images:  1078
Training:  862
Validation:  162
Testing:  54
Total images:  798
Training:  638
Validation:  120
Testing:  40
Total images:  678
Training:  542
Validation:  102
Testing:  34
Total images:  1950
Training:  1560
Validation:  292
Testing:  98
