# Step 3 - Split data into train, validation, and test sets
### Objective
This program will split the data in the 'Processed Data' folder into train, validation, and test sets. All the data will be placed in folders in the 'TrainValTest Data' folder.

### Requirements
The program 'Step 2' has been run. On completion of the program ('Step 2'), the 'Processed Data' with the subfolders have been created and populated.

### Folder structure

    Data
    |
    ----Input Videos                    ---existing
    |    |
    |    ----Kevin (class 1)            ---existing
    |    |
    |    ----Stuart (class 2)           ---existing
    |
    ----Output Data                     ---existing
    |    |
    |    ----Kevin (class 1)            ---existing
    |    |
    |    ----Stuart (class 2)           ---existing
    |
    ----Processed Data                  ---existing
    |    |
    |    ----Kevin (class 1)            ---existing
    |    |
    |    ----Stuart (class 2)           ---existing
    |    
    ----TrainValTest Data               ---will be created by the program
         |
         ----Train                      ---will be created by the program
              |
              ----Kevin (class 1)       ---will be created by the program
              |
              ----Stuart (class 2)      ---will be created by the program
         |
         ----Val                        ---will be created by the program 
              |
              ----Kevin (class 1)       ---will be created by the program
              |
              ----Stuart (class 2)      ---will be created by the program
         |
         ----Test                       ---will be created by the program
              |
              ----Kevin (class 1)       ---will be created by the program
              |
              ----Stuart (class 2)      ---will be created by the program

### Output
Folder called 'TrainValTest Data' will be created. In this folder, there will be subfolders called 'Train', 'Val', and 'Test'. In each of the subfolders mentioned above, there will be classes subfolders created and will contain the ratio of images from the subfolders in the 'Processed Data' folder.

In [1]:
#Importing the libraries
import os
import glob
import shutil
import random

In [1]:
#Variables used
trainSplitPercentage = 0.6
valSplitPercentage = 0.3

In [2]:
#Creating the necessary folders
if os.path.exists('Data\TrainValTest Data') == False:
    os.mkdir('Data\TrainValTest Data')
    
if os.path.exists('Data\TrainValTest Data\Train') == False:
    os.mkdir('Data\TrainValTest Data\Train')
    
if os.path.exists('Data\TrainValTest Data\Val') == False:
    os.mkdir('Data\TrainValTest Data\Val')

if os.path.exists('Data\TrainValTest Data\Test') == False:
    os.mkdir('Data\TrainValTest Data\Test')

for dirs in os.listdir('Data\Processed Data'):
    inputDir = 'Data\Processed Data\\' + dirs
    trainDir = 'Data\TrainValTest Data\Train\\' + dirs
    valDir = 'Data\TrainValTest Data\Val\\' + dirs
    testDir = 'Data\TrainValTest Data\Test\\' + dirs
    
    #Removing the folders if they exist
    if os.path.exists(trainDir):
        shutil.rmtree(trainDir)
        
    if os.path.exists(valDir):
        shutil.rmtree(valDir)
    
    if os.path.exists(testDir):
        shutil.rmtree(testDir)
        
    #Creating the folders
    os.mkdir(trainDir)
    os.mkdir(valDir)
    os.mkdir(testDir)
    
    filelist = []
    for file in glob.glob(inputDir + "\\*.jpg"):
        filelist.append(os.path.basename(file))
    
    #Suffling the data in the list    
    random.shuffle(filelist)
    
    #Calculating the number of images for training, validation, and testing
    trainSize = int(trainSplitPercentage * len(filelist))
    valSize = int(valSplitPercentage * len(filelist))
    testSize = len(filelist) - trainSize - valSize
    
    #Splitting the list data
    train_filenames = filelist[:trainSize]
    val_filenames = filelist[trainSize:trainSize+valSize]
    test_filenames = filelist[trainSize+valSize:]

    #Using the list data to copy images from the source folder to the destination folder
    #Training
    for filename in train_filenames:
        src = inputDir + '\\' + filename
        dst = trainDir + '\\' + filename
        shutil.copy(src,dst)
        
    #Validation
    for filename in val_filenames:
        src = inputDir + '\\' + filename
        dst = valDir + '\\' + filename
        shutil.copy(src,dst)

    #Testing
    for filename in test_filenames:
        src = inputDir + '\\' + filename
        dst = testDir + '\\' + filename
        shutil.copy(src,dst)

print ('Splitting of data complete.')