# Separate data for training, validation and sample

Before we get to preprocessing we should split our available data into a *training, cross-validation and testing sets*. It is also really helpful to create a small subset of the data (called **sample**) which we'll use for early experimentations and making sure the code works. 

The idea is to work on a small sample dataset (itself separated into train, cv and test) so that we get feedback quickly - and the same code can then be run for more epochs on the larger dataset.

This is a good exercise in basic python **file manipulation** and can also be done directly in the terminal (deeply recommend *tmux*), but I prefer to have a notebook for this because I can then retrace my steps and easily repeat the process. 

## Action Plan
What do we plan to achieve with this notebook and what steps need to be taken?

   -  **Split into main/test and main/cv**
   <br><font color=gray> use the provided testing_list.txt and validation_list.txt lists to split the original train set into main/test, main/cv (by moving files). This has the benefit of putting files recorded by the same person in only one subset, so the model can't latch onto a person's voice characteristics. We'll make sure there's not data leakage this way</font><br><br>
   -  **Prepare main/train**
   <br><font color=gray>treat the remaining files as main/train</font><br><br>
   -  **Prepare main/*subset*/unknown**
   <br><font color=gray>use the categories we won't be predicting as *unknown*, but maintain the files' uniqueness by appending their original category name to the end of the filename and then put them all into main/*subset*/unknown/ </font><br><br>
   -  **Split background noise files**
   <br><font color=gray>all the background noise files (which we'll treat as *silence* category members) need to be cut into approximately 1 second long files (the other files are also slightly irregular in this way)</font><br><br>
   -  **Prepare main/*subset*/silence**
   <br><font color=gray>all the 1-sec silence files need to be split into train, test and cv subsets (60 x 20 x 20) </font><br><br>
   -  **Copy subsets into sample/*subset*/*category**
   <br><font color=gray>create sample folder (as opposed to /main) & copy small, random subset of each category into sample/train, sample/cv and sample/test</font><br><br>

In [1]:
import tensorflow as tf