# Separate data for training, validation and sample

Before we get to preprocessing we should split our available data into a *training, cross-validation and testing sets*. It is also really helpful to create a small subset of the data (called **sample**) which we'll use for early experimentations and making sure the code works. 

The idea is to work on a small sample dataset (itself separated into train, cv and test) so that we get feedback quickly - and the same code can then be run for more epochs on the larger dataset.

This is a good exercise in basic python **file manipulation** and can also be done directly in the terminal (deeply recommend *tmux*), but I prefer to have a notebook for this because I can then retrace my steps and easily repeat the process. 

## Action Plan
What do we plan to achieve with this notebook and what steps need to be taken?

   -  **Split into main/test and main/cv**
   <br><font color=gray> use the provided testing_list.txt and validation_list.txt lists to split the original train set into main/test, main/cv (by moving files). This has the benefit of putting files recorded by the same person in only one subset, so the model can't latch onto a person's voice characteristics. We'll make sure there's not data leakage this way.</font><br><br>
   -  **Prepare main/train**
   <br><font color=gray>treat the remaining files as main/train.</font><br><br>
   -  **Prepare main/-subset-/unknown**
   <br><font color=gray>use the categories we won't be predicting as *unknown*, but maintain the files' uniqueness by appending their original category name to the end of the filename and then put them all into main/*subset*/unknown/. </font><br><br>
   -  **Split background noise files**
   <br><font color=gray>all the background noise files (which we'll treat as *silence* category members) need to be cut into approximately 1 second long files (the other files are also slightly irregular in this way).</font><br><br>
   -  **Prepare main/-subset-/silence**
   <br><font color=gray>all the 1-sec silence files need to be split into train, test and cv subsets (60 x 20 x 20). </font><br><br>
   -  **Copy subsets into sample/-subset-/category**
   <br><font color=gray>create sample folder (as opposed to /main) & copy small, random subset of each category into sample/train, sample/cv and sample/test.</font>

## Import
We'll need a couple of additional libraries so let's import them.

In [5]:
import glob
import math
import os

from shutil import copyfile
from pydub import AudioSegment

### Split into main/test and main/cv
What Google Brain & Kaggle give us is a folder called "train" with 3 important element - the testing_list.txt, validation_list.txt and a subfolder "audio" with further subfolders whose names are referring to what the .wav files within them represent - e.g. "yes", "no", "happy" & "dog". 

Google's idea is for us to use the .txt files to grab the right .wav files and move them from the primary train folder into validation and test folders. 

In [8]:
# make sure we're in the right folder (the one with Google's test and train folder in it), and if not cd into it.
!pwd

/c/Users/mateusz/Documents/Mateusz/Career/Machine Learning & AI/tensorflow_speech_recognition/tensorflow_speech_recognition


We're aiming for the following directory tree:
    -  data\
        - main\
            - train\
                - all subfolders named after categories (e.g. "yes", "one", "cat", "silence", "unknown")
            - test\
                - (...)
            - cv\
                - (...)
        - sample\
            - train\
            - test\
            - cv\

In [10]:
# create our own data folder with main\train, main\test and main\cv subfolders in it
!mkdir data\main\train
!mkdir data\main\test
!mkdir data\main\cv

# depending on the operating system the slashes might be backward of forward

In [3]:
# store relative paths (we need to escape the slash)
path_main_train = "data\\main\\train"
path_main_test = "data\\main\\test"
path_main_cv = "data\\main\\cv"

paths_main = [path_main_train, path_main_test, path_main_cv]

In [4]:
# store relative paths to provided data
path_provided = "train\\audio"

In [6]:
# define the categories (which are also names of subfolders, all lowercase)
categories_to_predict = ["yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go",
             "silence", "unknown"]

categories_unknown = ["Zero", "One", "Two", "Three", "Four", "Five", 
                      "Six", "Seven", "Eight", "Nine", "Bed", "Bird",
                      "Cat", "Dog", "Happy", "House", "Marvin", "Sheila",
                      "Tree", "Wow"]

categories_all = categories_to_predict + categories_unknown

# we need to make them all lowercase because that's how they're written in testing_list.txt and validation_list.txt
categories_all = [category.lower() for category in categories_all]

categories_all[::5]  # every 5th element will be shown

['yes', 'right', 'silence', 'three', 'eight', 'dog', 'tree']

In [64]:
# create all subfolders in main
for path in paths_main:
    for subfolder_name in categories_all:
        subfolder_path = os.path.join(path, subfolder_name)
        !mkdir $subfolder_path

In [46]:
# we should now see our newly made data folder and the original train folder provided by Google 
# I've also made a backup of the train folder, just in case.
!ls

0. Separate data for training, validation and sample.ipynb
LICENSE
README.md
data
req.txt
sample_submission.csv
test
train
train_backup


In [51]:
# take a look at the format
with open("train\\testing_list.txt", "r", encoding="UTF-8") as f:
    content = f.readlines()
    
    # show first 5 entries
    print(content[:5])

['bed/0c40e715_nohash_0.wav\n', 'bed/0ea0e2f4_nohash_0.wav\n', 'bed/0ea0e2f4_nohash_1.wav\n', 'bed/105a0eea_nohash_0.wav\n', 'bed/1528225c_nohash_0.wav\n']


In [66]:
# use the .txt files provided by Google to move files from original \train\ directory
# this won't be a problem for linux/mac users but on windows we'll have to fix the slash
def move_per_list(text_list, path_origin, path_destination):
    """
    Take a list of paths to .wav files and move them from origin path to destination
    """
    with open(text_list, "r", encoding="UTF-8") as f:
        content = f.readlines()
        for line in content:
            # strip newline ("\n")
            line = line.strip()

            # replace "/" with "\\"
            line = line.replace("/", "\\")

            # construct current path to .wav file
            cur_path = os.path.join(path_origin, line)

            # move the listed .wav files
            os.rename(cur_path, os.path.join(path_destination, line))        

In [67]:
# main\test\...
move_per_list("train\\testing_list.txt", path_provided, path_main_test)

In [69]:
# main\cv\...
move_per_list("train\\validation_list.txt", path_provided, path_main_cv)

After that our train\audio directory no longer contains the files from testing_list.txt and validation_list.txt.

### Prepare main/train
All the .wav files remaining in the path_provided (the original contents of \train\audio provided by Google) are meant to be treated as the training set, so let's move them to the appropriate data\main\train folder.

This step deals with the largest chunk of data so it can take a couple of minutes, depending on the hardware.

In [89]:
# we can use glob module to grab all .wav files in the subfolders
for category in categories_all:
    # create regex path for glob
    glob_regex = os.path.join(path_provided, category)
    glob_regex = os.path.join(glob_regex, "*.wav")
    
    # use glob to grab all .wav files in that category's subdirectory
    train_wav_files = glob.glob(glob_regex)

    # move them to our data\main\train folder
    for wav_file in train_wav_files:
        
        # grab the actual unique name of each .wav file (e.q. yes\004ae712_nohash_0.wav, without the parent dirs)
        wav_unique_name = wav_file[12:]
        
        # move them
        os.rename(wav_file, os.path.join(path_main_train, wav_unique_name))

The origina path_provided (\train\audio\(...)) should now contain empty category subfolders, with the exception of _background_noise_, which we'll use for creating our "silence" examples. 

### Prepare main/-subset-/unknown
Some of the folders contain .wav files that we should consider to belong to one common category - "unknown", representing the fact that the user may have said something, but it was not a known command. For example categories such as "happy", "wow" and "tree". We can see how we'd like out voice-recognition applications to distinguish between an unknown word and silence/background noise.

We'll want to move all the files from those folders into one folder named "uknown". The tricky part is that some of the files within "happy" and "wow" may have the same name. They're only uniquely named if we take into consideration the name of their folder.

My solution is to rename all files within these categories by appending the name of the folder to the end of the filename and then move all of the renamed files to "unknown" and then remove the remaining empty folders.

In [17]:
# make sure yout unknown categories are lowercase
categories_unknown = [category.lower() for category in categories_unknown]
categories_unknown[::3]

['zero', 'three', 'six', 'nine', 'cat', 'house', 'tree']

The step below is moving a lot of files, so again it may take a couple minutes to finish.

In [34]:
# rename for every subset (train, test, cv) and subcategory (e.g. 'house', 'tree' etc.)
for path_to_subset in paths_main:
    for unknown_cat in categories_unknown:
        
        # create regex for glob
        path_to_cat = os.path.join(path_to_subset, unknown_cat)
        glob_regex = os.path.join(path_to_cat, "*.wav")
        
        # grab all "unknown" files
        unknown_wavs = glob.glob(glob_regex)
        
        # create destination path
        path_to_destination = os.path.join(path_to_subset, "unknown")
        
        # move & rename files
        for wav_file in unknown_wavs:
            
            # construct new full name (with path) - first replace the category with "unknown"
            new_full_path = wav_file.replace(unknown_cat, "unknown")
            
            # then add the category name - e.g. "_tree" before ".wav")
            new_full_path = new_full_path[:-4] + "_" + unknown_cat + new_full_path[-4:]
            
            # move files
            os.rename(wav_file, new_full_path)

Now we can safely remove the folders of the categories that became "unknown".

In [37]:
# remove the unused category folders from all subsets
for path_to_subset in paths_main:
    for unknown_cat in categories_unknown:
        full_unused_path = os.path.join(path_to_subset, unknown_cat)
        !rm -r $full_unused_path

In [41]:
# main/train, main/test and main/cv should now only contain folders named after categories that we'll be predicting
!ls $path_to_subset

down
go
left
no
off
on
right
silence
stop
unknown
up
yes


<div class="alert alert-block alert-warning">It is important to notice that in main/test we now have 250 examples per category but in the case of unknown we have over 4000 examples. In main/train we have 1850 examples per category and 32K in unknown . This makes our dataset very unbalanced.</div>

However at this stage we don't want to lose any data and we may be able to use data augmentation later on to reduce this imbalance.

### Split background noise files
In the train / audio / background_noise folder provided by Google Brain & Kaggle we have 6 .wav files of differing length (between 1:00 and 1:35 minutes). We can use them in a couple of different ways - e.g. mix them with our categorised .wav samples so that our model might better learn how to distinguish voice commands from environments with background noise or treat them separately as members of the "silence" category.

In this notebook we'll take the latter approach and leave mixing .wav files to the next notebook, which will put more focus on preprocessing.

In [20]:
def split_into_1sec(input_source, output_destination, fragment_length=1000):
    """ 
    Simple function for grabbing .wav files and splitting them into 1 sec
    fragments, also handling differing input length.
    """
    
    # open the input source
    source = AudioSegment.from_wav(input_source)
    
    # store length (in milliseconds)
    length = len(source)
    
    # split until there's not enough length to create a full fragment
    for i in range(math.floor(length / fragment_length)):
        
        # grab the right slice
        current_fragment = source[i * fragment_length : (i + 1) * fragment_length + 1]
        
        # construct output fragment's name
        fragment_name = output_destination + "_{}.wav".format(i+1)
        
        # save the fragment
        current_fragment.export(fragment_name, format="wav")
        
        # close current fragment
        del current_fragment
    
    # close the input source
    del source

Now we can use this function to populate train/audio/\_background\_noise\_ with 1 second fragments.

In [22]:
background_types = ["doing_the_dishes", "dude_miaowing", 
                    "exercise_bike", "pink_noise",
                    "running_tap", "white_noise"]

In [28]:
# create a new directory for the fragments
path_to_fragments = os.path.join(path_provided, "_background_noise_fragments_")
!mkdir $path_to_fragments

In [29]:
# split all 6 background noise files
for background in background_types:
    # grab the full-length background file
    path_to_background = os.path.join(path_provided, "_background_noise_", background + ".wav")
    
    # create destination path
    path_to_destination = os.path.join(path_to_fragments, background)
    
    # split
    split_into_1sec(path_to_background, path_to_destination)

If we listen to the files we will notice that some of them are louder than others, which may be good for the background noise "silence" category, but not necessarily so for other categories. We may wish to look at that in the preprocessing notebook. 

In [32]:
print(path_to_fragments)

train\audio\_background_noise_fragments_


In [34]:
# let's see how many "silence" examples we have
!ls $path_to_fragments | wc

    398     398    8167


As you can see we've got almost 400 examples of 1-second background noises that we can now move to the main/ _subset_ /silence folders.

### Prepare main/-subset-/silence

### Copy subsets into sample/-subset-/category