Solution to Kaggle TensorFlow Speech Contest
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Kaggle TensorFlow Speech Recognition Challenge

This repository contains my solution to the Kaggle TensorFlow Speech Recognition Challenge. The competition's goal was to train a model to recognize ten simple spoken words using Google Brain's speech command data set.

I wrote my solution in Python and TensorFlow. As a learning exercise, I tried designing my own network rather than starting with one of the well known architectures. I started with a pure RNN approach but later added 2D convolutional layers as they were effective in finding structure within the Mel Frequency Cepstral Coefficients transformation of the audio signals. My final design was:

Layer Type Description
Input Data MFCC transformed data
Conv1 Conv2d 64 filters, 10x5 kernel, 1x2 strides
BatchNorm1 Batch Norm Momentum 0.9
Relu1 Relu Relu activation layer
MaxPool1 MaxPool 3x3 kernel, 1x2 strides
DropOut1 Dropout Rate 0.5
Conv2 Conv2d 128 filters, 10x5 kernel, 1x2 strides
BatchNorm2 Batch Norm Momentum 0.9
Relu2 Relu Relu activation layer
MaxPool2 MaxPool 3x3 kernel, 1x2 strides
DropOut2 Dropout Rate 0.5
Conv3 Conv2d 256 filters, 10x5 kernel, 1x1 strides
BatchNorm3 Batch Norm Momentum 0.9
Relu3 Relu Relu activation layer
MaxPool3 MaxPool 3x5 kernel, 1x1 strides
DropOut3 Dropout Rate 0.5
RNN1 GRU 128 units
FC1 Dense 256 units
BatchNorm 4 Batch Norm Momentum 0.9
Relu4 Relu Relu activation layer
FC2 Dense Final output logits

I trained the model for 30 epochs at a batch size of 512 on my NVIDIA TitanX GPU. I used the Python Speech Features package to transform the data from audio to MFCC. My final score was 80.8% accuracy which placed me 479th out of 1315 teams (top 37%).