Skip to content
Switch branches/tags

An End-to-End Architecture for Keyword Spotting and Voice Activity Detection

The reference implementation and open-source datasets of the high-quality keyword spotter and voice activity detector introduced in An End-to-End Architecture for Keyword Spotting and Voice Activity Detection.


We propose a single neural network architecture for two tasks: on-line keyword spotting and voice activity detection. We develop novel inference algorithms for an end-to-end Recurrent Neural Network trained with the Connectionist Temporal Classification loss function which allow our model to achieve high accuracy on both keyword spotting and voice activity detection without retraining. In contrast to prior voice activity detection models, our architecture does not require aligned training data and uses the same parameters as the keyword spotting model. This allows us to deploy a high quality voice activity detector with no additional memory or maintenance requirements.



This code has been run on Ubuntu 14.04 with Python 2.7 and Tensorflow 1.4.0.

The supported versions of CUDA include CUDA 7.5 with CUDNN 4.0. (CUDA 9.0 is not yet supported.)

We also require Boost Python. The build process will attempt to build the Ubuntu package libsamplerate-dev. For this you may need root access or you will need to modify the Makefiles.



Setup the virtual environment:

virtualenv kws
source kws/bin/activate


To install the system and python dependencies, from the repo root directory simply run:

make .deps


After the dependencies are installed, run:




Along with code, we also provide an open-source dataset of positive samples for the keyword "Olivia." To get decent accuracy, you'll need to include a corpus of LV speech data and train the model on the joint corpus. Some publicly available datasets that will work include Mozilla's Common Voice corpus and LibriSpeech. The dataset is split as follows:

  • train (1544 samples)
  • test (550 samples)

We also provide a dataset with added noise, as described in the paper:

  • train_noise (15440 samples)

Download the data and unzip it into the data subdirectory.


Before training, make sure to set your PYTHONPATH to the repo top level directory. From the repo top level directory run


All the configuration parameters can be found in config/kws.json. The most important thing to change here are the paths to the data json(s) and where to save the model. These are config["data"]["train_jsons"] and config["io"]["save_path"] respectively.

Most of the other parameters should work out of the box, however you are free to change these for hyperparameter tuning etc.

To train a model run


For help / usage run

python -h

This should produce:

usage: [-h] [--config CONFIG] [--num_gpus NUM_GPUS]

optional arguments:
  -h, --help           show this help message and exit
  --config CONFIG      Configuration json for model building and training
  --num_gpus NUM_GPUS  Number of GPUs to train with.


To evaluate a trained model use the script. This takes as input a model directory and a list of wave files (eg. data/olivia_test.txt). The wave files can be arbitrary length, the model streams the evaluation and reports the negative log-likelihood.

usage: [-h] --save_path SAVE_PATH --file_list FILE_LIST

optional arguments:
  -h, --help            show this help message and exit
  --save_path SAVE_PATH
                        Path where model is saved.
  --file_list FILE_LIST
                        Path to list of wave files.


If you find this code useful for your research, please cite:

  author    = {Christopher T. Lengerich and Awni Y. Hannun},
  title     = {An End-to-End Architecture for Keyword Spotting and Voice Activity Detection},
  booktitle = {NIPS End-to-End Learning for Speech and Audio Processing Workshop},
  year      = {2016},


An End-to-End Architecture for Keyword Spotting and Voice Activity Detection




No releases published


No packages published