Skip to content

DeepSpeech 0.6.0

Compare
Choose a tag to compare
@reuben reuben released this 03 Dec 17:18
· 1230 commits to master since this release

General

This is the 0.6.0 release of Deep Speech, an open speech-to-text engine. In accord with semantic versioning, this version is not backwards compatible with version 0.5.1 or earlier versions. So when updating one will have to update code and models. As with previous releases, this release includes trained models and source code.

v0.6.0.tar.gz

and a model

deepspeech-0.6.0-models.tar.gz

trained on American English which achieves an 7.5% word error rate on the LibriSpeech clean test corpus. Models with a "*.pbmm" extension are memory mapped and much more memory efficient, as well as faster to load. Models with the ".tflite" extension are converted to use with TFLite and have post-training quantization enabled, and are more suitable for resource constrained environments.

We also include example audio files:

audio-0.6.0.tar.gz

which can be used to test the engine; and checkpoint files

deepspeech-0.6.0-checkpoint.tar.gz

which can be used as the basis for further fine-tuning.

Notable changes from the previous release

DeepSpeech 0.6.0 includes a number of significant changes. These changes break backwards compatibility with code targeting older releases as well as training or exporting older checkpoints. For details on the changes, see below:

  • API - We have cleaned up several inconsistencies in our API, making function names more uniform and removing unused parameters. We have included a simple wrapper header that can be used by users of the C API if they absolutely cannot change their code. It's not a complicated upgrade, here's a summary:
    • DS_CreateModel arguments alphabet, n_cep and n_context removed. (Now retrieved from the model file).
    • DS_EnableDecoderWithLM argument alphabet removed. (Alphabet retrieved from the model is used).
    • DS_SetupStream renamed to DS_CreateStream and no longer takes a sample rate parameter.
    • DS_DestroyModel renamed to DS_FreeModel
    • DS_DiscardStream renamed to DS_FreeStream
    • DS_SpeechToText and DS_SpeechToTextWithMetadata no longer take a sample rate parameter.
    • The equivalent methods in the language bindings have also been updated.
  • Checkpoints - With TF 1.14, we have added CuDNN RNN support to our training graph, which improves training performance significantly. We've seen improvements on the order of 2x faster training time per epoch. The required training graph changes breaks loading older checkpoints, due to differences in the computation performed by CudnnLSTM.
    • Note that mixing CuDNN and non-CuDNN checkpoints requires some care: a CuDNN checkpoint can't be continued normally on a non-CuDNN setup, you'll need to use the --cudnn_checkpoint flag to load it and it'll re-initialize the optimizer momentum variables for the RNN weights from scratch.
  • Exported Model We've fixed a bug where trying to interleave multiple streams with the same Model instance would lead to non-deterministic behavior, as all streams were sharing the same LSTM state between passes through the acoustic model. This was fixed in #2146. The fix required changing the inference graph, so trying to load a 0.5.1 model with a newer client would lead to errors. We bumped the graph version accordingly so that clients will fail early.
  • Dependencies - We have updated our dependencies from TensorFlow 1.13.1 in v0.5.1 to TensorFlow 1.14.0. Make sure you always use the correct TensorFlow version for the version of DeepSpeech you're using.
  • Trie - We switched to a different data structure for the language model trie file, so that the file could be memory mapped when loading. The older format is no longer supported, so we bumped the version number accordingly. In order to create a trie file that's compatible with the newer version, you have to run an updated generate_trie again with your lm.binary file. Note that just the trie format changed, the main LM binary file doesn't require changes.
  • Trie Loading - We have changed the mode of loading the LM trie file to be lazier, which improves memory utilization and latency for the first inference request after creating a model. See discussion and some measurements in the issue.
  • Language Model - We have updated the language model by filtering out uncommon words. It now contains only the top 500k words from the text it was trained on. Furthermore, we have pruned it for singletons of order three and higher. (In version 0.5.1 we pruned the language model for singletons of order four and higher.) These together half the size of the language model, taking it from about 1800MB in Deep Speech 0.5.1 to about 900MB in Deep Speech 0.6.0 with little to no impact on word error rate (WER).
  • Data Augmentation - Several online data augmentation techniques have been contributed. See the PR here and some documentation was added to the README as well.
  • Documentation - We have refactored our documentation to merge docs for the different language bindings under a single resource. The new docs can be seen on deepspeech.readthedocs.io.
  • Tool for bulk transcription - We added a tool for bulk transcribing large audio files.

Hyperparameters for fine-tuning

The hyperparameters used to train the model are useful for fine tuning. Thus, we document them here along with the hardware used, a server with 8 Quadro RTX 6000 GPUs each with 24GB of VRAM.

  • train_files Fisher, LibriSpeech, Switchboard, Common Voice English, and approximately 1700 hours of transcribed WAMU (NPR) radio shows explicitly licensed to use as training corpora.
  • dev_files LibriSpeech clean dev corpus.
  • test_files LibriSpeech clean test corpus
  • train_batch_size 128
  • dev_batch_size 128
  • test_batch_size 128
  • n_hidden 2048
  • learning_rate 0.0001
  • dropout_rate 0.20
  • epoch 75
  • lm_alpha 0.75
  • lm_beta 1.85

The weights with the best validation loss were selected at the end of 75 epochs using --noearly_stop, and the selected model was trained for 233784 steps. In addition the training used the --use_cudnn_rnn flag.

Bindings

This release also includes a Python based command line tool deepspeech, installed through

pip install deepspeech

Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:

pip install deepspeech-gpu

Also, it exposes bindings for the following languages

  • Python (Versions 2.7, 3.5, 3.6, 3.7 and 3.8) installed via

    pip install deepspeech

    Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:

    pip install deepspeech-gpu
  • NodeJS (Versions 4.x, 5.x, 6.x, 7.x, 8.x, 9.x, 10.x, 11.x, 12.x and 13.x) installed via

    npm install deepspeech
    

    Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:

    npm install deepspeech-gpu
    
  • ElectronJS versions 3.1, 4.0, 4.1, 5.0, 6.0, 7.0 and 7.1 are also supported

  • C++ which requires the appropriate shared objects are installed from native_client.tar.xz (See the section in the main README which describes native_client.tar.xz installation.)

  • .NET which is installed by following the instructions on the NuGet package page.

In addition there are third party bindings that are supported by external developers, for example

  • Rust which is installed by following the instructions on the external Rust repo.
  • Go which is installed by following the instructions on the external Go repo.

Supported Platforms

  • Windows 8.1, 10, and Server 2012 R2 64-bits (Needs at least AVX support, requires Redistribuable Visual C++ 2015 Update 3 (64-bits) for runtime).
  • OS X 10.10, 10.11, 10.12, 10.13, 10.14 and 10.15
  • Linux x86 64 bit with a modern CPU (Needs at least AVX/FMA)
  • Linux x86 64 bit with a modern CPU + NVIDIA GPU (Compute Capability at least 3.0, see NVIDIA docs)
  • Raspbian Buster on Raspberry Pi 3 + Raspberry Pi 4
  • ARM64 built against Debian/ARMbian Buster and tested on LePotato boards
  • Java Android bindings / demo app. Early preview, tested only on Pixel 2 device, TF Lite model only.

Contact/Getting Help

  1. FAQ - We have a list of common questions, and their answers, in our FAQ. When just getting started, it's best to first check the FAQ to see if your question is addressed.
  2. Discourse Forums - If your question is not addressed in the FAQ, the Discourse Forums is the next place to look. They contain conversations on General Topics, Using Deep Speech, Alternative Platforms, and Deep Speech Development.
  3. IRC - If your question is not addressed by either the FAQ or Discourse Forums, you can contact us on the #machinelearning channel on Mozilla IRC; people there can try to answer/help
  4. Issues - Finally, if all else fails, you can open an issue in our repo if there is a bug with the current code base.

Contributors to 0.6.0 release