babble-rnn is a research project in the use of machine learning to generate new speech by modelling human speech audio, without any intermediate text or word representations. The idea is to learn to speak through imitation, much like a baby might.
Jupyter Notebook Python Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
codec2 Attempting to use Conv2D and LSTM networks. Aug 11, 2017
docs Update current-audio.html Mar 16, 2018
generated Add files via upload Mar 16, 2018
notebooks Example output Feb 15, 2018
out Fix audio file Feb 16, 2018
test added test data May 30, 2017
tools Rework the generator api a little to simplify and provide Dec 14, 2017
utils Added test script to ensure Theano is configured and runnable Jul 8, 2017
.gitignore Learning func-20-10-3 ./test/criticalpracticalreason.c2-3200 none Nov 21, 2017
CNAME Create CNAME May 26, 2017
CODE_OF_CONDUCT.md Create CODE_OF_CONDUCT.md Jun 18, 2017
CONTRIBUTING.md Create CONTRIBUTING.md Jun 18, 2017
LICENSE Create LICENSE Jun 18, 2017
README-code.md fix links May 31, 2017
README.md Update README.md Mar 16, 2018
_config.yml Set theme jekyll-theme-minimal May 26, 2017
c2towav.sh more convolutional filters, with max pooling Aug 14, 2017
custom_objects.py Learning func-23-2-11 ./test/criticalpracticalreason.c2-3200 Dec 6, 2017
generator.py Learning func-27-1-15 ./test/critiquepracticalreason_00_kant_64kb.c2c… Dec 21, 2017
learn.sh Add missing notes.md files Jan 25, 2018
list_procs.sh New batches. Attempting to learn all three LSTMs at the start Jul 8, 2017
lstm_c2_generation.py Rework the generator api a little to simplify and provide Dec 14, 2017
model_def.py Learning func-28-1-3 test/critiquepracticalreason_00_kant_64kb.c2cb-3… Feb 8, 2018
model_utils.py Learning func-27-1-15 ./test/critiquepracticalreason_00_kant_64kb.c2c… Dec 21, 2017
mp32c2.sh Generated results for d2-3200-v1-1-1 Aug 6, 2017
run_config.py Learning func-27-1-21 ./test/critiquepracticalreason_00_kant_64kb.c2c… Jan 12, 2018
samples_for_model_def.py updates May 30, 2017

README.md

babble-rnn: Generating speech from speech with LSTM networks

babble-rnn is a research project in the use of machine learning to generate new speech by modelling human speech audio, without any intermediate text or word representations. The idea is to learn to speak through imitation, much like a baby might. The goal is to generate a babbling audio output that emulates the speech patterns of the original speaker, ideally incorporating real words into the output.

The implementation is based on Keras / Theano, generating an LSTM RNN; and Codec 2, an open source speech audio compression algorithm. The resulting models have learned the most common audio sequences of a 'performer', and can generate a probable babbling audio sequence when provided a seed sequence.

Read the babble-rnn tech post

View the babble-rnn code on Github

Wondering what babble-rnn can do? Listen to the latest babble produced by the experiments since the original tech report:

play the audio

This babbler is a stack of 11 bidirectional LSTMs, attempting to learning an encoded sequence of data (frame of 13 normalized parameters, representing 20ms of audio). Groups of LSTMs are trained together, while keeping others locked, to limit the complexity of learning such a deep network.

The audio itself is highly compressed through the Codec 2 (see the original tech post for details) producing a 3200 bit per second stream of frequency, energy, sinusoidal and voicing parameters. An autoencoder learns the features of this against a particular human speaker, to compress the output further. The encoder stage is a mix of 2D convolutional layers, picking features from the Codec 2 data over short time sequences, runnning in parallel with a series of standard hidden layers (to provide a compressed stream that helps feed through some of the original input), before being merged into a single encoded output at a quarter of the rate of the original Codec 2 input (80ms audio per frame, although more parameters output than the original).