A place for all things voice cloning. Make a PR!
This is the main Synthesis Colab
This is the simplified Synthesis Colab
This is supposedly a newer version of the simplified Synthesis Colab
For the sake of completeness, this is the training colab
It's worth noting that the cookiePPP training colab has (what I believe is) a major improvement over mine: an integrated grapheme-to-phoneme system, so that the model can learn on syllabes instead of stupid nonstandard English spellings. I believe this will only work with English transcrips.
And another link: this is my fully functional Colab notebook for tacotron2 training and synthesis, with explanatory notes. No hardware required--it'll train your model on google's free GPUs and save the output to your google drive. The most complicated part is prepping your dataset before upload. Currently set up to train from the LJspeech-trained model, on 22050hz wav files with 16-bit PCM encoding. (See the dataset section for help on this)
You can use this tensorboard to interact in parallel with the Tacotron2 for Dummies notebook to check the progress of your model. You will have to use "Factory Reset Runtime" every time you want to update the tensorboard to check progress. This is a GREAT way to visualize what's going on with your model. Much more useful than the alignment charts that the training colab spits out.
Below is a hastily coded python script to convert graphemes to phonemes in files already prepped for tt2 learning. Basically it takes each line of <filename.wav|transcription> and converts the transcription segment into IPA characters. What this means is that the model shouldn't get confused about words that don't sound the way they are written, and in general they should learn better.
Noice's Watson Speech To Text Tool
Use ffmpeg to convert your wav files to the right format:
ffmpeg -y -i $filename -ac 1 -acodec pcm_s16le -ar 22050 -sample_fmt s16 converted/$filename
Or, on a whole directory:
#!/bin/bash
for filename in *.wav; do
echo "Converting $filename"
ffmpeg -y -i $filename -ac 1 -acodec pcm_s16le -ar 22050 -sample_fmt s16 converted/$filename
done
LJSpeech Dataset: Old Reliable
VoxCeleb: 2000+ hours of celebrity utterances, with 7000+ speakers. Audio is captured as "in the wild," including background noise.voxceleb/vox1.html
TED-LIUM: 452 hours of audio and aligned trascripts from TED talks.
LibriSpeech: 1000+ hour dataset of read English speech based on public domain audiobooks.