Skip to content

justinjohn0306/awesome-voice-cloning

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

awesome-voice-cloning

What is this?

A place for all things voice cloning. Make a PR!

TacoTron 2

TACOTRON 2

CookiePPP Tacotron 2 Colabs

This is the main Synthesis Colab

This is the simplified Synthesis Colab

This is supposedly a newer version of the simplified Synthesis Colab

For the sake of completeness, this is the training colab

It's worth noting that the cookiePPP training colab has (what I believe is) a major improvement over mine: an integrated grapheme-to-phoneme system, so that the model can learn on syllabes instead of stupid nonstandard English spellings. I believe this will only work with English transcrips.

Scripp's Training Colabs

And another link: this is my fully functional Colab notebook for tacotron2 training and synthesis, with explanatory notes. No hardware required--it'll train your model on google's free GPUs and save the output to your google drive. The most complicated part is prepping your dataset before upload. Currently set up to train from the LJspeech-trained model, on 22050hz wav files with 16-bit PCM encoding. (See the dataset section for help on this)

Training

You can use this tensorboard to interact in parallel with the Tacotron2 for Dummies notebook to check the progress of your model. You will have to use "Factory Reset Runtime" every time you want to update the tensorboard to check progress. This is a GREAT way to visualize what's going on with your model. Much more useful than the alignment charts that the training colab spits out.

Tensorboard

Converting graphemes to phonemes

Below is a hastily coded python script to convert graphemes to phonemes in files already prepped for tt2 learning. Basically it takes each line of <filename.wav|transcription> and converts the transcription segment into IPA characters. What this means is that the model shouldn't get confused about words that don't sound the way they are written, and in general they should learn better.

Script in Colab Form

Waveglow

On training Waveglow - Scripp

Dataset Resources

Tools

Noice's Watson Speech To Text Tool

ASSFAP

Scripp's Guide

Use ffmpeg to convert your wav files to the right format:

ffmpeg -y -i $filename -ac 1 -acodec pcm_s16le -ar 22050 -sample_fmt s16 converted/$filename

Or, on a whole directory:

#!/bin/bash

for filename in *.wav; do
    echo "Converting $filename"
    ffmpeg -y -i $filename -ac 1 -acodec pcm_s16le -ar 22050 -sample_fmt s16 converted/$filename    
done

Datasets

Kanye West

LJSpeech Dataset: Old Reliable

Common Voice: Broad voice dataset sample with demographic metadata. Includes valid-invalid identifier as an indication of transcript quality.

VoxCeleb: 2000+ hours of celebrity utterances, with 7000+ speakers. Audio is captured as "in the wild," including background noise.voxceleb/vox1.html

TED-LIUM: 452 hours of audio and aligned trascripts from TED talks.

LibriSpeech: 1000+ hour dataset of read English speech based on public domain audiobooks.

Creating a Dataset

Some Scripts for recording voices

SRT Splitting

AutoSub

Cam's Workflow

Glossary of Terms

Glossary

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published