## Hands-on example for TTS  [https://github.com/mozilla/TTS](https://github.com/mozilla/TTS)

This notebook trains Tacotron model on LJSpeech dataset.

In [0]:
# download LJSpeech dataset
!wget http://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
# decompress
!tar -xjf LJSpeech-1.1.tar.bz2

--2020-01-23 03:41:15--  http://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
Resolving data.keithito.com (data.keithito.com)... 174.138.79.61
Connecting to data.keithito.com (data.keithito.com)|174.138.79.61|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2748572632 (2.6G) [application/octet-stream]
Saving to: ‘LJSpeech-1.1.tar.bz2’


2020-01-23 03:41:57 (61.9 MB/s) - ‘LJSpeech-1.1.tar.bz2’ saved [2748572632/2748572632]



In [0]:
# create train-val splits
!shuf LJSpeech-1.1/metadata.csv > LJSpeech-1.1/metadata_shuf.csv
!head -n 12000 LJSpeech-1.1/metadata_shuf.csv > LJSpeech-1.1/metadata_train.csv
!tail -n 1100 LJSpeech-1.1/metadata_shuf.csv > LJSpeech-1.1/metadata_val.csv

In [0]:
# get TTS to your local
!git clone https://github.com/mozilla/TTS

Cloning into 'TTS'...
remote: Enumerating objects: 98, done.[K
remote: Counting objects:   1% (1/98)[Kremote: Counting objects:   2% (2/98)[Kremote: Counting objects:   3% (3/98)[Kremote: Counting objects:   4% (4/98)[Kremote: Counting objects:   5% (5/98)[Kremote: Counting objects:   6% (6/98)[Kremote: Counting objects:   7% (7/98)[Kremote: Counting objects:   8% (8/98)[Kremote: Counting objects:   9% (9/98)[Kremote: Counting objects:  10% (10/98)[Kremote: Counting objects:  11% (11/98)[Kremote: Counting objects:  12% (12/98)[Kremote: Counting objects:  13% (13/98)[Kremote: Counting objects:  14% (14/98)[Kremote: Counting objects:  15% (15/98)[Kremote: Counting objects:  16% (16/98)[Kremote: Counting objects:  17% (17/98)[Kremote: Counting objects:  18% (18/98)[Kremote: Counting objects:  19% (19/98)[Kremote: Counting objects:  20% (20/98)[Kremote: Counting objects:  21% (21/98)[Kremote: Counting objects:  22% (22/98)[Kremote: Counting obje

In [0]:
# install espeak backend if you like to use phonemes instead of raw characters
!sudo apt-get install espeak
!pip install soundfile

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-430
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  espeak-data libespeak1 libportaudio2 libsonic0
The following NEW packages will be installed:
  espeak espeak-data libespeak1 libportaudio2 libsonic0
0 upgraded, 5 newly installed, 0 to remove and 7 not upgraded.
Need to get 1,219 kB of archives.
After this operation, 3,031 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libportaudio2 amd64 19.6.0-1 [64.6 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsonic0 amd64 0.2.0-6 [13.4 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 espeak-data amd64 1.48.04+dfsg-5 [934 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libespeak1 amd64 1.48.04+dfsg-5 [145 k

In [0]:
cd TTS

/content/TTS


In [0]:
# install TTS requirements
!python setup.py install

running install
running bdist_egg
running egg_info
creating tts_namespace/TTS.egg-info
writing tts_namespace/TTS.egg-info/PKG-INFO
writing dependency_links to tts_namespace/TTS.egg-info/dependency_links.txt
writing requirements to tts_namespace/TTS.egg-info/requires.txt
writing top-level names to tts_namespace/TTS.egg-info/top_level.txt
writing manifest file 'tts_namespace/TTS.egg-info/SOURCES.txt'
writing manifest file 'tts_namespace/TTS.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
-- Building version 0.0.1+8dfedb6
creating temp_build
creating temp_build/TTS
copying tts_namespace/TTS/__init__.py -> temp_build/TTS
copying tts_namespace/TTS/test_cluster.py -> temp_build/TTS
copying tts_namespace/TTS/setup.py -> temp_build/TTS
copying tts_namespace/TTS/synthesize.py -> temp_build/TTS
copying tts_namespace/TTS/version.py -> temp_build/TTS
copying tts_namespace/TTS/train.py -> temp_build/TTS
copying tts_namespace/TTS/dis

In [0]:
import json
from utils.generic_utils import load_config
CONFIG = load_config('config.json')
CONFIG['datasets'][0]['path'] = '../LJSpeech-1.1/'
CONFIG['output_path'] = '../'
CONFIG['epochs'] = 2
with open('config.json', 'w') as fp:
    json.dump(CONFIG, fp)


In [0]:
# pull the trigger
!python train.py --config_path config.json | tee training.log

 > Using CUDA:  True
 > Number of GPUs:  1
 > Git Hash: 8dfedb6
 > Experiment folder: ../ljspeech-graves-January-23-2020_03+47AM-8dfedb6
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:12.5
 | > frame_length_ms:50
 | > ref_level_db:20
 | > num_freq:1025
 | > power:1.5
 | > preemphasis:0.98
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > sound_norm:False
 | > n_fft:2048
 | > hop_length:275
 | > win_length:1100
 > Using model: Tacotron2
 | > Num output units : 1025

 > Model has 28921234 parameters
 > Number of outputs per iteration: 7

 > DataLoader initialization
 | > Use phonemes: True
   | > phoneme language: en-us
 | > Number of instances : 12000
 | > Max length sequence: 187
 | > Min length sequence: 5
 | > Avg length sequence: 98.20591666666667
 | > Num. instances discarded by max-min 

In [0]:
! ls

best_model_config.json	dist		     notebooks	       temp_build
build			distribute.py	     __pycache__       test_cluster.py
CODE_OF_CONDUCT.md	Dockerfile	     README.md	       tests
config.json		images		     requirements.txt  training.log
CONTRIBUTING.md		__init__.py	     server	       train.py
dataset_analysis	layers		     setup.cfg	       tts_namespace
datasets		LICENSE.txt	     setup.py	       utils
debug_config.json	models		     speaker_encoder   version.py
de_sentences.txt	mozilla_us_phonemes  synthesize.py


In [0]:
! ls

LJSpeech-1.1	      ljspeech-graves-January-23-2020_03+47AM-8dfedb6  TTS
LJSpeech-1.1.tar.bz2  sample_data


In [0]:
% cd ../

/content


In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
model_file = drive.CreateFile({'title' : 'ljspeech-graves-January-23-2020_03+47AM-8dfedb6/best_model.pth.tar'})
model_file.SetContentFile('ljspeech-graves-January-23-2020_03+47AM-8dfedb6/best_model.pth.tar')
model_file.Upload()
