<center> 
    <h1> Transformer TTS: A Text-to-Speech Transformer in TensorFlow 2 </h1>
    <h2> Audio synthesis with Autoregressive Transformer TTS and WaveRNN Vocoder</h2>
</center>

## Autoregressive Model

In [None]:
# Clone the Transformer TTS and WaveRNN repos
!git clone https://github.com/as-ideas/TransformerTTS.git
!git clone https://github.com/fatchord/WaveRNN

Cloning into 'TransformerTTS'...
remote: Enumerating objects: 72, done.[K
remote: Counting objects: 100% (72/72), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 2474 (delta 33), reused 44 (delta 20), pack-reused 2402[K
Receiving objects: 100% (2474/2474), 4.24 MiB | 7.22 MiB/s, done.
Resolving deltas: 100% (1670/1670), done.
Cloning into 'WaveRNN'...
remote: Enumerating objects: 928, done.[K
remote: Total 928 (delta 0), reused 0 (delta 0), pack-reused 928[K
Receiving objects: 100% (928/928), 241.65 MiB | 40.27 MiB/s, done.
Resolving deltas: 100% (540/540), done.


In [None]:
# Install requirements
!apt-get install -y espeak
!pip install -r TransformerTTS/requirements.txt

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  espeak-data libespeak1 libportaudio2 libsonic0
The following NEW packages will be installed:
  espeak espeak-data libespeak1 libportaudio2 libsonic0
0 upgraded, 5 newly installed, 0 to remove and 32 not upgraded.
Need to get 1,219 kB of archives.
After this operation, 3,031 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libportaudio2 amd64 19.6.0-1 [64.6 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsonic0 amd64 0.2.0-6 [13.4 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 espeak-data amd64 1.48.04+dfsg-5 [934 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libespeak1 amd64 1.48.04+dfsg-5 [145 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic/universe amd64 espeak amd64 1.48.04+dfsg-5 [61.6 kB]
Fetched 1,219 kB in 1s (1,080 kB/s)
S

In [None]:
# Download the pre-trained weights
! wget https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/ljspeech_wavernn_autoregressive_transformer.zip
! unzip ljspeech_wavernn_autoregressive_transformer.zip

--2020-06-04 13:00:25--  https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/ljspeech_wavernn_autoregressive_transformer.zip
Resolving public-asai-dl-models.s3.eu-central-1.amazonaws.com (public-asai-dl-models.s3.eu-central-1.amazonaws.com)... 52.219.73.160
Connecting to public-asai-dl-models.s3.eu-central-1.amazonaws.com (public-asai-dl-models.s3.eu-central-1.amazonaws.com)|52.219.73.160|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 177657256 (169M) [application/zip]
Saving to: ‘ljspeech_wavernn_autoregressive_transformer.zip’


2020-06-04 13:00:29 (52.8 MB/s) - ‘ljspeech_wavernn_autoregressive_transformer.zip’ saved [177657256/177657256]

Archive:  ljspeech_wavernn_autoregressive_transformer.zip
   creating: ljspeech_wavernn_autoregressive_transformer/
  inflating: __MACOSX/._ljspeech_wavernn_autoregressive_transformer  
  inflating: ljspeech_wavernn_autoregressive_transformer/.DS_Store  
  inflating: __MACOSX/ljspeech_wavernn_autore

In [None]:
# Set up the paths
from pathlib import Path
WaveRNN_path = 'WaveRNN/'
TTS_path = 'TransformerTTS/'
config_path = Path('ljspeech_wavernn_autoregressive_transformer/wavernn')

import sys
sys.path.append(TTS_path)

In [None]:
# Load pretrained models
from utils.config_manager import ConfigManager
from utils.audio import Audio

import IPython.display as ipd

config_loader = ConfigManager(str(config_path), model_kind='autoregressive')
audio = Audio(config_loader.config)
model = config_loader.load_model(str(config_path / 'autoregressive_weights/ckpt-40'))

restored weights from ljspeech_wavernn_autoregressive_transformer/wavernn/autoregressive_weights/ckpt-40 at step 400000


In [None]:
# Synthesize text
sentence = 'Scientists at the CERN laboratory, say that they have discovered a new particle.'
out = model.predict(sentence)

pred text mel: 391 stop out: -2.0836281776428223Stopping


In [None]:
# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
ipd.display(ipd.Audio(wav, rate=config_loader.config['sampling_rate']))

In [None]:
# Normalize for WaveRNN
mel = (out['mel'].numpy().T+4.)/8.

### WaveRNN

In [None]:
# Do some sys cleaning and imports
sys.path.remove(TTS_path)
sys.modules.pop('utils')

<module 'utils' from 'TransformerTTS/utils/__init__.py'>

In [None]:
sys.path.append(WaveRNN_path)
from utils.dsp import hp
from models.fatchord_version import WaveRNN
import torch
import numpy as np
WaveRNN_path = Path(WaveRNN_path)

In [None]:
# Unzip the pretrained model
!unzip WaveRNN/pretrained/ljspeech.wavernn.mol.800k.zip -d WaveRNN/pretrained/

Archive:  WaveRNN/pretrained/ljspeech.wavernn.mol.800k.zip
  inflating: WaveRNN/pretrained/latest_weights.pyt  


In [None]:
# Load pretrained model
hp.configure(WaveRNN_path / 'hparams.py')  # Load hparams from file
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
model = WaveRNN(rnn_dims=hp.voc_rnn_dims,
                fc_dims=hp.voc_fc_dims,
                bits=hp.bits,
                pad=hp.voc_pad,
                upsample_factors=hp.voc_upsample_factors,
                feat_dims=hp.num_mels,
                compute_dims=hp.voc_compute_dims,
                res_out_dims=hp.voc_res_out_dims,
                res_blocks=hp.voc_res_blocks,
                hop_length=hp.hop_length,
                sample_rate=hp.sample_rate,
                mode=hp.voc_mode).to(device)

model.load(str(WaveRNN_path / 'pretrained/latest_weights.pyt'))

Trainable Parameters: 4.234M


In [None]:
# Ignore some TF warnings
import tensorflow as tf
tf.get_logger().setLevel('ERROR')

In [None]:
# Generate sample with pre-trained WaveRNN vocoder
batch_pred = True # False is slower but possibly better
_ = model.generate(mel.clip(0,1)[np.newaxis,:,:], 'scientists.wav', batch_pred, 11_000, hp.voc_overlap, hp.mu_law)

| ████████████████ 120000/121000 | Batch Size: 10 | Gen Rate: 3.0kHz | 

In [None]:
# Load wav file
ipd.display(ipd.Audio('scientists.wav'))