<center> 
    <h1> Transformer TTS: A Text-to-Speech Transformer in TensorFlow 2 </h1>
    <h2> Audio synthesis with Forward Transformer TTS and WaveRNN Vocoder</h2>
</center>


## Forward Model

In [None]:
# Clone the Transformer TTS and WaveRNN repos
!git clone https://github.com/as-ideas/TransformerTTS.git
!git clone https://github.com/fatchord/WaveRNN

Cloning into 'TransformerTTS'...
remote: Enumerating objects: 423, done.[K
remote: Counting objects: 100% (423/423), done.[K
remote: Compressing objects: 100% (181/181), done.[K
remote: Total 2825 (delta 260), reused 384 (delta 234), pack-reused 2402[K
Receiving objects: 100% (2825/2825), 8.01 MiB | 5.63 MiB/s, done.
Resolving deltas: 100% (1897/1897), done.
Cloning into 'WaveRNN'...
remote: Enumerating objects: 928, done.[K


In [None]:
# Install requirements
!apt-get install -y espeak
!pip install -r TransformerTTS/requirements.txt

In [None]:
# Download the pre-trained weights
! wget https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/ljspeech_wavernn_forward_transformer.zip
! unzip ljspeech_wavernn_forward_transformer.zip

In [None]:
!cd TransformerTTS/; git checkout 1c1cb03

In [None]:
# Set up the paths
from pathlib import Path
WaveRNN_path = 'WaveRNN/'
TTS_path = 'TransformerTTS/'
config_path = Path('ljspeech_wavernn_forward_transformer/wavernn')

import sys
sys.path.append(TTS_path)

In [None]:
# Load pretrained models
from utils.config_manager import ConfigManager
from utils.audio import Audio

import IPython.display as ipd

config_loader = ConfigManager(str(config_path), model_kind='forward')
audio = Audio(config_loader.config)
model = config_loader.load_model(str(config_path / 'forward_weights/ckpt-133'))

restored weights from ljspeech_wavernn_forward_transformer/wavernn/forward_weights/ckpt-133 at step 665000


In [None]:
# Synthesize text
sentence = 'Scientists at the CERN laboratory, say they have discovered a new particle.'
out_normal = model.predict(sentence)

In [None]:
# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out_normal['mel'].numpy().T)
ipd.display(ipd.Audio(wav, rate=config_loader.config['sampling_rate']))

In [None]:
# Normalize for WaveRNN
mel = (out_normal['mel'].numpy().T+4.)/8.

You can also vary the speech speed

In [None]:
# 20% faster
sentence = 'Scientists at the CERN laboratory, say they have discovered a new particle.'
out = model.predict(sentence, speed_regulator=1.20)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
ipd.display(ipd.Audio(wav, rate=config_loader.config['sampling_rate']))

In [None]:
# 10% slower
sentence = 'Scientists at the CERN laboratory, say they have discovered a new particle.'
out = model.predict(sentence, speed_regulator=.9)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
ipd.display(ipd.Audio(wav, rate=config_loader.config['sampling_rate']))

### WaveRNN

In [None]:
# Do some sys cleaning and imports
sys.path.remove(TTS_path)
sys.modules.pop('utils')

<module 'utils' from 'TransformerTTS/utils/__init__.py'>

In [None]:
sys.path.append(WaveRNN_path)
from utils.dsp import hp
from models.fatchord_version import WaveRNN
import torch
import numpy as np
WaveRNN_path = Path(WaveRNN_path)

In [None]:
# Unzip the pretrained model
!unzip WaveRNN/pretrained/ljspeech.wavernn.mol.800k.zip -d WaveRNN/pretrained/

Archive:  WaveRNN/pretrained/ljspeech.wavernn.mol.800k.zip
  inflating: WaveRNN/pretrained/latest_weights.pyt  


In [None]:
# Load pretrained model
hp.configure(WaveRNN_path / 'hparams.py')  # Load hparams from file
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
model = WaveRNN(rnn_dims=hp.voc_rnn_dims,
                fc_dims=hp.voc_fc_dims,
                bits=hp.bits,
                pad=hp.voc_pad,
                upsample_factors=hp.voc_upsample_factors,
                feat_dims=hp.num_mels,
                compute_dims=hp.voc_compute_dims,
                res_out_dims=hp.voc_res_out_dims,
                res_blocks=hp.voc_res_blocks,
                hop_length=hp.hop_length,
                sample_rate=hp.sample_rate,
                mode=hp.voc_mode).to(device)

model.load(str(WaveRNN_path / 'pretrained/latest_weights.pyt'))

Trainable Parameters: 4.234M


In [None]:
# Ignore some TF warnings
import tensorflow as tf
tf.get_logger().setLevel('ERROR')

In [None]:
# Generate sample with pre-trained WaveRNN vocoder
batch_pred = True # False is slower but possibly better
_ = model.generate(mel.clip(0,1)[np.newaxis,:,:], 'scientists.wav', batch_pred, 11_000, hp.voc_overlap, hp.mu_law)

| ████████████████ 120000/121000 | Batch Size: 10 | Gen Rate: 2.6kHz | 

In [None]:
# Load wav file
ipd.display(ipd.Audio('scientists.wav'))