<center> 
    <h1> Transformer TTS: A Text-to-Speech Transformer in TensorFlow 2 </h1>
    <h2> Audio synthesis with Forward Transformer TTS and MelGAN Vocoder</h2>
</center>

## Forward Model

In [30]:
# Clone the Transformer TTS and MelGAN repos
!git clone https://github.com/as-ideas/TransformerTTS.git
!git clone https://github.com/seungwonpark/melgan.git

fatal: destination path 'TransformerTTS' already exists and is not an empty directory.
fatal: destination path 'melgan' already exists and is not an empty directory.


In [31]:
# Install requirements
!apt-get install -y espeak
!pip install -r TransformerTTS/requirements.txt

Reading package lists... Done
Building dependency tree       
Reading state information... Done
espeak is already the newest version (1.48.04+dfsg-8build1).
0 upgraded, 0 newly installed, 0 to remove and 23 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [32]:
!cd TransformerTTS/; git checkout c3405c53e435a06c809533aa4453923469081147

HEAD is now at c3405c5 Fix path.


In [33]:
# Set up the paths
from pathlib import Path
MelGAN_path = 'melgan/'
TTS_path = 'TransformerTTS/'

import sys
sys.path.append(TTS_path)

In [34]:
# Load pretrained model
from model.factory import tts_ljspeech
from data.audio import Audio

model, config = tts_ljspeech()
audio = Audio(config)

In [35]:
# Synthesize text
sentence = 'What is so scary about Justin Grierson, is you never know what he will do next.'
out_normal = model.predict(sentence)

In [36]:
# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out_normal['mel'].numpy().T)

In [37]:
import IPython.display as ipd

ipd.display(ipd.Audio(wav, rate=config['sampling_rate']))

You can also vary the speech speed

In [38]:
# 20% faster
sentence = 'What is so scary about Justin Grierson, is you never know what he will do next.'
out = model.predict(sentence, speed_regulator=1.20)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
ipd.display(ipd.Audio(wav, rate=config['sampling_rate']))

In [39]:
# 10% slower
sentence = 'What is so scary about Justin Grierson, is you never know what he will do next.'
out = model.predict(sentence, speed_regulator=.9)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
ipd.display(ipd.Audio(wav, rate=config['sampling_rate']))

### MelGAN

In [41]:
# Do some sys cleaning
sys.path.remove(TTS_path)
sys.modules.pop('model')

ValueError: ignored

In [42]:
sys.path.append(MelGAN_path)
import torch
import numpy as np

vocoder = torch.hub.load('seungwonpark/melgan', 'melgan')
vocoder.eval()

mel = torch.tensor(out_normal['mel'].numpy().T[np.newaxis,:,:])

Using cache found in /root/.cache/torch/hub/seungwonpark_melgan_master


In [43]:
if torch.cuda.is_available():
    vocoder = vocoder.cuda()
    mel = mel.cuda()

with torch.no_grad():
    audio = vocoder.inference(mel)

In [44]:
# Display audio
ipd.display(ipd.Audio(audio.cpu().numpy(), rate=22050))