# Text-to-Speech with Flowtron and Waveglow

This is an English LibriTTS multispeaker TTS demo using open source projects [NVIDIA/flowtron](https://github.com/NVIDIA/flowtron).

For other deep-learning Colab notebooks, visit [tugstugi/dl-colab-notebooks](https://github.com/tugstugi/dl-colab-notebooks).

## Install Flowtron and Waveglow

In [0]:
#@title
%tensorflow_version 1.x
import os
from os.path import exists, join, basename, splitext

git_repo_url = 'https://github.com/NVIDIA/flowtron.git'
project_name = splitext(basename(git_repo_url))[0]
if not exists(project_name):
  # clone and install
  !git clone -q --recursive {git_repo_url}
  !pip install -q librosa unidecode gdown
  
os.chdir(project_name)
from flowtron import Flowtron
from data import Data

import sys
sys.path.insert(0, 'tacotron2')
sys.path.insert(0, 'tacotron2/waveglow')
from glow import WaveGlow

from IPython.display import Audio
import matplotlib
import matplotlib.pylab as plt
plt.rcParams["axes.grid"] = False

## Download pretrained models

In [2]:
flowtron_pretrained_model = 'flowtron_libritts.pt'
if not exists(flowtron_pretrained_model):
  !gdown https://drive.google.com/uc?id=1KhJcPawFgmfvwV7tQAOeC253rYstLrs8
waveglow_pretrained_model = 'waveglow_256channels_universal_v5.pt'
if not exists(waveglow_pretrained_model):
  !gdown https://drive.google.com/uc?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF

Downloading...
From: https://drive.google.com/uc?id=1KhJcPawFgmfvwV7tQAOeC253rYstLrs8
To: /content/flowtron/flowtron_libritts.pt
244MB [00:03, 78.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF
To: /content/flowtron/waveglow_256channels_universal_v5.pt
676MB [00:06, 104MB/s] 


## Initialize Flowtron

Following code is copied from https://github.com/NVIDIA/flowtron/blob/master/inference.py and updated for the Colab.

In [3]:
import json
import torch
import numpy as np

torch.manual_seed(1234)
torch.cuda.manual_seed(1234)
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = False

# read config
config = json.load(open('config.json'))
data_config = config["data_config"]
model_config = config["model_config"]
model_config['n_speakers'] = 123 # there are 123 speakers
data_config['training_files'] = 'filelists/libritts_train_clean_100_audiopath_text_sid_shorterthan10s_atleast5min_train_filelist.txt'
data_config['validation_files'] = data_config['training_files']

# load waveglow
waveglow = torch.load(waveglow_pretrained_model)['model'].cuda().eval()
waveglow.cuda().half()
for k in waveglow.convinv:
    k.float()
_ = waveglow.eval()

# load flowtron
model = Flowtron(**model_config).cuda()
state_dict = torch.load(flowtron_pretrained_model, map_location='cpu')['state_dict']
model.load_state_dict(state_dict)
_ = model.eval()

ignore_keys = ['training_files', 'validation_files']
trainset = Data(data_config['training_files'], **dict((k, v) for k, v in data_config.items() if k not in ignore_keys))

def synthesize(speaker_id, text, sigma=0.5, n_frames=500):
  speaker_vecs = trainset.get_speaker_id(speaker_id).cuda()
  text = trainset.get_text(text).cuda()
  speaker_vecs = speaker_vecs[None]
  text = text[None]

  with torch.no_grad():
    residual = torch.cuda.FloatTensor(1, 80, n_frames).normal_() * sigma
    mels, attentions = model.infer(residual, speaker_vecs, text)

  audio = waveglow.infer(mels.half(), sigma=0.8).float()
  audio = audio.cpu().numpy()[0]
  # normalize audio for now
  audio = audio / np.abs(audio).max()
  return Audio(audio, rate=22050)



Number of speakers : 123


## Synthesize a text

Replace `TEXT` with your text if you want try out another text.

In [0]:
TEXT = "It is well know that deep generative models have a deep latent space!"

# available speaker ids: 1069, 1088, 1116, 118, 1246, 125, 1263, 1502, 1578, 1841, 1867, 196, 1963, 1970, 200, 2092, 2136, 2182, 2196, 2289, 2416, 2436, 250, 254, 2836, 2843, 2911, 2952, 3240, 3242, 3259, 3436, 3486, 3526, 3664, 374, 3857, 3879, 3982, 3983, 40, 4018, 405, 4051, 4088, 4160, 4195, 4267, 4297, 4362, 4397, 4406, 446, 460, 4640, 4680, 4788, 5022, 5104, 5322, 5339, 5393, 5652, 5678, 5703, 5750, 5808, 587, 6019, 6064, 6078, 6081, 6147, 6181, 6209, 6272, 6367, 6385, 6415, 6437, 6454, 6476, 6529, 669, 6818, 6836, 6848, 696, 7059, 7067, 7078, 7178, 7190, 7226, 7278, 730, 7302, 7367, 7402, 7447, 7505, 7511, 7794, 78, 7800, 8051, 8088, 8098, 8108, 8123, 8238, 83, 831, 8312, 8324, 8419, 8468, 8609, 8629, 87, 8770, 8838, 887
SPEAKER_ID = 118  # this is a male voice

Now synthesize the above text:

In [5]:
synthesize(SPEAKER_ID, TEXT)