# 🐸 [Coqui TTS](https://github.com/coqui-ai/TTS) on CPU Real-Time Speech Synthesis 

## Glow-TTS
Paper: https://arxiv.org/abs/2005.11129

Trained with **LJSpeech** for **330K steps**.

This model is different than Tacotron by using a **greedy search algorithm** instead of an attention mechanism. In our experiments, it produces less **natural speech** but** easier to train** especially with lower quality datasets. It is also
**faster than Tacotron** models since it does not rely on auto-regression and **computes output with a single pass**. You can also **control speech pace and variation** with certain model parameters as shown below.

## MultiBand-MelGAN
Paper: https://arxiv.org/abs/2005.05106 

Trained with **LibriTTS** for **145K steps** with real spectrograms.

### Download Models

In [1]:
!gdown --id 1NFsfhH8W8AgcfJ-BsL8CYAwQfZ5k4T-n -O tts_model.pth.tar
!gdown --id 1IAROF3yy9qTK43vG_-R67y3Py9yYbD6t -O config.json

Downloading...
From: https://drive.google.com/uc?id=1NFsfhH8W8AgcfJ-BsL8CYAwQfZ5k4T-n
To: /content/tts_model.pth.tar
100% 344M/344M [00:02<00:00, 144MB/s]
Downloading...
From: https://drive.google.com/uc?id=1IAROF3yy9qTK43vG_-R67y3Py9yYbD6t
To: /content/config.json
100% 8.90k/8.90k [00:00<00:00, 44.6MB/s]


In [2]:
!gdown --id 1Ty5DZdOc0F7OTGj9oJThYbL5iVu_2G0K -O vocoder_model.pth.tar
!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O config_vocoder.json
!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O scale_stats_vocoder.npy

Downloading...
From: https://drive.google.com/uc?id=1Ty5DZdOc0F7OTGj9oJThYbL5iVu_2G0K
To: /content/vocoder_model.pth.tar
100% 82.8M/82.8M [00:00<00:00, 137MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu
To: /content/config_vocoder.json
100% 6.76k/6.76k [00:00<00:00, 20.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU
To: /content/scale_stats_vocoder.npy
100% 10.5k/10.5k [00:00<00:00, 29.9MB/s]


### Setup Libraries

In [3]:
! sudo apt-get install espeak

Reading package lists... Done
Building dependency tree       
Reading state information... Done
espeak is already the newest version (1.48.04+dfsg-8build1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


In [4]:
!git clone https://github.com/coqui-ai/TTS TTS_repo

fatal: destination path 'TTS_repo' already exists and is not an empty directory.


In [5]:
%cd TTS_repo
!git checkout 4132240
!pip install -r requirements.txt
!python setup.py develop
%cd ..

/content/TTS_repo
HEAD is now at 41322408 Merge branch 'dev' of https://github.com/mozilla/TTS into dev
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement tensorflow==2.3.1 (from versions: 2.8.0rc0, 2.8.0rc1, 2.8.0, 2.8.1, 2.8.2, 2.8.3, 2.8.4, 2.9.0rc0, 2.9.0rc1, 2.9.0rc2, 2.9.0, 2.9.1, 2.9.2, 2.9.3, 2.10.0rc0, 2.10.0rc1, 2.10.0rc2, 2.10.0rc3, 2.10.0, 2.10.1, 2.11.0rc0, 2.11.0rc1, 2.11.0rc2, 2.11.0, 2.11.1, 2.12.0rc0, 2.12.0rc1, 2.12.0, 2.13.0rc0)[0m[31m
[0m[31mERROR: No matching distribution found for tensorflow==2.3.1[0m[31m
!!

        ********************************************************************************
        Usage of dash-separated 'build-lib' will not be supported in future
        versions. Please use the underscore name 'build_lib' instead.

        By 2023-Sep-26, you need to update your project and remove deprecated calls
        or your build

### Define TTS function

In [6]:
def interpolate_vocoder_input(scale_factor, spec):
    """Interpolation to tolarate the sampling rate difference
    btw tts model and vocoder"""
    print(" > before interpolation :", spec.shape)
    spec = torch.tensor(spec).unsqueeze(0).unsqueeze(0)
    spec = torch.nn.functional.interpolate(spec, scale_factor=scale_factor, mode='bilinear').squeeze(0)
    print(" > after interpolation :", spec.shape)
    return spec


def tts(model, text, CONFIG, use_cuda, ap, use_gl, figures=True):
    t_1 = time.time()
    # run tts
    target_sr = CONFIG.audio['sample_rate']
    waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs =\
     synthesis(model,
               text,
               CONFIG,
               use_cuda,
               ap,
               speaker_id,
               None,
               False,
               CONFIG.enable_eos_bos_chars,
               use_gl)
    # run vocoder
    mel_postnet_spec = ap._denormalize(mel_postnet_spec.T).T
    if not use_gl:
        target_sr = VOCODER_CONFIG.audio['sample_rate']
        vocoder_input = ap_vocoder._normalize(mel_postnet_spec.T)
        if scale_factor[1] != 1:
            vocoder_input = interpolate_vocoder_input(scale_factor, vocoder_input)
        else:
            vocoder_input = torch.tensor(vocoder_input).unsqueeze(0)
        waveform = vocoder_model.inference(vocoder_input)
    # format output
    if use_cuda and not use_gl:
        waveform = waveform.cpu()
    if not use_gl:
        waveform = waveform.numpy()
    waveform = waveform.squeeze()
    # compute run-time performance
    rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)
    tps = (time.time() - t_1) / len(waveform)
    print(waveform.shape)
    print(" > Run-time: {}".format(time.time() - t_1))
    print(" > Real-time factor: {}".format(rtf))
    print(" > Time per step: {}".format(tps))
    # display audio
    IPython.display.display(IPython.display.Audio(waveform, rate=target_sr))  
    return alignment, mel_postnet_spec, stop_tokens, waveform

### Load Models

In [7]:
!pip install dlinfo
!pip install segments
!pip install "librosa==0.9.1"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [14]:
import sys
import os
import torch
import time
import IPython

# for some reason TTS installation does not work on Colab
sys.path.append('TTS_repo')

from TTS.utils.io import load_config
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.generic_utils import setup_model
from TTS.tts.utils.text.symbols import make_symbols, symbols, phonemes
from TTS.tts.utils.synthesis import synthesis
from TTS.tts.utils.io import load_checkpoint

In [15]:
# runtime settings
use_cuda = True

In [16]:
# model paths
TTS_MODEL = "tts_model.pth.tar"
TTS_CONFIG = "config.json"
VOCODER_MODEL = "vocoder_model.pth.tar"
VOCODER_CONFIG = "config_vocoder.json"

In [17]:
# load configs
TTS_CONFIG = load_config(TTS_CONFIG)
VOCODER_CONFIG = load_config(VOCODER_CONFIG)

# TTS_CONFIG.audio['stats_path'] = "./scale_stats.npy"
VOCODER_CONFIG.audio['stats_path'] = "./scale_stats_vocoder.npy"


In [18]:
# load the audio processor
ap = AudioProcessor(**TTS_CONFIG.audio)         

 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.1
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:None
 | > hop_length:256
 | > win_length:1024


In [21]:
# LOAD TTS MODEL
# multi speaker 
speakers = []
speaker_id = None
    
if 'characters' in TTS_CONFIG.keys():
    symbols, phonemes = make_symbols(**characters)

# load the model
num_chars = len(phonemes) if TTS_CONFIG.use_phonemes else len(symbols)
model = setup_model(num_chars, len(speakers), TTS_CONFIG)      

# load model state
model, _ =  load_checkpoint(model, TTS_MODEL, use_cuda=use_cuda)
model.eval();
model.store_inverse();

NameError: ignored

In [None]:
from TTS.vocoder.utils.generic_utils import setup_generator

# LOAD VOCODER MODEL
vocoder_model = setup_generator(VOCODER_CONFIG)
vocoder_model.load_state_dict(torch.load(VOCODER_MODEL, map_location="cpu")["model"])
vocoder_model.remove_weight_norm()
vocoder_model.inference_padding = 0

# scale factor for sampling rate difference
scale_factor = [1,  VOCODER_CONFIG['audio']['sample_rate'] / ap.sample_rate]
print(f"scale_factor: {scale_factor}")

ap_vocoder = AudioProcessor(**VOCODER_CONFIG['audio'])    
if use_cuda:
    vocoder_model.cuda()
vocoder_model.eval();

In [None]:
import glob
from TTS.vocoder.datasets.gan_dataset import GANDataset
from TTS.vocoder.utils.generic_utils import plot_results
import matplotlib.pyplot as plt
from TTS.utils.tensorboard_logger import TensorboardLogger
from torch.utils.data import DataLoader

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/drive/MyDrive/cs_172b_project

In [None]:
from datasets import load_dataset, Audio
dataset = load_dataset("audiofolder", data_dir="full_pierre_dataset/")

In [None]:
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
print(dataset)

In [None]:

'''
wav_paths = glob.glob(os.path.join("/content/sample/", "**", "*.wav"), recursive=True)
dataset = GANDataset(
    ap=ap_vocoder, 
    items=wav_paths,
    seq_len=VOCODER_CONFIG.seq_len,
    hop_len=ap_vocoder.hop_length,
    pad_short=VOCODER_CONFIG.pad_short,
    conv_pad=VOCODER_CONFIG.conv_pad,
    is_training=False,
    return_segments=False,
    use_noise_augment=VOCODER_CONFIG.use_noise_augment,
    use_cache=VOCODER_CONFIG.use_cache,
    verbose=False)

data = dataset[0]
c_G, y_G = data
c_G = c_G.unsqueeze(0)
y_G = y_G.unsqueeze(0)
'''
c_G, y_G = dataset[0]
c_G = c_G.unsqueeze(0)
y_G = y_G.unsqueeze(0)

y_hat = vocoder_model.inference(c_G)
print(y_hat.shape, y_G.shape)

!rm -rf ./sample/test
figures = plot_results(y_hat, y_G, ap_vocoder, 0, "test")
tb_logger = TensorboardLogger("/content/sample/test", model_name="vocoder_test")
tb_logger.tb_eval_figures(0, figures)

sample_voice = y_hat[0].squeeze(0).detach().cpu().numpy()
real_voice = y_G[0].squeeze(0).cpu().numpy()
tb_logger.tb_eval_audios(0, {'eval/audio': sample_voice, 'eval/real': real_voice}, VOCODER_CONFIG.audio["sample_rate"])

IPython.display.display(IPython.display.Audio(sample_voice, rate=22050))
IPython.display.display(IPython.display.Audio(real_voice, rate=22050))

%load_ext tensorboard
%tensorboard --logdir "/content/sample"

## Run Inference

In [None]:
model.length_scale = 1.0  # set speed of the speech. 
model.noise_scale = 0.33  # set speech variationd

sentence =  "Bill got in the habit of asking himself “Is that thought true?” and if he wasn’t absolutely certain it was, he just let it go."
align, spec, stop_tokedns, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)

In [None]:
# faster speech
model.length_scale = 0.8  # set speed of the speech. 
model.noise_scale = 0.33  # set speech variationd

sentence =  "Bill got in the habit of asking himself “Is that thought true?” and if he wasn’t absolutely certain it was, he just let it go."
align, spec, stop_tokedns, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)

In [None]:
# even more faster speech with less variantion
model.length_scale = 0.6  # set speed of the speech. 
model.noise_scale = 0.01  # set speech variation

sentence =  "Bill got in the habit of asking himself “Is that thought true?” and if he wasn’t absolutely certain it was, he just let it go."
align, spec, stop_tokedns, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)