<a href="https://colab.research.google.com/github/naseembabu/Audio-Implementation-Code/blob/master/DDC_TTS_and_MultiBand_MelGAN_TF_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mozilla TTS on CPU Real-Time Speech Synthesis with Tensorflow

**These models are converted from released [PyTorch models](https://colab.research.google.com/drive/1u_16ZzHjKYFn1HNVuA4Qf_i2MMFB9olY?usp=sharing) using our TF utilities provided in Mozilla TTS.**

These TF models support TF 2.2 and for different versions you might need to
regenerate them. 

We use Tacotron2 and MultiBand-Melgan models and LJSpeech dataset.

Tacotron2 is trained using [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/) (DDC) only for 130K steps (3 days) with a single GPU.

MultiBand-Melgan is trained  1.45M steps with real spectrograms.

Note that both model performances can be improved with more training.


### Download Models

In [1]:
!gdown --id 1p7OSEEW_Z7ORxNgfZwhMy7IiLE1s0aH7 -O tts_model.pkl
!gdown --id 18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc -O config.json

Downloading...
From: https://drive.google.com/uc?id=1p7OSEEW_Z7ORxNgfZwhMy7IiLE1s0aH7
To: /content/tts_model.pkl
116MB [00:01, 78.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc
To: /content/config.json
100% 9.53k/9.53k [00:00<00:00, 15.4MB/s]


In [2]:
!gdown --id 1rHmj7CqD3Sfa716Y3ub_vpIBrQg_b1yF -O vocoder_model.pkl
!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O config_vocoder.json
!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O scale_stats.npy

Downloading...
From: https://drive.google.com/uc?id=1rHmj7CqD3Sfa716Y3ub_vpIBrQg_b1yF
To: /content/vocoder_model.pkl
10.1MB [00:00, 32.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu
To: /content/config_vocoder.json
100% 6.76k/6.76k [00:00<00:00, 6.19MB/s]
Downloading...
From: https://drive.google.com/uc?id=11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU
To: /content/scale_stats.npy
100% 10.5k/10.5k [00:00<00:00, 9.24MB/s]


### Setup Libraries

In [3]:
# need it for char to phoneme conversion
! sudo apt-get install espeak

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  espeak-data libespeak1 libportaudio2 libsonic0
The following NEW packages will be installed:
  espeak espeak-data libespeak1 libportaudio2 libsonic0
0 upgraded, 5 newly installed, 0 to remove and 13 not upgraded.
Need to get 1,219 kB of archives.
After this operation, 3,031 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libportaudio2 amd64 19.6.0-1 [64.6 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsonic0 amd64 0.2.0-6 [13.4 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 espeak-data amd64 1.48.04+dfsg-5 [934 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libespeak1 amd64 1.48.04+dfsg-5 [145 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic/universe amd64 espeak amd64 1.48.04+dfsg-5 [61.6 kB]
Fetched 1,219 kB in 1s (890 kB/s)
deb

In [4]:
!git clone https://github.com/mozilla/TTS

Cloning into 'TTS'...
remote: Enumerating objects: 11984, done.[K
remote: Total 11984 (delta 0), reused 0 (delta 0), pack-reused 11984[K
Receiving objects: 100% (11984/11984), 122.74 MiB | 25.43 MiB/s, done.
Resolving deltas: 100% (8459/8459), done.


In [5]:
%cd TTS
!git checkout c7296b3
!pip install -r requirements.txt
!python setup.py install
!pip install tensorflow==2.2.0
%cd ..

/content/TTS
Note: checking out 'c7296b3'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at c7296b3 add module requirement
Collecting Unidecode>=0.4.20
[?25l  Downloading https://files.pythonhosted.org/packages/9e/25/723487ca2a52ebcee88a34d7d1f5a4b80b793f179ee0f62d5371938dfa01/Unidecode-1.2.0-py2.py3-none-any.whl (241kB)
[K     |████████████████████████████████| 245kB 8.2MB/s 
Collecting tensorboardX
[?25l  Downloading https://files.pythonhosted.org/packages/af/0c/4f41bcd45db376e6fe5c619c01100e9b7531c55791b7244815bac6eac32c/tensorboardX-2.1-py2.py3-none-any.whl (308kB)
[K     |████████████████████████████████| 317kB 17

### Define TTS function

In [6]:
def tts(model, text, CONFIG, p):
    t_1 = time.time()
    waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,
                                                                             truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars,
                                                                             backend='tf')
    waveform = vocoder_model.inference(torch.FloatTensor(mel_postnet_spec.T).unsqueeze(0))
    waveform = waveform.numpy()[0, 0]
    rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)
    tps = (time.time() - t_1) / len(waveform)
    print(waveform.shape)
    print(" > Run-time: {}".format(time.time() - t_1))
    print(" > Real-time factor: {}".format(rtf))
    print(" > Time per step: {}".format(tps))
    IPython.display.display(IPython.display.Audio(waveform, rate=CONFIG.audio['sample_rate']))  
    return alignment, mel_postnet_spec, stop_tokens, waveform

### Load Models

In [7]:
import os
import torch
import time
import IPython

from TTS.tf.utils.generic_utils import setup_model
from TTS.tf.utils.io import load_checkpoint
from TTS.utils.io import load_config
from TTS.utils.text.symbols import symbols, phonemes
from TTS.utils.audio import AudioProcessor
from TTS.utils.synthesis import synthesis

In [8]:
# runtime settings
use_cuda = False

In [9]:
# model paths
TTS_MODEL = "tts_model.pkl"
TTS_CONFIG = "config.json"
VOCODER_MODEL = "vocoder_model.pkl"
VOCODER_CONFIG = "config_vocoder.json"

In [10]:
# load configs
TTS_CONFIG = load_config(TTS_CONFIG)
VOCODER_CONFIG = load_config(VOCODER_CONFIG)

In [11]:
# load the audio processor
ap = AudioProcessor(**TTS_CONFIG.audio)         

 > Setting up Audio Processor...
 | > sample_rate:22050
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:./scale_stats.npy
 | > hop_length:256
 | > win_length:1024


In [12]:
# LOAD TTS MODEL
# multi speaker 
speaker_id = None
speakers = []

# load the model
num_chars = len(phonemes) if TTS_CONFIG.use_phonemes else len(symbols)
model = setup_model(num_chars, len(speakers), TTS_CONFIG)
model.build_inference()
model = load_checkpoint(model, TTS_MODEL)
model.decoder.set_max_decoder_steps(1000)

 > Using model: Tacotron2
(1, None, 80)
(1, None, 80)


In [13]:
from TTS.vocoder.tf.utils.generic_utils import setup_generator
from TTS.vocoder.tf.utils.io import load_checkpoint

# LOAD VOCODER MODEL
vocoder_model = setup_generator(VOCODER_CONFIG)
vocoder_model.build_inference()
vocoder_model = load_checkpoint(vocoder_model, VOCODER_MODEL)
vocoder_model.inference_padding = 0

ap_vocoder = AudioProcessor(**VOCODER_CONFIG['audio'])    

 > Generator Model: multiband_melgan_generator
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:./scale_stats.npy
 | > hop_length:256
 | > win_length:1024


## Run Inference

In [17]:
sentence =  "Indian Institute of Technology Patna is an autonomous institute of education and research in science, engineering and technology located in Patna, India. It is recognized as an Institute of National Importance by the Government of India.."
align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, ap)

(377088,)
 > Run-time: 7.135311603546143
 > Real-time factor: 0.41718847192292785
 > Time per step: 1.8920134866796054e-05
