[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tulasiram58827/TTS_TFLite/blob/main/End_to_End_TTS.ipynb)

This notebook provides an end-to-end Text to Speech with a choice to choose both TTS Model and Vocoder

## About TTS Architecture

Speech Synthesis generally also called as Text-To-Speech (TTS). Synthesizing a speech from text generally consists of two steps.
Spectrograms are one of the most commonly used features to represent speech.
- `TTS model` : TTS Model generates mel spectrograms (don't worry about mel) from text.
- `Vocoder` : Vocoder generates an audio from spectrograms. There are different types of vocoders.
    - Algorithmic based
        - Griffin-Lim
    - Neural Network based
        - MelGAN
        - MB-MelGAN
        - Parallel WaveGAN

## Setup

In [1]:
!sudo apt-get install espeak
!pip install -q phonemizer

[sudo] password for navin: 

In [None]:
!git clone https://github.com/mozilla/TTS

%cd TTS
!git checkout c7296b3
!pip -q install -r requirements.txt
!python setup.py install
%cd ..

In [None]:
!git clone https://github.com/TensorSpeech/TensorFlowTTS.git
!cd TensorFlowTTS
!pip -q install /content/TensorFlowTTS/

In [None]:
# Otherwise, Fastspeech2 will fail
!pip install -q tf-nightly

## Imports

In [None]:
import os
import torch
import time
import IPython

import sys
sys.path.append('/content/TensorFlowTTS')

import numpy as np
import tensorflow as tf
print(tf.__version__)

from TTS.tf.utils.tflite import load_tflite_model
from TTS.tf.utils.io import load_checkpoint
from TTS.utils.io import load_config
from TTS.utils.text.symbols import symbols, phonemes
from TTS.utils.audio import AudioProcessor
from TTS.utils.synthesis import synthesis

from tensorflow_tts.processor import LJSpeechProcessor
from tensorflow_tts.processor.ljspeech import valid_symbols
from tensorflow_tts.configs import FastSpeechConfig, FastSpeech2Config
from tensorflow_tts.configs import MelGANGeneratorConfig
from tensorflow_tts.inference import TFAutoModel, AutoConfig, AutoProcessor

from tensorflow_tts.models import TFFastSpeech, TFFastSpeech2
from tensorflow_tts.models import TFMelGANGenerator

## Acknowledgments

- `Tacotron2` - TFLite Model provided by [Mozilla TTS Repository](https://github/mozilla/TTS/) and you can refer to this [Notebook](https://colab.research.google.com/github/mozilla/TTS/blob/master/notebooks/DDC_TTS_and_MultiBand_MelGAN_TFLite_Example.ipynb#scrollTo=4dnpE0-kvTsu) for creation of TFLite models.


- `FastSpeech2` - TFLite Model provided by [TensorFlow TTS](https://github.com/TensorSpeech/TensorFlowTTS/) and you can use this [Notebook](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/notebooks/TensorFlowTTS_FastSpeech_with_TFLite.ipynb) for creation of TFLite Models.


- `MelGAN` - Pre-trained Model provided by [TensorFlow TTS Repository](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/melgan#pretrained-models-and-audio-samples) and TFLite Model created with this [Notebook](https://github.com/tulasiram58827/TTS_TFLite/blob/main/MelGAN_TFLite.ipynb).


- `MB-MelGAN` - TFLite Model provided by [Mozilla TTS Repository](https://github/mozilla/TTS/).


- `Parallel WaveGAN` - PyTorch weights are provided by [Parallel WaveGAN Repository](https://github.com/kan-bayashi/ParallelWaveGAN#results). We converted to TFLite. For the process of TFLite conversion you can view this [Notebook](https://github.com/tulasiram58827/TTS_TFLite/blob/main/Parallel_WaveGAN_TFLite.ipynb).

**Note** that these models are trained on `LJSpeech` dataset.

In [None]:
# Downloading Tacotron2 TFLite Model and its config
!gdown --id 17PYXCmTe0el_SLTwznrt3vOArNGMGo5v -O tts_model.tflite
!gdown --id 18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc -O config.json

# Downloading MB-MelGAN Vocoder TFLite Model and its config
!gdown --id 1aXveT-NjOM1mUr6tM4JfWjshq67GvVIO -O vocoder_model.tflite
!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O config_vocoder.json
!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O scale_stats.npy

# Downloading Parallel WaveGAN TFLite Models
!wget -q https://github.com/tulasiram58827/TTS_TFLite/raw/main/models/parallel_wavegan_dr.tflite
!wget -q https://github.com/tulasiram58827/TTS_TFLite/raw/main/models/parallel_wavegan_float16.tflite
    
# Downloading MelGAN TFlite Models
!wget -q https://github.com/tulasiram58827/TTS_TFLite/raw/main/models/melgan.tflite
!wget -q https://github.com/tulasiram58827/TTS_TFLite/raw/main/models/melgan_float16.tflite

# Downloading the Fastspeech2 model (available in dynamic-range quantization currently)
!wget -q https://github.com/tulasiram58827/TTS_TFLite/raw/main/models/fastspeech_quant.tflite

# Downloading ljspeech_mapper
!gdown --id {"1YBaDdMlhTXxsKrH7mZwDu-2aODq5fr5e"} -O ljspeech_mapper.json

## MelGAN Inference

In [None]:
def run_melgan(mel_spec, quantization):
    model_name = f'melgan_{quantization}.tflite'
    
    feats = np.expand_dims(mel_spec, 0)
    interpreter = tf.lite.Interpreter(model_path=model_name)
    
    interpreter = tf.lite.Interpreter(model_path=model_name)

    input_details = interpreter.get_input_details()

    output_details = interpreter.get_output_details()

    interpreter.resize_tensor_input(input_details[0]['index'],  [1, feats.shape[1], feats.shape[2]], strict=True)
    interpreter.allocate_tensors()

    interpreter.set_tensor(input_details[0]['index'], feats)

    interpreter.invoke()

    output = interpreter.get_tensor(output_details[0]['index'])
    
    return output

## MB MelGAN Inference

In [None]:
def run_mb_melgan(mel_spec):
  VOCODER_MODEL = "vocoder_model.tflite"
  VOCODER_CONFIG = "config_vocoder.json"
  vocoder_model = load_tflite_model(VOCODER_MODEL)  
  VOCODER_CONFIG = load_config(VOCODER_CONFIG)
  vocoder_inputs = mel_spec[None, :, :]
  # get input and output details
  input_details = vocoder_model.get_input_details()
  # reshape input tensor for the new input shape
  vocoder_model.resize_tensor_input(input_details[0]['index'], vocoder_inputs.shape)
  vocoder_model.allocate_tensors()
  detail = input_details[0]
  vocoder_model.set_tensor(detail['index'], vocoder_inputs)
  # run the model
  vocoder_model.invoke()
  # collect outputs
  output_details = vocoder_model.get_output_details()
  waveform = vocoder_model.get_tensor(output_details[0]['index'])
  return waveform 

## Parallel WaveGAN Inference

In [None]:
def run_parallel_wavegan(melspec, quantization):
    model_name = f'parallel_wavegan_{quantization}.tflite'
    feats = np.expand_dims(melspec, 0)
    interpreter = tf.lite.Interpreter(model_path=model_name)

    input_details = interpreter.get_input_details()

    output_details = interpreter.get_output_details()

    interpreter.resize_tensor_input(input_details[0]['index'],  [1, feats.shape[1], feats.shape[2]], strict=True)
    interpreter.allocate_tensors()

    interpreter.set_tensor(input_details[0]['index'], feats)

    interpreter.invoke()

    output = interpreter.get_tensor(output_details[0]['index'])
    
    return output

## FastSpeech Inference

Below inference code is copied from [Tensorflow TTS Notebook](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/notebooks/TensorFlowTTS_FastSpeech_with_TFLite.ipynb)

In [None]:
# Prepare input data.
def fastspeech_prepare_input(input_ids):
  input_ids = tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0)
  return (input_ids,
          tf.convert_to_tensor([0], tf.int32),
          tf.convert_to_tensor([1.0], dtype=tf.float32),
          tf.convert_to_tensor([1.0], dtype=tf.float32),
          tf.convert_to_tensor([1.0], dtype=tf.float32))

# Test the model on random input data.
def fastspeech_infer(tflite_model_path, input_text):
  # Load the TFLite model and allocate tensors.
  interpreter = tf.lite.Interpreter(model_path=tflite_model_path)

  # Get input and output tensors.
  input_details = interpreter.get_input_details()
  output_details = interpreter.get_output_details()

  processor = AutoProcessor.from_pretrained(pretrained_path="ljspeech_mapper.json")
  input_ids = processor.text_to_sequence(input_text.lower())
  interpreter.resize_tensor_input(input_details[0]['index'], 
                                  [1, len(input_ids)])
  interpreter.resize_tensor_input(input_details[1]['index'], 
                                  [1])
  interpreter.resize_tensor_input(input_details[2]['index'], 
                                  [1])
  interpreter.resize_tensor_input(input_details[3]['index'], 
                                  [1])
  interpreter.resize_tensor_input(input_details[4]['index'], 
                                  [1])
  interpreter.allocate_tensors()
  input_data = fastspeech_prepare_input(input_ids)
  for i, detail in enumerate(input_details):
    input_shape = detail['shape_signature']
    interpreter.set_tensor(detail['index'], input_data[i])

  interpreter.invoke()

  # The function `get_tensor()` returns a copy of the tensor data.
  # Use `tensor()` in order to get a pointer to the tensor.
  return (interpreter.get_tensor(output_details[0]['index']),
          interpreter.get_tensor(output_details[1]['index']))

## Tacotron2 Inference

In [None]:
def run_tacotron2(text):
    use_cuda = False
    TTS_MODEL = "tts_model.tflite"
    TTS_CONFIG = "config.json"
    TTS_CONFIG = load_config(TTS_CONFIG)
    ap = AudioProcessor(**TTS_CONFIG.audio)
    speaker_id = None
    speakers = []
    model = load_tflite_model(TTS_MODEL)
    waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, TTS_CONFIG, use_cuda, ap, speaker_id, style_wav=None,
                                                                             truncated=False, enable_eos_bos_chars=TTS_CONFIG.enable_eos_bos_chars,
                                                                             backend='tflite')
    return mel_postnet_spec, TTS_CONFIG.audio['sample_rate']

## TTS Inference Helper

In [None]:
def run_tts_inference(text, model_name='Tacotron2', vocoder_name='MB-MelGAN', quantization='float16'):
    if model_name == 'Tacotron2':
        tac_output, sample_rate = run_tacotron2(text)
    elif model_name == 'FastSpeech2':
        _, tac_output = fastspeech_infer('fastspeech_quant.tflite', text)
        tac_output = np.squeeze(tac_output)
        sample_rate = 22050
    if vocoder_name == 'MelGAN':
        waveform = run_melgan(tac_output, quantization)
        waveform = np.squeeze(waveform)
    elif vocoder_name == 'MB-MelGAN':
        waveform = run_mb_melgan(tac_output.T)
        waveform = waveform[0, 0]
    elif vocoder_name == 'PWGAN':
        waveform = run_parallel_wavegan(tac_output, quantization)
        waveform = waveform[0, :, 0]
      
    IPython.display.display(IPython.display.Audio(waveform, rate=sample_rate))
    

## Choose model

In [None]:
tts_model = 'FastSpeech2' #@param ["Tacotron2", "FastSpeech2", "Glow-TTS"]

vocoder_model = 'MelGAN' #@param ["MelGAN", "MB-MelGAN", "PWGAN"]

quantization = 'float16' #@param ["dr", "float16"]

## Inference

In [None]:
text =  "Bill got in the habit of asking himself"

run_tts_inference(text, tts_model, vocoder_model, quantization)

(228, 80)


## Benchmarks

### MB-MELGAN

- Inference Time : 1.9ms
- Memory FootPrint : 15MB
- Model Size : 9.7MB

### TACOTRON2

- Inference Time

- Memory FootPrint

- Model Size :  28.67MB

###  MelGAN

#### Dynamic Range Quantization

- Inference Time : 

- Memory FootPrint :

- Model Size : 17MB 

#### Float16 Quantization

- Inference Time : 

- Memory FootPrint :

- Model Size : 8.4MB

### Parallel WaveGAN

#### Dynamic Range Quantization

- Inference Time : 

- Memory FootPrint :

- Model Size : 5.7MB 

#### Float16 Quantization

- Inference Time : 

- Memory FootPrint :

- Model Size : 3.2MB