<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_tts_tts-python-advanced-pretrain-tts-tao-training/nvidia_logo.png" style="width: 90px; float: right;">

# How to train Riva TTS models (FastPitch and HiFiGAN) with TAO Toolkit

This tutorial walks you through the steps to train Riva TTS models (FastPitch and HiFiGAN) from scratch with LJSpeech dataset using TAO Toolkit.

## Overview

In this tutorial, we will customize the Riva TTS pipeline by training Riva TTS models with NVIDIA's TAO Toolkit.  

The main objective is to synthesize reasonable and natural speech for given text. Since there are no universal standards to measure quality of synthesized speech, you will need to listen to some inferred speech to tell whether a TTS model is well trained.

TTS consists of two models: [FastPitch](https://arxiv.org/pdf/2006.06873.pdf) and [HiFi-GAN](https://arxiv.org/pdf/2010.05646.pdf).

* FastPitch is spectrogram model that generates a Mel spectrogram from text input. It's a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference, and generates speech that could be further controlled with predicted contours. FastPitch can thus change the perceived emotional state of the speaker or put emphasis on certain lexical units

![FastPitch](./imgs/architecture-fastpitch.PNG)


* HiFiGAN is a vocoder model that generates an audio output from the Mel spectrograms generated using FastPitch. HiFiGAN uses an end-to-end feed-forward WaveNet architecture, trained with multi-scale adversarial discriminators in both the time domain and the time-frequency domain. It relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech. The proposed model generalizes well to new speakers, new speech content, and new environments. It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments. 

![HiFiGAN](./imgs/architecture-hifigan.PNG)

---
## TTS using TAO

The TAO launcher uses Docker containers under the hood, and **for our data and results directory to be visible to Docker, they need to be mapped**. The launcher can be configured using the config file `~/.tao_mounts.json`. Apart from the mounts, you can also configure additional options like the environment variables and the amount of shared memory available to the TAO launcher. <br>

`IMPORTANT NOTE:` The following code creates a sample `~/.tao_mounts.json`  file. Here, we can map directories in which we save the data, specs, results, and cache. You should configure it for your specific use case so these directories are correctly visible to the Docker container.

In [None]:
# Working directory for this tutorial
WORKING_DIR = 'tts_training'

# Defining paths on the local host machine
%env HOST_DATA_DIR = {WORKING_DIR}/data
%env HOST_SPECS_DIR = {WORKING_DIR}/specs
%env HOST_RESULTS_DIR = {WORKING_DIR}/results

In [None]:
! mkdir -p $WORKING_DIR
! mkdir -p $HOST_DATA_DIR
! mkdir -p $HOST_SPECS_DIR
! mkdir -p $HOST_RESULTS_DIR

In [None]:
# Mapping up the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tao_configs = {
   "Mounts":[
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       },
       {
           "source": os.path.expanduser("~/.cache"),
           "destination": "/root/.cache"
       }
   ],
   "DockerOptions": {
        "shm_size": "128G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         }
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tao_configs, mfile, indent=4)

In [None]:
!cat ~/.tao_mounts.json

You can check the Docker image versions and the tasks that it performs. You can also check by issuing `tao --help` or:

In [None]:
! tao info --verbose

### Set Relevant Paths

In [None]:
# NOTE: The following paths are set from the perspective of the TAO Docker.

# The data is saved here:
DATA_DIR = "/data"
SPECS_DIR = "/specs"
RESULTS_DIR = "/results"

# Set your encryption key and use the same key for all commands:
KEY = 'tlt_encode'

The command structure for the TAO interface can be broken down as follows: `tao <task name> <subcommand>` <br> 

Let's see this in further detail.


### Downloading Specs
TAO's conversational AI toolkit works off of spec files which make it easy to edit hyperparameters on the fly. We can proceed to downloading the spec files. You may choose to modify/rewrite these specs or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command. <br>

The `-o` argument indicates the folder where the default specification files will be downloaded. The `-r` argument instructs the script on where to save the logs. **Ensure the `-o` points to an empty folder.**

In [None]:
# Download spec files for FastPitch
! tao spectro_gen download_specs \
    -r $RESULTS_DIR/spectro_gen \
    -o $SPECS_DIR/spectro_gen

In [None]:
# Download spec files for HiFiGAN
! tao vocoder download_specs \
    -r $RESULTS_DIR/vocoder \
    -o $SPECS_DIR/vocoder

### Download Data

In this tutorial we will use the popular [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset. Let's download it.

In [None]:
# Checking if the dataset exists, otherwise download it
if os.path.exists(os.environ["HOST_DATA_DIR"] + '/ljspeech.tar.bz2'):
    print("Dataset exists, skipping download")
else:
    print("Dataset does not exist, downloading")
    ! wget -O $HOST_DATA_DIR/ljspeech.tar.bz2 https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2

After downloading, untar the dataset and move it to the correct directory.

In [None]:
! tar -xf $HOST_DATA_DIR/ljspeech.tar.bz2
! rm -rf $HOST_DATA_DIR/ljspeech
! mv LJSpeech-1.1 $HOST_DATA_DIR/ljspeech

### Pre-Processing

In TAO/NeMo format, the dataset consists of a set of utterances in individual audio files (.wav) and a manifest that describes the dataset, with information about one utterance per line.<br>
Each line of the manifest should be in the following format:

```
{"audio_filepath": "/path/to/audio.wav", "text": "the transcription of the utterance", "duration": 23.147}
```

The `audio_filepath` field should provide an absolute path to the .wav file corresponding to the utterance. The `text` field should contain the full transcript for the utterance, and the `duration` field should reflect the duration of the utterance in seconds.

This pre-processing step downloads audio (.wav) to transcript (.txt) file lists from NVIDIA for LJSpeech dataset and generate the manifest files in the format mentioned above. 

Let's run the below command to pre-process LJSpeech.

Be patient! This step can take several minutes.

In [None]:
! tao spectro_gen dataset_convert \
    -e $SPECS_DIR/spectro_gen/dataset_convert_ljs.yaml \
    -r $RESULTS_DIR/spectro_gen/dataset_convert \
    data_dir=$DATA_DIR/ljspeech \
    dataset_name=ljspeech

### Training 

The TAO interface enables you to configure the training parameters from the command-line interface. <br>

The process of opening the training script, finding the parameters of interest (which might be spread across multiple files), and making the changes needed, is being replaced by a simple command-line interface.

For example, if the number of epochs are needed to be modified along with a change in the learning rate, you can add `trainer.max_epochs=10` and `optim.lr=0.02` and train the model. Sample commands are given below.


For training TTS models in TAO, we use the `tao spectro_gen train` and `tao vocoder train` commands with the following arguments:
<ul>
    <li>`-e`: Path to the spec file </li>
    <li>`-g`: Number of GPUs to use </li>
    <li>`-r`: Path to the results folder </li>
    <li>`-k`: User specified encryption key to use while saving/loading the model </li>
    <li>Any overrides to the spec file. For example, `trainer.max_epochs`. </li>
</ul>

NOTE: In order to get a TTS pipeline, you need to train **BOTH** FastPitch (`spectro_gen`) and HiFi-GAN (`vocoder`). For HiFi-GAN, since it's universal for a specific language, the pretrained weights from NGC will itself give you good performance.

#### Training FastPitch

In [None]:
# Prior is needed for FastPitch training. If an empty folder is provided, prior will generate on-the-fly.
# Prior alignment matrix is used to align speech and text inputs.
! mkdir -p $HOST_RESULTS_DIR/spectro_gen/train/prior_folder

In [None]:
!tao spectro_gen train \
     -e $SPECS_DIR/spectro_gen/train.yaml \
     -g 1 \
     -k $KEY \
     -r $RESULTS_DIR/spectro_gen/train \
     train_dataset=$DATA_DIR/ljspeech/ljspeech_train.json \
     validation_dataset=$DATA_DIR/ljspeech/ljspeech_val.json \
     prior_folder=$RESULTS_DIR/spectro_gen/train/prior_folder \
     trainer.max_epochs=5 \
     train_ds.num_workers=12 \
     validation_ds.num_workers=4

#### Training HiFi-GAN

Instead of passing `trainer.max_epochs`, HiFi-GAN requires the definition of `trainer.max_steps`. Defining `trainer.max_epochs` for HiFi-GAN has no effect.

In [None]:
!tao vocoder train \
     -e $SPECS_DIR/vocoder/train.yaml \
     -g 1 \
     -k $KEY \
     -r $RESULTS_DIR/vocoder/train \
     train_dataset=$DATA_DIR/ljspeech/ljspeech_train.json \
     validation_dataset=$DATA_DIR/ljspeech/ljspeech_val.json \
     trainer.max_steps=10000

---
## TTS Inference with TAO Toolkit

In this section, we are going to run inference on the trained TTS models. As previously mentioned, since there are no universal standards to measure quality of synthesized speech, you will need to listen to some inferred speech to tell whether a TTS model is well trained. Therefore, we do not provide `evaluate` functionality in TAO Toolkit for TTS but only provide `infer` functionality.

The inference in the following cells is not optimized for real-time performance. For real-time inference and best latency, we would deploy this model using RIVA. 

### TTS Inference with TLT checkpoint

In this section, we will run inference on the `.tlt` checkpoint trained with TAO Toolkit.

#### Generate spectrogram

The first step for inference is generating a spectrogram. That's a NumPy array (saved as `.npy` file) for a sentence which can be converted to voice by a vocoder. We use the FastPitch model we just trained to generate a spectrogram.

You may have to work with the `infer.yaml` file to set the texts you want for inference.

In [None]:
!tao spectro_gen infer \
     -e $SPECS_DIR/spectro_gen/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/spectro_gen/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/spectro_gen/infer \
     output_path=$RESULTS_DIR/spectro_gen/infer/spectro

#### Generate sound file

The second step for inference is generating a `.wav` sound file based on a spectrogram you generated in the previous step.

In [None]:
!tao vocoder infer \
     -e $SPECS_DIR/vocoder/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/vocoder/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/vocoder/infer \
     input_path=$RESULTS_DIR/spectro_gen/infer/spectro \
     output_path=$RESULTS_DIR/vocoder/infer/wav

In [None]:
import os
import IPython.display as ipd
# change path of the file here
ipd.Audio(os.environ["HOST_RESULTS_DIR"] + '/vocoder/infer/wav/0.wav')

---
## TTS model export

With TAO, you can also export your model in a format that can deployed using NVIDIA Riva; a highly performant application framework for multi-modal conversational AI services using GPUs. The same command for exporting to ONNX can be used here. The only small variation is the configuration for `export_format` in the spec file.

#### Export to RIVA

In [None]:
!tao spectro_gen export \
     -e $SPECS_DIR/spectro_gen/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/spectro_gen/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/spectro_gen/export \
     export_format=RIVA \
     export_to=spectro_gen.riva

In [None]:
!tao vocoder export \
     -e $SPECS_DIR/vocoder/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/vocoder/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/vocoder/export \
     export_format=RIVA \
     export_to=vocoder.riva

In [None]:
# Moving FastPitch and HiFiGAN .riva models to a common folder
# This is required for Riva to build a TTS pipeline
! mkdir -p $HOST_RESULTS_DIR/riva
! cp $HOST_RESULTS_DIR/spectro_gen/export/spectro_gen.riva $HOST_RESULTS_DIR/riva/
! cp $HOST_RESULTS_DIR/vocoder/export/vocoder.riva $HOST_RESULTS_DIR/riva/

## What's Next?

Now that we've trained FastPitch and HiFiGAN, we can now deploy these models to NVIDIA Riva.

Make sure to keep the path of `spectro_gen.riva` and `vocoder.riva` handy for deployment i.e. `tts_training/results/riva/`