<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_tts_tts-python-advanced-pretrain-tts-tao-training/nvidia_logo.png" style="width: 90px; float: right;">

# How to train Riva TTS models (FastPitch and HiFiGAN) with NeMo

NeMo Toolkit is a Python-based AI toolkit for training and customizing purpose-built pre-trained AI models with your own data. In this tutorial, we will train the models from scratch, but one can easily customize them via transfer learning instead. 

Transfer learning extracts learned features from an existing neural network to a new one. Transfer learning is often used when creating a large training dataset is not feasible.

Developers, researchers and software partners building intelligent AI apps and services, can bring their own data to fine-tune pre-trained models instead of going through the hassle of training from scratch.

Let's see this in action with a use case for Speech Synthesis!

## Overview

In this tutorial, we will customize the Riva TTS pipeline by training Riva TTS models with NVIDIA's NeMo Toolkit.  

The main objective is to synthesize reasonable and natural speech for given text. Since there are no universal standards to measure the quality of synthesized speech, you will need to listen to some inferred speech to tell whether a TTS model is well trained.

TTS consists of two models: [FastPitch](https://arxiv.org/pdf/2006.06873.pdf) and [HiFi-GAN](https://arxiv.org/pdf/2010.05646.pdf).

* FastPitch is spectrogram model that generates a Mel spectrogram from text input. It's a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference, and generates speech that could be further controlled with predicted contours. FastPitch can thus change the perceived emotional state of the speaker or put emphasis on certain lexical units

![FastPitch](./imgs/architecture-fastpitch.PNG)


* HiFiGAN is a vocoder model that generates an audio output from the Mel spectrograms generated using FastPitch. HiFiGAN uses an end-to-end feed-forward WaveNet architecture, trained with multi-scale adversarial discriminators in both the time domain and the time-frequency domain. It relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech. The proposed model generalizes well to new speakers, new speech content, and new environments. It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments. 

![HiFiGAN](./imgs/architecture-hifigan.PNG)

---
## Let's Dig in: TTS using NeMo

This notebook assumes that you are already familiar with TTS Training using NeMo, as described in the [text-to-speech-training](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_MixerTTS_Training.ipynb) notebook, and that you have a pretrained TTS model.

After [installing NeMo](https://github.com/NVIDIA/NeMo#installation), the next step is to setup the paths to save data and results. NeMo can be used with docker containers or virtual environments.

Replace the variables FIXME with the required paths enclosed in "" as a string.

`IMPORTANT NOTE:` Here, we map directories in which we save the data, specs, results and cache. You should configure it for your specific case so these directories are correctly visible to the docker container. Make sure this tutorial is in the NeMo folder.

### Installation of packages and importing of files

We will first install all necessary packages.

In [None]:
! pip install numba>=0.53
! pip install librosa
! pip install soundfile
! pip install tqdm
! pip install Cython

In [None]:
import os
from pathlib import Path

In [None]:
# Clone NeMo locally
# Change this path if you don't want to clone NeMo to the directory containing this tutorial
NEMO_DIR = os.path.join(os.getcwd(), "NeMo")
! git clone https://github.com/NVIDIA/NeMo $NEMO_DIR

In [None]:
## Install NeMo
BRANCH = 'main'
! python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

### Set Relevant Paths

In [None]:
import os

# The data is saved here
DATA_DIR = os.path.join(os.path.abspath("tts-models"), "datasets")
RESULTS_DIR = os.path.join(os.path.abspath("tts-models"), "results")

! mkdir -p {DATA_DIR}
! mkdir -p {RESULTS_DIR}

os.environ["DATA_DIR"] = DATA_DIR
os.environ["RESULTS_DIR"] = RESULTS_DIR

### Data

In this tutorial, we will illustrate the process of training FastPitch and HiFiGAN from scratch on the LJSpeech dataset. First, let's download and pre-process the original LJSpeech dataset and set variables that point to the associated manifest `.json` files.

### Pre-Processing

This step downloads audio to text file lists from NVIDIA for LJSpeech and generates the manifest files `train_manifest.json`, `val_manifest.json`, and `test_manifest.json`. 

If you use your own dataset, you have to generate three files: `ljs_audio_text_train_manifest.json`, `ljs_audio_text_val_manifest.json`, `ljs_audio_text_test_manifest.json` yourself. Those files correspond to your train / val / test split. For each text file, the number of rows should be equal to number of samples in this split and each row for a single speaker dataset should be like:

```
{"audio_filepath": "path_to_audio_file", "text": "text_of_the_audio", "duration": duration_of_the_audio}
```

In case of multi-speaker dataset

```
{"audio_filepath": "path_to_audio_file", "text": "text_of_the_audio", "duration": duration_of_the_audio, "speaker": speaker_id}
```

An example row is:

```
{"audio_filepath": "actressinhighlife_01_bowen_0001.flac", "text": "the pleasant season did my heart employ", "duration": 2.4}
```

We will now download the audio and the manifest files then convert them to the above format, also normalize the text. These steps for LJSpeech can be found in NeMo [`scripts/dataset_processing/tts/ljspeech/get_data.py`](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/ljspeech/get_data.py). Be patient, this step is expected to take some time.

In [None]:
!python $NEMO_DIR/scripts/dataset_processing/tts/ljspeech/get_data.py \
    --data-root $DATA_DIR \
    --whitelist-path $NEMO_DIR/scripts/dataset_processing/tts/ljspeech/lj_speech.tsv

### Getting Pitch Statistics

Training Fastpitch requires you to set 2 values for pitch extraction:
  - `avg`: The average used to normalize the pitch
  - `std`: The std deviation used to normalize the pitch

We can compute pitch for the training data using [`scripts/dataset_processing/tts/extract_sup_data.py`](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/extract_sup_data.py) and extract pitch statistics using the NeMo script [`scripts/dataset_processing/tts/compute_speaker_stats.py`](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/compute_speaker_stats.py), We have already downloaded the files earlier in the tutorial. Let's use it to get `pitch_mean` and `pitch_std`.

**Note**: It can take several hours for this script to compute the supplementary statistics for the LJSpeech dataset.

First we will extract the pitch supplementary data using `extract_sup_data.py` file. This file works with a yaml config file `ds_for_fastpitch_align`, which we downloaded above. To make this work for your dataset simply change the `manifest_path` to your manifest path. The argument `sup_data_path` determines where the supplementary data is stored.

Set the paths to the LJSpeech manifest files

In [None]:
ljspeech_dir = os.path.join(DATA_DIR, "LJSpeech-1.1")
train_manifest_json = os.path.join(ljspeech_dir, "train_manifest.json")
val_manifest_json   = os.path.join(ljspeech_dir, "val_manifest.json")
test_manifest_json  = os.path.join(ljspeech_dir, "test_manifest.json")

Path to the directory containing the `ds_for_fastpitch_align.yaml` configuration file

In [None]:
config_path = os.path.join(NEMO_DIR, "scripts/dataset_processing/tts/ljspeech/ds_conf")

Specify the output paths for the `extract_sup_data.py` script, or more precisely the `ds_for_fastpitch_align.yaml` configuration file on which it depends

In [None]:
sup_data_path = os.path.join(ljspeech_dir, "sup_data_path")
pitch_stats_path = os.path.join(ljspeech_dir, "pitch_stats.json")

Path to the `extract_sup_data.py` script

In [None]:
extract_sup_data = os.path.join(NEMO_DIR, "scripts/dataset_processing/tts/extract_sup_data.py")

The script [`scripts/dataset_processing/tts/extract_sup_data.py`](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/extract_sup_data.py) writes the pitch mean and pitch std (standard deviation) in the command line. We'll route that output to a file, then read and parse the file to extract the pitch mean and std.

In [None]:
! cd $NEMO_DIR && \
    python $extract_sup_data --config-path=$config_path \
    manifest_filepath=$train_manifest_json sup_data_path=$sup_data_path \
    &> $ljspeech_dir/sup_data_console_output.txt

In [None]:
with open(os.path.join(ljspeech_dir, "sup_data_console_output.txt"), "r") as f:
    cmd_str_list = [line.rstrip() for line in f]

In [None]:
cmd_str = [c for c in cmd_str_list if "PITCH_MEAN" in c][0]
cmd_str = cmd_str[cmd_str.find('PITCH_MEAN='):]

In [None]:
# Extract pitch mean and std from the command line
pitch_mean_str = cmd_str.split(',')[0]
pitch_mean = float(pitch_mean_str.split('=')[1])
pitch_std_str = cmd_str.split(',')[1]
pitch_std = float(pitch_std_str.split('=')[1])
pitch_mean, pitch_std

Setting the `pitch_mean` and `pitch_std` based on the results from the cell above.

In [None]:
os.environ["pitch_mean"] = str(pitch_mean)
os.environ["pitch_std"] = str(pitch_std)

print(f"pitch mean: {pitch_mean}")
print(f"pitch std: {pitch_std}")

### Training

We are now ready to train our TTS models. We'll start with FastPitch, then proceed to HiFiGAN.

#### Training FastPitch

We'll use [`examples/tts/fastpitch.py`](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/fastpitch.py) to train FastPitch. Doing so properly will take many more epochs than the default value of 10 given here. 

If you wish to fine-tune FastPitch with your own dataset, use [`examples/tts/fastpitch_finetune.py`](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/fastpitch_finetune.py) instead. Change the dataset arguments accordingly, and add the argument `+init_from_pretrained_model="tts_en_fastpitch"`. This will initialize the model with the pretrained [FastPitch](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_fastpitch) checkpoint available from NGC. For more details, refer to this [TTS Fine-Tuning tutorial](https://github.com/nvidia-riva/tutorials/blob/main/tts-finetune-nemo.ipynb). 

In [None]:
!(cd $NEMO_DIR && \
  python $NEMO_DIR/examples/tts/fastpitch.py \
  --config-name=fastpitch_align_v1.05.yaml \
  train_dataset=$train_manifest_json \
  validation_datasets=$val_manifest_json \
  sup_data_path=$sup_data_path \
  exp_manager.exp_dir=$RESULTS_DIR \
  trainer.max_epochs=10 \
  trainer.check_val_every_n_epoch=10 \
  model.train_ds.dataloader_params.batch_size=24 \
  model.validation_ds.dataloader_params.batch_size=24 \
  model.n_speakers=1 \
  model.pitch_mean=$pitch_mean \
  model.pitch_std=$pitch_std \
  model.optim.lr=2e-4 \
  ~model.optim.sched \
  model.optim.name=adam \
  trainer.devices=1 \
  trainer.strategy=null \
  +model.text_tokenizer.add_blank_at=true \
)

Let's take a closer look at the training command:

* `--config-name=fastpitch_align_v1.05.yaml`
  * We first tell the script what config file to use.

* `train_dataset=$train_manifest_json 
  validation_datasets=$val_manifest_json 
  sup_data_path=$sup_data_path`
  * We tell the script what manifest files to train and eval on, as well as where supplementary data is located (or will be calculated and saved during training if not provided).
  
* `phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.10 
heteronyms_path=tts_dataset_files/heteronyms-052722
whitelist_path=tts_dataset_files/tts.tsv 
`
  * We tell the script where `phoneme_dict_path`, `heteronyms-052722` and `whitelist_path` are located. These are the additional files we downloaded earlier, and are used in preprocessing the data.
  
* `trainer.max_epochs=10 trainer.check_val_every_n_epoch=10`
  * For this experiment, we tell the script to train for 10 epochs. You will need to train FastPitch for many more epochs to obtain good results.

* `model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24`
  * Set batch sizes for the training and validation data loaders.

* `model.n_speakers=1`
  * The number of speakers in the data. There is only 1 for now, but we will revisit this parameter later in the notebook.

* `model.pitch_mean=$pitch_mean model.pitch_std=$pitch_std`
  * Pitch statistics which we computed by running the script `python <NeMo_base>/scripts/dataset_processing/tts/extract_sup_data.py manifest_filepath=<your_manifest_path>`.
  * `model.pitch_fmin` and `model.pitch_fmax` are hyperparameters to librosa's pyin function. We recommend tweaking these only if the speaker is in a noisy environment, such that background noise isn't predicted to be speech.

* `model.optim.lr=2e-4 ~model.optim.sched model.optim.name=adam`
  * For fine-tuning, we lower the learning rate.
  * We use a fixed learning rate of 2e-4.
  * We switch from the lamb optimizer to the adam optimizer.

* `trainer.devices=1 trainer.strategy=null`
  * For this notebook, we default to 1 gpu which means that we do not need ddp.
  * If you have the compute resources, feel free to scale this up to the number of free gpus you have available.
  * Please remove the `trainer.strategy=null` section if you intend on multi-gpu training.

#### Generating Mel Spectrograms

We'll use the [`scripts/dataset_processing/tts/generate_mels.py`](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/generate_mels.py) script to pass the training data into FastPitch and generate mel spectrograms. 

Relative to `RESULTS_DIR`, your FastPitch model checkpoint should have a path of the form `FastPitch/<START DATE & TIME>/checkpoints/FastPitch--val_loss=<val_loss>-epoch=<epoch>.ckpt`. Modify it accordingly in the cell below.

In [None]:
fastpitch_checkpoint = os.path.join(RESULTS_DIR, "FastPitch/<START DATE & TIME>/checkpoints/FastPitch--val_loss=<val_loss>-epoch=<epoch>.ckpt")

Generate the mel spectrograms from the training data. They'll be placed in a folder named `mels` and catalogued in `train_maniffest_mel.json`, both of which will be contained in `ljspeech_dir`. 

In [None]:
!(cd $NEMO_DIR && \
  python $NEMO_DIR/scripts/dataset_processing/tts/generate_mels.py \
  --fastpitch-model-ckpt $fastpitch_checkpoint \
  --input-json-manifests $train_manifest_json \
  --output-json-manifest-root $ljspeech_dir \
 )

#### Training HiFiGAN

Now let's train HiFiGAN with the [examples/tts/hifigan.py](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/hifigan.py) script and the configs present in [examples/tts/conf/hifigan](https://github.com/NVIDIA/NeMo/tree/main/examples/tts/conf/hifigan). Doing so properly will take many more steps than the default value of 10000 given here.

If you wish to fine-tune HiFiGAN, use [examples/tts/hifigan_finetune.py](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/hifigan_finetune.py) instead. You'll need to generate mel spectrograms from your fine-tuning data instead of the LJSpeech training data. You should also add the argument `+init_from_pretrained_model=tts_hifigan` in calling the fine-tuning script. This will initialize the model with the pretrained [HiFiGAN](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_hifigan) checkpoint available from NGC. For more details, refer to this [TTS Fine-Tuning tutorial](https://github.com/nvidia-riva/tutorials/blob/main/tts-finetune-nemo.ipynb). 

Create a small validation dataset for HiFiGAN training.

In [None]:
hifigan_train_ds = os.path.join(ljspeech_dir, "train_manifest_mel.json")
hifigan_val_ds   = os.path.join(ljspeech_dir, "val_manifest_mel.json")

In [None]:
! cat $hifigan_train_ds | tail -n 2 > $hifigan_val_ds

Run the following command to train HiFiGAN.

In [None]:
!(cd $NEMO_DIR && \
  python $NEMO_DIR/examples/tts/hifigan.py \
  --config-name=hifigan.yaml \
  model.train_ds.dataloader_params.batch_size=32 \
  model.max_steps=10000 \
  model.optim.lr=0.00001 \
  ~model.optim.sched \
  train_dataset=$hifigan_train_ds \
  validation_datasets=$hifigan_val_ds \
  exp_manager.exp_dir=$RESULTS_DIR \
  trainer.check_val_every_n_epoch=10 \
  model/train_ds=train_ds_finetune \
  model/validation_ds=val_ds_finetune)

### TTS Inference

As aforementioned, since there are no universal standard to measure quality of synthesized speech, you will need to listen to some inferred speech to tell whether a TTS model is well trained. Therefore, we do not provide `evaluate` functionality in NeMo Toolkit for TTS but only provide `infer` functionality.

#### Generate spectrogram and audio

The first step for inference is generating spectrogram. That's a numpy array (saved as `.npy` file) for a sentence which can be converted to voice by a vocoder. We use FastPitch we just trained to generate spectrogram

Please update the `hifigan_checkpoint` variable with the path to the HiFiGAN checkpoint you want to use. Relative to `RESULTS_DIR`, it should have a path of the form `HiFiGan/<START DATE & TIME>/checkpoints/HiFiGan--val_loss=<val_loss>-epoch=<epoch>.ckpt`.

In [None]:
hifigan_checkpoint = os.path.join(RESULTS_DIR, "HifiGan/<START DATE & TIME>/checkpoints/HifiGan--val_loss=<val_loss>-epoch=<epoch>.ckpt")

Let's load the two models, FastPitch and HiFiGAN, for inference.

In [None]:
from nemo.collections.tts.models import FastPitchModel, HifiGanModel

HOME_DIR = os.getcwd()
os.chdir(NEMO_DIR)

vocoder = HifiGanModel.load_from_checkpoint(hifigan_checkpoint)
vocoder = vocoder.eval().cuda()
spec_model = FastPitchModel.load_from_checkpoint(fastpitch_checkpoint)
spec_model.eval().cuda()

os.chdir(HOME_DIR)

Let's create a helper method to run inference given a string input. In case of multi-speaker inference the same method can
be used by passing the speaker ID as a parameter.

In [None]:
import torch

def infer(spec_gen_model, vocoder_model, str_input, speaker=None):
    """
    Synthesizes spectrogram and audio from a text string given a spectrogram synthesis and vocoder model.
    
    Args:
        spec_gen_model: Spectrogram generator model (FastPitch in our case)
        vocoder_model: Vocoder model (HiFiGAN in our case)
        str_input: Text input for the synthesis
        speaker: Speaker ID
    
    Returns:
        spectrogram and waveform of the synthesized audio.
    """
    with torch.no_grad():
        parsed = spec_gen_model.parse(str_input)
        if speaker is not None:
            speaker = torch.tensor([speaker]).long().to(device=spec_gen_model.device)
        spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed, speaker=speaker)
        audio = vocoder_model.convert_spectrogram_to_audio(spec=spectrogram)
        
    if spectrogram is not None:
        if isinstance(spectrogram, torch.Tensor):
            spectrogram = spectrogram.to('cpu').numpy()
        if len(spectrogram.shape) == 3:
            spectrogram = spectrogram[0]
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()
    return spectrogram, audio

Helper function for reading manifest `.json` files

In [None]:
import random
import json

def json_reader(filename):
    with open(filename) as f:
        for line in f:
            yield json.loads(line)

Running the next cell will generate the following for each line in the manifest `.json` file for your test data: 
- The ground truth audio sample
- A mel spectrogram generated by passing the transcribed text into your trained FastPitch model
- A synthesized audio sample generated by passing the mel spectrogram into your trained HiFiGAN model

In [None]:
import IPython.display as ipd
from matplotlib.pyplot import imshow
from matplotlib import pyplot as plt

# Path to test manifest file (.json)
test_records_path = os.path.join(ljspeech_dir, 'test_manifest.json')
test_records = list(json_reader(test_records_path))
new_speaker_id = None

for test_record in test_records:
    print("Real validation audio")
    ipd.display(ipd.Audio(test_record['audio_filepath'], rate=22050))
    duration_sec = test_record['duration']
    if 'speaker' in test_record:
        speaker_id = test_record['speaker']
    else:
        speaker_id = new_speaker_id
    print(f"SYNTHESIZED | Duration: {duration_sec} sec | Text: {test_record['text']}")
    spec, audio = infer(spec_model, vocoder, test_record['text'], speaker=speaker_id)
    ipd.display(ipd.Audio(audio, rate=22050))
    %matplotlib inline
    imshow(spec, origin="lower", aspect="auto")
    plt.show()

#### Debug

The data provided is only meant to be a sample to understand how finetuning works in NeMo. In order to generate better speech quality, you will need to train FastPitch for far more than the default `trainer.max_epochs=10` epochs and HiFiGAN for far more than the default `model.max_steps=10000` steps. 

If you're fine-tuning pre-trained models, we recommend recording at least 30 mins of your own audio, and setting the number of fine-tuning steps for both models to `trainer.max_steps=5000`.

### TTS model export

You can also export your model in a format that can deployed using NVIDIA Riva, a highly performant application framework for multi-modal conversational AI services using GPUs!

#### Export to RIVA

Executing the snippets in the cells below, allows you to generate a `.riva` model file for the spectrogram generator and vocoder models that were trained the preceding cells. These models are required to generate a complete Text-To-Speech pipeline.


#### Convert to Riva

Convert the downloaded model to `.riva` format. We will use encryption key `tlt_encode`.

If you didn't manage to generate a `.nemo` file for either FastPitch or HiFiGAN (for example, if your session timed out before the script reached `trainer.max_epochs` or `model.max_steps`), you can create them from the `spec_model` and `vocoder` local model variables which you specified earlier. In that event, uncomment the following cell.

In [None]:
# spec_model.save_to(os.path.join(os.path.dirname(fastpitch_checkpoint), 'FastPitch.nemo'))
# vocoder.save_to(os.path.join(os.path.dirname(hifigan_checkpoint), 'HiFiGan.nemo'))

Specify the paths to your FastPitch and HiFiGAN `.nemo` models

In [None]:
fastpitch_nemo_file_path = FIXME
hifigan_nemo_file_path = FIXME

Generate the corresponding `.riva` file paths

In [None]:
RIVA_MODEL_DIR = os.path.join(RESULTS_DIR, "riva")
!mkdir -p $RIVA_MODEL_DIR

fastpitch_nemo_file_list = fastpitch_nemo_file_path.split('/')
fastpitch_nemo_file_name = fastpitch_nemo_file_list[-1]
fastpitch_riva_file_name = fastpitch_nemo_file_name[:-5] + ".riva"
fastpitch_riva_file_path = os.path.join(RIVA_MODEL_DIR, fastpitch_riva_file_name)

hifigan_nemo_file_list = hifigan_nemo_file_path.split('/')
hifigan_nemo_file_name = hifigan_nemo_file_list[-1]
hifigan_riva_file_name = hifigan_nemo_file_name[:-5] + ".riva"
hifigan_riva_file_path = os.path.join(RIVA_MODEL_DIR, hifigan_riva_file_name)

Install `nemo2riva` from the `.whl` file provided in the Riva Skills Quick Start resource folder which you downloaded in the first tutorial in this lab. Alternatively, you can install it from PyPI by running 
```
! pip install nemo2riva
```

In [None]:
RIVA_DIR = os.path.abspath('riva_quickstart_v2.10.0')
! cd $RIVA_DIR && pip install nemo2riva*.whl

Export the `.nemo` files to `.riva`

In [None]:
! nemo2riva --out $fastpitch_riva_file_path --key=tlt_encode $fastpitch_nemo_file_path
! nemo2riva --out $hifigan_riva_file_path   --key=tlt_encode $hifigan_nemo_file_path

### What's Next ?

Now that we've trained FastPitch and HiFiGAN, proceed to the [next tutorial](./4_spectrogen-vocoder-tao-deployment.ipynb) in this lab to learn how to deploy these models to NVIDIA Riva.