<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_tts_tts-python-basics/nvidia_logo.png" style="width: 90px; float: right;">

# How do I customize Riva TTS pronunciations?

This tutorial walks you through the basics of Riva/NeMo TTS pronunciation customization. 

## Grapheme-to-phoneme (G2P) Overview

Modern **text-to-speech** (TTS) models can learn pronunciations from raw text input and its corresponding audio data.
Sometimes, however, it is desirable to customize pronunciations, for example, for domain-specific terms. As a result, many TTS systems use grapheme and phonetic input during training to directly access and correct pronunciations at inference time.


[The International Phonetic Alphabet (IPA)](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) and [ARPABET](https://en.wikipedia.org/wiki/ARPABET) are the most common phonetic alphabets. Starting with the Riva 2.8.0 release, IPA will be the only supported prounciation alphabet for TTS models. Older Riva models only support ARPABET.

There are two ways to customize pronunciations in Riva:

1. using SSML, note that the request-time overrides are best suited for one-off adjustments. See [this](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/tts-python-basics-and-customization-with-ssml.html#customizing-pronunciation-with-the-phoneme-tag) for more details.
2. configure Riva with the desired domain-specific terms when deploying the server.

Both methods require users to convert graphemes into phonemes (G2P). Below we are going to focus on the second approach.

#### All words for G2P purposes could be divided into the following groups:
* *known* words - words that are present in the model's phonetic dictionary
* *out-of-vocabulary (OOV)* words - words that are missing from the model's phonetic dictionary. 
* *[heteronyms](https://en.wikipedia.org/wiki/Heteronym_(linguistics)* - words with the same spelling but different pronunciations and/or meanings, e.g., *bass* (the fish) and *bass* (the musical instrument).

#### Important Riva flags:
* `--phone_dictionary_file` path to a dictionary that maps words to their phonetic form, e.g., [ARPABET-based CMU Dictionary](https://github.com/NVIDIA/NeMo/blob/r1.14.0/scripts/tts_dataset_files/cmudict-0.7b_nv22.10) or [IPA-based CMU Dictionary](https://github.com/NVIDIA/NeMo/blob/r1.14.0/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv22.10.txt)
* `--preprocessor.g2p_ignore_ambiguous`: if is set to **True**, words with more than one phonetic representation in the pronunciation dictionary are ignored. This flag is relevant to heteronyms and non-heteronym words with multiple valid phonetic forms in the dictionary, for example, due to accent variations.

TTS models take a text in grapheme form, then convert all known unambiguous words into phonetic form during preprocessing. The rest of the words (OOV and words with multiple dictionary entries) are kept as graphemes, and the TTS model uses context clues from the sentence to predict an appropriate pronunciation for such words.

To ensure the desired pronunciation, we need to add a new entry to `--phone_dictionary_file` dictionary. If the target word is already in the dictionary, we need to remove the default pronunciation so that only the target pronunciation is present. 

## Dictionary customization

Below we show how to customize phonetic dictionary for NeMo/Riva models. 

In [None]:
"""
You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.
Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""

BRANCH = 'main'
# # If you're using Google Colab and not running locally, uncomment and run this cell.
# !apt-get install sox libsndfile1 ffmpeg
# !pip install wget text-unidecode pynini==2.1.4
# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]
# !wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/nemo_text_processing/install_pynini.sh
# !bash install_pynini.sh

In [None]:
import os
import nemo.collections.tts as nemo_tts
from nemo_text_processing.g2p.modules import IPAG2P
import soundfile as sf
import IPython.display as ipd
import torch

# Load mel spectrogram generator
spec_generator = nemo_tts.models.FastPitchModel.from_pretrained("tts_en_fastpitch_ipa")
# to use dictionary entries for known words
spec_generator.vocab.phoneme_probability = 1
spec_generator.vocab.g2p.phoneme_probability = 1

# Load vocoder
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_hifigan")


def generate_audio(input_text):
    parsed = spec_generator.parse(input_text)
    spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
    audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
    display(ipd.Audio(audio.detach().to('cpu').numpy(), rate=22050))

In [None]:
text = "paracetamol can help reduce fever."
generate_audio(text)

During preprocessing, unambiguous dictionary words are converted to phonemes, while OOV and words with multiple entries are kept as graphemes. For example, **paracetamol** is missing in from the phonetic dictionary, and **can** has 2 forms.

In [None]:
print(f"Input before tokenization: |{' '.join(spec_generator.vocab.g2p(text))}|\n")
for word in ["paracetamol", "can"]:
    word = word.upper()
    phoneme_forms = spec_generator.vocab.g2p.phoneme_dict[word]
    print(f"Number of phoneme forms for wordPhoneme forms for '{word}': {len(phoneme_forms)} -- {phoneme_forms}")

Let's add a new entry to the dictionary for the word **paracetamol**. 

In [None]:
# we download IPA-based CMU Dictionary and add a custom entry for the target word
ipa_cmu_dict = "ipa_cmudict-0.7b_nv22.10.txt"
if os.path.exists(ipa_cmu_dict):
    ! rm $ipa_cmu_dict

! wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tts_dataset_files/$ipa_cmu_dict

new_pronunciation = "ˌpæɹəˈsitəmɔl"

with open(ipa_cmu_dict, "a") as f:
    f.write(f"PARACETAMOL  {new_pronunciation}\n")
        
! tail $ipa_cmu_dict

In [None]:
# let's now use our updated dictionary as the model's phonetic dictionary
from collections import defaultdict
import re

phoneme_dict_obj = defaultdict(list)
_alt_re = re.compile(r"\([0-9]+\)")
with open(ipa_cmu_dict, "r") as fdict:
    for line in fdict:
        if len(line) and ('A' <= line[0] <= 'Z' or line[0] == "'"):
            parts = line.strip().split(maxsplit=1)
            word = re.sub(_alt_re, "", parts[0])
            prons = re.sub(r"\s+", "", parts[1])
            phoneme_dict_obj[word].append(list(prons))

spec_generator.vocab.g2p.phoneme_dict = phoneme_dict_obj

**Paracetomol** is no longer an OOV, and the model uses the phonetic form we provided:

In [None]:
" ".join(spec_generator.vocab.g2p(text))

Finally, let's use the new phoneme dictionary for synthesis.

In [None]:
generate_audio(text)

To summarize, one can customize the TTS model's output by altering entries in the phonetic dictionary with --phone_dictionary_file flag.


# Resources
* [Riva TTS documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html)
* [TTS pipeline costumizaiton](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-custom.html#tts-pipeline-configuration)
* [Overview of TTS in NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/NeMo_TTS_Primer.ipynb)
* [G2P models in NeMo](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/text_processing/g2p/g2p.html)