<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_tts_tts-python-advanced-customization-with-ssml/nvidia_logo.png" style="width: 90px; float: right;">

# How do I customize Riva TTS audio output with SSML?

This tutorial walks you through some of the advanced features for customization of Riva TTS audio output with Speech Synthesis Markup Language (SSML).

## NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech and natural language understanding services such as:

- Automated speech recognition (ASR)
- Text-to-Speech synthesis (TTS)
- A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will customize Riva TTS audio output with SSML. <br> 
To understand the basics of Riva TTS APIs, refer to [How do I use Riva TTS APIs with out-of-the-box models?](https://github.com/nvidia-riva/tutorials/blob/stable/tts-python-basics.ipynb). <br>

For more information about Riva, refer to the [Riva developer documentation](https://developer.nvidia.com/riva).

## Customizing Riva TTS audio output with SSML

Speech Synthesis Markup Language (SSML) specification is a markup for directing the performance of the virtual speaker. Riva supports portions of SSML, allowing you to adjust pitch, rate, and pronunciation of the generated audio.  
SSML support is available only for the FastPitch model at this time. The FastPitch model must be exported using NeMo>=1.5.1 and the nemo2riva>=1.8.0 tool.

All SSML inputs must be a valid XML document and use the <speak> root tag. All non-valid XML and all valid XML with a different root tag are treated as raw input text.

Riva TTS supports two SSML tags:  

- The ``prosody`` tag, which supports two attributes ``rate`` and ``pitch``, through which we can control the rate and pitch of the generated audio.  

- The ``phoneme`` tag, which allows us to control the pronunciation of the generated audio.

Let's look at customization of Riva TTS with these SSML tags in some detail.

#### Requirements and setup

1. Start the Riva Speech Skills server.  
Follow the instructions in the [Riva Quick Start Guide](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html#) to deploy OOTB TTS models on the Riva Speech Skills server before running this tutorial.  


2. Install the Riva Client library.  
Follow the steps in the [Requirements and setup for the Riva Client](https://github.com/nvidia-riva/tutorials#running-the-riva-client) to install the Riva Client library.


3. Install the additional Python libraries to run this tutorial.  
Run the following commands to install the libraries:

In [None]:
# We need numpy to read the output from Riva TTS request
!pip install numpy

#### Import Riva Client Libraries

Let's first import some required libraries, including the Riva Client libraries:

In [1]:
import numpy as np
import IPython.display as ipd
import grpc

import riva_api.riva_tts_pb2 as rtts
import riva_api.riva_tts_pb2_grpc as rtts_srv
import riva_api.riva_audio_pb2 as ra

#### Create Riva Clients and connect to the Riva Speech API server

The below URI assumes a local deployment of the Riva Speech API server is on the default port. In case the server deployment is on a different host or via Helm chart on Kubernetes, use an appropriate URI.

In [2]:
channel = grpc.insecure_channel('localhost:50051')

riva_tts = rtts_srv.RivaSpeechSynthesisStub(channel)

### Customizing rate and pitch with the `prosody` tag

#### Pitch Attribute
Riva supports an additive relative change to the pitch. The `pitch` attribute has a range of [-3, 3]. Values outside this range result in an error being logged and no audio returned. This value returns a pitch shift of the attribute value multiplied with the speaker’s pitch standard deviation when the FastPitch model is trained. For the pretrained checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of 1.25 results in a change of 1.25*52.185=~65.23Hz pitch shift up. 
Riva also supports the prosody tags as per the SSML specs. Prosody tags `x-low`, `low`, `medium`, `high`, `x-high`, and `default` are supported.

The `pitch` attribute is expressed in the following formats:
- `pitch="1"`
- `pitch="+1.8"`
- `pitch="-0.65"`
- `pitch="high"`
- `pitch="default"`

For the pretrained Female-1 checkpoint, the standard deviation is 53.33 Hz.
For the pretrained Male-1 checkpoint, the standard deviation is 47.15 Hz.

The `pitch` attribute does not support `Hz`, `st`, and `%` changes. Support is planned for a future Riva release.

#### Rate Attribute
Riva supports a percentage relative change to the rate. The `rate` attribute has a range of [25%, 250%]. Values outside this range result in an error being logged and no audio returned. 
Riva also supports the prosody tags as per the SSML specs. Prosody tags `x-low`, `low`, `medium`, `high`, `x-high`, and `default` are supported.

The `rate` attribute is expressed in the following formats:
- `rate="35%"`
- `rate="+200%"`
- `rate="low"`
- `rate="default"`

Let's look at an example showing these pitch and rate customizations for Riva TTS:

In [3]:
# Setting up Riva TTS request with SynthesizeSpeechRequest
req = rtts.SynthesizeSpeechRequest()
req.language_code = "en-US"
req.encoding = ra.AudioEncoding.LINEAR_PCM 
req.sample_rate_hz = 44100
req.voice_name = "English-US-Female-1"

"""
    Raw text is "Today is a sunny day. But it might rain tomorrow."
    We are updating this raw text with SSML:
    1. Envelope raw text in '<speak>' tags as is required for SSML
    2. Add '<prosody>' tag with 'pitch' attribute set to '2.5'
    3. Add '<prosody>' tag with 'rate' attribute set to 'high'
"""
raw_text = "Today is a sunny day. But it might rain tomorrow."
ssml_text = """<speak><prosody pitch='2.5'>Today is a sunny day</prosody>. <prosody rate='high'>But it might rain tomorrow.</prosody></speak>"""
print("Raw Text: ", raw_text)
print("SSML Text: ", ssml_text)

# Set the SSML Text as the text input for Riva TTS request
req.text = ssml_text

# Request to Riva TTS to synthesize audio
resp = riva_tts.Synthesize(req)

# Playing the generated audio from Riva TTS request
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.display(ipd.Audio(audio_samples, rate=req.sample_rate_hz))

Raw Text:  Today is a sunny day. But it might rain tomorrow.
SSML Text:  <speak><prosody pitch='2.5'>Today is a sunny day</prosody>. <prosody rate='high'>But it might rain tomorrow.</prosody></speak>


Here are more examples showing the effects of changes in pitch, and rate attribute values on the generated audio:

In [4]:
# Setting up Riva TTS request with SynthesizeSpeechRequest
req = rtts.SynthesizeSpeechRequest()
req.language_code = "en-US"
req.encoding = ra.AudioEncoding.LINEAR_PCM 
req.sample_rate_hz = 44100
req.voice_name = "English-US-Female-1"

# SSML texts we want to try
ssml_texts = [
  """<speak>This is a normal sentence</speak>""",
  """<speak><prosody pitch="0." rate="100%">This is also a normal sentence</prosody></speak>""",
  """<speak><prosody rate="200%">This is a fast sentence</prosody></speak>""",
  """<speak><prosody rate="60%">This is a slow sentence</prosody></speak>""",
  """<speak><prosody pitch="+1.0">Now, I'm speaking a bit higher</prosody></speak>""",
  """<speak><prosody pitch="-0.5">And now, I'm speaking a bit lower</prosody></speak>""",
  """<speak>S S M L supports <prosody pitch="-1">nested tags. So I can speak <prosody rate="150%">faster</prosody>, <prosody rate="75%">or slower</prosody>, as desired.</prosody></speak>""",
]

# Loop through 'ssml_texts' list and synthesize audio with Riva TTS for each entry 'ssml_texts'
for ssml_text in ssml_texts:
    req.text = ssml_text
    resp = riva_tts.Synthesize(req)
    audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
    print("SSML Text: ", ssml_text)
    ipd.display(ipd.Audio(audio_samples, rate=req.sample_rate_hz))
    print("--------------------------------------------")

SSML Text:  <speak>This is a normal sentence</speak>


--------------------------------------------
SSML Text:  <speak><prosody pitch="0." rate="100%">This is also a normal sentence</prosody></speak>


--------------------------------------------
SSML Text:  <speak><prosody rate="200%">This is a fast sentence</prosody></speak>


--------------------------------------------
SSML Text:  <speak><prosody rate="60%">This is a slow sentence</prosody></speak>


--------------------------------------------
SSML Text:  <speak><prosody pitch="+1.0">Now, I'm speaking a bit higher</prosody></speak>


--------------------------------------------
SSML Text:  <speak><prosody pitch="-0.5">And now, I'm speaking a bit lower</prosody></speak>


--------------------------------------------
SSML Text:  <speak>S S M L supports <prosody pitch="-1">nested tags. So I can speak <prosody rate="150%">faster</prosody>, <prosody rate="75%">or slower</prosody>, as desired.</prosody></speak>


--------------------------------------------


### Customizing Pronunciation with the `phoneme` Tag

We can use the `phoneme` tag to override the pronunciation of words from the predicted pronunciation. For a given word or sequence of words, use the `ph` attribute to provide an explicit pronunciation, and the `alphabet` attribute to provide the phone set.  
Currently, only `x-arpabet` is supported for pronunciation dictionaries based on CMUdict. IPA support will be added soon.

The full list of phonemes in the CMUdict can be found at [cmudict.phone](https://github.com/cmusphinx/cmudict/blob/master/cmudict.phones). The list of supported symbols with stress can be found at [cmudict.symbols](https://github.com/cmusphinx/cmudict/blob/master/cmudict.symbols). For a mapping of these phones to English sounds, refer to the [ARPABET Wikipedia page](https://en.wikipedia.org/wiki/ARPABET).

Let's look at an example showing this custom pronunciation for Riva TTS:

In [5]:
# Setting up Riva TTS request with SynthesizeSpeechRequest
req = rtts.SynthesizeSpeechRequest()
req.language_code = "en-US"
req.encoding = ra.AudioEncoding.LINEAR_PCM 
req.sample_rate_hz = 44100
req.voice_name = "English-US-Female-1"

"""
    Raw text is "You say tomato, I say tomato."
    We are updating this raw text with SSML:
    1. Envelope raw text in '<speak>' tags as is required for SSML
    2. For a substring in the raw text, add '<phoneme>' tags with 'alphabet' attribute set to 'x-arpabet' 
       (currently the only supported value) and 'ph' attribute set to a custom pronunciation based on CMUdict and ARPABET

"""
raw_text = "You say tomato, I say tomato."
ssml_text = '<speak>You say <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@EY1}{@T}{@OW2}">tomato</phoneme>, I say <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@AA1}{@T}{@OW2}">tomato</phoneme>.</speak>'

print("Raw Text: ", raw_text)
print("SSML Text: ", ssml_text)

# Set the SSML Text as the text input for Riva TTS request
req.text = ssml_text

# Request to Riva TTS to synthesize audio
resp = riva_tts.Synthesize(req)

# Playing the generated audio from Riva TTS request
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.display(ipd.Audio(audio_samples, rate=req.sample_rate_hz))

Raw Text:  You say tomato, I say tomato.
SSML Text:  <speak>You say <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@EY1}{@T}{@OW2}">tomato</phoneme>, I say <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@AA1}{@T}{@OW2}">tomato</phoneme>.</speak>


Here are more examples showing the customization of pronunciation in generated audio:

In [6]:
# Setting up Riva TTS request with SynthesizeSpeechRequest
req = rtts.SynthesizeSpeechRequest()
req.language_code = "en-US"
req.encoding = ra.AudioEncoding.LINEAR_PCM 
req.sample_rate_hz = 44100
req.voice_name = "English-US-Female-1"

# SSML texts we want to try
ssml_texts = [
  """<speak>Is it <phoneme alphabet="x-arpabet" ph="{@S}{@K}{@EH1}{@JH}{@UH0}{@L}">schedule</phoneme> or <phoneme alphabet="x-arpabet" ph="{@SH}{@EH1}{@JH}{@UW0}{@L}">schedule</phoneme>?</speak>""",
  """<speak>You say <phoneme alphabet="x-arpabet" ph="{@D}{@EY1}{@T}{@AH0}">data</phoneme>, I say <phoneme alphabet="x-arpabet" ph="{@D}{@AE1}{@T}{@AH0}">data</phoneme>.</speak>""",
  """<speak>Some people say <phoneme alphabet="x-arpabet" ph="{@R}{@UW1}{@T}">route</phoneme> and some say <phoneme alphabet="x-arpabet" ph="{@R}{@AW1}{@T}">route</phoneme>.</speak>""",
]

# Loop through 'ssml_texts' list and synthesize audio with Riva TTS for each entry 'ssml_texts'
for ssml_text in ssml_texts:
    req.text = ssml_text
    resp = riva_tts.Synthesize(req)
    audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
    print("SSML Text: ", ssml_text)
    ipd.display(ipd.Audio(audio_samples, rate=req.sample_rate_hz))
    print("--------------------------------------------")

SSML Text:  <speak>Is it <phoneme alphabet="x-arpabet" ph="{@S}{@K}{@EH1}{@JH}{@UH0}{@L}">schedule</phoneme> or <phoneme alphabet="x-arpabet" ph="{@SH}{@EH1}{@JH}{@UW0}{@L}">schedule</phoneme>?</speak>


--------------------------------------------
SSML Text:  <speak>You say <phoneme alphabet="x-arpabet" ph="{@D}{@EY1}{@T}{@AH0}">data</phoneme>, I say <phoneme alphabet="x-arpabet" ph="{@D}{@AE1}{@T}{@AH0}">data</phoneme>.</speak>


--------------------------------------------
SSML Text:  <speak>Some people say <phoneme alphabet="x-arpabet" ph="{@R}{@UW1}{@T}">route</phoneme> and some say <phoneme alphabet="x-arpabet" ph="{@R}{@AW1}{@T}">route</phoneme>.</speak>


--------------------------------------------


Information about customizing Riva TTS with SSML can also be found in the documentation [here](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-ssml.html#). 

## Go Deeper into Riva Capabilities

### Additional Riva Tutorials

Checkout more Riva TTS (and ASR) tutorials [here](https://github.com/nvidia-riva/tutorials). These tutorials provide a deeper understanding of the advanced features of Riva TTS, including customizing TTS for your specific needs.

### Sample Applications

Riva comes with various sample applications. They demonstrate how to use the APIs to build applications such as a [chatbot](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/samples/weather.html), a domain specific speech recognition, [keyword (entity) recognition system](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/samples/callcenter.html), or simply how Riva allows scaling out for handling massive amounts of requests at the same time. Refer to ([SpeechSquad)](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/samples/speechsquad.html) for more information.  
Refer to the *Sample Application* section in the [Riva developer documentation](https://developer.nvidia.com/) for more information.


###  Riva Automated Speech Recognition (ASR)

Riva's ASR offering comes with OOTB pipelines for English, German, Spanish, Russian and Mandarin. It can be used in streaming or batch inference modes and easily deployed using the [Riva Quick Start scripts](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html). Follow [this link](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-overview.html) to better understand Riva's ASR capabilities. Explore how to use Riva ASR APIs with the OOTB voices with [this Riva ASR tutorial](https://github.com/nvidia-riva/tutorials/blob/stable/asr-python-basics.ipynb).


### Additional Resources

For more information about each of the APIs and their functionalities, refer to the [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/protobuf-api/protobuf-api-root.html).