<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# How do I use Riva TTS APIs with out-of-the-box models?

This tutorial walks you through the basics of Riva Speech Skills's TTS Services, specifically covering how to use Riva TTS APIs with out-of-the-box models.

## NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech services such as:

- Automated speech recognition (ASR)
- Text-to-Speech synthesis (TTS)

In this tutorial, we will interact with the text-to-speech synthesis (TTS) APIs.

For more information about Riva, please refer to the [Riva developer documentation](https://developer.nvidia.com/riva).

## Speech generation with Riva TTS APIs

The Riva TTS service is based on a two-stage pipeline: Riva first generates a mel spectrogram using the first model, then generates speech using the second model. This pipeline forms a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.

Riva provides two state-of-the-art voices (one male and one female) for English, that can easily be deployed with the Riva Quick Start scripts. Riva also supports easy customization of TTS in various ways, to meet your specific needs.

Subsequent Riva releases will include added features, including model registration to support multiple languages/voices with the same API. Support for resampling to alternative sampling rates will also be added.

Refer to the [Riva TTS documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html) for more information.  

Now, let's generate audio using Riva APIs with an OOTB (out-of-the-box) English TTS pipeline.

#### Import Riva clent libraries

We first import some required libraries, including the Riva client libraries

In [None]:
import numpy as np
import IPython.display as ipd
import grpc

import riva.client

#### Create Riva clients and connect to Riva Speech API server

The below URI assumes a local deployment of the Riva Speech API server on the default port. In case the server deployment is on a different host or via Helm chart on Kubernetes, the user should use an appropriate URI.

In [None]:
auth = riva.client.Auth(uri='localhost:50051')

riva_tts = riva.client.SpeechSynthesisService(auth)

### Batch mode TTS

Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. But when making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests. <br> 
Let's take a look at an example showing batch mode TTS API usage:

#### Make a gRPC request to the Riva Speech API server

Now let us make a gRPC request to the Riva Speech servers, for TTS, in batch inference mode.

In [None]:
sample_rate_hz = 44100
resp = riva_tts.synthesize(
    text = "Is it recognize speech or wreck a nice beach?",
    language_code = "en-US",
    encoding = riva.client.AudioEncoding.LINEAR_PCM,    # Currently only LINEAR_PCM is supported
    sample_rate_hz = sample_rate_hz,                    # Generate 44.1KHz audio
    voice_name = "English-US-Female-1"         # The name of the voice to generate
)

audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.Audio(audio_samples, rate=sample_rate_hz)

### Understanding TTS API parameters

Riva TTS supports a number of options while making a text-to-speech request to the gRPC endpoint, as shown above. Let's learn more about these parameters:
- ``language_code`` - Language of the generated audio. ``"en-US"`` represents English (US) and is currently the only language supported OOTB.
- ``encoding`` - Type of audio encoding to generate (``LINEAR_PCM``, ``FLAC``, ``MULAW`` and ``ALAW``).
- ``sample_rate_hz`` - Sample rate of the generated audio. Depends on the microphone and is usually ``22khz`` or ``44khz``.
- ``voice_name`` - Voice used to synthesize the audio. Currently, Riva offers two OOTB voices (``English-US-Female-1``, ``English-US-Male-1``).

With this, we come to an end for the introduction to Riva's offline python client for TTS. Feel free to read more about the Riva's TTS proto [here](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/riva_tts.proto.html?highlight=proto)