<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# How do I use Riva ASR APIs with out-of-the-box models?

This tutorial walks you through the basics of Riva Speech Skills ASR Services, specifically covering how to use Riva ASR APIs with out-of-the-box models.

## NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech and natural language understanding services such as:

- Automated speech recognition (ASR)
- Text-to-Speech synthesis (TTS)
- A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, intent classification.

In this tutorial, we will interact with the automated speech recognition (ASR) APIs.

For more information about Riva, refer to the [Riva developer documentation](https://developer.nvidia.com/riva).

## Transcription with Riva ASR APIs

ASR takes an audio stream or audio buffer as input and returns one or more text transcripts, along with additional optional metadata. Speech recognition in Riva is a GPU-accelerated compute pipeline, with optimized performance and accuracy.  
Riva provides state of the art OOTB(out-of-the-box) models and pipelines for multiple languages, like English, Spanish, German, Russian and Mandarin, that can be easily deployed with the Riva Quick Start Scripts. Riva also supports easy customization of the ASR pipeline, in various ways, to meet your specific needs.  
Refer to the [Riva ASR documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-overview.html) for more information.  

Now, let's generate the transcripts using Riva APIs, for some sample audio clips, with an OOTB pipeline, starting with English.

<a id='updated_reqs_and_setup_for_EngASR'></a>
#### Requirements and setup

1. Start the Riva Speech Skills server.  
Follow the instructions in the [Riva Quick Start Guide](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html#) to deploy OOTB ASR models on the Riva Speech Skills server before running this tutorial. By default, only the English models are deployed.  


2. Install the Riva Client library.   
Follow the steps in the [Requirements and setup for the Riva Client](https://github.com/nvidia-riva/tutorials#riva-client) to install the Riva Client library.


3. Install the additional Python libraries to run this tutorial.  
Run the following commands to install the libraries:

In [1]:
# librosa is a python package for music and audio analysis. We use it here to load and play the sample audio clips
!pip install librosa==0.9.1
# libpq-dev and libsndfile-dev are needed for librosa
!apt-get update && apt-get upgrade -y && apt-get install -y && apt-get -y install apt-utils gcc libpq-dev libsndfile-dev

# Riva exposes gRPC APIs. We need grpcio and grpcio-tools installed to for making gRPC requests
!pip install grpcio==1.44.0
!pip install grpcio-tools==1.44.0

Collecting librosa==0.9.1
  Downloading librosa-0.9.1-py3-none-any.whl (213 kB)
[K     |████████████████████████████████| 213 kB 12.5 MB/s eta 0:00:01
[?25hCollecting pooch>=1.0
  Downloading pooch-1.6.0-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 7.3 MB/s  eta 0:00:011
[?25hCollecting numba>=0.45.1
  Downloading numba-0.55.1-1-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 50.5 MB/s eta 0:00:01
[?25hCollecting resampy>=0.2.2
  Downloading resampy-0.2.2.tar.gz (323 kB)
[K     |████████████████████████████████| 323 kB 89.1 MB/s eta 0:00:01
[?25hCollecting numpy>=1.17.0
  Downloading numpy-1.22.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[K     |████████████████████████████████| 16.8 MB 133.8 MB/s eta 0:00:01
[?25hCollecting soundfile>=0.10.2
  Downloading SoundFile-0.10.3.post1-py2.py3-none-any.whl (21 kB)
Collecting joblib>=0.14
  Downloading joblib-1.1.0-

Get:17 http://deb.debian.org/debian buster-updates/main amd64 tzdata all 2021a-0+deb10u4 [266 kB]
Get:18 http://deb.debian.org/debian buster/main amd64 libicu-dev amd64 63.1-6+deb10u3 [9183 kB]
Get:19 http://security.debian.org/debian-security buster/updates/main amd64 libtiff-dev amd64 4.1.0+git191117-2~deb10u4 [394 kB]
Get:20 http://security.debian.org/debian-security buster/updates/main amd64 libtiff5 amd64 4.1.0+git191117-2~deb10u4 [271 kB]
Get:21 http://security.debian.org/debian-security buster/updates/main amd64 libtiffxx5 amd64 4.1.0+git191117-2~deb10u4 [118 kB]
Get:22 http://deb.debian.org/debian buster/main amd64 icu-devtools amd64 63.1-6+deb10u3 [189 kB]
Get:23 http://deb.debian.org/debian buster/main amd64 libicu63 amd64 63.1-6+deb10u3 [8293 kB]
Get:24 http://deb.debian.org/debian buster/main amd64 libkrb5-dev amd64 1.17-3+deb10u3 [47.6 kB]
Get:25 http://deb.debian.org/debian buster/main amd64 krb5-multidev amd64 1.17-3+deb10u3 [156 kB]
Get:26 http://deb.debian.org/debian b

Preparing to unpack .../05-krb5-multidev_1.17-3+deb10u3_amd64.deb ...
Unpacking krb5-multidev:amd64 (1.17-3+deb10u3) over (1.17-3+deb10u1) ...
Preparing to unpack .../06-libk5crypto3_1.17-3+deb10u3_amd64.deb ...
Unpacking libk5crypto3:amd64 (1.17-3+deb10u3) over (1.17-3+deb10u1) ...
Preparing to unpack .../07-libgssapi-krb5-2_1.17-3+deb10u3_amd64.deb ...
Unpacking libgssapi-krb5-2:amd64 (1.17-3+deb10u3) over (1.17-3+deb10u1) ...
Preparing to unpack .../08-libkrb5-3_1.17-3+deb10u3_amd64.deb ...
Unpacking libkrb5-3:amd64 (1.17-3+deb10u3) over (1.17-3+deb10u1) ...
Preparing to unpack .../09-libkrb5support0_1.17-3+deb10u3_amd64.deb ...
Unpacking libkrb5support0:amd64 (1.17-3+deb10u3) over (1.17-3+deb10u1) ...
Preparing to unpack .../10-libssl-dev_1.1.1n-0+deb10u1_amd64.deb ...
Unpacking libssl-dev:amd64 (1.1.1n-0+deb10u1) over (1.1.1d-0+deb10u6) ...
Preparing to unpack .../11-libssl1.1_1.1.1n-0+deb10u1_amd64.deb ...
Unpacking libssl1.1:amd64 (1.1.1n-0+deb10u1) over (1.1.1d-0+deb10u6) ...
P

Reading package lists... Done
Building dependency tree       
Reading state information... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'libsndfile1-dev' instead of 'libsndfile-dev'
gcc is already the newest version (4:8.3.0-1).
libpq-dev is already the newest version (11.14-0+deb10u1).
The following additional packages will be installed:
  libapt-inst2.0 libflac-dev libflac8 libogg-dev libogg0 libsndfile1
  libvorbis-dev libvorbis0a libvorbisenc2 libvorbisfile3
The following NEW packages will be installed:
  apt-utils libapt-inst2.0 libflac-dev libflac8 libogg-dev libogg0 libsndfile1
  libsndfile1-dev libvorbis-dev libvorbis0a libvorbisenc2 libvorbisfile3
0 upgraded, 12 newly installed, 0 to remove and 0 not upgraded.
Need to get 2426 kB of archives.
After this operation, 8293 kB of additional disk space will be used.
Get:1 http://deb.debian.org/debian

#### Import the Riva client libraries

Let's import some of the required libraries, including the Riva Client libraries.

In [2]:
import io
import librosa
import IPython.display as ipd
import grpc

import riva_api.riva_asr_pb2 as rasr
import riva_api.riva_asr_pb2_grpc as rasr_srv
import riva_api.riva_audio_pb2 as ra

#### Create a Riva client and connect to the Riva Speech API server

The following URI assumes a local deployment of the Riva Speech API server is on the default port. In case the server deployment is on a different host or via a Helm chart on Kubernetes, use an appropriate URI.

In [3]:
channel = grpc.insecure_channel('localhost:50051')

riva_asr = rasr_srv.RivaSpeechRecognitionStub(channel)

### Offline recognition for English

You can use Riva ASR in either streaming mode or offline mode. In streaming mode, a continuous stream of audio is captured and recognized, producing a stream of transcribed text. In offline mode, an audio clip of a set length is transcribed to text. <br> 
Let's look at an example showing offline ASR API usage for English:

#### Make a gRPC request to the Riva Speech API server
Riva ASR API supports `.wav` files in pulse-code modulation (PCM) format; including `.alaw`, `.mulaw`, and `.flac` formats with single channel. 

Now, let's make a gRPC request to the Riva Speech server for ASR with a sample `.wav` file in offline mode. Start by loading the audio.

In [4]:
# This example uses a .wav file with LINEAR_PCM encoding.
# read in an audio file from local disk
path = "./audio_samples/en-US_sample.wav"
audio, sr = librosa.core.load(path, sr=None)
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

Next, create an audio `RecognizeRequest` object, setting the configuration parameters as required.

In [5]:
# Set up an offline/batch recognition request
req = rasr.RecognizeRequest()
req.audio = content                                   # raw bytes
req.config.encoding = ra.AudioEncoding.LINEAR_PCM     # Supports LINEAR_PCM, FLAC, MULAW and ALAW audio encodings
req.config.sample_rate_hertz = sr                     # Audio will be resampled if necessary
req.config.language_code = "en-US"                    # Language code of the audio clip
req.config.max_alternatives = 1                       # How many top-N hypotheses to return
req.config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
req.config.audio_channel_count = 1                    # Mono channel

Finally, submit the request to the server.

In [6]:
response = riva_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript:", asr_best_transcript)

print("\n\nFull Response Message:")
print(response)

ASR Transcript: What is natural language processing? 


Full Response Message:
results {
  alternatives {
    transcript: "What is natural language processing? "
    confidence: 1.0
  }
  channel_tag: 1
  audio_processed: 4.1519999504089355
}



#### Understanding ASR API parameters

Riva ASR supports a number of options while making a transcription request to the gRPC endpoint, as shown in the previous section. Let's learn more about these parameters:
- `encoding` - Type of audio encoding to use (`LINEAR_PCM`, `FLAC`, `MULAW` or `ALAW`).
- `language_code` - Language of the input audio. "en-US" represents English (US). Other options include (`es-US`, `de-DE`, `ru-RU`, `zh-CN`). We will explore ASR for non-English languages in the next section.
- `enable_automatic_punctuation` - Adds a punctuation at the end of VAD (Voice Activity Detection).
- `audio_channel_count` - Number of audio channels. Typical microphones have 1 audio channel.

### Offline recognition for non-English languages - Spanish example

In the previous section, we went through the Riva API usage and understood the different parameters of the ASR API. Now, let's look at using the ASR APIs for non-English languages, like Spanish, in offline mode.

<a id='updated_reqs_and_setup_for_nonEngASR'></a>
#### Requirements and Setup for Spanish ASR:

The requirements and setup steps for non-English ASR (in this case Spanish ASR) is the almost the same as for English ASR. The only difference is before running inference on the Spanish audio, we need to first deploy the Spanish ASR pipeline on the Riva Speech Skills server. 

Note: The Riva Speech Skills server [Quick Start Guide](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html#), that we followed in the [Requirements and Setup section](#updated_reqs_and_setup_for_EngASR) above for English ASR, explains how to deploy only English models by default.  

1. Start the Riva Speech Skills server, with the Spanish ASR pipeline.  
    1.1. Navigate to the Quick Start Guide folder. You downloaded this folder in the [Requirements and setup section](#Requirements-and-setup) above.
    
    1.2. Run ``bash riva_stop.sh`` to shut down the running Riva Speech Skills server. If Riva Speech Skills server is not currently running, you can skip this step.   
    
    1.3. Update the ``config.sh`` file: Update the ``language_code=("en-US")`` line to include the Spanish model according to the instructions above this line in the ``config.sh`` script.  
    
    1.4. Rerun ``bash riva_init.sh`` to download and initialize the Spanish models and pipeline.  
    
    1.5. Rerun ``bash riva_start.sh`` to restart the Riva Speech Skills server.  

#### Make a gRPC request to the Riva Speech API server
Let's make a gRPC request to the Riva Speech server for ASR with a sample Spanish `.wav` file in offline mode.  

Like before, start by loading the audio.

In [7]:
# This example uses a .wav file with LINEAR_PCM encoding.
# read in an audio file from local disk
path = "audio_samples/es-US_sample.wav" #Link to the Spanish sample audio file
audio, sr = librosa.core.load(path, sr=None)
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

As with English, create an audio `RecognizeRequest` object, setting the configuration parameters as required. Notice that we have updated ``language_code`` of the request configuration to the Spanish language code (``"es-US"``).

In [8]:
# Set up an offline/batch recognition request
req = rasr.RecognizeRequest()
req.audio = content                                   # raw bytes
req.config.encoding = ra.AudioEncoding.LINEAR_PCM     # Supports LINEAR_PCM, FLAC, MULAW and ALAW audio encodings
req.config.sample_rate_hertz = sr                     # Audio will be resampled if necessary
req.config.language_code = "es-US"                    # Language code of the audio clip. Set to Spanish
req.config.max_alternatives = 1                       # How many top-N hypotheses to return
req.config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
req.config.audio_channel_count = 1                    # Mono channel

Finally, submit the request to the server.

In [9]:
response = riva_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript:", asr_best_transcript)

print("\n\nFull Response Message:")
print(response)

ASR Transcript: Existen mutaciones que alteran los pigmentos de color basado en carotenoides, pero son raras. 


Full Response Message:
results {
  alternatives {
    transcript: "Existen mutaciones que alteran los pigmentos de color basado en carotenoides, pero son raras. "
    confidence: 1.0
  }
  channel_tag: 1
  audio_processed: 10.031999588012695
}



We can similarly run Riva ASR for German, Russian, and Mandarin by setting their corresponding language codes (``de-DE``, ``ru-RU``, and ``zh-CN``) in the request configuration. Ensure that these pipelines are deployed on the Riva Speech Skills server as instructed in the [Requirements and Setup for Spanish ASR](#updated_reqs_and_setup_for_nonEngASR).

## Go deeper into Riva capabilities

Now that you have a basic introduction to the Riva ASR APIs, you can try:

### Additional Riva tutorials

Checkout more Riva ASR (and TTS) tutorials [here](https://github.com/nvidia-riva/tutorials) to understand how to use some of the advanced features of Riva ASR, including customizing ASR for your specific needs.


### Sample applications

Riva comes with various sample applications. They demonstrate how to use the APIs to build applications such as a [chatbot](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/samples/weather.html), a domain specific speech recognition, [keyword (entity) recognition system](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/samples/callcenter.html), or simply how Riva allows scaling out for handling massive amounts of requests at the same time. Refer to ([SpeechSquad)](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/samples/speechsquad.html) for more information.  
Refer to the *Sample Application* section in the [Riva developer documentation](https://developer.nvidia.com/) for more information.


###  Riva Text-To-Speech (TTS)

Riva's TTS offering comes with two OOTB voices that can be used in streaming or batch inference modes. They can be easily deployed using the [Riva Quick Start scripts](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html). Follow [this link](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html) to understand Riva's TTS capabilities. Explore how to use Riva TTS APIs with the OOTB voices with [this Riva TTS tutorial](https://github.com/nvidia-riva/tutorials/blob/dev/22.04/tts-python-basics.ipynb).


### Additional resources

For more information about each of the APIs and their functionalities, refer to the [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/protobuf-api/protobuf-api-root.html).