<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# How to use Riva ASR APIs with out-of-the-box models?

This tutorial walks through the basics of Riva Speech Skills's ASR Services, specifically covering how to use Riva ASR APIs with ouit-of-the-box models.

## NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech and natural language understanding services such as:

- Automated speech recognition (ASR)
- Text-to-Speech synthesis (TTS)
- A collection of natural language processing (NLP) services such as named entity recognition (NER), punctuation, intent classification.

**In this tutorial, we will focus on interacting with the Automated speech recognition (ASR) APIs.**

For more detailed information on Riva, please refer to the [Riva developer documentation](https://developer.nvidia.com/riva).

## Transcription with Riva ASR APIs

Automatic Speech Recognition (ASR) takes an audio stream or audio buffer as input and returns one or more text transcripts, along with additional optional metadata. Speech recognition in Riva is a GPU-accelerated compute pipeline, with optimized performance and accuracy.  
Riva provides state of the art OOTB(out-of-the-box) models and pipelines for multiple languages, like English, Spanish, German, Russian, that can be easily deployed with the Riva Quick Start Scripts. Riva also supports easy customization of the ASR pipeline, in various ways, to meet your specific needs.  
Please refer to the [Riva ASR documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-overview.html) for more details.  

Now let us generate the transcripts for some sample audio clips with OOTB(out-of-the-box) pipelines, starting with English

#### Requirements and setup

To execute this notebook, please follow the [Requirements and Setup steps for Riva Client](./README.md).

#### Import Riva clent libraries

We first import some required libraries, including the Riva client libraries

In [1]:
import io
import librosa
import IPython.display as ipd
import grpc

import riva_api.riva_asr_pb2 as rasr
import riva_api.riva_asr_pb2_grpc as rasr_srv
import riva_api.riva_audio_pb2 as ra

ModuleNotFoundError: No module named 'librosa'

#### Create Riva clients and connect to Riva Speech API server

The below URI assumes a local deployment of the Riva Speech API server on the default port. In case the server deployment is on a different host or via Helm chart on Kubernetes, the user should use an appropriate URI.

In [2]:
channel = grpc.insecure_channel('localhost:50051')

riva_asr = rasr_srv.RivaSpeechRecognitionStub(channel)

### Offline recognition for English

Riva ASR can be used either in streaming mode or offline mode. In streaming mode, a continuous stream of audio is captured and recognized, producing a stream of transcribed text. In offline mode, an audio clip of set length is transcribed to text. <br> 
Let us look at an example showing Offline ASR API usage:

#### Make a gRPC request to the Riva Speech API server
Riva ASR API supports `.wav` files in PCM format, `.alaw`, `.mulaw` and `.flac` formats with single channel in this release. 

Now let us make a gRPC request to the Riva Speech server for ASR with a sample .wav file in offline mode. Start by loading the audio.

In [3]:
# This example uses a .wav file with LINEAR_PCM encoding.
# read in an audio file from local disk
path = "../_static/data/asr/en-US_sample.wav"
audio, sr = librosa.core.load(path, sr=None)
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

Next, create an audio `RecognizeRequest` object, setting the configuration parameters as required.

In [4]:
# Set up an offline/batch recognition request
req = rasr.RecognizeRequest()
req.audio = content                                   # raw bytes
req.config.encoding = ra.AudioEncoding.LINEAR_PCM     # Supports LINEAR_PCM, FLAC, MULAW and ALAW audio encodings
req.config.sample_rate_hertz = sr                     # Audio will be resampled if necessary
req.config.language_code = "en-US"                    # Ignored, will route to correct model in future release
req.config.max_alternatives = 1                       # How many top-N hypotheses to return
req.config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
req.config.audio_channel_count = 1                    # Mono channel

Finally, submit the request to the server.

In [5]:
response = riva_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript:", asr_best_transcript)

print("\n\nFull Response Message:")
print(response)

ASR Transcript: What is natural language processing? 


Full Response Message:
results {
  alternatives {
    transcript: "What is natural language processing? "
    confidence: 1.0
  }
  channel_tag: 1
  audio_processed: 4.1519999504089355
}



#### Understanding ASR API parameters

Riva ASR supports a number of options while making a transcription request to the gRPC endpoint, as shown above. Let's learn more about these parameters:
- `enable_automatic_punctuation` - Adds a punctuation at the end of VAD (Voice Activity Detection).
- `encoding` - Type of audio encoding to use (`LINEAR_PCM`, `FLAC`, `MULAW` or `ALAW`).
- `language_code` - Language of the audio. "en-US" represents english (US).
- `audio_channel_count` - Number of audio channels. Typical microphones have 1 audio channel.

### Offline recognition for non-English languages - Spanish example

In the previous section, we went through Riva API usage and understood the different parameters of ASR API. Now let us look at using ASR for non-English languages, like Spanish in Offline mode

#### Make a gRPC request to the Riva Speech API server
Riva ASR API supports `.wav` files in PCM format, `.alaw`, `.mulaw` and `.flac` formats with single channel in this release. 

Now let us make a gRPC request to the Riva Speech server for ASR with a sample .wav file in offline mode. Start by loading the audio.

In [3]:
# This example uses a .wav file with LINEAR_PCM encoding.
# read in an audio file from local disk
path = "../_static/data/asr/en-_sample.wav"
audio, sr = librosa.core.load(path, sr=None)
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

Next, create an audio `RecognizeRequest` object, setting the configuration parameters as required.

In [4]:
# Set up an offline/batch recognition request
req = rasr.RecognizeRequest()
req.audio = content                                   # raw bytes
req.config.encoding = ra.AudioEncoding.LINEAR_PCM     # Supports LINEAR_PCM, FLAC, MULAW and ALAW audio encodings
req.config.sample_rate_hertz = sr                     # Audio will be resampled if necessary
req.config.language_code = "en-US"                    # Ignored, will route to correct model in future release
req.config.max_alternatives = 1                       # How many top-N hypotheses to return
req.config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
req.config.audio_channel_count = 1                    # Mono channel

Finally, submit the request to the server.

In [5]:
response = riva_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript:", asr_best_transcript)

print("\n\nFull Response Message:")
print(response)

ASR Transcript: What is natural language processing? 


Full Response Message:
results {
  alternatives {
    transcript: "What is natural language processing? "
    confidence: 1.0
  }
  channel_tag: 1
  audio_processed: 4.1519999504089355
}



## Go deeper into Riva capabilities

Now that you have a basic introduction to the Riva APIs, you may like to try out:

### Advanced ASR notebook

Checkout [this notebook](asr-python-boosting) to understand how to use some of the advanced features of Riva ASR.


### Sample apps

Riva comes with various sample apps as a demonstration for how to use the APIs to build interesting applications such as a [chatbot](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/samples/weather.html), a domain specific speech recognition or [keyword (entity) recognition system](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/samples/callcenter.html), or simply how Riva allows scaling out for handling massive amount of requests at the same time. ([SpeechSquad)](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/samples/speechsquad.html) 
Have a look at the Sample Application section in the [Riva developer documentation](https://developer.nvidia.com/) for all the sample apps.


###  Finetune a domain specific speech model

Train the latest state-of-the-art speech and natural language processing models on your own data using [Transfer Learning ToolKit](https://developer.nvidia.com/transfer-learning-toolkit) or [NeMo](https://github.com/NVIDIA/NeMo) and deploy them on Riva using the [Riva ServiceMaker tool](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/model-servicemaker.html).


### Further resources

Explore the details of each of the APIs and their functionalities in the [docs](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/protobuf-api/protobuf-api-root.html).