<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="question-answering-training.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="question-answering-training.ipynb">6</a>
        <a >7</a>
        <a href="challenge.ipynb">8</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="challenge.ipynb">Next Notebook</a></span>
</div>

# Deploying a Question Answering Model in Riva
---


## Learning Objectives
In this notebook, you will learn how to:  
- Use Riva ServiceMaker to take a TAO exported .riva and generate a model [resources](https://ngc.nvidia.com/catalog/resources/nvidia:riva:riva_quickstart)
- Deploy the model(s) locally  on the Riva Server
- Send inference requests from a demo client using Riva API bindings.

### NVIDIA RIVA

[NVIDIA Riva](https://developer.nvidia.com/riva), is a highly performant application framework for multi-modal conversational AI services using GPUs. It is a GPU-accelerated SDK for building customized Speech AI applications for real-time performance. TAO models can be exported in .riva format, optimized, and deployed on Riva as a speech service. Riva services are delivered as gRPC-based microservices for low-latency streaming, as well as high-throughput offline use cases. One major features of Riva is flexibility, you can modify model architectures, fine-tuning models on your data and customizing pipelines, as well as the ability to deploy on any platform.

This notebook explores taking a `.riva model`, the result of `tao question_answering export` from the previous notebook, and leveraging the `Riva ServiceMaker` framework to aggregate all the necessary artifacts for Riva deployment to a target environment. Once the model is deployed in Riva, you can issue inference requests to the server. We will demonstrate how quick and straightforward to acheive this whole process. 


## Riva ServiceMaker
Servicemaker is the set of tools that aggregates all the necessary artifacts (models, files, configurations, and user settings) for Riva deployment to a target environment. It has two main components:
- Riva-build
- Riva-deploy

### 1. Riva-build

This step helps build a Riva-ready version of the model. It’s only output is an intermediate format (called a RMIR) of an end to end pipeline for the supported services within Riva. We are taking a QA model in consideration.<br>

`riva-build` is responsible for the combination of one or more exported models (.riva files) into a single file containing an intermediate format called `Riva Model Intermediate Representation (.rmir)`. This file contains a deployment-agnostic specification of the whole end-to-end pipeline along with all the assets required for the final deployment and inference. Please check out the [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/service-nlp.html#pipeline-configuration) to find out more. 

Steps to execute `riva-build` is presented as follows:
- Have your `qa-model.riva` model ready as shown below

<img align="center" src="images/qa_model_riva.png">

`NOTE:` The above qa-model.riva is the qa model obtained from the previous notebook


- Set variables: 
    - ServiceMaker Docker Image
    - Directory where the .riva model
    - Name of the .riva file
    - model encryption (Key from NGC used in the previous notebook during TAO training)

In [None]:
import os

# Name of ServiceMaker Docker Image
RIVA_SM_CONTAINER = "nvcr.io/nvidia/riva/riva-speech:2.6.0-servicemaker"

# Directory where the .riva model is stored $MODEL_LOC/*.riva
MODEL_LOC = os.path.realpath(os.getcwd()+'/../')+"/results/questions_answering/export_riva"

# Name of the .riva file
MODEL_NAME = "qa-model.riva"

# Key that model is encrypted with
KEY = ""

- Pull the ServiceMaker docker by running the cell below

In [None]:
# Get the ServiceMaker docker
!docker pull $RIVA_SM_CONTAINER

**You should see similar output as this**:
```python
2.6.0-servicemaker: Pulling from nvidia/riva/riva-speech

b49e2995: Pulling fs layer 
0e6f7157: Pulling fs layer 
53daec0c: Pulling fs layer 
24ede650: Pulling fs layer 
...
c00d2faf: Pull complete  620B/620B2kBBDigest: sha256:88cc5298313eeb3f0105000c9f2a215bbc2ab0e950e1b6dd32898490e020cde8
Status: Downloaded newer image for nvcr.io/nvidia/riva/riva-speech:2.6.0-servicemaker
nvcr.io/nvidia/riva/riva-speech:2.6.0-servicemaker

```

- Execute `riva-build` to create the `Riva Model Intermediate Representation (.rmir)` file

In [None]:
# Syntax: riva-build <task-name> output-dir-for-rmir/model.rmir:key dir-for-riva/model.riva:key
!docker run --rm --gpus '"device=0:0"' -v $MODEL_LOC:/data $RIVA_SM_CONTAINER -- \
            riva-build qa -f /data/question-answering.rmir:$KEY /data/$MODEL_NAME:$KEY


**Expected output should look as shown below**:

```python
==========================
=== Riva Speech Skills ===
==========================

NVIDIA Release  (build 45250447)
Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

https://developer.nvidia.com/tensorrt

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
...
2022-10-21 12:04:55,988 [INFO] Packing binaries for tokenizer/PyTorch : {'vocab': ('nemo.collections.nlp.models.question_answering.qa_model.QAModel', 'tokenizer.vocab_file')}
2022-10-21 12:04:55,988 [INFO] Copying vocab:tokenizer.vocab_file -> tokenizer:tokenizer-tokenizer.vocab_file
2022-10-21 12:04:55,989 [INFO] Saving to /data/question-answering.rmir
```

The generated `.rmir` file should be found at `../results/questions_answering/export_riva/` directory

<img align="center" src="images/rmir.png">

Run the cell below to view the `question-answering.rmir` file

In [None]:
!ls $MODEL_LOC

### 2. Riva-deploy

The deployment tool takes as input one or more `Riva Model Intermediate Representation (RMIR)` files and a target model repository directory. It creates an ensemble configuration specifying the pipeline for the execution and finally writes all those assets to the output `model` repository directory.

Run the cell below to execute `riva-deploy`. Please note that the flag `-v` represents mapping model directory from your local machine to the /data directory within the Riva ServiceMaker container, while the `-f` is to forcefully overwrite any preexiting `models` folder.

In [None]:
# Syntax: riva-deploy -f dir-for-rmir/model.rmir:key output-dir-for-repository
!docker run --rm --gpus '"device=0:0"' -v $MODEL_LOC:/data $RIVA_SM_CONTAINER -- \
            riva-deploy -f /data/question-answering.rmir:$KEY /data/models

**Expected output**:

````bash
==========================
=== Riva Speech Skills ===
==========================

NVIDIA Release  (build 45250447)
Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
https://developer.nvidia.com/tensorrt
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
...

2022-10-21 12:32:37,291 [INFO] extracting {'vocab': ('nemo.collections.nlp.models.question_answering.qa_model.QAModel', 'tokenizer.vocab_file')} -> /data/models/qa_tokenizer-en-US/1
2022-10-21 12:32:37,292 [INFO] Extract_binaries for token_classifier -> /data/models/qa_qa_postprocessor/1
2022-10-21 12:32:37,293 [INFO] Extract_binaries for self -> /data/models/riva_qa/1

````
You should also see the generated `models` subdirectory at `../results/questions_answering/export_riva/`

<img align="center" src="images/riva_models.png"> <br>


## Start Riva Server
Once the model repository is generated, we are ready to start the Riva server. From this step onwards you need to download the Riva QuickStart Resource from NGC. 
You can access NVIDIA NGC, and download the Riva Quickstart here [resources](https://ngc.nvidia.com/catalog/resources/nvidia:riva:riva_quickstart). Your Riva Quickstart (version 2.6.0) folder should contain list of files shown below. 

<img align="center" src="images/riva_quickstart_v26.png">


Set the path to the Riva QuickStart directory:

In [None]:
RIVA_DIR = os.path.realpath(os.getcwd()+'/../')+"/source_code/riva_quickstart_v2.6.0"
RIVA_DIR

Next, we modify `config.sh` to enable relevant Riva services (nlp for Question Answering model), provide the encryption key, and path to the model repository (`riva_model_loc`) generated in the previous step among other configurations. 

For instance, if above the model repository is generated at `$MODEL_LOC/models`, then you can specify `riva_model_loc` as the same directory as `MODEL_LOC`

Pretrained versions of models specified in models_asr/nlp/tts are fetched from NGC. Since we are using our custom model, we can comment it in models_nlp (and any others that are not relevant to your use case). <br>

`NOTE:` You can perform this step of editing **config.sh** `../source_code/riva_quickstart_v2.6.0/config.sh`.

### config.sh snipet
```
# Enable or Disable Riva Services 
service_enabled_asr=false                                 ## MAKE CHANGES HERE
service_enabled_nlp=true                                  ## MAKE CHANGES HERE
service_enabled_tts=false                                 ## MAKE CHANGES HERE

# Specify one or more GPUs to use
# specifying more than one GPU is currently an experimental feature, and may result in undefined behaviours.
gpus_to_use="device=0"

# Specify the encryption key to use to deploy models
MODEL_DEPLOY_KEY="tlt_encode"                             ## Set the model encryption key

# Locations to use for storing models artifacts
...
riva_model_loc="<add path>"                              ## Replace with MODEL_LOC

# The default RMIRs are downloaded from NGC by default in the above $riva_rmir_loc directory
# If you'd like to skip the download from NGC and use the existing RMIRs in the $riva_rmir_loc
# then set the below $use_existing_rmirs flag to true.
...
use_existing_rmirs=true                                  ## Set to True
```
Since we are interested in the nlp part, open the `riva_start.sh`, go to line 93, and remove these flags: `--asr_service=$service_enabled_asr --tts_service=$service_enabled_tts`  

### riva_start.sh snipet
```
...
85  #speech server is required
86 #check if it's already running first...
87 if [ $(docker ps -q -f "name=^/$riva_daemon_speech$" | wc -l) -eq 0 ]; then
88    echo "Starting Riva Speech Services. This may take several minutes depending on the number of models deployed."
89    docker rm $riva_daemon_speech &> /dev/null
90    if [[ $riva_target_gpu_family == "tegra" ]]; then
91        docker_run_args="-p 8000:8000 -p 8001:8001 -p 8002:8002 -p 8888:8888 --device /dev/bus/usb --device /dev/snd $image_speech_api riva_server $ssl_args"
92    else
93        docker_run_args="-p 8000 -p 8001 -p 8002 $image_speech_api start-riva --riva-uri=0.0.0.0:$riva_speech_api_port --asr_service=$service_enabled_asr --tts_service=$service_enabled_tts --nlp_service=$service_enabled_nlp $ssl_args &> /dev/null"
94 ...
```

#### Ensure you have permission to execute these scripts from the Riva Quickstart directory

In [None]:
!cd $RIVA_DIR && chmod +x ./riva_init.sh && chmod +x ./riva_start.sh

#### Run Riva Start. This will deploy your model(s)

In [None]:
# Run Riva Start. This will deploy your model(s).
!cd $RIVA_DIR && ./riva_start.sh config.sh


When you should see the output below, its implies your model(s) is successfully deployed and Riva server is ready for inference request.

```bash
Starting Riva Speech Services. This may take several minutes depending on the number of models deployed.
d51f487128807235ce4bf258853c7c7387ea9ae019ba1e47a05694673b0b0342
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...
```

If the Riva server is fails to start, uncomment and run the cell below (`docker logs riva-speech`) to identify the errors within the logs

```bash
...
I1101 13:15:43.179256 103 tensorrt.cc:5454] TRITONBACKEND_ModelInstanceInitialize: riva-trt-riva_qa-nn-bert-base-uncased_0 (GPU device 0)
  > Riva waiting for Triton server to load all models...retrying in 1 second
  > Riva waiting for Triton server to load all models...retrying in 1 second
  > Riva waiting for Triton server to load all models...retrying in 1 second
  > Riva waiting for Triton server to load all models...retrying in 1 second
I1101 13:15:47.239278 103 logging.cc:49] [MemUsageChange] Init CUDA: CPU +253, GPU +0, now: CPU 287, GPU 1456 (MiB)
I1101 13:15:47.473845 103 logging.cc:49] Loaded engine size: 208 MiB

...

+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1101 13:15:54.687420 103 server.cc:576] 
+--------------------+-------------------------------------------------------------------------------+--------+
| Backend            | Path                                                                          | Config |
+--------------------+-------------------------------------------------------------------------------+--------+
| onnxruntime        | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so               | {}     |
| tensorrt           | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so                     | {}     |
| riva_nlp_qa        | /opt/tritonserver/backends/riva_nlp_qa/libtriton_riva_nlp_qa.so               | {}     |
| riva_nlp_tokenizer | /opt/tritonserver/backends/riva_nlp_tokenizer/libtriton_riva_nlp_tokenizer.so | {}     |
+--------------------+-------------------------------------------------------------------------------+--------+

I1101 13:15:54.687497 103 server.cc:619] 
+---------------------------------------+---------+--------+
| Model                                 | Version | Status |
+---------------------------------------+---------+--------+
| qa_qa_postprocessor                   | 1       | READY  |
| qa_tokenizer-en-US                    | 1       | READY  |
| riva-trt-riva_qa-nn-bert-base-uncased | 1       | READY  |
| riva_qa                               | 1       | READY  |
+---------------------------------------+---------+--------+
...

I1101 13:15:54.841537 103 http_server.cc:180] Started Metrics Service at 0.0.0.0:8002
  > Triton server is ready...
ERROR: illegal value '' specified for bool flag 'asr_service'
ERROR: illegal value '' specified for bool flag 'tts_service'
One of the processes has exited unexpectedly. Stopping container.
Signal (15) received.
/opt/riva/bin/start-riva: line 1: kill: (197) - No such process
I1101 13:16:00.113213 103 server.cc:250] Waiting for in-flight requests to complete.
I1101 13:16:00.113252 103 server.cc:266] Timeout 30: Found 0 model versions that have in-flight inferences
I1101 13:16:00.113277 103 model_repository_manager.cc:1109] unloading: riva_qa:1
I1101 13:16:00.113419 103 model_repository_manager.cc:1109] unloading: riva-trt-riva_qa-nn-bert-base-uncased:1
I1101 13:16:00.113504 103 model_repository_manager.cc:1109] unloading: qa_tokenizer-en-US:1
I1101 13:16:00.113636 103 model_repository_manager.cc:1109] unloading: qa_qa_postprocessor:1
I1101 13:16:00.113837 103 model_repository_manager.cc:1214] successfully unloaded 'riva_qa' version 1

```
**Solution**

The solution is set to `asr_service` and `tts_service` to **false** within the `config.sh` file and ensure that line 93 in the `riva_start.sh` is change from:
```bash
docker_run_args="-p 8000 -p 8001 -p 8002 $image_speech_api start-riva --riva-uri=0.0.0.0:$riva_speech_api_port --asr_service=$service_enabled_asr --tts_service=$service_enabled_tts --nlp_service=$service_enabled_nlp $ssl_args &> /dev/null"

to:

docker_run_args="-p 8000 -p 8001 -p 8002 $image_speech_api start-riva --riva-uri=0.0.0.0:$riva_speech_api_port  --nlp_service=$service_enabled_nlp $ssl_args &> /dev/null"
```

In [None]:
!docker logs riva-speech

---
## Run Inference
Once the Riva server is up and running with your models, you can send inference requests querying the server. 

To send GRPC requests, you can install Riva Python API bindings for client. This is available as a pip .whl in the QuickStart directory. However, if the version of Riva QuickStart is 2.7.0 and above, you don't need to explicitly install the python binding.Therefore, you can skip the installation cells below.


In [None]:
# IMPORTANT: This is only applicable to riva quickstart v2.6.0 and earlier version
# Set the name of the whl file.

RIVA_API_WHL = "riva_api-2.1.0-py3-none-any.whl"

In [None]:
# Install client API bindings
!cd $RIVA_DIR && pip install $RIVA_API_WHL

**The following code sample shows how you can perform inference using Riva Python API gRPC bindings:**

In [None]:
import grpc
import riva_api.riva_nlp_pb2 as rnlp
import riva_api.riva_nlp_pb2_grpc as rnlp_srv


grpc_server =  "localhost:50051"
channel = grpc.insecure_channel(grpc_server)
riva_nlp = rnlp_srv.RivaLanguageUnderstandingStub(channel)

req = rnlp.NaturalQueryRequest()


Create QA request context and query, and get response result from the Riva server.      

In [None]:
context_1 = "In 2010 the Amazon rainforest experienced another severe drought, in some ways more extreme than the 2005 drought."\
                "The affected region was approximate 1,160,000 square miles (3,000,000 km2) of rainforest, compared to 734,000 square miles (1,900,000 km2)" \
                " in 2005. The 2010 drought had three epicenters where vegetation died off, whereas in 2005 the drought was focused on the southwestern part." \
                " The findings were published in the journal Science. In a typical year the Amazon absorbs 1.5 gigatons of carbon dioxide; during 2005" \
                "instead 5 gigatons were released and in 2010 8 gigatons were released."


req.query = "How many gigatons of carbon are absorbed the Amazon in a typical year?"

req.context = context_1
resp = riva_nlp.NaturalQuery(req)
print(resp)

Create another query or question from the context above.

In [None]:
req.query = "the affected region by drought in 2010 is approximately?"
req.context = context_1
resp = riva_nlp.NaturalQuery(req)
print(resp.results[0])

## Sample QA implementation 

In this section, we want to demostrate an inquiry system where a user can make an enquiry through verbally and get response on speech form. First, we create a textual context and two speech queries from the context for a sample scenario forour demostration. Due to the nature of bootcamp computing environ, the queries were captured as .wav files. Each query is sent to the riva server for inferencing. During this process, speech goes through an Automatic Speech Recognition (ASR) model and is being transcribe to text. This output serves as request the QA model deployed on the Riva server. The response from the server is transformed to speech output via a TTS model.

<img align="center" src="images/sample_application.png">

### ASR: Speech-To-Text(STT)

Automatic Speech Recognition (ASR) takes an audio stream or audio buffer as input and returns one or more text transcripts, along with additional optional metadata.
Speech recognition in Riva is a GPU-accelerated compute pipeline, with optimized performance and accuracy. In the cell below, we display list of ASR models.

In [None]:
import nemo
# Import Speech Recognition collection
import nemo.collections.asr as nemo_asr
import IPython

# Here is an example of all CTC-based models:
nemo_asr.models.EncDecCTCModel.list_available_models()


**Expected Output:**
```bash
[PretrainedModelInfo(
 	pretrained_model_name=QuartzNet15x5Base-En,
 	description=QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other. Please visit https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels for further details.,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemospeechmodels/versions/1.0.0a5/files/QuartzNet15x5Base-En.nemo
 ),
 PretrainedModelInfo(
 	pretrained_model_name=stt_en_quartznet15x5,
 	description=For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_quartznet15x5,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_quartznet15x5/versions/1.0.0rc1/files/stt_en_quartznet15x5.nemo
 ),
 PretrainedModelInfo(
 	pretrained_model_name=stt_en_jasper10x5dr,
 	description=For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_jasper10x5dr,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_jasper10x5dr/versions/1.0.0rc1/files/stt_en_jasper10x5dr.nemo
 ),
 ...
```

Select a Speech Recognition model `stt_en_jasper10x5dr.nemo`

In [None]:
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_en_jasper10x5dr").cuda() 

Create Context

In [None]:
context_2 = 'Climate is the average of the weather conditions at a particular point on the Earth. Typically, climate \
is expressed in terms of expected temperature, rainfall and wind conditions based on historical observations. Climate change \
is a change in either the average climate or climate variability that persists over an extended period.'

Set question speech

In [None]:
import sys
import IPython
sys.path.append('/usr/bin/ffmpeg')

audio_question_1 = "../data/speech/audio_question_1.wav"
IPython.display.Audio(audio_question_1)

Apply the ASR model to transcribe the speech

In [None]:
transcribed_text = asr_model.transcribe([audio_question_1])
transcribed_text

Send transcription as request to RIVA server and get response

In [None]:
req.query = str(transcribed_text[0]).strip()
req.context = context_2
resp = riva_nlp.NaturalQuery(req)

In [None]:
def extract_answer(results):
    dic = results
    answer = str(dic[0]).split('\n')[0].split(':')[1]
    answer= answer.replace('"', '').strip()
    return answer

In [None]:
response =  extract_answer(resp.results)

### TTS

The text-to-speech (TTS) pipeline is based on a two-stage pipeline. First, its generates a mel-spectrogram using the first model, and then generates speech using the second model. This pipeline forms a TTS system that enables synthesis of natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.

Run the cells below to apply TTS model to convert response to audio speech

In [None]:
import soundfile as sf

# Download and load the pretrained fastpitch model
from nemo.collections.tts.models import FastPitchModel
spec_generator = FastPitchModel.from_pretrained("nvidia/tts_en_fastpitch")

# Download and load the pretrained hifigan model
from nemo.collections.tts.models import HifiGanModel
vocoder = HifiGanModel.from_pretrained(model_name="tts_en_hifigan")


In [None]:
# All spectrogram generators start by parsing raw strings to a tokenized version of the string
def texttospeech(text):
    
    parsed = spec_generator.parse(text)
    #produce a spectrogram
    spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
    #converts the spectrogram to audio
    audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
    
    return audio

Save and Display response speech

In [None]:
answer_speech = texttospeech(response)
path_to_save = "../data/speech/audio_answer_1.wav"
#Save the audio to disk
sf.write(path_to_save, answer_speech.to('cpu').detach().numpy()[0], 22050)

IPython.display.Audio(path_to_save)

In [None]:
response

**From Context_2, create second speech query**

Run the cell below to set the speech audio

In [None]:
#req.query = "Based on historical observations, climate can be expressed in terms of what ?"
audio_question_2 = "../data/speech/audio_question_2.wav"
IPython.display.Audio(audio_question_2)


Run the below cell to get speech response 

In [None]:
transcribed_text_2 = asr_model.transcribe([audio_question_2])
req.query = str(transcribed_text_2[0]).strip()
req.context = context_2
resp = riva_nlp.NaturalQuery(req)
response_2 =  extract_answer(resp.results)

answer_speech_2 = texttospeech(response_2)
path_to_save = "../data/speech/audio_answer_2.wav"

#Save the audio to disk in a file called speech.wav
sf.write(path_to_save, answer_speech_2.to('cpu').detach().numpy()[0], 22050)

IPython.display.Audio(path_to_save)

In [None]:
response_2

**You can stop all running docker container** 

In [None]:
!docker stop $(docker ps -q) 

---

**It is advisable not to run the section below in a Bootcamp session as it takes lot of time to get executed**

## Deploying Model from NGC

This section shows you how to make use of available models from the NGC. Rather than building your own custom model, you can pull already built model and the .rmir from the cloud and deploy to Riva. List of these models (ASR, STT, and NLP) exist in the `config.sh` file. The following are the steps to take in deploying the models:

- modify the config.sh
- modify the riva_start.sh
- run the riva_init.sh
- run riva_start.sh
- start inference

### Config.sh
Reset the `Config.sh` file to the default settings as follows:

```bash
# Enable or Disable Riva Services
service_enabled_asr=true
service_enabled_nlp=true
service_enabled_tts=true

MODEL_DEPLOY_KEY="tlt_encode"

riva_model_loc="riva-model-repo"

use_existing_rmirs=false
```
You can comment out unneeded models within the script. For demo purpose, we are only going to make use of `rmir_nlp_question_answering_bert_base` while others are put in comment.

### Riva_start.sh

Reset the script back to the default setting as shown below:

```bash
...
93 docker_run_args="-p 8000 -p 8001 -p 8002 $image_speech_api start-riva --riva-uri=0.0.0.0:$riva_speech_api_port --asr_service=$service_enabled_asr -- tts_service=$service_enabled_tts --nlp_service=$service_enabled_nlp $ssl_args &> /dev/null"
94 ...
```

### Run riva_init.sh
The script pulls model's `.rmir` file(s) and the prebuilt model artifacts from the NGC. `riva-deploy` is also run behind the scene to deploy the `.rmir` file 

In [None]:
!cd $RIVA_DIR && ./riva_init.sh config.sh 

**The output should look similar as shown below:**

```bash

Logging into NGC docker registry if necessary...
Pulling required docker images if necessary...
Note: This may take some time, depending on the speed of your Internet connection.
> Pulling Riva Speech Server images.
  > Image nvcr.io/nvidia/riva/riva-speech:2.6.0 exists. Skipping.
  > Image nvcr.io/nvidia/riva/riva-speech:2.6.0-servicemaker exists. Skipping.

Downloading models (RMIRs) from NGC...

---
2022-11-01 14:23:14,979 [INFO] processed 330000 lines
2022-11-01 14:23:15,043 [INFO] skipped 0 empty lines
2022-11-01 14:23:15,043 [INFO] filtered 0 lines
2022-11-01 14:23:15,045 [INFO] Extract_binaries for conformer-en-US-asr-offline -> /data/models/conformer-en-US-asr-offline/1
2022-11-01 14:23:15,045 [INFO] extracting {'wfst_tokenizer': '/mnt/nvdl/datasets/jarvis_speech_ci/model_files/sp-itn/22.09/en/tokenize_and_classify.far', 'wfst_verbalizer': '/mnt/nvdl/datasets/jarvis_speech_ci/model_files/sp-itn/22.09/en/verbalize.far'} -> /data/models/conformer-en-US-asr-offline/1
+ '[' 0 -ne 0 ']'
+ [[ amd64 == \t\e\g\r\a ]]
+ echo

+ echo 'Riva initialization complete. Run ./riva_start.sh to launch services.'
Riva initialization complete. Run ./riva_start.sh to launch services.


```

### Run riva_start.sh

The script starts the riva server.  

In [None]:
# Run Riva Start.
!cd $RIVA_DIR && ./riva_start.sh config.sh

When you should see the output below, its implies your model(s) is successfully deployed and Riva server is ready for inference request.

```bash
Starting Riva Speech Services. This may take several minutes depending on the number of models deployed.
d51f487128807235ce4bf258853c7c7387ea9ae019ba1e47a05694673b0b0342
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...
```

In [None]:
#!docker logs riva-speech

## Run Inference
Once the Riva server is up and running with your models, you can send inference requests querying the server. Let's 

In [None]:
import grpc
import riva_api.riva_nlp_pb2 as rnlp
import riva_api.riva_nlp_pb2_grpc as rnlp_srv


grpc_server =  "localhost:50051"
channel = grpc.insecure_channel(grpc_server)
riva_nlp = rnlp_srv.RivaLanguageUnderstandingStub(channel)

req = rnlp.NaturalQueryRequest()

context_1 = "In 2010 the Amazon rainforest experienced another severe drought, in some ways more extreme than the 2005 drought."\
                "The affected region was approximate 1,160,000 square miles (3,000,000 km2) of rainforest, compared to 734,000 square miles (1,900,000 km2)" \
                " in 2005. The 2010 drought had three epicenters where vegetation died off, whereas in 2005 the drought was focused on the southwestern part." \
                " The findings were published in the journal Science. In a typical year the Amazon absorbs 1.5 gigatons of carbon dioxide; during 2005" \
                "instead 5 gigatons were released and in 2010 8 gigatons were released."


req.query = "How many tons of carbon are absorbed the Amazon in a typical year?"

req.context = context_1
resp = riva_nlp.NaturalQuery(req)
print(resp)

In [None]:
req.query = "the affected region by drought in 2010 is approximately?"
req.context = context_1
resp = riva_nlp.NaturalQuery(req)
print(resp.results[0])

 You can stop all docker container before shutting down the jupyter kernel. **Caution: The following command will stop all running containers**

In [None]:
!docker stop $(docker ps -q)

## References

https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/tts/intro.html

https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Online_ASR_Microphone_Demo.ipynb

https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/starthere/tutorials.html

https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html

https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-overview.html

---
## Licensing

Copyright © 2022 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="question-answering-training.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
         <a href="question-answering-training.ipynb">6</a>
        <a >7</a>
        <a href="challenge.ipynb">8</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="challenge.ipynb">Next Notebook</a></span>
</div>

<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center> </p>