<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# 8.0 ASR-NLP-TTS Deployment
## (part of Lab 2)

In this notebook, you'll deploy a full pipeline with [NVIDIA Riva](https://developer.nvidia.com/riva). After building a "plain vanilla" end-to-end application using out-of-the-box models, you'll customize the pipeline for a specific restaurant use case.

**[8.1 Full Pipeline With OOTB Models](#8.1-Full-Pipeline-With-OOTB-Models)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[8.1.1 Model Deployment](#8.1.1-Model-Deployment)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.1.2 Excercise: Riva Configuration](#8.1.2-Excercise:-Riva-Configuration)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.1.3 Riva Start Services](#8.1.3-Riva-Start-Services)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.1.4 OOTB Pipeline Demo](#8.1.4-OOTB-Pipeline-Demo)<br>
**[8.2 ASR Customization](#8.2-ASR-Customization)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[8.2.1 Word Boosting](#8.2.1-Word-Boosting)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.2.2 Exercise: Negative Word Boost](#8.2.2-Exercise:-Negative-Word-Boost)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.2.3 Lexicon Customization](#8.2.3-Lexicon-Customization)<br>
**[8.3 NER Customization](#8.3-NER-Customization)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[8.3.1 IOB Tagging](#8.3.1-IOB-Tagging)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.3.2 Restaurant Context for NER](#8.3.2-Restaurant-Context-for-NER)<br>
**[8.4 TTS Customization](#8.4-TTS-Customization)<br>**
**[8.5 Customized Restaurant Pipeline Demo](#8.5-Customized-Restaurant-Pipeline-Demo)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[8.5.1 Exercise: Run a Custom Pipeline](#8.5.1-Exercise:-Run-a-Custom-Pipeline)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.5.2 Shut Down Riva](#8.5.2-Shut-Down-Riva)<br>
**[8.6 Shut Down the Kernel](#8.6-Shut-Down-the-Kernel)<br>**

### Notebook Dependencies
The steps in this notebook assume that you have:

1. **NGC Credentials**<br>Be sure you have added your NGC credential as described in the [NGC Setup notebook](003_Intro_NGC_Setup.ipynb)

---
# 8.1 Full Pipeline With OOTB Models

In previous notebooks, we took a close look at ASR and TTS speech models.  In this notebook, we'll add NLP (Natural Language Processing) models to build a full pipeline. NLP models interpret text in various ways so that action can be taken based on the text's meaning. We'll use the standard OOTB (out-of-the-box) models that Riva pulls from NGC.



## 8.1.1 Model Deployment
The default ASR and TTS service models were deployed earlier in their own model repositories, `/dli_workspace/riva-asr-model-repo` and `/dli_workspace/riva-tts-model-repo`.  For the full pipeline, we'll need to deploy default models for ASR, TTS, and NLP services.  The `riva_init.sh` command loads and builds models specific to the GPU you are using, but to save time for this course, these have been preloaded.

The optimized NLP models are already located in the `/dli_workspace/riva-full-model-repo`, but lets go ahead and copy the ASR and TTS models there as well to save some build time:

In [None]:
# Set the Riva Quick Start directory
WORKSPACE='/dli_workspace'
RIVA_QS = WORKSPACE + "/riva_quickstart"
RIVA_MODEL_REPO = WORKSPACE + "/riva-full-model-repo"
!mkdir -p $RIVA_MODEL_REPO

In [None]:
%%bash
# Copy all the ASR and TTS models for convenience (faster deployment)
# Time is about 1-2 minutes for the copy
cp -rn  /dli_workspace/riva-asr-model-repo/* \
    /dli_workspace/riva-full-model-repo/
cp -rn  /dli_workspace/riva-tts-model-repo/* \
    /dli_workspace/riva-full-model-repo/

In [None]:
# check to see what models are there now
!ls $RIVA_MODEL_REPO/models

## 8.1.2 Excercise: Riva Configuration

Open [config.sh](dli_workspace/riva_quickstart/config.sh) and modify it to deploy all three services (ASR, NLP, TTS) using the `/dli_workspace/riva-full-model-repo` location that we've preloaded with all the models.  Save your work.

If you're not sure what to change, take a peek at the [solution](solutions/ex8.1.2_config.sh).

Check your work.  The `diff` comparison in the following cell should have no output.

In [None]:
# Check your work
!diff solutions/ex8.1.2_config.sh dli_workspace/riva_quickstart/config.sh

In [None]:
# Quick fix!
!cp solutions/ex8.1.2_config.sh dli_workspace/riva_quickstart/config.sh

## 8.1.3 Riva Start Services

The `riva_init.sh` script downloads the Riva containers needed, downloads models listed in `config.sh`, and optimizes  models as required with [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt). Since we've already downloaded the containers and preloaded the optimized models, `riva_init.sh` won't have much to do, but it is provided here for completeness.

The `riva_start.sh` script starts the server.

In [None]:
# Initialize Riva 
!cd $RIVA_QS && bash riva_init.sh config.sh

In [None]:
# Start the Riva server (about 1 minute)
!cd $RIVA_QS && bash riva_start.sh config.sh

## 8.1.4 OOTB Pipeline Demo
Now that we have the Riva server running with all the OOTB models, let's build an application that will:

1. Transcribe audio into text (ASR)
2. Find a name in the text (NER)
3. Determine what text to output in response (DM)
4. Output audio of the text response (TTS)

<img src="images/pipeline/full_pipeline_ootb.png">

We've already explored how ASR (step 1) and TTS (step 4) work, but now we add a Named Entity Recognition (NER) model and a simple Dialog Manager (DM) to the pipeline.  

NER is a Natural Language Processing (NLP) task.  NER, also referred to as entity chunking, identification, or extraction, is the task of detecting and classifying key information (entities) in text. In other words, an NER model takes a piece of text as input and for each word in the text, the model identifies a category the word belongs to. For example, in a sentence: "Mary lives in Santa Clara and works at NVIDIA", the NER model should detect that Mary is a person, Santa Clara is a location and NVIDIA is a company.

The DM is responsible for keeping track of the conversation state and determining responses.  For our purposes, we will use very simple slot-filling to create a response.

Begin by importing the Riva client and some other useful libraries.

In [None]:
import riva.client
import numpy as np
import IPython.display as ipd
import io
import time
import librosa

**Step 1: ASR**

Create an ASR function to transcribe audio to text. Then give it a try using a simple sentence.

In [None]:
# Define a Python function to transcribe text from audio
def asr_predict(SAMPLE):
    auth = riva.client.Auth(uri='localhost:50051')
    riva_asr = riva.client.ASRService(auth)
    # This example uses a .wav file with LINEAR_PCM encoding.
    # read in an audio file from local disk
    # Set up an offline/batch recognition request
    config = riva.client.RecognitionConfig()
    config.language_code = "en-US"                    # Language code of the audio clip
    config.max_alternatives = 1                       # How many top-N hypotheses to return
    config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
    config.audio_channel_count = 1     
    with io.open(SAMPLE, 'rb') as fh:
        content = fh.read()
    response = riva_asr.offline_recognize(content, config)
    transcript=response.results[0].alternatives[0].transcript
    return transcript

In [None]:
SAMPLE="/dli_workspace/data/audio_sample_resampled2.wav"

with io.open(SAMPLE, 'rb') as fh:
    content = fh.read()
ipd.Audio(SAMPLE, autoplay=True)

In [None]:
transcript=asr_predict(SAMPLE)
print("ASR Transcript:", transcript)

**Step 2: NER**

Create an NER function to find special words, such as persons, locations, and organizations, within the transcribed text.  For the sentence "Hi, my name is Dana and I work for NVIDIA", the NER model should recognize "Dana" as a name and "NVIDIA" as an organization.

In [None]:
# Define a Python function to extract entities from text
def ner_predict(text):
    auth = riva.client.Auth(uri='localhost:50051')
    service = riva.client.NLPService(auth)
    tokens, slots, slot_confidences, starts, ends = riva.client.extract_most_probable_token_classification_predictions(
        service.classify_tokens(input_strings=text, model_name='riva_ner'))
    return tokens[0],slots[0]

In [None]:
tokens,slots=ner_predict(transcript)
print(tokens)
print(slots)

**Step 3: DM**

For demonstration purposes, we'll build a very basic dialog manager that just looks for a person's name in the sentence and uses that name in the response text (if it exists).

In [None]:
# Dialog Manager
def dm_predict(slots, tokens):
    if "PER" in slots:
        index = slots.index("PER")
        response="Hi " + tokens[index] + ", how can I help you?"
    else:
        response="Hi, how can I help you?"
    return response

In [None]:
response = dm_predict(slots, tokens)
print(response)

**Step 4: TTS**

Finally, we'll use the TTS model to output the response sentence as audio.

In [None]:
sample_rate_hz = 44100

# helper function for more readable output
def remove_braces(braced_text):
    return braced_text.replace("{@","").replace("}","")

# Define a Python function to create speech from text
def tts_predict(text):
    auth = riva.client.Auth(uri='localhost:50051')
    riva_tts = riva.client.SpeechSynthesisService(auth)
    req = { 
            "language_code"  : "en-US",
            "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,   # Currently only LINEAR_PCM is supported
            "sample_rate_hz" : sample_rate_hz,                          # Generate 44.1KHz audio
            "voice_name"     : "English-US.Female-1"                    # The name of the voice to generate
    }
    req["text"] = text
    resp = riva_tts.synthesize(**req)
    audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
    return audio_samples, remove_braces(resp.meta.processed_text)

In [None]:
audio_samples, processed_text =tts_predict(response)
print(processed_text)
ipd.Audio(audio_samples, rate=sample_rate_hz, autoplay=True)

We have all the pieces, and have tested them individually.  Now let's run it as one pipeline.

In [None]:
# Put it all together

SAMPLE="/dli_workspace/data/audio_sample_resampled2.wav"
print("First Audio sample:")
ipd.display(ipd.Audio(SAMPLE, rate=sample_rate_hz, autoplay=True))

# get input audio duration
d=librosa.get_duration(filename=SAMPLE)
# call Riva ASR  
transcript = asr_predict(SAMPLE)
# call Riva NER
tokens,slots = ner_predict(transcript)
# call Dialog Manager
dm_response = dm_predict(slots, tokens)
# call Riva TTS
synth_audio, processed_text = tts_predict(dm_response)

time.sleep(d)
print("Virtual Assistant Response:")
ipd.display(ipd.Audio(synth_audio, rate=sample_rate_hz, autoplay=True))

---
# 8.2 ASR Customization

Imagine an application with the same basic pipeline that receives restaurant orders.  It may be the case that the standard ASR model does not recognize a particular dish available on the restaurant menu, so recognition of that dish needs to be added to the dictionary.  For our example, let's say that "couscous" has been ordered.  

In [None]:
SAMPLE="/dli_workspace/data/couscous-left.wav"
ipd.display(ipd.Audio(SAMPLE, rate=sample_rate_hz, autoplay=True))

# get inut audio duration
d=librosa.get_duration(filename=SAMPLE)
# call Riva ASR  
transcript=asr_predict(SAMPLE)
print(transcript)

The spoken word, "couscous", is not correctly transcribed by the ASR.  Instead, it thinks the spelling is "Css".  Is that because it doesn't know the word, or because it just didn't recognize it?  We can check by searching for the correct spelling in the Conformer-CTC model lexicon.

In [None]:
import os
CONFORMER_OFFLINE = "conformer-en-US-asr-offline-ctc-decoder-cpu-streaming-offline"
LEXICON = os.path.join(RIVA_MODEL_REPO, "models", CONFORMER_OFFLINE, "1", "lexicon.txt")

! grep couscous $LEXICON

The word "couscous" is already part of the Riva ASR vocabulary, but was not recognized as the likely transcription. Since this is a restaurant context, we want to improve the likelihood that "couscous" is transcribed, which we can do with word boosting. 

_Note: If we have a large list of words of interest like this, we could fine-tune the language model, which would make word boosting unnecessary._

## 8.2.1 Word Boosting 

Word boosting is the easiest customization of Riva ASR. The boosting  happens at the client side, when querying for transcription. The user can specify a list of words of interest that are most likely to appear and giving them new (higher) scores. The user-specified scores are used for decoding the output of the acoustic model.

In order to boost the word "couscous", we can use the config [`riva.client.add_word_boosting_to_config()`](https://github.com/nvidia-riva/python-clients/blob/928c63273176a939500e01ce176c463f1606a1ff/riva_api/asr.py#L78) function to specify the list of words and their scores. 


In [None]:
# predict with word boosting
def asr_predict_WB(SAMPLE, boost, score):
    auth = riva.client.Auth(uri='localhost:50051')
    riva_asr = riva.client.ASRService(auth)
    # This example uses a .wav file with LINEAR_PCM encoding.
    # read in an audio file from local disk
    # Set up an offline/batch recognition request
    config = riva.client.RecognitionConfig()
    config.language_code = "en-US"                    # Language code of the audio clip
    config.max_alternatives = 1                       # How many top-N hypotheses to return
    config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
    config.audio_channel_count = 1     
    riva.client.add_word_boosting_to_config(config, [boost], score)    # ****** WORD BOOSTING ******
    with io.open(SAMPLE, 'rb') as fh:
        content = fh.read()
    response = riva_asr.offline_recognize(content, config)
    transcript=response.results[0].alternatives[0].transcript
    return transcript

In [None]:
SAMPLE="/dli_workspace/data/couscous-left.wav"
ipd.display(ipd.Audio(SAMPLE, rate=sample_rate_hz, autoplay=True))

# transcibe while boosting couscous with the score 4.0
transcript=asr_predict_WB(SAMPLE,"couscous", 4.0)
print(transcript)

## 8.2.2 Exercise: Negative Word Boost

It's also possible to boost with negative numbers to reduce the likelihood of a particular transcription.  Try a few boost numbers, both negative and positive, to see what value is required to get the correct transcription of "couscous".

In [None]:
# transcibe while boosting couscous with various `myscore` values such as:  -4.0, 1.0, 2.0, 3.0
myscore = -4.0

SAMPLE="/dli_workspace/data/couscous-left.wav"
transcript=asr_predict_WB(SAMPLE,"couscous", myscore)
print(transcript)

## 8.2.3 Lexicon Customization

The lexicon is a raw text file that contains the mapping of each word on the vocabulary to its tokenized format (separated by a tab). Tokens must be part of the acoustic model's vocabulary .  <br> For example:

``` 
as      ▁as
with    ▁with
not     ▁not
don't   ▁don ' t
```

Customizing the lexicon file creates explicit pronunciations of terms, in the form of tokenized sequences. 
Let's see this in action with the word `bruschetta` that could be pronounced `brusketa`. 

In [None]:
# load sample audio

SAMPLE="/dli_workspace/data/bruschetta_resampled.wav"
ipd.display(ipd.Audio(SAMPLE, rate=sample_rate_hz, autoplay=True))

# call Riva ASR  
transcript=asr_predict(SAMPLE)
print(transcript)

The transcription is "Bruce Keta", which is not accurate. Let's check if "bruschetta" is part of the vocabulary. 

In [None]:
! grep bruschetta $LEXICON

The word "bruschetta" is in the lexicon, but it was not recognized by Riva ASR pipeline. The only pronunciation configured is `▁b ru s ch e t t a`. 

Let's provide more sequences to recognize the word when pronouncing it "bruce keta".  

We'll need to stop the Riva server, customize the pronunciation, and then restart with the update.

In [None]:
# Stop the Riva server. 
!bash $RIVA_QS/riva_stop.sh

Let's locate and load the tokenizer and test the sequence of tokens for the sentence "Hi, my name is Dana".

In [None]:
import glob
import sentencepiece as spm

# locate the _tokenizer.model in Riva models repo
mydir = os.path.join(RIVA_MODEL_REPO, "models", CONFORMER_OFFLINE, "1")
os.chdir(mydir)
for file in glob.glob("*.model"):
    filename = file
    
tokenizer = os.path.join(RIVA_MODEL_REPO, "models", CONFORMER_OFFLINE, "1", filename)

# Load the tokenizer
s = spm.SentencePieceProcessor(model_file=tokenizer)

# tokenize the sentence
s.encode("Hi, my name is Dana",out_type=str)

To generate new lexicon possibilities for "bruschetta" using the `encode()` function. We need:
- PRONUNCIATION: What the word or phrase should sound like (our example: "bruce keta")
- TOKEN: The desired written form of the word (our example: "bruschetta")

Let's query for five variants of the sequence of tokens leading to this sound. 

In [None]:
TOKEN="bruschetta"
PRONUNCIATION="bruce keta"

for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

Add those pronunciation options to the Riva offline lexicon file.

In [None]:
!echo -e "bruschetta\t▁b ru ce ▁k e t a" >> $LEXICON
!echo -e "bruschetta\t▁b ru c e ▁ k e t a" >> $LEXICON
!echo -e "bruschetta\t▁b r u c e ▁k e t a" >> $LEXICON
!echo -e "bruschetta\t▁ b ru c e ▁k e t a" >> $LEXICON
!echo -e "bruschetta\t▁ b ru ce ▁ k e t a" >> $LEXICON

In [None]:
# Check that Riva lexicon is updated
! grep bruschetta $LEXICON

In [None]:
# Start the Riva server (about 1 minute)
!cd $RIVA_QS && bash riva_start.sh config.sh

Let's query the customized Riva ASR service again with the updated lexicon.

In [None]:
SAMPLE = "/dli_workspace/data/bruschetta_resampled.wav"
ipd.display(ipd.Audio(SAMPLE, rate=sample_rate_hz, autoplay=True))

# call Riva ASR  
transcript = asr_predict(SAMPLE)
print("Riva transcription After Lexicon Mapping:\n", transcript)

---
# 8.3 NER Customization
In our restaurant application, we will need to pick entities out of a conversation beyond names and organizations.  For example, we might like to identify cuisine or dishes in the conversation.  The [MIT Restaurant Corpus](https://groups.csail.mit.edu/sls/downloads/) dataset has labels identified with _IOB Tagging_, which is what we need for fine-tuning an NER model with NeMo.  

The actual training for the NER model is out of scope for this course, but the process is covered in detail in the DLI course, [Building Transformer-Based Natural Language Processing Applications](https://www.nvidia.com/en-us/training/instructor-led-workshops/natural-language-processing/).  For this class, we will use a custom NER restaurant model that was fine-tuned in NeMo using the Restaurant data.

## 8.3.1 IOB Tagging

The sentences and labels in the NER restaurant dataset map to each other with inside, outside, beginning (IOB) tagging. Anything separated by white space is a word, including punctuation. 

In [None]:
# let's take a look at the data 
print('Text:')
! head -n 8 /dli_workspace/data/restaurant/text_train.txt

print('\nLabels:')
! head -n 8 /dli_workspace/data/restaurant/labels_train.txt

The first eight lines of the training dataset are mapped to eight lines of labels. For example, look at the 6th line, "a place that serves soft serve ice cream":

```text
a    place    that    serves    soft    serve    ice    cream
O    O        O       O         B-Dish  I-Dish   I-Dish I-Dish
```

The IOB tags indicate that "soft" is the beginning of a "Dish" entity, and that the next three words, "serve ice cream" are also part of that entity.  The full entity identified is therefore "soft serve ice cream" as a "Dish". None of the other words in the sentence are identified as entities.  

The possible entity labels for this dataset are listed in [label_ids.csv](dli_workspace/data/restaurant/label_ids.csv).

## 8.3.2 Restaurant Context for NER
Let's try the basic OOTB NER model with a few restaurant queries.  In this example, we'll use NeMo rather than the Riva client.

In [None]:
import nemo
import nemo.collections.nlp as nemo_nlp
pretrained_ner_model = nemo_nlp.models.TokenClassificationModel.from_pretrained(model_name="ner_en_bert") 

In [None]:
# define a list of queries for inference
request_bruschetta = "I would like to order a bruschetta for 6pm"
request_pasta = "Can you recommend a good Italian pasta?"

queries = [request_bruschetta, request_pasta]
results = pretrained_ner_model.add_predictions(queries)

for query, result in zip(queries, results):
    print()
    print(f'Query : {query}')
    print(f'Result: {result.strip()}\n')

If our NER thinks "bruschetta" is an organization, it will be difficult to design a DM that detects what dish a customer wants to order! 
Let's try this again with a customized model based on the Restaurant dataset.

In [None]:
my_ner_model = nemo_nlp.models.TokenClassificationModel.restore_from(restore_path="/dli_workspace/riva-full-model-repo/ner_restaurant.nemo") 

In [None]:
results = my_ner_model.add_predictions(queries)
results

In [None]:
# Create a new definition of ner_predict using the custom model
def ner_predict(text):
    results = my_ner_model.add_predictions([text])
    return results[0]

## 8.3.2 Restaurant DM

In [None]:
Options ={"bruschetta": ["Tomato and Olive Oil", "Mozarella and Basil", "Cherry Tomato and Garlic"], "pasta": ["Spaghetti Marinara", "Fettuccine Alfredo", "Linguini and Clams"]}

# Create a response with options for specific dishes
def dm_predict_restaurant(res):
    if "B-Dish" in res:
        index = res.index("B-Dish")
        dish = res[:index-1].split().pop()
        if dish in Options.keys():
            list_options = Options[dish]
            response = "What " + dish + " option would you like? We have : "
            for l in list_options:
                 response += l + ". "
        else:
            response = dish + ". What else would you like?"
    else:
        response = "Hi, how can I help you?"
    return response

In [None]:
dm_response_bruschetta = dm_predict_restaurant(ner_predict(request_bruschetta))
print("\n{}".format(request_bruschetta))
print(dm_response_bruschetta)

In [None]:
dm_response_pasta = dm_predict_restaurant(ner_predict(request_pasta))
print("\n{}".format(request_pasta))
print(dm_response_pasta)

---
# 8.4 TTS Customization
The correct pronunciation for "bruschetta" may be up for debate depending on where in the world you are.  For our application, we need to settle on a consistent pronunciation, so assume we want that "bruce keta" type of pronunciation we focused on earlier. Without any customization, the TTS model does not pronounce it that way.

In [None]:
# Generate speech with the OOTB TTS model
dm_response = dm_response_bruschetta
synth_audio, processed_text =tts_predict(dm_response)
ipd.display(ipd.Audio(synth_audio, rate=sample_rate_hz, autoplay=True))

We can try different pronunciations with the NeMo aligner model as follows:

In [None]:
from nemo.collections.tts.models import AlignerModel
aligner = AlignerModel.from_pretrained("tts_en_radtts_aligner_ipa")

In [None]:
input_string = "broo sketah"
text_g2p = aligner.tokenizer.g2p(input_string)
print(text_g2p)
text_tokens = aligner.tokenizer(input_string)
print(text_tokens)
print("\n" + ''.join(text_g2p))
synth_audio, processed_text =tts_predict(''.join(text_g2p))
ipd.display(ipd.Audio(synth_audio, rate=sample_rate_hz, autoplay=True))

In [None]:
input_string = "brous KETA"
text_g2p = aligner.tokenizer.g2p(input_string)
print(text_g2p)
text_tokens = aligner.tokenizer(input_string)
print(text_tokens)
print("\n" + ''.join(text_g2p))
synth_audio, processed_text =tts_predict(''.join(text_g2p))
ipd.display(ipd.Audio(synth_audio, rate=sample_rate_hz, autoplay=True))

After trying a few, alter the input to the TTS model with a simple `replace()` in the string output from the DM.

In [None]:
# generate speech 
custom_dm_response = dm_response.replace("bruschetta", "BROO SKETAH") 

synth_audio, processed_text =tts_predict(custom_dm_response)
ipd.display(ipd.Audio(synth_audio, rate=sample_rate_hz, autoplay=True))

---
# 8.5 Customized Restaurant Pipeline Demo

We've customized several parts of our pipeline:
1. ASR - added detection of additional pronunciations with the lexicon
2. NER - incorporated a fine-tuned NeMo model for restaurant context
3. DM - added refinements for our simple example
4. TTS - added pronunciation substitution

<img src="images/pipeline/restaurant_pipeline.png">



## 8.5.1 Exercise: Run a Custom Pipeline

In the following cell, complete the "TODO" sections to run the full pipeline with customizations:
1. ASR customized lexicon for "bruschetta" (this should already be in place)
2. NER customized NeMo model
3. DM with the restaurant options
4. TTS with substituted pronunciation for "bruschetta"

If you get stuck, you can check the [solution](solutions/ex8.5.1.ipynb)

In [None]:
# Restaurant Pipeline
# Put it all together

SAMPLE="/dli_workspace/data/bruschetta_resampled.wav"
print("First Audio sample:")
ipd.display(ipd.Audio(SAMPLE, rate=sample_rate_hz, autoplay=True))

# get input audio duration
d=librosa.get_duration(filename=SAMPLE)

# TODO call Riva ASR  
# TODO call NeMo NER
# TODO call Dialog Manager
# TODO call Riva TTS

time.sleep(d)
print("Virtual Assistant Response:")
ipd.display(ipd.Audio(synth_audio, rate=sample_rate_hz, autoplay=True))

## 8.5.2 Shut Down Riva

In [None]:
# Stop the Riva server 
!bash $RIVA_QS/riva_stop.sh

---
# 8.6 Shut Down the Kernel
<h3 style="color:red;">Important!</h3>

From the menu above, choose ***Kernel->Shut Down Kernel*** to fully clear GPU memory before moving on.

---
<h2 style="color:green;">Congratulations!</h2>

In this notebook, you have:
- Built a full conversational AI pipeline
- Customized ASR, NLP, and TTS in a full pipeline
- Built a full restaurant context customized pipeline

This concludes the TTS portion of the course.<br>
Next, you'll work with deployment of Riva at scale using Kubernetes, starting with [Enabling GPU within Kubernetes](009_K8s_Enable.ipynb).

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>