# NVIDIA Riva for Automatic Speech Recognition and Text To Speech
## Part 1: Getting Started

The `NVIDIARivaASR`, `NVIDIARivaTTS` utility runnables are LangChain runnables that integrate [NVIDIA Riva](https://www.nvidia.com/en-us/ai-data-science/products/riva/) into LCEL chains for Automatic Speech Recognition (ASR) and Text To Speech (TTS).

This example goes over how to use these LangChain runnables as well as an audio streaming class to interact with an LLM with streamed speech.

## 1. NVIDIA Riva Runnables
There are 2 Riva Runnables:
a. **RivaASR**: Converts audio bytes into text for an LLM using NVIDIA Riva. 

b. **RivaTTS**: Converts text into audio bytes using NVIDIA Riva.

### a. RivaASR
The [**RivaASR**](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/utilities/nvidia_riva.py#L404) runnable converts audio bytes into a string for an LLM using NVIDIA Riva. 

It's useful for sending an audio stream (a message containing streaming audio) into a chain and preprocessing that audio by converting it to a string to create an LLM prompt. 

```
ASRInputType = AudioStream # the AudioStream type is a custom type for a message queue containing streaming audio
ASROutputType = str

class RivaASR(
    RivaAuthMixin,
    RivaCommonConfigMixin,
    RunnableSerializable[ASRInputType, ASROutputType],
):
    """A runnable that performs Automatic Speech Recognition (ASR) using NVIDIA Riva."""

    name: str = "nvidia_riva_asr"
    description: str = (
        "A Runnable for converting audio bytes to a string."
        "This is useful for feeding an audio stream into a chain and"
        "preprocessing that audio to create an LLM prompt."
    )

    # riva options
    audio_channel_count: int = Field(
        1, description="The number of audio channels in the input audio stream."
    )
    profanity_filter: bool = Field(
        True,
        description=(
            "Controls whether or not Riva should attempt to filter "
            "profanity out of the transcribed text."
        ),
    )
    enable_automatic_punctuation: bool = Field(
        True,
        description=(
            "Controls whether Riva should attempt to correct "
            "senetence puncuation in the transcribed text."
        ),
    )
```

When this runnable is called on an input, it takes an input audio stream that acts as a queue and concatenates transcription as chunks are returned. When a response is fully generated, a string is returned. 


### b. RivaTTS
The [**RivaTTS**](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/utilities/nvidia_riva.py#L511) runnable converts text output to audio bytes. 

It's useful for sending a streamed textual response from an LLM into a chain and processing that response by converting it to audio bytes that sound like a natural human voice to be played back to the user. 

```
TTSInputType = Union[str, AnyMessage, PromptValue]
TTSOutputType = byte

class RivaTTS(
    RivaAuthMixin,
    RivaCommonConfigMixin,
    RunnableSerializable[TTSInputType, TTSOutputType],
):
    """A runnable that performs Text-to-Speech (TTS) with NVIDIA Riva."""

    name: str = "nvidia_riva_tts"
    description: str = (
        "A tool for converting text to speech."
        "This is useful for converting LLM output into audio bytes."
    )

    # riva options
    voice_name: str = Field(
        "English-US.Female-1",
        description=(
            "The voice model in Riva to use for speech. "
            "Pre-trained models are documented in "
            "[the Riva documentation]"
            "(https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html)."
        ),
    )
    output_directory: Optional[str] = Field(
        None,
        description=(
            "The directory where all audio files should be saved. "
            "A null value indicates that wave files should not be saved. "
            "This is useful for debugging purposes."
        ),
```

When this runnable is called on an input, it takes iterable text chunks and streams them into output audio bytes that are either written to a `.wav` file or played out loud.

## 2. Installation

The NVIDIA Riva client library must be installed.

In [None]:
%pip install --upgrade --quiet nvidia-riva-client

## 3. Setup

**To get started with NVIDIA Riva:**

1. Follow the Riva Quick Start setup instructions for [Local Deployment Using Quick Start Scripts](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html#local-deployment-using-quick-start-scripts).

## 4. Building a Chain
### a. Import RivaASR and RivaTTS Runnables

In [12]:
import json
from langchain_community.utilities.nvidia_riva import (
    RivaASR,
    RivaTTS,
)

Let's view the runnable schemas.

In [14]:
print(json.dumps(RivaASR.schema(), indent=2))
print(json.dumps(RivaTTS.schema(), indent=2))

{
  "title": "RivaASR",
  "description": "A runnable that performs Automatic Speech Recognition (ASR) using NVIDIA Riva.",
  "type": "object",
  "properties": {
    "name": {
      "title": "Name",
      "default": "nvidia_riva_asr",
      "type": "string"
    },
    "encoding": {
      "description": "The encoding on the audio stream.",
      "default": "LINEAR_PCM",
      "allOf": [
        {
          "$ref": "#/definitions/RivaAudioEncoding"
        }
      ]
    },
    "sample_rate_hertz": {
      "title": "Sample Rate Hertz",
      "description": "The sample rate frequency of audio stream.",
      "default": 8000,
      "type": "integer"
    },
    "language_code": {
      "title": "Language Code",
      "description": "The [BCP-47 language code](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) for the target language.",
      "default": "en-US",
      "type": "string"
    },
    "url": {
      "title": "Url",
      "description": "The full URL where the Riva service can be found.",

### b. Convert Audio File to Chunks
To mimic streaming, read in a single-channel `.wav` file and convert it to chunks of audio bytes. 

In [19]:
import pywav  # pywav is used instead of built-in wave because of mulaw support
from langchain_community.utilities.nvidia_riva import (
    RivaAudioEncoding
)
audio_file = "./audio_files/en-US_sample2.wav"
wav_file = pywav.WavRead(audio_file)
audio_data = wav_file.getdata()
audio_encoding = RivaAudioEncoding.from_wave_format_code(wav_file.getaudioformat())
sample_rate = wav_file.getsamplerate()
delay_time = 1 / 4
chunk_size = int(sample_rate * delay_time)
delay_time = 1 / 8
num_channels = wav_file.getnumofchannels()
audio_chunks = [
    audio_data[0 + i : chunk_size + i] for i in range(0, len(audio_data), chunk_size)
]

###  5. Create Riva ASR and TTS Runnables

First, set the URL to the Riva speech server. 

If you don't have a Riva speech server, go to [Setup](##Setup) in this notebook.

In [21]:
RIVA_SPEECH_URL="http://localhost:50051/"

Next, create the RivaASR and RivaTTS runnables.

In [20]:
riva_asr = RivaASR(
    url=RIVA_SPEECH_URL,  # the location of the Riva ASR server
    encoding=audio_encoding,
    audio_channel_count=num_channels,
    sample_rate_hertz=sample_rate,
    profanity_filter=True,
    enable_automatic_punctuation=True,
    language_code="en-US",
)

riva_tts = RivaTTS(
    url=RIVA_SPEECH_URL,  # the location of the Riva TTS server
    output_directory="./scratch",  # location of the output .wav files
    language_code="en-US",
    voice_name="English-US.Female-1",
)

### 6. Create Additional Chain Components (PromptTemplate and LLM)
As usual, declare the other parts of the chain. In this case, it's just a prompt and an LLM.

In [39]:
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI

prompt = PromptTemplate.from_template("{user_input}")
llm = OpenAI(openai_api_key="sk-xxx")


Now, tie together all the parts of the chain including RivaASR and RivaTTS.

In [26]:
chain = {"user_input": riva_asr} | prompt | llm | riva_tts

### 7. Mimic Audio Streaming

Take the audio chunks and stream them into the chain for ASR followed by the rest of the pipeline.

In [37]:
from langchain_community.utilities.nvidia_riva import AudioStream
import asyncio

async def generate_audio_chunks() -> None:
    """Generates audio chunks from a .wav file
    to mimic streaming."""

    input_stream = AudioStream(maxsize=1000)
    # Send bytes into the stream
    for chunk in audio_chunks:
        await input_stream.aput(chunk)
    input_stream.close()

    output_stream = asyncio.Queue()
    while not input_stream.complete:
        async for chunk in chain.astream(input_stream):
            await output_stream.put(chunk)

In [None]:
import nest_asyncio # needed for Jupyter
nest_asyncio.apply()

# TODO: make this stop at a certain point in time so it doesn't hang
asyncio.run(generate_audio_chunks())

## 8. Results and Next Steps
Listen to the TTS response. 

The next notebook, **Part 2: Conversational Application with Riva and LangChain** covers how to take these fundamentals and bring them to a real-time application for a full conversation with a bot.

In [47]:
import IPython
import os
import glob

output_path = os.path.join(os.getcwd(), "scratch")
file_type = "*.wav"
files_path = os.path.join(output_path, file_type)
files = glob.glob(files_path)

IPython.display.Audio(files[0])

/Users/hwolff/dev/langchain/docs/docs/integrations/utilities/scratch
