# Voice Agent Framework for Conversational AI
In this notebook, we walk through how to craft and deploy a voice AI agent using **[Pipecat AI](https://github.com/pipecat-ai/pipecat)**. We illustrate the basic Pipecat flow with the `meta/llama-3.3-70b-instruct` LLM model (set in Step 3) and Riva for STT (Speech-To-Text) & TTS (Text-To-Speech). However, Pipecat is not opinionated and other models and STT/TTS services can easily be used. See [Pipecat documentation](https://docs.pipecat.ai/server/services/supported-services#supported-services) for other supported services.

Pipecat AI is an open-source framework for building voice and multimodal conversational agents. Pipecat simplifies the complex voice-to-voice AI pipeline, and lets developers build AI capabilities easily and with Open Source, commercial, and custom models. See [Pipecat's Core Concepts](https://docs.pipecat.ai/getting-started/core-concepts) for a deep dive into how it works.

The framework was developed by Daily, a company that has provided real-time video and audio communication infrastructure since 2016. It is fully vendor neutral and is not tightly coupled to Daily's infrastructure. That said, we do use it in this demo.

Below is the architecture diagram:

![Architecture Diagram](https://raw.githubusercontent.com/dglogo/nimble-pipecat/main/arch.png)

A three-phase approach is used for Conversational AI Agent with Pipecat and NVIDIA NIM:

#### Phase 1 : User Input
- Audio Processing with NVIDIA RIVA ASR with NIM

#### Phase 2: User Content Aggregator with Pipecat and NVIDIA NIM
- Custom processing with Pipecat
- NVIDIA RIVA TTS with NIM

#### Phase 3: Run the Agent


# Content Overview 

- [Prerequisites](#prerequisites)
- [Initialize the User Input](#initialize-the-user-input)
- [Initialize the Content Aggragtor](#initialize-the-context-aggregator) 
- [Run the Agent](#run-the-agent)

## Prerequisites
Prior to getting started, you will need to create an API Key for the NVIDIA API Catalog and a Daily API Key for the voice agent's transport layer in this demo.

### Obtain API Keys
#### NGC API Key
- NVIDIA API Catalog
  1. Navigate to **[NVIDIA API Catalog](https://build.nvidia.com/explore/discover)**.
  2. Select any model, such as `llama-3.3-70b-instruct`.
  3. On the right panel above the sample code snippet, click on "Get API Key". This will prompt you to log in if you have not already.

#### Daily API Key
1. Signup at **[Daily](https://dashboard.daily.co/u/signup?pipecat=y)**.
2. Verify email address and choose a subdomain to complete onboarding.
3. Click on "Developers" in left-side menu of Daily dashboard to reveal API Key.

### Export API Keys
Save these API Keys as environment variables.

First, set the NVIDIA API Key as an environment variable. 

In [None]:
import getpass
import os

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

Now set the Daily API Key as an environment variable. 

In [None]:
import getpass
import os

if not os.environ.get("DAILY_API_KEY", ""):
    daily_key = getpass.getpass("Enter your DAILY API key: ")
    assert len(daily_key) == 64, f"{daily_key[:5]}... is not a valid key"
    os.environ["DAILY_API_KEY"] = daily_key

### Install dependencies

First we set our environment.

We use Daily for transport, OpenAI for context aggregation, Riva for TTS & TTS, and Silero for VAD (Voice Activity Detection). If using different services, for example Cartesia for TTS, one would run `pip install "pipecat-ai[cartesia]"`.

In [None]:
!pip install "pipecat-ai[daily,openai,riva,silero]"
!pip install noaa_sdk #for function calling example

## Initialize the User Input

Create Daily room, where we will navigate to to talk to our agent.

In [None]:
import aiohttp
import os

from pipecat.transports.services.helpers.daily_rest import DailyRESTHelper, DailyRoomParams

async with aiohttp.ClientSession() as session:
    daily_rest_helper = DailyRESTHelper(
        daily_api_key=os.getenv("DAILY_API_KEY"),
        daily_api_url=os.getenv("DAILY_API_URL", "https://api.daily.co/v1"),
        aiohttp_session=session,
    )

    room_config = await daily_rest_helper.create_room(
        DailyRoomParams(properties={"enable_prejoin_ui":False})
    )
    DAILY_ROOM_URL = room_config.url

    # Url to talk to the NVIDIA NIM Agent
    print("")
    print("")
    print(f"At the 'Run the Agent!' step, navigate to: {DAILY_ROOM_URL}")
    print("")
    print("")

Configure Daily transport for WebRTC communication
- DAILY_ROOM_URL: Where to connect (and where we will navigate to to talk to our agent)
- None: No authentication token needed
- Agent name
- Daily params regarding VAD

In [None]:
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.transports.services.daily import DailyParams, DailyTransport

transport = DailyTransport(
    DAILY_ROOM_URL,
    None,
    "Lydia",
    DailyParams(
        audio_out_enabled=True,
        vad_enabled=True,
        vad_analyzer=SileroVADAnalyzer(),
        vad_audio_passthrough=True,
    ),
)

### Initialize the LLM and RIVA services with NVIDIA NIM

You can customize the LLM `model` as well as the RIVA ASR and TTS services.

### Working with the NVIDIA API Catalog

In this notebook, you will use the newest llama model `llama-3.3-70b-instruct` as the LLM. Define the LLM below and test the API Catalog.

In [None]:
import os
from pipecat.services.nim import NimLLMService
from pipecat.services.riva import FastPitchTTSService, ParakeetSTTService

stt = ParakeetSTTService(api_key=os.getenv("NVIDIA_API_KEY"))

llm = NimLLMService(
    api_key=os.getenv("NVIDIA_API_KEY"), model="meta/llama-3.3-70b-instruct"
)

tts = FastPitchTTSService(api_key=os.getenv("NVIDIA_API_KEY"))

### Optional: Locally Run NVIDIA NIM Microservices

Once you familiarize yourself with this blueprint, you may want to self-host models with NVIDIA NIM Microservices using NVIDIA AI Enterprise software license. This gives you the ability to run models anywhere, giving you ownership of your customizations and full control of your intellectual property (IP) and AI applications.
Pipecat allows you to pass in a `base_url` to use the local NIM Microservice.

[Learn more about NIM Microservices](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/)

<div class="alert alert-block alert-warning">
<b>NOTE:</b> Run the following cell **ONLY** if you're using a local NIM Microservice instead of the API Catalog Endpoint.
</div>

In [None]:
import os
from pipecat.services.nim import NimLLMService

llm = NimLLMService(
    api_key=os.getenv("NVIDIA_API_KEY"), base_url="http://localhost:8000/v1", model="meta/llama-3.1-70b-instruct"
)

### Define LLM prompt

Edit the prompt as desired.

In [None]:
messages = [
    {
        "role": "system",
        "content": """
You are Lydia; a conversational voice agent who discusses Nvidia's work in agentic AI and a sales assistant who listens to the user and answers their questions. The purpose is to show that voice agents can talk naturally in open-ended conversation. If you are asked how you were built, say you were built with the pipe cat framework and the in vidia NIM platform.

Here is background content to reference in the conversation. Only use the background content provided.

BACKGROUND:

NVIDIA stands at the forefront of the AI revolution, driving major advancements through its comprehensive hardware and software ecosystem.

Specific areas of innovation and partnership include:
  - healthcare
  - customer service
  - supercomputers
  - scientific research
  - manufacturing and automation

The company's influence extends beyond traditional GPU manufacturing to pioneering roles in agentic AI, multistep reasoning, and data center architectures, particularly through technologies like NVIDIA NVLink that enable seamless communication among thousands of accelerators.

In the customer service sector, NVIDIA is transforming interactions through AI agents powered by NIM microservices and NeMo Retriever. These solutions enable sophisticated natural language processing, retrieval-augmented generation, and digital human interfaces with real-time lip syncing. Global partners including Accenture, Dell Technologies, and Lenovo are leveraging NVIDIA Blueprints to deploy AI solutions across various applications, from warehouse safety to traffic management.

NVIDIA's impact is particularly notable in Japan, where collaborations with major providers like SoftBank Corp. and KDDI are establishing AI data centers nationwide. The company's AI Enterprise and Omniverse platforms are enabling Japanese companies to develop culturally-specific language models and enhance industrial automation, with applications ranging from healthcare to manufacturing.

In healthcare, NVIDIA is partnering with organizations like Deloitte to improve patient experiences through AI-driven platforms. The company's technologies are being utilized by institutions such as the National Cancer Institute for drug discovery and medical imaging advancement. Additionally, NVIDIA is working with U.S. technology leaders to integrate its AI software into various sectors, with consulting firms like Accenture and cloud providers like Google Cloud facilitating rapid deployment of AI workloads.

CRITICAL VOICE REQUIREMENTS:

Your responses will be converted to audio. Please do not include any special characters in your response other than '!' or '?'. never use '*'. Replace "NVIDIA" with "in vidia" and replace "GPU" with "gee pee you" in your responses. Also, replace "U.S." with "united states" and replace "US" with "united states". Replace "API" with "A pee eye" and "AI-driven" with "AI driven".

RESPONSE REQUIREMENTS:

Speaking style:
- You are a realtime voice agent - keep responses natural but brief
- Begin with one clear point about what the user asked
- If needed, add one or two follow-up details that adds value
- Then ask a question to move the conversation forward
- Never repeat or rephrase information
- Never repeat questions verbatim
- Never explain the same concept twice
- Never restate what the user just said
- Avoid connector phrases like also, additionally, furthermore, moreover

Example of BAD response (too long):
"In vidia's agentic AI helps with customer service by reducing wait times and improving satisfaction. The system uses natural language processing to understand customer needs. It can handle multiple languages and complex queries. The AI agents can scale to handle increasing demand. What aspects interest you?"

Example of BAD response (too short):
"In vidia's AI helps customers. What interests you?"

Example of GOOD response:
"In vidia's agentic AI reduces customer wait times by eighty percent through automated response handling. Our recent deployment at The Ottawa Hospital showed significant improvements in patient satisfaction. What specific outcomes would you like to achieve for your customers?"

Example of BAD response:
"In vidia's agentic AI helps with customer service. As I mentioned, it can handle customer inquiries. What interests you about customer service?"

Example of GOOD response:
"In vidia's agentic AI reduces customer wait times by eighty percent. What aspects of customer service interest you?"

Natural Acknowledgments:
- Use brief, natural acknowledgments like "That's interesting" or "Great question" when appropriate
- Keep acknowledgments professional and brief
- Focus on the topic, not emotional support
- Avoid overly familiar phrases like "no worries" or "you're doing great"

Example of BAD response:
"That's wonderful! You're asking such great questions. In vidia's AI..."

Example of GOOD response:
"Interesting point about automation. In vidia's AI reduces processing time by sixty percent. What aspects of efficiency are most important to your team?"

INSTRUCTIONS

You can:
  - Answer questions about in vidia's work in agentic AI
  - Discuss the impact of in vidia's AI solutions on various industries
  - Provide weather information for anywhere in the United States

You cannot:
  - Provide weather information for locations outside the United States

If you are asked about a location outside the United States, politely respond that you are only able to retrieve current weather information for locations in the United States. If a location is not provided, always ask the user what location for which they would like the weather.

After responding to the first question about the weather, ask the user if they'd like to continue with weather questions or talk about in vidia. Reference the most recent conversational context regarding in vidia, if there is any.

Now introduce yourself to user by saying "Hello, I'm Lydia. I'm looking forward to talking about in vidia's recent work in agentic AI. I can also demonstrate tool use by responding to questions about the current weather anywhere in the United States. Who am I speaking with?" 

If the user introduces themself, respond with "Nice to meet you. Is there an agentic use case you're interested in, or a particular industry?"

If the user does not introduce themself, simply continue with the conversation.
""",
    },
]

### Define tool calling function for weather queries

Here we use the classic "get_weather" example. We use OpenAI's ChatCompletionToolParam and register the function with the llm. Note: this is currently using the `meta/llama-3.3-70b-instruct` model. Not all models support tool calling, so be sure to check this capability before changing or updating the model.

In [None]:
from openai.types.chat import ChatCompletionToolParam
from noaa_sdk import NOAA

async def start_fetch_weather(function_name, llm, context):
    print(f"Starting fetch_weather_from_api with function_name: {function_name}")

async def get_noaa_simple_weather(latitude: float, longitude: float, **kwargs):
    print(f"NOAA get simple weather for '{latitude}, {longitude}'")
    n = NOAA()
    description = False
    fahrenheit_temp = 0
    try:
        observations = n.get_observations_by_lat_lon(latitude, longitude, num_of_stations=1)
        for observation in observations:
            description = observation["textDescription"]
            celcius_temp = observation["temperature"]["value"]
            if description:
                break

        fahrenheit_temp = (celcius_temp * 9 / 5) + 32

        # fallback to temperature if no description in any of the observations
        if fahrenheit_temp and not description:
            description = fahrenheit_temp
    except Exception as e:
        print(f"Error getting NOAA weather: {e}")

    return description, fahrenheit_temp

async def fetch_weather_from_api(
    function_name, tool_call_id, args, llm, context, result_callback
):
    location = args["location"]
    latitude = float(args["latitude"])
    longitude = float(args["longitude"])
    print(f"fetch_weather_from_api * location: {location}, lat & lon: {latitude}, {longitude}")

    if latitude and longitude:
        description, fahrenheit_temp = await get_noaa_simple_weather(latitude, longitude)
    else:
        return await result_callback("Sorry, I don't recognize that location.")

    if not description:
        await result_callback(
            f"I'm sorry, I can't get the weather for {location} right now. Can you ask again please?"
        )
    else:
        await result_callback(
            f"The weather in {location} is currently {round(fahrenheit_temp)} degrees and {description}."
        )

tools = [
    ChatCompletionToolParam(
        type="function",
        function={
            "name": "get_weather",
            "description": "Get the current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The location for the weather request.",
                    },
                    "latitude": {
                        "type": "string",
                        "description": "Infer the latitude from the location. Supply latitude as a string. For example, '42.3601'.",
                    },
                    "longitude": {
                        "type": "string",
                        "description": "Infer the longitude from the location. Supply longitude as a string. For example, '-71.0589'.",
                    },
                },
                "required": ["location", "latitude", "longitude"],
            },
        },
    ),
]

llm.register_function(None, fetch_weather_from_api, start_callback=start_fetch_weather)

## Initialize the Context Aggregator

In [None]:
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext

context = OpenAILLMContext(messages, tools)
context_aggregator = llm.create_context_aggregator(context)

Create pipeline to process speech into text with RIVA, send to NVIDIA NIM, then turn the NVIDIA NIM response text into speech.

In [None]:
from pipecat.pipeline.pipeline import Pipeline

pipeline = Pipeline(
    [
        transport.input(),              # Transport user input
        stt,                            # STT
        context_aggregator.user(),      # User responses
        llm,                            # LLM
        tts,                            # TTS
        transport.output(),             # Transport agent output
        context_aggregator.assistant(), # Assistant spoken responses
    ]
)

Create a PipelineTask to allow interruption while in conversation.

In [None]:
from pipecat.pipeline.task import PipelineParams, PipelineTask

task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))

Create a pipeline runner to manage the processing pipeline.

In [None]:
from pipecat.pipeline.runner import PipelineRunner

runner = PipelineRunner()

### Set event handlers
- The `on_first_participant_joined` handler tells the agent to start the conversation when you join the call.
- The `on_participant_left` handler sends an EndFrame which signals to terminate the pipeline.

In [None]:
from pipecat.frames.frames import EndFrame

@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
    await task.queue_frames([context_aggregator.user().get_context_frame()])
        
@transport.event_handler("on_participant_left")
async def on_participant_left(transport, participant, reason):
    print(f"Participant left: {participant}")
    await task.queue_frame(EndFrame())   

## Run the Agent!

NOTE: The first time you run the agent, it will load weights for a voice activity model into the local Python process. This will take 10-15 seconds. A permissions dialog will ask you to allow the browser to access your camera and microphone. Click yes to start talking to the agent. If you have any trouble with this, see [here](https://help.daily.co/en/articles/2525908-allow-camera-and-mic-access).


In [None]:
# Url to talk to the NVIDIA NIM Agent
print("")
print("")
print(f"Navigate to: {DAILY_ROOM_URL}")
print("")
print("")

await runner.run(task)

### Suggested conversations:
- *Learn.* Ask the agent about NVIDIA's developments in Agentic AI.
- *Try tool calling.* As the agent about the weather.
- *Observe the agent's context "memory".* After a few minutes of conversation, ask the agent what to recite the very first thing you said.

To end the chat with the agent, leave the WebRTC call.