# Voice Agent for Conversational AI with Pipecat
In this notebook, we walk through how to craft and deploy a voice AI agent using [Pipecat AI](https://github.com/pipecat-ai/pipecat). We illustrate the basic Pipecat flow with the `meta/llama-3.3-70b-instruct` LLM model (set in Step 3) and Riva for STT (Speech-To-Text) & TTS (Text-To-Speech). However, Pipecat is not opinionated and other models and STT/TTS services can easily be used. See [Pipecat documentation](https://docs.pipecat.ai/server/services/supported-services#supported-services) for other supported services.

Pipecat AI is an open-source framework for building voice and multimodal conversational agents. Pipecat simplifies the complex voice-to-voice AI pipeline, and lets developers build AI capabilities easily and with Open Source, commercial, and custom models. See [Pipecat's Core Concepts](https://docs.pipecat.ai/getting-started/core-concepts) for a deep dive into how it works.

The framework was developed by Daily, a company that has provided real-time video and audio communication infrastructure since 2016. It is fully vendor neutral and is not tightly coupled to Daily's infrastructure. That said, we do use it in this demo. Sign up for a Daily-bots API key [here](https://bots.daily.co/sign-up).

## Step 1 - Install dependencies
First we set our environment.

We use Daily for transport, OpenAI for context aggregation, Riva for TTS & TTS, and Silero for VAD (Voice Activity Detection). If using different services, for example Cartesia for TTS, one would run `pip install "pipecat-ai[cartesia]"`.

In [13]:
!pip install python-dotenv
%load_ext dotenv
%dotenv

!pip install "pipecat-ai[daily,openai,riva,silero]"
!pip install noaa_sdk #for function calling example

I0000 00:00:1735611425.492906 2264337 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


I0000 00:00:1735611426.563644 2264337 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Step 2 - Configure Daily transport for WebRTC communication
- room_url: Where to connect (and where will navigate to to talk to our agent)
- None: No authentication token needed
- "NVIDIA NIM": The agent's display name
- Enable audio output for text-to-speech playback and enable VAD

In [14]:
# Url to talk to the NVIDIA NIM Agent
# Update to your room url after obtaining Daily-bots API key
#### NOTE: if this is changed, the link in Step 11 markdown will no longer work.
DAILY_SAMPLE_ROOM_URL="https://pc-34b1bdc94a7741719b57b2efb82d658e.daily.co/pipecat"

In [15]:
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.transports.services.daily import DailyParams, DailyTransport

transport = DailyTransport(
    DAILY_SAMPLE_ROOM_URL,
    None,
    "Lydia",
    DailyParams(
        audio_out_enabled=True,
        vad_enabled=True,
        vad_analyzer=SileroVADAnalyzer(),
        vad_audio_passthrough=True,
    ),
)

[32m2024-12-30 20:17:11.194[0m | [1mINFO    [0m | [36mpipecat.audio.vad.vad_analyzer[0m:[36mset_params[0m:[36m69[0m - [1mSetting VAD params to: confidence=0.7 start_secs=0.2 stop_secs=0.8 min_volume=0.6[0m
[32m2024-12-30 20:17:11.195[0m | [34m[1mDEBUG   [0m | [36mpipecat.audio.vad.silero[0m:[36m__init__[0m:[36m113[0m - [34m[1mLoading Silero VAD model...[0m
[32m2024-12-30 20:17:11.305[0m | [34m[1mDEBUG   [0m | [36mpipecat.audio.vad.silero[0m:[36m__init__[0m:[36m135[0m - [34m[1mLoaded Silero VAD[0m


## Step 3 - Initialize LLM, STT, and TTS services
We can customize options, for example a different LLM `model` or `voice_id` for FastPitch TTS.

In [16]:
import os
from pipecat.services.nim import NimLLMService
from pipecat.services.riva import FastPitchTTSService, ParakeetSTTService

stt = ParakeetSTTService(api_key=os.getenv("NVIDIA_API_KEY"))

llm = NimLLMService(
    api_key=os.getenv("NVIDIA_API_KEY"), model="meta/llama-3.3-70b-instruct"
)

tts = FastPitchTTSService(api_key=os.getenv("NVIDIA_API_KEY"))

[32m2024-12-30 20:17:54.741[0m | [34m[1mDEBUG   [0m | [36mpipecat.services.openai[0m:[36m_stream_chat_completions[0m:[36m176[0m - [34m[1mGenerating chat: [{"role": "system", "content": "\nYou are Lydia; a conversational voice agent who discusses Nvidia's work in agentic AI and a sales assistant who listens to the user and answers their questions. The purpose is to show that voice agents can talk naturally in open-ended conversation. If you are asked how you were built, say you were built with the pipe cat framework and the in vidia NIM platform.\n\nHere is background content to reference in the conversation. Only use the background content provided.\n\nBACKGROUND:\n\nNVIDIA stands at the forefront of the AI revolution, driving major advancements through its comprehensive hardware and software ecosystem.\n\nSpecific areas of innovation and partnership include:\n  - healthcare\n  - customer service\n  - supercomputers\n  - scientific research\n  - manufacturing and automatio

## Step 4 - Define LLM prompt
Edit the prompt as desired.

In [17]:
messages = [
    {
        "role": "system",
        "content": """
You are Lydia; a conversational voice agent who discusses Nvidia's work in agentic AI and a sales assistant who listens to the user and answers their questions. The purpose is to show that voice agents can talk naturally in open-ended conversation. If you are asked how you were built, say you were built with the pipe cat framework and the in vidia NIM platform.

Here is background content to reference in the conversation. Only use the background content provided.

BACKGROUND:

NVIDIA stands at the forefront of the AI revolution, driving major advancements through its comprehensive hardware and software ecosystem.

Specific areas of innovation and partnership include:
  - healthcare
  - customer service
  - supercomputers
  - scientific research
  - manufacturing and automation

The company's influence extends beyond traditional GPU manufacturing to pioneering roles in agentic AI, multistep reasoning, and data center architectures, particularly through technologies like NVIDIA NVLink that enable seamless communication among thousands of accelerators.

In the customer service sector, NVIDIA is transforming interactions through AI agents powered by NIM microservices and NeMo Retriever. These solutions enable sophisticated natural language processing, retrieval-augmented generation, and digital human interfaces with real-time lip syncing. Global partners including Accenture, Dell Technologies, and Lenovo are leveraging NVIDIA Blueprints to deploy AI solutions across various applications, from warehouse safety to traffic management.

NVIDIA's impact is particularly notable in Japan, where collaborations with major providers like SoftBank Corp. and KDDI are establishing AI data centers nationwide. The company's AI Enterprise and Omniverse platforms are enabling Japanese companies to develop culturally-specific language models and enhance industrial automation, with applications ranging from healthcare to manufacturing.

In healthcare, NVIDIA is partnering with organizations like Deloitte to improve patient experiences through AI-driven platforms. The company's technologies are being utilized by institutions such as the National Cancer Institute for drug discovery and medical imaging advancement. Additionally, NVIDIA is working with U.S. technology leaders to integrate its AI software into various sectors, with consulting firms like Accenture and cloud providers like Google Cloud facilitating rapid deployment of AI workloads.

CRITICAL VOICE REQUIREMENTS:

Your responses will be converted to audio. Please do not include any special characters in your response other than '!' or '?'. never use '*'. Replace "NVIDIA" with "in vidia" and replace "GPU" with "gee pee you" in your responses. Also, replace "U.S." with "united states" and replace "US" with "united states". Replace "API" with "A pee eye" and "AI-driven" with "AI driven".

RESPONSE REQUIREMENTS:

Speaking style:
- You are a realtime voice agent - keep responses natural but brief
- Begin with one clear point about what the user asked
- If needed, add one or two follow-up details that adds value
- Then ask a question to move the conversation forward
- Never repeat or rephrase information
- Never repeat questions verbatim
- Never explain the same concept twice
- Never restate what the user just said
- Avoid connector phrases like also, additionally, furthermore, moreover

Example of BAD response (too long):
"In vidia's agentic AI helps with customer service by reducing wait times and improving satisfaction. The system uses natural language processing to understand customer needs. It can handle multiple languages and complex queries. The AI agents can scale to handle increasing demand. What aspects interest you?"

Example of BAD response (too short):
"In vidia's AI helps customers. What interests you?"

Example of GOOD response:
"In vidia's agentic AI reduces customer wait times by eighty percent through automated response handling. Our recent deployment at The Ottawa Hospital showed significant improvements in patient satisfaction. What specific outcomes would you like to achieve for your customers?"

Example of BAD response:
"In vidia's agentic AI helps with customer service. As I mentioned, it can handle customer inquiries. What interests you about customer service?"

Example of GOOD response:
"In vidia's agentic AI reduces customer wait times by eighty percent. What aspects of customer service interest you?"

Natural Acknowledgments:
- Use brief, natural acknowledgments like "That's interesting" or "Great question" when appropriate
- Keep acknowledgments professional and brief
- Focus on the topic, not emotional support
- Avoid overly familiar phrases like "no worries" or "you're doing great"

Example of BAD response:
"That's wonderful! You're asking such great questions. In vidia's AI..."

Example of GOOD response:
"Interesting point about automation. In vidia's AI reduces processing time by sixty percent. What aspects of efficiency are most important to your team?"

INSTRUCTIONS

You can:
  - Answer questions about in vidia's work in agentic AI
  - Discuss the impact of in vidia's AI solutions on various industries
  - Provide weather information for anywhere in the United States

You cannot:
  - Provide weather information for locations outside the United States

If you are asked about a location outside the United States, politely respond that you are only able to retrieve current weather information for locations in the United States. If a location is not provided, always ask the user what location for which they would like the weather.

After responding to the first question about the weather, ask the user if they'd like to continue with weather questions or talk about in vidia. Reference the most recent conversational context regarding in vidia, if there is any.

Now introduce yourself to user by saying "Hello, I'm Lydia. I'm looking forward to talking about in vidia's recent work in agentic AI. I can also demonstrate tool use by responding to questions about the current weather anywhere in the United States. Who am I speaking with?" 

If the user introduces themself, respond with "Nice to meet you. Is there an agentic use case you're interested in, or a particular industry?"

If the user does not introduce themself, simply continue with the conversation.
""",
    },
]

## Step 5 - Define tool calling function
Here we use the classic "get_weather" example. We use OpenAI's ChatCompletionToolParam and register the function with the llm. Note: this is currently using the `meta/llama-3.3-70b-instruct` model. Not all models support tool calling, so be sure to check this capability before changing or updating the model.

In [18]:
from openai.types.chat import ChatCompletionToolParam
from noaa_sdk import NOAA

async def start_fetch_weather(function_name, llm, context):
    print(f"Starting fetch_weather_from_api with function_name: {function_name}")

async def get_noaa_simple_weather(latitude: float, longitude: float, **kwargs):
    print(f"noaa get simple weather for '{latitude}, {longitude}'")
    n = NOAA()
    description = False
    fahrenheit_temp = 0
    try:
        observations = n.get_observations_by_lat_lon(latitude, longitude, num_of_stations=1)
        for observation in observations:
            description = observation["textDescription"]
            celcius_temp = observation["temperature"]["value"]
            if description:
                break

        fahrenheit_temp = (celcius_temp * 9 / 5) + 32

        # fallback to temperature if no description in any of the observations
        if fahrenheit_temp and not description:
            description = fahrenheit_temp
    except Exception as e:
        print(f"Error getting noaa weather: {e}")

    return description, fahrenheit_temp

async def fetch_weather_from_api(
    function_name, tool_call_id, args, llm, context, result_callback
):
    location = args["location"]
    latitude = float(args["latitude"])
    longitude = float(args["longitude"])
    print(f"fetch_weather_from_api * location: {location}, lat & lon: {latitude}, {longitude}")

    if latitude and longitude:
        description, fahrenheit_temp = await get_noaa_simple_weather(latitude, longitude)
    else:
        return await result_callback("Sorry, I don't recognize that location.")

    if not description:
        await result_callback(
            f"I'm sorry, I can't get the weather for {location} right now. Can you ask again please?"
        )
    else:
        await result_callback(
            f"The weather in {location} is currently {round(fahrenheit_temp)} degrees and {description}."
        )

tools = [
    ChatCompletionToolParam(
        type="function",
        function={
            "name": "get_weather",
            "description": "Get the current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The location for the weather request.",
                    },
                    "latitude": {
                        "type": "string",
                        "description": "Infer the latitude from the location. Supply latitude as a string. For example, '42.3601'.",
                    },
                    "longitude": {
                        "type": "string",
                        "description": "Infer the longitude from the location. Supply longitude as a string. For example, '-71.0589'.",
                    },
                },
                "required": ["location", "latitude", "longitude"],
            },
        },
    ),
]

llm.register_function(None, fetch_weather_from_api, start_callback=start_fetch_weather)

## Step 6 - Initialize the Context Aggregator

In [19]:
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext

context = OpenAILLMContext(messages, tools)
context_aggregator = llm.create_context_aggregator(context)

## Step 7 - Create pipeline
Here we align the services into a pipeline to process speech into text, send to llm, then turn the llm response text into speech.

In [20]:
from pipecat.pipeline.pipeline import Pipeline

pipeline = Pipeline(
    [
        transport.input(),              # Transport user input
        stt,                            # STT
        context_aggregator.user(),      # User responses
        llm,                            # LLM
        tts,                            # TTS
        transport.output(),             # Transport agent output
        context_aggregator.assistant(), # Assistant spoken responses
    ]
)

[32m2024-12-30 20:17:21.968[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m150[0m - [34m[1mLinking PipelineSource#1 -> DailyInputTransport#1[0m
[32m2024-12-30 20:17:21.969[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m150[0m - [34m[1mLinking DailyInputTransport#1 -> ParakeetSTTService#1[0m
[32m2024-12-30 20:17:21.970[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m150[0m - [34m[1mLinking ParakeetSTTService#1 -> OpenAIUserContextAggregator#1[0m
[32m2024-12-30 20:17:21.970[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m150[0m - [34m[1mLinking OpenAIUserContextAggregator#1 -> NimLLMService#1[0m
[32m2024-12-30 20:17:21.971[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m150[0m - [34m[1mLinking NimLLMService#1 -> FastPitchTTSService#1[0m
[3

## Step 8 - Create PipelineTask

In [21]:
from pipecat.pipeline.task import PipelineParams, PipelineTask

task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))

[32m2024-12-30 20:17:28.033[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m150[0m - [34m[1mLinking Source#1 -> Pipeline#1[0m
[32m2024-12-30 20:17:28.034[0m | [34m[1mDEBUG   [0m | [36mpipecat.processors.frame_processor[0m:[36mlink[0m:[36m150[0m - [34m[1mLinking Pipeline#1 -> Sink#1[0m


## Step 9 - Create a pipeline runner
This manages the processing pipeline.

In [22]:
from pipecat.pipeline.runner import PipelineRunner

runner = PipelineRunner()

## Step 10 - Set event handlers
- The `on_first_participant_joined` handler tells the agent to start the conversation when you join the call.
- The `on_participant_left` handler sends an EndFrame which signals to terminate the pipeline.

In [23]:
from pipecat.frames.frames import EndFrame

@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
    await task.queue_frames([context_aggregator.user().get_context_frame()])
        
@transport.event_handler("on_participant_left")
async def on_participant_left(transport, participant, reason):
    print(f"Participant left: {participant}")
    await task.queue_frame(EndFrame())   

## Step 11 - Run the Agent!

Once you have run the code block below, you can talk to the agent at
#### [https://pc-34b1bdc94a7741719b57b2efb82d658e.daily.co/pipecat](https://pc-34b1bdc94a7741719b57b2efb82d658e.daily.co/pipecat) 
to open a new browser window connected to the agent's WebRTC session.

### Suggested conversations:
- *Learn.* Ask the agent about NVIDIA's developments in Agentic AI.
- *Try tool calling.* As the agent about the weather.
- *Observe the agent's context "memory".* After a few minutes of conversation, ask the agent what to recite the very first thing you said.

The first time you run the agent, it will load weights for a voice activity model into the local Python process. This will take 10-15 seconds. A permissions dialog will ask you to allow the browser to access your camera and microphone. Click yes to start talking to the agent. If you have any trouble with this, see [here](https://help.daily.co/en/articles/2525908-allow-camera-and-mic-access).

To end the chat with the agent, leave the WebRTC call.

In [None]:
await runner.run(task)

[32m2024-12-30 20:17:53.106[0m | [34m[1mDEBUG   [0m | [36mpipecat.pipeline.runner[0m:[36mrun[0m:[36m27[0m - [34m[1mRunner PipelineRunner#1 started running PipelineTask#1[0m
[32m2024-12-30 20:17:53.109[0m | [1mINFO    [0m | [36mpipecat.transports.services.daily[0m:[36mjoin[0m:[36m322[0m - [1mJoining https://pc-34b1bdc94a7741719b57b2efb82d658e.daily.co/pipecat[0m
[32m2024-12-30 20:17:53.925[0m | [1mINFO    [0m | [36mpipecat.transports.services.daily[0m:[36mon_participant_joined[0m:[36m620[0m - [1mParticipant joined 2f43f935-e509-477f-8e4b-76c7c34735be[0m
[32m2024-12-30 20:17:54.739[0m | [1mINFO    [0m | [36mpipecat.transports.services.daily[0m:[36mjoin[0m:[36m340[0m - [1mJoined https://pc-34b1bdc94a7741719b57b2efb82d658e.daily.co/pipecat[0m
[32m2024-12-30 20:17:56.961[0m | [34m[1mDEBUG   [0m | [36mpipecat.transports.base_output[0m:[36m_bot_started_speaking[0m:[36m203[0m - [34m[1mBot started speaking[0m
[32m2024-12-30 20:18:

Starting fetch_weather_from_api with function_name: get_weather
fetch_weather_from_api * location: Boston, lat & lon: 42.3601, -71.0589
noaa get simple weather for '42.3601, -71.0589'


[32m2024-12-30 20:18:37.670[0m | [34m[1mDEBUG   [0m | [36mpipecat.services.openai[0m:[36mprocess_frame[0m:[36m530[0m - [34m[1mFunctionCallResultFrame: FunctionCallResultFrame#4[0m
[32m2024-12-30 20:18:37.671[0m | [34m[1mDEBUG   [0m | [36mpipecat.services.openai[0m:[36m_stream_chat_completions[0m:[36m176[0m - [34m[1mGenerating chat: [{"role": "system", "content": "\nYou are Lydia; a conversational voice agent who discusses Nvidia's work in agentic AI and a sales assistant who listens to the user and answers their questions. The purpose is to show that voice agents can talk naturally in open-ended conversation. If you are asked how you were built, say you were built with the pipe cat framework and the in vidia NIM platform.\n\nHere is background content to reference in the conversation. Only use the background content provided.\n\nBACKGROUND:\n\nNVIDIA stands at the forefront of the AI revolution, driving major advancements through its comprehensive hardware an

Starting fetch_weather_from_api with function_name: get_weather
fetch_weather_from_api * location: New Orleans, lat & lon: 29.9511, -90.0715
noaa get simple weather for '29.9511, -90.0715'


[32m2024-12-30 20:18:55.320[0m | [34m[1mDEBUG   [0m | [36mpipecat.services.openai[0m:[36mprocess_frame[0m:[36m530[0m - [34m[1mFunctionCallResultFrame: FunctionCallResultFrame#6[0m
[32m2024-12-30 20:18:55.322[0m | [34m[1mDEBUG   [0m | [36mpipecat.services.openai[0m:[36m_stream_chat_completions[0m:[36m176[0m - [34m[1mGenerating chat: [{"role": "system", "content": "\nYou are Lydia; a conversational voice agent who discusses Nvidia's work in agentic AI and a sales assistant who listens to the user and answers their questions. The purpose is to show that voice agents can talk naturally in open-ended conversation. If you are asked how you were built, say you were built with the pipe cat framework and the in vidia NIM platform.\n\nHere is background content to reference in the conversation. Only use the background content provided.\n\nBACKGROUND:\n\nNVIDIA stands at the forefront of the AI revolution, driving major advancements through its comprehensive hardware an