### Voice Chatbot with GPT-4o Audio Modality 

In this cookbook, we will walk through the process of building a voice enabled sales chatbot with newly released GPT-4o audio-in and audio-out modalities. In addition, to the text and image modalities, GPT-4o enables you to generate a spoken audio response to a prompt, and to use audio inputs to prompt the model. You can learn more about Audio Generation capabilities of GPT-4o [here](https://platform.openai.com/docs/guides/audio).  

GPT-4o audio modality based chatbot provide a balance between low-latency in communication and better control over the conversation. With streaming audio output, GPT-4o is an improvement over the previous generation of Text-To-Speech (TTS)/Speech-To-Text (STT) chatbots.  


Creating an GPT-4o voice modality based chatbot is a three-step process, as outlined below:

**1. Set Up the GPT Model with Audio-Out Modality**  
Initialize the GPT model with system prompts that define the goal of the conversation, guiding the chatbot's responses toward assisting with sales order placement. The prompts can be set up for a multi-assistant system, where one assistant drives the conversation with the customer and another assistant manages the cart in parallel. Also, set up tools for assistants to use when asking for human help or interacting with each other (such as cart pricing).

**2. Develop Audio Modules for ASR (Automatic Speech Recognition) and Setup GPT-4o for Audio-input**  
 Create an audio interface that listens to the user, and records their speech. Implement a **VAD (Voice Activity Detection) module** for a **handsfree operation**. This module detects input audio from user and segments the audio at silence intervals to send to Whisper model for transcription. Keep VAD (Voice Activity Detection) module parameters (threshold of audio amplitude that qualifies as silence, and the duration of silent chunks) configurable, so they can be adjusted based on the environment. The audio from the user is sent to the GPT-4o model. 
 

**3. Create a conversation loop and manage order cart**  
Implement a conversation loop where the agent listens to the user and responds back, continuing until an event occurs that breaks the loop, such as a request to speak with a human or another indication of the end of the conversation.

For the purposes of this cookbook, we will use an example of an office stationery ordering bot. You can interact with the bot to order general-purpose office products such as pencils, pens, paper clips, writing pads, printing paper, and envelopes.

The key challenges we want to address are:

1. Ensure customers can only order items that are available.
2. Escalate to a human in the loop if the customer requests help or engages in non-order-related conversation.
3. Provide an accurate summary of the order with prices to the customer.
4. Minimize the lag in the conversation 


Before we get started, make sure you have the following libraries installed: `pyaudio`, `numpy`, `openai`, `playsound`, and that you have configured your OpenAI API key as an environment variable.

### 1. Set Up the GPT Model 

First step is to set the foundation for the GPT model to operate effectively as a sales chatbot within the office stationery domain. By carefully crafting the prompts and defining the functions, we ensure that the bot can handle customer interactions smoothly, maintain the flow of conversation, and provide accurate assistance aligned with the objectives of our project.

We will initiate a Sales Bot prompt `SALES_BOT_PROMPT` that would drive the interaction with the user, and a `SALES_CART_PROMPT` prompt that would manage the cart. Note that the list of items available for sale are provide as a list of JSON objects `office_stationery_items`. This helps the sales bot and sales cart assistant to understand the available items, and repond the user accordingly.  


In [60]:
# Creates a list of dictionaries, where each dictionary represents an office stationery item available for purchase.
office_stationery_items = [
    {"item-id": "0001", "item-name": "pencil", "item-price": "$0.50"},
    {"item-id": "0002", "item-name": "pen", "item-price": "$1.00"},
    {"item-id": "0003", "item-name": "clip", "item-price": "$0.05"},
    {"item-id": "0004", "item-name": "writing pad", "item-price": "$2.00"},
    {"item-id": "0005", "item-name": "printing paper", "item-price": "$5.00"},
    {"item-id": "0006", "item-name": "envelope", "item-price": "$0.10"}
]

# Defines the system prompt that instructs the GPT model on how to behave during the conversation.
SALES_BOT_PROMPT = f"""You are a office stationery sales bot. The customer will ask to buy one of the following items. Follow the rules below: 
1. Be succinct in your responses up to 10 words or less if possible.  
2. If the customer asks for an item that is not available, you should let the customer know that item is not available.
3. Once the customer has placed an order, reply with ANYTHING ELSE
4. If the customer wants to chat with a human, call the function  'get_human_help'
5. If the customer discusses any other topic, other than ordering office stationery, call the function 'get_human_help'
6. When the order is final, call the function `get_order_details` and let the customer know the price.
<LIST OF ITEMS>
{office_stationery_items}
</LIST OF ITEMS>  
"""

# Provides a separate prompt to guide the bot in generating the final order cart
# An example is provided to illustrate the desired output format, ensuring consistency and accuracy in the bot's response
# This could be further enhanced by structured output, but one shot example is sufficient in this context 
SALES_CART_PROMPT = f"""You are an office stationery sales bot, that will generate a cart based on a conversation between a user and an agent. The list of items available for purchase is provided below. Output the cart in JSON format. Include quantity and total price of the order. 

<LIST OF ITEMS>
{office_stationery_items}
</LIST OF ITEMS> 

<EXAMPLE OF A CART> 
{{
  "cart": [
    {{
      "item-id": "0001",
      "item-name": "pencil",
      "quantity": 4,
      "item-price": "$0.50",
      "total-item-price": "$2.00"
    }}
  ],
  "total-price": "$2.00"
}}
</EXAMPLE OF A CART> 
"""

# Defines functions that the bot can "call" during the conversation to handle specific situations such as to get order details and get human help 
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_order_details",
            "description": "Use this function once the customer has finished ordering to get the order price."
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_human_help",
            "description": "Use this function if customer discusses topics other than the order or wants to speak with a human."
        }
    }]

# Initialize the prompt for sales agent 
sales_agent_prompt = [{"role": "system", "content": SALES_BOT_PROMPT}]

# Initialize the prompt for pricing agent 
pricing_agent_prompt = [{"role": "system", "content": SALES_CART_PROMPT}]

### 2. Develop Audio Modules for ASR (Automatic Speech Recognition) and TTS (text-to-speech)

The following Python code implements an interactive voice agent that facilitates customer interactions for ordering office stationery.  
 
The `listen()` function implements the **VAD (Voice Activity Detection) module** using `PyAudio` that streams the audio in `frames_per_buffer` defined as `CHUNK`. Each `CHUNK` is `1024` frames in the audio buffer. Function `is_silent(input_data)` determines if the audio data is below the `SILENCE_THRESHOLD` to classify the audio chunk as silent. This can help filter out low noise in the environment such as breathing sounds. If there are consecutive `50` `SILENT_CHUNKS`  as defined in the code below, the function interprets it as the customer has finished speaking, and saves the audio to a WAV file. To qualify as valid user input, the user must have spoken something which is determined using `SPOKEN_CHUNKS`. Once the **VAD (Voice Activity Detection) module** determines the user input is valid, and user has finished speaking, the audio is sent to OpenAI's Whisper model for **ASR (Automatic Speech Recognition)**, and transcription in English returned as part of the function call.  

Variables `SPOKEN_CHUNKS`, `SILENCE_THRESHOLD` and `SILENT_CHUNKS` can be adjusted based on the environment and type of use case to segment the input audio. 

The `speak(agent_message)` function takes the agent's text response, converts it into spoken audio using OpenAI's text-to-speech model, saves it as a WAV file, and plays it back to the customer. Overall, the code enables a conversational interface by integrating speech recognition and synthesis. [Note that in this implementation audio cannot be interrupted once the function starts speaking. The user must wait for its turn to speak.] 

To reduce the lag, we have pre-recorded sound snippets and stored them under `sounds` folder. If agent response is one of these pre-recorded phrases we can play them instantaneously, reducing the perceived lag. 

In [61]:
import pyaudio
import numpy as np
import io
import base64
import wave

CHUNK = 1024  # Number of frames per buffer
FORMAT = pyaudio.paInt16  # Sample format
CHANNELS = 1  # Mono audio
RATE = 24000  # Sample rate (Hz)
SILENCE_THRESHOLD = 20  # Silence threshold
SILENT_CHUNKS = 50  # Number of silent chunks to stop recording
SPOKEN_CHUNKS = 50  # Minimum spoken chunks to consider valid speech


def listen():
    """Listen to the microphone and return the audio as a base64-encoded WAV string."""
    print("Agent listening ...")

    def is_silent(data_chunk):
        """Determine if the given audio chunk is silent."""
        audio_data = np.frombuffer(data_chunk, dtype=np.int16)
        return np.abs(audio_data).mean() < SILENCE_THRESHOLD

    p = pyaudio.PyAudio()
    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK)

    frames = []
    silent_chunks = speech_chunks = 0

    try:
        while True:
            data = stream.read(CHUNK)
            frames.append(data)

            if is_silent(data):
                silent_chunks += 1
            else:
                silent_chunks = 0
                speech_chunks += 1

            if silent_chunks > SILENT_CHUNKS and speech_chunks > SPOKEN_CHUNKS:
                break
    finally:
        stream.stop_stream()
        stream.close()
        p.terminate()

    print("* done listening")

    # Write to an in-memory WAV file
    audio_buffer = io.BytesIO()
    with wave.open(audio_buffer, 'wb') as wf:
        wf.setnchannels(CHANNELS)
        wf.setsampwidth(p.get_sample_size(FORMAT))
        wf.setframerate(RATE)
        wf.writeframes(b''.join(frames))

    # Get the WAV data from the buffer
    audio_data = audio_buffer.getvalue()

    # Encode the WAV data to a base64 string
    base64_encoded_audio = base64.b64encode(audio_data).decode('utf-8')

    return base64_encoded_audio


In [69]:
import requests
import os
import json

# Load the API key from the environment variable
api_key = os.getenv("OPENAI_API_KEY")


def process_audio_with_gpt_4o(output_modalities, prompt_messages_dictionary, tools):
    # Chat Completions API end point 
    url = "https://api.openai.com/v1/chat/completions"

    # Set the headers
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    # Construct the request data
    data = {
        "model": "gpt-4o-audio-preview",
        "modalities": output_modalities,
        "tools":tools,
        "tool_choice": "auto",
        "audio": {
            "voice": "alloy",
            "format": "wav"
        },
        "messages": prompt_messages_dictionary
    }
    
    request_response = requests.post(url, headers=headers, data=json.dumps(data))
    if request_response.status_code == 200:
        return request_response.json()
    else:
        print(f"Error {request_response.status_code}: {request_response.text}")
        return

In [70]:
# Make sure pydub is installed 
from pydub import AudioSegment
from pydub.playback import play
from io import BytesIO

messages_dictionary = []  # Store the messages 

# Loop until the user has completed the order or asks for human help 
while True:
    # listen to the user input 
    user_input_base64_wav_audio = listen()

    # Append the message to messages dictionary to pass on the model 
    messages_dictionary.append({
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": user_input_base64_wav_audio,
                    "format": "wav"
                }
            }
        ]
    })

    messages = sales_agent_prompt + messages_dictionary
    
    

    response_json = process_audio_with_gpt_4o(["text", "audio"], messages, TOOLS)
    
    print(response_json)
    
    tool_calls = ""
        #response_json['choices'][0]['message'].get('tool_calls', None)
    
    if tool_calls:
        print("tool call!")
        tool_function_name = tool_calls[0]['function']['name']
        if tool_function_name == "get_order_details":
            # The pricing agent generates a detailed cart in JSON format, including item quantities and total prices, ensuring the user receives an accurate summary of their order.
            print("get order details")
            break;
        elif tool_function_name == "get_human_help":
            #  get_human_help function allows the assistant to gracefully transfer the conversation to a human agent if the user requests assistance or deviates from the order process.
            print("Get human help")
            break;
        else: 
            print(f"Tool does not exist: {tool_function_name}")
            
    else: 
        print("continue conversation!")
        response_message = response_json['choices'][0]['message']
        
        # Get the transcript from the model. This will vary depending on the modality you are using. 
        message_transcript = response_message['audio']['transcript']
        
        ## print(message_transcript)
        
        # Get the audio content from the response 
        message_audio = response_message['audio']['data']
        
        # Play the audio 
        audio_data_bytes = base64.b64decode(message_audio)
        audio_segment = AudioSegment.from_file(BytesIO(audio_data_bytes), format="wav")
        
        play(audio_segment)
        
        break

    



Agent listening ...
* done listening
Error 500: {
  "error": {
    "message": "The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID req_066b051d6398e2d8d8a1e22e468671d5 in your email.)",
    "type": "server_error",
    "param": null,
    "code": null
  }
}
None
continue conversation!


TypeError: 'NoneType' object is not subscriptable