### Voice Chatbot with ASR (Automatic Speech Recognition)

In this cookbook, we will walk through the process of creating a simple sales chatbot with automatic speech recognition (ASR) and text-to-speech (TTS) capabilities. We'll use a GPT model via the Chat Completions API to drive the conversation with the user. At the end of the interaction, the chatbot will present an order cart containing the items the user wishes to purchase.

Voice chatbots based on ASR/TTS can introduce latency due to the speech-to-text and text-to-speech conversion processes. We will explore strategies to minimize this lag to ensure a better conversational flow.

Creating an ASR/TTS-based voice chatbot is a three-step process, as outlined below:

**1. Set Up the GPT Model (Text-to-Text Modality)**  
Initialize the GPT model with system prompts that define the goal of the conversation, guiding the chatbot's responses toward assisting with sales order placement. The prompts can be set up for a multi-assistant system, where one assistant drives the conversation with the customer and another assistant manages the cart in parallel. Also, set up tools for assistants to use when asking for human help or interacting with each other (such as cart pricing).

**2. Develop Audio Modules for ASR (Automatic Speech Recognition) and TTS (Text-To-Speech)**  
Create an audio interface that listens to the user, records their speech, and forwards the audio data to the ASR solution (such as Whisper) to transcribe it to text. Depending on the platform, implement an audio chunking strategy that segments the audio at silence intervals. Keep the threshold and duration of silence configurable based on the environment. Set up a TTS interface that, given an input text, converts the text to audio and speaks it back to the user.

**3. Create a conversation loop and manage order cart**  
Implement a conversation loop where the agent listens to the user and responds back, continuing until an event occurs that breaks the loop, such as a request to speak with a human or another indication of the end of the conversation.


Overarching solution architecture is as follows:   
![ASR/TTS](./images/asr-text-to-speech.png)

For the purposes of this cookbook, we will use an example of an office stationery ordering bot. You can interact with the bot to order general-purpose office products such as pencils, pens, paper clips, writing pads, printing paper, and envelopes.

The key challenges we want to address are:

1. Ensure customers can only order items that are available.
2. Escalate to a human in the loop if the customer requests help or engages in non-order-related conversation.
3. Provide an accurate summary of the order with prices to the customer.
4. Minimize the lag in the conversation 


Before we get started, make sure you have the following libraries installed: `pyaudio`, `numpy`, `openai`, `playsound`, and that you have configured your OpenAI API key as an environment

### 1. Set Up the GPT Model (Text-to-Text Modality)

First step is to set the foundation for the GPT model to operate effectively as a sales chatbot within the office stationery domain. By carefully crafting the prompts and defining the functions, we ensure that the bot can handle customer interactions smoothly, maintain the flow of conversation, and provide accurate assistance aligned with the objectives of our project.


In [63]:
import json

# Creates a list of dictionaries, where each dictionary represents an office stationery item available for purchase.
office_stationery_items = [
    {"item-id": "0001", "item-name": "pencil", "item-price": "$0.50"},
    {"item-id": "0002", "item-name": "pen", "item-price": "$1.00"},
    {"item-id": "0003", "item-name": "clip", "item-price": "$0.05"},
    {"item-id": "0004", "item-name": "writing pad", "item-price": "$2.00"},
    {"item-id": "0005", "item-name": "printing paper", "item-price": "$5.00"},
    {"item-id": "0006", "item-name": "envelope", "item-price": "$0.10"}
]

# Defines the system prompt that instructs the GPT model on how to behave during the conversation.
SALES_BOT_PROMPT = f"""You are a office stationery sales bot. The customer will ask to buy one of the following items. Follow the rules below: 
1. Be succinct in your responses up to 10 words or less if possible.  
2. If the customer asks for an item that is not available, you should let the customer know that item is not available.
3. Once the customer has placed an order, reply with ANYTHING ELSE
4. If the customer wants to chat with a human, call the function  'get_human_help'
5. If the customer discusses any other topic, other than ordering office stationery, call the function 'get_human_help'
6. When the order is final, call the function `get_order_details` and let the customer know the price.
<LIST OF ITEMS>
{office_stationery_items}
</LIST OF ITEMS>  
"""

# Provides a separate prompt to guide the bot in generating the final order cart
# An example is provided to illustrate the desired output format, ensuring consistency and accuracy in the bot's response
# This could be further enhanced by structured output, but one shot example is sufficient in this context 
SALES_CART_PROMPT = f"""You are an office stationery sales bot, that will generate a cart based on a conversation between a user and an agent. The list of items available for purchase is provided below. Output the cart in JSON format. Include quantity and total price of the order. 

<LIST OF ITEMS>
{office_stationery_items}
</LIST OF ITEMS> 

<EXAMPLE OF A CART> 
{{
  "cart": [
    {{
      "item-id": "0001",
      "item-name": "pencil",
      "quantity": 4,
      "item-price": "$0.50",
      "total-item-price": "$2.00"
    }}
  ],
  "total-price": "$2.00"
}}
</EXAMPLE OF A CART> 
"""

# Defines functions that the bot can "call" during the conversation to handle specific situations such as to get order details and get human help 
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_order_details",
            "description": "Use this function once the customer has finished ordering to get the order price."
        }
    }, 
    {
        "type": "function",
        "function": {
            "name": "get_human_help",
            "description": "Use this function if customer discusses topics other than the order or wants to speak with a human."
        }
    }]


### 2. Develop Audio Modules for ASR (Automatic Speech Recognition) and TTS (text-to-speech)

The following Python code implements an interactive voice agent that facilitates customer interactions for ordering office stationery. The `listen()` function records the customer's speech using PyAudio, detects silence to determine when the customer has finished speaking, and saves the audio to a WAV file. It then transcribes the recorded speech into text using OpenAI's Whisper model. 

The `speak(agent_message)` function takes the agent's text response, converts it into spoken audio using OpenAI's text-to-speech model, saves it as a WAV file, and plays it back to the customer. Overall, the code enables a conversational interface by integrating speech recognition and synthesis.

To reduce the lag, we have pre-recorded sound snippets and store them under `sounds` folder. If agent response is one of these pre-recorded phrases we can play them instantaneously. 


In [64]:
import pyaudio
import numpy as np
import wave
from openai import OpenAI
from playsound import playsound

CHUNK = 1024  # CHUNK sets the number of frames per buffer.
FORMAT = pyaudio.paInt16  # FORMAT specifies the sample format (16-bit in this case).
CHANNELS = 1  # CHANNELS sets the number of audio channels: 1 for mono, 2 for stereo
RATE = 44100  # RATE sets the sample rate to 44100 Hz
SILENCE_THRESHOLD = 20  # Adjust this threshold based on your environment
SILENCE_CHUNKS = 50  # Number of chunks of silence to trigger stop
SPOKEN_CHUNKS = 50  # Number of spoke chunks to have a valid response from the user

oai_client = OpenAI()


# List of pre-recorded messages 
initial_message = "What would you like to order?"
human_help_message = "Let me get you a human to help!"


def listen():
    """Listen to the customer. Return the text from the speech"""
    print("Agent listening ...")

    def is_silent(input_data):
        """Check if the given data chunk is silent."""
        audio_data = np.frombuffer(input_data, dtype=np.int16)
        return np.abs(audio_data).mean() < SILENCE_THRESHOLD

    output = "user_response.wav"
    with wave.open(output, 'wb') as wf:
        p = pyaudio.PyAudio()
        wf.setsampwidth(p.get_sample_size(FORMAT))
        wf.setnchannels(CHANNELS)
        wf.setframerate(RATE)

        stream = p.open(format=FORMAT,
                        channels=CHANNELS,
                        rate=RATE,
                        input=True,
                        frames_per_buffer=CHUNK)

        # print("* recording")
        frames = []
        silent_chunks = 0
        speech_chunks = 0

        while True:
            data = stream.read(CHUNK)
            frames.append(data)

            if is_silent(data):
                silent_chunks += 1
            else:
                silent_chunks = 0
                speech_chunks += 1

            if silent_chunks > SILENCE_CHUNKS and speech_chunks > SPOKEN_CHUNKS:
                break

        print("* done listening")
        stream.stop_stream()
        stream.close()
        p.terminate()
        wf.writeframes(b''.join(frames))

        # Upload the recorded audio file to OpenAI whisper-1 model for transcription
        audio_file = open(output, "rb")
        transcription = oai_client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            prompt="This is a customer trying to order office stationery"
        )

        return transcription.text


def speak(agent_message):
    print("Agent speaking ...")
    # Common phrases can be pre-recorded 
    if agent_message == initial_message:
        playsound("./sounds/initial_message.wav")
    
    elif agent_message == human_help_message: 
        playsound("./sounds/human_help_message.wav")

    else:
        # Convert text to speech
        response = oai_client.audio.speech.create(
            model="tts-1",
            voice="nova",
            input=agent_message
        )
        # Save to the file
        response.write_to_file("assistant_message.wav")

        # play the audio file
        playsound("assistant_message.wav")
        

### 3. Create a conversation loop and manage order cart  

The code below initiates a conversation with a user to facilitate an ordering process. It starts by asking the user what they would like to order, then continuously listens to the user's input using the `listen()` function. 

The conversation is managed through a loop that processes each user input and appends it to a message dictionary, which logs the interaction. The code sends the user's input to GPT-4 model (gpt-4o) for generating a response. 

If the generated response involves a tool call, like calculating the cart price, the loop breaks, signaling the end of the interaction. 

Alternatively, if the user requests human assistance, the conversation ends by breaking the loop and triggering a message indicating human intervention is needed.


In [65]:
# Set the messages dictionary 
messages_dictionary = [{
    "role": "assistant",
    "content": initial_message
}] 

# Initialize the prompt for sales agent 
sales_agent_prompt = [{"role": "system", "content": SALES_BOT_PROMPT}]
# Initialize the prompt for pricing agent 
pricing_agent_prompt = [{"role": "system", "content": SALES_CART_PROMPT}]

# Initiate the conversation with the user 
speak(initial_message)


# Loop until the user has completed the order or asks for human help 
while True:
    # listen to the user input 
    user_input = listen()

    # Append the message to messages dictionary to pass on the model 
    messages_dictionary.append({
        "role": "user",
        "content": user_input
    })
    
    # Response from the model to user input 
    response = oai_client.chat.completions.create(
        model='gpt-4o',
        messages=sales_agent_prompt + messages_dictionary, 
        tools=TOOLS
    )
    
    tool_calls = response.choices[0].message.tool_calls
    
    # Check if model wants to call a tool  
    if tool_calls: 
        tool_function_name = tool_calls[0].function.name
        if tool_function_name == "get_order_details":
            # Invoke the cart management agent to get the price  
            response = oai_client.chat.completions.create(
                model='gpt-4o',
                messages=pricing_agent_prompt + messages_dictionary, 
                response_format={"type": "json_object"}
            )
            # Get message content 
            cart = json.loads(response.choices[0].message.content)
            
            
            print("*" * 10 + "Cart: " + "*" * 10)
            print(json.dumps(cart, indent=4))
    
            # Extracting the total price of the entire order
            total_price = cart["total-price"]
            final_message = f"Thank you for your order, your total is {total_price}"
            
            speak(final_message)
            messages_dictionary.append({
                "role": "assistant",
                "content": final_message
                })
            break;
        elif tool_function_name == "get_human_help":
            # Application code in this block can invoke an API to get a human's attention  
            speak(human_help_message)
            messages_dictionary.append({
                "role": "assistant",
                "content": human_help_message
                })
            break;
        else: 
            print(f"Tool does not exist: {response.choices[0].message.tool_calls}")
    
    # Get message content 
    response_message = response.choices[0].message.content
    
    # Append the message to messages dictionary 
    messages_dictionary.append({
    "role": "assistant",
    "content": response_message
    })
    speak(response_message)
    
    
# Print the conversation
print ("*" * 10 + " Conversation log: " + "*" * 10)
print(json.dumps(messages_dictionary, indent=4))

Agent speaking ...
Agent listening ...
Agent speaking ...
Agent listening ...
**********Cart: **********
{
    "cart": [
        {
            "item-id": "0002",
            "item-name": "pen",
            "quantity": 4,
            "item-price": "$1.00",
            "total-item-price": "$4.00"
        }
    ],
    "total-price": "$4.00"
}
Agent speaking ...
********** Conversation log: **********
[
    {
        "role": "assistant",
        "content": "What would you like to order?"
    },
    {
        "role": "user",
        "content": "Hi, can I get 4 pens?"
    },
    {
        "role": "assistant",
        "content": "You ordered 4 pens. ANYTHING ELSE?"
    },
    {
        "role": "user",
        "content": "That will be all. Thank you."
    },
    {
        "role": "assistant",
        "content": "Thank you for your order, your total is $4.00"
    }
]


Due to the conversion of text-to-speech and speech-to-text, there is noticeable lag in conversation. A few tricks to improve the User Experience are as follows:

1. Provide visual cues when model is speaking or listening  
2. Chunk incoming audio for inputs to Whisper    
3. Pre-record common phrases such as welcome message, human in the loop escalation message, etc.   
4. Keep text output from the model short and abbreviate common responses
5. Stream output audio instead of waiting for entire audio file to be available 