# Building AI Assistants with Voice Capabilities

## Motivation

It's amazing that we now have AI systems that can understand and generate natural language to directly respond to instructions, but typing is a lot slower than speaking, which is why talking is our main mode of communication between humans.

## In this Notebook

In the past two lectures, we've built:
1. A text-input, text-ouput AI system that can engage in back and forth conversation, like ChatGPT
2. A suite of tools that our AI system can choose to use at any point

In this notebook, we will develop voice capabilities for that AI system, in two parts:
1. Voice transcription
2. Voice generation

## Schematic

![](images/Voice%20Assistant%20Diagram.png)

## Setup

In this code, we're going to use the OpenAI API, so we need to:
1. Install the OpenAI Python library (code written by their engineers)
2. Enter our OpenAI API key (our password)
    - To get that, you must [create an account](https://platform.openai.com/) on the OpenAI platform (not ChatGPT), then [create an API key](https://platform.openai.com/api-keys) and copy it to use below.
    - If you later get errors related to a quota limit, try:
        1. Adding a billing method to your OpenAI account
        2. Restarting the notebook (option found under "Runtime" in tab headings at the top)

Here we install several Python libraries onto the computer in Google's datacenter that this browser tab is connected to.

In [10]:
!pip install openai
!pip install playsound

Collecting playsound
  Downloading playsound-1.3.0.tar.gz (7.7 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: playsound
  Building wheel for playsound (setup.py) ... [?25ldone
[?25h  Created wheel for playsound: filename=playsound-1.3.0-py3-none-any.whl size=7019 sha256=6e20efa2133ba7d87381d8d149ef88a8ed963a09f4dacdaf0247b70d2b5ca7a8
  Stored in directory: /Users/ice/Library/Caches/pip/wheels/cf/42/ff/7c587bae55eec67b909ca316b250d9b4daedbf272a3cbeb907
Successfully built playsound
Installing collected packages: playsound
Successfully installed playsound-1.3.0


This cell is here so that you visibly notice that the next step is not something you can skip.

> You must get an API key from the [OpenAI developer platform](https://platform.openai.com/api-keys) and put it in the cell below, replacing `"YOUR_API_KEY"`.

In [3]:
from openai import OpenAI
# openai = OpenAI(api_key="YOUR_API_KEY")
openai = OpenAI(api_key="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")

Here's the main body of Python code that's going to be our starting point.

In [7]:
# import openai # commented out so you don't import again and have to set your api key in every cell
# openai.api_key = "YOUR API KEY HERE" # commented out so you don't overwrite the correct api key you set above

# your system message probably has a lot of text in that you would need to copy over, so I haven't included it down here


def get_response(messages):
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        max_tokens=300
    )
    content = response.choices[0].message.content
    return content


def chat():
    messages = [
        {"role": "system", "content": "You are a friendly assistant"} # change me to change my personality
    ]
    while True:
        prompt = input("Type a prompt (or type \"exit\" to exit)...")
        print("User:", prompt)
        if prompt == "exit":
            break
        messages = add_message(messages, prompt, "user")
        response = get_response(messages)
        messages = add_message(messages, response, "assistant")
        print("Assistant:", response)


def add_message(messages, content, role):
    message = {"role": role, "content": content}
    messages.append(message)
    return messages


chat()


User: hello
Assistant: Hello! How may I assist you today?
User: exit


## Working with Audio on OpenAI

To start off with a top-down understanding of where to find resources for both voice transcription and voice generation, check this section of the docs. Make sure you understand where you are and how you would have found this section.

As a reminder, here's the thought process for how to reach this part of the docs:
1. "So, I know I want to write some code to use the OpenAI platform. That means I should go to the [API reference](https://platform.openai.com/docs/api-reference)."
2. "I know that the list of 'Endpoints' along the left refers to the list of capabilities that the OpenAI platform offers. So I'll look there."
3. "Voice in and voice out are both related to audio, so I'll look in the [audio endpoint](https://platform.openai.com/docs/api-reference/audio)."
4. "Here I can see sections both for creating speech and transcription. Exactly what I was looking for!"

In the next sections, we'll break down the implementation behind voice generation and transcription.

## Voice Generation

As of writing, OpenAI just added text to speech capabilities to their platform by launching a new text to speech model. Check out the documentation for how to use it [here](https://platform.openai.com/docs/api-reference/audio/createSpeech).

Before we integrate voice generation into our chat system, let's firstly create a function to take text, turn it into voice, then play it.

In [14]:
from IPython.display import Audio, display


def say(text):
    speech_file_path = "speech.mp3"
    response = openai.audio.speech.create( # TODO call the function to create speech
        model="tts-1", # TODO specify the model
        voice="alloy", # TODO specify the voice
        input=text # TODO specify the input text
    )
    response.stream_to_file(speech_file_path) # TODO stream the response to the file path
    #  print(dir(response)) # you can always use the dir() function to see what methods are available on an object (this isn't yet in the docs)
    display(Audio(speech_file_path, autoplay=True)) # display the audio file (this would be different if we were working in a .py file instead of a .ipynb notebook file)

say("Hello, my name is Alloy. I am a robot.") # test the function

Now, let's integrate the AI systems new voice into our existing conversational code.

In [15]:
# import openai # commented out so you don't import again and have to set your api key in every cell
# openai.api_key = "YOUR API KEY HERE" # commented out so you don't overwrite the correct api key you set above

# your system message probably has a lot of text in that you would need to copy over, so I haven't included it down here


def get_response(messages):
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        max_tokens=300
    )
    content = response.choices[0].message.content
    return content


def chat():
    messages = [
        {"role": "system", "content": "You are a friendly assistant"} # change me to change my personality
    ]
    while True:
        prompt = input("Type a prompt (or type \"exit\" to exit)...")
        print("User:", prompt)
        if prompt == "exit":
            break
        messages = add_message(messages, prompt, "user")
        response = get_response(messages)
        messages = add_message(messages, response, "assistant")
        print("Assistant:", response) # TODO (optional) remove this line that prints the response
        say(response) # TODO call the say() function with the response


def add_message(messages, content, role):
    message = {"role": role, "content": content}
    messages.append(message)
    return messages


chat()


User: hello
Assistant: Hello! How can I assist you today?


User: exit


## Voice Transcription

The next step is to introduce the capability for our AI system to listen to our voice, and then turn it into text that it can process.

In [18]:
!pip install pydub
!pip install sounddevice
!pip install numpy

Collecting sounddevice
  Using cached sounddevice-0.4.6-py3-none-macosx_10_6_x86_64.macosx_10_6_universal2.whl (107 kB)
Installing collected packages: sounddevice
Successfully installed sounddevice-0.4.6
Collecting numpy
  Downloading numpy-1.26.2-cp312-cp312-macosx_10_9_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading numpy-1.26.2-cp312-cp312-macosx_10_9_x86_64.whl (20.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.3/20.3 MB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: numpy
Successfully installed numpy-1.26.2


In [21]:
from time import time
from pydub import AudioSegment
import sounddevice as sd
import numpy as np
import os

def voice_to_text(duration=10):
    print("Listening...")

    # RECORD
    # this section is a little complicated, but I'm fairly sure I found it on stackoverflow
    fs = 44100  # Sample rate (samples per second)
    audio_data = []
    with sd.InputStream(samplerate=fs, channels=1, dtype=np.int16) as stream:
        start_time = time()
        while time() - start_time < duration:
            # Read a chunk of audio data
            chunk, overflowed = stream.read(fs)
            audio_data.append(chunk)
    audio_data = np.concatenate(audio_data) # Convert the recorded audio data into a NumPy array
    audio = AudioSegment(audio_data.tobytes(), frame_rate=fs, sample_width=2, channels=1) # this line gave me loads of trouble but eventually these params worked
    temp_audio_filename = "temp.mp3"
    audio.export(temp_audio_filename, format="mp3")

    # TRANSCRIBE
    with open(temp_audio_filename, 'rb') as f: # TODO open context manager to temporary audio file
        print("Thinking...")
        text = openai.audio.transcriptions.create( # TODO call the function to create a transcription
            model="whisper-1", # TODO specify the model
            file=f # TODO specify the file
        ).text # TODO get the text attribute from the response

    os.remove(temp_audio_filename) # TODO remove the temporary audio file

    return text

voice_to_text()

Listening...
Thinking...


"What's good, my friend?"

Now let's add this into our main body of code.

In [24]:
# import openai # commented out so you don't import again and have to set your api key in every cell
# openai.api_key = "YOUR API KEY HERE" # commented out so you don't overwrite the correct api key you set above

# your system message probably has a lot of text in that you would need to copy over, so I haven't included it down here


def get_response(messages):
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        max_tokens=300
    )
    content = response.choices[0].message.content
    return content


def chat():
    messages = [
        {"role": "system", "content": "You are a friendly assistant"} # change me to change my personality
    ]
    while True:
        # prompt = input("Type a prompt (or type \"exit\" to exit)...") # TODO remove this line as our input will now be taken by voice
        prompt = voice_to_text()
        print("User:", prompt)
        if prompt == "exit": # TODO change this to check if the prompt is "Exit." because this is the format that whisper transcribes "exit" to (try without first)
            break
        messages = add_message(messages, prompt, "user")
        response = get_response(messages)
        messages = add_message(messages, response, "assistant")
        # print("Assistant:", response) 
        say(response)
        input("Press enter to continue") # TODO add a line here to wait until the user presses enter to continue (try without this first and you'll see why we need it)


def add_message(messages, content, role):
    message = {"role": role, "content": content}
    messages.append(message)
    return messages


chat()


Listening...
Thinking...
User: Hey, what's up?


Listening...
Thinking...
User: What do you think is a good plan?


Listening...
Thinking...
User: Exit.


Listening...
Thinking...
User: Exit.


KeyboardInterrupt: 

## Conclusion

In this notebook, you've learnt how to use the OpenAI platform for voice transcription and voice generation.