# Audio Capabilities with OpenAI's gpt-4o-audio-preview Model: A Practical Guide
    
This notebook will walk you through how to use OpenAI’s new `gpt-4o-audio-preview` model using LangChain.
We’ll go step by step, covering everything from environment setup to audio processing and advanced use cases like tool calling and chaining tasks.

This guide will get you up and running with practical examples.

### 1. Installing the Required Packages

We’ll need the `langchain-openai` package to interact with OpenAI models. Run the command below to install it.

```bash
%pip install -qU langchain-openai
```

Let’s break this down quickly: the -q flag tells the installer to keep things quiet (so it doesn't spam your terminal with output), and the -U flag ensures you’re installing the latest version. Installing this package allows LangChain to interact with OpenAI’s models directly, which is exactly what we need to move forward.


In [1]:
# Install langchain-openai package
%pip install -qU langchain-openai

Note: you may need to restart the kernel to use updated packages.


### 2. Setting Up Environment Variables

Now we’ll set up the environment variables to store your OpenAI API key. This keeps sensitive information out of your code.

You can manually set the environment variable in your terminal, or use a `.env` file in combination with `python-dotenv`. Here, we’ll show you how to set it within Python.

In [1]:
import getpass
import os

# Set your OpenAI API key as an environment variable
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

### 3. Instantiating the Model

Now that we have our environment set up, let’s instantiate the `gpt-4o-audio-preview` model using LangChain. We’ll also configure some basic parameters like temperature and token limits.

In [2]:
from langchain_openai import ChatOpenAI

# Instantiate the model
llm = ChatOpenAI(
    model="gpt-4o-audio-preview",  # Specifying the model
    temperature=0,  # Low randomness for structured output
    max_tokens=None,  # Unlimited tokens (set a limit if needed)
    timeout=None,  # No timeout for processing
    max_retries=2  # Retry if the request fails
)

### 4. Uploading and Encoding Audio Files

We’ll now upload an audio file and encode it into base64 format so that it can be processed by the model. Here’s how you can read and encode an audio file in Python.

In [3]:
import base64

# Open the audio file and convert to base64
with open("gpt.wav", "rb") as f: #Replace your own audio file here
    audio_data = f.read()

# Convert binary audio data to base64
audio_b64 = base64.b64encode(audio_data).decode()


### 5. Transcribing Audio

Now that we’ve encoded the audio, we can pass it to the model and get a transcription. Let’s send the request and retrieve the transcribed text.

In [4]:
# Send audio for transcription
messages = [
    (
        "human",
        [
            {"type": "text", "text": "Transcribe the following:"},
            {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
        ],
    )
]

# Invoke the model and get the transcription
output_message = llm.invoke(messages)
print(output_message.content)  # The transcription will appear here

As an athlete with a busy lifestyle, finding the right pair of jeans has always been challenging.


### 6. Generating Audio Responses

Let’s now configure the model to generate audio outputs, allowing it to respond with actual speech. We’ll specify the voice and format for the output.

In [7]:
# Configure the model to generate audio responses
llm = ChatOpenAI(
    model="gpt-4o-audio-preview",
    temperature=0,
    model_kwargs={
        "modalities": ["text", "audio"],  # Enable audio responses
        "audio": {"voice": "alloy", "format": "wav"},  # Set voice and output format
    }
)

# Generate a response with audio
messages = [("human", "Are you human? Reply either yes or no.")]
output_message = llm.invoke(messages)

# Access the generated audio data
audio_response = output_message.additional_kwargs['audio']['data']
print(f"Generated audio (base64): {audio_response}")

Generated audio (base64): UklGRsa7AABXQVZFZm10IBAAAAABAAEAwF0AAIC7AAACABAATElTVBoAAABJTkZPSVNGVA4AAABMYXZmNTguMjkuMTAwAGRhdGGAuwAADAAHAAIACwAEAAkADAAIAA8ACgAPAAoAAgANAP//EQADAA4ABQAFAAoA+v8HAP3/CAADAAMABwD+/wgA/f8HAP//AgAAAP3/AgD7/wkA+P8CAP3/+P8BAPT/AADs/wcA8/////j/9//7//T////y/wAA9v8AAPP/+f/z//r/+P/x//z/6//4/+v/+f/w//n/8f/2//L/8v/w//P/+f/w//X/7P/y/+z/7v/2//D//f/s//f/6//t//P/6f/5/+f/9P/l/+r/9v/s//T/5f/2/+L/9//v//P/8P/n//f/6//3/+b/8P/p/+n/7//p//H/5P/u/+v/7P/o/+f/5v/n/+L/6P/j/+j/5f/l/+H/6f/k/+L/5P/h/+X/4v/t/9n/6//U/+n/1v/p/9z/4P/m/9n/6//R/+n/1f/t/9//3v/l/9r/7P/W/+n/5f/l/+r/3f/p/+L/6P/o/+H/7v/c//H/4P/v/+b/7P/t/+T/9f/n//P/6//z/+7/7v/w/+//8v/x//f/6P/0//T/8v/5//H//P/s//n/8v/4//b/8P/8//D//P/w//v////0////8P8BAPL/CgAAAAEABAD1/wkA9/8UAPP/FgD7/wsACgAFAAwA+v8aAAQADwAHAAsAEAADABEACQAVAAsAEwAOAA0AGQALABkACQAZABEAGgAZABUAHQAOAB4ADwAcABMAGAAZABAAGwATABoAFQAbAB4AGwAeABoAGgAYABYAFAAUABgAFQAYABYAGAASABkAFgAYABcADAAXAAsAGQAMABoADAAMAB4ACgAhAAwAHAAHAB0ADwAOABYADgAbAAcAHwAOAB4AEAAXABMAFAAXABgAFwAVABsAEwAeAB

### 7. Saving and Playing Back Audio

After generating the audio response, you might want to save it and play it back. Here’s how you can decode the base64-encoded audio data and save it as a `.wav` file.

In [8]:
# Decode the base64 audio data
audio_bytes = base64.b64decode(audio_response)

# Save the audio as a .wav file
with open("output.wav", "wb") as f:
    f.write(audio_bytes)

print("Audio saved as output.wav")

Audio saved as output.wav


### 8. Tool Binding and Task Chaining

In more advanced use cases, you can bind tools to the model and chain tasks. For example, we can bind a weather fetching tool to the model and chain it with transcription.

In [18]:
import requests
from pydantic import BaseModel, Field

# Define a tool schema using Pydantic
class GetWeather(BaseModel):
    """Get the current weather in a given location."""
    location: str = Field(..., description="The city and state, e.g. Edinburgh, UK")
    
    def fetch_weather(self):
        # Using OpenWeatherMap API to fetch real-time weather
        API_KEY = ""  # Replace with your actual API key
        base_url = f"http://api.openweathermap.org/data/2.5/weather?q={self.location}&APPID={API_KEY}&units=metric"
        
        response = requests.get(base_url)
        
        if response.status_code == 200:
            data = response.json()
            weather_description = data['weather'][0]['description']
            temperature = data['main']['temp']
            return f"The weather in {self.location} is {weather_description} with a temperature of {temperature}°C."
        else:
            # Print the status code and response for debugging
            print(f"Error: {response.status_code}, {response.text}")
            return f"Could not fetch the weather for {self.location}."

# Example usage
weather_tool = GetWeather(location="Edinburgh, GB")  # Using city name and country code
ai_msg = weather_tool.fetch_weather()
print(ai_msg)


The weather in Edinburgh, GB is broken clouds with a temperature of 13.48°C.


Now let’s take it a step further - chaining tasks. This is where you can create multi-step workflows that combine multiple tools and model calls to handle complex requests. Imagine a scenario where you want your assistant to transcribe audio and then perform an action for the location mentioned in the audio. In this example, we’ll chain an audio transcription task with a weather lookup based on the transcribed.

In [20]:
import base64
import requests
from pydantic import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Define the tool schema for fetching weather
class GetWeather(BaseModel):
    """Get the current weather in a given location."""
    location: str = Field(..., description="The city and state, e.g. Edinburgh, UK")
    
    def fetch_weather(self):
        # Using OpenWeatherMap API to fetch real-time weather
        API_KEY = ""  # Replace with your actual API key
        base_url = f"http://api.openweathermap.org/data/2.5/weather?q={self.location}&appid={API_KEY}&units=metric"        
        response = requests.get(base_url)
        if response.status_code == 200:
            data = response.json()
            weather_description = data['weather'][0]['description']
            temperature = data['main']['temp']
            return f"The weather in {self.location} is {weather_description} with a temperature of {temperature}°C."
        else:
            return f"Could not fetch the weather for {self.location}."

# Instantiate the LLM model
llm = ChatOpenAI(
    model="gpt-4o-audio-preview"
)

# Function to handle audio transcription using the LLM
def audio_to_text(audio_b64: str) -> str:
    # Define the message to send for transcription
    messages = [
        (
            "human",
            [
                {"type": "text", "text": "Transcribe the following:"},
                {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
            ],
        )
    ]
    # Invoke the model and get the transcription
    output_message = llm.invoke(messages)
    # Return the transcription from the model's output
    return output_message.content

# Create a prompt template for transcription and weather lookup
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an assistant that transcribes audio and fetches weather information."),
        ("human", "Transcribe the following and tell me the weather in the location mentioned in the audio."),
    ]
)

# Bind the tool to the model
llm_with_tools = llm.bind_tools([GetWeather])

# Chain the transcription and weather tool
chain = prompt | llm_with_tools

# Read and encode the audio file in base64
audio_file = "weather_input.wav" #Replace by your own audio file

with open(audio_file, "rb") as audio_file:
    audio_b64 = base64.b64encode(audio_file.read()).decode('utf-8')

# Transcribe the audio to get the location
transcribed_location = audio_to_text(audio_b64)

# Print the transcription result for debugging
print(f"Transcribed location: {transcribed_location}")

# Check if transcription returned a valid result
if transcribed_location:
    # Fetch weather for the transcribed location
    weather_tool = GetWeather(location=transcribed_location)
    weather_result = weather_tool.fetch_weather()
    print(f"Weather result: {weather_result}")
else:
    print("No valid location was transcribed from the audio.")

Transcribed location: Edinburgh
Weather result: The weather in Edinburgh is broken clouds with a temperature of 13.47°C.


## Practical Example: Building a Voice-Enabled Assistant
Finally, let’s look into a practical example where we build a voice-enabled assistant that listens to user queries through audio, generates a response, and replies back using audio. 

In [17]:
import base64
from langchain_openai import ChatOpenAI

# Step 1: Instantiate the audio-capable model with configuration for generating audio
llm = ChatOpenAI(
    model="gpt-4o-audio-preview",
    temperature=0,
    model_kwargs={
        "modalities": ["text", "audio"],  # Enable both text and audio modalities
        "audio": {"voice": "alloy", "format": "wav"},  # Set the desired voice and output format
    }
)

# Step 2: Capture and encode the audio (replace this with real audio input)
audio_file = "math_joke_audio.wav" #Replace by your own audio file
with open(audio_file, "rb") as audio_file:
    audio_b64 = base64.b64encode(audio_file.read()).decode('utf-8')

# Step 3: Create the message structure for transcription and audio response
messages = [
    (
        "human",
        [
            {"type": "text", "text": "Answer the question."},
            {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
        ],
    )
]

# Step 4: Invoke the model to transcribe the audio and generate a response
result = llm.invoke(messages)

# Step 5: Extract the audio response
audio_response = result.additional_kwargs.get('audio', {}).get('data')  # Safely check if audio exists

# Step 6: Save the audio response to a file if it exists
if audio_response:
    # Decode the base64 audio data and save it as a .wav file
    audio_bytes = base64.b64decode(audio_response)
    with open("response.wav", "wb") as f:
        f.write(audio_bytes)
    print("Audio response saved as 'response.wav'")
else:
    print("No audio response available")


Audio response saved as 'response.wav'
