## UC1 - Building with AzureOpenAI realtime API 


The GPT-4o audio realtime API is designed to handle real-time, low-latency conversational interactions, making it a great fit for use cases involving live interactions between a user and a model, such as customer support agents, voice assistants, and real-time translators.

It works with text messages, function tool calling, and many other existing capabilities from other endpoints like /chat/completions.
Is a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user.

Please review the [GPT-4o Realtime API documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/realtime-audio-quickstart?pivots=ai-foundry-portal)


### Architecture Overview: 
The API is intended to be used within a trusted, intermediate service that manages connections to both end users and the model endpoint. It's not designed for direct use from untrusted end-user devices. Capturing and rendering audio data are managed outside the scope of the API.

## Deploying the model
On your Azure AI Foundry Portal choose Models + Endpoints, Deploy Mode / Deploy a base Model and finally choose gpt-4o-realtime as the model to be deployed.

![Deploying GPT-4o Realtime API](./images/p1.png)

Next, edit the deployment to increase the number of calls per minute to the max value.
![Increase calls per minute to the model](./images/p2.png)



## Test GPT-4o Azure OpenAI API Endpoint with Completions API 

Once a text-based model such as gpt-4o is deployed, let's use it to generate completions for a given a simple prompt.
This way we will confirm the API key and endpoint URL are configured correctly.
Move on to the next section when a successful generation is completed.

In [None]:
!pip install os
!pip install python-dotenv
!pip install openai

In [None]:
import dotenv
import os 
from openai import AzureOpenAI

dotenv.load_dotenv()

AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")


client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-06-01",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )
        
# Send a completion call to generate an answer
completion = client.chat.completions.create(
    model="gpt-4o",
    messages = [
        {
        "role": "system",
        "content": "You are an MIT PhD in Physics, specializing in quantum physics."
        },
        {
        "role": "user",
        "content": "What is a black hole?"
        }
    ]
    # max_tokens=4096
)

#print(completion.model_dump_json(indent=2))
content = completion.choices[0].message.content
print(content)
print(len(content))

## Connecting to and authenticating with /realtime

/realtime is built on the [WebSockets API](https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API) to facilitate fully asynchronous streaming communication between the end user and model. \

The /realtime API requires an existing Azure OpenAI resource endpoint in a supported region. 
A full request URI can be constructed by concatenating:

1. The secure WebSocket (wss://) protocol \
2. Your Azure OpenAI resource endpoint hostname, e.g. my-aoai-resource.openai.azure.com \
3. The openai/realtime API path \
4. An api-version query string parameter for a supported API version -- initially, 2024-10-01-preview \
5. A deployment query string parameter with the name of your gpt-4o-realtime-preview model deployment \

Combining into a full example, the following could be a well-constructed /realtime request URI:
For example in my case the the full request URI is:

```plaintext
wss://aoai-ep-swedencentral02.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview

```


In [None]:
!pip install websockets #A library for handling WebSocket connections (needed to connect to Azure OpenAI's real-time endpoint)
!pip install asyncio #Python's built-in library for asynchronous programming. It's used to manage non-blocking tasks like WebSocket connections.
!pip install nest_asyncio


In [9]:
import websockets
import asyncio
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Retrieve API key and WSS URL
api_key = os.getenv("AZURE_OPENAI_API_KEY")
wss_url = f"wss://aoai-ep-swedencentral02.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview&api-key={api_key}"

if not api_key:
    raise ValueError("AZURE_OPENAI_API_KEY not found. Ensure it is set in the .env file.")

# Define a function to confirm the connection to the WebSocket server
async def confirm_connection():
    http_headers = {
        "Authorization": f"Bearer {api_key}"
    }

    try:
        async with websockets.connect(wss_url) as websocket:
            print("Successfully connected to the /realtime endpoint!")

    except Exception as e:
        print(f"Connection failed: {e}")

# Run the WebSocket client
asyncio.run(confirm_connection())


Successfully connected to the /realtime endpoint!


## OpenAI Realtime API: Brief Overview and Next Steps

### **What We Have Done**
1. **Connection to the Realtime API**:
   - Successfully established a WebSocket connection to the OpenAI Realtime API.
   - Authentication was performed using an API key embedded in the query parameters (`api-key=<your-api-key>`).

2. **How the API Works**:
   - **Event-Based Architecture**:
     - Communication happens via JSON-formatted events over a WebSocket.
     - Events include client commands (`conversation.item.create`, `response.create`) and server responses (e.g., `response.text.delta`, `response.audio.delta`).
   - **Stateful Sessions**:
     - The API maintains session state for the duration of the WebSocket connection.
     - Sessions consist of conversations, which are composed of items (text, audio, or function calls).
   - **Multimodal Support**:
     - Accepts text and audio as both input and output.
     - Supports function calls and combined responses.
   - **Turn Detection**:
     - Configurable for manual or automatic turn detection (e.g., push-to-talk or VAD).


## **Next Step: Generating a Sample API Call**

### **Goal**
Send a message (e.g., user text input) to the API and receive a response.

### **Sample Workflow**
1. **Client-Side Event: Send a Message**:
   - Use `conversation.item.create` to send a user message:
     ```json
     {
         "type": "conversation.item.create",
         "item": {
             "type": "message",
             "role": "user",
             "content": [
                 {
                     "type": "input_text",
                     "text": "Hello, how are you?"
                 }
             ]
         }
     }
     ```

2. **Client-Side Event: Request a Response**:
   - Use `response.create` to trigger the model's response:
     ```json
     {
         "type": "response.create"
     }
     ```

3. **Server-Side Events**:
   - The API streams responses incrementally:
     - `response.text.delta`: Partial text updates.
     - `response.done`: Signals the completion of the response.


## Playing Audio in Jupyter Notebooks

To play audio in a Jupyter notebook, you can embed it directly. \
Use the IPython.display.Audio module to render audio playback.
    
```python
from IPython.display import Audio

# Assuming the audio file is saved as 'output.wav'
display(Audio("output.wav", autoplay=True))
```

Install the necessary libraries to play audio directly in Jupyter with the below shell commands. \
You need to install portaudio first, which is a prerequisite for pyaudio. 
e.g. On mac use  !brew install portaudio

In [None]:

!pip install IPython
!pip install pyaudio

In [46]:
!pip install numpy

Collecting numpy
  Using cached numpy-2.1.3-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)
Using cached numpy-2.1.3-cp312-cp312-macosx_14_0_arm64.whl (5.1 MB)
Installing collected packages: numpy
Successfully installed numpy-2.1.3


In [None]:
#Generating a simple wav file 

import numpy as np
import wave

# Parameters for the WAV file
sample_rate = 44100  # Samples per second
duration = 2.0       # Duration in seconds
frequency = 440.0    # Frequency of the sine wave in Hz (A4 note)

# Generate a sine wave
t = np.linspace(0, duration, int(sample_rate * duration), endpoint=False)
audio_data = (np.sin(2 * np.pi * frequency * t) * 32767).astype(np.int16)

# Save as a WAV file
output_path = "simple_tone.wav"
with wave.open(output_path, 'w') as wav_file:
    wav_file.setnchannels(1)  # Mono audio
    wav_file.setsampwidth(2)  # 16 bits per sample
    wav_file.setframerate(sample_rate)
    wav_file.writeframes(audio_data.tobytes())

print(f"WAV file generated: {output_path}")


WAV file generated: simple_tone.wav


In [None]:
#Playing the file in Jupyter Notebook

from IPython.display import Audio

# Path to your audio file
audio_path = "simple_tone.wav"

# Play the audio
Audio(audio_path, autoplay=True)


In [3]:
!pip install websockets nest_asyncio




## Real-time Audio Generation with Azure OpenAI WebSocket API

### Overview
This script establishes a WebSocket connection to Azure OpenAI's Real-time API to generate 
and stream audio responses in real-time through the system's audio output.

### Key Components

1. Environment Setup
- Uses `.env` file for Azure OpenAI API key
- Configures WebSocket URL with:
  - api-version: 2024-10-01-preview
  - deployment: gpt-4o-realtime-preview
  - api-key: Authentication token

2. Audio Output Setup
- Initializes sounddevice OutputStream
- Configures for 24kHz sample rate, mono channel, 16-bit PCM

3. Session Management
- Establishes WebSocket connection
- Configures session with voice settings and turn detection parameters
- Waits for session.created confirmation

4. Message Flow
- Sends user message with specified content
- Requests audio response
- Handles multiple response types:
  - response.audio.delta: Contains audio chunks
  - response.done: Signals completion
  - error: Handles error states

5. Real-time Audio Processing
- Receives base64-encoded audio chunks
- Decodes and cleans up base64 data
- Streams audio directly to system output

### Challenges Overcome
1. Base64 Data Handling
- Cleaning up received base64 strings
- Adding proper padding
- Handling malformed data

2. Real-time Streaming
- Managing audio chunks
- Ensuring smooth playback
- Proper timing of audio output

3. Protocol Complexity
- Managing multi-step session setup
- Handling various response types
- Proper message sequencing

# Important Note for Workshop Attendees

Just a quick heads up! The Azure OpenAI real-time audio code we cover in the next cell won't run properly in Jupyter Notebooks due to event loop conflicts. You'll need to save it as a separate Python script (e.g., `audio_generation.py`) and run it from your terminal using `python audio_generation.py`. This is because Jupyter already runs its own event loop for cell execution, which conflicts with our audio streaming code's asyncio requirements. Don't worry if you see a RuntimeError in your notebook - this is expected! Let me know if you need help setting up the script.

In [None]:
import asyncio
import os
import base64
import json
from dotenv import load_dotenv
import websockets
import numpy as np
import sounddevice as sd

# Load environment variables
load_dotenv()

async def main():
    api_key = os.getenv("AZURE_OPENAI_API_KEY")
    if not api_key:
        print("Error: AZURE_OPENAI_API_KEY is not set.")
        return

    url = (
        f"wss://aoai-ep-swedencentral02.openai.azure.com/openai/realtime?"
        "api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview&api-key="
        f"{api_key}"
    )

    # Set up audio output stream
    try:
        stream = sd.OutputStream(samplerate=24000, channels=1, dtype=np.int16)
        stream.start()
        print("Audio stream started")
    except Exception as e:
        print(f"Error initializing audio stream: {e}")
        return

    try:
        async with websockets.connect(url) as ws:
            print("Connected to the API WebSocket")

            # Step 1: Send session update
            session_payload = {
                "type": "session.update",
                "session": {
                    "voice": "alloy",
                    "instructions": "Your response should be an enthusiastic introduction of yourself as an AI assistant, mentioning your capabilities for real-time audio conversation. Keep it under 20 seconds.",
                    "modalities": ["audio", "text"],
                    "input_audio_format": "pcm16",
                    "output_audio_format": "pcm16",
                    "turn_detection": {
                        "type": "server_vad",
                        "threshold": 0.5,
                        "prefix_padding_ms": 300,
                        "silence_duration_ms": 200
                    }
                }
            }
            await ws.send(json.dumps(session_payload))
            print("Session update sent")

            # Wait for session.created
            while True:
                response = await ws.recv()
                data = json.loads(response)
                print(f"Session response: {json.dumps(data, indent=2)}")
                if data.get("type") == "session.created":
                    break
                elif data.get("type") == "error":
                    print("Error creating session")
                    return

            # Step 2: Send user message
            message_payload = {
                "type": "conversation.item.create",
                "item": {
                    "type": "message",
                    "role": "user",
                    "content": [{"type": "input_text", "text": "Speak now."}]
                }
            }
            await ws.send(json.dumps(message_payload))
            print("User message sent")

            # Wait for conversation.item.created
            while True:
                response = await ws.recv()
                data = json.loads(response)
                print(f"Message response: {json.dumps(data, indent=2)}")
                if data.get("type") == "conversation.item.created":
                    break
                elif data.get("type") == "error":
                    print("Error creating message")
                    return

            # Step 3: Request response
            response_payload = {
                "type": "response.create",
                "response": {"modalities": ["audio", "text"]}
            }
            await ws.send(json.dumps(response_payload))
            print("Response requested")

            # Step 4: Stream audio response
            print("Streaming audio...")
            audio_buffer = []
            while True:
                try:
                    response = await ws.recv()
                    data = json.loads(response)
                    
                    if data["type"] == "response.audio.delta":
                        audio_data = data.get("delta", "")
                        if audio_data:
                            try:
                                # Remove any non-base64 characters and pad if necessary
                                audio_data = audio_data.replace(" ", "").replace("\n", "")
                                padding = 4 - (len(audio_data) % 4)
                                if padding != 4:
                                    audio_data += "=" * padding
                                
                                # Decode and play audio
                                audio_bytes = base64.b64decode(audio_data)
                                audio = np.frombuffer(audio_bytes, dtype=np.int16)
                                stream.write(audio)
                                print(".", end="", flush=True)
                            except Exception as decode_error:
                                print(f"\nError decoding audio: {decode_error}")
                        
                    elif data["type"] == "response.done":
                        print("\nAudio streaming completed")
                        break
                    elif data["type"] == "error":
                        print(f"Error response received: {json.dumps(data, indent=2)}")
                        break

                except Exception as e:
                    print(f"Error during audio streaming: {e}")
                    break

    except websockets.exceptions.InvalidStatusCode as e:
        print(f"WebSocket connection failed: {e}")
    except Exception as e:
        print(f"Error: {e}")
    finally:
        # Clean up audio stream
        await asyncio.sleep(1)  # Allow final audio to play
        stream.stop()
        stream.close()
        print("Audio stream closed")

if __name__ == "__main__":
    print("Starting real-time API test...")
    asyncio.run(main())

Starting real-time API test...


RuntimeError: asyncio.run() cannot be called from a running event loop