# CONVERSE with REALTIME API 
In this step, we'll transform our single-turn audio generation into a full conversational system. This involves tackling several key challenges: managing bi-directional audio streams for both user input and AI output, implementing proper turn-taking mechanics (knowing when the user has finished speaking and when the AI should respond), handling the complex flow of events from speech detection to response generation, and maintaining conversation context across multiple exchanges. The trickiest part isn't just streaming the audio - it's orchestrating all these components to work together smoothly while maintaining natural conversational flow. We'll build this step by step, focusing on creating a robust system that can handle real-world conversation scenarios.

# Key Challenges in Building Real-time Conversational AI

Below are the key challanges in building Real-time Conversational AI...

1. **Session and Context Management**
  - Maintaining conversation history
  - Tracking context across exchanges
  - Managing session state

2. **Turn-Taking Mechanics**
  - Detecting end of user speech
  - Managing interruptions
  - Coordinating input/output transitions
  - Timing audio recording and playback

3. **Audio Stream Handling**
  - Coordinating bidirectional audio
  - Managing stream lifecycles
  - Resolving device conflicts
  - Handling format compatibility

4. **Event Flow Control**
  - Input audio streaming
  - Speech detection
  - Text transcription
  - Response generation
  - Output audio streaming

5. **Error Recovery**
  - Connection drops
  - Audio device issues
  - API rate limits
  - Graceful error handling

6. **Resource Management**
  - WebSocket connection handling
  - Audio buffer memory
  - Resource cleanup
  - Connection lifecycle management


We will build an MVP, therefore choose the minimum set that will give us a working setup. The below is what we will target. 
# Minimum Viable Conversational System - Development Plan

## Phase 1: Basic Turn-Taking
1. **Core Components**
  - Audio input/output stream management
  - Basic turn detection
  - Session context tracking

## Implementation Steps

### Step 1: Input/Output Setup
- Configure audio device for both input and output
- Implement basic audio streaming

### Step 2: Turn Detection
- Add VAD (Voice Activity Detection)
- Implement simple silence detection
- Basic interrupt handling

### Step 3: Context Management
- Track conversation history
- Maintain basic session state

## Testing Milestones
1. Test audio I/O
2. Verify turn detection
3. Validate context preservation

Timeline: We will implement each step iteratively, testing thoroughly before moving to next.

## Step 1 **Session and Context Management**
We will implement the below key features implemented:

- Bidirectional audio streams
- Basic turn management
- Audio buffering
- Session handling
- Basic error recovery

Finally test by running the script - it will listen for 2 seconds, send to API, then play response.

In [None]:
import asyncio
import os
import base64
import json
from dotenv import load_dotenv
import websockets
import numpy as np
import sounddevice as sd

class ConversationSystem:
    def __init__(self):
        load_dotenv()
        self.input_stream = None
        self.output_stream = None
        self.is_speaking = False
        self.input_buffer = []
        self.api_key = os.getenv("AZURE_OPENAI_API_KEY")
        if not self.api_key:
            raise ValueError("AZURE_OPENAI_API_KEY not found in environment")
        self.url = (
            f"wss://aoai-ep-swedencentral02.openai.azure.com/openai/realtime?"
            f"api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview&api-key={self.api_key}"
        )

    async def setup_audio(self):
        print("Setting up audio streams...")
        self.output_stream = sd.OutputStream(samplerate=24000, channels=1, dtype=np.int16)
        self.input_stream = sd.InputStream(samplerate=24000, channels=1, dtype=np.int16,
                                         callback=self.audio_callback)
        self.output_stream.start()
        self.input_stream.start()
        print("Audio streams started")
        
    def audio_callback(self, indata, frames, time, status):
        if status:
            print(f"Input stream error: {status}")
        if not self.is_speaking:
            self.input_buffer.extend(indata.tobytes())

    async def start_conversation(self):
        try:
            async with websockets.connect(self.url) as ws:
                print("Connected to WebSocket")
                await self.setup_session(ws)
                
                # Initial greeting
                await self.send_message(ws, "Hello")
                await self.handle_response(ws)

                while True:
                    print("\nListening... (speak for at least 2 seconds)")
                    self.input_buffer.clear()
                    await asyncio.sleep(2)  # Wait for 2 seconds of audio

                    if len(self.input_buffer) > 0:
                        print("Processing your input...")
                        audio_data = bytes(self.input_buffer)
                        self.input_buffer.clear()
                        
                        # Send audio
                        base64_audio = base64.b64encode(audio_data).decode('utf-8')
                        await ws.send(json.dumps({
                            "type": "input_audio_buffer.append",
                            "audio": base64_audio
                        }))
                        await ws.send(json.dumps({
                            "type": "input_audio_buffer.commit"
                        }))
                        
                        # Request response
                        await ws.send(json.dumps({
                            "type": "response.create",
                            "response": {"modalities": ["audio", "text"]}
                        }))
                        
                        await self.handle_response(ws)

        except Exception as e:
            print(f"Error in conversation: {e}")

    async def setup_session(self, ws):
        session_payload = {
            "type": "session.update",
            "session": {
                "voice": "alloy",
                "instructions": "You are a helpful AI assistant. Keep responses brief and engaging.",
                "modalities": ["audio", "text"],
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 200
                }
            }
        }
        await ws.send(json.dumps(session_payload))
        
        while True:
            response = await ws.recv()
            data = json.loads(response)
            if data.get("type") == "session.created":
                print("Session setup complete")
                break
            elif data.get("type") == "error":
                raise Exception("Error creating session")

    async def send_message(self, ws, text):
        message_payload = {
            "type": "conversation.item.create",
            "item": {
                "type": "message",
                "role": "user",
                "content": [{"type": "input_text", "text": text}]
            }
        }
        await ws.send(json.dumps(message_payload))
        
        await ws.send(json.dumps({
            "type": "response.create",
            "response": {"modalities": ["audio", "text"]}
        }))

    async def handle_response(self, ws):
        self.is_speaking = True
        try:
            while True:
                response = await ws.recv()
                data = json.loads(response)
                
                if data["type"] == "response.audio.delta":
                    if "delta" in data:
                        try:
                            audio_data = data["delta"].replace(" ", "").replace("\n", "")
                            padding = len(audio_data) % 4
                            if padding:
                                audio_data += "=" * padding
                            
                            audio_bytes = base64.b64decode(audio_data)
                            audio = np.frombuffer(audio_bytes, dtype=np.int16)
                            self.output_stream.write(audio)
                            print(".", end="", flush=True)
                            
                        except Exception as e:
                            print(f"Error processing audio: {e}")
                            
                elif data["type"] == "response.done":
                    break
                    
        finally:
            self.is_speaking = False

async def main():
    system = ConversationSystem()
    await system.setup_audio()
    await system.start_conversation()

if __name__ == "__main__":
    print("Starting real-time conversation system...")
    asyncio.run(main())

Run the above as a python script outside of the notebook as event loops clash with Jupyters scheduling. \
[Here is a working script](https://github.com/ozgurgulerx/ai-builders/blob/main/part1_build/p1uc1_realtime_api_converse_step1.py)



Voila - our voice chatbot works!!! \
Still a bit clumsy though in terms of detecting session dynamics as well as properly listenning and understanding the user. \
We will improve it in the next steps.

Implementation Status: Phase 1 \

#### ✓ Completed
- Audio I/O setup (device configuration, streaming)
- Server-side VAD implementation
- Basic audio state management

#### 🔄 Partially Implemented
- Turn detection (needs refinement)
- Basic interrupt handling
- State tracking

##### ❌ Not Implemented
- Conversation history
- Context preservation
- Session state management

#### Next Steps
Focus on context management system and turn detection improvements for Phase 2.

### STEP2 - Turn Detection, better VAD and Silence Detection 

#### Voice Activity Detection (VAD)
- Configurable VAD threshold
- Proper silence duration tracking
- Minimum speech frames validation

#### Audio Processing
- Block-based processing (200ms blocks)
- Efficient buffer management
- Real-time audio level monitoring

#### State Management
- Clear speech state tracking
- Better interruption handling
- Dynamic silence detection

Expected outcome: More natural conversation flow with reliable speech detection and turn-taking.
```

In [None]:
import asyncio
import os
import base64
import json
import numpy as np
import sounddevice as sd
import websockets
from dotenv import load_dotenv

class AudioProcessor:
    def __init__(self, sample_rate=24000):
        self.sample_rate = sample_rate
        self.vad_threshold = 0.015
        self.speech_frames = 0
        self.silence_frames = 0
        self.min_speech_duration = int(0.3 * sample_rate)
        self.max_silence_duration = int(0.8 * sample_rate)
        self.buffer = []
        self.is_speaking = False
        self.speech_detected = False

    def process_audio(self, indata):
        if self.is_speaking:
            return

        audio_level = np.abs(indata).mean() / 32768.0
        
        if audio_level > self.vad_threshold:
            self.speech_detected = True
            self.speech_frames += len(indata)
            self.silence_frames = 0
            self.buffer.extend(indata.tobytes())
        elif self.speech_detected:
            self.silence_frames += len(indata)
            if self.silence_frames < self.max_silence_duration:
                self.buffer.extend(indata.tobytes())

    def should_process(self):
        return (self.speech_detected and 
                self.speech_frames >= self.min_speech_duration and 
                self.silence_frames >= self.max_silence_duration)

    def reset(self):
        self.speech_frames = 0
        self.silence_frames = 0
        self.speech_detected = False
        audio_data = bytes(self.buffer)
        self.buffer.clear()
        return audio_data

class ConversationSystem:
    def __init__(self):
        load_dotenv()
        self.api_key = os.getenv("AZURE_OPENAI_API_KEY")
        if not self.api_key:
            raise ValueError("AZURE_OPENAI_API_KEY not found")
            
        self.url = (
            "wss://aoai-ep-swedencentral02.openai.azure.com/openai/realtime?"
            f"api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview&"
            f"api-key={self.api_key}"
        )
        
        self.audio_processor = AudioProcessor()
        self.streams = {'input': None, 'output': None}

    def audio_callback(self, indata, frames, time, status):
        if status:
            print(f"Audio error: {status}")
            return
        self.audio_processor.process_audio(indata)

    async def setup_audio(self):
        self.streams['output'] = sd.OutputStream(
            samplerate=24000, channels=1, dtype=np.int16)
        self.streams['input'] = sd.InputStream(
            samplerate=24000, channels=1, dtype=np.int16,
            callback=self.audio_callback, blocksize=4800)
            
        for stream in self.streams.values():
            stream.start()

    async def setup_websocket_session(self, websocket):
        session_config = {
            "type": "session.update",
            "session": {
                "voice": "alloy",
                "instructions": "You are a helpful AI assistant. Keep responses brief.",
                "modalities": ["audio", "text"],
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.3,
                    "prefix_padding_ms": 150,
                    "silence_duration_ms": 600
                }
            }
        }
        
        await websocket.send(json.dumps(session_config))
        
        while True:
            response = json.loads(await websocket.recv())
            if response["type"] == "session.created":
                break
            if response["type"] == "error":
                raise Exception(f"Session setup failed: {response}")

    async def send_audio(self, websocket, audio_data):
        audio_base64 = base64.b64encode(audio_data).decode('utf-8')
        
        await websocket.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": audio_base64
        }))
        await websocket.send(json.dumps({"type": "input_audio_buffer.commit"}))
        await websocket.send(json.dumps({
            "type": "response.create",
            "response": {"modalities": ["audio", "text"]}
        }))

    async def handle_response(self, websocket):
        self.audio_processor.is_speaking = True
        
        try:
            while True:
                response = json.loads(await websocket.recv())
                
                if response["type"] == "response.audio.delta":
                    if "delta" in response:
                        try:
                            audio_data = response["delta"].strip()
                            padding = -len(audio_data) % 4
                            if padding:
                                audio_data += "=" * padding
                            
                            audio = np.frombuffer(
                                base64.b64decode(audio_data), 
                                dtype=np.int16
                            )
                            self.streams['output'].write(audio)
                            
                        except Exception as e:
                            print(f"Audio processing error: {e}")
                            
                elif response["type"] == "response.done":
                    break
                    
        finally:
            self.audio_processor.is_speaking = False

    async def run(self):
        await self.setup_audio()
        
        async with websockets.connect(self.url) as ws:
            await self.setup_websocket_session(ws)
            print("Ready for conversation")
            
            while True:
                if self.audio_processor.should_process():
                    audio_data = self.audio_processor.reset()
                    await self.send_audio(ws, audio_data)
                    await self.handle_response(ws)
                await asyncio.sleep(0.05)

if __name__ == "__main__":
    system = ConversationSystem()
    asyncio.run(system.run())

### Changes & Expected Effects

#### VAD Parameters
- Changed threshold: 500 -> 0.015 (normalized)
- Effect: More sensitive speech detection

#### Timing
- Sleep interval: 0.1s -> 0.05s
- Effect: Faster response time

#### Buffer Management
- Added proper reset after processing
- Effect: Prevents audio data buildup

#### State Tracking
- Added speech_frames counter
- Min speech duration: 0.3s
- Max silence: 0.8s
- Effect: Better conversation turn detection

#### Code Structure
- Split into AudioProcessor class
- Effect: Cleaner maintenance, better testing

[Here is a working script](https://github.com/ozgurgulerx/ai-builders/blob/main/part1_build/p1uc1_realtime_api_converse_step2.py) 

Now the bot is more responsive and can detect the user's speech more accurately. \
The conversation has become more fluid and natural. \
Well done! 

One missing piece is improving "interrupt handling". Voice bot keeps going even when we interrupt - we will address this in the next step.




### Implementing Effective Interruption Handling in Voice Conversation Systems

#### Understanding the Challenge

Interruption handling in voice conversation systems requires careful consideration of both technical and user experience aspects. When a human interrupts an AI system mid-response, they expect the system to behave similarly to a human conversation partner - stopping the current topic and paying attention to the new input.

#### Core Requirements

##### 1. Speech Detection During System Output

The system must actively monitor user audio even while it's speaking. This requires:
- Maintaining an audio buffer of recent input
- Continuously analyzing audio levels
- Setting appropriate threshold values that can distinguish between background noise and intentional speech
- Implementing this without impacting system performance

##### 2. Immediate Response Cancellation

When interruption is detected, the system should:
- Stop the current audio output immediately
- Cancel any pending response generation
- Provide feedback to the user that their interruption was recognized
- Complete this process quickly enough to feel natural

##### 3. Context Management

The most crucial aspect of interruption handling is proper context management:
- The system must clear its current conversation context
- Previous topic and response should be completely abandoned
- New user input should be treated as a fresh conversation start
- The system should be ready to process the new topic immediately

##### 4. Natural Conversation Flow

To maintain natural interaction:
- The interruption threshold should be carefully calibrated
- False positives (detecting interruptions when there aren't any) should be minimized
- The system should respond quickly to genuine interruptions
- The transition to the new topic should feel smooth and natural

#### Implementation Considerations

When implementing these features, consider:
- Audio processing overhead and its impact on system performance
- Balance between sensitivity and accuracy in interruption detection
- Proper cleanup of system resources during interruption
- Error handling for various edge cases
- Testing with different voices and background conditions

#### Success Metrics

A well-implemented interruption system should:
- Detect interruptions within 200-300ms
- Successfully abandon previous context
- Process new input correctly
- Maintain conversation fluidity
- Handle rapid topic switches naturally

Understanding and implementing these aspects will create a more natural and engaging voice interaction system that better matches human conversation patterns.

In [None]:
import asyncio
import os
import base64
import json
import numpy as np
import sounddevice as sd
import websockets
from dotenv import load_dotenv

class AudioProcessor:
    def __init__(self, sample_rate=24000):
        self.sample_rate = sample_rate
        self.vad_threshold = 0.015
        self.interrupt_threshold = 0.02
        self.speech_frames = 0
        self.silence_frames = 0
        self.min_speech_duration = int(0.3 * sample_rate)
        self.max_silence_duration = int(0.8 * sample_rate)
        self.buffer = []
        self.is_speaking = False
        self.speech_detected = False
        self.latest_audio = None
        self.was_interrupted = False  # Track interruption state

    def process_audio(self, indata):
        self.latest_audio = indata
        if self.is_speaking:
            return

        audio_level = np.abs(indata).mean() / 32768.0
        
        if audio_level > self.vad_threshold:
            self.speech_detected = True
            self.speech_frames += len(indata)
            self.silence_frames = 0
            self.buffer.extend(indata.tobytes())
        elif self.speech_detected:
            self.silence_frames += len(indata)
            if self.silence_frames < self.max_silence_duration:
                self.buffer.extend(indata.tobytes())

    def check_interruption(self):
        if not self.is_speaking or self.latest_audio is None:
            return False
        audio_level = np.abs(self.latest_audio).mean() / 32768.0
        return audio_level > self.interrupt_threshold

    def should_process(self):
        return (self.speech_detected and 
                self.speech_frames >= self.min_speech_duration and 
                self.silence_frames >= self.max_silence_duration)

    def reset(self):
        self.speech_frames = 0
        self.silence_frames = 0
        self.speech_detected = False
        audio_data = bytes(self.buffer)
        self.buffer.clear()
        return audio_data

class ConversationSystem:
    def __init__(self):
        load_dotenv()
        self.api_key = os.getenv("AZURE_OPENAI_API_KEY")
        if not self.api_key:
            raise ValueError("AZURE_OPENAI_API_KEY not found")
            
        self.url = (
            "wss://aoai-ep-swedencentral02.openai.azure.com/openai/realtime?"
            f"api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview&"
            f"api-key={self.api_key}"
        )
        
        self.audio_processor = AudioProcessor()
        self.streams = {'input': None, 'output': None}

    def audio_callback(self, indata, frames, time, status):
        if status:
            print(f"Audio error: {status}")
            return
        self.audio_processor.process_audio(indata)

    async def setup_audio(self):
        self.streams['output'] = sd.OutputStream(
            samplerate=24000, channels=1, dtype=np.int16)
        self.streams['input'] = sd.InputStream(
            samplerate=24000, channels=1, dtype=np.int16,
            callback=self.audio_callback, blocksize=4800)
            
        for stream in self.streams.values():
            stream.start()

    async def setup_websocket_session(self, websocket):
        session_config = {
            "type": "session.update",
            "session": {
                "voice": "alloy",
                "instructions": "You are a helpful AI assistant. Keep responses brief.",
                "modalities": ["audio", "text"],
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.3,
                    "prefix_padding_ms": 150,
                    "silence_duration_ms": 600
                }
            }
        }
        
        await websocket.send(json.dumps(session_config))
        
        while True:
            response = json.loads(await websocket.recv())
            if response["type"] == "session.created":
                break
            if response["type"] == "error":
                raise Exception(f"Session setup failed: {response}")

    async def handle_interruption(self, websocket):
        print("Interrupted!")
        # First cancel the current response
        await websocket.send(json.dumps({
            "type": "response.cancel"
        }))
        # Then clear the conversation context
        await websocket.send(json.dumps({
            "type": "conversation.item.truncate",
            "position": 0  # Clear all items
        }))
        self.audio_processor.was_interrupted = True

    async def send_audio(self, websocket, audio_data):
        audio_base64 = base64.b64encode(audio_data).decode('utf-8')
        
        await websocket.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": audio_base64
        }))
        await websocket.send(json.dumps({"type": "input_audio_buffer.commit"}))
        await websocket.send(json.dumps({
            "type": "response.create",
            "response": {"modalities": ["audio", "text"]}
        }))

    async def handle_response(self, websocket):
        self.audio_processor.is_speaking = True
        try:
            while True:
                if self.audio_processor.check_interruption():
                    await self.handle_interruption(websocket)
                    break
                
                response = json.loads(await websocket.recv())
                
                if response["type"] == "response.audio.delta":
                    if "delta" in response:
                        try:
                            audio_data = response["delta"].strip()
                            padding = -len(audio_data) % 4
                            if padding:
                                audio_data += "=" * padding
                            
                            audio = np.frombuffer(
                                base64.b64decode(audio_data), 
                                dtype=np.int16
                            )
                            self.streams['output'].write(audio)
                            
                        except Exception as e:
                            print(f"Audio processing error: {e}")
                            
                elif response["type"] == "response.done":
                    break
                    
        finally:
            self.audio_processor.is_speaking = False

    async def run(self):
        await self.setup_audio()
        
        async with websockets.connect(self.url) as ws:
            await self.setup_websocket_session(ws)
            print("Ready for conversation")
            
            while True:
                if self.audio_processor.should_process():
                    audio_data = self.audio_processor.reset()
                    await self.send_audio(ws, audio_data)
                    await self.handle_response(ws)
                await asyncio.sleep(0.05)

if __name__ == "__main__":
    system = ConversationSystem()
    asyncio.run(system.run())