# Speech-to-Text AI Agent with Tool Calling

## Objective

This notebook demonstrates how to build a speech-enabled AI agent that:

- Converts audio input to text using Whisper
- Uses GPT function calling (tool calling)
- Dynamically invokes Python tools
- Maintains conversation memory
- Performs multi-step reasoning

---

## What This Notebook Covers

1. Secure API initialization
2. Conversation memory handling
3. Tool schema design
4. Function calling workflow
5. Multi-step agent execution
6. Real-time tool invocation

## System Architecture

Audio Input  
   ↓  
Whisper (Speech-to-Text)  
   ↓  
GPT Model  
   ↓  
Tool Decision (Function Calling)  
   ↓  
Execute Python Tool  
   ↓  
Final GPT Response  

This represents an agentic AI workflow.

#Install Dependencies

In [None]:
!pip install openai requests

In [None]:
!pip install openai requests

In [None]:
from IPython.display import Javascript, display
from google.colab.output import eval_js
import base64

def record_audio(seconds=5, filename="live_audio.webm"):

    print("Recording... Speak now!")

    display(Javascript("""
    async function recordAudio(seconds) {
      const stream = await navigator.mediaDevices.getUserMedia({audio: true});
      const recorder = new MediaRecorder(stream);
      let chunks = [];
      recorder.ondataavailable = e => chunks.push(e.data);
      recorder.start();
      await new Promise(resolve => setTimeout(resolve, seconds * 1000));
      recorder.stop();
      await new Promise(resolve => recorder.onstop = resolve);
      const blob = new Blob(chunks, {type: 'audio/webm'});
      const arrayBuffer = await blob.arrayBuffer();
      return btoa(
        new Uint8Array(arrayBuffer)
          .reduce((data, byte) => data + String.fromCharCode(byte), '')
      );
    }
    """))

    audio_base64 = eval_js(f"recordAudio({seconds})")
    audio_bytes = base64.b64decode(audio_base64)

    with open(filename, "wb") as f:
        f.write(audio_bytes)

    print("Recording complete")
    return filename

## API Setup

We securely load API keys using Google Colab's secret storage.
This prevents hardcoding credentials and improves security.

In [None]:
from google.colab import userdata
from openai import OpenAI
import json
import requests

OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
OPENAI_BASE_URL = userdata.get("OPENAI_BASE_URL")

client = OpenAI(
    api_key=OPENAI_API_KEY,
    base_url=OPENAI_BASE_URL
)

print("OpenAI Client Initialized")

## Conversation Memory

We use structured memory with roles:

- system → sets behavior
- user → user inputs
- assistant → model responses
- tool → tool outputs

This enables contextual multi-turn interaction.

In [None]:
memory = [
    {
        "role": "system",
        "content": "You are a voice assistant. Use tools when needed."
    }
]

## Tool Definitions

We define real Python functions and expose them to GPT
through tool schemas.

In [None]:
# -------------------
# Tool 1: Get Time
# -------------------
def get_time():
    from datetime import datetime
    import pytz

    ist = pytz.timezone("Asia/Kolkata")
    current_time = datetime.now(ist)

    return current_time.strftime("%I:%M %p")
# -------------------
# Tool 2: Get Weather
# -------------------
WEATHER_API_KEY = userdata.get("WEATHER_API_KEY")

def get_weather(city):
    url = "http://api.weatherapi.com/v1/current.json"

    params = {
        "key": WEATHER_API_KEY,
        "q": city
    }

    response = requests.get(url, params=params)
    data = response.json()

    location = data["location"]["name"]
    country = data["location"]["country"]
    temp = data["current"]["temp_c"]
    condition = data["current"]["condition"]["text"]

    return f"The current weather in {location}, {country} is {temp}°C with {condition}."

In [None]:
tool_functions = {
    "get_time": get_time,
    "get_weather": get_weather
}

In [None]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_time",
            "description": "Get current time",
            "parameters": {
                "type": "object",
                "properties": {}
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"}
                },
                "required": ["city"]
            }
        }
    }
]

## Speech-to-Text (Whisper)

We convert audio input into text before passing it to GPT.

In [None]:
def speech_to_text(audio_path):

    with open(audio_path, "rb") as audio_file:

        transcription = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            temperature=0
        )

    return transcription.text.strip()

## Agent Execution Logic

1. Add user message to memory
2. Call GPT with tool definitions
3. If GPT requests a tool:
   - Execute tool
   - Add tool result to memory
   - Call GPT again
4. Return final response

In [None]:
def run_agent(user_text):

    memory.append({"role": "user", "content": user_text})

    response = client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=memory,
        tools=tools,
        tool_choice="auto",
        temperature=0.3
    )

    message = response.choices[0].message

    # If tool is called
    if message.tool_calls:

        memory.append({
            "role": "assistant",
            "content": None,
            "tool_calls": message.tool_calls
        })

        tool_call = message.tool_calls[0]
        tool_name = tool_call.function.name
        args = json.loads(tool_call.function.arguments)

        print("Tool Called:", tool_name)

        result = tool_functions[tool_name](**args)

        memory.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": result
        })

        second_response = client.chat.completions.create(
            model="gpt-4.1-nano",
            messages=memory
        )

        final_text = second_response.choices[0].message.content

    else:
        final_text = message.content

    memory.append({"role": "assistant", "content": final_text})

    return final_text

In [None]:
while True:

    command = input("\nType 'speak' to talk or 'quit' to exit: ")

    if command.lower() == "quit":
        print("Exiting Agent")
        break

    if command.lower() == "speak":

        audio_path = record_audio(seconds=5)

        user_text = speech_to_text(audio_path)
        print("User:", user_text)

        response = run_agent(user_text)
        print("Assistant:", response)

# Final Observations

##  What This Notebook Demonstrates

- Speech recognition using Whisper
- GPT function calling
- Dynamic tool execution
- Multi-step reasoning
- Memory-based conversation
- Agentic AI workflow

---

## Key Learning Outcomes

1. LLMs can extend capabilities using tools
2. Function schemas guide tool usage
3. Memory enables contextual dialogue
4. Agents require multi-step execution logic

---

## Possible Enhancements

- Add more tools (calculator, database, search)
- Add text-to-speech output
- Deploy using FastAPI
- Add long-term persistent memory