# Multi-Modal Content Analysis Agent (Airline Support)

This notebook contains a **multi-modal, function-calling AI agent** designed to process **text, audio, and image** inputs and automatically:
- classify the customer request (intent, urgency, sentiment),
- call tools when needed (example: ticket-price lookup),
- return a short, courteous response suitable for customer support.

It has no secrets in code, clean structure, and is runnable end-to-end.


## 0) Setup

### Create a virtualenv (recommended)
```bash
python -m venv .venv
source .venv/bin/activate
```

### Install dependencies
```bash
pip install -U openai python-dotenv gradio pillow
```

### Set your API key
Create a `.env` file:
```bash
OPENAI_API_KEY="YOUR_KEY_HERE"
```

Or export it in your shell:
```bash
export OPENAI_API_KEY="YOUR_KEY_HERE"
```


In [None]:
# Imports
import os
import json
import sqlite3
import base64
from io import BytesIO
from pathlib import Path

from dotenv import load_dotenv
from PIL import Image

from openai import OpenAI


In [None]:
# Load env + init OpenAI client
load_dotenv(override=True)

api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("OPENAI_API_KEY not found. Add it to your .env or export it in your shell.")

client = OpenAI(api_key=api_key)

# Pick a default model. You can swap this to whatever you have access to.
MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini")


## 1) Minimal tool: ticket price lookup (SQLite)

To keep this repo self-contained, we use a tiny SQLite DB as an example **tool** the model can call.


In [None]:
DB_PATH = Path("flightai.db")

def init_db():
    with sqlite3.connect(DB_PATH) as conn:
        cur = conn.cursor()
        cur.execute("CREATE TABLE IF NOT EXISTS ticket_prices (city TEXT PRIMARY KEY, price_usd INTEGER)")
        cur.executemany(
            "INSERT OR REPLACE INTO ticket_prices(city, price_usd) VALUES(?, ?)",
            [
                ("Paris", 640),
                ("New York City", 380),
                ("Tokyo", 980),
                ("San Francisco", 420),
                ("London", 710),
                ("Dubai", 880),
            ],
        )
        conn.commit()

init_db()

def get_ticket_price(destination_city: str) -> dict:
    """Tool: return price for a return ticket to the destination city."""
    with sqlite3.connect(DB_PATH) as conn:
        cur = conn.cursor()
        cur.execute("SELECT price_usd FROM ticket_prices WHERE city = ?", (destination_city,))
        row = cur.fetchone()
    if not row:
        return {"destination_city": destination_city, "price_usd": None, "note": "No price available for this city."}
    return {"destination_city": destination_city, "price_usd": int(row[0])}

# Quick sanity check
get_ticket_price("Paris")


## 2) Multi-modal input helpers

We support:
- **Text**: direct message
- **Audio**: transcribe to text, then analyze
- **Image**: send the image to a vision-capable model for analysis


In [None]:
def encode_image_to_data_url(image_path: str) -> str:
    """Convert an image file to a data URL suitable for chat.completions vision input."""
    image_path = Path(image_path)
    b = image_path.read_bytes()
    b64 = base64.b64encode(b).decode("utf-8")
    # best-effort MIME
    suffix = image_path.suffix.lower().lstrip(".")
    mime = "image/png" if suffix == "png" else "image/jpeg" if suffix in {"jpg", "jpeg"} else "application/octet-stream"
    return f"data:{mime};base64,{b64}"

def transcribe_audio(audio_path: str) -> str:
    """Transcribe an audio file to text."""
    audio_path = Path(audio_path)
    with audio_path.open("rb") as f:
        transcript = client.audio.transcriptions.create(
            model=os.getenv("OPENAI_STT_MODEL", "whisper-1"),
            file=f,
        )
    return transcript.text


## 3) The agent: content analysis + tool calling

The agent produces a compact JSON output with:
- `intent` (ex: pricing, booking_change, baggage, complaint)
- `urgency` (low, medium, high)
- `sentiment` (positive, neutral, negative)
- `needs_tool` + `tool_args` (when a database/tool lookup is needed)
- `final_reply` (1 sentence)

This matches the resume framing: **function-calling**, **multi-modal inputs**, and **automated resolution** for complex content evaluation queries.


In [None]:
SYSTEM = """You are FlightAI, an airline customer support assistant.
You must be accurate, courteous, and brief: respond with 1 sentence.
You also act as a content analysis agent: classify the request and decide whether a tool lookup is needed.

Return STRICT JSON with exactly these keys:
intent: one of [pricing, booking_change, baggage, policy, complaint, other]
urgency: one of [low, medium, high]
sentiment: one of [positive, neutral, negative]
needs_tool: boolean
tool_name: string or null
tool_args: object (empty if none)
final_reply: string (one sentence)
"""

TOOL_SPEC = [{
    "type": "function",
    "function": {
        "name": "get_ticket_price",
        "description": "Get the price of a return ticket to the destination city.",
        "parameters": {
            "type": "object",
            "properties": {
                "destination_city": {"type": "string", "description": "City to fly to (e.g., Paris)."}
            },
            "required": ["destination_city"]
        }
    }
}]

def run_agent(messages):
    """Run the model, handle tool calls if requested, and return parsed JSON."""
    resp = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=TOOL_SPEC,
        tool_choice="auto",
        temperature=0.2,
    )

    msg = resp.choices[0].message

    # Tool loop (handles one or multiple calls)
    while getattr(msg, "tool_calls", None):
        tool_results = []
        for tc in msg.tool_calls:
            name = tc.function.name
            args = json.loads(tc.function.arguments or "{}")
            if name == "get_ticket_price":
                result = get_ticket_price(**args)
            else:
                result = {"error": f"Unknown tool: {name}"}

            tool_results.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": json.dumps(result),
            })

        messages = messages + [msg] + tool_results
        resp = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            tools=TOOL_SPEC,
            tool_choice="auto",
            temperature=0.2,
        )
        msg = resp.choices[0].message

    content = msg.content or "{}"
    try:
        data = json.loads(content)
    except json.JSONDecodeError:
        data = {
            "intent": "other",
            "urgency": "low",
            "sentiment": "neutral",
            "needs_tool": False,
            "tool_name": None,
            "tool_args": {},
            "final_reply": content.strip()[:240],
        }
    return data


### Text-only demo

In [None]:
user_text = "Hi, how much is a return ticket to Paris next month?"
messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": user_text},
]
run_agent(messages)


### Image demo (optional)

Provide an image path (ex: a photo of a damaged suitcase tag) and the agent will incorporate it in analysis.


In [None]:
# Replace with your own file path
# image_path = "example.jpg"
# data_url = encode_image_to_data_url(image_path)

# messages = [
#     {"role": "system", "content": SYSTEM},
#     {"role": "user", "content": [
#         {"type": "text", "text": "My luggage arrived damaged, can you help?"},
#         {"type": "image_url", "image_url": {"url": data_url}},
#     ]},
# ]
# run_agent(messages)


### Audio demo (optional)

Provide an audio path and we'll transcribe it, then analyze.


In [None]:
# Replace with your own audio file path (wav/mp3/m4a)
# audio_path = "example.m4a"
# transcript = transcribe_audio(audio_path)
# print("Transcript:", transcript)

# messages = [
#     {"role": "system", "content": SYSTEM},
#     {"role": "user", "content": transcript},
# ]
# run_agent(messages)


## 4) Simple Gradio UI (text + optional image + optional audio)

This provides a lightweight UI that matches the resume framing: multi-modal inputs + automated resolution + tool calling.


In [None]:
import gradio as gr

def handle_request(text, image_file, audio_file):
    parts = []
    if text:
        parts.append({"type": "text", "text": text})

    if audio_file:
        transcript = transcribe_audio(audio_file)
        parts.append({"type": "text", "text": f"[AUDIO TRANSCRIPT]\n{transcript}"})

    if image_file:
        data_url = encode_image_to_data_url(image_file)
        parts.append({"type": "image_url", "image_url": {"url": data_url}})

    if len(parts) == 1 and parts[0]["type"] == "text":
        user_content = parts[0]["text"]
    else:
        user_content = parts

    messages = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": user_content},
    ]

    out = run_agent(messages)
    return out["final_reply"], json.dumps(out, indent=2)

with gr.Blocks() as demo:
    gr.Markdown("# FlightAI: Multi-Modal Content Analysis Agent")
    with gr.Row():
        with gr.Column():
            text_in = gr.Textbox(label="Customer message (text)", lines=3)
            image_in = gr.File(label="Optional image (jpg/png)", file_types=[".jpg", ".jpeg", ".png"])
            audio_in = gr.File(label="Optional audio (wav/mp3/m4a)", file_types=[".wav", ".mp3", ".m4a"])
            run_btn = gr.Button("Analyze + Respond")
        with gr.Column():
            reply_out = gr.Textbox(label="1-sentence reply", lines=2)
            json_out = gr.Code(label="Structured output (JSON)", language="json")

    run_btn.click(fn=handle_request, inputs=[text_in, image_in, audio_in], outputs=[reply_out, json_out])

demo
