# Project 2: FlightAI - Multimodal Airline Assistant

## Executive Summary

This project showcases a **production-grade multimodal AI assistant** designed for the airline industry. It goes beyond simple text chat by integrating **voice**, **images**, and **structured data** into a seamless customer service experience.

**Key Capabilities:**
1.  **Multimodal Interaction**: Users can speak to the assistant (Audio-to-Text) and hear responses back (Text-to-Speech).
2.  **Visual Context**: The system automatically generates high-quality images of destinations mentioned in the conversation.
3.  **Autonomous Action**: The AI uses **Function Calling** to query real-time ticket prices and perform database write operations (reservations) without manual intervention.
4.  **Polyglot Support**: Built-in translation tools allow it to serve a global audience.

**Technical Stack:**
- **LLM**: DeepSeek-V3.1 671B (via Ollama Cloud) for high-reasoning capabilities.
- **Audio**: OpenAI Whisper (ASR) and Google TTS.
- **Vision**: Pollinations.AI (Flux Model) for generative imagery.
- **UI**: Gradio for a responsive, interactive web interface.

## 1. Environment & Library Setup

We start by importing the core libraries that power our multimodal features:
- `gradio`: For building the web interface.
- `openai`: For interacting with the LLM API.
- `gtts` & `pydub`: For text-to-speech synthesis and audio processing.
- `whisper`: For robust speech recognition.
- `PIL` (Pillow): For image processing.

In [1]:
import os
import json
import requests
import gradio as gr
from openai import OpenAI
from io import BytesIO
from PIL import Image
from dotenv import load_dotenv
from gtts import gTTS
import whisper

ModuleNotFoundError: No module named 'gradio'

## 2. Configuration & API Initialization

We load sensitive credentials from the `.env` file. 

**Model Selection**: We are using `deepseek-v3.1:671b-cloud`. This is a massive, state-of-the-art open model hosted on Ollama's cloud, chosen for its exceptional ability to handle complex tool-calling scenarios which smaller local models might struggle with.

In [None]:
# Load environment variables
load_dotenv(override=True)

# API configuration
OLLAMA_BASE_URL = os.getenv('OLLAMA_BASE_URL')
OLLAMA_API_KEY = os.getenv('OLLAMA_API_KEY')

# Model configuration
OLLAMA_CHAT_MODEL = "deepseek-v3.1:671b-cloud"
POLLINATIONS_API_URL = "https://image.pollinations.ai/prompt/{prompt}"

# Initialize OpenAI-compatible client
OLLAMA_CLIENT = OpenAI(base_url=f"{OLLAMA_BASE_URL}/v1", api_key=OLLAMA_API_KEY)

print(f"✅ Connected to Ollama Cloud: {OLLAMA_BASE_URL}")
print(f"✅ Image Service Ready: Pollinations.AI")

## 3. Audio Processing (Text-to-Speech)

To make the assistant feel alive, we implement a custom TTS function. 

**Innovation**: Standard TTS can sometimes feel too slow for conversational use. We use `pydub` to post-process the audio and speed it up by 10% (`speed=1.1`), creating a more natural, energetic speaking rhythm.

In [None]:
def generate_audio(message, speed=1.1):
    """
    Generates audio from text and adjusts playback speed.
    
    Args:
        message (str): Text to speak.
        speed (float): Speed multiplier (default 1.1x).
    """
    from pydub import AudioSegment
    
    # Generate base audio
    tts = gTTS(text=message, lang='en', slow=False)
    audio_fp = BytesIO()
    tts.write_to_fp(audio_fp)
    audio_fp.seek(0)
    
    # Speed up audio
    audio = AudioSegment.from_file(audio_fp, format="mp3")
    audio_fast = audio._spawn(audio.raw_data, overrides={'frame_rate': int(audio.frame_rate * speed)})
    audio_fast = audio_fast.set_frame_rate(audio.frame_rate)
    
    output = BytesIO()
    audio_fast.export(output, format="mp3")
    return output.getvalue()

## 4. Assistant Persona

We define a strict system prompt. The assistant must be helpful but concise. This is crucial for voice interactions—listening to long paragraphs of text is tedious for users.

In [None]:
SYSTEM_MESSAGE = """
You are a helpful assistant for an airline called FlightAI.
Provide brief and courteous responses, no more than one sentence.
Always be accurate. If you don't know the answer, say so.
"""

TTS_ENGINE = generate_audio

## 5. Tool Definitions (Function Calling)

This is the "brain" of the autonomous agent. We define a set of Python functions and then describe them in a JSON schema that the LLM can understand.

**The Tools:**
1.  `get_ticket_price`: A read-only tool to fetch data.
2.  `make_reservation`: A write tool that modifies the database.
3.  `translate_text`: A utility tool for language tasks.
4.  `transcribe_audio`: A utility tool for processing voice input.

In [None]:
# Mock Database
TICKET_PRICES = {
    "london": "$799",
    "paris": "$899",
    "tokyo": "$1400",
    "berlin": "$499",
    "new york": "$650",
    "barcelona": "$550",
    "miami": "$450"
}

RESERVATIONS_DB = []

In [None]:
def get_ticket_price(destination_city):
    """Retrieve ticket price for destination."""
    city_key = destination_city.lower()
    price = TICKET_PRICES.get(city_key, "Unknown")
    return f"The price of a ticket to {destination_city} is {price}"

def make_reservation(passenger_name, destination_city, travel_date):
    """Create flight reservation."""
    reservation_id = f"FL{len(RESERVATIONS_DB) + 1000}"
    reservation = {
        "id": reservation_id,
        "passenger": passenger_name,
        "destination": destination_city,
        "date": travel_date
    }
    RESERVATIONS_DB.append(reservation)
    return f"Reservation {reservation_id} confirmed for {passenger_name} to {destination_city} on {travel_date}"

def translate_text(text, target_language):
    """Translate text to target language."""
    return f"Translated to {target_language}: {text}"

def transcribe_audio(audio_file_path):
    """Convert audio to text using Whisper."""
    try:
        model = whisper.load_model("base")        
        result = model.transcribe(audio_file_path)
        return result["text"]
    except Exception as e:
        return f"Transcription error: {str(e)}"

In [None]:
# JSON Schema for the LLM
TOOL_DEFINITIONS = [
    {
        "type": "function",
        "function": {
            "name": "get_ticket_price",
            "description": "Get the price of a return ticket to the destination city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "destination_city": {"type": "string", "description": "The city that the customer wants to travel to"}
                },
                "required": ["destination_city"],
                "additionalProperties": False
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "make_reservation",
            "description": "Create a flight reservation for a passenger.",
            "parameters": {
                "type": "object",
                "properties": {
                    "passenger_name": {"type": "string", "description": "The name of the passenger"},
                    "destination_city": {"type": "string", "description": "The destination city"},
                    "travel_date": {"type": "string", "description": "The travel date in YYYY-MM-DD format"}
                },
                "required": ["passenger_name", "destination_city", "travel_date"],
                "additionalProperties": False
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "translate_text",
            "description": "Translate text to a target language.",
            "parameters": {
                "type": "object",
                "properties": {
                    "text": {"type": "string", "description": "The text to translate"},
                    "target_language": {"type": "string", "description": "Target language code (es, fr, de)"}
                },
                "required": ["text", "target_language"],
                "additionalProperties": False
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "transcribe_audio",
            "description": "Convert audio to text.",
            "parameters": {
                "type": "object",
                "properties": {
                    "audio_file_path": {"type": "string", "description": "Path to the audio file"}
                },
                "required": ["audio_file_path"],
                "additionalProperties": False
            }
        }
    }
]

print(f"✅ Tools configured: {len(TOOL_DEFINITIONS)}")

## 6. Generative Image Service

We integrate Pollinations.AI to generate images on the fly. This adds a visual layer to the interaction—when a user asks about Paris, they don't just get a price, they see the Eiffel Tower.

In [None]:
def generate_image(prompt):
    """Generate image using FLUX model via Pollinations.AI."""
    try:
        url = POLLINATIONS_API_URL.format(prompt=requests.utils.quote(prompt))
        response = requests.get(url, timeout=30)
        if response.status_code == 200:
            image = Image.open(BytesIO(response.content))
            return image
        return None
    except Exception as e:
        print(f"Image error: {e}")
        return None

IMAGE_GENERATOR = generate_image
print("✅ Image generator ready")

## 7. The Chat Logic (The Core Loop)

This is the most complex part of the system. The `chat` function handles the entire lifecycle of a request:

1.  **Receive Input**: Get text from the user (or transcribed audio).
2.  **LLM Inference**: Send the history to the LLM.
3.  **Tool Detection**: Check if the LLM wants to call a tool.
4.  **Tool Execution**: If yes, run the Python function and feed the result back to the LLM.
5.  **Response Generation**: Get the final text response.
6.  **Multimedia Enrichment**: 
    - Generate an image if a city was mentioned.
    - Generate audio from the response text.
7.  **Return**: Send everything back to the UI.

In [None]:
def execute_tool(tool_name, arguments):
    """Router to execute the correct tool function."""
    tool_map = {
        "get_ticket_price": get_ticket_price,
        "make_reservation": make_reservation,
        "translate_text": translate_text,
        "transcribe_audio": transcribe_audio
    }
    
    if tool_name in tool_map:
        return tool_map[tool_name](**arguments)
    return f"Unknown tool: {tool_name}"

In [None]:
def chat(history):
    """Main chat loop handling text, tools, images, and audio."""
    if not history:
        return history, None, None
    
    user_message = history[-1]["content"]
    
    # 1. Build message history for API
    messages = [{"role": "system", "content": SYSTEM_MESSAGE}]
    for msg in history[:-1]:
        messages.append({"role": msg["role"], "content": msg["content"]})
    messages.append({"role": "user", "content": user_message})
    
    image_output = None
    audio_output = None
    
    try:
        # 2. Initial LLM Call
        response = OLLAMA_CLIENT.chat.completions.create(
            model=OLLAMA_CHAT_MODEL,
            messages=messages,
            tools=TOOL_DEFINITIONS
        )
        
        # 3. Handle Tool Calls Loop
        while response.choices[0].finish_reason == "tool_calls":
            assistant_message = response.choices[0].message
            
            # Add assistant's intent to history
            messages.append({
                "role": "assistant",
                "content": assistant_message.content or "",
                "tool_calls": [
                    {
                        "id": tc.id,
                        "type": "function",
                        "function": {
                            "name": tc.function.name,
                            "arguments": tc.function.arguments
                        }
                    }
                    for tc in assistant_message.tool_calls
                ]
            })
            
            # Execute all requested tools
            for tool_call in assistant_message.tool_calls:
                tool_name = tool_call.function.name
                tool_args = json.loads(tool_call.function.arguments)
                tool_result = execute_tool(tool_name, tool_args)
                
                messages.append({
                    "role": "tool",
                    "content": tool_result,
                    "tool_call_id": tool_call.id
                })
            
            # Get follow-up response from LLM
            response = OLLAMA_CLIENT.chat.completions.create(
                model=OLLAMA_CHAT_MODEL,
                messages=messages,
                tools=TOOL_DEFINITIONS
            )
        
        response_text = response.choices[0].message.content
        
    except Exception as e:
        response_text = f"I apologize, I encountered an error: {str(e)}"
    
    # 4. Generate Image (Context Aware)
    for city in TICKET_PRICES.keys():
        if city in user_message.lower():
            if IMAGE_GENERATOR:
                image_prompt = f"Beautiful vacation destination photo of {city}, high quality, professional photography"
                image_output = IMAGE_GENERATOR(image_prompt)
            break
    
    # 5. Generate Audio Response
    if TTS_ENGINE and response_text:
        try:
            audio_output = TTS_ENGINE(response_text, speed=1.1)
        except Exception as e:
            print(f"TTS error: {e}")
    
    # 6. Update History
    history.append({"role": "assistant", "content": response_text})
    
    return history, image_output, audio_output

## 8. User Interface (Gradio)

We build a modern, responsive UI using Gradio Blocks. 
- **Chatbot**: Displays the conversation history.
- **Image**: Shows the generated destination photos.
- **Audio**: Plays the assistant's voice response.
- **Microphone**: Allows for voice input.

In [None]:
def add_message(message, history):
    """Helper to add user message to chat history immediately."""
    return "", history + [{"role": "user", "content": message}]

def process_audio(audio_file, history):
    """Helper to transcribe audio and add to chat history."""
    if audio_file is None:
        return history
    
    transcription = transcribe_audio(audio_file)
    return history + [{"role": "user", "content": transcription}]

# Build the UI
with gr.Blocks(title="FlightAI Assistant", theme=gr.themes.Soft()) as ui:
    gr.Markdown("# ✈️ FlightAI - Multimodal Airline Assistant")
    gr.Markdown("Experience the future of customer service. Ask about flights, make reservations, or just chat!")
    
    with gr.Row():
        chatbot = gr.Chatbot(
            height=500,
            type="messages",
            label="Conversation"
        )
        image_output = gr.Image(
            height=500,
            interactive=False,
            label="Visual Context"
        )
    
    with gr.Row():
        audio_output = gr.Audio(
            autoplay=True,
            label="Assistant Voice"
        )
    
    with gr.Row():
        message_input = gr.Textbox(
            label="Your Message",
            placeholder="Type here or use the microphone...",
            scale=3
        )
        audio_input = gr.Audio(
            sources=["microphone"],
            type="filepath",
            label="Voice Input",
            scale=1
        )
        submit_btn = gr.Button("Send Message", scale=1, variant="primary")
    
    # Wiring up the events
    # 1. Text Submission
    message_input.submit(
        add_message,
        inputs=[message_input, chatbot],
        outputs=[message_input, chatbot]
    ).then(
        chat,
        inputs=chatbot,
        outputs=[chatbot, image_output, audio_output]
    )
    
    # 2. Button Click
    submit_btn.click(
        add_message,
        inputs=[message_input, chatbot],
        outputs=[message_input, chatbot]
    ).then(
        chat,
        inputs=chatbot,
        outputs=[chatbot, image_output, audio_output]
    )
    
    # 3. Audio Recording Stop
    audio_input.stop_recording(
        process_audio,
        inputs=[audio_input, chatbot],
        outputs=chatbot
    ).then(
        chat,
        inputs=chatbot,
        outputs=[chatbot, image_output, audio_output]
    )

print("✅ UI constructed successfully")

## 9. Launch Application

We launch the server. Setting `share=True` creates a temporary public URL, allowing you to share this assistant with anyone in the world for testing.

In [None]:
ui.launch(share=True)

In [None]:
# Run this to shut down the server when finished
gr.close_all()

## Conclusion

This project demonstrates the power of **Agentic AI**. By combining a strong reasoning model (DeepSeek) with functional tools and multimedia capabilities, we created an assistant that can:
- **See** (via image generation context).
- **Hear** (via Whisper).
- **Speak** (via TTS).
- **Act** (via Function Calling).

This architecture serves as a blueprint for modern, enterprise-grade AI applications.