# ====================================================
# PaliGemma: The Engine Behind Our Universal Vision AI
# ====================================================
<!-- #
# The application we've built is powered by **PaliGemma**, a state-of-the-art
# open Vision Language Model (VLM) from Google. To understand why it's so
# revolutionary and the perfect choice for this project, let's break down
# what it is and what makes it special.
#
# ### What is PaliGemma? The Name is the Formula.
#
# The name itself tells you everything about its architecture:
#
# *   **Pali:** Stands for **Pa**thways **L**anguage and **I**mage model. It signifies
#     a model that finds a "path" to connect visual information with human language.
# *   **Gemma:** This is the family of powerful, lightweight, open-source language
#     models from Google, derived from the same research and technology used to
#     create the Gemini models.
#
# Essentially, **PaliGemma = A Powerful Vision Component + A Powerful Language Model**.
#
# ### How Does It Work? The Architecture of a Digital Eye and Brain
#
# Imagine you have two experts who speak different languages: an expert art critic
# who can only see (the **Vision Encoder**) and an expert linguist who can only
# read and write (the **Language Decoder**).
#
# ![PaliGemma Architecture Diagram](https://storage.googleapis.com/gweb-uniblog-publish-prod/images/image1_jH8vT1j.width-1300.png)
#
# 1.  **The Vision Encoder (The Eye):** This part of PaliGemma is a sophisticated
#     **Vision Transformer (ViT)**. It doesn't see an image as just a grid of pixels.
#     Instead, it breaks the image down into a series of smaller patches, like a
#     jigsaw puzzle. It analyzes each patch and understands its content and its
#     relationship to the other patches. It then converts this complex visual
#     understanding into a special, compressed digital language.
#
# 2.  **The Language Decoder (The Brain):** This is the **Gemma** language model.
#     It's a master of text, grammar, reasoning, and multiple languages. It can
#     write poetry, answer questions, and follow complex instructions, but it can't see.
#
# 3.  **The Projector (The Universal Translator):** This is the magic component.
#     The "projector" is a small but crucial neural network that sits between the
#     Eye and the Brain. Its only job is to translate the digital language of the
#     vision encoder into a format that the Gemma language model can understand
#     and reason about.
#
# When you give our app an image and a prompt like `"detect the cars"`, this is what happens:
# *   The **Vision Encoder** looks at the image and creates a rich, detailed digital summary.
# *   The **Projector** translates this visual summary into "words" that the language model can read.
# *   The **Language Model** receives a combined instruction: `[visual summary words] + "detect the cars"`.
# *   Because it now has both the visual context and your command, it can reason about
#     the image and generate the precise answer, including the special `<loc...>` tokens
#     that pinpoint exactly where the cars are.
#
# ---
#
# ### Why is PaliGemma So Amazing? Its Superpowers
#
# PaliGemma's unique architecture gives it several superpowers that made it the
# perfect choice for our application:
#
# #### 1. True Zero-Shot Versatility
# This is its most incredible feature. "Zero-shot" means it can perform tasks it was
# **never explicitly trained to do**. Because it has a general understanding of both
# images and language, you can give it almost any free-form command, and it will
# reason its way to the correct answer.
#
# > **Our App's Benefit:** This is *why* our single interface works for everything.
# > We don't need separate models or modes for OCR, VQA, and detection. We just
# > pass your natural language prompt (`"read the text"`, `"what color is the car?"`,
# > `"caption in hindi"`) to PaliGemma, and it figures out what to do.
#
# #### 2. Visual Grounding
# PaliGemma doesn't just recognize that a "cow" is in an image; it knows the
# *exact pixel coordinates* of that cow. When you ask it to "detect a cow," it
# outputs the label "a cow" preceded by special location tokens.
#
# ```
# <loc0274><loc0000><loc1021><loc0700> a cow
# ```
#
# > **Our App's Benefit:** This is what allows our UI to parse those tokens and
# > draw the pink bounding boxes directly on the output image, providing a rich,
# > visual result that goes beyond simple text.
#
# #### 3. Innate Multilingualism
# PaliGemma was trained on a massive, web-scale dataset containing images and text
# from countless languages. This means it has a built-in understanding of languages
# beyond English.
#
# > **Our App's Benefit:** This is why you can ask for a caption in Hindi, Gujarati,
# > or Spanish. By providing a strong prompt (as we now do in our backend), we can
# > tap into this latent ability and get reliable multilingual output without
# > needing a separate translation service.
#
# #### 4. A Unified, Efficient Architecture
# Before models like PaliGemma, building our application would have required
# stitching together multiple different AI models: one for OCR, one for object
# detection, one for VQA, etc. This would be complex, slow, and expensive.
#
# > **Our App's Benefit:** PaliGemma provides all of this functionality in a single,
# > elegant model. This simplifies our code, reduces resource consumption, and allows
# > us to have one powerful, flexible tool that can handle any visual task you
# > throw at it.
#
# In summary, PaliGemma is not just another model; it's a leap forward in creating
# general-purpose AI that can see, read, and reason about the world in a way that
# is remarkably human-like. It's the perfect engine for the sleek, powerful, and
# versatile application we have built together. -->

In [None]:
# ==============================================================================
# app.py: Self-Contained Zero-Shot Web App
#
# Description:
# This definitive version features a non-scrolling page layout, a scrollable
# output text area, and a significantly enhanced, more impactful particles.js
# background, all within a polished, futuristic user interface.
#
# Author: Gemini AI Architect
# Date: June 11, 2025
# ==============================================================================

# ==============================================================================
# 0. Dependencies (Run this in a separate cell in Kaggle)
#
!pip install "transformers>=4.41.2" "torch>=2.3.0" "accelerate>=0.30.1" \
             "Pillow>=10.3.0" "flask>=3.0.3" "python-dotenv>=1.0.1" \
             "pyngrok>=7.1.6" "requests>=2.32.3" "bitsandbytes>=0.43.1"
# ==============================================================================

import logging
import os
import re
import time
import base64
from io import BytesIO
from typing import Any, Dict, List, Tuple

import requests
import torch
from flask import Flask, jsonify, request
from PIL import Image
from pyngrok import ngrok
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration

# ==============================================================================
# 1. Environment and Secrets Loading (for Kaggle)
# ==============================================================================
try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    HF_TOKEN = user_secrets.get_secret("HF_TOKEN")
    NGROK_AUTH_TOKEN = user_secrets.get_secret("NGROK_AUTH_TOKEN")
    os.environ['HF_TOKEN'] = HF_TOKEN
    os.environ['NGROK_AUTH_TOKEN'] = NGROK_AUTH_TOKEN
except ImportError:
    print("Kaggle secrets not found. Using .env for local development.")
    from dotenv import load_dotenv
    load_dotenv()
    HF_TOKEN = os.getenv("HF_TOKEN")
    NGROK_AUTH_TOKEN = os.getenv("NGROK_AUTH_TOKEN")

if not HF_TOKEN:
    raise ValueError("Hugging Face token not found. Please set HF_TOKEN as a Kaggle secret.")
if not NGROK_AUTH_TOKEN:
    raise ValueError("Ngrok token not found. Please set NGROK_AUTH_TOKEN as a Kaggle secret.")

# ==============================================================================
# 2. Logging Configuration
# ==============================================================================
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# ==============================================================================
# 3. Embedded HTML, CSS, and JavaScript (Enhanced Professional UI)
# ==============================================================================
HTML_TEMPLATE = """
<!DOCTYPE html>
<html lang="en" data-theme="dark">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Paligemma – A Vision Language Model</title>
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=Poppins:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        :root {
            --bg: #121212;
            --surface: #1f1f1f;
            --panel-border: #2a2a2a;
            --primary: #4361ee;
            --accent: #04d361;
            --highlight: #f72585;
            --text: #d9d9d9;
            --text-bright: #ffffff;
            --shadow: rgba(255, 255, 255, 0.05);
            --font-family: 'Poppins', sans-serif;
        }
        * { box-sizing: border-box; }
        html, body { scroll-behavior: smooth; }
        body {
            font-family: var(--font-family);
            background-color: var(--bg);
            color: var(--text);
            margin: 0;
            transition: background-color 0.3s ease, color 0.3s ease;
            overflow-x: hidden;
        }
        #particles-js {
            position: fixed; top: 0; left: 0; width: 100%; height: 100%;
            pointer-events: none; z-index: -1;
        }
        .app { display: flex; flex-direction: column; min-height: 100vh; }
        header { text-align: center; padding: 40px 20px; position: relative; }
        .main-title {
            font-size: clamp(2.2rem, 5vw, 3.5rem);
            font-weight: 700;
            color: var(--text-bright);
            position: relative;
            display: inline-block;
            padding-bottom: 10px;
        }
        .main-title::after {
            content: '';
            position: absolute;
            width: 100%;
            height: 4px;
            bottom: 0;
            left: 0;
            background: linear-gradient(90deg, var(--accent), var(--primary), var(--highlight), var(--accent));
            background-size: 200% 100%;
            animation: run-glow 4s linear infinite;
        }
        main { flex-grow: 1; padding: 0 20px 40px 20px; }
        .grid {
            display: grid; grid-template-columns: 1fr 1fr; gap: 30px;
            align-items: start; max-width: 1400px; margin: 0 auto;
        }
        @media (max-width: 960px) { .grid { grid-template-columns: 1fr; } .output-panel { margin-top: 30px; } }
        .input-panel, .output-panel {
            background: var(--surface);
            border: 1px solid var(--panel-border);
            border-radius: 16px;
            padding: 25px;
            transition: border-color .3s, box-shadow .3s;
            box-shadow: 0 10px 30px rgba(0,0,0,0.2);
        }
        .input-panel.glow, .output-panel.glow {
            animation: pulse-glow 1.5s ease-out;
        }
        @keyframes pulse-glow {
            0% { border-color: var(--panel-border); box-shadow: 0 10px 30px rgba(0,0,0,0.2); }
            50% { border-color: var(--primary); box-shadow: 0 0 25px var(--primary); }
            100% { border-color: var(--panel-border); box-shadow: 0 10px 30px rgba(0,0,0,0.2); }
        }
        .input-controls { display: flex; flex-direction: column; gap: 15px; margin-bottom: 15px; }
        .button {
            background: var(--bg);
            color: var(--text); border: 1px solid var(--shadow); padding: 12px;
            border-radius: 8px; font-size: 1rem; font-weight: 500; cursor: pointer;
            transition: all 0.3s ease;
        }
        .button:hover { border-color: var(--primary); color: var(--text-bright); transform: translateY(-2px); box-shadow: 0 4px 10px rgba(0,0,0,0.2); }
        .button:active { transform: translateY(-1px) scale(0.98); }
        #submit-btn {
            background: linear-gradient(90deg, var(--primary) 0%, var(--highlight) 100%);
            color: white; font-weight: 600;
            display: flex; align-items: center; justify-content: center; gap: 10px;
        }
        .url-input, textarea {
            background: var(--bg); border: 1px solid var(--shadow);
            color: var(--text); border-radius: 8px; padding: 12px;
            font-family: var(--font-family); font-size: 1rem; width: 100%;
            transition: all 0.3s ease;
        }
        .url-input:focus, textarea:focus {
            border-color: var(--primary);
            box-shadow: 0 0 12px rgba(67, 97, 238, 0.5);
            outline: none;
        }
        textarea { min-height: 120px; resize: vertical; }
        .image-box {
            background: var(--bg); border: 1px solid var(--shadow);
            height: 350px; /* Fixed height */
            display: flex; align-items: center; justify-content: center;
            border-radius: 8px; position: relative; overflow: hidden;
            transition: min-height 0.3s ease;
        }
        .image-box img {
            max-width: 100%; max-height: 100%; object-fit: contain;
            animation: pop-in 0.5s cubic-bezier(0.25, 0.8, 0.25, 1) forwards;
        }
        @keyframes pop-in {
            from { transform: scale(0.95); opacity: 0; }
            to { transform: scale(1); opacity: 1; }
        }
        .output-text {
            background: var(--bg); border-left: 4px solid var(--highlight);
            padding: 16px; color: var(--text-bright);
            margin-top: 20px; font-family: monospace; font-size: 18px; line-height: 1.6;
            border-radius: 0 8px 8px 0; white-space: pre-wrap; word-wrap: break-word;
            animation: fade-slide-in 0.5s ease-out forwards;
            /* *** CRITICAL FIX FOR SCROLLING *** */
            max-height: 400px;
            overflow-y: auto;
        }
        @keyframes fade-slide-in {
            from { opacity: 0; transform: translateY(10px); }
            to { opacity: 1; transform: translateY(0); }
        }
        footer { text-align: center; padding: 40px 20px; }
        .footer-credit {
            font-size: 32px; color: var(--text);
            text-decoration: none; position: relative; display: inline-block;
        }
        .footer-credit::after {
            content: ''; position: absolute; width: 100%; height: 3px;
            bottom: -5px; left: 0;
            background: linear-gradient(90deg, var(--accent), var(--primary), var(--highlight), var(--accent));
            background-size: 200% 100%;
            animation: run-glow 4s linear infinite;
        }
        @keyframes run-glow { to { background-position: -200% 0; } }
        .placeholder-text { color: var(--text); opacity: 0.5; }
        .bounding-box { position: absolute; box-sizing: border-box; border: 2px solid var(--highlight); background-color: rgba(247, 37, 133, 0.2); }
        .box-label { position: absolute; top: -22px; left: -2px; background-color: var(--highlight); color: white; padding: 2px 6px; font-size: 12px; font-weight: 600; border-radius: 4px; }
        .spinner {
            width: 20px; height: 20px; border-radius: 50%;
            border: 3px solid rgba(255,255,255,0.2);
            border-top-color: #ffffff;
            animation: spin 1s linear infinite;
        }
        @keyframes spin { to { transform: rotate(360deg); } }
    </style>
</head>
<body>
    <div id="particles-js"></div>
    <div class="app">
        <header>
            <h1 class="main-title">Paligemma – A Vision Language Model</h1>
        </header>
        <main class="grid">
            <section class="input-panel" id="input-panel">
                <div class="image-box" id="input-image-container">
                    <p class="placeholder-text">Your image will appear here.</p>
                    <img id="input-image-preview" src="" alt="Input Preview" style="display:none;">
                </div>
                <div class="input-controls">
                    <button id="upload-btn" class="button">Upload from computer</button>
                    <input type="file" id="file-upload" hidden>
                </div>
                <input type="text" id="image-url" class="url-input" placeholder="Paste image URL…">
                <textarea id="task-prompt" placeholder="Type your AI query or command…"></textarea>
                <button id="submit-btn" class="button">
                    <span id="btn-text">Analyze Image</span>
                    <div class="spinner" id="spinner" style="display: none;"></div>
                </button>
            </section>
            <section class="output-panel" id="output-panel">
                <div class="image-box" id="output-image-container">
                    <p class="placeholder-text">Your processed image will appear here.</p>
                    <img id="output-image" src="" alt="Result Image" style="display:none;">
                </div>
                <pre class="output-text" id="output-text-container" style="display:none;"></pre>
                <div id="error-display" style="color: var(--highlight); display:none; margin-top: 15px;"></div>
            </section>
        </main>
        <footer>
            <a href="#" class="footer-credit">Made by Hit Kalariya</a>
        </footer>
    </div>
    
    <script src="https://cdn.jsdelivr.net/npm/particles.js@2.0.0/particles.min.js"></script>
    <script>
    document.addEventListener('DOMContentLoaded', () => {
        // --- Particles.js Initialization (Enhanced Version) ---
        particlesJS('particles-js', {
            "particles": {
                "number": {"value": 80, "density": {"enable": true, "value_area": 800}},
                "color": {"value": "#4361ee"},
                "shape": {"type": "circle"},
                "opacity": {"value": 0.5, "random": true, "anim": {"enable": true, "speed": 1, "opacity_min": 0.1, "sync": false}},
                "size": {"value": 6, "random": true, "anim": {"enable": true, "speed": 3, "size_min": 1, "sync": false}},
                "line_linked": {"enable": true, "distance": 150, "color": "#f72585", "opacity": 0.3, "width": 3},
                "move": {"enable": true, "speed": 0.5, "direction": "none", "random": true, "straight": false, "out_mode": "out"}
            },
            "interactivity": {
                "detect_on": "canvas",
                "events": {"onhover": {"enable": true, "mode": "grab"}, "resize": true},
                "modes": {"grab": {"distance": 200, "line_linked": {"opacity": 0.5}}}
            },
            "retina_detect": true
        });

        // --- DOM Element Selection ---
        const uploadBtn = document.getElementById('upload-btn');
        const fileUpload = document.getElementById('file-upload');
        const imageUrlInput = document.getElementById('image-url');
        const taskPrompt = document.getElementById('task-prompt');
        const submitBtn = document.getElementById('submit-btn');
        const btnText = document.getElementById('btn-text');
        const spinner = document.getElementById('spinner');
        const inputImageContainer = document.getElementById('input-image-container');
        const inputImagePreview = document.getElementById('input-image-preview');
        const outputImageContainer = document.getElementById('output-image-container');
        const outputImage = document.getElementById('output-image');
        const outputTextContainer = document.getElementById('output-text-container');
        const errorDisplay = document.getElementById('error-display');
        const inputPanel = document.getElementById('input-panel');
        const outputPanel = document.getElementById('output-panel');
        let currentImageSource = null;
        let isInferring = false;

        // --- Bounding Box & State Management ---
        function clearBoundingBoxes() {
            outputImageContainer.querySelectorAll('.bounding-box').forEach(box => box.remove());
        }
        function drawBoundingBoxes(detections) {
            clearBoundingBoxes();
            if (!detections || !Array.isArray(detections)) return;
            const containerWidth = outputImageContainer.offsetWidth;
            const containerHeight = outputImageContainer.offsetHeight;
            const imageNaturalWidth = outputImage.naturalWidth;
            const imageNaturalHeight = outputImage.naturalHeight;
            if (imageNaturalWidth === 0 || imageNaturalHeight === 0) return;
            const imageAspectRatio = imageNaturalWidth / imageNaturalHeight;
            const containerAspectRatio = containerWidth / containerHeight;
            let scaledWidth, scaledHeight, offsetX, offsetY;
            if (imageAspectRatio > containerAspectRatio) {
                scaledWidth = containerWidth;
                scaledHeight = scaledWidth / imageAspectRatio;
                offsetX = 0;
                offsetY = (containerHeight - scaledHeight) / 2;
            } else {
                scaledHeight = containerHeight;
                scaledWidth = scaledHeight * imageAspectRatio;
                offsetY = 0;
                offsetX = (containerWidth - scaledWidth) / 2;
            }
            detections.forEach(detection => {
                const { box, label } = detection;
                const [x_min_pct, y_min_pct, x_max_pct, y_max_pct] = box;
                const boxDiv = document.createElement('div');
                boxDiv.className = 'bounding-box';
                boxDiv.style.left = `${(x_min_pct * scaledWidth) + offsetX}px`;
                boxDiv.style.top = `${(y_min_pct * scaledHeight) + offsetY}px`;
                boxDiv.style.width = `${(x_max_pct - x_min_pct) * scaledWidth}px`;
                boxDiv.style.height = `${(y_max_pct - y_min_pct) * scaledHeight}px`;
                const labelSpan = document.createElement('span');
                labelSpan.className = 'box-label';
                labelSpan.innerText = label;
                boxDiv.appendChild(labelSpan);
                outputImageContainer.appendChild(boxDiv);
            });
        }
        
        function triggerGlow(element) {
            element.classList.remove('glow');
            void element.offsetWidth;
            element.classList.add('glow');
        }

        function handleImageInput(src) {
            currentImageSource = src;
            inputImagePreview.src = src;
            inputImagePreview.style.display = 'block';
            inputImageContainer.querySelector('.placeholder-text').style.display = 'none';
            triggerGlow(inputPanel);
        }

        // --- Event Handlers ---
        uploadBtn.addEventListener('click', () => fileUpload.click());
        fileUpload.addEventListener('change', (e) => {
            const file = e.target.files[0];
            if (file) {
                const reader = new FileReader();
                reader.onload = (event) => {
                    handleImageInput(event.target.result);
                    imageUrlInput.value = '';
                };
                reader.readAsDataURL(file);
            }
        });
        imageUrlInput.addEventListener('input', () => {
            if (imageUrlInput.value) {
                handleImageInput(imageUrlInput.value);
                fileUpload.value = null;
            }
        });
        
        // --- Main Inference Function ---
        async function triggerInference() {
            if (!currentImageSource || !taskPrompt.value || isInferring) {
                if (!currentImageSource) alert('Please provide an image first.');
                if (!taskPrompt.value) alert('Please enter a prompt.');
                return;
            }
            
            isInferring = true;
            btnText.style.display = 'none';
            spinner.style.display = 'block';
            submitBtn.disabled = true;

            clearBoundingBoxes();
            errorDisplay.style.display = 'none';
            
            outputImage.src = currentImageSource;
            outputImage.style.display = 'block';
            outputImageContainer.querySelector('.placeholder-text').style.display = 'none';
            outputTextContainer.style.display = 'none';
            
            try {
                const response = await fetch('/infer', {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify({ image_source: currentImageSource, task: taskPrompt.value }),
                });
                const data = await response.json();
                if (!response.ok) { throw new Error(data.error || 'An unknown error occurred.'); }
                
                outputTextContainer.textContent = data.result;
                outputTextContainer.style.display = 'block';
                triggerGlow(outputPanel);

                if (data.detections && data.detections.length > 0) {
                    outputImage.onload = () => drawBoundingBoxes(data.detections);
                    if (outputImage.complete) drawBoundingBoxes(data.detections);
                }
            } catch (err) {
                errorDisplay.textContent = `Error: ${err.message}`;
                errorDisplay.style.display = 'block';
                outputImage.style.display = 'none';
                outputImageContainer.querySelector('.placeholder-text').style.display = 'block';
            } finally {
                isInferring = false;
                btnText.style.display = 'inline';
                spinner.style.display = 'none';
                submitBtn.disabled = false;
            }
        }

        submitBtn.addEventListener('click', triggerInference);
        taskPrompt.addEventListener('keydown', (e) => {
            if (e.key === 'Enter' && !e.shiftKey) {
                e.preventDefault();
                triggerInference();
            }
        });
    });
    </script>
</body>
</html>
"""

# ==============================================================================
# 4. Model and Processor Initialization
# ==============================================================================
MODEL_ID = "google/paligemma-3b-mix-448"
model = None
processor = None

@torch.no_grad()
def initialize_model_and_processor():
    """Initializes and returns the model and processor."""
    global model, processor
    if model is not None and processor is not None: return
    logging.info(f"Initializing model: {MODEL_ID}...")
    try:
        model = PaliGemmaForConditionalGeneration.from_pretrained(
            MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto", use_auth_token=HF_TOKEN
        )
        processor = AutoProcessor.from_pretrained(MODEL_ID, use_auth_token=HF_TOKEN)
        logging.info("Model and processor initialized successfully.")
    except Exception as e:
        logging.error(f"Failed to initialize model/processor: {e}")
        raise

initialize_model_and_processor()

# ==============================================================================
# 5. Tunnel and Deployment Setup
# ==============================================================================
try:
    ngrok.set_auth_token(NGROK_AUTH_TOKEN)
    public_url = ngrok.connect(5000).public_url
except Exception as e:
    logging.error(f"Failed to start ngrok tunnel: {e}")
    raise

print("=" * 80)
print(f"✅ Your Futuristic Vision AI Interface is LIVE!".center(80))
print(f"   Public URL: {public_url}".center(80))
print("=" * 80)

# ==============================================================================
# 6. Flask Application with Bounding Box Parsing & Base64 Support
# ==============================================================================
app = Flask(__name__)

def parse_detection_output(text: str) -> List[Dict[str, Any]]:
    """Parses the model's text output to extract bounding boxes and labels."""
    detections = []
    pattern = re.compile(r"(<loc\d{4}><loc\d{4}><loc\d{4}><loc\d{4}>)\s*([\w\s]+)")
    
    for part in text.split(';'):
        match = pattern.search(part.strip())
        if not match: continue
        loc_tokens, label = match.groups()
        coords = [int(c) for c in re.findall(r'(\d{4})', loc_tokens)]
        if len(coords) != 4: continue
        y_min, x_min, y_max, x_max = [c / 1024.0 for c in coords]
        detections.append({"box": [x_min, y_min, x_max, y_max], "label": label})
    return detections

def resolve_task(user_prompt: str) -> Tuple[str, Dict[str, Any]]:
    """Maps a free-form user prompt to a structured pipeline task."""
    prompt_lower = user_prompt.lower()
    caption_match = re.search(r"(?:caption|describe)(?: this)?(?: in ([\w-]+))?", prompt_lower)
    if caption_match:
        language = caption_match.group(1) if caption_match.group(1) else "en"
        return f"caption {language}", {}
    detect_match = re.search(r"detect (.+)", prompt_lower)
    if detect_match:
        labels_str = detect_match.group(1)
        labels = [label.strip() for label in re.sub(r"\band\b|\bor\b", ";", labels_str).split(";")]
        return f"detect {'; '.join(labels)}", {"is_detection": True}
    if any(keyword in prompt_lower for keyword in ["read", "ocr", "text"]):
        return "ocr", {}
    return user_prompt, {"question": user_prompt}

@app.route("/", methods=["GET"])
def home():
    """Serves the main HTML user interface."""
    return HTML_TEMPLATE

@app.route("/infer", methods=["POST"])
def infer():
    """The main inference endpoint supporting URLs and Base64 uploads."""
    data = request.get_json()
    image_source = data.get("image_source")
    user_task = data.get("task")

    if not image_source or not user_task:
        return jsonify({"error": "Missing image or task"}), 400

    try:
        if image_source.startswith("http"):
            response = requests.get(image_source, timeout=20)
            response.raise_for_status()
            image = Image.open(BytesIO(response.content)).convert("RGB")
        elif image_source.startswith("data:image"):
            header, encoded = image_source.split(",", 1)
            decoded_bytes = base64.b64decode(encoded)
            image = Image.open(BytesIO(decoded_bytes)).convert("RGB")
        else:
            return jsonify({"error": "Invalid image source format"}), 400
    except Exception as e:
        return jsonify({"error": f"Failed to load image: {str(e)}"}), 400

    resolved_task, task_info = resolve_task(user_task)

    try:
        inputs = processor(text=resolved_task, images=image, return_tensors="pt").to(model.device)
        output = model.generate(**inputs, max_new_tokens=100)
        result_text = processor.decode(output[0], skip_special_tokens=True)
        clean_result = result_text.split(resolved_task, 1)[-1].strip()

        response_data = {
            "result": clean_result if clean_result else "Model returned an empty response."
        }
        
        if task_info.get("is_detection"):
            response_data["detections"] = parse_detection_output(clean_result)

        return jsonify(response_data), 200
    except Exception as e:
        logging.error(f"Direct model inference failed for task '{user_task}': {e}")
        return jsonify({"error": f"Model inference failed: {str(e)}"}), 500

if __name__ == '__main__':
    app.run(host="0.0.0.0", port=5000)

Collecting python-dotenv>=1.0.1
  Using cached python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collecting pyngrok>=7.1.6
  Using cached pyngrok-7.2.11-py3-none-any.whl.metadata (9.4 kB)
Collecting bitsandbytes>=0.43.1
  Using cached bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.3.0)
  Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.3.0)
  Using cached nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=2.3.0)
  Using cached nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch>=2.3.0)
  Using cached nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch>=2.3.0)
  Using

2025-06-24 01:54:12.143969: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750730052.533283      82 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750730052.632753      82 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/62.6k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]



preprocessor_config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/40.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.26M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/607 [00:00<?, ?B/s]

                 ✅ Your Futuristic Vision AI Interface is LIVE!                 
              Public URL: https://b2f7-34-123-105-253.ngrok-free.app            
 * Serving Flask app '__main__'
 * Debug mode: off


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
