# 🎯 PitVQA Training + Video Demo Pipeline**Complete Pipeline:** Train → Inference → Demo VideoThis notebook:1. Trains spatial fine-tuning (2-6 hours)2. Runs inference on surgical frames3. Creates annotated video with bounding boxes and labels4. Generates comparison visualization**Hardware:** T4 GPU (Free) or A100 (Pro)

## 1️⃣ Install Dependencies

In [None]:
!pip install -q transformers accelerate peft trl datasets bitsandbytes qwen-vl-utils pillow opencv-python imageio imageio-ffmpeg matplotlib
print("✅ Installed!")

In [None]:
from huggingface_hub import notebook_login
notebook_login()

## 2️⃣ Load Dataset & Model

In [None]:
from datasets import load_dataset
dataset = load_dataset("mmrech/pitvqa-comprehensive-spatial")
print(f"Train: {len(dataset['train'])}, Val: {len(dataset['validation'])}")

## 3️⃣ Training (Use Full Notebook)**For complete training cells, use:** `train_spatial_qwen2vl_colab.ipynb`Or continue with pre-trained model: `mmrech/pitvqa-qwen2vl-spatial`

---# 🎬 DEMO PIPELINE## 4️⃣ Load Trained Model

In [None]:
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
model = PeftModel.from_pretrained(model, "mmrech/pitvqa-qwen2vl-spatial")
processor = AutoProcessor.from_pretrained("mmrech/pitvqa-qwen2vl-spatial", trust_remote_code=True)
print("✅ Model loaded!")

## 5️⃣ Extract Frames for Demo

In [None]:
# Get 30 validation frames
demo_frames = []
demo_metadata = []
for i in range(30):
    sample = dataset['validation'][i]
    demo_frames.append(sample['image'])
    demo_metadata.append(sample)
print(f"✅ Extracted {len(demo_frames)} frames")

## 6️⃣ Run Inference

In [None]:
import re
from tqdm import tqdm

def extract_points(text):
    pattern = r"<point x='([\d.]+)' y='([\d.]+)'>([^<]+)</point>"
    return [{'x': float(m[0]), 'y': float(m[1]), 'label': m[2]} 
            for m in re.findall(pattern, text)]

predictions = []
for frame, meta in tqdm(zip(demo_frames, demo_metadata), total=len(demo_frames)):
    question = "Point to all surgical instruments visible."
    conv = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]}]
    text = processor.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[frame], return_tensors="pt").to(model.device)
    
    with torch.inference_mode():
        outputs = model.generate(**inputs, max_new_tokens=200)
    response = processor.decode(outputs[0], skip_special_tokens=True)
    
    predictions.append({
        'image': frame,
        'points': extract_points(response),
        'response': response
    })

print(f"✅ Inference done! {sum(len(p['points']) for p in predictions)} detections")

## 7️⃣ Create Annotated Video

In [None]:
import cv2
import numpy as np
from PIL import Image

def draw_boxes(image, points):
    img = np.array(image.convert('RGB'))
    h, w = img.shape[:2]
    
    for p in points:
        x_px = int(p['x'] * w / 100)
        y_px = int(p['y'] * h / 100)
        
        # Draw box
        box_size = 40
        color = (0, 255, 0)  # Green
        cv2.rectangle(img, 
                     (x_px-box_size, y_px-box_size),
                     (x_px+box_size, y_px+box_size),
                     color, 2)
        
        # Draw point
        cv2.circle(img, (x_px, y_px), 5, color, -1)
        
        # Add label
        cv2.putText(img, p['label'], (x_px-box_size, y_px-box_size-10),
                   cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255,255,255), 1)
    
    return img

annotated = [draw_boxes(p['image'], p['points']) for p in tqdm(predictions)]
print(f"✅ Created {len(annotated)} annotated frames")

## 8️⃣ Export Video

In [None]:
import imageio
output_video = "pitvqa_demo.mp4"
imageio.mimsave(output_video, annotated, fps=2, codec='libx264', quality=8)
print(f"✅ Video saved: {output_video}")

## 9️⃣ Create Side-by-Side Comparison

In [None]:
comparison = []
for i, pred in enumerate(predictions):
    orig = np.array(pred['image'].convert('RGB'))
    annot = annotated[i]
    
    h = min(orig.shape[0], annot.shape[0])
    orig_r = cv2.resize(orig, (int(orig.shape[1]*h/orig.shape[0]), h))
    annot_r = cv2.resize(annot, (int(annot.shape[1]*h/annot.shape[0]), h))
    
    side_by_side = np.hstack([orig_r, annot_r])
    cv2.putText(side_by_side, "Original", (20,40), cv2.FONT_HERSHEY_SIMPLEX, 1, (255,255,255), 2)
    cv2.putText(side_by_side, "With Detection", (orig_r.shape[1]+20,40), cv2.FONT_HERSHEY_SIMPLEX, 1, (255,255,255), 2)
    comparison.append(side_by_side)

imageio.mimsave("pitvqa_comparison.mp4", comparison, fps=2, codec='libx264', quality=8)
print("✅ Comparison video created!")

## 🎁 Download Results

In [None]:
from google.colab import files
!zip pitvqa_demo.zip pitvqa_demo.mp4 pitvqa_comparison.mp4
files.download('pitvqa_demo.zip')
print("✅ Downloading!")

## ✅ Complete!**You now have:**1. pitvqa_demo.mp4 - Annotated surgical video2. pitvqa_comparison.mp4 - Side-by-side comparison3. Trained model: mmrech/pitvqa-qwen2vl-spatial**Use for:** Papers, presentations, demos