## System Check & Setup

Run this cell first to check your Colab environment and system capabilities.

# AIC 2024/2025 Retrieval – Automated Google Colab Pipeline

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nqvu-daniel/AIC_FTML_dev/blob/main/notebooks/colab_pipeline.ipynb)

**Quick Start:**
1. **Enable GPU**: Runtime → Change runtime type → Hardware accelerator: T4/L4/A100 (recommended)
2. **Run Setup**: Execute the "Setup" cell below to automatically clone repo and install dependencies
3. **Choose Your Path**:
   - Host Inference (recommended): Use pre-built artifacts to run queries instantly
   - Development Pipeline: Build your own artifacts from scratch (requires dataset)

**File Downloads**: Results are saved to `/content/AIC_FTML_dev/submissions/` - you can download them from Colab's file browser.

---

## Two Usage Modes

### 1. Host Inference (Recommended - Fast)
- No dataset required
- Uses pre-built artifacts and models
- Ready in ~2 minutes
- Perfect for running queries and getting CSV results

### 2. Development Pipeline (Advanced - Slow)
- Downloads full dataset (alot~GBs)
- Builds search index from scratch + custom reranker models

---

In [None]:
!nvidia-smi || true
!python --version
import sys, os, pathlib
print('CWD:', os.getcwd())

Mon Aug 25 23:40:33 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   46C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
# Setup: Clone repo and install dependencies automatically (GPU-ready)
import os
import pathlib
import subprocess
import sys

REPO_URL = 'https://github.com/nqvu-daniel/AIC_FTML_dev.git'
REPO_NAME = 'AIC_FTML_dev'

def setup_repository():
    """Automatically clone repository and setup environment"""
    try:
        # Check if repo already exists
        if pathlib.Path(REPO_NAME).exists():
            print(f"Repository '{REPO_NAME}' already exists")
            os.chdir(REPO_NAME)
        else:
            print(f"Cloning repository from {REPO_URL}")
            subprocess.run(['git', 'clone', REPO_URL], check=True)
            os.chdir(REPO_NAME)
            print("Repository cloned successfully")

        # Install dependencies
        print("Installing dependencies...")
        subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', '-r', 'requirements.txt'], check=True)

        # Install FAISS based on CUDA availability
        try:
            import torch
            if torch.cuda.is_available():
                print("GPU detected, installing faiss-gpu-cu12...")
                subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'faiss-gpu-cu12'], check=True)
                print('Installed faiss-gpu-cu12 (CUDA 12 compatible)')
            else:
                print("No GPU detected, installing faiss-cpu...")
                subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'faiss-cpu'], check=True)
                print('Installed faiss-cpu')
        except Exception as e:
            print(f'FAISS install error: {e}')
            # Fallback to CPU version
            subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'faiss-cpu'], check=True)
            print('Fallback: Installed faiss-cpu')

        # Add to Python path
        if '.' not in sys.path:
            sys.path.append('.')

        print("Setup complete! Ready to run AIC FTML pipeline")
        print(f"Current directory: {os.getcwd()}")

        return True

    except subprocess.CalledProcessError as e:
        print(f"Error during setup: {e}")
        return False
    except Exception as e:
        print(f"Unexpected error: {e}")
        return False

# Run setup
if setup_repository():
    print("\nYou can now proceed with the pipeline!")
else:
    print("\nSetup failed. Please check the errors above.")

Cloning repository from https://github.com/nqvu-daniel/AIC_FTML_dev.git
Repository cloned successfully
Installing dependencies...
GPU detected, installing faiss-gpu-cu12...
Installed faiss-gpu-cu12 (CUDA 12 compatible)
Setup complete! Ready to run AIC FTML pipeline
Current directory: /content/AIC_FTML_dev

You can now proceed with the pipeline!


## Host Inference – One-shot
Provide `ARTIFACTS_BUNDLE_URL` and/or `RERANKER_MODEL_URL` if not already present in `./artifacts`.
This writes a Top-100 CSV into `submissions/`.

In [None]:
# Host Inference - Automated Setup and Query Execution
import os
import subprocess
import pathlib

# Configuration - Update these URLs with your hosted models
QUERY = 'a person opening a laptop'  # Change this to your search query
QUERY_ID = 'q1'  # Official query id for filename submissions/{query_id}.csv
TASK = 'kis'     # 'kis' or 'vqa'
ANSWER = ''      # Required if TASK='vqa'
ARTIFACTS_BUNDLE_URL = ''  # e.g., 'https://your-host.com/artifacts_bundle.tar.gz'
RERANKER_MODEL_URL = ''    # e.g., 'https://your-host.com/reranker.joblib'

def run_inference_query(query, bundle_url='', model_url='', query_id='', task='kis', answer=''):
    """Run inference with automatic artifact download if needed"""
    try:
        # Ensure we're in the right directory
        if not pathlib.Path('src/retrieval/use.py').exists():
            print('Missing use.py script. Make sure setup completed successfully.')
            return False

        # Build command
        cmd = ['python', 'src/retrieval/use.py', '--query', query, '--task', task]
        if query_id:
            cmd.extend(['--query_id', query_id])
        if task == 'vqa':
            if not answer:
                print('For TASK=vqa you must set ANSWER.')
                return False
            cmd.extend(['--answer', answer])

        if bundle_url:
            cmd.extend(['--bundle_url', bundle_url])
            print(f'Will download artifacts bundle from: {bundle_url}')

        if model_url:
            cmd.extend(['--model_url', model_url])
            print(f'Will download reranker model from: {model_url}')

        # Create submissions directory if it doesn't exist
        os.makedirs('submissions', exist_ok=True)

        print(f"Running query: '{query}' (task={task}, qid={query_id})")
        print('Command:', ' '.join(cmd))

        # Execute the command
        result = subprocess.run(cmd, capture_output=True, text=True)

        if result.returncode == 0:
            print('Query execution successful!')
            if result.stdout:
                print('Output:\n' + result.stdout)

            # List generated files
            submissions_dir = pathlib.Path('submissions')
            if submissions_dir.exists():
                csv_files = list(submissions_dir.glob('*.csv'))
                if csv_files:
                    print('\nGenerated ' + str(len(csv_files)) + ' result file(s):')
                    for csv_file in csv_files:
                        print(f'  - {csv_file}')
                        # Show first few lines of the CSV
                        try:
                            with open(csv_file, 'r') as f:
                                lines = f.readlines()[:5]
                                print('    Preview (first 5 lines):')
                                for i, line in enumerate(lines, 1):
                                    print(f'    {i}: {line.strip()}')
                        except Exception as e:
                            print(f'    (Could not preview: {e})')
            return True
        else:
            print('Query execution failed!')
            print('Error output:\n' + result.stderr)
            return False

    except Exception as e:
        print(f'Error running inference: {e}')
        return False

# Run the inference
print('Starting AIC FTML Host Inference...')
success = run_inference_query(QUERY, ARTIFACTS_BUNDLE_URL, RERANKER_MODEL_URL, QUERY_ID, TASK, ANSWER)

if success:
    print('\nInference completed! Check the submissions/ folder for results.')
else:
    print('\nInference failed. Check the error messages above.')

## Dev Pipeline – Build Artifacts (Optional)
Downloads dataset archives using `AIC_2025_dataset_download_link.csv`, builds index/corpus, optionally trains reranker, and assembles `my_pipeline/`.

### Configuration
Set your preferences here - run this cell first to configure the pipeline.

In [None]:
# Configuration - Run this cell first
import os
import subprocess
import pathlib
import time
import csv
import tempfile

# Configuration
DATASET_ROOT = '/content/aic2025'
TEST_MODE = True  # Uncomment to enable test mode (only downloads L21-L24)
VIDEOS = ['L21', 'L22', 'L23', 'L24', 'L25', 'L26', 'L27', 'L28', 'L29', 'L30']  # adjust if needed
CSV_FILE = 'AIC_2025_dataset_download_link.csv'  # Update path if different

# Apply test mode if enabled
try:
    if TEST_MODE:
        VIDEOS = ['L21', 'L22']
        print("TEST MODE ENABLED: Only processing L21-L24")
except NameError:
    print("Using full video list:", VIDEOS)

def filter_csv_for_videos(csv_path, video_list, output_path):
    """Filter the CSV file to only include entries for specified videos + essential metadata"""
    if not pathlib.Path(csv_path).exists():
        return False

    filtered_rows = []
    with open(csv_path, 'r', encoding='utf-8') as f:
        reader = csv.reader(f)
        header = next(reader, None)
        if header:
            filtered_rows.append(header)

        for row in reader:
            if not row:
                continue
            # Check if any of our target videos appear in the filename
            filename = row[-2].strip() if len(row) >= 2 else ""
            filename_upper = filename.upper()

            # Always include essential metadata files (needed for all videos)
            essential_files = [
                'MAP-KEYFRAMES-AIC25-B1.ZIP',
                'MEDIA-INFO-AIC25-B1.ZIP',
                'OBJECTS-AIC25-B1.ZIP',
                'CLIP-FEATURES-32-AIC25-B1.ZIP'
            ]

            is_essential = any(essential in filename_upper for essential in essential_files)
            is_target_video = any(vid in filename_upper for vid in video_list)

            if is_essential or is_target_video:
                filtered_rows.append(row)

    # Write filtered CSV
    with open(output_path, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerows(filtered_rows)

    print(f"Filtered CSV: {len(filtered_rows)-1} entries for videos {video_list} + essential metadata")
    return True

print("Configuration loaded successfully!")
print(f"Dataset root: {DATASET_ROOT}")
print(f"Videos to process: {VIDEOS}")

TEST MODE ENABLED: Only processing L21-L24
Configuration loaded successfully!
Dataset root: /content/aic2025
Videos to process: ['L21', 'L22']


In [None]:
# Step 1: Download Dataset (Skip this cell if data already downloaded)
from tqdm import tqdm
print("Step 1: Download dataset")
start_time = time.time()

if pathlib.Path(CSV_FILE).exists():
    # Create filtered CSV for our target videos
    with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as tmp_csv:
        filtered_csv_path = tmp_csv.name

    if filter_csv_for_videos(CSV_FILE, VIDEOS, filtered_csv_path):
        print("Starting dataset download with progress tracking...")
        cmd = [
            'python', 'scripts/dataset_downloader.py',
            '--dataset_root', DATASET_ROOT,
            '--csv', filtered_csv_path,
            '--skip-existing'
        ]
        print(f"Command: {' '.join(cmd)}")

        # Run with real-time output to show download progress
        import sys
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                                 universal_newlines=True, bufsize=1)

        while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print(output.strip())
                sys.stdout.flush()

        return_code = process.poll()

        # Clean up temp file
        try:
            os.unlink(filtered_csv_path)
        except:
            pass

        if return_code != 0:
            print("Dataset download failed!")
            raise Exception("Download failed")

        elapsed = time.time() - start_time
        print(f"Dataset download completed in {elapsed:.1f} seconds")

        # Debug: Check what was actually extracted
        print("\nChecking extracted structure:")
        dataset_path = pathlib.Path(DATASET_ROOT)
        if dataset_path.exists():
            for subdir in ['videos', 'keyframes', 'map_keyframes', 'media_info', 'objects', 'features']:
                subdir_path = dataset_path / subdir
                if subdir_path.exists():
                    files = list(subdir_path.rglob('*'))
                    print(f"  {subdir}/: {len(files)} items")
                    # Show first few items
                    for item in files[:5]:
                        rel_path = item.relative_to(subdir_path)
                        item_type = "DIR" if item.is_dir() else "FILE"
                        print(f"    {item_type}: {rel_path}")
                    if len(files) > 5:
                        print(f"    ... and {len(files) - 5} more")
                else:
                    print(f"  {subdir}/: NOT FOUND")
    else:
        print("Failed to filter CSV file")
        raise Exception("CSV filtering failed")
else:
    print(f"CSV file {CSV_FILE} not found. Make sure it exists in the current directory.")
    raise Exception("CSV file not found")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
→ 3174.0/3378.9 MB (93.9%)
→ 3175.1/3378.9 MB (94.0%)
→ 3176.1/3378.9 MB (94.0%)
→ 3177.2/3378.9 MB (94.0%)
→ 3178.2/3378.9 MB (94.1%)
→ 3179.3/3378.9 MB (94.1%)
→ 3180.3/3378.9 MB (94.1%)
→ 3181.4/3378.9 MB (94.2%)
→ 3182.4/3378.9 MB (94.2%)
→ 3183.5/3378.9 MB (94.2%)
→ 3184.5/3378.9 MB (94.2%)
→ 3185.6/3378.9 MB (94.3%)
→ 3186.6/3378.9 MB (94.3%)
→ 3187.7/3378.9 MB (94.3%)
→ 3188.7/3378.9 MB (94.4%)
→ 3189.8/3378.9 MB (94.4%)
→ 3190.8/3378.9 MB (94.4%)
→ 3191.9/3378.9 MB (94.5%)
→ 3192.9/3378.9 MB (94.5%)
→ 3194.0/3378.9 MB (94.5%)
→ 3195.0/3378.9 MB (94.6%)
→ 3196.1/3378.9 MB (94.6%)
→ 3197.1/3378.9 MB (94.6%)
→ 3198.2/3378.9 MB (94.6%)
→ 3199.2/3378.9 MB (94.7%)
→ 3200.3/3378.9 MB (94.7%)
→ 3201.3/3378.9 MB (94.7%)
→ 3202.4/3378.9 MB (94.8%)
→ 3203.4/3378.9 MB (94.8%)
→ 3204.4/3378.9 MB (94.8%)
→ 3205.5/3378.9 MB (94.9%)
→ 3206.5/3378.9 MB (94.9%)
→ 3207.6/3378.9 MB (94.9%)
→ 3208.6/3378.9 MB (95.0%)
→ 3209.7/3378.9 M

In [None]:
# HOTFIX: Reorganize video files that are in video/ subfolders
import os
import shutil
from pathlib import Path
def fix_video_structure(dataset_root):
    """Fix video files that are sitting in video/ subfolders"""
    extracted_tmp = Path(dataset_root) / "_extracted_tmp"
    videos_dir = Path(dataset_root) / "videos"
    if not extracted_tmp.exists():
        print("No _extracted_tmp found, skipping video fix")
        return
    # Find all Videos_* directories
    for item in extracted_tmp.rglob("*"):
        if item.is_dir() and item.name.lower().startswith("videos_"):
            print(f"Found Videos directory: {item}")
            # Check for video subfolder
            video_subdir = item / "video"
            if video_subdir.exists() and video_subdir.is_dir():
                print(f"  Found video subfolder: {video_subdir}")
                # Copy all mp4 files from video subfolder
                for mp4_file in video_subdir.glob("*.mp4"):
                    dst = videos_dir / mp4_file.name
                    if not dst.exists():
                        videos_dir.mkdir(parents=True, exist_ok=True)
                        shutil.move(str(mp4_file), str(dst))
                        print(f"    Moved: {mp4_file.name}")
                    else:
                        print(f"    Skip (exists): {mp4_file.name}")
# Run the hotfix
fix_video_structure("/content/aic2025")
print("Video structure hotfix complete!")

Found Videos directory: /content/aic2025/_extracted_tmp/Videos_L21_a_zip
  Found video subfolder: /content/aic2025/_extracted_tmp/Videos_L21_a_zip/video
    Moved: L21_V019.mp4
    Moved: L21_V014.mp4
    Moved: L21_V028.mp4
    Moved: L21_V011.mp4
    Moved: L21_V006.mp4
    Moved: L21_V012.mp4
    Moved: L21_V003.mp4
    Moved: L21_V031.mp4
    Moved: L21_V013.mp4
    Moved: L21_V002.mp4
    Moved: L21_V008.mp4
    Moved: L21_V016.mp4
    Moved: L21_V025.mp4
    Moved: L21_V007.mp4
    Moved: L21_V026.mp4
    Moved: L21_V005.mp4
    Moved: L21_V029.mp4
    Moved: L21_V030.mp4
    Moved: L21_V024.mp4
    Moved: L21_V021.mp4
    Moved: L21_V017.mp4
    Moved: L21_V010.mp4
    Moved: L21_V023.mp4
    Moved: L21_V009.mp4
    Moved: L21_V027.mp4
    Moved: L21_V015.mp4
    Moved: L21_V001.mp4
    Moved: L21_V018.mp4
    Moved: L21_V022.mp4
Found Videos directory: /content/aic2025/_extracted_tmp/Videos_L22_a_zip
  Found video subfolder: /content/aic2025/_extracted_tmp/Videos_L22_a_zip/vide

In [None]:
# Step 2: Build Search Index
from tqdm import tqdm
print("Step 2: Build search index")
start_time = time.time()

# Check if GPU is available for flat indexing
use_gpu = False
try:
    import torch
    use_gpu = torch.cuda.is_available()
    if use_gpu:
        print("GPU detected - will build flat index for GPU acceleration")
    else:
        print("No GPU detected - building HNSW index for CPU")
except Exception:
    print("Could not detect GPU - building HNSW index for CPU")

cmd = [
    'python', 'scripts/index.py',
    '--dataset_root', DATASET_ROOT,
    '--videos'
] + VIDEOS

# Add --flat flag for GPU compatibility
if use_gpu:
    cmd.append('--flat')
    print("Building flat index for GPU acceleration")

print(f"Command: {' '.join(cmd)}")

# Run with real-time output for progress tracking
import sys
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                         universal_newlines=True, bufsize=1)

while True:
    output = process.stdout.readline()
    if output == '' and process.poll() is not None:
        break
    if output:
        print(output.strip())
        sys.stdout.flush()

return_code = process.poll()

if return_code != 0:
    print("Index building failed!")
    raise Exception("Index building failed")

elapsed = time.time() - start_time
print(f"Search index built successfully in {elapsed:.1f} seconds")

Step 2: Build search index
GPU detected - will build flat index for GPU acceleration
Building flat index for GPU acceleration
Command: python scripts/index.py --dataset_root /content/aic2025 --videos L21 L22 --flat
Using configured model: ViT-B-32 (openai)
Model: ViT-B-32 (openai) | embedding dim: 512
Found 29 videos in collection L21
Found 31 videos in collection L22
Processing 60 videos: ['L21_V001', 'L21_V002', 'L21_V003', 'L21_V005', 'L21_V006']...
[OK] Using precomputed features for L21_V001: (307, 512) (float16)
[OK] Using precomputed features for L21_V002: (262, 512) (float16)
[OK] Using precomputed features for L21_V003: (286, 512) (float16)
[OK] Using precomputed features for L21_V005: (223, 512) (float16)
[OK] Using precomputed features for L21_V006: (257, 512) (float16)
[OK] Using precomputed features for L21_V007: (209, 512) (float16)
[OK] Using precomputed features for L21_V008: (317, 512) (float16)
[OK] Using precomputed features for L21_V009: (287, 512) (float16)
[OK] Us

In [None]:
# Step 3: Build Text Corpus
from tqdm import tqdm
print("Step 3: Build text corpus")
start_time = time.time()

cmd = [
    'python', 'scripts/build_text.py',
    '--dataset_root', DATASET_ROOT,
    '--videos'
] + VIDEOS

print(f"Command: {' '.join(cmd)}")

# Run with real-time output for progress tracking
import sys
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                         universal_newlines=True, bufsize=1)

while True:
    output = process.stdout.readline()
    if output == '' and process.poll() is not None:
        break
    if output:
        print(output.strip())
        sys.stdout.flush()

return_code = process.poll()

if return_code != 0:
    print("Text corpus building failed!")
    raise Exception("Text corpus building failed")

elapsed = time.time() - start_time
print(f"Text corpus built successfully in {elapsed:.1f} seconds")

# Check generated artifacts
artifacts_dir = pathlib.Path('./artifacts')
if artifacts_dir.exists():
    artifact_files = list(artifacts_dir.glob('*'))
    print(f"\nGenerated {len(artifact_files)} artifact files:")
    for artifact in sorted(artifact_files):
        if artifact.is_file():
            size = artifact.stat().st_size
            print(f"  - {artifact.name} ({size:,} bytes)")
        elif artifact.is_dir():
            file_count = len(list(artifact.glob('*')))
            print(f"  - {artifact.name}/ ({file_count} files)")

    print(f"\nPipeline completed! Artifacts ready in ./artifacts/")
else:
    print("No artifacts directory found")

Step 3: Build text corpus
Command: python scripts/build_text.py --dataset_root /content/aic2025 --videos L21 L22
Found 29 videos in collection L21
Found 31 videos in collection L22

Building text corpus:   0%|          | 0/60 [00:00<?, ?it/s]
Building text corpus:   2%|▏         | 1/60 [00:00<00:11,  5.21it/s]
Building text corpus:   3%|▎         | 2/60 [00:00<00:08,  7.04it/s]
Building text corpus:   5%|▌         | 3/60 [00:00<00:07,  7.68it/s]
Building text corpus:   8%|▊         | 5/60 [00:00<00:06,  9.09it/s]
Building text corpus:  12%|█▏        | 7/60 [00:00<00:05,  9.27it/s]
Building text corpus:  13%|█▎        | 8/60 [00:00<00:05,  8.87it/s]
Building text corpus:  15%|█▌        | 9/60 [00:01<00:05,  8.74it/s]
Building text corpus:  17%|█▋        | 10/60 [00:01<00:05,  8.97it/s]
Building text corpus:  20%|██        | 12/60 [00:01<00:05,  9.42it/s]
Building text corpus:  22%|██▏       | 13/60 [00:01<00:05,  9.29it/s]
Building text corpus:  23%|██▎       | 14/60 [00:01<00:05,  9.08

In [None]:
# Step 4: Generate Training Data and Train Reranker Model
import os
import pathlib
import subprocess
import time

def create_and_train_reranker():
    """Generate training data from metadata and train reranker model"""
    try:
        # Step 1: Generate training data from competition metadata
        print("Step 4a: Generating training data from competition metadata...")

        cmd = [
            'python', 'scripts/create_training_data.py',
            '--dataset_root', DATASET_ROOT,
            '--output', 'data/train.jsonl',
            '--num_examples', '100'
        ]
        print(f"Command: {' '.join(cmd)}")

        result = subprocess.run(cmd, capture_output=True, text=True)
        if result.returncode != 0:
            print(f"Training data generation failed: {result.stderr}")
            print("Will proceed with fusion baseline (no reranker training)")
            return True  # Don't fail pipeline, just use baseline

        print("Training data generated successfully!")
        if result.stdout:
            print(result.stdout)

        # Step 2: Train the reranker model
        print("\nStep 4b: Training reranker model...")

        cmd = [
            'python', 'src/training/train_reranker.py',
            '--index_dir', './artifacts',
            '--train_jsonl', 'data/train.jsonl'
        ]
        print(f"Command: {' '.join(cmd)}")

        start_time = time.time()

        # Run with real-time output for progress tracking
        import sys
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                                 universal_newlines=True, bufsize=1)

        while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print(output.strip())
                sys.stdout.flush()

        return_code = process.poll()
        elapsed_time = time.time() - start_time

        if return_code == 0:
            print(f"\nReranker training completed successfully in {elapsed_time:.1f}s!")

            # Check if model was created
            model_file = pathlib.Path('./artifacts/reranker.joblib')
            if model_file.exists():
                model_size = model_file.stat().st_size
                print(f"Model saved: {model_file} ({model_size:,} bytes)")

            return True
        else:
            print(f"Reranker training failed after {elapsed_time:.1f}s!")
            print("Will use fusion baseline instead")
            return True  # Don't fail pipeline, baseline still works

    except Exception as e:
        print(f"Training error: {e}")
        print("Will use fusion baseline instead")
        return True  # Don't fail pipeline

# Run training data creation and reranker training
print("Creating training data and training reranker...")
train_success = create_and_train_reranker()

if train_success:
    print("✅ Reranker training step completed!")

    # Check what we ended up with
    model_file = pathlib.Path('./artifacts/reranker.joblib')
    if model_file.exists():
        print("🎯 Using trained reranker model for enhanced search results")
    else:
        print("📊 Using fusion baseline (RRF) for reliable search results")
else:
    print("❌ Training failed, but pipeline can continue with fusion baseline.")

Creating training data and training reranker...
Step 4a: Generating training data from competition metadata...
Command: python scripts/create_training_data.py --dataset_root /content/aic2025 --output data/train.jsonl --num_examples 100
Training data generated successfully!
Using metadata from:
  Media info: /content/aic2025/media_info
  Keyframes: /content/aic2025/map_keyframes
Found 873 videos with complete metadata
Sampling 20 videos for training data generation
Generated 80 training examples
Training data saved to data/train.jsonl

✅ Training data creation successful!
Next step: Train the reranker with:
python src/training/train_reranker.py --index_dir ./artifacts --train_jsonl data/train.jsonl


Step 4b: Training reranker model...
Command: python src/training/train_reranker.py --index_dir ./artifacts --train_jsonl data/train.jsonl
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable

In [None]:
# Assemble Pipeline and Test Query
import os
import subprocess
import pathlib
import shutil
import time

# Configuration
PIPELINE_DIR = 'my_pipeline'
TEST_QUERY = 'a person opening a laptop'

def assemble_and_test_pipeline():
    """Assemble minimal pipeline directory and run a test query"""
    try:
        # Step 1: Prepare pipeline directory
        print("Assembling minimal pipeline directory...")

        cmd = [
            'python', 'scripts/prepare_pipeline_dir.py',
            '--outdir', PIPELINE_DIR,
            '--artifact_dir', './artifacts',
            '--include_model',
            '--force'
        ]
        print(f"Command: {' '.join(cmd)}")

        result = subprocess.run(cmd, capture_output=True, text=True)
        if result.returncode != 0:
            print(f"Pipeline assembly failed: {result.stderr}")
            return False

        print("Pipeline directory assembled successfully!")

        # Show pipeline contents
        pipeline_path = pathlib.Path(PIPELINE_DIR)
        if pipeline_path.exists():
            print(f"\nPipeline directory contents ({PIPELINE_DIR}/):")
            for item in sorted(pipeline_path.rglob('*')):
                if item.is_file():
                    rel_path = item.relative_to(pipeline_path)
                    size = item.stat().st_size
                    print(f"  FILE {rel_path} ({size:,} bytes)")
                elif item.is_dir() and item != pipeline_path:
                    rel_path = item.relative_to(pipeline_path)
                    file_count = len(list(item.glob('*')))
                    print(f"  DIR  {rel_path}/ ({file_count} files)")

        # Step 2: Test the pipeline
        print(f"\nTesting pipeline with query: '{TEST_QUERY}'")

        # Change to pipeline directory for testing
        original_dir = os.getcwd()
        os.chdir(PIPELINE_DIR)

        try:
            cmd = ['python', 'src/retrieval/use.py', '--query', TEST_QUERY]
            print(f"Command: {' '.join(cmd)}")

            start_time = time.time()

            # Run with real-time output to see any import errors immediately
            import sys
            process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                                     universal_newlines=True, bufsize=1)

            output_lines = []
            while True:
                output = process.stdout.readline()
                if output == '' and process.poll() is not None:
                    break
                if output:
                    output_lines.append(output.strip())
                    print(output.strip())
                    sys.stdout.flush()

            return_code = process.poll()
            elapsed_time = time.time() - start_time

            if return_code == 0:
                print(f"\n✅ Test query completed successfully in {elapsed_time:.1f}s!")

                # Show results
                submissions_dir = pathlib.Path('submissions')
                if submissions_dir.exists():
                    csv_files = list(submissions_dir.glob('*.csv'))
                    if csv_files:
                        print(f"\nGenerated {len(csv_files)} result file(s):")
                        for csv_file in csv_files:
                            print(f"  - {csv_file}")
                            # Show first few lines
                            try:
                                with open(csv_file, 'r') as f:
                                    lines = f.readlines()[:3]
                                    print(f"    Sample results (first 3 lines):")
                                    for i, line in enumerate(lines, 1):
                                        print(f"    {i}: {line.strip()}")
                                # Count total results
                                with open(csv_file, 'r') as f:
                                    total_lines = sum(1 for _ in f)
                                print(f"    Total results: {total_lines}")
                            except Exception as e:
                                print(f"    (Could not read file: {e})")

                return True
            else:
                print(f"\n❌ Test query failed after {elapsed_time:.1f}s!")

                # Show detailed error information
                error_found = False
                for line in output_lines:
                    if "Error" in line or "Traceback" in line or "ImportError" in line:
                        error_found = True
                        print(f"Error: {line}")

                if not error_found and output_lines:
                    print("Last few lines of output:")
                    for line in output_lines[-5:]:
                        print(f"  {line}")

                return False

        finally:
            # Return to original directory
            os.chdir(original_dir)

    except Exception as e:
        print(f"Assembly/test error: {e}")
        return False

# Run assembly and test
print("Starting pipeline assembly and testing...")
success = assemble_and_test_pipeline()

if success:
    print(f"\n🎉 Pipeline assembled and tested successfully!")
    print(f"Ready-to-deploy pipeline is in: {PIPELINE_DIR}/")
    print("\nNext steps:")
    print(f"  1. Upload {PIPELINE_DIR}/ to your deployment environment")
    print("  2. Run queries using: python src/retrieval/use.py --query 'your search'")
    print("  3. Find results in submissions/ folder")
else:
    print(f"\n⚠️  Pipeline assembly or testing failed. Check errors above.")

# Show final summary
print(f"\nDevelopment Summary:")
print(f"  Pipeline directory: {PIPELINE_DIR}/")
print(f"  Test query: '{TEST_QUERY}'")
print(f"  Results location: {PIPELINE_DIR}/submissions/")

# Check for trained model
model_file = pathlib.Path(f'{PIPELINE_DIR}/artifacts/reranker.joblib')
if model_file.exists():
    print(f"  Model: Trained reranker included ({model_file.stat().st_size:,} bytes)")
else:
    print(f"  Model: Using fusion baseline (RRF)")

Starting pipeline assembly and testing...
Assembling minimal pipeline directory...
Command: python scripts/prepare_pipeline_dir.py --outdir my_pipeline --artifact_dir ./artifacts --include_model --force
Pipeline directory assembled successfully!

Pipeline directory contents (my_pipeline/):
  FILE README_RUN.md (324 bytes)
  DIR  artifacts/ (4 files)
  FILE artifacts/index.faiss (34,603,053 bytes)
  FILE artifacts/mapping.parquet (384,309 bytes)
  FILE artifacts/reranker.joblib (1,037 bytes)
  FILE artifacts/text_corpus.jsonl (79,541,830 bytes)
  FILE config.py (2,015 bytes)
  DIR  src/ (1 files)
  DIR  src/retrieval/ (1 files)
  FILE src/retrieval/use.py (13,280 bytes)
  DIR  submissions/ (0 files)
  FILE utils.py (1,124 bytes)

Testing pipeline with query: 'a person opening a laptop'
Command: python src/retrieval/use.py --query a person opening a laptop
[OK] wrote 100 lines → submissions/kis_a-person-opening-a-laptop.csv

✅ Test query completed successfully in 14.2s!

Generated 1 resu

## Official Evaluation
Provide your ground truth JSON path and task.

In [None]:
# Configure evaluation
GT_PATH = 'ground_truth.json'   # update path (e.g., /content/drive/MyDrive/gt.json)
TASK_EVAL = 'kis'               # 'kis' or 'vqa' or 'trake'
NORMALIZE_ANS = False           # True to casefold VQA answers

import subprocess
cmd = ['python', 'eval/evaluate.py', '--gt', GT_PATH, '--pred_dir', 'submissions', '--task', TASK_EVAL]
if NORMALIZE_ANS:
    cmd.append('--normalize_answer')
print('Evaluating:', ' '.join(cmd))
res = subprocess.run(cmd, capture_output=True, text=True)
print(res.stdout)
if res.returncode:
    print(res.stderr)
    raise SystemExit(res.returncode)
