## System Check & Setup

Run this cell first to check your Colab environment and system capabilities.

## NEW: Segment-first + ASR + CE Re-rank + TransNetV2
This notebook now supports a stronger retrieval flow:
- **NEW**: Segment videos with TransNetV2 (deep learning shot detection) → `segments.parquet`
- Build index over segment representative frames using GPU-accelerated FAISS
- Merge ASR transcripts into the corpus (optional)
- Hybrid retrieval (BM25 + FAISS) with optional cross-encoder re-ranking
- Full support for KIS, VQA, and TRAKE (host performs scoring)

### Quick CLI (local or Colab VM)
````bash
# 1) Segment videos with TransNetV2 (collection IDs like L21 or explicit IDs)
python scripts/segment_videos.py --dataset_root /content/aic2025 \
  --videos L21 L22 --artifact_dir ./artifacts --use_transnetv2

# 2) Build index using segment reps with GPU acceleration
python scripts/index.py --dataset_root /content/aic2025 \
  --videos L21 L22 --segments ./artifacts/segments.parquet

# 3) Build text corpus (merge ASR if available)
python scripts/build_text.py --dataset_root /content/aic2025 \
  --videos L21 L22 --artifact_dir ./artifacts \
  --segments ./artifacts/segments.parquet \
  --transcripts /content/transcripts.jsonl   # optional

# 4) Search examples
python src/retrieval/use.py --query "your search" --query_id q1 --rerank ce
python src/retrieval/use.py --task vqa --query "câu hỏi" --answer "màu xanh" --query_id q2 --rerank ce
python src/retrieval/use.py --task trake --query "high jump" \
  --events_json /content/events.json --query_id q3 --rerank ce
````

**Key Updates:**
- **TransNetV2**: Deep learning-based shot boundary detection (replaces OpenCV fallback)
- **GPU FAISS**: Uses `faiss-gpu-cu12` for faster indexing and search
- **Better Dependencies**: Latest `open-clip-torch` with improved multilingual support

Notes: We export CSVs only; the host computes official scores.

# AIC 2024/2025 Retrieval – Automated Pipeline ⚡

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nqvu-daniel/AIC_FTML_dev/blob/main/notebooks/colab_pipeline.ipynb)

## 🚀 **NEW: Segment-First Pipeline with TransNetV2**
**Latest update**: Modern **segment-first architecture** with **TransNetV2** deep learning shot detection!

**Quick Start:**
1. **Enable GPU**: Runtime → Change runtime type → Hardware accelerator: T4/L4/A100 (recommended for Colab)
2. **Run Setup**: Execute the "Setup" cell below to automatically clone repo and install dependencies
3. **Configure Environment**: Set `IS_COLAB = True/False` in the configuration cell
4. **Choose Your Path**:
   - Host Inference (recommended): Use pre-built artifacts to run queries instantly
   - Development Pipeline: Build your own artifacts with **segment-first approach**

**File Downloads**: 
- **Colab**: Results saved to `/content/AIC_FTML_dev/submissions/` - download from Colab's file browser
- **Local**: Results saved to `./submissions/` in your current directory

---

## Two Usage Modes

### 1. Host Inference (Recommended - Fast)
- No dataset required
- Uses pre-built artifacts and models
- Ready in ~2 minutes
- Perfect for running queries and getting CSV results

### 2. Development Pipeline (Advanced - Modern Segment-First!)
- Downloads full dataset (~GBs)
- **NEW**: Segment-first architecture with TransNetV2 shot detection
- **NEW**: GPU-accelerated FAISS indexing
- **NEW**: Proper text corpus with ASR integration
- Builds search index + custom reranker models

---

## 🎯 Modern Pipeline Architecture

| Component | Technology | Purpose |
|-----------|-------------|---------|
| **Video Segmentation** | TransNetV2 (deep learning) | Accurate shot boundary detection |
| **Visual Index** | FAISS GPU | Fast similarity search |
| **Text Corpus** | BM25 + ASR merge | Hybrid text search |
| **Reranking** | Cross-encoder | Result quality boost |
| **Dependencies** | Latest open-clip-torch | Better multilingual support |

**Key Improvements:**
- Replaces old "smart pipeline" with proper segment-first approach
- Uses TransNetV2 instead of basic OpenCV detection
- GPU-accelerated processing throughout
- Structured 3-step build process

---

In [None]:
!nvidia-smi || true
!python --version
import sys, os, pathlib
print('CWD:', os.getcwd())

In [None]:
# Install dependencies (CUDA 12.1)
!pip -q install --upgrade pip

# Clean potentially conflicting preinstalled packages (safe to ignore errors)
!pip -q uninstall -y opencv-python opencv-contrib-python opencv-python-headless numpy scipy pandas scikit-learn faiss faiss-cpu faiss-gpu faiss-gpu-cu12 faiss-cpu-cu12 open-clip-torch thinc spacy pillow decord pyarrow  || true
!pip -q cache purge || true

# PyTorch cu121
!pip -q install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio

# Align to NumPy 2 to avoid ABI conflicts with OpenCV 4.12/thinc
!pip -q install "numpy>=2.0,<2.3" "scipy>=1.13.1" "pandas>=2.2.2" "scikit-learn>=1.5.1"

# Core libraries with pinned versions (headless OpenCV 4.12 needs NumPy 2.x)
!pip -q install open-clip-torch>=2.24.0 rank_bm25 joblib pyarrow tqdm pyyaml pillow "opencv-python-headless==4.12.0.88" decord transnetv2-pytorch>=1.0.0

# Install GPU-optimized FAISS for CUDA 12
!pip -q install faiss-gpu-cu12

# Verify
import torch, sys
print('Torch', torch.__version__, 'CUDA', torch.version.cuda, 'CUDA available', torch.cuda.is_available())
try:
    import faiss
    print('FAISS', getattr(faiss, '__version__', 'n/a'))
    # Check if GPU FAISS is available
    if hasattr(faiss, 'StandardGpuResources'):
        print('FAISS GPU support: Available')
    else:
        print('FAISS GPU support: Not available')
except Exception as e:
    print('FAISS import error:', e)
try:
    import open_clip
    print('open_clip', getattr(open_clip, '__version__', 'n/a'))
except Exception as e:
    print('open_clip import error:', e)
try:
    import transnetv2_pytorch
    print('TransNetV2', getattr(transnetv2_pytorch, '__version__', 'available'))
except Exception as e:
    print('TransNetV2 import error:', e)

In [None]:
# Setup: Clone repo and install dependencies automatically (GPU-ready)
import os
import pathlib
import subprocess
import sys

REPO_URL = 'https://github.com/nqvu-daniel/AIC_FTML_dev.git'
REPO_NAME = 'AIC_FTML_dev'

def setup_repository():
    """Automatically clone repository and setup environment"""
    try:
        # Check if repo already exists
        if pathlib.Path(REPO_NAME).exists():
            print(f"Repository '{REPO_NAME}' already exists")
            os.chdir(REPO_NAME)
        else:
            print(f"Cloning repository from {REPO_URL}")
            subprocess.run(['git', 'clone', REPO_URL], check=True)
            os.chdir(REPO_NAME)
            print("Repository cloned successfully")

        # Install dependencies
        print("Installing dependencies...")
        subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', '-r', 'requirements.txt'], check=True)

        # Install GPU-optimized FAISS and TransNetV2
        try:
            import torch
            if torch.cuda.is_available():
                print("GPU detected, installing faiss-gpu-cu12...")
                subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'faiss-gpu-cu12'], check=True)
                print('Installed faiss-gpu-cu12 (CUDA 12 compatible)')
            else:
                print("No GPU detected, installing faiss-cpu...")
                subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'faiss-cpu'], check=True)
                print('Installed faiss-cpu')
            
            # Install TransNetV2 for advanced video segmentation
            subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'transnetv2-pytorch>=1.0.0'], check=True)
            print('Installed TransNetV2-PyTorch for video segmentation')
            
        except Exception as e:
            print(f'FAISS/TransNetV2 install error: {e}')
            # Fallback to CPU versions
            subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'faiss-cpu'], check=True)
            subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'transnetv2-pytorch'], check=True)
            print('Fallback: Installed faiss-cpu and transnetv2-pytorch')

        # Add to Python path
        if '.' not in sys.path:
            sys.path.append('.')

        print("Setup complete! Ready to run AIC FTML pipeline")
        print(f"Current directory: {os.getcwd()}")

        return True

    except subprocess.CalledProcessError as e:
        print(f"Error during setup: {e}")
        return False
    except Exception as e:
        print(f"Unexpected error: {e}")
        return False

# Run setup
if setup_repository():
    print("\nYou can now proceed with the pipeline!")
else:
    print("\nSetup failed. Please check the errors above.")

## Host Inference – One-shot
Provide `ARTIFACTS_BUNDLE_URL` and/or `RERANKER_MODEL_URL` if not already present in `./artifacts`.
This writes a Top-100 CSV into `submissions/`.

In [None]:
# Host Inference - Automated Setup and Query Execution
import os
import subprocess
import pathlib

# Configuration - Update these URLs with your hosted models
QUERY = 'a person opening a laptop'  # Change this to your search query
QUERY_ID = 'q1'  # Official query id for filename submissions/{query_id}.csv
TASK = 'kis'     # 'kis' or 'vqa'
ANSWER = ''      # Required if TASK='vqa'
ARTIFACTS_BUNDLE_URL = ''  # e.g., 'https://your-host.com/artifacts_bundle.tar.gz'
RERANKER_MODEL_URL = ''    # e.g., 'https://your-host.com/reranker.joblib'

def run_inference_query(query, bundle_url='', model_url='', query_id='', task='kis', answer=''):
    """Run inference with automatic artifact download if needed"""
    try:
        # Ensure we're in the right directory
        if not pathlib.Path('src/retrieval/use.py').exists():
            print('Missing use.py script. Make sure setup completed successfully.')
            return False

        # Build command
        cmd = ['python', 'src/retrieval/use.py', '--query', query, '--task', task]
        if query_id:
            cmd.extend(['--query_id', query_id])
        if task == 'vqa':
            if not answer:
                print('For TASK=vqa you must set ANSWER.')
                return False
            cmd.extend(['--answer', answer])

        if bundle_url:
            cmd.extend(['--bundle_url', bundle_url])
            print(f'Will download artifacts bundle from: {bundle_url}')

        if model_url:
            cmd.extend(['--model_url', model_url])
            print(f'Will download reranker model from: {model_url}')

        # Create submissions directory if it doesn't exist
        os.makedirs('submissions', exist_ok=True)

        print(f"Running query: '{query}' (task={task}, qid={query_id})")
        print('Command:', ' '.join(cmd))

        # Execute the command
        result = subprocess.run(cmd, capture_output=True, text=True)

        if result.returncode == 0:
            print('Query execution successful!')
            if result.stdout:
                print('Output:\n' + result.stdout)

            # List generated files
            submissions_dir = pathlib.Path('submissions')
            if submissions_dir.exists():
                csv_files = list(submissions_dir.glob('*.csv'))
                if csv_files:
                    print('\nGenerated ' + str(len(csv_files)) + ' result file(s):')
                    for csv_file in csv_files:
                        print(f'  - {csv_file}')
                        # Show first few lines of the CSV
                        try:
                            with open(csv_file, 'r') as f:
                                lines = f.readlines()[:5]
                                print('    Preview (first 5 lines):')
                                for i, line in enumerate(lines, 1):
                                    print(f'    {i}: {line.strip()}')
                        except Exception as e:
                            print(f'    (Could not preview: {e})')
            return True
        else:
            print('Query execution failed!')
            print('Error output:\n' + result.stderr)
            return False

    except Exception as e:
        print(f'Error running inference: {e}')
        return False

# Run the inference
print('Starting AIC FTML Host Inference...')
success = run_inference_query(QUERY, ARTIFACTS_BUNDLE_URL, RERANKER_MODEL_URL, QUERY_ID, TASK, ANSWER)

if success:
    print('\nInference completed! Check the submissions/ folder for results.')
else:
    print('\nInference failed. Check the error messages above.')

## Dev Pipeline – Build Artifacts (Optional)
Downloads dataset archives using `AIC_2025_dataset_download_link.csv`, builds index/corpus, optionally trains reranker, and assembles `my_pipeline/`.

### Configuration
Set your preferences here - run this cell first to configure the pipeline.

In [None]:
# Configuration - Run this cell first
import os
import subprocess
import pathlib
import time
import csv
import tempfile

# Configuration - Set your environment
IS_COLAB = True  # Set to False for local environment

# Configuration - Dataset root based on environment
# Both Colab and local keep dataset alongside AIC_FTML_dev directory
DATASET_ROOT = '/content/aic2025' if IS_COLAB else '../aic2025'
TEST_MODE = True  # Uncomment to enable test mode (only downloads L21-L24)
VIDEOS = ['L21', 'L22', 'L23', 'L24', 'L25', 'L26', 'L27', 'L28', 'L29', 'L30']  # adjust if needed
CSV_FILE = 'AIC_2025_dataset_download_link.csv'  # Update path if different

# Apply test mode if enabled
try:
    if TEST_MODE:
        VIDEOS = ['L21', 'L22']
        print("TEST MODE ENABLED: Only processing L21-L24")
except NameError:
    print("Using full video list:", VIDEOS)

def filter_csv_for_videos(csv_path, video_list, output_path):
    """Filter the CSV file to only include entries for specified videos + essential metadata"""
    if not pathlib.Path(csv_path).exists():
        return False

    filtered_rows = []
    with open(csv_path, 'r', encoding='utf-8') as f:
        reader = csv.reader(f)
        header = next(reader, None)
        if header:
            filtered_rows.append(header)

        for row in reader:
            if not row:
                continue
            # Check if any of our target videos appear in the filename
            filename = row[-2].strip() if len(row) >= 2 else ""
            filename_upper = filename.upper()

            # Always include essential metadata files (needed for all videos)
            essential_files = [
                'MAP-KEYFRAMES-AIC25-B1.ZIP',
                'MEDIA-INFO-AIC25-B1.ZIP',
                'OBJECTS-AIC25-B1.ZIP',
                'CLIP-FEATURES-32-AIC25-B1.ZIP'
            ]

            is_essential = any(essential in filename_upper for essential in essential_files)
            is_target_video = any(vid in filename_upper for vid in video_list)

            if is_essential or is_target_video:
                filtered_rows.append(row)

    # Write filtered CSV
    with open(output_path, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerows(filtered_rows)

    print(f"Filtered CSV: {len(filtered_rows)-1} entries for videos {video_list} + essential metadata")
    return True

print("Configuration loaded successfully!")
print(f"Environment: {'Google Colab' if IS_COLAB else 'Local'}")
print(f"Dataset root: {DATASET_ROOT}")
print(f"Videos to process: {VIDEOS}")

In [None]:
# Step 1: Download Dataset (Skip this cell if data already downloaded)
from tqdm import tqdm
print("Step 1: Download dataset")
start_time = time.time()

if pathlib.Path(CSV_FILE).exists():
    # Create filtered CSV for our target videos
    with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as tmp_csv:
        filtered_csv_path = tmp_csv.name

    if filter_csv_for_videos(CSV_FILE, VIDEOS, filtered_csv_path):
        print("Starting dataset download with progress tracking...")
        cmd = [
            'python', 'scripts/dataset_downloader.py',
            '--dataset_root', DATASET_ROOT,
            '--csv', filtered_csv_path,
            '--skip-existing'
        ]
        print(f"Command: {' '.join(cmd)}")

        # Run with real-time output to show download progress
        import sys
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                                 universal_newlines=True, bufsize=1)

        while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print(output.strip())
                sys.stdout.flush()

        return_code = process.poll()

        # Clean up temp file
        try:
            os.unlink(filtered_csv_path)
        except:
            pass

        if return_code != 0:
            print("Dataset download failed!")
            raise Exception("Download failed")

        elapsed = time.time() - start_time
        print(f"Dataset download completed in {elapsed:.1f} seconds")

        # Debug: Check what was actually extracted
        print("\nChecking extracted structure:")
        dataset_path = pathlib.Path(DATASET_ROOT)
        if dataset_path.exists():
            for subdir in ['videos', 'keyframes', 'map_keyframes', 'media_info', 'objects', 'features']:
                subdir_path = dataset_path / subdir
                if subdir_path.exists():
                    files = list(subdir_path.rglob('*'))
                    print(f"  {subdir}/: {len(files)} items")
                    # Show first few items
                    for item in files[:5]:
                        rel_path = item.relative_to(subdir_path)
                        item_type = "DIR" if item.is_dir() else "FILE"
                        print(f"    {item_type}: {rel_path}")
                    if len(files) > 5:
                        print(f"    ... and {len(files) - 5} more")
                else:
                    print(f"  {subdir}/: NOT FOUND")
    else:
        print("Failed to filter CSV file")
        raise Exception("CSV filtering failed")
else:
    print(f"CSV file {CSV_FILE} not found. Make sure it exists in the current directory.")
    raise Exception("CSV file not found")

In [None]:
# HOTFIX: Reorganize video files that are in video/ subfolders
import os
import shutil
from pathlib import Path
def fix_video_structure(dataset_root):
    """Fix video files that are sitting in video/ subfolders"""
    extracted_tmp = Path(dataset_root) / "_extracted_tmp"
    videos_dir = Path(dataset_root) / "videos"
    if not extracted_tmp.exists():
        print("No _extracted_tmp found, skipping video fix")
        return
    # Find all Videos_* directories
    for item in extracted_tmp.rglob("*"):
        if item.is_dir() and item.name.lower().startswith("videos_"):
            print(f"Found Videos directory: {item}")
            # Check for video subfolder
            video_subdir = item / "video"
            if video_subdir.exists() and video_subdir.is_dir():
                print(f"  Found video subfolder: {video_subdir}")
                # Copy all mp4 files from video subfolder
                for mp4_file in video_subdir.glob("*.mp4"):
                    dst = videos_dir / mp4_file.name
                    if not dst.exists():
                        videos_dir.mkdir(parents=True, exist_ok=True)
                        shutil.move(str(mp4_file), str(dst))
                        print(f"    Moved: {mp4_file.name}")
                    else:
                        print(f"    Skip (exists): {mp4_file.name}")
# Run the hotfix
fix_video_structure(DATASET_ROOT)
print("Video structure hotfix complete!")

In [None]:
# Step 2: Modern Segment-First Pipeline with TransNetV2
print("Step 2: Segment-first pipeline with TransNetV2 + GPU-accelerated indexing")
start_time = time.time()

# Check for videos directory
videos_dir = pathlib.Path(DATASET_ROOT) / "videos"
if not videos_dir.exists():
    print(f"Warning: {videos_dir} not found, using dataset root as video directory")
    videos_dir = pathlib.Path(DATASET_ROOT)

# Check if GPU is available
use_gpu = False
try:
    import torch
    use_gpu = torch.cuda.is_available()
    if use_gpu:
        print("GPU detected - using TransNetV2 + GPU-accelerated FAISS")
    else:
        print("No GPU detected - using OpenCV segmentation + CPU FAISS")
except Exception:
    print("Could not detect GPU - using CPU processing")

# Step 2a: Video Segmentation with TransNetV2
print("\nStep 2a: Video segmentation...")
cmd = [
    'python', 'scripts/segment_videos.py',
    '--dataset_root', DATASET_ROOT,
    '--videos'
] + VIDEOS + [
    '--artifact_dir', './artifacts'
]

# Add TransNetV2 flag if GPU available
if use_gpu:
    cmd.append('--use_transnetv2')
    print("Using TransNetV2 for deep learning shot boundary detection")
else:
    print("Using OpenCV fallback for shot boundary detection")

print(f"Segmentation command: {' '.join(cmd)}")

# Run segmentation
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                         universal_newlines=True, bufsize=1)

while True:
    output = process.stdout.readline()
    if output == '' and process.poll() is not None:
        break
    if output:
        print(f"[SEGMENT] {output.strip()}")
        sys.stdout.flush()

if process.poll() != 0:
    print("Video segmentation failed!")
    raise Exception("Segmentation failed")

print("✅ Video segmentation completed")

# Step 2b: Build Search Index
print("\nStep 2b: Building search index...")
cmd = [
    'python', 'scripts/index.py',
    '--dataset_root', DATASET_ROOT,
    '--videos'
] + VIDEOS + [
    '--segments', './artifacts/segments.parquet'
]

print(f"Indexing command: {' '.join(cmd)}")

# Run indexing
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                         universal_newlines=True, bufsize=1)

while True:
    output = process.stdout.readline()
    if output == '' and process.poll() is not None:
        break
    if output:
        print(f"[INDEX] {output.strip()}")
        sys.stdout.flush()

if process.poll() != 0:
    print("Index building failed!")
    raise Exception("Indexing failed")

print("✅ Search index completed")

# Step 2c: Build Text Corpus (with ASR if available)
print("\nStep 2c: Building text corpus...")
cmd = [
    'python', 'scripts/build_text.py',
    '--dataset_root', DATASET_ROOT,
    '--videos'
] + VIDEOS + [
    '--artifact_dir', './artifacts',
    '--segments', './artifacts/segments.parquet'
]

print(f"Text corpus command: {' '.join(cmd)}")

# Run text corpus building
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                         universal_newlines=True, bufsize=1)

while True:
    output = process.stdout.readline()
    if output == '' and process.poll() is not None:
        break
    if output:
        print(f"[TEXT] {output.strip()}")
        sys.stdout.flush()

if process.poll() != 0:
    print("Text corpus building failed!")
    raise Exception("Text corpus building failed")

print("✅ Text corpus completed")

elapsed = time.time() - start_time
print(f"\n🚀 Modern segment-first pipeline completed in {elapsed:.1f} seconds")
print("📈 Key advantages:")
print("  - TransNetV2 deep learning shot detection")
print("  - GPU-accelerated FAISS indexing")  
print("  - Segment-level retrieval precision")
print("  - ASR corpus integration ready")

In [None]:
# Step 3: Check Pipeline Results
print("Step 3: Checking segment-first pipeline results")

# Check generated artifacts
artifacts_dir = pathlib.Path('./artifacts')
if artifacts_dir.exists():
    artifact_files = list(artifacts_dir.glob('*'))
    print(f"✅ Generated {len(artifact_files)} artifact files:")
    for artifact in sorted(artifact_files):
        if artifact.is_file():
            size = artifact.stat().st_size
            print(f"  - {artifact.name} ({size:,} bytes)")
        elif artifact.is_dir():
            file_count = len(list(artifact.glob('*')))
            print(f"  - {artifact.name}/ ({file_count} files)")

    # Check for expected artifacts from segment-first pipeline
    expected_artifacts = ['segments.parquet', 'index.faiss', 'mapping.parquet', 'text_corpus.jsonl']
    missing_artifacts = []
    for expected in expected_artifacts:
        if not (artifacts_dir / expected).exists():
            missing_artifacts.append(expected)
    
    if missing_artifacts:
        print(f"\n⚠️  Missing expected artifacts: {missing_artifacts}")
    else:
        print(f"\n🚀 Segment-first pipeline completed! All artifacts ready in ./artifacts/")
        print("📈 Pipeline components:")
        print("  - segments.parquet: Video segments with TransNetV2/OpenCV detection")
        print("  - index.faiss: GPU-accelerated visual search index") 
        print("  - mapping.parquet: Frame-to-video mapping")
        print("  - text_corpus.jsonl: Text search corpus (with ASR if available)")
else:
    print("❌ No artifacts directory found - pipeline may have failed")

In [None]:
# Step 4: Generate Training Data and Train Reranker Model
import os
import pathlib
import subprocess
import time

def create_and_train_reranker():
    """Generate training data from metadata and train reranker model"""
    try:
        # Step 1: Generate training data from competition metadata
        print("Step 4a: Generating training data from competition metadata...")

        cmd = [
            'python', 'scripts/create_training_data.py',
            '--dataset_root', DATASET_ROOT,
            '--output', 'data/train.jsonl',
            '--num_examples', '100'
        ]
        print(f"Command: {' '.join(cmd)}")

        result = subprocess.run(cmd, capture_output=True, text=True)
        if result.returncode != 0:
            print(f"Training data generation failed: {result.stderr}")
            print("Will proceed with fusion baseline (no reranker training)")
            return True  # Don't fail pipeline, just use baseline

        print("Training data generated successfully!")
        if result.stdout:
            print(result.stdout)

        # Step 2: Train the reranker model
        print("\nStep 4b: Training reranker model...")

        cmd = [
            'python', 'src/training/train_reranker.py',
            '--index_dir', './artifacts',
            '--train_jsonl', 'data/train.jsonl'
        ]
        print(f"Command: {' '.join(cmd)}")

        start_time = time.time()

        # Run with real-time output for progress tracking
        import sys
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                                 universal_newlines=True, bufsize=1)

        while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print(output.strip())
                sys.stdout.flush()

        return_code = process.poll()
        elapsed_time = time.time() - start_time

        if return_code == 0:
            print(f"\nReranker training completed successfully in {elapsed_time:.1f}s!")

            # Check if model was created
            model_file = pathlib.Path('./artifacts/reranker.joblib')
            if model_file.exists():
                model_size = model_file.stat().st_size
                print(f"Model saved: {model_file} ({model_size:,} bytes)")

            return True
        else:
            print(f"Reranker training failed after {elapsed_time:.1f}s!")
            print("Will use fusion baseline instead")
            return True  # Don't fail pipeline, baseline still works

    except Exception as e:
        print(f"Training error: {e}")
        print("Will use fusion baseline instead")
        return True  # Don't fail pipeline

# Run training data creation and reranker training
print("Creating training data and training reranker...")
train_success = create_and_train_reranker()

if train_success:
    print("✅ Reranker training step completed!")

    # Check what we ended up with
    model_file = pathlib.Path('./artifacts/reranker.joblib')
    if model_file.exists():
        print("🎯 Using trained reranker model for enhanced search results")
    else:
        print("📊 Using fusion baseline (RRF) for reliable search results")
else:
    print("❌ Training failed, but pipeline can continue with fusion baseline.")

Creating training data and training reranker...
Step 4a: Generating training data from competition metadata...
Command: python scripts/create_training_data.py --dataset_root /content/aic2025 --output data/train.jsonl --num_examples 100
Training data generated successfully!
Using metadata from:
  Media info: /content/aic2025/media_info
  Keyframes: /content/aic2025/map_keyframes
Found 873 videos with complete metadata
Sampling 20 videos for training data generation
Generated 80 training examples
Training data saved to data/train.jsonl

✅ Training data creation successful!
Next step: Train the reranker with:
python src/training/train_reranker.py --index_dir ./artifacts --train_jsonl data/train.jsonl


Step 4b: Training reranker model...
Command: python src/training/train_reranker.py --index_dir ./artifacts --train_jsonl data/train.jsonl
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable

In [None]:
# Assemble Pipeline and Test Query
import os
import subprocess
import pathlib
import shutil
import time

# Configuration
PIPELINE_DIR = 'my_pipeline'
TEST_QUERY = 'a person opening a laptop'

def assemble_and_test_pipeline():
    """Assemble minimal pipeline directory and run a test query"""
    try:
        # Step 1: Prepare pipeline directory
        print("Assembling minimal pipeline directory...")

        cmd = [
            'python', 'scripts/prepare_pipeline_dir.py',
            '--outdir', PIPELINE_DIR,
            '--artifact_dir', './artifacts',
            '--include_model',
            '--force'
        ]
        print(f"Command: {' '.join(cmd)}")

        result = subprocess.run(cmd, capture_output=True, text=True)
        if result.returncode != 0:
            print(f"Pipeline assembly failed: {result.stderr}")
            return False

        print("Pipeline directory assembled successfully!")

        # Show pipeline contents
        pipeline_path = pathlib.Path(PIPELINE_DIR)
        if pipeline_path.exists():
            print(f"\nPipeline directory contents ({PIPELINE_DIR}/):")
            for item in sorted(pipeline_path.rglob('*')):
                if item.is_file():
                    rel_path = item.relative_to(pipeline_path)
                    size = item.stat().st_size
                    print(f"  FILE {rel_path} ({size:,} bytes)")
                elif item.is_dir() and item != pipeline_path:
                    rel_path = item.relative_to(pipeline_path)
                    file_count = len(list(item.glob('*')))
                    print(f"  DIR  {rel_path}/ ({file_count} files)")

        # Step 2: Test the pipeline
        print(f"\nTesting pipeline with query: '{TEST_QUERY}'")

        # Change to pipeline directory for testing
        original_dir = os.getcwd()
        os.chdir(PIPELINE_DIR)

        try:
            cmd = ['python', 'src/retrieval/use.py', '--query', TEST_QUERY]
            print(f"Command: {' '.join(cmd)}")

            start_time = time.time()

            # Run with real-time output to see any import errors immediately
            import sys
            process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                                     universal_newlines=True, bufsize=1)

            output_lines = []
            while True:
                output = process.stdout.readline()
                if output == '' and process.poll() is not None:
                    break
                if output:
                    output_lines.append(output.strip())
                    print(output.strip())
                    sys.stdout.flush()

            return_code = process.poll()
            elapsed_time = time.time() - start_time

            if return_code == 0:
                print(f"\n✅ Test query completed successfully in {elapsed_time:.1f}s!")

                # Show results
                submissions_dir = pathlib.Path('submissions')
                if submissions_dir.exists():
                    csv_files = list(submissions_dir.glob('*.csv'))
                    if csv_files:
                        print(f"\nGenerated {len(csv_files)} result file(s):")
                        for csv_file in csv_files:
                            print(f"  - {csv_file}")
                            # Show first few lines
                            try:
                                with open(csv_file, 'r') as f:
                                    lines = f.readlines()[:3]
                                    print(f"    Sample results (first 3 lines):")
                                    for i, line in enumerate(lines, 1):
                                        print(f"    {i}: {line.strip()}")
                                # Count total results
                                with open(csv_file, 'r') as f:
                                    total_lines = sum(1 for _ in f)
                                print(f"    Total results: {total_lines}")
                            except Exception as e:
                                print(f"    (Could not read file: {e})")

                return True
            else:
                print(f"\n❌ Test query failed after {elapsed_time:.1f}s!")

                # Show detailed error information
                error_found = False
                for line in output_lines:
                    if "Error" in line or "Traceback" in line or "ImportError" in line:
                        error_found = True
                        print(f"Error: {line}")

                if not error_found and output_lines:
                    print("Last few lines of output:")
                    for line in output_lines[-5:]:
                        print(f"  {line}")

                return False

        finally:
            # Return to original directory
            os.chdir(original_dir)

    except Exception as e:
        print(f"Assembly/test error: {e}")
        return False

# Run assembly and test
print("Starting pipeline assembly and testing...")
success = assemble_and_test_pipeline()

if success:
    print(f"\n🎉 Pipeline assembled and tested successfully!")
    print(f"Ready-to-deploy pipeline is in: {PIPELINE_DIR}/")
    print("\nNext steps:")
    print(f"  1. Upload {PIPELINE_DIR}/ to your deployment environment")
    print("  2. Run queries using: python src/retrieval/use.py --query 'your search'")
    print("  3. Find results in submissions/ folder")
else:
    print(f"\n⚠️  Pipeline assembly or testing failed. Check errors above.")

# Show final summary
print(f"\nDevelopment Summary:")
print(f"  Pipeline directory: {PIPELINE_DIR}/")
print(f"  Test query: '{TEST_QUERY}'")
print(f"  Results location: {PIPELINE_DIR}/submissions/")

# Check for trained model
model_file = pathlib.Path(f'{PIPELINE_DIR}/artifacts/reranker.joblib')
if model_file.exists():
    print(f"  Model: Trained reranker included ({model_file.stat().st_size:,} bytes)")
else:
    print(f"  Model: Using fusion baseline (RRF)")

Starting pipeline assembly and testing...
Assembling minimal pipeline directory...
Command: python scripts/prepare_pipeline_dir.py --outdir my_pipeline --artifact_dir ./artifacts --include_model --force
Pipeline directory assembled successfully!

Pipeline directory contents (my_pipeline/):
  FILE README_RUN.md (324 bytes)
  DIR  artifacts/ (4 files)
  FILE artifacts/index.faiss (34,603,053 bytes)
  FILE artifacts/mapping.parquet (384,309 bytes)
  FILE artifacts/reranker.joblib (1,037 bytes)
  FILE artifacts/text_corpus.jsonl (79,541,830 bytes)
  FILE config.py (2,015 bytes)
  DIR  src/ (1 files)
  DIR  src/retrieval/ (1 files)
  FILE src/retrieval/use.py (13,280 bytes)
  DIR  submissions/ (0 files)
  FILE utils.py (1,124 bytes)

Testing pipeline with query: 'a person opening a laptop'
Command: python src/retrieval/use.py --query a person opening a laptop
[OK] wrote 100 lines → submissions/kis_a-person-opening-a-laptop.csv

✅ Test query completed successfully in 14.2s!

Generated 1 resu

## Official Evaluation
Provide your ground truth JSON path and task.

In [None]:
# Configure evaluation
GT_PATH = 'ground_truth.json'   # update path (e.g., /content/drive/MyDrive/gt.json)
TASK_EVAL = 'kis'               # 'kis' or 'vqa' or 'trake'
NORMALIZE_ANS = False           # True to casefold VQA answers

import subprocess
cmd = ['python', 'eval/evaluate.py', '--gt', GT_PATH, '--pred_dir', 'submissions', '--task', TASK_EVAL]
if NORMALIZE_ANS:
    cmd.append('--normalize_answer')
print('Evaluating:', ' '.join(cmd))
res = subprocess.run(cmd, capture_output=True, text=True)
print(res.stdout)
if res.returncode:
    print(res.stderr)
    raise SystemExit(res.returncode)
