#  Intelligent Transportation Post-Training with Cosmos Reason 2

This notebook demonstrates how to fine-tune **NVIDIA Cosmos Reason 2** for intelligent transportation scene understanding using the **Woven Traffic Safety (WTS) Dataset**.

## Overview

Supervised Fine-Tuning (SFT) aligns pre-trained models to specific tasks by showing clear input-output pairs. In this notebook, we fine-tune Cosmos Reason 2-8B to understand traffic scenes - including road attributes, pedestrian situations, and vehicle behavior.

### What You'll Learn
- Explore and visualize the WTS dataset
- Configure training hyperparameters for optimal performance
- Understand vision token calculations for different frame sampling strategies
- Train and evaluate the model
- Deploy with FP8 quantization and NVIDIA NIM

### Table of Contents
1. Environment Setup
2. Dataset Exploration
3. Zero-Shot Inference
4. Training Configuration
5. Run Training
6. Run Evaluation
7. Fine-Tuned Inference
8. Deployment

### Prerequisites
- Downloaded WTS Dataset from [Woven by Toyota](https://woven-visionai.github.io/wts-dataset-homepage/)
- NVIDIA GPUs (A100 recommended)
- Python 3.10+ with pip or uv

---

**Reference:** [NVIDIA Cosmos Cookbook - Intelligent Transportation Post-Training](https://nvidia-cosmos.github.io/cosmos-cookbook/recipes/post_training/reason2/intelligent-transportation/post_training.html)

## Environment Setup (Recommended: uv)

Use uv to set up Cosmos Reason 2 and Cosmos-RL quickly. This step can take several minutes and requires sufficient disk space.

In [None]:
import sys
import os

PYTHON = sys.executable
print(f"Using Python: {PYTHON}")

# Repo paths (keep consistent with the Configuration section below)
# Path to the cloned cosmos-reason2 repository
COSMOS_REASON2_REPO = "/home/ubuntu/cosmos-reason2"

# Path to the cloned cosmos-cookbook repository
COSMOS_COOKBOOK_REPO = "/home/ubuntu/cosmos-cookbook"

# Cosmos-RL directory (inside the cosmos-reason2 repo)
COSMOS_RL_PATH = f"{COSMOS_REASON2_REPO}/examples/cosmos_rl"

# Install ffmpeg and redis-server
!sudo apt-get update
!sudo apt-get install -y ffmpeg redis-server

# Bootstrap pip
print("üì¶ Bootstrapping pip...")
!$PYTHON -m ensurepip --upgrade 2>/dev/null || echo "pip ready"
!$PYTHON -m pip install --upgrade pip

# Install visualization dependencies
print("\nüì¶ Installing visualization dependencies...")
!$PYTHON -m pip install matplotlib numpy opencv-python pyyaml tqdm requests decord

# Install uv
print("üì¶ Installing uv...")
!{sys.executable} -m pip install -q uv

# Clone repos if needed
!git clone https://github.com/nvidia-cosmos/cosmos-reason2.git {COSMOS_REASON2_REPO} 2>/dev/null || echo "cosmos-reason2 already exists"
!git clone https://github.com/nvidia-cosmos/cosmos-cookbook.git {COSMOS_COOKBOOK_REPO} 2>/dev/null || echo "cosmos-cookbook already exists"

print("\nüì¶ Setting up cosmos-reason2 prerequisites with uv sync...")
!cd {COSMOS_REASON2_REPO} && {sys.executable} -m uv sync --extra cu128

print("\nüì¶ Setting up cosmos-rl with uv sync...")
!cd {COSMOS_RL_PATH} && {sys.executable} -m uv sync

print("\n‚úÖ Setup complete!")

### Alternative: Manual pip Installation (Optional)

If you prefer manual installs, use this cell instead of the uv-based setup above.

In [None]:
import sys
import os

# Path to the cloned cosmos-reason2 repository
COSMOS_REASON2_REPO = "/home/ubuntu/cosmos-reason2"

# Path to the cloned cosmos-cookbook repository
COSMOS_COOKBOOK_REPO = "/home/ubuntu/cosmos-cookbook"

PYTHON = sys.executable
print(f"Using Python: {sys.executable}\n")

# Bootstrap pip
print("üì¶ Bootstrapping pip...")
!$PYTHON -m ensurepip --upgrade 2>/dev/null || echo "pip ready"
!$PYTHON -m pip install --upgrade pip

# Clone repos if needed
print("\nüì¶ Cloning repositories...")
!git clone https://github.com/nvidia-cosmos/cosmos-reason2.git {COSMOS_REASON2_REPO} 2>/dev/null || echo "cosmos-reason2 already exists"
!git clone https://github.com/nvidia-cosmos/cosmos-cookbook.git {COSMOS_COOKBOOK_REPO} 2>/dev/null || echo "cosmos-cookbook already exists"

# Install visualization dependencies
print("\nüì¶ Step 1: Installing visualization dependencies...")
!$PYTHON -m pip install matplotlib numpy opencv-python pyyaml tqdm requests decord

# Install PyTorch with CUDA 12.8
print("\nüì¶ Step 2: Installing PyTorch with CUDA 12.8...")
!$PYTHON -m pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

# Verify CUDA
print("\nüîç Verifying CUDA...")
import torch
print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else "N/A"}')

# Step 3: Install cosmos-rl
print("\nüì¶ Step 3: Installing cosmos-reason2-utils & cosmos-rl...")
!$PYTHON -m pip install /home/ubuntu/cosmos-reason2/cosmos_reason2_utils
!$PYTHON -m pip install /home/ubuntu/cosmos-reason2/examples/cosmos_rl

# Step 4: Install flash-attention (builds against torch 2.8.0)
print("\nüì¶ Step 4: Installing flash-attention...")
# os.environ["TMPDIR"] = f"{CACHE_DIR}/tmp"
os.environ["MAX_JOBS"] = "4"
!$PYTHON -m pip uninstall flash-attn -y 2>/dev/null || true
!$PYTHON -m pip cache remove flash-attn 2>/dev/null || true
!$PYTHON -m pip install flash-attn --no-build-isolation --no-cache-dir --force-reinstall
!$PYTHON -m pip install einops

# Verify
try:
    import flash_attn
    print(f'‚úÖ flash-attn: {flash_attn.__version__}')
except:
    print("‚ùå flash-attn failed")

# Step 5: Install vllm
print("\nüì¶ Step 5: Installing vllm...")
!$PYTHON -m pip install vllm qwen-vl-utils

# Install vllm for inference
print("\nüì¶ Step 6: Installing vllm...")
!$PYTHON -m pip install vllm qwen-vl-utils

# Verify cache is being used
print(f"\nüìÅ Cache usage:")
!du -sh {CACHE_DIR}/* 2>/dev/null || echo "Cache empty (packages may already be installed)"

print("\n‚úÖ All dependencies installed!")

### Use the Cosmos Reason 2 Kernel (Required for Inference)

After installing dependencies, switch the notebook to the cosmos-reason venv kernel:

1. **Create a kernel for the venv (run once in a terminal):**
   ```
   cd /path/to/cosmos-reason2
   source .venv/bin/activate
   python -m pip install ipykernel
   python -m ipykernel install --user --name cosmos-reason2-venv --display-name "Cosmos-Reason"
   ```

2. **Switch the notebook kernel:**
- In JupyterLab: **Kernel ‚Üí Change Kernel‚Ä¶**
- In Classic Notebook: **Kernel ‚Üí Change kernel**
- Select **Cosmos-Reason2**

4. **Verify in a cell:**
   

In [None]:
import sys
print(sys.executable)

### Alternative: Docker Container (Optional)

If you prefer running in a containerized environment, you can build and run the Cosmos Reason 2 Docker container. This requires Docker and the NVIDIA Container Toolkit.

**Build the container:**
The build command tags the image for reuse.

**CUDA Variants:**
- CUDA 12.8: `--build-arg=CUDA_VERSION=12.8.1` (default, requires NVIDIA Driver)
- CUDA 13.0: `--build-arg=CUDA_VERSION=13.0.0` (required for DGX Spark and Jetson AGX)

In [None]:
# Docker Container Build (Optional)
# Uncomment and run if using Docker instead of uv/pip

# Build the container (run from cosmos-reason2 repo directory)
# !cd {COSMOS_REASON2_REPO} && docker build -f Dockerfile --build-arg=CUDA_VERSION=12.8.1 -t cosmos-reason2:cu128 .

# For CUDA 13.0 (DGX Spark / Jetson AGX):
# !cd {COSMOS_REASON2_REPO} && docker build -f Dockerfile --build-arg=CUDA_VERSION=13.0.0 -t cosmos-reason2:cu130 .

print("Docker build commands (uncomment to run):")
print(f"  cd {COSMOS_REASON2_REPO}")
print("  docker build -f Dockerfile --build-arg=CUDA_VERSION=12.8.1 -t cosmos-reason2:cu128 .")

**Run the container:**

The container mounts the current directory to `/workspace` and preserves venv and cache directories.

In [None]:
# Docker Container Run (Optional)
# Uncomment and customize before running

# docker run -it --gpus all --ipc=host --rm \
#     -v .:/workspace \
#     -v /workspace/.venv \
#     -v /workspace/examples/cosmos_rl/.venv \
#     -v /root/.cache:/root/.cache \
#     -e HF_TOKEN="$HF_TOKEN" \
#     cosmos-reason2:cu128

print("Docker run command (uncomment to run):")
print("""docker run -it --gpus all --ipc=host --rm \\
    -v .:/workspace \\
    -v /workspace/.venv \\
    -v /workspace/examples/cosmos_rl/.venv \\
    -v /root/.cache:/root/.cache \\
    -e HF_TOKEN="$HF_TOKEN" \\
    cosmos-reason2:cu128""")

print("\nOptional arguments:")
print("  --ipc=host         Use host shared memory (torchrun needs this)")
print("  -v /root/.cache    Mount host cache to avoid re-downloads")
print("  -e HF_TOKEN        Pass HuggingFace token to container")

In [None]:
import sys
print(sys.executable)

### Verify Installation

Confirm that core dependencies and the `cosmos-rl` CLI are available before proceeding.

In [None]:
# Verify installations
import sys
import os

print("‚úÖ Verifying installations:\n")

# Check visualization packages
try:
    import matplotlib
    print(f"  matplotlib: {matplotlib.__version__}")
except ImportError:
    print("  matplotlib: NOT installed ‚úó")
try:
    import numpy
    print(f"  numpy: {numpy.__version__}")
except ImportError:
    print("  numpy: NOT installed ‚úó")
try:
    import cv2
    print(f"  opencv: {cv2.__version__}")
except ImportError:
    print("  opencv: NOT installed ‚úó")

# Check cosmos-rl venv
COSMOS_RL_PATH = "/home/ubuntu/cosmos-reason2/examples/cosmos_rl"
COSMOS_RL_BIN = f"{COSMOS_RL_PATH}/.venv/bin/cosmos-rl"

print("\nüîç Checking cosmos-rl venv:")
!ls -la {COSMOS_RL_PATH}/.venv/bin/ 2>/dev/null | grep -E "cosmos|python" || echo "venv not found"

if os.path.exists(COSMOS_RL_BIN):
    print(f"\n‚úÖ cosmos-rl found!")
    print("\nüìã cosmos-rl --help:")
    !{COSMOS_RL_BIN} --help 2>&1 | head -15
else:
    print(f"\n‚ùå cosmos-rl not found at {COSMOS_RL_BIN}")
    print("\nüîß Try running uv sync manually:")
    !cd {COSMOS_RL_PATH} && {sys.executable} -m uv sync 2>&1 | tail -30

## Dataset Exploration

Before post-training a vision-language model, it helps to inspect a few samples to understand clip length, camera viewpoints, and the kinds of questions and answers available. This quick check also confirms your dataset paths are correct and that annotations align with videos.

For this notebook, we use the **Woven Traffic Safety (WTS) Dataset** (Environment VQA subset) as the example. It includes:
- **255 traffic scenarios**
- **1,200+ video segments**
- **341 videos** with **~5.6k MCQ question-answer pairs**
- Average video length is **~75 seconds**.

Let's load and display a sample video from the dataset.

## Configuration

Set the dataset, model, and repo paths once here. The rest of the notebook references these variables.

In [None]:
# Setup and Imports
import os
import json
import random
from pathlib import Path
from IPython.display import display, Video, HTML, Image
import matplotlib.pyplot as plt
import numpy as np

# ==============================================================================
# CONFIGURATION - Update these paths before running the notebook
# ==============================================================================

# --- Repository Paths ---
# If you cloned the repos elsewhere, update these to match.
# Path to the cloned cosmos-reason2 repository
COSMOS_REASON2_REPO = "/home/ubuntu/cosmos-reason2"

# Path to the cloned cosmos-cookbook repository (contains training scripts)
COSMOS_COOKBOOK_REPO = "/home/ubuntu/cosmos-cookbook"

# --- Dataset Paths ---
# Training dataset directory (should contain videos/ and annotations.json)
TRAIN_DATA_PATH = "/ephemeral/wts_data_train"

# Validation dataset directory (should contain videos/ and annotations.json)
VAL_DATA_PATH = "/ephemeral/wts_data_val"

# --- Model Paths ---
# Base model checkpoint (local path for Cosmos Reason 2 checkpoints or HuggingFace ID "nvidia/Cosmos-Reason2-8B")
BASE_MODEL_PATH = "/ephemeral/Cosmos-Reason2-8B"

# Output directory for fine-tuned model checkpoints
FINETUNED_MODEL_PATH = "/ephemeral/finetuned_model"

# Example video for quick testing (update to your video path)
EXAMPLE_VIDEO_PATH = "/home/ubuntu/example_video_wts.mp4"

# --- Derived Paths (computed from above, usually no need to edit) ---
TRAIN_VIDEOS_PATH = f"{TRAIN_DATA_PATH}/videos"
TRAIN_ANNOTATIONS_PATH = f"{TRAIN_DATA_PATH}/annotations.json"
VAL_VIDEOS_PATH = f"{VAL_DATA_PATH}/videos"
VAL_ANNOTATIONS_PATH = f"{VAL_DATA_PATH}/annotations.json"

# Cosmos-RL directory (inside the cosmos-reason2 repo)
COSMOS_RL_PATH = f"{COSMOS_REASON2_REPO}/examples/cosmos_rl"

print("Configuration:")
print(f"  Train Dataset Path:   {TRAIN_DATA_PATH}")
print(f"  Validation Dataset:   {VAL_DATA_PATH}")
print(f"  Base Model Path:       {BASE_MODEL_PATH}")
print(f"  Fine-Tuned Model Path: {FINETUNED_MODEL_PATH}")
print(f"  Example Video Path:    {EXAMPLE_VIDEO_PATH}")
print(f"  Cosmos Reason2 Repo:   {COSMOS_REASON2_REPO}")
print(f"  Cosmos Cookbook Repo:  {COSMOS_COOKBOOK_REPO}")

### Authenticate to Hugging Face (Optional)

If you are downloading the Cosmos Reason 2 model from Hugging Face (e.g., `nvidia/Cosmos-Reason2-8B`), you need to authenticate. This cell prompts for your HF token and performs authentication.

In [None]:
# HuggingFace Authentication (Optional)
import subprocess
import getpass
from IPython.display import display, HTML
import time

display(HTML('<a href="https://huggingface.co/settings/tokens" target="_blank" style="font-size:16px;">üîë Get HuggingFace Token</a>'))
time.sleep(2)

hf_token = getpass.getpass("HuggingFace Token (leave blank to skip): ").strip()

if hf_token:
    result = subprocess.run(
        ["uvx", "hf", "auth", "login", "--token", hf_token],
        capture_output=True, text=True
    )
    print("‚úÖ HuggingFace login successful" if result.returncode == 0 else f"‚ùå Failed: {result.stderr}")
else:
    print("‚è≠Ô∏è Skipped HuggingFace authentication")

### Video Helper Utilities

These helpers list videos, display metadata, and sample frames so you can quickly validate the dataset contents.

In [None]:
def list_videos(video_dir, num_samples=5):
    """List available videos in the dataset directory."""
    video_extensions = ['.mp4', '.avi', '.mov', '.mkv']
    videos = []
    
    video_path = Path(video_dir)
    if video_path.exists():
        for ext in video_extensions:
            videos.extend(list(video_path.rglob(f"*{ext}")))
    
    return videos[:num_samples] if videos else []

def display_video_with_info(video_path, width=640):
    """Display a video with metadata information."""
    import cv2
    
    cap = cv2.VideoCapture(str(video_path))
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    width_px = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height_px = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    duration = frame_count / fps if fps > 0 else 0
    cap.release()
    
    print(f"üìπ Video: {video_path.name}")
    print(f"   Resolution: {width_px} x {height_px}")
    print(f"   FPS: {fps:.2f}")
    print(f"   Duration: {duration:.2f} seconds")
    print(f"   Total Frames: {frame_count}")
    
    return Video(str(video_path), embed=True, width=width)

def extract_sample_frames(video_path, num_frames=8):
    """Extract uniformly sampled frames from a video (mimics nframes=8 config)."""
    import cv2
    
    cap = cv2.VideoCapture(str(video_path))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    
    # Uniformly sample frame indices
    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
    
    frames = []
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(frame_rgb)
    cap.release()
    
    return frames, indices

# List and display sample videos
print("üîç Searching for videos in WTS dataset...\n")

train_videos_path = os.path.join(TRAIN_VIDEOS_PATH)
sample_videos = list_videos(train_videos_path)

if sample_videos:
    print(f"Found {len(sample_videos)} sample videos:\n")
    for i, v in enumerate(sample_videos):
        print(f"  {i+1}. {v.name}")
    
    # Display the first video
    print("\n" + "="*60)
    print("Displaying first video:")
    print("="*60 + "\n")
    
    display(display_video_with_info(sample_videos[0]))
else:
    print("‚ö†Ô∏è No videos found. Please update TRAIN_VIDEOS_PATH to your dataset location.")
    print(f"   Current path: {train_videos_path}")

## Dataset Labels and Annotations

The WTS dataset provides rich annotations including:
- **Textual descriptions** of pedestrian and vehicle behavior
- **Traffic VQA** with multiple-choice questions (MCQ)

The data is converted to **Llava dataset format** for training - a JSON structure with conversation pairs between human queries and expected VLM responses.

In [None]:
def display_llava_format(example):
    """Pretty print a Llava-format example from the dataset."""
    print("üìã Llava Dataset Format Example (from WTS):")
    print("="*60)
    print(json.dumps(example, indent=2))
    print("="*60)


def parse_mcq_text(text):
    """Parse MCQ question/options from the WTS Llava-format prompt."""
    cleaned = text.replace("<video>", " ").strip()
    lines = [line.strip() for line in cleaned.splitlines() if line.strip()]
    question = lines[0] if lines else ""
    options = lines[1:] if len(lines) > 1 else []
    return question, options


def is_correct_option(option, answer):
    """Mark the correct option based on the answer token (e.g., 'A')."""
    opt = option.strip()
    ans = answer.strip()
    if not ans:
        return False
    prefixes = [f"{ans}:", f"{ans})", f"{ans}.", f"{ans} "]
    return opt == ans or any(opt.startswith(prefix) for prefix in prefixes)


# Load actual MCQ examples from the training annotations
annotations_path = TRAIN_ANNOTATIONS_PATH

if not os.path.exists(annotations_path):
    print("‚ö†Ô∏è annotations.json not found. Update TRAIN_DATA_PATH to your dataset location.")
    print(f"   Current path: {annotations_path}")
else:
    with open(annotations_path, "r") as f:
        annotations = json.load(f)

    # Display a real Llava-format entry
    if annotations:
        display_llava_format(annotations[0])

    # Display a few actual MCQ questions
    print("\n\nüìù Sample MCQ Questions from the Training Set:")
    print("="*60)
    for i, ann in enumerate(annotations[:4], 1):
        question_text, options = parse_mcq_text(ann["conversations"][0]["value"])
        answer = ann["conversations"][1]["value"]
        print(f"\nQ{i}: {question_text}")
        for opt in options:
            marker = "‚úì" if is_correct_option(opt, answer) else " "
            print(f"   [{marker}] {opt}")
    print("\n" + "="*60)

## Inference Helper Class

Define a reusable inference helper so we can run zero-shot evaluation before training and reuse the same logic after fine-tuning.

In [None]:
# Inference Class for Cosmos Reason 2
class CosmosReason2Inference:
    """
    Inference wrapper for fine-tuned Cosmos Reason 2 model.
    """
    
    def __init__(self, model_path, nframes=8, max_tokens=512):
        """
        Initialize the inference engine.
        
        Args:
            model_path: Path to the model checkpoint (base or fine-tuned)
            nframes: Number of frames to sample from videos
            max_tokens: Maximum tokens to generate
        """
        self.model_path = model_path
        self.nframes = nframes
        self.max_tokens = max_tokens
        self.llm = None
        self.processor = None
        self.sampling_params = None
        
    def load_model(self):
        """Load the model using vLLM."""
        try:
            from vllm import LLM, SamplingParams
            from transformers import AutoProcessor
            import torch
            import gc

            torch.cuda.empty_cache()
            gc.collect()
            
            print(f"üîÑ Loading model from: {self.model_path}")
            
            self.llm = LLM(
                model=self.model_path,
                tensor_parallel_size=1,
                max_model_len=32768,
                trust_remote_code=True,
                limit_mm_per_prompt={"video": 1, "image": 0}
            )
            
            # Load processor for chat template
            self.processor = AutoProcessor.from_pretrained(
                self.model_path,
                trust_remote_code=True
            )
            
            self.sampling_params = SamplingParams(
                max_tokens=self.max_tokens,
                temperature=0.0
            )
            
            print("‚úÖ Model loaded successfully!")
            return True
            
        except ImportError:
            print("‚ö†Ô∏è vLLM not installed. Install with: pip install vllm")
            return False
        except Exception as e:
            print(f"‚ùå Error loading model: {e}")
            return False
    
    def query(self, video_path, question, system_prompt="You are a helpful assistant."):
        """
        Query the model with a video and question.
        
        Args:
            video_path: Path to the video file
            question: Question to ask about the video
            system_prompt: System prompt for the model
        
        Returns:
            Model's response as string
        """
        if self.llm is None or not hasattr(self, 'processor') or self.processor is None:
            print("‚ö†Ô∏è Model not loaded. Call load_model() first.")
            return None
        
        try:
            from qwen_vl_utils import process_vision_info
            
            # Prepare messages with video
            messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": [
                    {"type": "video", "video": str(video_path), "nframes": self.nframes},
                    {"type": "text", "text": question}
                ]}
            ]
            
            # Apply chat template to get text prompt
            text_prompt = self.processor.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True
            )
            
            # Extract video data using process_vision_info
            image_inputs, video_inputs, video_kwargs = process_vision_info(
                messages,
                image_patch_size=16,
                return_video_kwargs=True,
                return_video_metadata=True
            )
            
            # Prepare input for vLLM generate
            model_input = {
                "prompt": text_prompt,
                "multi_modal_data": {"video": video_inputs},
                "mm_processor_kwargs": video_kwargs
            }
            
            # Run inference using generate (not chat)
            outputs = self.llm.generate([model_input], self.sampling_params)
            response = outputs[0].outputs[0].text
            
            return response
            
        except ImportError as ie:
            print(f"‚ö†Ô∏è Import error: {ie}")
            print("   Install with: pip install qwen-vl-utils")
            return None
        except Exception as e:
            print(f"‚ùå Error during inference: {e}")
            import traceback
            traceback.print_exc()
            return None
    
    def batch_query(self, queries):
        """
        Process multiple queries in batch.
        
        Args:
            queries: List of (video_path, question) tuples
        
        Returns:
            List of responses
        """
        responses = []
        for video_path, question in queries:
            response = self.query(video_path, question)
            responses.append(response)
        return responses

## Zero-Shot Inference

Before fine-tuning, run a quick zero-shot evaluation with the base model to establish a baseline for comparison.

In [None]:
# Zero-shot inference with base model
print("="*70)
print("üîç ZERO-SHOT INFERENCE (Base Model)")
print("="*70)

inference_base = CosmosReason2Inference(
    model_path=BASE_MODEL_PATH,  # Base model path
    nframes=8,
    max_tokens=512
)

inference_base.load_model()

# Test video
zero_shot_video = EXAMPLE_VIDEO_PATH
print(f"\nVideo: {zero_shot_video}\n")

# Sample question
question = "What is the pedestrian doing in this video?"
print("üìù Question:")
print(question)
print("-"*70)

response = inference_base.query(zero_shot_video, question)
print(f"\n‚úÖ ANSWER: {response}")
print("="*70)

# Clean up GPU memory before training
try:
    del inference_base
    import torch, gc
    torch.cuda.empty_cache()
    gc.collect()
except Exception:
    pass

## Training Configuration

The training configuration is specified in a TOML file. Key hyperparameters are optimized for training on **8x A100 GPUs**. Adjust the other parameters according to the hardware.

### Key Configuration Highlights:
- **Learning Rate**: 2e-5 with cosine decay
- **Batch Size**: 32 per replica
- **Model**: nvidia/Cosmos-Reason2-2B (or 8B)
- **Max Length**: 32,768 tokens
- **Vision**: 8 frames uniformly sampled (nframes=8)

In [None]:
# Use the official training config from the cosmos-cookbook repo
CONFIG_FILE = "scripts/examples/reason2/intelligent-transportation/sft_config.toml"

CONFIG_PATH = f"{COSMOS_COOKBOOK_REPO}/{CONFIG_FILE}"

# Display the raw config file
print("üìÑ Official Training Config from cosmos-cookbook")
print(f"   Source: github.com/nvidia-cosmos/cosmos-cookbook/{CONFIG_FILE}\n")
print("="*70)
!cat {CONFIG_PATH}
print("="*70)

# Parse and show key parameters
try:
    import tomllib
except ImportError:
    try:
        import tomli as tomllib
    except:
        import pip._vendor.tomli as tomllib

with open(CONFIG_PATH, "rb") as f:
    config = tomllib.load(f)

print("\nüîë Key Training Parameters:\n")
print(f"  Model:           {config['policy']['model_name_or_path']}")
print(f"  Learning Rate:   {config['train']['optm_lr']}")
print(f"  Batch Size:      {config['train']['train_batch_per_replica']} per GPU")
print(f"  Max Seq Length:  {config['policy']['model_max_length']}")

## Update Training Config Paths

Patch the `sft_config.toml` file with your local dataset and output paths. This keeps the training script aligned with your environment.

In [None]:
# Update sft_config.toml with actual paths
import os
import subprocess
from pathlib import Path

CONFIG_PATH = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation/sft_config.toml"
print("üìù Updating sft_config.toml with actual paths...\n")

# Use sed to update the config file directly (in-place)
subprocess.run(["sed", "-i", f's|annotation_path = .*|annotation_path = "{TRAIN_ANNOTATIONS_PATH}"|', CONFIG_PATH])
subprocess.run(["sed", "-i", f's|media_path = .*|media_path = "{TRAIN_VIDEOS_PATH}"|', CONFIG_PATH])
subprocess.run(["sed", "-i", f's|output_dir = .*|output_dir = "{FINETUNED_MODEL_PATH}"|', CONFIG_PATH])

print(f"  annotation_path: {TRAIN_ANNOTATIONS_PATH}")
print(f"  media_path:      {TRAIN_VIDEOS_PATH}")
print(f"  output_dir:      {FINETUNED_MODEL_PATH}")

# Verify paths exist
print("\nüîç Verifying paths:")
ann_path = TRAIN_ANNOTATIONS_PATH
media_path = TRAIN_VIDEOS_PATH

if os.path.exists(ann_path):
    print(f"  ‚úÖ annotations.json exists")
else:
    print(f"  ‚ùå annotations.json NOT found at {ann_path}")
    
if os.path.exists(media_path):
    print(f"  ‚úÖ videos directory exists")
    video_files = list(Path(media_path).rglob("*.mp4"))
    print(f"     Found {len(video_files)} video files")
else:
    print(f"  ‚ùå videos directory NOT found at {media_path}")

# Show updated config section
print("\nüìÑ Updated [custom.dataset] section:")
print("="*50)
!grep -A5 "\[custom.dataset\]" {CONFIG_PATH}
print("="*50)

## Vision Token Calculation (Ablation Study)

Understanding how vision tokens are calculated is crucial for optimizing training. Qwen3-VL (the backbone of Cosmos Reason 2) compresses input videos in both **space** and **time**:

### Compression Factors:
- **Spatial Compression**: Effective patch size = 32 (16 patch √ó 2 spatial merge)
- **Temporal Compression**: Effective temporal step = 2 (2 frames merge into 1)

### Two Ablation Configurations:
1. **nframes=8 (3k tokens)**: Fewer frames, higher resolution per frame
2. **fps=1, 8M pixels (8k tokens)**: More frames, lower resolution per frame

**Key Finding**: Higher resolution per frame (3k tokens) achieves better accuracy with 3√ó faster training!

In [None]:
# Vision Token Calculation for Cosmos Reason 2

class VisionTokenCalculator:
    """Calculator for vision tokens in Qwen3-VL based models."""
    
    # Model constants
    PATCH_SIZE = 16
    SPATIAL_MERGE = 2
    TEMPORAL_MERGE = 2  # 2 frames merge into 1
    
    EFFECTIVE_PATCH_SIZE = PATCH_SIZE * SPATIAL_MERGE  # 32
    DEFAULT_MAX_FRAME_TOKENS = 768  # Default max tokens per frame
    
    def __init__(self):
        self.patch_area = self.EFFECTIVE_PATCH_SIZE ** 2  # 32 * 32 = 1024
    
    def calculate_tokens_nframes(self, nframes, frame_width, frame_height, max_frame_tokens=768):
        """
        Calculate vision tokens for nframes configuration.
        
        Args:
            nframes: Number of frames to sample uniformly
            frame_width: Original frame width
            frame_height: Original frame height
            max_frame_tokens: Maximum tokens per frame (default 768)
        
        Returns:
            Dictionary with calculation details
        """
        # Calculate max pixels per frame from max tokens
        max_pixels_per_frame = max_frame_tokens * self.patch_area
        
        # Check if resizing is needed
        original_pixels = frame_width * frame_height
        
        if original_pixels > max_pixels_per_frame:
            # Need to resize - maintain aspect ratio
            scale = (max_pixels_per_frame / original_pixels) ** 0.5
            new_width = int(frame_width * scale)
            new_height = int(frame_height * scale)
            # Round to nearest multiple of patch size
            new_width = (new_width // self.EFFECTIVE_PATCH_SIZE) * self.EFFECTIVE_PATCH_SIZE
            new_height = (new_height // self.EFFECTIVE_PATCH_SIZE) * self.EFFECTIVE_PATCH_SIZE
        else:
            new_width, new_height = frame_width, frame_height
        
        # Calculate tokens per frame
        tokens_per_frame = (new_width * new_height) // self.patch_area
        
        # Apply temporal compression
        effective_frames = nframes // self.TEMPORAL_MERGE
        
        # Total vision tokens
        total_tokens = effective_frames * tokens_per_frame
        
        return {
            "config": f"nframes={nframes}",
            "original_resolution": f"{frame_width} √ó {frame_height}",
            "resized_resolution": f"{new_width} √ó {new_height}",
            "frames_sampled": nframes,
            "effective_frames": effective_frames,
            "tokens_per_frame": tokens_per_frame,
            "total_vision_tokens": total_tokens
        }
    
    def calculate_tokens_fps(self, video_duration_sec, fps, total_pixel_limit):
        """
        Calculate vision tokens for fps configuration with total pixel limit.
        
        Args:
            video_duration_sec: Video duration in seconds
            fps: Frames per second to sample
            total_pixel_limit: Maximum total pixels across all frames
        
        Returns:
            Dictionary with calculation details
        """
        # Calculate number of frames
        num_frames = int(video_duration_sec * fps)
        # Round to nearest even number for temporal compression
        num_frames = (num_frames // 2) * 2
        
        # Calculate effective frames after temporal merge
        effective_frames = num_frames // self.TEMPORAL_MERGE
        
        # Calculate pixels per frame based on total limit
        pixels_per_frame = total_pixel_limit // effective_frames
        
        # Calculate frame dimensions (approximate, assuming ~16:9 aspect ratio)
        aspect_ratio = 16 / 9
        frame_height = int((pixels_per_frame / aspect_ratio) ** 0.5)
        frame_width = int(frame_height * aspect_ratio)
        
        # Round to patch size multiples
        frame_width = (frame_width // self.EFFECTIVE_PATCH_SIZE) * self.EFFECTIVE_PATCH_SIZE
        frame_height = (frame_height // self.EFFECTIVE_PATCH_SIZE) * self.EFFECTIVE_PATCH_SIZE
        
        # Recalculate actual pixels
        actual_pixels_per_frame = frame_width * frame_height
        
        # Tokens per frame
        tokens_per_frame = actual_pixels_per_frame // self.patch_area
        
        # Total vision tokens
        total_tokens = effective_frames * tokens_per_frame
        
        return {
            "config": f"fps={fps}, total_pixels={total_pixel_limit:,}",
            "video_duration": f"{video_duration_sec} seconds",
            "frames_sampled": num_frames,
            "effective_frames": effective_frames,
            "pixels_per_frame": actual_pixels_per_frame,
            "resized_resolution": f"{frame_width} √ó {frame_height}",
            "tokens_per_frame": tokens_per_frame,
            "total_vision_tokens": total_tokens
        }

# Initialize calculator
calc = VisionTokenCalculator()

print("üî¢ Vision Token Calculations for Ablation Study")
print("="*70)
print("\nüìê Model Constants:")
print(f"   Patch Size: {calc.PATCH_SIZE}")
print(f"   Spatial Merge: {calc.SPATIAL_MERGE}x")
print(f"   Effective Patch Size: {calc.EFFECTIVE_PATCH_SIZE}")
print(f"   Temporal Merge: {calc.TEMPORAL_MERGE}x (2 frames ‚Üí 1)")
print(f"   Patch Area: {calc.patch_area} pixels")
print("="*70)

## Run Training

Now we launch the SFT training using the Cosmos-RL framework. The training uses:
- **8√ó A100 GPUs (Recommended, can be done with 4x A100 too)** with data parallelism
- **Supervised Fine-Tuning (SFT)** on MCQ data

Training time: ~1 hour 16 minutes for 3k vision tokens configuration.

In [None]:
# Run Training with Cosmos-RL (using cosmos-rl's own venv)
import os
import sys

COSMOS_RL_VENV = f"{COSMOS_RL_PATH}/.venv"
TRAINING_DIR = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation"

print("üöÄ Running Training with Cosmos-RL")
print("="*70)
print(f"  Working Dir: {TRAINING_DIR}")
print(f"  Config:      sft_config.toml")
print(f"  Script:      custom_sft.py")
print("="*70)

# Check Redis package installed
try:
    import redis
except ImportError as exc:
    raise ImportError("Redis not installed. Install with: pip install redis") from exc

print("\n‚è±Ô∏è Expected training time (8√ó A100):")
print("   - 3k tokens (nframes=8): ~1h 16m for 3 epochs")

# Setup cosmos-rl venv if needed
if not os.path.exists(f"{COSMOS_RL_VENV}/bin/cosmos-rl"):
    print("\nüì¶ Setting up cosmos-rl venv with uv sync...")
    !cd {COSMOS_RL_PATH} && pip install -q uv && uv sync

# Run training - MUST activate venv so subprocesses get the right python
print("\nüîÑ Starting training...\n")
!source {COSMOS_RL_VENV}/bin/activate && cd {TRAINING_DIR} && cosmos-rl --config sft_config.toml custom_sft.py

## Run Evaluation

After training, we evaluate the model on the validation set of the WTS Environment VQA dataset:
- **171 videos** with **2.6k MCQ questions** (unseen during training)
- Evaluation uses **vLLM** inference engine for efficient batch processing
- Metrics: **Accuracy** on multiple-choice questions

In [None]:
# Run Evaluation using cosmos-cookbook script
import sys
import os

EVAL_DIR = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation"
EVAL_CONFIG = f"{EVAL_DIR}/eval_config.yaml"

# Update paths in eval_config.yaml (preserve rest of config)
print("\nüìù Updating paths in eval_config.yaml...")

# Use subprocess for proper variable expansion (in-place edit)
import subprocess
subprocess.run(["sed", "-i", f's|annotation_path:.*|annotation_path: {VAL_ANNOTATIONS_PATH}|', EVAL_CONFIG])
subprocess.run(["sed", "-i", f's|media_dir:.*|media_dir: {VAL_VIDEOS_PATH}|', EVAL_CONFIG])
subprocess.run(["sed", "-i", f's|model_name:.*|model_name: {FINETUNED_MODEL_PATH}|', EVAL_CONFIG])

print(f"  annotation_path: {VAL_ANNOTATIONS_PATH}")
print(f"  media_dir:       {VAL_VIDEOS_PATH}")
print(f"  model_name:      {FINETUNED_MODEL_PATH}")

# Show updated config
print("\nüìÑ Evaluation Config:")
print("="*70)
!cat {EVAL_CONFIG}
print("="*70)

# Run evaluation
print("\nüîÑ Starting evaluation...\n")
!cd {EVAL_DIR} && {sys.executable} evaluate.py --config eval_config.yaml

## Results Visualization

Visualize training results and accuracy comparisons across different configurations and training epochs.

In [None]:
# Load and display results from evaluate.py
import json
import os
import glob

EVAL_DIR = f"{COSMOS_COOKBOOK_REPO}/scripts/examples/reason2/intelligent-transportation"
RESULTS_BASE = os.path.join(EVAL_DIR, "results")

# Find all results.json files from evaluate.py output
result_files = glob.glob(os.path.join(RESULTS_BASE, "**/results.json"), recursive=True)

if result_files:
    # Use the most recent results file
    result_file = max(result_files, key=os.path.getmtime)
    
    with open(result_file, 'r') as f:
        metrics = json.load(f)
    
    print("="*60)
    print("üìä EVALUATION RESULTS")
    print("="*60)
    print(f"\n   Accuracy:  {metrics['accuracy']*100:.2f}%")
    print(f"   Correct:   {metrics['total_correct']} / {metrics['total_questions']}")
    print(f"\n   Results:   {result_file}")
    print("\n" + "="*60)
else:
    print(f"‚ÑπÔ∏è No results found in {RESULTS_BASE}")
    print("   Run the evaluation cell above first.")

## Inference with Fine-Tuned Model

Run inference on custom traffic videos using the fine-tuned Cosmos Reason 2 model. The model can answer both MCQ and open-ended questions about traffic scenes.

In [None]:
# Fine-tuned inference demo
inference = CosmosReason2Inference(model_path=FINETUNED_MODEL_PATH, nframes=8)
inference.load_model()

# Sample questions for traffic scene understanding
SAMPLE_QUESTIONS = [
    "What type of road is shown in this video?",
    "How many vehicles can you see in the scene?",
    "Is there any pedestrian in the video? If yes, what are they doing?",
    "What potential traffic hazards do you observe?",
    "Describe the overall traffic flow and density.",
]

# Test video path (update with your video)
test_video = EXAMPLE_VIDEO_PATH

print("\n" + "="*70)
print("üé¨ TESTING INFERENCE ON TRAFFIC VIDEO (FINE-TUNED)")
print("="*70)
print(f"Video: {test_video}")
print(f"Total Questions: {len(SAMPLE_QUESTIONS)}")
print("="*70 + "\n")

for i, question in enumerate(SAMPLE_QUESTIONS, 1):
    print(f"üìù Question {i}/{len(SAMPLE_QUESTIONS)}")
    print("-"*70)
    print(question)
    print("-"*70)
    
    response = inference.query(test_video, question)
    
    print(f"‚úÖ ANSWER: {response}")
    print("="*70 + "\n")

print("‚úÖ All questions processed successfully!")


---

## Deployment with FP8 Quantization and NVIDIA NIM

For production deployment, we can:
1. **Quantize to FP8** for faster inference with minimal accuracy loss
2. **Deploy on NVIDIA NIM** for optimized, production-ready inference microservices

### Benefits of FP8 Quantization:
- 2√ó faster inference
- 50% memory reduction
- Minimal accuracy degradation (<0.5%)

In [None]:
# FP8 Quantization Configuration
# Output directory for the FP8-quantized model
FP8_MODEL_OUTPUT_PATH = f"{FINETUNED_MODEL_PATH}_fp8"

QUANTIZATION_CONFIG = {
    "model_path": FINETUNED_MODEL_PATH,
    "output_path": FP8_MODEL_OUTPUT_PATH,
    "precision": "fp8"
}

print("üîß FP8 Quantization Setup")
print("="*70)

# Quantization script path
quantize_script = f"{COSMOS_REASON2_REPO}/scripts/quantize.py"

# Quantization command
QUANTIZE_CMD = f"""
# FP8 Quantization Command
# ========================

python {quantize_script} \\
    --model "{QUANTIZATION_CONFIG['model_path']}" \\
    -o "{QUANTIZATION_CONFIG['output_path']}" \\
    --precision {QUANTIZATION_CONFIG['precision']}
"""

print("üìù Quantization Command:")
print("-"*70)
print(QUANTIZE_CMD)
print("-"*70)

# Expected benefits
print("\nüìä Expected Benefits of FP8 Quantization:")
print("""
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Metric                 ‚îÇ FP16         ‚îÇ FP8          ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ Model Size             ‚îÇ ~16 GB       ‚îÇ ~8 GB        ‚îÇ
‚îÇ Inference Speed        ‚îÇ 1.0√ó         ‚îÇ ~2.0√ó        ‚îÇ
‚îÇ GPU Memory Usage       ‚îÇ 100%         ‚îÇ ~50%         ‚îÇ
‚îÇ Accuracy Loss          ‚îÇ 0%           ‚îÇ <0.5%        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
""")

## Run FP8 Quantization

Quantize the fine-tuned checkpoint to FP8 to reduce memory usage and speed up inference. Ensure the Cosmos-RL venv is available before running this cell.

In [None]:
import os

COSMOS_RL_VENV = f"{COSMOS_RL_PATH}/.venv"

# Run FP8 Quantization (Shell Command)
!source {COSMOS_RL_VENV}/bin/activate && {COSMOS_REASON2_REPO}/scripts/quantize.py \
    --model "{QUANTIZATION_CONFIG['model_path']}" \
    -o "{QUANTIZATION_CONFIG['output_path']}" \
    --precision fp8

## Authenticate to NGC

You need an NGC API key to pull the Cosmos Reason 2 NIM image. This cell prompts for your key and performs a Docker login.

In [None]:
# NGC Login
import subprocess
import getpass
from IPython.display import display, HTML
import time

display(HTML('<a href="https://org.ngc.nvidia.com/setup/api-key" target="_blank" style="font-size:16px;">üîë Get NGC API Key</a>'))
time.sleep(2)

ngc_api_key = getpass.getpass("NGC API Key: ").strip()

if ngc_api_key:
    result = subprocess.run(
        ["docker", "login", "nvcr.io", "-u", "$oauthtoken", "--password-stdin"],
        input=ngc_api_key, text=True, capture_output=True
    )
    print("‚úÖ Login successful" if result.returncode == 0 else f"‚ùå Failed: {result.stderr}")
else:
    print("‚ùå No key provided")

## NIM Deployment Configuration

Define the model path, NIM image, and runtime parameters before launching the container. Adjust `max_model_len` and `allowed_local_media_path` based on your hardware and data location.

In [None]:
# NVIDIA NIM Deployment Configuration
NIM_CONFIG = {
    "model_path": f"{QUANTIZATION_CONFIG['output_path']}/model_fp8",  # FP8 quantized model
    "nim_image": "nvcr.io/nim/nvidia/cosmos-reason2-8b:latest",
    "model_name": "cosmos-reason2-wts-fp8",
    "port": 8000,
    "shm_size": "32GB",
    "max_model_len": 262144,  # 256k tokens, reduce for lower memory usage
    "allowed_local_media_path": "/home/ubuntu"  # Allow local video file access
}

print("üöÄ NVIDIA NIM Deployment")
print("="*70)
print(f"Model: {NIM_CONFIG['model_path']}")
print(f"Max Context Length: {NIM_CONFIG['max_model_len']:,} tokens")
print(f"Port: {NIM_CONFIG['port']}")
print("="*70)

In [None]:
# Shell Command for NIM Deployment

NIM_DEPLOY_CMD = f"""
# Set environment variables
export CUSTOM_WEIGHTS="{NIM_CONFIG['model_path']}"
export NIM_IMAGE="{NIM_CONFIG['nim_image']}"

# Launch NIM container
docker run -d --name=cosmos-reason2-wts \\
    --gpus all \\
    --shm-size={NIM_CONFIG['shm_size']} \\
    -e NIM_MODEL_NAME=$CUSTOM_WEIGHTS \\
    -e NIM_SERVED_MODEL_NAME="{NIM_CONFIG['model_name']}" \\
    -e NIM_MAX_MODEL_LEN={NIM_CONFIG['max_model_len']} \\
    -e NIM_ALLOWED_LOCAL_MEDIA_PATH="{NIM_CONFIG['allowed_local_media_path']}" \\
    -v $CUSTOM_WEIGHTS:$CUSTOM_WEIGHTS \\
    -v {NIM_CONFIG['allowed_local_media_path']}:{NIM_CONFIG['allowed_local_media_path']}:ro \\
    -u $(id -u) \\
    -p {NIM_CONFIG['port']}:8000 \\
    $NIM_IMAGE

# Wait for startup (takes ~2-3 minutes)
# Check the deployment status using 
docker logs -f cosmos-reason2-wts

# Health check
curl http://localhost:{NIM_CONFIG['port']}/v1/health/ready | jq .
"""


print("\nüí° Steps:")
print("   1. Run the commands below")
print("   2. Monitor with: docker logs -f cosmos-reason2-wts")
print("   3. Stop with: docker stop cosmos-reason2-wts \n")
print("-"*70)
print("üìù NIM Deployment Commands:")
print("-"*70)
print(NIM_DEPLOY_CMD)
print("-"*70)


## Test NIM API

This step sends a sample request to the local NIM endpoint to confirm the deployment is responding correctly. It uses a remote video URL for convenience, but you can swap in your own video later.

In [None]:
## Test NIM API - Remote Video

import requests
import json

NIM_ENDPOINT = f"http://localhost:{NIM_CONFIG['port']}/v1/chat/completions"

# Example 1: Remote video URL
test_payload_remote = {
    "model": NIM_CONFIG['model_name'],
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this video?"},
            {"type": "video_url", "video_url": {"url": "https://download.samplelib.com/mp4/sample-5s.mp4"}}
        ]
    }],
    "media_io_kwargs": {
        "video": {
            "num_frames": 10
        }
    },
    "stream": False
}

print("Testing with remote video URL...")
print("="*70)

try:
    response = requests.post(NIM_ENDPOINT, json=test_payload_remote)
    result = response.json()
    
    if "choices" in result:
        print("Success!")
        print(f"\nQuestion: What is in this video?")
        print(f"Answer: {result['choices'][0]['message']['content']}")
        print(f"\nTokens used: {result['usage']['total_tokens']:,}")
    else:
        print("Error:", result)
except Exception as e:
    print(f"Connection error: {e}")
    print("   Make sure the NIM container is running!")

## Troubleshooting

- **CUDA out of memory**: reduce batch size, lower `nframes`, or decrease `model_max_length`; restart the kernel to clear GPU memory.
- **Flash Attention build failures**: confirm CUDA/Torch versions match, ensure build tools are installed
- **Redis not found**: install with `pip install redis` before running training.

## Summary & Conclusion

### What We Accomplished

1. **Data Exploration**: Visualized WTS traffic videos and annotations
2. **Training Configuration**: Set up optimal hyperparameters for SFT
3. **Vision Token Analysis**: Understood the tradeoffs between frame count and resolution
4. **Model Training**: Fine-tuned Cosmos Reason 2 on traffic VQA data
5. **Evaluation**: Achieved 93.65% accuracy on validation set
6. **Deployment**: Prepared FP8 quantization and NIM deployment

### Key Takeaways

| Insight | Implication |
|---------|-------------|
| Higher resolution > More frames | Prioritize image quality over quantity for scene understanding |
| Fast convergence | Domain-specific data enables quick training (~1 hour) |
| MCQ ‚Üí Open-ended | Fine-tuning on MCQs improves open-ended reasoning |
| FP8 quantization | 2√ó speedup with minimal accuracy loss |

### Next Steps

- [ ] Experiment with other traffic datasets
- [ ] Try different frame sampling strategies
- [ ] Evaluate on edge cases (night, rain, occlusions)
- [ ] Deploy to production with monitoring

---

**Reference:** [NVIDIA Cosmos Cookbook](https://nvidia-cosmos.github.io/cosmos-cookbook/)