# Qwen3-VL Evaluation on m2sv Dataset
---

Clean, simple approach using HuggingFace Transformers

## Table of Contents:
---
0. [Dependencies | Imports | Configuration]()
1. [Load Model]()
2. [Load Dataset]()
3. [Run Evaluation]()
4. [Analyze & Save Results]()

### 0. Dependencies | Imports | Configuration
---

In [None]:
%%capture
%pip install -U transformers accelerate datasets pillow torch torchvision

In [11]:
import gc
import json
import torch
import pandas as pd
from PIL import Image
from tqdm.auto import tqdm
from datetime import datetime
from datasets import load_dataset
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

In [12]:
CONFIG = {
    "model_name": "Qwen/Qwen3-VL-4B-Instruct",
    "dataset_name": "yosubshin/m2sv",
    "split": "train",
    "max_new_tokens": 128,
    "temperature": 0.1,
    "output_dir": "./qwen3_eval_results",
    "save_predictions": True,
}

print(f"\nConfiguration:")
print(f"   Model: {CONFIG['model_name']}")
print(f"   Dataset: {CONFIG['dataset_name']}")
print(f"   Device: {'cuda' if torch.cuda.is_available() else 'cpu'}")


Configuration:
   Model: Qwen/Qwen3-VL-4B-Instruct
   Dataset: yosubshin/m2sv
   Device: cuda


### 1. Load Model
---

In [13]:
print(f"\n[1/4] Loading Qwen3-VL model...")
print(f"   Model: {CONFIG['model_name']}")

device = "cuda" if torch.cuda.is_available() else "cpu"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    CONFIG['model_name'],
    dtype=torch.float16 if device == "cuda" else torch.float32,
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(CONFIG['model_name'])

print(f"Model loaded successfully")
print(f"   Parameters: {sum(p.numel() for p in model.parameters()) / 1e9:.2f}B")
print(f"   Device: {device}")


[1/4] Loading Qwen3-VL model...
   Model: Qwen/Qwen3-VL-4B-Instruct


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model loaded successfully
   Parameters: 4.44B
   Device: cuda


### 2. Load Dataset
---

In [17]:
print(f"\n[2/4] Loading m2sv dataset...")

dataset = load_dataset(CONFIG['dataset_name'], split=CONFIG['split'])
print(f"Loaded {len(dataset)} samples")

sample = dataset[0]
print(f"\nSample question:")
print(f"  Q: {sample['question']}")
print(f"  Options: {sample['options']}")
print(f"  Answer: {sample['answer']}")


[2/4] Loading m2sv dataset...
Loaded 100 samples

Sample question:
  Q: Which labeled direction on the map corresponds to the direction in which the street view photo was taken?
  Options: ['A', 'B', 'C']
  Answer: C


### 3. Run Evaluation
---

In [15]:
print(f"\n[3/4] Running evaluation...")

def format_prompt(question, options):
    """Format the question with options for the model"""
    prompt = f"{question}\n\n"
    prompt += "Options:\n"
    for opt in options:
        prompt += f"{opt}\n"
    prompt += "\nProvide only the letter of the correct answer (A, B, C, or D)."
    return prompt

def extract_answer(response):
    """Extract the answer letter from model response"""
    response = response.strip().upper()

    # First try to find single letter
    for char in ['A', 'B', 'C', 'D']:
        if char in response:
            return char

    return response[0] if response else ""

results = []
correct = 0
total = len(dataset)

# Create progress bar
pbar = tqdm(dataset, desc="Evaluating", total=total)

for idx, item in enumerate(pbar):
    try:
        # Format prompt
        prompt = format_prompt(item['question'], item['options'])

        # Prepare messages in Qwen3-VL format
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": item['image_sv']},
                    {"type": "text", "text": prompt},
                ],
            }
        ]

        # Process with Qwen3-VL API
        inputs = processor.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,
            return_dict=True,
            return_tensors="pt"
        ).to(device)

        # Generate
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=CONFIG['max_new_tokens'],
                temperature=CONFIG['temperature'],
            )

        # Decode response - trim the input tokens
        generated_ids_trimmed = [
            out_ids[len(in_ids):]
            for in_ids, out_ids in zip(inputs['input_ids'], outputs)
        ]

        response = processor.batch_decode(
            generated_ids_trimmed,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False
        )[0]

        # Extract answer
        prediction = extract_answer(response)
        ground_truth = item['answer']
        is_correct = prediction == ground_truth

        if is_correct:
            correct += 1

        # Store result
        results.append({
            'id': item['id'],
            'question': item['question'],
            'options': item['options'],
            'ground_truth': ground_truth,
            'prediction': prediction,
            'raw_response': response,
            'correct': is_correct,
        })

        # Update progress bar
        accuracy = correct / (idx + 1)
        pbar.set_postfix({'accuracy': f'{accuracy:.2%}'})

        # Free memory periodically
        if (idx + 1) % 10 == 0:
            gc.collect()
            if device == "cuda":
                torch.cuda.empty_cache()

    except Exception as e:
        print(f"\n⚠ Error on sample {idx}: {e}")
        import traceback
        traceback.print_exc()

        results.append({
            'id': item['id'],
            'question': item['question'],
            'options': item['options'],
            'ground_truth': item['answer'],
            'prediction': 'ERROR',
            'raw_response': str(e),
            'correct': False,
        })

pbar.close()


[3/4] Running evaluation...


Evaluating:   0%|          | 0/100 [00:00<?, ?it/s]

### 4. Analyze & Save Results
---

In [16]:
print(f"\n[4/4] Analysis and results...")

# Calculate metrics
accuracy = correct / total
results_df = pd.DataFrame(results)

# Print summary
print("\n" + "=" * 100)
print("EVALUATION RESULTS")
print("=" * 100)
print(f"Model: {CONFIG['model_name']}")
print(f"Dataset: {CONFIG['dataset_name']} ({CONFIG['split']} split)")
print(f"Total samples: {total}")
print(f"Correct: {correct}")
print(f"Incorrect: {total - correct}")
print(f"Accuracy: {accuracy:.2%}")
print("=" * 100)

# Show examples
print("\n✓ Sample Correct Predictions:")
correct_samples = results_df[results_df['correct'] == True].head(3)
for _, row in correct_samples.iterrows():
    print(f"\nQ: {row['question'][:80]}...")
    print(f"   GT: {row['ground_truth']} | Pred: {row['prediction']}")
    print(f"   Response: {row['raw_response'][:100]}")

print("\n✗ Sample Incorrect Predictions:")
incorrect_samples = results_df[results_df['correct'] == False].head(3)
for _, row in incorrect_samples.iterrows():
    print(f"\nQ: {row['question'][:80]}...")
    print(f"   GT: {row['ground_truth']} | Pred: {row['prediction']}")
    print(f"   Response: {row['raw_response'][:100]}")

# Save results
if CONFIG['save_predictions']:
    import os
    os.makedirs(CONFIG['output_dir'], exist_ok=True)

    # Save detailed results
    results_path = f"{CONFIG['output_dir']}/predictions.csv"
    results_df.to_csv(results_path, index=False)
    print(f"\nSaved predictions to: {results_path}")

    # Save summary
    summary = {
        'model': CONFIG['model_name'],
        'dataset': CONFIG['dataset_name'],
        'split': CONFIG['split'],
        'total_samples': total,
        'correct': correct,
        'incorrect': total - correct,
        'accuracy': float(accuracy),
        'timestamp': datetime.now().isoformat(),
        'config': CONFIG,
    }

    summary_path = f"{CONFIG['output_dir']}/summary.json"
    with open(summary_path, 'w') as f:
        json.dump(summary, f, indent=2)
    print(f"✓ Saved summary to: {summary_path}")

print("\n" + "=" * 100)
print("Evaluation complete!")
print("=" * 100)

# Free memory
del model
del processor
gc.collect()
if device == "cuda":
    torch.cuda.empty_cache()


[4/4] Analysis and results...

EVALUATION RESULTS
Model: Qwen/Qwen3-VL-4B-Instruct
Dataset: yosubshin/m2sv (train split)
Total samples: 100
Correct: 36
Incorrect: 64
Accuracy: 36.00%

✓ Sample Correct Predictions:

Q: Which labeled direction on the map corresponds to the direction in which the str...
   GT: A | Pred: A
   Response: A

Q: Which labeled direction on the map corresponds to the direction in which the str...
   GT: A | Pred: A
   Response: A

Q: Which labeled direction on the map corresponds to the direction in which the str...
   GT: A | Pred: A
   Response: A

✗ Sample Incorrect Predictions:

Q: Which labeled direction on the map corresponds to the direction in which the str...
   GT: C | Pred: A
   Response: A

Q: Which labeled direction on the map corresponds to the direction in which the str...
   GT: C | Pred: A
   Response: A

Q: Which labeled direction on the map corresponds to the direction in which the str...
   GT: C | Pred: A
   Response: A

Saved predictions t