# Benchmarking Theory of Mind with Established Datasets

This notebook shows how to use established ToM benchmarks with SELPHI:

1. **ToMBench** (ACL 2024) - 2,860 samples across 8 tasks
2. **OpenToM** (2024) - 696 narratives with 16,008 questions
3. **SocialIQA** (2019) - 38,000 QA pairs on social reasoning

## Why Use Benchmarks?

- **Standardized Evaluation**: Compare your models against published results
- **Comprehensive Coverage**: Test across diverse ToM scenarios
- **Reproducibility**: Use same data as other researchers
- **Scale**: Test on thousands of scenarios

##  Setup

In [None]:
# 1. Add repository root to path to import harness
import sys
sys.path.append('../../../')  # Go up to repo root from notebooks/

# 2. Import SELPHI benchmark loading functions
from selphi import (
    # Benchmark loaders
    load_tombench,        # Load ToMBench dataset (ACL 2024)
    load_opentom,         # Load OpenToM dataset
    load_socialiqa,       # Load SocialIQA dataset
    print_benchmark_info,  # Print info about available benchmarks
    
    # Task execution
    run_scenario,          # Run single scenario
    run_multiple_scenarios,  # Run multiple scenarios
    
    # Evaluation
    evaluate_scenario,     # Evaluate single response
    evaluate_batch,        # Evaluate multiple responses
    results_to_dict_list,  # Convert results to dict format
)

## View Available Benchmarks

In [None]:
# 1. Print information about all available ToM benchmarks
# This shows which benchmarks are available and how to install them
print_benchmark_info()

## Example 1: Load ToMBench

First, clone the repository:
```bash
git clone https://github.com/zhchen18/ToMBench.git
```

The dataset will be at `./ToMBench/data/`

In [None]:
# Load ToMBench
try:
    tombench = load_tombench()
    
    print(f"Loaded {tombench.total_count} scenarios from ToMBench")
    print(f"Source: {tombench.source}")
    print(f"\nMetadata: {tombench.metadata}")
    
    # Show a sample scenario
    print(f"\nSample scenario:")
    sample = tombench.scenarios[0]
    print(f"Name: {sample.name}")
    print(f"Type: {sample.tom_type.value}")
    print(f"\nPrompt:\n{sample.to_prompt()}")
    
except FileNotFoundError as e:
    print(f"Error: {e}")
    print("\nPlease install ToMBench first:")
    print("git clone https://github.com/zhchen18/ToMBench.git")

### Run Model on ToMBench Scenarios

In [None]:
# Run a subset of ToMBench scenarios
try:
    # Take first 10 scenarios for quick testing
    test_scenarios = tombench.scenarios[:10]
    
    print(f"Running {len(test_scenarios)} ToMBench scenarios...")
    
    results = run_multiple_scenarios(
        test_scenarios,
        provider="ollama",
        temperature=0.1,  # Low temperature for consistent reasoning
        verbose=True
    )
    
    print(f"\nCompleted {len(results)} scenarios")
    
except NameError:
    print("ToMBench not loaded. Run the previous cell first.")

### Evaluate ToMBench Results

In [None]:
# Evaluate results
try:
    batch_eval = evaluate_batch(
        results_to_dict_list(results),
        method="semantic"
    )
    
    print("ToMBench Evaluation Results:")
    print(f"Overall Average: {batch_eval['overall_average']:.3f}")
    print(f"\nBy ToM Type:")
    for tom_type, score in batch_eval['by_type'].items():
        print(f"  {tom_type}: {score:.3f}")
    
except NameError:
    print("No results to evaluate. Run the previous cell first.")

## Example 2: Load OpenToM

First, clone the repository:
```bash
git clone https://github.com/seacowx/OpenToM.git
```

Or install from HuggingFace:
```python
from datasets import load_dataset
dataset = load_dataset("SeacowX/OpenToM")
```

In [None]:
# Load OpenToM
try:
    opentom = load_opentom(
        include_long=False,  # Exclude long narratives for faster testing
        question_types=None  # Load all question types
    )
    
    print(f"Loaded {opentom.total_count} scenarios from OpenToM")
    print(f"Source: {opentom.source}")
    print(f"\nMetadata: {opentom.metadata}")
    
    # Show a sample scenario
    print(f"\nSample scenario:")
    sample = opentom.scenarios[0]
    print(f"Name: {sample.name}")
    print(f"Type: {sample.tom_type.value}")
    print(f"Question Type: {sample.metadata.get('question_type')}")
    print(f"\nNarrative:\n{sample.setup[:200]}...")
    print(f"\nQuestion: {sample.test_questions[0]}")
    
except FileNotFoundError as e:
    print(f"Error: {e}")
    print("\nPlease install OpenToM first:")
    print("git clone https://github.com/seacowx/OpenToM.git")

### Test on OpenToM Scenarios

In [None]:
# Test on a small sample
try:
    test_scenarios = opentom.scenarios[:5]  # Just 5 for demo
    
    print(f"Running {len(test_scenarios)} OpenToM scenarios...")
    
    opentom_results = run_multiple_scenarios(
        test_scenarios,
        provider="ollama",
        temperature=0.2,
        verbose=True
    )
    
    # Evaluate
    opentom_eval = evaluate_batch(
        results_to_dict_list(opentom_results),
        method="semantic"
    )
    
    print(f"\nOpenToM Results:")
    print(f"Average Score: {opentom_eval['overall_average']:.3f}")
    
except NameError:
    print("OpenToM not loaded. Run the previous cell first.")

## Example 3: Load SocialIQA

SocialIQA is available on HuggingFace. Install the datasets library:
```bash
pip install datasets
```

In [None]:
# Load SocialIQA (validation split)
try:
    socialiqa = load_socialiqa(
        split="validation",
        max_samples=20  # Limit for demo
    )
    
    print(f"Loaded {socialiqa.total_count} scenarios from SocialIQA")
    print(f"Source: {socialiqa.source}")
    print(f"Full dataset size: {socialiqa.metadata['full_dataset_size']}")
    
    # Show a sample
    print(f"\nSample scenario:")
    sample = socialiqa.scenarios[0]
    print(f"Context: {sample.setup}")
    print(f"Question: {sample.test_questions[0]}")
    print(f"Choices: {sample.metadata['choices']}")
    print(f"Correct Answer: {sample.correct_answers[0]}")
    
except ImportError:
    print("Error: HuggingFace datasets library not installed")
    print("Install with: pip install datasets")
except Exception as e:
    print(f"Error: {e}")

### Test on SocialIQA

In [None]:
# Run SocialIQA scenarios
try:
    print(f"Running {len(socialiqa.scenarios)} SocialIQA scenarios...")
    
    socialiqa_results = run_multiple_scenarios(
        socialiqa.scenarios,
        provider="ollama",
        temperature=0.1,
        verbose=True
    )
    
    # Evaluate
    socialiqa_eval = evaluate_batch(
        results_to_dict_list(socialiqa_results),
        method="semantic"
    )
    
    print(f"\nSocialIQA Results:")
    print(f"Average Score: {socialiqa_eval['overall_average']:.3f}")
    print(f"\nNote: GPT-4 achieves ~79% on this benchmark")
    print(f"      Human baseline is ~84%")
    
except NameError:
    print("SocialIQA not loaded. Run the previous cell first.")

## Example 4: Compare Across Benchmarks

In [None]:
# Compare performance across benchmarks
try:
    print("Performance Comparison Across Benchmarks:\n")
    print(f"{'Benchmark':<20} {'Avg Score':<12} {'Scenarios Tested':<20}")
    print("="*52)
    
    if 'batch_eval' in locals():
        print(f"{'ToMBench':<20} {batch_eval['overall_average']:<12.3f} {len(results):<20}")
    
    if 'opentom_eval' in locals():
        print(f"{'OpenToM':<20} {opentom_eval['overall_average']:<12.3f} {len(opentom_results):<20}")
    
    if 'socialiqa_eval' in locals():
        print(f"{'SocialIQA':<20} {socialiqa_eval['overall_average']:<12.3f} {len(socialiqa_results):<20}")
    
    print("\nNote: These are small samples. Run on full datasets for accurate comparison.")
    
except Exception as e:
    print(f"Not all benchmarks have been run yet.")

## Example 5: Full Benchmark Run (Large Scale)

For a complete evaluation, you'd want to run on all scenarios. Here's how:

In [None]:
# Full benchmark evaluation (commented out - can be slow!)
# Uncomment to run full evaluation

# def run_full_benchmark(benchmark_name: str, provider="ollama", model=None):
#     """Run complete benchmark evaluation"""
#     
#     if benchmark_name == "tombench":
#         benchmark = load_tombench()
#     elif benchmark_name == "opentom":
#         benchmark = load_opentom(include_long=True)
#     elif benchmark_name == "socialiqa":
#         benchmark = load_socialiqa(split="validation", max_samples=None)
#     else:
#         raise ValueError(f"Unknown benchmark: {benchmark_name}")
#     
#     print(f"Running full {benchmark_name} benchmark ({benchmark.total_count} scenarios)...")
#     print("This may take a while!\n")
#     
#     results = run_multiple_scenarios(
#         benchmark.scenarios,
#         provider=provider,
#         model=model,
#         temperature=0.1,
#         verbose=True
#     )
#     
#     evaluation = evaluate_batch(
#         results_to_dict_list(results),
#         method="semantic"
#     )
#     
#     return {
#         "benchmark": benchmark_name,
#         "total_scenarios": len(results),
#         "evaluation": evaluation,
#         "results": results
#     }
#
# # Run full evaluation
# full_results = run_full_benchmark("tombench", provider="ollama")
# print(f"\nFinal Score: {full_results['evaluation']['overall_average']:.3f}")

## Integration with Experiment Tracking

Track benchmark runs with the harness:

In [None]:
from harness import get_tracker, ExperimentConfig, ExperimentResult

# Example: Track benchmark evaluation
def track_benchmark_run(benchmark_name, scenarios, provider="ollama"):
    """Run and track a benchmark evaluation"""
    
    # Start tracking
    tracker = get_tracker()
    experiment_dir = tracker.start_experiment(ExperimentConfig(
        experiment_name=f"benchmark_{benchmark_name}",
        strategy="single",
        provider=provider,
        metadata={
            "project": "selphi",
            "benchmark": benchmark_name,
            "scenario_count": len(scenarios)
        }
    ))
    
    # Run scenarios
    results = run_multiple_scenarios(scenarios, provider=provider, verbose=True)
    
    # Evaluate
    evaluation = evaluate_batch(results_to_dict_list(results), method="semantic")
    
    # Log each result
    for result, eval_item in zip(results, evaluation['evaluations']):
        tracker.log_result(ExperimentResult(
            task_input=result.metadata['prompt'],
            output=result.model_response,
            strategy_name="single",
            latency_s=result.latency_s,
            tokens_in=result.tokens_in,
            tokens_out=result.tokens_out,
            cost_usd=result.cost_usd,
            metadata={
                "scenario": result.scenario_name,
                "score": eval_item['average_score'],
                "benchmark": benchmark_name
            }
        ))
    
    # Finish tracking
    summary = tracker.finish_experiment()
    
    print(f"\nResults saved to: {experiment_dir}")
    print(f"Overall Score: {evaluation['overall_average']:.3f}")
    
    return evaluation

# Example usage:
# evaluation = track_benchmark_run("tombench", tombench.scenarios[:10])

## Next Steps

1. **Run Full Benchmarks**: Evaluate on complete datasets for accurate metrics
2. **Compare Models**: Test different models on same benchmarks
3. **Analyze Failures**: Study which ToM types are hardest
4. **Fine-Tune**: Use insights to improve model performance
5. **Publish Results**: Share findings with research community

## References

- ToMBench: Chen et al., ACL 2024 - https://github.com/zhchen18/ToMBench
- OpenToM: 2024 - https://github.com/seacowx/OpenToM
- SocialIQA: Sap et al., 2019 - https://huggingface.co/datasets/allenai/social_i_qa