# Basic Theory of Mind Tests

This notebook demonstrates how to test theory of mind (ToM) and epistemology understanding in language models using the selphi toolkit.

## What is Theory of Mind?

Theory of Mind is the ability to attribute mental states (beliefs, intents, desires, knowledge) to oneself and others. In language models, we're interested in:

1. **False Belief Understanding**: Can the model track that different agents have different (possibly incorrect) beliefs?
2. **Knowledge Attribution**: Does the model understand who knows what based on observation?
3. **Perspective Taking**: Can the model reason from different viewpoints?
4. **Belief Updating**: Does the model understand how beliefs change with new information?
5. **Second-Order Beliefs**: Can the model reason about beliefs about beliefs?
6. **Epistemic States**: Can the model distinguish knowing from believing from guessing?

## Setup

In [None]:
# 1. Add repository root to path to import harness
import sys
sys.path.append('../../../')  # Go up to repo root from notebooks/

# 2. Import SELPHI components for Theory of Mind testing
from selphi import (
    # Pre-defined ToM scenarios
    SALLY_ANNE,          # Classic false belief test
    CHOCOLATE_BAR,       # Chocolate bar false belief variant
    SURPRISE_PARTY,      # Surprise party scenario
    BROKEN_VASE,         # Broken vase scenario
    MOVIE_OPINIONS,      # Movie opinions (perspective-taking)
    WEATHER_UPDATE,      # Weather update (belief updating)
    GIFT_SURPRISE,       # Gift surprise (second-order belief)
    COIN_FLIP,           # Coin flip (epistemic state)
    DOOR_LOCKED,         # Door locked (knowledge attribution)
    
    # Core functions
    run_scenario,               # Run single ToM scenario
    run_multiple_scenarios,     # Run multiple scenarios
    run_all_scenarios,          # Run all pre-defined scenarios
    evaluate_scenario,          # Evaluate model response
    evaluate_batch,             # Evaluate multiple responses
    compare_models_on_scenarios,  # Compare different models
    results_to_dict_list,       # Convert results to dict format
    
    # Types and collections
    ToMType,              # Enum of ToM types (FALSE_BELIEF, KNOWLEDGE_ATTRIBUTION, etc.)
    ALL_SCENARIOS,        # All pre-defined scenarios
    get_scenarios_by_type,       # Filter by ToM type
    get_scenarios_by_difficulty,  # Filter by difficulty
)
from harness.defaults import DEFAULT_MODEL, DEFAULT_PROVIDER

# 3. Import JSON and pprint for data display
import json
from pprint import pprint

# 4. Show configuration
print("="*70)
print("🔧 SELPHI NOTEBOOK CONFIGURATION")
print("="*70)
print(f"📍 Provider: {DEFAULT_PROVIDER}")
print(f"🤖 Model: {DEFAULT_MODEL or '(default for provider)'}")
print("="*70)
print("\n💡 TO CHANGE: Add a cell with:")
print("   PROVIDER = 'mlx'  # or 'ollama', 'anthropic', 'openai'")
print("   MODEL = 'your-model'")
print("   Then pass to: run_scenario(..., provider=PROVIDER, model=MODEL)")
print("="*70 + "\n")

print("✓ Imports successful")


In [None]:
# ═══════════════════════════════════════════════════════════════════════# ⚙️  CONFIGURATION - EDIT THIS CELL TO CHANGE SETTINGS# ═══════════════════════════════════════════════════════════════════════PROVIDER = 'ollama'  # Options: 'mlx', 'ollama', 'anthropic', 'openai'MODEL = None         # Examples:                     # MLX: 'mlx-community/Llama-3.2-3B-Instruct-4bit'                     # Ollama: 'llama3.2:latest', 'qwen2.5:latest'                     # Anthropic: 'claude-3-5-sonnet-20241022'                     # OpenAI: 'gpt-4o'# Use defaults if not specifiedif MODEL is None:    MODEL = DEFAULT_MODEL# ═══════════════════════════════════════════════════════════════════════print("Current Configuration:")print("="*70)print(f"📍 Provider: {PROVIDER}")print(f"🤖 Model: {MODEL or '(default for provider)'}")print("="*70)

## Example 1: Classic Sally-Anne Test

The Sally-Anne test is the most famous false belief task. Let's see how a model performs.

In [None]:
# 1. Print the Sally-Anne scenario as a prompt
print("SCENARIO:")
print(SALLY_ANNE.to_prompt())  # Converts scenario to text prompt

# 2. Print separator
print("\n" + "="*60)

# 3. Print the correct answers (what we expect from a model with ToM)
print("\nCORRECT ANSWERS:")
for i, answer in enumerate(SALLY_ANNE.correct_answers, 1):
    print(f"{i}. {answer}")

In [None]:
# 1. Run the Sally-Anne scenario with a model
# Change provider/model as needed
result = run_scenario(
    SALLY_ANNE,        # Scenario to test
    provider="ollama",  # Provider to use
    model=None,        # Use default model
    temperature=0.1    # Low temperature for consistent reasoning
)

# 2. Print the model's response
print("MODEL RESPONSE:")
print(result.model_response)

# 3. Print separator and performance metrics
print("\n" + "="*60)
print(f"\nLatency: {result.latency_s:.2f}s")
print(f"Tokens: {result.tokens_in} in, {result.tokens_out} out")
print(f"Cost: ${result.cost_usd:.4f}")

In [None]:
# 1. Evaluate the model's response against correct answers
eval_result = evaluate_scenario(
    SALLY_ANNE,                # Scenario with correct answers
    result.model_response,     # Model's response to evaluate
    method="semantic"          # Use semantic similarity (can also use "llm_judge")
)

# 2. Print evaluation summary
print("EVALUATION RESULTS:")
print(f"Average Score: {eval_result['average_score']:.2f}")

# 3. Print scores for each individual question
print("\nIndividual Question Scores:")
for i, score in enumerate(eval_result['individual_scores'], 1):
    print(f"  Question {i}: {score:.2f}")

## Example 2: Run Multiple Scenarios

Let's test the model on several scenarios of varying difficulty.

In [None]:
# Get all easy scenarios
easy_scenarios = get_scenarios_by_difficulty("easy")

print(f"Running {len(easy_scenarios)} easy scenarios...")
results = run_multiple_scenarios(
    easy_scenarios,
    provider="ollama",
    model=None,
    verbose=True,
    temperature=0.1
)

In [None]:
# Evaluate all results
batch_eval = evaluate_batch(
    results_to_dict_list(results),
    method="semantic"
)

print("BATCH EVALUATION RESULTS:")
print(f"Overall Average Score: {batch_eval['overall_average']:.2f}")
print("\nScores by Scenario Type:")
for tom_type, score in batch_eval['by_type'].items():
    print(f"  {tom_type}: {score:.2f}")
print("\nScores by Difficulty:")
for difficulty, score in batch_eval['by_difficulty'].items():
    print(f"  {difficulty}: {score:.2f}")

## Example 3: Compare Different Models

Compare how different models perform on the same scenarios.

In [None]:
# Define models to compare
models_to_compare = [
    {
        'name': 'ollama-default',
        'provider': 'ollama',
        'model': None,
        'kwargs': {'temperature': 0.1}
    },
    # Add more models as needed
    # {
    #     'name': 'gpt-4o-mini',
    #     'provider': 'openai',
    #     'model': 'gpt-4o-mini',
    #     'kwargs': {'temperature': 0.1}
    # },
]

# Use just false belief scenarios for comparison
false_belief_scenarios = get_scenarios_by_type(ToMType.FALSE_BELIEF)

print(f"Comparing {len(models_to_compare)} models on {len(false_belief_scenarios)} scenarios...")
comparison_results = compare_models_on_scenarios(
    false_belief_scenarios,
    models_to_compare,
    verbose=True
)

In [None]:
# Convert to format for comparison evaluation
model_results_for_eval = {
    model_name: results_to_dict_list(results)
    for model_name, results in comparison_results.items()
}

# Compare
from selphi import compare_models
comparison_eval = compare_models(model_results_for_eval, method="semantic")

print("MODEL COMPARISON:")
print("\nOverall Scores:")
for model, score in comparison_eval['overall_scores'].items():
    print(f"  {model}: {score:.2f}")

print("\nScores by Difficulty:")
for difficulty, model_scores in comparison_eval['by_difficulty'].items():
    print(f"  {difficulty}:")
    for model, score in model_scores.items():
        print(f"    {model}: {score:.2f}")

print("\nScores by ToM Type:")
for tom_type, model_scores in comparison_eval['by_type'].items():
    print(f"  {tom_type}:")
    for model, score in model_scores.items():
        print(f"    {model}: {score:.2f}")

## Example 4: LLM-as-Judge Evaluation

For more nuanced evaluation, we can use another LLM as a judge.

In [None]:
# Run a complex scenario
result = run_scenario(
    GIFT_SURPRISE,  # Second-order belief scenario
    provider="ollama",
    temperature=0.3
)

print("SCENARIO:")
print(GIFT_SURPRISE.to_prompt())
print("\n" + "="*60)
print("\nMODEL RESPONSE:")
print(result.model_response)

In [None]:
# Evaluate with LLM judge
judge_eval = evaluate_scenario(
    GIFT_SURPRISE,
    result.model_response,
    method="llm_judge",
    judge_provider="ollama",
    judge_model=None  # Can specify a different model for judging
)

print("LLM JUDGE EVALUATION:")
print(f"\nNormalized Score: {judge_eval['normalized_score']:.2f}")
print(f"Average Score: {judge_eval['average_score']:.1f}/10")

print("\nIndividual Question Scores:")
for i, (score, reasoning) in enumerate(zip(judge_eval['individual_scores'], judge_eval['individual_reasonings']), 1):
    print(f"\nQuestion {i}: {score}/10")
    print(f"Reasoning: {reasoning}")

print("\n" + "="*60)
print("\nOVERALL ASSESSMENT:")
print(judge_eval['overall_assessment'])

## Example 5: Test All Scenarios

Run a comprehensive test across all ToM scenario types.

In [None]:
# Run all scenarios
print(f"Running all {len(ALL_SCENARIOS)} scenarios...")
all_results = run_all_scenarios(
    provider="ollama",
    temperature=0.2,
    verbose=True
)

In [None]:
# Comprehensive evaluation
comprehensive_eval = evaluate_batch(
    results_to_dict_list(all_results),
    method="semantic"
)

print("COMPREHENSIVE EVALUATION:")
print(f"\nTotal Scenarios Tested: {comprehensive_eval['total_scenarios']}")
print(f"Overall Average Score: {comprehensive_eval['overall_average']:.2f}")

print("\nScores by ToM Type:")
for tom_type, score in sorted(comprehensive_eval['by_type'].items(), key=lambda x: x[1], reverse=True):
    print(f"  {tom_type}: {score:.2f}")

print("\nScores by Difficulty:")
for difficulty in ['easy', 'medium', 'hard']:
    if difficulty in comprehensive_eval['by_difficulty']:
        score = comprehensive_eval['by_difficulty'][difficulty]
        print(f"  {difficulty}: {score:.2f}")

# Detailed breakdown
print("\n" + "="*60)
print("\nDETAILED BREAKDOWN:")
for eval_result in comprehensive_eval['evaluations']:
    print(f"\n{eval_result['scenario_name']} ({eval_result['scenario_type']}, {eval_result['difficulty']}):")
    print(f"  Score: {eval_result['average_score']:.2f}")

## Exploration: Create Your Own Scenarios

You can create custom ToM scenarios to test specific hypotheses.

In [None]:
from selphi import ToMScenario, ToMType

# Create a custom scenario
custom_scenario = ToMScenario(
    name="custom_test",
    tom_type=ToMType.FALSE_BELIEF,
    setup="Your scenario setup here...",
    events=[
        "Event 1...",
        "Event 2..."
    ],
    test_questions=[
        "Question 1?",
        "Question 2?"
    ],
    correct_answers=[
        "Expected answer 1",
        "Expected answer 2"
    ],
    reasoning="Why this tests ToM...",
    difficulty="medium"
)

# Test it
# result = run_scenario(custom_scenario, provider="ollama")
# eval_result = evaluate_scenario(custom_scenario, result.model_response)

## Next Steps

1. **Experiment with different models**: Compare local models (Ollama, MLX) with API models (Claude, GPT)
2. **Vary temperature**: Test if higher temperature affects ToM reasoning
3. **Test with different prompt formats**: Try chain-of-thought prompting
4. **Create domain-specific scenarios**: Design ToM tests for your specific use case
5. **Fine-tune models**: See if fine-tuning improves ToM capabilities
6. **Analyze failure modes**: Study which types of ToM reasoning are hardest

## Research Questions to Explore

- Do larger models have better ToM capabilities?
- Which ToM types are easiest/hardest for current models?
- Does multi-agent debate improve ToM reasoning?
- Can we identify specific architectural features that support ToM?
- How does fine-tuning on ToM tasks affect other capabilities?