# Reproducing minAction.net Test Results

This notebook walks through reproducing the empirical validation results reported in the paper.

## Setup

In [None]:
import sys
sys.path.append('../')

from src.model_interface import OllamaInterface
from src.evaluation import evaluate_test
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
%matplotlib inline

## 1. Load Existing Results

First, let's load and examine the results from the paper.

In [None]:
# Load detailed results
with open('../results/qwen2_math_7b/detailed_results.json', 'r') as f:
    results = json.load(f)

# Display metadata
print("Metadata:")
for key, value in results['metadata'].items():
    print(f"  {key}: {value}")

## 2. Visualize Results by Category

In [None]:
# Extract test data
tests = results['tests']
df = pd.DataFrame([
    {
        'Test': t['name'],
        'Category': t['category'],
        'Score': t['score'],
        'Status': t['status']
    }
    for t in tests
])

# Plot by category
fig, ax = plt.subplots(figsize=(10, 6))
category_scores = df.groupby('Category')['Score'].mean()
category_scores.plot(kind='bar', ax=ax, color='steelblue')
ax.set_ylabel('Average Score')
ax.set_title('Performance by Test Category')
ax.set_ylim([0, 1.0])
ax.axhline(y=0.61, color='r', linestyle='--', label='Overall Average (61%)')
ax.legend()
plt.tight_layout()
plt.show()

print("\nCategory Breakdown:")
print(category_scores)

## 3. Reproduce a Single Test

Let's reproduce Test 1 (General Lagrangian) to verify our setup works.

In [None]:
# Initialize model interface
model = OllamaInterface(model_name='qwen2-math:7b')

# Load prompt
with open('../prompts/phase1_basic/test_1.txt', 'r') as f:
    prompt = f.read()

print("Prompt:")
print(prompt)
print("\n" + "="*80 + "\n")

# Get response
response = model.generate(prompt)
print("Model Response:")
print(response)

## 4. Manual Evaluation

Compare this response to the one in detailed_results.json. Does it match?

In [None]:
# Show original response from paper
test_1_original = [t for t in tests if t['test_id'] == 1][0]
print("Original Response from Paper:")
print(test_1_original['model_response'])
print("\nScore:", test_1_original['score'])
print("Status:", test_1_original['status'])

## 5. Run Full Test Suite (Optional)

**Warning**: This will make 9 API calls and take ~2-3 minutes.

In [None]:
# Uncomment to run full test suite
# !python ../scripts/run_tests.py --model qwen2-math:7b --output ../results/reproduction/

## 6. Compare Phase 1 vs Phase 2 Results

In [None]:
# Phase 2 results
phase2_tests = results.get('phase2_tests', [])
if phase2_tests:
    df_phase2 = pd.DataFrame([
        {
            'Test': t['name'],
            'Phase 1 Score': t.get('original_score', 0),
            'Phase 2 Score': t['score'],
            'Improvement': t.get('improvement', 'N/A')
        }
        for t in phase2_tests
    ])
    
    print("Phase 2 (With Selection Principles) Results:")
    print(df_phase2)
    
    # Plot comparison
    fig, ax = plt.subplots(figsize=(10, 6))
    x = range(len(df_phase2))
    width = 0.35
    ax.bar([i - width/2 for i in x], df_phase2['Phase 1 Score'], width, label='Phase 1 (Basic)', alpha=0.8)
    ax.bar([i + width/2 for i in x], df_phase2['Phase 2 Score'], width, label='Phase 2 (Guided)', alpha=0.8)
    ax.set_ylabel('Score')
    ax.set_title('Phase 1 vs Phase 2 Performance')
    ax.set_xticks(x)
    ax.set_xticklabels(df_phase2['Test'], rotation=45, ha='right')
    ax.legend()
    ax.set_ylim([0, 1.1])
    plt.tight_layout()
    plt.show()

## Conclusions

Key findings from the empirical validation:

1. **Mathematical Capability Without Physical Intuition**: The model achieves 61% on forward derivation but 0% on inverse problems
2. **Selection Principles Help**: Providing explicit principles improves performance by 14 percentage points
3. **Inverse Problems Remain Unsolved**: Even with guidance, construction tasks fail completely

These results validate the need for the minAction.net framework - current architectures lack the right inductive bias for physics discovery.