# DeltaBench Interactive Exploration Notebook

**Comprehensive playground for exploring the DeltaBench dataset and testing DirectCritic vs PedCOT approaches**

This notebook provides:
- 📊 **Dataset exploration** with interactive sampling
- 🔍 **Annotation viewer** showing all parts of examples
- 🤖 **Critic testing** for both DirectCritic and PedCOT
- ⚖️ **Comparative analysis** between approaches
- 🎯 **Deep dive tools** for understanding error patterns

---

## 1. Setup & Imports

Load all necessary libraries and initialize the DeltaBench framework.

In [None]:
# Core imports
import os
import json
import random
import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Tuple
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML, Markdown
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed

# DeltaBench framework
import sys
sys.path.append('src')

from src import (
    DeltaBenchDataset, DeltaBenchEvaluator, 
    DirectCritic, PedCoTCritic, CriticFactory, create_critic,
    display_example, display_critic_comparison, summarize_results
)

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)
plt.style.use('default')
sns.set_palette("husl")

print("✅ All imports successful!")
print("🔧 Make sure your OPENAI_API_KEY is set in your environment or .env file")

## 2. Load Dataset

Load the DeltaBench dataset and display basic statistics.

In [None]:
# Load the dataset
dataset = DeltaBenchDataset()
data = dataset.load_jsonl('data/Deltabench_v1.jsonl')

print(f"📊 Dataset loaded: {len(data)} examples")
print(f"📁 File path: {dataset.file_path}")

# Basic statistics
if data:
    # Task distribution
    task_counts = Counter(ex.get('task_l1', 'unknown') for ex in data)
    print("\n📈 Task Distribution:")
    for task, count in task_counts.most_common():
        print(f"   {task}: {count} ({count/len(data)*100:.1f}%)")
    
    # Error statistics
    error_examples = sum(1 for ex in data if ex.get('reason_error_section_numbers') or ex.get('reason_unuseful_section_numbers'))
    print(f"\n🚨 Examples with errors: {error_examples} ({error_examples/len(data)*100:.1f}%)")
    
    # Sample first example keys
    print(f"\n🔑 Example structure (keys): {list(data[0].keys())}")
else:
    print("❌ Failed to load dataset")

## 3. Dataset Exploration Functions

Interactive functions for sampling and exploring examples.

In [None]:
def sample_examples(n: int = 5, task_type: Optional[str] = None, has_errors: bool = True, 
                   min_sections: int = 3, max_sections: int = 100) -> List[Dict]:
    """
    Smart sampling of examples with filtering options.
    
    Args:
        n: Number of examples to sample
        task_type: Filter by task type ('math', 'code', etc.)
        has_errors: Whether to include only examples with errors
        min_sections: Minimum number of reasoning sections
        max_sections: Maximum number of reasoning sections
    
    Returns:
        List of sampled examples
    """
    filtered_data = data.copy()
    
    # Filter by task type
    if task_type:
        filtered_data = [ex for ex in filtered_data if ex.get('task_l1') == task_type]
    
    # Filter by error presence
    if has_errors:
        filtered_data = [ex for ex in filtered_data 
                        if ex.get('reason_error_section_numbers') or ex.get('reason_unuseful_section_numbers')]
    
    # Filter by section count (approximate)
    def count_sections(example):
        content = example.get('sections_content', '') or example.get('section_content', '') or example.get('long_cot', '')
        # Count "section" patterns
        import re
        sections = re.findall(r'section\s*\d+', content, re.IGNORECASE)
        return max(len(sections), len(content.split('\n\n'))) if content else 0
    
    filtered_data = [ex for ex in filtered_data 
                    if min_sections <= count_sections(ex) <= max_sections]
    
    if len(filtered_data) < n:
        print(f"⚠️ Only {len(filtered_data)} examples match filters (requested {n})")
        n = len(filtered_data)
    
    sampled = random.sample(filtered_data, min(n, len(filtered_data)))
    
    print(f"🎲 Sampled {len(sampled)} examples")
    if task_type:
        print(f"   📋 Task type: {task_type}")
    print(f"   🚨 Has errors: {has_errors}")
    print(f"   📏 Section range: {min_sections}-{max_sections}")
    
    return sampled

def get_task_types():
    """Get all available task types from the dataset."""
    return sorted(set(ex.get('task_l1', 'unknown') for ex in data))

def get_example_by_id(example_id: str) -> Optional[Dict]:
    """Find example by ID."""
    for ex in data:
        if ex.get('id') == example_id:
            return ex
    return None

# Test the sampling function
print("Available task types:", get_task_types())
sample_examples(3, task_type='math')

## 4. Rich Annotation Viewer

Display complete example details with all annotations and formatting.

In [None]:
def display_full_example(example: Dict, show_reasoning: bool = True, show_sections: bool = True):
    """
    Display a complete example with rich formatting and all annotations.
    
    Args:
        example: The example dictionary
        show_reasoning: Whether to show the detailed reasoning
        show_sections: Whether to break down sections
    """
    # Header with metadata
    display(HTML(f"""
    <div style="border: 2px solid #4CAF50; padding: 15px; margin: 10px 0; border-radius: 8px; background-color: #f9f9f9;">
        <h3 style="color: #2E7D32; margin-top: 0;">📋 Example Details</h3>
        <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px; margin-bottom: 10px;">
            <div><strong>ID:</strong> {example.get('id', 'N/A')}</div>
            <div><strong>Origin:</strong> {example.get('origin', 'N/A')}</div>
            <div><strong>Task L1:</strong> {example.get('task_l1', 'N/A')}</div>
            <div><strong>Task L2:</strong> {example.get('task_l2', 'N/A')}</div>
        </div>
    </div>
    """))
    
    # Question
    display(HTML(f"""
    <div style="border-left: 4px solid #2196F3; padding: 15px; margin: 10px 0; background-color: #f0f8ff;">
        <h4 style="color: #1976D2; margin-top: 0;">❓ Question</h4>
        <p style="line-height: 1.6;">{example.get('question', 'N/A')}</p>
    </div>
    """))
    
    # Answer and correctness
    final_correct = example.get('final_correct', 'unknown')
    correct_color = '#4CAF50' if final_correct else '#F44336' if final_correct is False else '#FF9800'
    correct_icon = '✅' if final_correct else '❌' if final_correct is False else '❓'
    
    display(HTML(f"""
    <div style="border-left: 4px solid {correct_color}; padding: 15px; margin: 10px 0; background-color: #fafafa;">
        <h4 style="color: {correct_color}; margin-top: 0;">{correct_icon} Final Answer</h4>
        <p><strong>Answer:</strong> {example.get('answer', 'N/A')}</p>
        <p><strong>Correct:</strong> {final_correct}</p>
    </div>
    """))
    
    # Error annotations
    error_sections = example.get('reason_error_section_numbers', [])
    unuseful_sections = example.get('reason_unuseful_section_numbers', [])
    all_errors = sorted(set(error_sections + unuseful_sections))
    
    if all_errors:
        display(HTML(f"""
        <div style="border-left: 4px solid #F44336; padding: 15px; margin: 10px 0; background-color: #fff5f5;">
            <h4 style="color: #D32F2F; margin-top: 0;">🚨 Error Annotations</h4>
            <p><strong>Error sections:</strong> {error_sections}</p>
            <p><strong>Unuseful sections:</strong> {unuseful_sections}</p>
            <p><strong>All problematic sections:</strong> {all_errors}</p>
        </div>
        """))
    else:
        display(HTML(f"""
        <div style="border-left: 4px solid #4CAF50; padding: 15px; margin: 10px 0; background-color: #f0fff0;">
            <h4 style="color: #388E3C; margin-top: 0;">✅ No Errors Annotated</h4>
            <p>This example has no annotated reasoning errors.</p>
        </div>
        """))
    
    # Reasoning content
    if show_reasoning:
        reasoning_content = (example.get('sections_content') or 
                           example.get('section_content') or 
                           example.get('long_cot') or 
                           "No reasoning content available")
        
        display(HTML(f"""
        <div style="border-left: 4px solid #9C27B0; padding: 15px; margin: 10px 0; background-color: #fdf7ff;">
            <h4 style="color: #7B1FA2; margin-top: 0;">🧠 Reasoning Content</h4>
            <div style="max-height: 400px; overflow-y: auto; padding: 10px; background-color: white; border-radius: 4px; font-family: monospace; white-space: pre-wrap; line-height: 1.4;">{reasoning_content}</div>
        </div>
        """))
    
    # Section breakdown
    if show_sections and 'sections' in example:
        sections = example['sections']
        display(HTML("<h4 style='color: #FF9800; margin-top: 20px;'>📑 Section Breakdown</h4>"))
        
        for i, section in enumerate(sections, 1):
            section_color = '#ffebee' if i in all_errors else '#f9f9f9'
            section_border = '#F44336' if i in all_errors else '#ddd'
            error_indicator = '🚨 ' if i in all_errors else ''
            
            display(HTML(f"""
            <div style="border: 1px solid {section_border}; padding: 10px; margin: 5px 0; background-color: {section_color}; border-radius: 4px;">
                <strong>{error_indicator}Section {i}:</strong> {section.get('description', 'No description')}<br>
                <strong>Type:</strong> {section.get('section_type', 'Unknown')}<br>
                <div style="margin-top: 8px; font-family: monospace; background-color: white; padding: 8px; border-radius: 3px; max-height: 200px; overflow-y: auto;">{section.get('content', 'No content')}</div>
            </div>
            """))

# Example usage widget
def create_example_viewer_widget():
    """Create an interactive widget for viewing examples."""
    
    # Sample some examples for the dropdown
    sample_ex = sample_examples(10, has_errors=True)
    example_options = [(f"{ex.get('task_l1', 'unknown')} - {ex.get('id', 'no-id')[:8]}", ex) for ex in sample_ex]
    
    def view_example(example, show_reasoning, show_sections):
        display_full_example(example, show_reasoning, show_sections)
    
    return interactive(
        view_example,
        example=widgets.Dropdown(options=example_options, description='Example:'),
        show_reasoning=widgets.Checkbox(value=True, description='Show reasoning'),
        show_sections=widgets.Checkbox(value=True, description='Show sections')
    )

print("🎨 Rich annotation viewer functions ready!")
print("📝 Use display_full_example(example) to view any example")
print("🎮 Use create_example_viewer_widget() for interactive viewing")

## 5. Critic Testing Interface

Functions for testing both DirectCritic and PedCOT on examples.

In [None]:
def test_direct_critic(example: Dict, model: str = 'gpt-4o-mini', show_details: bool = True) -> Dict:
    """
    Test DirectCritic on an example and display results.
    
    Args:
        example: The example to test
        model: Model to use for the critic
        show_details: Whether to show detailed output
    
    Returns:
        Dictionary with critic results
    """
    try:
        # Initialize DirectCritic
        critic = create_critic('direct', model=model)
        
        # Get the question and reasoning
        question = example.get('question', '')
        model_output = (example.get('sections_content') or 
                       example.get('section_content') or 
                       example.get('long_cot', ''))
        
        if not question or not model_output:
            display(HTML('<div style="color: red;">❌ Missing question or reasoning content</div>'))
            return {}
        
        # Run the critic
        display(HTML('<div style="color: blue;">🔄 Running DirectCritic...</div>'))
        
        critic_output, token_info = critic.evaluate_reasoning(question, model_output)
        
        if not critic_output:
            display(HTML('<div style="color: red;">❌ Critic failed to generate output</div>'))
            return {}
        
        # Parse the output
        ground_truth = example.get('reason_error_section_numbers', []) + example.get('reason_unuseful_section_numbers', [])
        result = critic.parse_output(critic_output, ground_truth)
        
        if show_details:
            # Display results
            display(HTML(f"""
            <div style="border: 2px solid #2196F3; padding: 15px; margin: 10px 0; border-radius: 8px; background-color: #f0f8ff;">
                <h3 style="color: #1976D2; margin-top: 0;">🤖 DirectCritic Results</h3>
                
                <div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 15px; margin-bottom: 15px;">
                    <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px;">
                        <strong>Precision</strong><br>
                        <span style="font-size: 1.5em; color: #4CAF50;">{result.precision:.3f}</span>
                    </div>
                    <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px;">
                        <strong>Recall</strong><br>
                        <span style="font-size: 1.5em; color: #FF9800;">{result.recall:.3f}</span>
                    </div>
                    <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px;">
                        <strong>F1 Score</strong><br>
                        <span style="font-size: 1.5em; color: #9C27B0;">{result.f1_score:.3f}</span>
                    </div>
                </div>
                
                <div style="margin-bottom: 15px;">
                    <strong>Ground Truth Errors:</strong> {ground_truth}<br>
                    <strong>Predicted Errors:</strong> {result.predicted_error_sections}<br>
                    <strong>Token Usage:</strong> {token_info.get('total_tokens', 0)} tokens
                </div>
            </div>
            """))
            
            # Show raw critic output
            display(HTML(f"""
            <div style="border-left: 4px solid #2196F3; padding: 15px; margin: 10px 0; background-color: #fafafa;">
                <h4 style="color: #1976D2; margin-top: 0;">📄 Raw Critic Output</h4>
                <div style="background-color: white; padding: 10px; border-radius: 4px; font-family: monospace; white-space: pre-wrap; max-height: 300px; overflow-y: auto;">{critic_output}</div>
            </div>
            """))
        
        return {
            'critic_type': 'DirectCritic',
            'result': result,
            'token_info': token_info,
            'raw_output': critic_output,
            'ground_truth': ground_truth
        }
        
    except Exception as e:
        display(HTML(f'<div style="color: red;">❌ Error testing DirectCritic: {str(e)}</div>'))
        return {}

def test_direct_general_critic(example: Dict, model: str = 'gpt-4o-mini', show_details: bool = True) -> Dict:
    """
    Test Direct General logic critic on an example and display results.
    
    Args:
        example: The example to test
        model: Model to use for the critic
        show_details: Whether to show detailed output
    
    Returns:
        Dictionary with critic results
    """
    try:
        # Initialize Direct General critic using LLMCritic with direct_general prompt
        from src.critic import LLMCritic
        critic = LLMCritic(model_name=model, prompt_type="direct_general")
        
        # Get the question and reasoning
        question = example.get('question', '')
        model_output = (example.get('sections_content') or 
                       example.get('section_content') or 
                       example.get('long_cot', ''))
        
        if not question or not model_output:
            display(HTML('<div style="color: red;">❌ Missing question or reasoning content</div>'))
            return {}
        
        # Run the critic
        display(HTML('<div style="color: #FF5722;">🔄 Running Direct General Logic Critic...</div>'))
        
        critic_output, token_info = critic.evaluate_reasoning(question, model_output)
        
        if not critic_output:
            display(HTML('<div style="color: red;">❌ Critic failed to generate output</div>'))
            return {}
        
        # Parse the output
        ground_truth = example.get('reason_error_section_numbers', []) + example.get('reason_unuseful_section_numbers', [])
        result = critic.parse_output(critic_output, ground_truth)
        
        if show_details:
            # Display results
            display(HTML(f"""
            <div style="border: 2px solid #FF5722; padding: 15px; margin: 10px 0; border-radius: 8px; background-color: #fff3e0;">
                <h3 style="color: #E64A19; margin-top: 0;">🧠 Direct General Logic Critic Results</h3>
                
                <div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 15px; margin-bottom: 15px;">
                    <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px;">
                        <strong>Precision</strong><br>
                        <span style="font-size: 1.5em; color: #4CAF50;">{result.precision:.3f}</span>
                    </div>
                    <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px;">
                        <strong>Recall</strong><br>
                        <span style="font-size: 1.5em; color: #FF9800;">{result.recall:.3f}</span>
                    </div>
                    <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px;">
                        <strong>F1 Score</strong><br>
                        <span style="font-size: 1.5em; color: #9C27B0;">{result.f1_score:.3f}</span>
                    </div>
                </div>
                
                <div style="margin-bottom: 15px;">
                    <strong>Ground Truth Errors:</strong> {ground_truth}<br>
                    <strong>Predicted Errors:</strong> {result.predicted_error_sections}<br>
                    <strong>Token Usage:</strong> {token_info.get('total_tokens', 0)} tokens<br>
                    <strong>Analysis Focus:</strong> Cross-section logic flow & pattern recognition
                </div>
            </div>
            """))
            
            # Show raw critic output
            display(HTML(f"""
            <div style="border-left: 4px solid #FF5722; padding: 15px; margin: 10px 0; background-color: #fafafa;">
                <h4 style="color: #E64A19; margin-top: 0;">📄 Raw Logic Critic Output</h4>
                <div style="background-color: white; padding: 10px; border-radius: 4px; font-family: monospace; white-space: pre-wrap; max-height: 300px; overflow-y: auto;">{critic_output}</div>
            </div>
            """))
        
        return {
            'critic_type': 'DirectGeneralCritic',
            'result': result,
            'token_info': token_info,
            'raw_output': critic_output,
            'ground_truth': ground_truth
        }
        
    except Exception as e:
        display(HTML(f'<div style="color: red;">❌ Error testing Direct General critic: {str(e)}</div>'))
        return {}

def test_pedcot_critic(example: Dict, model: str = 'gpt-4o-mini', show_details: bool = True) -> Dict:
    """
    Test PedCOT critic on an example with detailed two-stage breakdown.
    
    Args:
        example: The example to test
        model: Model to use for the critic
        show_details: Whether to show detailed output
    
    Returns:
        Dictionary with critic results
    """
    try:
        # Initialize PedCOT critic
        critic = create_critic('pedcot', model=model)
        
        # Get the question and reasoning
        question = example.get('question', '')
        model_output = (example.get('sections_content') or 
                       example.get('section_content') or 
                       example.get('long_cot', ''))
        
        if not question or not model_output:
            display(HTML('<div style="color: red;">❌ Missing question or reasoning content</div>'))
            return {}
        
        # Run the critic
        display(HTML('<div style="color: purple;">🔄 Running PedCOT Critic (Two-Stage Process)...</div>'))
        
        critic_output, token_info = critic.evaluate_reasoning(question, model_output)
        
        if not critic_output:
            display(HTML('<div style="color: red;">❌ PedCOT critic failed to generate output</div>'))
            return {}
        
        # Parse the output
        ground_truth = example.get('reason_error_section_numbers', []) + example.get('reason_unuseful_section_numbers', [])
        result = critic.parse_output(critic_output, ground_truth)
        
        if show_details:
            # Display results
            display(HTML(f"""
            <div style="border: 2px solid #9C27B0; padding: 15px; margin: 10px 0; border-radius: 8px; background-color: #fdf7ff;">
                <h3 style="color: #7B1FA2; margin-top: 0;">🎓 PedCOT Critic Results</h3>
                
                <div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 15px; margin-bottom: 15px;">
                    <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px;">
                        <strong>Precision</strong><br>
                        <span style="font-size: 1.5em; color: #4CAF50;">{result.precision:.3f}</span>
                    </div>
                    <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px;">
                        <strong>Recall</strong><br>
                        <span style="font-size: 1.5em; color: #FF9800;">{result.recall:.3f}</span>
                    </div>
                    <div style="text-align: center; padding: 10px; background-color: white; border-radius: 5px;">
                        <strong>F1 Score</strong><br>
                        <span style="font-size: 1.5em; color: #9C27B0;">{result.f1_score:.3f}</span>
                    </div>
                </div>
                
                <div style="margin-bottom: 15px;">
                    <strong>Ground Truth Errors:</strong> {ground_truth}<br>
                    <strong>Predicted Errors:</strong> {result.predicted_error_sections}<br>
                    <strong>Token Usage:</strong> {token_info.get('total_tokens', 0)} tokens<br>
                    <strong>Domain:</strong> {token_info.get('domain', 'unknown')}<br>
                    <strong>Stages:</strong> {token_info.get('num_stages', 0)} LLM calls
                </div>
            </div>
            """))
            
            # Show raw critic output
            display(HTML(f"""
            <div style="border-left: 4px solid #9C27B0; padding: 15px; margin: 10px 0; background-color: #fafafa;">
                <h4 style="color: #7B1FA2; margin-top: 0;">📄 Final PedCOT Output</h4>
                <div style="background-color: white; padding: 10px; border-radius: 4px; font-family: monospace; white-space: pre-wrap; max-height: 300px; overflow-y: auto;">{critic_output}</div>
            </div>
            """))
        
        return {
            'critic_type': 'PedCoTCritic',
            'result': result,
            'token_info': token_info,
            'raw_output': critic_output,
            'ground_truth': ground_truth
        }
        
    except Exception as e:
        display(HTML(f'<div style="color: red;">❌ Error testing PedCOT: {str(e)}</div>'))
        return {}

print("🤖 Critic testing functions ready!")
print("🔍 Use test_direct_critic(example) to test DirectCritic")
print("🧠 Use test_direct_general_critic(example) to test Direct General Logic Critic")
print("🎓 Use test_pedcot_critic(example) to test PedCOT")

## 6. Comparative Analysis

Side-by-side comparison of both critics.

In [None]:
def compare_critics(example: Dict, model: str = 'gpt-4o-mini') -> Dict:
    """
    Run all three critics on the same example and compare results.
    
    Args:
        example: The example to test
        model: Model to use for all critics
    
    Returns:
        Dictionary with comparison results
    """
    display(HTML(f"""
    <div style="text-align: center; margin: 20px 0;">
        <h2 style="color: #333;">⚖️ Three-Way Critic Comparison</h2>
        <p style="color: #666;">Testing DirectCritic, Direct General, and PedCOT on the same example</p>
    </div>
    """))
    
    # Test DirectCritic
    display(HTML('<h3 style="color: #2196F3;">🤖 DirectCritic Test</h3>'))
    direct_result = test_direct_critic(example, model, show_details=True)
    
    # Test Direct General
    display(HTML('<h3 style="color: #FF5722;">🧠 Direct General Logic Critic Test</h3>'))
    direct_general_result = test_direct_general_critic(example, model, show_details=True)
    
    # Test PedCOT
    display(HTML('<h3 style="color: #9C27B0;">🎓 PedCOT Test</h3>'))
    pedcot_result = test_pedcot_critic(example, model, show_details=True)
    
    # Comparison summary
    if direct_result and direct_general_result and pedcot_result:
        direct_metrics = direct_result['result']
        direct_general_metrics = direct_general_result['result']
        pedcot_metrics = pedcot_result['result']
        
        comparison_data = {
            'Metric': ['Precision', 'Recall', 'F1 Score', 'Predicted Errors', 'Token Usage'],
            'DirectCritic': [
                f"{direct_metrics.precision:.3f}",
                f"{direct_metrics.recall:.3f}",
                f"{direct_metrics.f1_score:.3f}",
                str(direct_metrics.predicted_error_sections),
                str(direct_result['token_info'].get('total_tokens', 0))
            ],
            'Direct General': [
                f"{direct_general_metrics.precision:.3f}",
                f"{direct_general_metrics.recall:.3f}",
                f"{direct_general_metrics.f1_score:.3f}",
                str(direct_general_metrics.predicted_error_sections),
                str(direct_general_result['token_info'].get('total_tokens', 0))
            ],
            'PedCOT': [
                f"{pedcot_metrics.precision:.3f}",
                f"{pedcot_metrics.recall:.3f}",
                f"{pedcot_metrics.f1_score:.3f}",
                str(pedcot_metrics.predicted_error_sections),
                str(pedcot_result['token_info'].get('total_tokens', 0))
            ]
        }
        
        comparison_df = pd.DataFrame(comparison_data)
        
        display(HTML(f"""
        <div style="border: 2px solid #FF9800; padding: 15px; margin: 20px 0; border-radius: 8px; background-color: #fff8e1;">
            <h3 style="color: #F57C00; margin-top: 0;">📊 Three-Way Comparison Summary</h3>
        </div>
        """))
        
        display(comparison_df)
        
        # Winner analysis
        f1_scores = {
            'DirectCritic': direct_metrics.f1_score,
            'Direct General': direct_general_metrics.f1_score,
            'PedCOT': pedcot_metrics.f1_score
        }
        f1_winner = max(f1_scores.items(), key=lambda x: x[1])[0]
        
        token_usage = {
            'DirectCritic': direct_result['token_info'].get('total_tokens', 0),
            'Direct General': direct_general_result['token_info'].get('total_tokens', 0),
            'PedCOT': pedcot_result['token_info'].get('total_tokens', 0)
        }
        efficiency_winner = min(token_usage.items(), key=lambda x: x[1])[0]
        
        display(HTML(f"""
        <div style="border-left: 4px solid #4CAF50; padding: 15px; margin: 10px 0; background-color: #f0fff0;">
            <h4 style="color: #388E3C; margin-top: 0;">🏆 Analysis</h4>
            <p><strong>F1 Score Winner:</strong> {f1_winner} ({f1_scores[f1_winner]:.3f})</p>
            <p><strong>Efficiency Winner:</strong> {efficiency_winner} ({token_usage[efficiency_winner]} tokens)</p>
            <p><strong>Ground Truth:</strong> {direct_result['ground_truth']}</p>
            <p><strong>F1 Scores:</strong> DirectCritic: {f1_scores['DirectCritic']:.3f}, Direct General: {f1_scores['Direct General']:.3f}, PedCOT: {f1_scores['PedCOT']:.3f}</p>
        </div>
        """))
    
    return {
        'direct_result': direct_result,
        'direct_general_result': direct_general_result,
        'pedcot_result': pedcot_result,
        'example_id': example.get('id', 'unknown')
    }

def batch_comparison(examples: List[Dict], model: str = 'gpt-4o-mini') -> pd.DataFrame:
    """
    Run batch comparison on multiple examples with all three critics.
    
    Args:
        examples: List of examples to test
        model: Model to use
    
    Returns:
        DataFrame with comparison results
    """
    results = []
    
    for i, example in enumerate(examples):
        print(f"\n{'='*50}")
        print(f"Testing example {i+1}/{len(examples)}: {example.get('id', 'unknown')[:8]}")
        print(f"{'='*50}")
        
        try:
            # Test all three critics (without detailed display)
            direct_result = test_direct_critic(example, model, show_details=False)
            direct_general_result = test_direct_general_critic(example, model, show_details=False)
            pedcot_result = test_pedcot_critic(example, model, show_details=False)
            
            if direct_result and direct_general_result and pedcot_result:
                results.append({
                    'example_id': example.get('id', 'unknown'),
                    'task_type': example.get('task_l1', 'unknown'),
                    'ground_truth': direct_result['ground_truth'],
                    'direct_precision': direct_result['result'].precision,
                    'direct_recall': direct_result['result'].recall,
                    'direct_f1': direct_result['result'].f1_score,
                    'direct_tokens': direct_result['token_info'].get('total_tokens', 0),
                    'direct_predictions': direct_result['result'].predicted_error_sections,
                    'direct_general_precision': direct_general_result['result'].precision,
                    'direct_general_recall': direct_general_result['result'].recall,
                    'direct_general_f1': direct_general_result['result'].f1_score,
                    'direct_general_tokens': direct_general_result['token_info'].get('total_tokens', 0),
                    'direct_general_predictions': direct_general_result['result'].predicted_error_sections,
                    'pedcot_precision': pedcot_result['result'].precision,
                    'pedcot_recall': pedcot_result['result'].recall,
                    'pedcot_f1': pedcot_result['result'].f1_score,
                    'pedcot_tokens': pedcot_result['token_info'].get('total_tokens', 0),
                    'pedcot_predictions': pedcot_result['result'].predicted_error_sections,
                })
        except Exception as e:
            print(f"Error testing example {i+1}: {e}")
            continue
    
    return pd.DataFrame(results)

print("⚖️ Comparative analysis functions ready!")
print("🔄 Use compare_critics(example) for three-way comparison")
print("📊 Use batch_comparison(examples) for multiple examples")

## 7. Interactive Playground

Widgets for interactive exploration.

In [None]:
# Interactive example viewer
print("🎮 Interactive Example Viewer")
print("Use the widget below to explore examples:")
create_example_viewer_widget()

## 8. Quick Start Playground

Ready-to-use cells for immediate exploration.

In [None]:
# Sample a few examples to play with
print("🎲 Sampling examples for exploration...")
playground_examples = sample_examples(3, task_type='math', has_errors=True)

print(f"\n📋 Got {len(playground_examples)} examples to explore")
for i, ex in enumerate(playground_examples):
    print(f"   {i+1}. {ex.get('task_l1', 'unknown')} - {ex.get('id', 'no-id')[:8]}")

# Store first example for easy access
if playground_examples:
    current_example = playground_examples[0]
    print(f"\n✅ Current example set to: {current_example.get('id', 'unknown')[:8]}")
    print("💡 Use the cells below to explore this example")
else:
    print("❌ No examples found. Check dataset loading.")

In [None]:
# Display the current example
if 'current_example' in locals():
    print("📋 Displaying current example...")
    display_full_example(current_example, show_reasoning=True, show_sections=True)
else:
    print("❌ No current example set. Run the cell above first.")

In [None]:
# Test DirectCritic on current example
if 'current_example' in locals():
    print("🤖 Testing DirectCritic...")
    direct_test_result = test_direct_critic(current_example, show_details=True)
else:
    print("❌ No current example set. Run the sampling cell first.")

In [None]:
# Test Direct General Logic Critic on current example
if 'current_example' in locals():
    print("🧠 Testing Direct General Logic Critic...")
    direct_general_test_result = test_direct_general_critic(current_example, show_details=True)
else:
    print("❌ No current example set. Run the sampling cell first.")

In [None]:
# Test PedCOT on current example
if 'current_example' in locals():
    print("🎓 Testing PedCOT...")
    pedcot_test_result = test_pedcot_critic(current_example, show_details=True)
else:
    print("❌ No current example set. Run the sampling cell first.")

# Compare all three critics
if 'current_example' in locals():
    print("⚖️ Comparing all three critics...")
    comparison_result = compare_critics(current_example)
else:
    print("❌ No current example set. Run the sampling cell first.")

In [None]:
# Run batch comparison (uncomment to run - this will use API calls)
# batch_examples = sample_examples(5, task_type='math', has_errors=True)
# print(f"Running batch comparison on {len(batch_examples)} examples...")
# batch_results = batch_comparison(batch_examples)
# display(batch_results)

print("🚫 Batch analysis is commented out to prevent accidental API usage")
print("💡 Uncomment the lines above to run batch comparison")
print("⚠️ Note: This will use OpenAI API tokens")

## 10. Advanced Exploration

Custom exploration functions.

In [None]:
# Custom exploration - modify as needed

def explore_by_task_type(task_type: str, n_examples: int = 3):
    """
    Explore examples from a specific task type.
    """
    print(f"🔍 Exploring {task_type} examples...")
    examples = sample_examples(n_examples, task_type=task_type, has_errors=True)
    
    for i, ex in enumerate(examples):
        print(f"\n{'='*20} Example {i+1} {'='*20}")
        display_full_example(ex, show_reasoning=False, show_sections=False)

def find_complex_examples(min_errors: int = 3):
    """
    Find examples with multiple errors.
    """
    complex_examples = []
    for ex in data:
        error_count = len(ex.get('reason_error_section_numbers', [])) + len(ex.get('reason_unuseful_section_numbers', []))
        if error_count >= min_errors:
            complex_examples.append((ex, error_count))
    
    # Sort by error count
    complex_examples.sort(key=lambda x: x[1], reverse=True)
    
    print(f"🎯 Found {len(complex_examples)} examples with {min_errors}+ errors")
    
    return [ex for ex, _ in complex_examples[:10]]  # Return top 10

# Available task types
print("📋 Available task types:", get_task_types())
print("\n💡 Use explore_by_task_type('math') to explore math examples")
print("💡 Use find_complex_examples(3) to find examples with 3+ errors")

## 🎯 Ready to Explore!

This notebook is now fully set up for interactive exploration of the DeltaBench dataset and testing all three critic approaches: **DirectCritic**, **Direct General Logic Critic**, and **PedCOT**.

### Quick Start Guide:

1. **📊 Data Exploration**: Use the sampling cells to get examples
2. **🔍 View Examples**: Use `display_full_example(example)` to see all annotations
3. **🤖 Test Critics**: Use individual test functions or three-way comparison
4. **⚖️ Compare**: Use `compare_critics(example)` for side-by-side comparison of all three
5. **🎮 Interactive**: Use the widgets for GUI-based exploration

### Available Critics:
- **🤖 DirectCritic**: Original DeltaBench approach (section-by-section evaluation)
- **🧠 Direct General**: Advanced logic critic with cross-section analysis and pattern recognition
- **🎓 PedCOT**: Pedagogical Chain-of-Thought approach (two-stage process)

### Key Functions Available:
- `sample_examples(n, task_type, has_errors)` - Smart sampling
- `display_full_example(example)` - Rich display with annotations
- `test_direct_critic(example)` - Test DirectCritic approach
- `test_direct_general_critic(example)` - Test Direct General Logic Critic
- `test_pedcot_critic(example)` - Test PedCOT approach
- `compare_critics(example)` - Three-way comparison
- `batch_comparison(examples)` - Multiple example analysis

### Direct General Logic Critic Features:
- **Cross-section logic analysis** - tracks reasoning flow between sections
- **Pattern recognition** - identifies specific error types (Unjustified Generalization, Invalid Simplification, etc.)
- **Enhanced output format** - includes Error Type, Location, Explanation, and Impact
- **Holistic evaluation** - considers both individual sections and overall logical flow

**⚠️ Note**: Testing critics will use OpenAI API tokens. Make sure your API key is set!

**🚀 Happy Exploring with All Three Critics!**