# Simulation Experiment Empirical Validation

**The Problem**: You've built a TinyTroupe simulation, but how do you know if it accurately reflects real-world behavior?

**The Solution**: This validation system compares your simulation results against empirical data to give you a confidence score.

*Note: data here is fictitious and for demonstration purposes only.*

## Real-World Example: E-commerce Checkout Optimization

**Scenario**: Your company is considering a new premium checkout flow. You have real customer data from the current system and want to validate your TinyTroupe simulation predictions before making a $2M investment.

**The Stakes**: If your simulation is wrong, you could lose customers and revenue. If it's right, you could increase conversions by 40%.

Let's see how close your simulation gets to reality...

In [1]:
import sys
import json
sys.path.insert(0, '..')

from tinytroupe.validation import SimulationExperimentEmpiricalValidator, SimulationExperimentDataset, validate_simulation_experiment_empirically
import pandas as pd
import matplotlib.pyplot as plt


!!!!
DISCLAIMER: TinyTroupe relies on Artificial Intelligence (AI) models to generate content. 
The AI models are not perfect and may produce inappropriate or inacurate results. 
For any serious or consequential use, please review the generated content before using it.
!!!!

Looking for default config on: c:\Users\pdasilva\repos\TinyTroupe\examples\..\tinytroupe\utils\..\config.ini
Found custom config on: c:\Users\pdasilva\repos\TinyTroupe\examples\config.ini
TinyTroupe version: 0.5.1
Current date and time (local): 2025-07-15 23:29:09
Current date and time (UTC):   2025-07-16 02:29:09

Current TinyTroupe configuration 
[OpenAI]
api_type = openai
azure_api_version = 2024-08-01-preview
model = gpt-4o-mini
reasoning_model = o3-mini
embedding_model = text-embedding-3-small
max_tokens = 16000
temperature = 1.7
freq_penalty = 0.1
presence_penalty = 0.1
timeout = 480
max_attempts = 5
waiting_time = 0
exponential_backoff_factor = 5
reasoning_effort = high
cache_api_calls = False
cache_file_na

## The Data: Real Customer Behavior vs. TinyTroupe Simulation

**Real Data**: Customers tested the current checkout flow over 3 months  
**Simulation**: TinyTroupe agents tested the proposed premium checkout flow  
**Question**: Can we trust the simulation to predict real customer behavior?

In [2]:
# REAL CUSTOMER DATA (3 months of actual e-commerce data)
real_customer_data = {
    "name": "Real E-commerce Customer Data",
    "description": "Actual customer behavior from current checkout system",
    "key_results": {
        # Extended data with more sample points to avoid statistical issues
        "conversion_rate": [0.23, 0.19, 0.25, 0.21, 0.24, 0.18, 0.22, 0.20, 0.26, 0.19, 
                           0.24, 0.22, 0.20, 0.25, 0.23, 0.21, 0.19, 0.24, 0.22, 0.20],  # 20 weekly averages
        "cart_abandonment_rate": [0.68, 0.72, 0.65, 0.70, 0.67, 0.74, 0.69, 0.71, 0.64, 0.73,
                                 0.69, 0.71, 0.68, 0.66, 0.72, 0.70, 0.67, 0.69, 0.71, 0.68],
        "average_order_value": [87.50, 92.30, 85.20, 89.10, 91.80, 83.40, 88.90, 86.70, 94.20, 84.60,
                               89.30, 91.50, 87.80, 90.20, 88.60, 86.40, 92.10, 89.70, 87.30, 90.80],
        "customer_satisfaction": [3.2, 3.4, 3.1, 3.3, 3.5, 3.0, 3.2, 3.4, 3.6, 3.1,
                                 3.3, 3.2, 3.4, 3.1, 3.3, 3.5, 3.0, 3.2, 3.4, 3.3],
        "overall_revenue_per_visitor": 20.13  # Key business metric
    },
    "result_types": {
        "conversion_rate": "per_agent",
        "cart_abandonment_rate": "per_agent", 
        "average_order_value": "per_agent",
        "customer_satisfaction": "per_agent",
        "overall_revenue_per_visitor": "aggregate"
    },
    "agent_justifications": [
        "Too many steps in checkout - gave up halfway through",
        "Payment options were confusing, wasn't sure which to choose",
        "Completed purchase but the process felt unnecessarily complicated",
        "Loading times were too slow, lost patience",
        "Smooth experience overall, would buy again"
    ],
    "justification_summary": "Current checkout has friction points: too many steps, confusing payment options, slow loading times. Customers abandon due to complexity."
}

# TINYTROUPE SIMULATION DATA (proposed premium checkout with AI assistance)
tinytroupe_simulation_data = {
    "name": "TinyTroupe Premium Checkout Simulation",
    "description": "Simulation of new premium checkout with AI assistant and streamlined flow",
    "key_results": {
        # Extended data with more sample points to match real data
        "conversion_rate": [0.34, 0.31, 0.36, 0.33, 0.35, 0.29, 0.32, 0.30, 0.37, 0.31,
                           0.33, 0.35, 0.32, 0.34, 0.36, 0.30, 0.33, 0.35, 0.31, 0.34],  # Predicted higher conversion
        "cart_abandonment_rate": [0.45, 0.48, 0.42, 0.46, 0.44, 0.50, 0.43, 0.47, 0.41, 0.49,
                                 0.44, 0.46, 0.43, 0.45, 0.47, 0.42, 0.48, 0.44, 0.46, 0.45],  # Predicted lower abandonment
        "average_order_value": [102.30, 108.50, 99.80, 105.20, 110.40, 97.60, 104.10, 101.90, 112.30, 98.70,
                               105.80, 103.40, 107.20, 101.60, 109.30, 104.70, 106.50, 102.80, 108.90, 105.10],  # Higher AOV
        "customer_satisfaction": [4.1, 4.3, 3.9, 4.2, 4.4, 3.8, 4.0, 4.2, 4.5, 3.9,
                                 4.2, 4.1, 4.3, 4.0, 4.4, 3.9, 4.1, 4.3, 4.2, 4.0],  # Higher satisfaction
        "overall_revenue_per_visitor": 33.85  # 68% increase predicted!
    },
    "result_types": {
        "conversion_rate": "per_agent",
        "cart_abandonment_rate": "per_agent",
        "average_order_value": "per_agent", 
        "customer_satisfaction": "per_agent",
        "overall_revenue_per_visitor": "aggregate"
    },
    "agent_justifications": [
        "AI assistant helped me find exactly what I needed quickly",
        "One-click checkout made the process effortless",
        "Smart recommendations increased my order value naturally",
        "Felt confident about my purchase with the AI guidance",
        "Fastest checkout experience I've ever had"
    ],
    "justification_summary": "Premium checkout eliminates friction with AI assistance, one-click payment, and smart recommendations. Customers feel guided and confident."
}

print("📊 Real customer data loaded (3 months of actual e-commerce behavior)")
print("🤖 TinyTroupe simulation data loaded (premium checkout predictions)")
print("💰 Predicted revenue increase: 68% per visitor")
print("⚠️  Question: Can we trust this simulation before investing $2M?")

📊 Real customer data loaded (3 months of actual e-commerce behavior)
🤖 TinyTroupe simulation data loaded (premium checkout predictions)
💰 Predicted revenue increase: 68% per visitor
⚠️  Question: Can we trust this simulation before investing $2M?


## The Validation: Can We Trust the Simulation?

The system compares your simulation against real data using:
- **Statistical tests** - Are the differences significant?
- **Semantic analysis** - Do the reasons make sense?
- **Confidence score** - How much can we trust this simulation?

In [3]:
# Run the validation - this is where the magic happens!
import warnings
import numpy as np

# Suppress statistical warnings for cleaner output
warnings.filterwarnings('ignore', category=RuntimeWarning)

try:
    validation_result = validate_simulation_experiment_empirically(
        control_data=real_customer_data,
        treatment_data=tinytroupe_simulation_data,
        validation_types=["statistical", "semantic"],
        significance_level=0.05,
        output_format="values"
    )
    
    print("🎯 VALIDATION RESULTS")
    print("=" * 50)
    
    if validation_result.overall_score is not None:
        print(f"📊 Confidence Score: {validation_result.overall_score:.1%}")
        print(f"📈 Simulation Quality: {validation_result.summary}")
        
        # Quick interpretation
        if validation_result.overall_score > 0.8:
            print("✅ HIGH CONFIDENCE - Simulation is very reliable")
        elif validation_result.overall_score > 0.6:
            print("⚠️  MEDIUM CONFIDENCE - Simulation has some reliability")
        else:
            print("❌ LOW CONFIDENCE - Simulation may not be reliable")
    else:
        print("⚠️  Validation completed but confidence score could not be calculated")
        print(f"📈 Summary: {validation_result.summary}")
        
except Exception as e:
    print("❌ VALIDATION ERROR")
    print("=" * 50)
    print(f"Error during validation: {str(e)}")
    print("💡 This might be due to insufficient data or statistical computation issues.")
    print("   Consider using more data points or different validation methods.")
    
    # Create a fallback basic comparison
    print("\n📊 FALLBACK: Basic Data Comparison")
    print("=" * 40)
    
    # Simple mean comparisons
    real_conv = np.mean(real_customer_data['key_results']['conversion_rate'])
    sim_conv = np.mean(tinytroupe_simulation_data['key_results']['conversion_rate'])
    
    print(f"Real conversion rate: {real_conv:.1%}")
    print(f"Simulated conversion rate: {sim_conv:.1%}")
    print(f"Predicted improvement: {((sim_conv - real_conv) / real_conv) * 100:.1f}%")

🎯 VALIDATION RESULTS
📊 Confidence Score: 33.2%
📈 Simulation Quality: Statistical validation: 4/5 tests significant, average effect size: 4.739; Semantic validation: Average proximity score of 0.282; Summary proximity: 0.400; Overall validation score: 0.332
❌ LOW CONFIDENCE - Simulation may not be reliable


## Deep Dive: What the Numbers Tell Us

In [4]:
# Examine the statistical evidence
try:
    if validation_result.statistical_results:
        print("📊 STATISTICAL ANALYSIS")
        print("=" * 40)
        
        if "error" in validation_result.statistical_results:
            print(f"❌ Error: {validation_result.statistical_results['error']}")
        else:
            metrics = validation_result.statistical_results['common_metrics']
            print(f"📈 Metrics analyzed: {', '.join(metrics)}")
            
            # Show key findings
            test_results = validation_result.statistical_results['test_results']
            significant_differences = []
            
            for treatment_name, treatment_results in test_results.items():
                for metric, metric_results in treatment_results.items():
                    for test_name, test_result in metric_results.items():
                        if test_result.get('significant', False):
                            p_val = test_result.get('p_value', 'N/A')
                            if isinstance(p_val, (int, float)) and not np.isnan(p_val):
                                significant_differences.append(f"{metric}: p={p_val:.3f}")
                            else:
                                significant_differences.append(f"{metric}: significant")
            
            if significant_differences:
                print("⚠️  Significant differences found:")
                for diff in significant_differences:
                    print(f"   • {diff}")
            else:
                print("✅ No significant differences - simulation aligns with real data")
    else:
        print("📊 No statistical results available")
        
except Exception as e:
    print("📊 Statistical analysis encountered an issue:")
    print(f"   {str(e)}")
    print("   This might be due to insufficient data for statistical testing.")

📊 STATISTICAL ANALYSIS
📈 Metrics analyzed: customer_satisfaction, cart_abandonment_rate, conversion_rate, overall_revenue_per_visitor, average_order_value
📊 Statistical analysis encountered an issue:
   'str' object has no attribute 'get'
   This might be due to insufficient data for statistical testing.


In [5]:
# Examine the reasoning alignment
try:
    if validation_result.semantic_results:
        print("\n🧠 REASONING ANALYSIS")
        print("=" * 40)
        
        avg_proximity = validation_result.semantic_results.get('average_proximity')
        if avg_proximity and not np.isnan(avg_proximity):
            print(f"🎯 Reasoning alignment: {avg_proximity:.1%}")
            
            if avg_proximity > 0.7:
                print("✅ Agent reasoning closely matches real customer thinking")
            elif avg_proximity > 0.5:
                print("⚠️  Agent reasoning somewhat matches real customer thinking")
            else:
                print("❌ Agent reasoning differs significantly from real customers")
        else:
            print("⚠️  Could not calculate reasoning alignment score")
        
        # Show reasoning comparison
        summary_comp = validation_result.semantic_results.get('summary_comparison')
        if summary_comp and summary_comp.get('proximity_score') and not np.isnan(summary_comp.get('proximity_score', 0)):
            print(f"\n📝 Summary comparison: {summary_comp['proximity_score']:.1%} similar")
            justification = summary_comp.get('justification', '')
            if justification:
                print(f"💡 Key insight: {justification[:120]}...")
        else:
            print("\n📝 Summary comparison: Unable to calculate similarity")
    else:
        print("\n🧠 No semantic analysis results available")
        
except Exception as e:
    print("\n🧠 Semantic analysis encountered an issue:")
    print(f"   {str(e)}")
    print("   This might be due to missing justification data or semantic processing issues.")


🧠 REASONING ANALYSIS
🎯 Reasoning alignment: 28.2%
❌ Agent reasoning differs significantly from real customers

📝 Summary comparison: 40.0% similar
💡 Key insight: The two texts discuss the concept of checkout processes but from opposing perspectives. The first text highlights the is...


## Business Impact Assessment

Based on the validation results, here's what this means for your $2M investment decision:

In [6]:
# Calculate business impact based on validation confidence
try:
    confidence_score = validation_result.overall_score if validation_result.overall_score is not None else 0.5
    predicted_revenue_increase = 0.68  # 68% increase from simulation
    current_monthly_revenue = 1000000  # $1M per month

    print("💼 BUSINESS IMPACT ASSESSMENT")
    print("=" * 50)

    # Risk-adjusted projections
    if confidence_score > 0.8:
        risk_adjustment = 0.9  # High confidence = 90% of predicted benefit
        recommendation = "PROCEED with investment"
        risk_level = "LOW"
    elif confidence_score > 0.6:
        risk_adjustment = 0.6  # Medium confidence = 60% of predicted benefit
        recommendation = "PROCEED with caution"
        risk_level = "MEDIUM"
    else:
        risk_adjustment = 0.3  # Low confidence = 30% of predicted benefit
        recommendation = "CONSIDER more validation"
        risk_level = "HIGH"

    expected_revenue_increase = predicted_revenue_increase * risk_adjustment
    monthly_revenue_gain = current_monthly_revenue * expected_revenue_increase
    annual_revenue_gain = monthly_revenue_gain * 12

    print(f"🎯 Simulation Confidence: {confidence_score:.1%}")
    print(f"⚠️  Risk Level: {risk_level}")
    print(f"📊 Risk-Adjusted Revenue Increase: {expected_revenue_increase:.1%}")
    print(f"💰 Expected Monthly Revenue Gain: ${monthly_revenue_gain:,.0f}")
    print(f"📈 Expected Annual Revenue Gain: ${annual_revenue_gain:,.0f}")
    
    if monthly_revenue_gain > 0:
        print(f"🎯 Investment Payback Period: {2000000 / monthly_revenue_gain:.1f} months")
    else:
        print("🎯 Investment Payback Period: Cannot calculate (no expected gain)")
        
    print(f"\n🏆 RECOMMENDATION: {recommendation}")
    
except Exception as e:
    print("💼 BUSINESS IMPACT ASSESSMENT")
    print("=" * 50)
    print(f"❌ Error calculating business impact: {str(e)}")
    print("💡 Using basic assessment based on available data...")
    
    # Basic fallback assessment
    print(f"🎯 Predicted Revenue Increase: 68%")
    print(f"💰 Investment Amount: $2M")
    print(f"⚠️  Recommendation: Proceed with caution due to validation issues")

💼 BUSINESS IMPACT ASSESSMENT
🎯 Simulation Confidence: 33.2%
⚠️  Risk Level: HIGH
📊 Risk-Adjusted Revenue Increase: 20.4%
💰 Expected Monthly Revenue Gain: $204,000
📈 Expected Annual Revenue Gain: $2,448,000
🎯 Investment Payback Period: 9.8 months

🏆 RECOMMENDATION: CONSIDER more validation


## Quick Example: Statistics-Only Validation

For simpler cases where you only have metrics (no customer reasoning), you can run statistical validation only:

In [7]:
# Simple metrics-only validation
simple_real_data = {
    "name": "Real Data - Metrics Only",
    "key_results": {
        # Extended data for more robust statistical testing
        "click_through_rate": [0.12, 0.15, 0.13, 0.14, 0.11, 0.16, 0.13, 0.12, 0.15, 0.14, 
                              0.13, 0.15, 0.12, 0.14, 0.16, 0.13, 0.12, 0.15, 0.14, 0.13],
        "time_on_page": [45, 52, 48, 50, 46, 54, 47, 49, 51, 48, 
                        46, 53, 49, 47, 52, 50, 48, 51, 49, 47]  # seconds
    },
    "result_types": {
        "click_through_rate": "per_agent",
        "time_on_page": "per_agent"
    }
}

simple_simulation_data = {
    "name": "Simulation - Metrics Only", 
    "key_results": {
        "click_through_rate": [0.18, 0.21, 0.19, 0.20, 0.17, 0.22, 0.19, 0.18, 0.21, 0.20,
                              0.19, 0.21, 0.18, 0.20, 0.22, 0.19, 0.18, 0.21, 0.20, 0.19],
        "time_on_page": [62, 68, 65, 64, 61, 70, 63, 66, 67, 64,
                        61, 69, 65, 63, 68, 66, 62, 67, 64, 63]  # seconds
    },
    "result_types": {
        "click_through_rate": "per_agent",
        "time_on_page": "per_agent"
    }
}

# Quick statistical validation with error handling
try:
    quick_result = validate_simulation_experiment_empirically(
        control_data=simple_real_data,
        treatment_data=simple_simulation_data,
        validation_types=["statistical"],
        output_format="values"
    )
    
    if quick_result.overall_score is not None:
        print(f"⚡ Quick validation score: {quick_result.overall_score:.1%}")
        print(f"📊 Results: {quick_result.summary}")
    else:
        print("⚡ Quick validation completed with issues")
        print(f"📊 Results: {quick_result.summary}")
        
except Exception as e:
    print(f"⚡ Quick validation error: {str(e)}")
    print("📊 Fallback: Basic comparison shows simulation predicts ~50% improvement")

⚡ Quick validation score: 16.6%
📊 Results: Statistical validation: 2/2 tests significant, average effect size: 5.170; Overall validation score: 0.166


## Key Takeaways

This validation system gives you confidence in your TinyTroupe simulations by:

1. **📊 Statistical Validation** - Tests if your simulation results are statistically similar to real-world data
2. **🧠 Semantic Validation** - Compares agent reasoning with real customer thinking patterns  
3. **🎯 Confidence Scoring** - Provides a clear 0-100% confidence score for business decisions
4. **💰 Risk Assessment** - Helps you make informed investment decisions based on validation confidence

**The Bottom Line**: Before making expensive business decisions based on simulations, validate them against real data. This system tells you exactly how much you can trust your TinyTroupe predictions.

### Next Steps
- Collect real customer data for your use case
- Run your TinyTroupe simulation  
- Use this validation system to assess confidence
- Make data-driven business decisions

**Remember**: A simulation is only as good as its validation against reality.

# Advanced Feature: Categorical Data Validation

**New Capability**: The validation system now supports categorical string data directly - no manual encoding required!

**Key Benefits**:
- 🎯 **Direct String Input** - Use categories like ["yes", "no", "maybe"] or ["low", "medium", "high"] directly
- 🔄 **Automatic Conversion** - Strings are automatically normalized and converted to ordinal values
- 📊 **KS Test Support** - Compare distributions, not just means, using the Kolmogorov-Smirnov test
- 📋 **Categorical Reports** - Get comprehensive reports showing category mappings and distributions

This is especially useful for survey responses, preference studies, and qualitative research validation.

## Example: Product Preference Study

**Scenario**: Your company wants to validate how TinyTroupe agents respond to product preferences compared to real humans.

**Real Data**: Human survey responses (mostly uncertain/neutral)  
**Simulation**: AI agent responses (more decisive/polarized)  
**Question**: Do the response patterns match well enough to trust the simulation?

In [None]:
# Define categorical data - notice we use strings directly, no manual encoding!

# Control data: Human responses (mostly neutral/uncertain)
human_responses = {
    "name": "Human Control Group",
    "description": "Empirical human responses to product preference",
    "key_results": {
        "preference": [
            "no", "maybe", "maybe", "yes", "maybe", 
            "no", "maybe", "maybe", "yes", "maybe",
            "no", "maybe", "yes", "maybe", "maybe"
        ],
        "satisfaction": [
            "low", "medium", "medium", "high", "medium",
            "low", "medium", "medium", "high", "medium", 
            "medium", "medium", "high", "medium", "medium"
        ]
    },
    "result_types": {
        "preference": "per_agent",
        "satisfaction": "per_agent"
    },
    "agent_justifications": [
        "Price seems reasonable but I'm not sure about quality",
        "Need more information before deciding",
        "Looks good but want to compare alternatives"
    ],
    "justification_summary": "Humans showed uncertainty and wanted more information before deciding"
}

# Treatment data: AI agent responses (more decisive/polarized)
ai_responses = {
    "name": "AI Agent Group", 
    "description": "AI agent simulation results for product preference",
    "key_results": {
        "preference": [
            "yes", "yes", "no", "yes", "yes",
            "no", "yes", "yes", "no", "yes",
            "yes", "no", "yes", "yes", "yes"
        ],
        "satisfaction": [
            "high", "high", "low", "high", "high",
            "low", "high", "high", "low", "high",
            "high", "low", "high", "high", "high"  
        ]
    },
    "result_types": {
        "preference": "per_agent", 
        "satisfaction": "per_agent"
    },
    "agent_justifications": [
        "Clear value proposition with good price-to-quality ratio",
        "Product specs don't meet my requirements",
        "Excellent features that justify the cost"
    ],
    "justification_summary": "AI agents made more decisive judgments based on clear criteria"
}

print("📊 Data loaded with categorical string values:")
print("Human preferences:", human_responses["key_results"]["preference"])
print("AI preferences:", ai_responses["key_results"]["preference"])
print("\n✨ Notice: No manual encoding required - strings are used directly!")

## Comparison: T-test vs KS Test for Categorical Data

The traditional t-test compares **means**, but for categorical data, we often care about **distributions**. 

The **Kolmogorov-Smirnov (KS) test** compares entire distributions and can detect differences that t-tests miss.

Let's see both in action:

In [None]:
# Analysis 1: Traditional Welch t-test (comparing means)
print("📈 Analysis 1: Traditional Welch t-test (comparing means)")
print("-" * 50)

try:
    result_ttest = validate_simulation_experiment_empirically(
        control_data=human_responses,
        treatment_data=ai_responses,
        validation_types=["statistical"],
        statistical_test_type="welch_t_test",  # Default test type
        output_format="values"
    )
    
    if result_ttest.statistical_results and "test_results" in result_ttest.statistical_results:
        for metric in ["preference", "satisfaction"]:
            test_result = result_ttest.statistical_results["test_results"]["treatment"][metric]
            print(f"{metric.title()}:")
            print(f"  - Significant: {test_result.get('significant', 'N/A')}")
            print(f"  - p-value: {test_result.get('p_value', 'N/A'):.4f}" if isinstance(test_result.get('p_value'), (int, float)) else f"  - p-value: {test_result.get('p_value', 'N/A')}")
            print(f"  - Effect size: {test_result.get('effect_size', 'N/A')}")
    else:
        print("T-test results not available - may be due to data limitations")
        
except Exception as e:
    print(f"T-test analysis error: {str(e)}")
    print("This might happen with categorical data - KS test is often more appropriate")

In [None]:
# Analysis 2: Kolmogorov-Smirnov test (comparing distributions)
print("\n📊 Analysis 2: Kolmogorov-Smirnov test (comparing distributions)")
print("-" * 50)

try:
    result_ks = validate_simulation_experiment_empirically(
        control_data=human_responses,
        treatment_data=ai_responses,
        validation_types=["statistical"],
        statistical_test_type="ks_test",  # Use KS test instead
        output_format="values"
    )
    
    if result_ks.statistical_results and "test_results" in result_ks.statistical_results:
        for metric in ["preference", "satisfaction"]:
            test_result = result_ks.statistical_results["test_results"]["treatment"][metric]
            print(f"{metric.title()}:")
            print(f"  - Significant: {test_result.get('significant', 'N/A')}")
            print(f"  - p-value: {test_result.get('p_value', 'N/A'):.4f}" if isinstance(test_result.get('p_value'), (int, float)) else f"  - p-value: {test_result.get('p_value', 'N/A')}")
            print(f"  - KS statistic: {test_result.get('ks_statistic', 'N/A'):.4f}" if isinstance(test_result.get('ks_statistic'), (int, float)) else f"  - KS statistic: {test_result.get('ks_statistic', 'N/A')}")
            print(f"  - Interpretation: {test_result.get('interpretation', 'N/A')}")
    else:
        print("KS test results not available")
        
except Exception as e:
    print(f"KS test analysis error: {str(e)}")
    
print("\n💡 Key Insight: KS test is often better for categorical data because it")
print("   compares the entire distribution shape, not just the average values!")

## Behind the Scenes: Categorical Data Conversion

The system automatically converts string categories to ordinal values for statistical analysis, while preserving the original categories for interpretation.

In [None]:
# Show how categorical data is automatically converted
from tinytroupe.validation.simulation_validator import SimulationExperimentDataset

print("🔄 Categorical Data Conversion Details")
print("=" * 50)

# Create dataset to inspect mappings
human_dataset = SimulationExperimentDataset.parse_obj(human_responses)

for metric in ["preference", "satisfaction"]:
    print(f"\n{metric.title()} Categories:")
    
    # Show the automatic mapping
    categories = human_dataset.get_categorical_values(metric)
    mapping = human_dataset.categorical_mappings[metric]
    
    for category in sorted(categories):
        ordinal = mapping[category]
        print(f"  '{category}' → {ordinal}")
    
    # Show distribution in both groups
    summary = human_dataset.get_metric_summary(metric)
    if "category_distribution" in summary:
        print(f"\n{metric.title()} Distribution - Humans:")
        for category, count in summary["category_distribution"].items():
            percentage = count / len(human_responses["key_results"][metric]) * 100
            print(f"  {category}: {count} ({percentage:.1f}%)")
    
    # Compare with AI responses
    ai_dataset = SimulationExperimentDataset.parse_obj(ai_responses)
    ai_summary = ai_dataset.get_metric_summary(metric)
    if "category_distribution" in ai_summary:
        print(f"\n{metric.title()} Distribution - AI:")
        for category, count in ai_summary["category_distribution"].items():
            percentage = count / len(ai_responses["key_results"][metric]) * 100
            print(f"  {category}: {count} ({percentage:.1f}%)")

print("\n✨ Key Benefits:")
print("  • Automatic normalization (lowercasing, whitespace removal)")
print("  • Consistent ordinal mapping across groups")
print("  • Original categories preserved for reports")
print("  • No manual encoding required!")

## Comprehensive Report with Categorical Information

Generate a full validation report that includes categorical data details:

In [None]:
# Generate comprehensive report with categorical information
print("📋 Generating Comprehensive Report with Categorical Information")
print("-" * 50)

try:
    comprehensive_report = validate_simulation_experiment_empirically(
        control_data=human_responses,
        treatment_data=ai_responses,
        validation_types=["statistical", "semantic"],
        statistical_test_type="ks_test",
        output_format="report"  # Get full markdown report
    )
    
    # Save report to file for inspection
    report_filename = "categorical_validation_report.md"
    with open(report_filename, "w", encoding="utf-8") as f:
        f.write(comprehensive_report)
    
    print(f"✅ Full report saved to '{report_filename}'")
    
    # Show a preview of the report
    print("\n📄 Report Preview (first 500 characters):")
    print("-" * 30)
    print(comprehensive_report[:500] + "..." if len(comprehensive_report) > 500 else comprehensive_report)
    
except Exception as e:
    print(f"❌ Error generating report: {str(e)}")
    print("This might be due to semantic analysis limitations with the current setup")
    
    # Try statistical-only report as fallback
    try:
        print("\n🔄 Generating statistical-only report as fallback...")
        stats_report = validate_simulation_experiment_empirically(
            control_data=human_responses,
            treatment_data=ai_responses,
            validation_types=["statistical"],
            statistical_test_type="ks_test",
            output_format="report"
        )
        
        with open("categorical_stats_report.md", "w", encoding="utf-8") as f:
            f.write(stats_report)
        
        print("✅ Statistical report saved to 'categorical_stats_report.md'")
        print("\n📄 Statistical Report Preview:")
        print("-" * 30)
        print(stats_report[:400] + "..." if len(stats_report) > 400 else stats_report)
        
    except Exception as fallback_e:
        print(f"❌ Fallback also failed: {str(fallback_e)}")
        print("Manual report generation may be needed")

## Why KS Test is Better for Categorical Data

Here's a demonstration of why the KS test can be superior to t-tests for categorical data:

In [None]:
# Scenario: Same "average" response but different distributions
print("🎯 Demonstrating KS Test Advantages for Categorical Data")
print("=" * 60)

print("\nScenario: Same 'average' response but different distributions")

# Both groups have same mean (1.0) but different distributions
control_uniform = {
    "name": "Uniform Distribution",
    "key_results": {
        "response": ["low", "medium", "high"] * 5  # Even distribution: 5 low, 5 medium, 5 high
    },
    "result_types": {"response": "per_agent"}
}

treatment_polarized = {
    "name": "Polarized Distribution", 
    "key_results": {
        "response": ["low"] * 7 + ["high"] * 8  # Polarized: 7 low, 0 medium, 8 high (same mean ≈ 1.0)
    },
    "result_types": {"response": "per_agent"}
}

print(f"\nControl (Uniform): {control_uniform['key_results']['response']}")
print(f"Treatment (Polarized): {treatment_polarized['key_results']['response']}")

# Calculate means to show they're similar
from tinytroupe.validation.simulation_validator import SimulationExperimentDataset
control_ds = SimulationExperimentDataset.parse_obj(control_uniform)
treatment_ds = SimulationExperimentDataset.parse_obj(treatment_polarized)

control_mean = sum(control_ds.get_metric_values("response")) / len(control_ds.get_metric_values("response"))
treatment_mean = sum(treatment_ds.get_metric_values("response")) / len(treatment_ds.get_metric_values("response"))

print(f"\nMeans (after conversion):")
print(f"Control mean: {control_mean:.2f}")
print(f"Treatment mean: {treatment_mean:.2f}")
print(f"Mean difference: {abs(control_mean - treatment_mean):.2f} (very small!)")

# Compare t-test vs KS test
print(f"\n📊 Statistical Test Comparison:")
print("-" * 40)

try:
    # T-test (comparing means)
    result_ttest = validate_simulation_experiment_empirically(
        control_data=control_uniform,
        treatment_data=treatment_polarized,
        validation_types=["statistical"],
        statistical_test_type="welch_t_test",
        output_format="values"
    )
    
    # KS test (comparing distributions)
    result_ks = validate_simulation_experiment_empirically(
        control_data=control_uniform,
        treatment_data=treatment_polarized,
        validation_types=["statistical"],
        statistical_test_type="ks_test",
        output_format="values"
    )
    
    if (result_ttest.statistical_results and result_ks.statistical_results and 
        "test_results" in result_ttest.statistical_results and 
        "test_results" in result_ks.statistical_results):
        
        ttest_result = result_ttest.statistical_results["test_results"]["treatment"]["response"]
        ks_result = result_ks.statistical_results["test_results"]["treatment"]["response"]
        
        print(f"T-test (means):        Significant: {ttest_result.get('significant', 'N/A')}, p-value: {ttest_result.get('p_value', 'N/A'):.4f}" if isinstance(ttest_result.get('p_value'), (int, float)) else f"T-test (means):        Significant: {ttest_result.get('significant', 'N/A')}, p-value: {ttest_result.get('p_value', 'N/A')}")
        print(f"KS test (distributions): Significant: {ks_result.get('significant', 'N/A')}, p-value: {ks_result.get('p_value', 'N/A'):.4f}" if isinstance(ks_result.get('p_value'), (int, float)) else f"KS test (distributions): Significant: {ks_result.get('significant', 'N/A')}, p-value: {ks_result.get('p_value', 'N/A')}")
        
        print(f"\n💡 Insight: KS test can detect distributional differences that t-tests miss!")
        print(f"   Even with similar means, the response patterns are fundamentally different.")
    else:
        print("Test results not available - using fallback comparison")
        print("💡 Key Point: T-test focuses on means, KS test compares entire distributions")
        
except Exception as e:
    print(f"Error in comparison: {str(e)}")
    print("💡 Key Point: T-test focuses on means, KS test compares entire distributions")
    print("   For categorical data, distributions often matter more than averages!")

## Categorical Data Validation - Key Takeaways

🎉 **Categorical data validation is now easy and powerful!**

### ✅ Key Benefits:

1. **🎯 Direct String Input** - No manual encoding required
   - Use `["yes", "no", "maybe"]` or `["low", "medium", "high"]` directly
   - Automatic normalization handles case and whitespace differences

2. **🔄 Smart Conversion** - Automatic ordinal mapping
   - Strings converted to meaningful ordinal values for analysis
   - Original categories preserved for reports and interpretation

3. **📊 Better Statistical Tests** - KS test for distributions
   - Compare entire response patterns, not just averages
   - Detect differences in behavior that t-tests might miss

4. **📋 Rich Reports** - Categorical information included
   - Category mappings and distributions in validation reports
   - Easy to understand and share with stakeholders

### 🚀 Best Practices:

- **Use KS test** (`statistical_test_type="ks_test"`) for categorical data
- **Include semantic validation** when you have justification data
- **Generate reports** (`output_format="report"`) for comprehensive documentation
- **Review category distributions** to understand behavior patterns

### 💡 When to Use:

- Survey responses and questionnaires
- Product preferences and ratings
- Qualitative research validation
- Any time you have string-based categorical data

This makes TinyTroupe validation accessible for qualitative research and survey-based studies!