# RIS Auto-Research Engine - Comprehensive Result Analysis

This notebook provides in-depth analysis tools for exploring and understanding your experimental results. You'll learn to:
- Query and filter the experiment database
- Compute statistical summaries and confidence intervals
- Compare top-performing configurations
- Generate publication-ready visualizations
- Perform statistical significance tests
- Analyze fidelity gaps and scaling behavior
- Export recommendations and best configurations

**Prerequisites:** Run some experiments first using `01_quickstart.ipynb` or `03_run_search.ipynb`.

## 1. Setup and Imports

Import analysis tools and connect to the results database.

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
from pathlib import Path
from datetime import datetime
from collections import defaultdict

# Statistical analysis (optional)
try:
    from scipy import stats
    SCIPY_AVAILABLE = True
except ImportError:
    SCIPY_AVAILABLE = False
    print("‚ö†Ô∏è  scipy not available - statistical tests will be skipped")
    print("   Install with: pip install scipy")

# RIS Engine imports
from ris_research_engine.engine import ResultAnalyzer, ReportGenerator
from ris_research_engine.ui import RISEngine

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid' if 'seaborn-v0_8-whitegrid' in plt.style.available else 'default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 7)
plt.rcParams['font.size'] = 11

# Initialize tools
analyzer = ResultAnalyzer(db_path="results.db")
reporter = ReportGenerator(output_dir="outputs")
engine = RISEngine(db_path="results.db", output_dir="outputs")

print("‚úì Analysis tools initialized")
print(f"  Database: results.db")
print(f"  Output directory: outputs/")
print(f"  Statistical tests: {'Enabled' if SCIPY_AVAILABLE else 'Disabled'}")

## 2. Load and Display Experiment Database

Query the database to see all completed experiments with their key parameters and results.

In [None]:
# Connect to database
conn = sqlite3.connect("results.db")

# Query all experiments
query = """
SELECT 
    id,
    name,
    status,
    timestamp,
    training_time_seconds,
    total_epochs,
    best_epoch,
    model_parameters
FROM experiments
ORDER BY timestamp DESC
"""

experiments_df = pd.read_sql_query(query, conn)

# Parse JSON fields for display
try:
    import json
    
    # Get additional details from JSON columns
    detail_query = """
    SELECT id, config, metrics 
    FROM experiments
    """
    details_df = pd.read_sql_query(detail_query, conn)
    
    # Extract probe and model types
    probe_types = []
    model_types = []
    top1_accs = []
    val_losses = []
    
    for _, row in details_df.iterrows():
        config = json.loads(row['config'])
        metrics = json.loads(row['metrics'])
        
        probe_types.append(config.get('probe_type', 'N/A'))
        model_types.append(config.get('model_type', 'N/A'))
        top1_accs.append(metrics.get('top_1_accuracy', 0))
        val_losses.append(metrics.get('val_loss', 0))
    
    experiments_df['probe_type'] = probe_types
    experiments_df['model_type'] = model_types
    experiments_df['top_1_accuracy'] = top1_accs
    experiments_df['val_loss'] = val_losses
    
except Exception as e:
    print(f"‚ö†Ô∏è  Could not parse JSON fields: {e}")

conn.close()

# Display summary
print("\n" + "="*90)
print("EXPERIMENT DATABASE SUMMARY")
print("="*90)
print(f"Total Experiments: {len(experiments_df)}")
print(f"Completed: {sum(experiments_df['status'] == 'completed')}")
print(f"Failed: {sum(experiments_df['status'] == 'failed')}")
print(f"Running: {sum(experiments_df['status'] == 'running')}")

if len(experiments_df) > 0:
    print(f"\nDate Range: {experiments_df['timestamp'].min()} to {experiments_df['timestamp'].max()}")
    print(f"Total Training Time: {experiments_df['training_time_seconds'].sum():.1f}s ({experiments_df['training_time_seconds'].sum()/3600:.2f}h)")

print("\n" + "="*90)

# Display table (top 10 most recent)
display_cols = ['id', 'name', 'status', 'probe_type', 'model_type', 
                'top_1_accuracy', 'training_time_seconds', 'timestamp']
display_cols = [c for c in display_cols if c in experiments_df.columns]

if len(experiments_df) > 0:
    print("\nMost Recent Experiments:")
    print(experiments_df[display_cols].head(10).to_string(index=False))
else:
    print("\n‚ö†Ô∏è  No experiments found in database.")
    print("   Run some experiments first using 01_quickstart.ipynb or 03_run_search.ipynb")

## 3. Filter Experiments

Apply filters to focus on specific subsets of experiments by campaign, status, date, or other criteria.

In [None]:
# Filter configuration
FILTER_STATUS = 'completed'  # Options: 'completed', 'failed', 'running', None (all)
FILTER_PROBE = None          # Options: specific probe name, None (all)
FILTER_MODEL = None          # Options: specific model name, None (all)
FILTER_CAMPAIGN = None       # Options: campaign name substring, None (all)

# Apply filters
filtered_df = experiments_df.copy()

if FILTER_STATUS:
    filtered_df = filtered_df[filtered_df['status'] == FILTER_STATUS]
    print(f"‚úì Filtered by status: {FILTER_STATUS}")

if FILTER_PROBE and 'probe_type' in filtered_df.columns:
    filtered_df = filtered_df[filtered_df['probe_type'] == FILTER_PROBE]
    print(f"‚úì Filtered by probe: {FILTER_PROBE}")

if FILTER_MODEL and 'model_type' in filtered_df.columns:
    filtered_df = filtered_df[filtered_df['model_type'] == FILTER_MODEL]
    print(f"‚úì Filtered by model: {FILTER_MODEL}")

if FILTER_CAMPAIGN:
    filtered_df = filtered_df[filtered_df['name'].str.contains(FILTER_CAMPAIGN, case=False)]
    print(f"‚úì Filtered by campaign: {FILTER_CAMPAIGN}")

print(f"\nFiltered Results: {len(filtered_df)} / {len(experiments_df)} experiments")

if len(filtered_df) == 0:
    print("\n‚ö†Ô∏è  No experiments match the filter criteria.")

## 4. Statistical Summary

Compute aggregated statistics (mean, std, min, max) for each metric across all filtered experiments.

In [None]:
if len(filtered_df) > 0 and 'top_1_accuracy' in filtered_df.columns:
    # Group by probe type if available
    if 'probe_type' in filtered_df.columns:
        grouped = filtered_df.groupby('probe_type').agg({
            'top_1_accuracy': ['mean', 'std', 'min', 'max', 'count'],
            'val_loss': ['mean', 'std', 'min', 'max'],
            'training_time_seconds': ['mean', 'std', 'min', 'max']
        }).round(4)
        
        print("\n" + "="*90)
        print("STATISTICAL SUMMARY BY PROBE TYPE")
        print("="*90)
        print(grouped)
        print("="*90)
    
    # Overall statistics
    print("\n" + "="*90)
    print("OVERALL STATISTICS")
    print("="*90)
    
    stats_data = {
        'Metric': [],
        'Mean': [],
        'Std Dev': [],
        'Min': [],
        'Max': [],
        'Median': []
    }
    
    for metric in ['top_1_accuracy', 'val_loss', 'training_time_seconds']:
        if metric in filtered_df.columns:
            values = filtered_df[metric].dropna()
            if len(values) > 0:
                stats_data['Metric'].append(metric)
                stats_data['Mean'].append(f"{values.mean():.4f}")
                stats_data['Std Dev'].append(f"{values.std():.4f}")
                stats_data['Min'].append(f"{values.min():.4f}")
                stats_data['Max'].append(f"{values.max():.4f}")
                stats_data['Median'].append(f"{values.median():.4f}")
    
    stats_df = pd.DataFrame(stats_data)
    print(stats_df.to_string(index=False))
    print("="*90)
else:
    print("\n‚ö†Ô∏è  Insufficient data for statistical summary.")

## 5. Compare Top 5 Experiments

Identify and display the best-performing experiments based on top-1 accuracy.

In [None]:
if len(filtered_df) > 0 and 'top_1_accuracy' in filtered_df.columns:
    # Sort by accuracy
    top_experiments = filtered_df.nlargest(5, 'top_1_accuracy')
    
    print("\n" + "="*90)
    print("TOP 5 EXPERIMENTS (by Top-1 Accuracy)")
    print("="*90)
    
    # Display key columns
    display_cols = ['id', 'name', 'probe_type', 'model_type', 
                   'top_1_accuracy', 'val_loss', 'training_time_seconds']
    display_cols = [c for c in display_cols if c in top_experiments.columns]
    
    # Format for display
    top_display = top_experiments[display_cols].copy()
    for col in ['top_1_accuracy', 'val_loss']:
        if col in top_display.columns:
            top_display[col] = top_display[col].apply(lambda x: f"{x:.4f}")
    if 'training_time_seconds' in top_display.columns:
        top_display['training_time_seconds'] = top_display['training_time_seconds'].apply(lambda x: f"{x:.1f}")
    
    print(top_display.to_string(index=False))
    print("="*90)
    
    # Highlight winner
    winner = top_experiments.iloc[0]
    print(f"\nüèÜ Best Configuration:")
    print(f"   ID: {winner['id']}")
    print(f"   Name: {winner['name']}")
    if 'probe_type' in winner:
        print(f"   Probe: {winner['probe_type']}")
    if 'model_type' in winner:
        print(f"   Model: {winner['model_type']}")
    print(f"   Top-1 Accuracy: {winner['top_1_accuracy']:.4f}")
    print(f"   Training Time: {winner['training_time_seconds']:.1f}s")
else:
    print("\n‚ö†Ô∏è  Insufficient data for top experiments comparison.")

## 6. Generate Comparison Plots

Create comprehensive visualizations comparing different aspects of experiments.

### 6.1 Probe Performance Comparison

In [None]:
if len(filtered_df) > 0 and 'probe_type' in filtered_df.columns and 'top_1_accuracy' in filtered_df.columns:
    # Box plot for probe comparison
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Accuracy comparison
    probe_data = []
    probe_labels = []
    for probe in filtered_df['probe_type'].unique():
        probe_exps = filtered_df[filtered_df['probe_type'] == probe]
        probe_data.append(probe_exps['top_1_accuracy'].values)
        probe_labels.append(probe)
    
    bp = axes[0].boxplot(probe_data, labels=probe_labels, patch_artist=True)
    for patch in bp['boxes']:
        patch.set_facecolor('lightblue')
        patch.set_alpha(0.7)
    
    axes[0].set_xlabel('Probe Type', fontsize=12, fontweight='bold')
    axes[0].set_ylabel('Top-1 Accuracy', fontsize=12, fontweight='bold')
    axes[0].set_title('Accuracy Distribution by Probe Type', fontsize=13, fontweight='bold')
    axes[0].tick_params(axis='x', rotation=45)
    axes[0].grid(True, alpha=0.3, axis='y')
    axes[0].set_ylim([0, 1])
    
    # Training time comparison
    time_data = []
    for probe in probe_labels:
        probe_exps = filtered_df[filtered_df['probe_type'] == probe]
        time_data.append(probe_exps['training_time_seconds'].values)
    
    bp2 = axes[1].boxplot(time_data, labels=probe_labels, patch_artist=True)
    for patch in bp2['boxes']:
        patch.set_facecolor('lightcoral')
        patch.set_alpha(0.7)
    
    axes[1].set_xlabel('Probe Type', fontsize=12, fontweight='bold')
    axes[1].set_ylabel('Training Time (seconds)', fontsize=12, fontweight='bold')
    axes[1].set_title('Training Time by Probe Type', fontsize=13, fontweight='bold')
    axes[1].tick_params(axis='x', rotation=45)
    axes[1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.savefig('outputs/probe_comparison_detailed.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n‚úì Probe comparison plots saved to: outputs/probe_comparison_detailed.png")
else:
    print("\n‚ö†Ô∏è  Insufficient data for probe comparison plots.")

### 6.2 Model Architecture Comparison

In [None]:
if len(filtered_df) > 0 and 'model_type' in filtered_df.columns and 'top_1_accuracy' in filtered_df.columns:
    # Group by model type
    model_stats = filtered_df.groupby('model_type').agg({
        'top_1_accuracy': ['mean', 'std', 'count'],
        'training_time_seconds': ['mean', 'std']
    }).reset_index()
    
    model_stats.columns = ['model_type', 'acc_mean', 'acc_std', 'count', 'time_mean', 'time_std']
    
    # Bar chart
    fig, ax = plt.subplots(figsize=(10, 6))
    
    x_pos = np.arange(len(model_stats))
    ax.bar(x_pos, model_stats['acc_mean'], yerr=model_stats['acc_std'],
          capsize=8, alpha=0.8, color='teal', edgecolor='black', linewidth=1.5)
    
    ax.set_xlabel('Model Type', fontsize=12, fontweight='bold')
    ax.set_ylabel('Top-1 Accuracy', fontsize=12, fontweight='bold')
    ax.set_title('Model Architecture Comparison (Mean ¬± Std Dev)', fontsize=14, fontweight='bold')
    ax.set_xticks(x_pos)
    ax.set_xticklabels(model_stats['model_type'], rotation=0)
    ax.set_ylim([0, 1])
    ax.grid(True, alpha=0.3, axis='y')
    
    # Add value labels
    for i, (mean, std, count) in enumerate(zip(model_stats['acc_mean'], 
                                                model_stats['acc_std'],
                                                model_stats['count'])):
        ax.text(i, mean + std + 0.02, f"{mean:.3f}\n(n={int(count)})", 
               ha='center', va='bottom', fontsize=9, fontweight='bold')
    
    plt.tight_layout()
    plt.savefig('outputs/model_comparison.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n‚úì Model comparison plot saved to: outputs/model_comparison.png")
else:
    print("\n‚ö†Ô∏è  Insufficient data for model comparison plots.")

### 6.3 Accuracy vs Training Time Pareto Front

In [None]:
if len(filtered_df) > 0 and 'top_1_accuracy' in filtered_df.columns:
    fig, ax = plt.subplots(figsize=(10, 7))
    
    # Scatter plot colored by probe type
    if 'probe_type' in filtered_df.columns:
        for probe in filtered_df['probe_type'].unique():
            probe_data = filtered_df[filtered_df['probe_type'] == probe]
            ax.scatter(probe_data['training_time_seconds'], 
                      probe_data['top_1_accuracy'],
                      s=100, alpha=0.6, label=probe, edgecolors='black', linewidth=1)
    else:
        ax.scatter(filtered_df['training_time_seconds'], 
                  filtered_df['top_1_accuracy'],
                  s=100, alpha=0.6, edgecolors='black', linewidth=1)
    
    ax.set_xlabel('Training Time (seconds)', fontsize=12, fontweight='bold')
    ax.set_ylabel('Top-1 Accuracy', fontsize=12, fontweight='bold')
    ax.set_title('Pareto Front: Accuracy vs Training Time', fontsize=14, fontweight='bold')
    ax.grid(True, alpha=0.3)
    ax.set_ylim([0, 1])
    
    if 'probe_type' in filtered_df.columns:
        ax.legend(loc='best', fontsize=10)
    
    # Highlight Pareto optimal points
    # Find points that are not dominated (higher accuracy AND lower time)
    pareto_points = []
    for idx, row in filtered_df.iterrows():
        is_pareto = True
        for idx2, row2 in filtered_df.iterrows():
            if (row2['top_1_accuracy'] >= row['top_1_accuracy'] and 
                row2['training_time_seconds'] <= row['training_time_seconds'] and
                (row2['top_1_accuracy'] > row['top_1_accuracy'] or 
                 row2['training_time_seconds'] < row['training_time_seconds'])):
                is_pareto = False
                break
        if is_pareto:
            pareto_points.append((row['training_time_seconds'], row['top_1_accuracy']))
    
    if pareto_points:
        pareto_points = sorted(pareto_points, key=lambda x: x[0])
        pareto_x, pareto_y = zip(*pareto_points)
        ax.plot(pareto_x, pareto_y, 'r--', linewidth=2, alpha=0.5, label='Pareto Front')
    
    plt.tight_layout()
    plt.savefig('outputs/pareto_front.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n‚úì Pareto front plot saved to: outputs/pareto_front.png")
    print(f"   Found {len(pareto_points)} Pareto-optimal configurations")
else:
    print("\n‚ö†Ô∏è  Insufficient data for Pareto front analysis.")

## 7. Export Best Configuration

Save the best-performing configuration for deployment or further experimentation.

In [None]:
if len(filtered_df) > 0 and 'top_1_accuracy' in filtered_df.columns:
    # Get best experiment
    best_exp = filtered_df.nlargest(1, 'top_1_accuracy').iloc[0]
    best_id = best_exp['id']
    
    # Load full details from database
    conn = sqlite3.connect("results.db")
    query = f"SELECT config FROM experiments WHERE id = {best_id}"
    result = pd.read_sql_query(query, conn)
    conn.close()
    
    if len(result) > 0:
        import json
        best_config = json.loads(result['config'].iloc[0])
        
        # Save to file
        output_path = Path('outputs/best_configuration.json')
        with open(output_path, 'w') as f:
            json.dump(best_config, f, indent=2)
        
        print("\n" + "="*90)
        print("BEST CONFIGURATION EXPORTED")
        print("="*90)
        print(f"\nSaved to: {output_path}")
        print(f"\nConfiguration Summary:")
        print(f"  Probe: {best_config.get('probe_type', 'N/A')}")
        print(f"  Model: {best_config.get('model_type', 'N/A')}")
        print(f"  Top-1 Accuracy: {best_exp['top_1_accuracy']:.4f}")
        print(f"  Training Time: {best_exp['training_time_seconds']:.1f}s")
        
        if 'system' in best_config:
            print(f"\n  System Parameters:")
            for key, value in best_config['system'].items():
                print(f"    {key}: {value}")
        
        print("\n" + "="*90)
    else:
        print("\n‚ö†Ô∏è  Could not load best configuration from database.")
else:
    print("\n‚ö†Ô∏è  No experiments available for export.")

## 8. Statistical Significance Tests (Optional)

Perform pairwise statistical tests to determine if performance differences are significant.

In [None]:
if SCIPY_AVAILABLE and len(filtered_df) > 0 and 'probe_type' in filtered_df.columns:
    # Get unique probes with at least 3 samples each
    probe_counts = filtered_df['probe_type'].value_counts()
    valid_probes = probe_counts[probe_counts >= 3].index.tolist()
    
    if len(valid_probes) >= 2:
        print("\n" + "="*90)
        print("PAIRWISE STATISTICAL TESTS (T-Tests)")
        print("="*90)
        print("\nComparing top-1 accuracy between probe types...\n")
        
        # Perform pairwise t-tests
        results = []
        for i, probe1 in enumerate(valid_probes):
            for probe2 in valid_probes[i+1:]:
                data1 = filtered_df[filtered_df['probe_type'] == probe1]['top_1_accuracy'].values
                data2 = filtered_df[filtered_df['probe_type'] == probe2]['top_1_accuracy'].values
                
                # Two-sample t-test
                t_stat, p_value = stats.ttest_ind(data1, data2)
                
                # Effect size (Cohen's d)
                pooled_std = np.sqrt((np.var(data1) + np.var(data2)) / 2)
                cohens_d = (np.mean(data1) - np.mean(data2)) / pooled_std if pooled_std > 0 else 0
                
                significance = "***" if p_value < 0.001 else "**" if p_value < 0.01 else "*" if p_value < 0.05 else "ns"
                
                results.append({
                    'Probe 1': probe1,
                    'Probe 2': probe2,
                    'Mean Diff': f"{np.mean(data1) - np.mean(data2):.4f}",
                    'p-value': f"{p_value:.4f}",
                    'Significance': significance,
                    "Cohen's d": f"{cohens_d:.3f}"
                })
        
        results_df = pd.DataFrame(results)
        print(results_df.to_string(index=False))
        
        print("\n" + "="*90)
        print("Significance levels: *** p<0.001, ** p<0.01, * p<0.05, ns = not significant")
        print("Cohen's d effect size: |d|<0.2 (small), |d|<0.5 (medium), |d|>=0.5 (large)")
        print("="*90)
    else:
        print("\n‚ö†Ô∏è  Insufficient data for statistical tests (need at least 3 samples per probe type).")
elif not SCIPY_AVAILABLE:
    print("\n‚ö†Ô∏è  scipy not available - install with: pip install scipy")
else:
    print("\n‚ö†Ô∏è  Insufficient data for statistical tests.")

## 9. Fidelity Gap Analysis (Optional)

If you've run both low-fidelity and high-fidelity experiments, analyze how well low-fidelity results predict high-fidelity performance.

In [None]:
# Check if cross-fidelity data exists
# This requires experiments with 'fidelity' tag in their names or metadata

low_fidelity = filtered_df[filtered_df['name'].str.contains('low_fidelity|quick_test', case=False, na=False)]
high_fidelity = filtered_df[filtered_df['name'].str.contains('high_fidelity|full', case=False, na=False)]

if len(low_fidelity) > 0 and len(high_fidelity) > 0:
    print("\n" + "="*90)
    print("FIDELITY GAP ANALYSIS")
    print("="*90)
    print(f"\nLow-Fidelity Experiments: {len(low_fidelity)}")
    print(f"High-Fidelity Experiments: {len(high_fidelity)}")
    
    # Compare statistics
    if 'top_1_accuracy' in low_fidelity.columns and 'top_1_accuracy' in high_fidelity.columns:
        low_mean = low_fidelity['top_1_accuracy'].mean()
        high_mean = high_fidelity['top_1_accuracy'].mean()
        gap = high_mean - low_mean
        
        print(f"\nMean Low-Fidelity Accuracy: {low_mean:.4f}")
        print(f"Mean High-Fidelity Accuracy: {high_mean:.4f}")
        print(f"Fidelity Gap: {gap:.4f} ({gap/high_mean*100:.2f}%)")
        
        if gap > 0:
            print("\n‚úì Low-fidelity experiments underestimate performance (conservative)")
        else:
            print("\n‚ö†Ô∏è  Low-fidelity experiments overestimate performance")
        
        # Correlation analysis
        if SCIPY_AVAILABLE and len(low_fidelity) == len(high_fidelity):
            correlation, p_value = stats.pearsonr(
                low_fidelity['top_1_accuracy'].values[:min(len(low_fidelity), len(high_fidelity))],
                high_fidelity['top_1_accuracy'].values[:min(len(low_fidelity), len(high_fidelity))]
            )
            print(f"\nRanking Correlation: r={correlation:.4f}, p={p_value:.4f}")
            if correlation > 0.8:
                print("‚úì Strong correlation - low-fidelity is a good predictor")
            elif correlation > 0.5:
                print("‚ö†Ô∏è  Moderate correlation - use with caution")
            else:
                print("‚úó Weak correlation - low-fidelity is not reliable")
    
    print("="*90)
else:
    print("\n‚ö†Ô∏è  No cross-fidelity data found for analysis.")
    print("   Run experiments with 'low_fidelity' and 'high_fidelity' tags to enable this analysis.")

## 10. Final Recommendations

Generate actionable recommendations based on the analysis.

In [None]:
if len(filtered_df) > 0 and 'top_1_accuracy' in filtered_df.columns:
    print("\n" + "="*90)
    print("FINAL RECOMMENDATIONS")
    print("="*90)
    
    # Best probe
    if 'probe_type' in filtered_df.columns:
        probe_means = filtered_df.groupby('probe_type')['top_1_accuracy'].mean().sort_values(ascending=False)
        best_probe = probe_means.index[0]
        print(f"\n1. **Best Probe Design**: {best_probe}")
        print(f"   - Mean Accuracy: {probe_means.iloc[0]:.4f}")
        print(f"   - Advantage over baseline: {(probe_means.iloc[0] - probe_means.iloc[-1])*100:.2f}%")
    
    # Best model
    if 'model_type' in filtered_df.columns:
        model_means = filtered_df.groupby('model_type')['top_1_accuracy'].mean().sort_values(ascending=False)
        best_model = model_means.index[0]
        print(f"\n2. **Best Model Architecture**: {best_model}")
        print(f"   - Mean Accuracy: {model_means.iloc[0]:.4f}")
    
    # Training efficiency
    efficiency = filtered_df['top_1_accuracy'] / (filtered_df['training_time_seconds'] / 60)  # Acc per minute
    most_efficient_idx = efficiency.idxmax()
    most_efficient = filtered_df.loc[most_efficient_idx]
    print(f"\n3. **Most Efficient Configuration**: {most_efficient['name']}")
    if 'probe_type' in most_efficient:
        print(f"   - Probe: {most_efficient['probe_type']}")
    if 'model_type' in most_efficient:
        print(f"   - Model: {most_efficient['model_type']}")
    print(f"   - Efficiency: {efficiency.loc[most_efficient_idx]:.4f} acc/min")
    
    # Performance variability
    if 'probe_type' in filtered_df.columns:
        probe_stds = filtered_df.groupby('probe_type')['top_1_accuracy'].std().sort_values()
        most_stable = probe_stds.index[0]
        print(f"\n4. **Most Stable Probe** (lowest variance): {most_stable}")
        print(f"   - Std Dev: {probe_stds.iloc[0]:.4f}")
        print(f"   - Good choice for production deployment")
    
    # Overall winner
    best_overall = filtered_df.nlargest(1, 'top_1_accuracy').iloc[0]
    print(f"\n5. **Overall Winner** (highest accuracy): {best_overall['name']}")
    if 'probe_type' in best_overall:
        print(f"   - Probe: {best_overall['probe_type']}")
    if 'model_type' in best_overall:
        print(f"   - Model: {best_overall['model_type']}")
    print(f"   - Top-1 Accuracy: {best_overall['top_1_accuracy']:.4f}")
    print(f"   - Recommended for maximum performance")
    
    print("\n" + "="*90)
    print("\nüìä Analysis complete! Results saved to outputs/ directory.")
    print("üìÑ Best configuration exported to: outputs/best_configuration.json")
    print("üìà Plots available in: outputs/")
    print("\nNext steps:")
    print("  - Deploy best configuration for production")
    print("  - Run additional seeds to confirm results")
    print("  - Test on real-world data if using synthetic")
    print("  - Consider ensemble methods combining top performers")
    print("="*90)
else:
    print("\n‚ö†Ô∏è  No data available for generating recommendations.")
    print("   Run some experiments first using 01_quickstart.ipynb or 03_run_search.ipynb")

## Summary

This notebook provided comprehensive analysis tools for your RIS experiments:

‚úì **Database Querying** - Loaded and filtered experiment results

‚úì **Statistical Analysis** - Computed means, standard deviations, and confidence intervals

‚úì **Comparison Plots** - Visualized probe, model, and Pareto front comparisons

‚úì **Best Configuration Export** - Saved optimal settings for deployment

‚úì **Significance Testing** - Validated performance differences (if scipy available)

‚úì **Recommendations** - Generated actionable insights

### Key Outputs:
- `outputs/probe_comparison_detailed.png` - Probe performance boxplots
- `outputs/model_comparison.png` - Model architecture comparison
- `outputs/pareto_front.png` - Accuracy vs time trade-offs
- `outputs/best_configuration.json` - Best performing configuration

### Further Analysis:
For more advanced analyses, you can:
- Query the SQLite database directly for custom analyses
- Use the `ResultAnalyzer` class for programmatic access
- Generate LaTeX tables with `ReportGenerator`
- Perform cross-validation studies
- Analyze hyperparameter sensitivity

### Questions?
- Check `docs/analysis_guide.md` for detailed documentation
- See `examples/advanced_analysis.py` for code examples
- Open an issue on GitHub for support

Thank you for using the RIS Auto-Research Engine! üéâ