# Intent2Model Research Experiments

This notebook contains experiments and analysis for the Intent2Model system.

## Research Claims

1. **Usability**: Improves ML usability for non-experts
2. **Safety**: Reduces incorrect pipeline configuration
3. **Intent Alignment**: Aligns evaluation metric with real user intent

Focus is NOT raw accuracy, but intent alignment and safety.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys

# Add backend to path
sys.path.insert(0, str(Path().parent.parent / "backend"))

from research.user_simulation import simulate_user_interaction, run_simulation_experiment
from research.ablation_tests import run_ablation_suite, AblationConfig, analyze_ablation_results

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

## 1. User Simulation Experiments

Simulate different user types (novice, intermediate, expert) and measure:
- Number of questions asked
- Mistakes detected and corrected
- Success rate
- Performance metrics

In [None]:
# Example: Create sample datasets for testing
# In practice, load real datasets

def create_sample_classification_dataset():
    """Create a sample classification dataset."""
    np.random.seed(42)
    n_samples = 200
    X1 = np.random.randn(n_samples)
    X2 = np.random.randn(n_samples)
    X3 = np.random.choice(['A', 'B', 'C'], n_samples)
    y = (X1 + X2 > 0).astype(int)
    df = pd.DataFrame({
        'feature1': X1,
        'feature2': X2,
        'feature3': X3,
        'target': y
    })
    return df, 'target', 'classification'

def create_sample_regression_dataset():
    """Create a sample regression dataset."""
    np.random.seed(42)
    n_samples = 200
    X1 = np.random.randn(n_samples)
    X2 = np.random.randn(n_samples)
    X3 = np.random.choice(['A', 'B', 'C'], n_samples)
    y = 2 * X1 + 3 * X2 + np.random.randn(n_samples) * 0.5
    df = pd.DataFrame({
        'feature1': X1,
        'feature2': X2,
        'feature3': X3,
        'target': y
    })
    return df, 'target', 'regression'

# Run simulation experiment
datasets = [
    create_sample_classification_dataset(),
    create_sample_regression_dataset()
]

# Note: This may take a while if using LLM
# simulation_results = run_simulation_experiment(
#     datasets, 
#     user_types=["novice", "intermediate", "expert"],
#     n_runs=5
# )

## 2. Ablation Tests

Test the impact of each component:
- Questioning agent
- Evaluator warnings
- LLM planning
- Explainer agent

In [None]:
# Run ablation tests
# ablation_results = run_ablation_suite(datasets)

# Analyze results
# analysis = analyze_ablation_results(ablation_results)
# print("Ablation Analysis:")
# print(analysis)

## 3. Visualization

Plot success rates, time to model, and metric alignment.

In [None]:
# Example plotting functions (uncomment when you have results)

# def plot_success_rate_by_user_type(results_df):
#     """Plot success rate by user type."""
#     success_by_type = results_df.groupby('user_type')['success'].mean()
#     success_by_type.plot(kind='bar', title='Success Rate by User Type')
#     plt.ylabel('Success Rate')
#     plt.xlabel('User Type')
#     plt.show()

# def plot_questions_asked(results_df):
#     """Plot distribution of questions asked."""
#     results_df['questions_asked'].hist(bins=10, title='Distribution of Questions Asked')
#     plt.xlabel('Number of Questions')
#     plt.ylabel('Frequency')
#     plt.show()

# def plot_ablation_comparison(ablation_results):
#     """Compare ablation configurations."""
#     ablation_groups = ablation_results.groupby('ablation_id')
#     success_rates = ablation_groups['training_success'].mean()
#     success_rates.plot(kind='bar', title='Training Success Rate by Configuration')
#     plt.ylabel('Success Rate')
#     plt.xlabel('Ablation Configuration')
#     plt.show()

## 4. Metrics Analysis

Key metrics to track:
- **Success Rate**: Percentage of successful pipeline configurations
- **Time to Model**: Time from dataset upload to trained model
- **Metric Alignment**: Whether chosen metric matches user intent
- **Mistake Recovery**: Ability to detect and correct user mistakes

In [None]:
# Example metrics calculation (uncomment when you have results)

# def calculate_metrics(results_df):
#     """Calculate key research metrics."""
#     metrics = {
#         'overall_success_rate': results_df['success'].mean(),
#         'avg_questions': results_df['questions_asked'].mean(),
#         'mistake_recovery_rate': ...  # Calculate based on mistakes_made vs success
#     }
#     return metrics

# metrics = calculate_metrics(simulation_results)
# print("Research Metrics:")
# for key, value in metrics.items():
#     print(f"{key}: {value:.3f}")