# Benchmarking Design Critique with UICrit

This notebook shows how to use the **UICrit** benchmark dataset with CRIT.

## About UICrit

**UICrit** (UIST 2024, Google Research) contains:
- 11,344 design critiques
- 1,000 mobile UIs from RICO dataset
- Expert human critiques from 7 experienced designers
- LLM-generated critiques for comparison
- Quality ratings (aesthetics, usability, learnability, efficiency)

## Why Use UICrit?

- **Compare to Experts**: See how your critiques match professional designers
- **Quality Benchmarking**: Validate against quality ratings
- **Large Scale**: Test on 1,000 real mobile UIs
- **LLM Baseline**: Compare against existing LLM critiques

## Setup

In [None]:
# 1. Add repository root to path to import modules
import sys
sys.path.append('../../../../')  # Go up to repo root from notebooks/crit/

# 2. Add local code directory for CRIT-specific code
sys.path.append('../../code')  # Add multi-agent/code for crit module

# 3. Import CRIT benchmark loading functions
from crit import (
    # Benchmark loaders
    load_uicrit,                  # Load UICrit dataset
    load_uicrit_for_comparison,   # Load UICrit with comparison data
    print_benchmark_info,          # Print info about available benchmarks
    compare_to_experts,            # Compare critique to expert critiques
    
    # Critique strategies
    run_critique_strategy,         # Run any critique strategy
    single_critic_strategy,        # Single critic
    multi_perspective_critique,    # Multi-perspective
    
    # Evaluation
    evaluate_critique,             # Evaluate critique quality
)

## View Benchmark Info

In [None]:
# 1. Print information about available benchmarks
# This shows which benchmarks are available and how to install them
print_benchmark_info()

## Load UICrit Dataset

First, clone the repository:
```bash
git clone https://github.com/google-research-datasets/uicrit.git
```

The CSV will be at: `./uicrit/uicrit_public.csv`

In [None]:
# 1. Attempt to load the UICrit dataset
try:
    # 2. Load dataset with specified parameters
    uicrit = load_uicrit(
        min_quality_rating=None,    # No filtering by quality
        include_llm_critiques=True  # Include LLM-generated critiques for comparison
    )
    
    # 3. Print dataset summary
    print(f"Loaded UICrit Dataset")
    print(f"Source: {uicrit.source}")
    
    # 4. Print statistics
    print(f"\nStatistics:")
    print(f"  UI Screens: {uicrit.metadata['ui_screens']}")
    print(f"  Total Critiques: {uicrit.metadata['total_critiques']}")
    print(f"  License: {uicrit.metadata['license']}")
    
    # 5. Show a sample problem
    print(f"\nSample UI Problem:")
    sample = uicrit.problems[0]
    print(f"RICO ID: {sample.metadata['rico_id']}")
    print(f"Task: {sample.metadata['task']}")
    print(f"Avg Quality: {sample.metadata['avg_quality']:.1f}/10")
    print(f"Num Critiques: {sample.metadata['num_critiques']}")
    
# 6. Handle case where UICrit is not installed
except FileNotFoundError as e:
    print(f"Error: {e}")
    print("\nPlease install UICrit first:")
    print("git clone https://github.com/google-research-datasets/uicrit.git")

## Example 1: Run Your Critique Strategy on UICrit UIs

In [None]:
# Select a sample UI to critique
try:
    sample_problem = uicrit.problems[0]
    
    print(f"Critiquing UI: {sample_problem.name}")
    print(f"Task: {sample_problem.metadata['task']}")
    print(f"\nDesign Problem:")
    print(sample_problem.current_design)
    
    # Run your critique strategy
    print(f"\nRunning multi-perspective critique...")
    your_critique = run_critique_strategy(
        "multi_perspective",
        sample_problem,
        provider="ollama",
        synthesize=True,
        temperature=0.3
    )
    
    print(f"\nYOUR CRITIQUE:")
    print("="*60)
    print(your_critique.synthesis)
    print(f"\nRecommendations ({len(your_critique.recommendations)}):")
    for i, rec in enumerate(your_critique.recommendations, 1):
        print(f"{i}. {rec}")
    
except NameError:
    print("UICrit not loaded. Run the previous cell first.")

## Example 2: Compare to Expert Critiques

In [None]:
# Load UICrit for comparison
try:
    comparison_data = load_uicrit_for_comparison(
        sample_size=10  # Just 10 UIs for demo
    )
    
    print(f"Loaded {len(comparison_data)} UIs for comparison")
    
    # Show expert critiques for first UI
    first_ui = comparison_data[0]
    print(f"\nUI: {first_ui['task']}")
    print(f"Quality Rating: {first_ui['quality_ratings']['design_quality']:.1f}/10")
    print(f"\nExpert Critiques ({len(first_ui['expert_critiques'])}):")
    for i, critique in enumerate(first_ui['expert_critiques'][:3], 1):
        print(f"{i}. {critique}")
    
    if first_ui['llm_critiques']:
        print(f"\nLLM Critiques ({len(first_ui['llm_critiques'])}):")
        for i, critique in enumerate(first_ui['llm_critiques'][:2], 1):
            print(f"{i}. {critique}")
    
except Exception as e:
    print(f"Error: {e}")

## Example 3: Systematic Comparison

In [None]:
# Run your strategy on multiple UIs and compare to experts
try:
    results = []
    
    # Test on first 5 UIs
    for i, item in enumerate(comparison_data[:5], 1):
        print(f"\nProcessing UI {i}/5: {item['task']}")
        
        # Run your critique
        your_critique = run_critique_strategy(
            "single",  # Use single for faster testing
            item['problem'],
            provider="ollama",
            temperature=0.3
        )
        
        # Extract your critique text
        your_critique_text = your_critique.synthesis or ""
        your_recommendations = your_critique.recommendations
        
        # Compare to expert critiques
        comparison = compare_to_experts(
            your_recommendations,
            item['expert_critiques'],
            method="overlap"
        )
        
        results.append({
            'rico_id': item['rico_id'],
            'task': item['task'],
            'quality': item['quality_ratings']['design_quality'],
            'your_critique': your_critique,
            'expert_critiques': item['expert_critiques'],
            'comparison': comparison
        })
        
        print(f"  Overlap with experts: {comparison['overlap_score']:.2%}")
    
    # Aggregate results
    avg_overlap = sum(r['comparison']['overlap_score'] for r in results) / len(results)
    print(f"\n{'='*60}")
    print(f"Average Overlap with Experts: {avg_overlap:.2%}")
    print(f"{'='*60}")
    
except NameError:
    print("Comparison data not loaded. Run the previous cell first.")

## Example 4: Evaluate Critique Quality

In [None]:
# Evaluate your critique quality using CRIT metrics
try:
    if results:
        sample_result = results[0]
        
        print(f"Evaluating critique for: {sample_result['task']}\n")
        
        evaluation = evaluate_critique(
            sample_result['your_critique'].metadata['problem'],
            sample_result['your_critique'],
            method="combined",
            judge_provider="ollama"
        )
        
        print("Evaluation Results:")
        print(f"  Coverage Score: {evaluation['coverage']['overall_coverage']:.2f}")
        print(f"  Quality Score: {evaluation['quality']['overall_quality']:.2f}")
        print(f"  Depth Score: {evaluation['depth']['depth_score']:.2f}")
        print(f"  Combined Score: {evaluation['combined_score']:.2f}")
        
        print(f"\nQuality Breakdown:")
        print(f"  Specificity: {evaluation['quality']['specificity']:.2f}")
        print(f"  Actionability: {evaluation['quality']['actionability']:.2f}")
        print(f"  Relevance: {evaluation['quality']['relevance']:.2f}")
        print(f"  Feasibility: {evaluation['quality']['feasibility']:.2f}")
        
except Exception as e:
    print(f"No results to evaluate yet.")

## Example 5: Compare Your Strategies Against Expert Baseline

In [None]:
# Test multiple strategies and compare to experts
try:
    test_ui = comparison_data[0]
    
    strategies_to_test = {
        "single": {},
        "multi_perspective": {"synthesize": True},
    }
    
    strategy_results = {}
    
    print(f"Testing strategies on: {test_ui['task']}\n")
    
    for strategy_name, kwargs in strategies_to_test.items():
        print(f"Running {strategy_name}...")
        
        critique = run_critique_strategy(
            strategy_name,
            test_ui['problem'],
            provider="ollama",
            temperature=0.3,
            **kwargs
        )
        
        # Compare to experts
        comparison = compare_to_experts(
            critique.recommendations,
            test_ui['expert_critiques'],
            method="coverage"
        )
        
        strategy_results[strategy_name] = {
            'critique': critique,
            'expert_overlap': comparison['coverage_score']
        }
        
        print(f"  Expert Coverage: {comparison['coverage_score']:.2%}")
        print(f"  Latency: {critique.latency_s:.2f}s")
        print()
    
    # Summary
    print("\nStrategy Comparison:")
    print(f"{'Strategy':<20} {'Expert Coverage':<20} {'Recommendations':<20}")
    print("="*60)
    for name, data in strategy_results.items():
        print(f"{name:<20} {data['expert_overlap']:<20.2%} {len(data['critique'].recommendations):<20}")
    
except Exception as e:
    print(f"Error: {e}")

## Integration with Experiment Tracking

In [None]:
from harness import get_tracker, ExperimentConfig, ExperimentResult

def track_uicrit_evaluation(problems, strategy="multi_perspective", provider="ollama"):
    """Run and track UICrit evaluation"""
    
    # Start tracking
    tracker = get_tracker()
    experiment_dir = tracker.start_experiment(ExperimentConfig(
        experiment_name=f"uicrit_{strategy}",
        strategy=strategy,
        provider=provider,
        metadata={
            "project": "crit",
            "benchmark": "UICrit",
            "problem_count": len(problems)
        }
    ))
    
    # Run critiques
    for i, problem in enumerate(problems, 1):
        print(f"Processing {i}/{len(problems)}: {problem.metadata['task']}")
        
        critique = run_critique_strategy(
            strategy,
            problem,
            provider=provider,
            synthesize=True if strategy == "multi_perspective" else None
        )
        
        # Evaluate
        evaluation = evaluate_critique(problem, critique, method="combined")
        
        # Log
        tracker.log_result(ExperimentResult(
            task_input=problem.to_critique_prompt(),
            output=critique.synthesis or "",
            strategy_name=strategy,
            latency_s=critique.latency_s,
            tokens_in=critique.total_tokens_in,
            tokens_out=critique.total_tokens_out,
            cost_usd=critique.total_cost_usd,
            metadata={
                "rico_id": problem.metadata['rico_id'],
                "task": problem.metadata['task'],
                "quality_rating": problem.metadata['avg_quality'],
                "combined_score": evaluation.get('combined_score', 0),
                "recommendations": critique.recommendations
            }
        ))
    
    summary = tracker.finish_experiment()
    print(f"\nResults saved to: {experiment_dir}")
    
    return summary

# Example usage:
# summary = track_uicrit_evaluation(uicrit.problems[:10], strategy="multi_perspective")

## Key Insights from UICrit

The UICrit paper found that LLMs achieved a **55% performance gain** using:
- Few-shot prompting with expert examples
- Visual prompting (when using multimodal models)
- Structured output formats

### Things to Try:

1. **Few-Shot Learning**: Include expert critiques as examples
2. **Multi-Perspective**: Use different expert viewpoints
3. **Structured Output**: Ask for specific critique categories
4. **Quality Correlation**: Test if low-rated UIs get more/better critiques

## Next Steps

1. **Full Evaluation**: Run on all 1,000 UIs
2. **Compare Strategies**: Test which approach best matches experts
3. **Quality Analysis**: Correlate your critiques with quality ratings
4. **Fine-Tuning**: Train on expert critiques to improve performance

## References

- UICrit: Duan et al., UIST 2024
- GitHub: https://github.com/google-research-datasets/uicrit
- Paper: https://arxiv.org/abs/2407.08850