# Collective Design Critique Experiments

This notebook demonstrates how to use collective design critique reasoning agents to solve challenging design problems.

## Research Question

**Can collective design critique from multiple perspectives produce better solutions than a single critic?**

We're interested in:
1. **Coverage**: Do multiple perspectives catch more issues?
2. **Quality**: Are recommendations more specific and actionable?
3. **Synthesis**: Can we effectively combine diverse viewpoints?
4. **Iteration**: Does iterative refinement improve designs?
5. **Adversarial Testing**: Does challenge/response find edge cases?

## Setup

In [None]:
import sys
sys.path.append('../../code')

from crit import (
    # Pre-defined problems
    MOBILE_CHECKOUT,
    DASHBOARD_LAYOUT,
    REST_API_VERSIONING,
    GRAPHQL_SCHEMA,
    MICROSERVICES_SPLIT,
    CACHING_STRATEGY,
    PERMISSION_MODEL,
    APPROVAL_WORKFLOW,
    
    # Strategies
    run_critique_strategy,
    single_critic_strategy,
    multi_perspective_critique,
    iterative_critique,
    adversarial_critique,
    
    # Types
    DesignDomain,
    CritiquePerspective,
    ALL_PROBLEMS,
    get_problems_by_domain,
    get_problems_by_difficulty,
    
    # Evaluation
    evaluate_critique,
    compare_strategies,
    batch_evaluate,
)

import json
from pprint import pprint

## Example 1: Single Critic Baseline

Let's start with a baseline - a single critic reviewing a mobile checkout flow.

In [None]:
# First, let's look at the design problem
print("DESIGN PROBLEM:")
print(f"Name: {MOBILE_CHECKOUT.name}")
print(f"Domain: {MOBILE_CHECKOUT.domain.value}")
print(f"Difficulty: {MOBILE_CHECKOUT.difficulty}")
print(f"\nDescription: {MOBILE_CHECKOUT.description}")
print(f"\nContext:\n{MOBILE_CHECKOUT.context}")
print(f"\nCurrent Design:\n{MOBILE_CHECKOUT.current_design}")
print(f"\nKnown Issues:")
for issue in MOBILE_CHECKOUT.known_issues:
    print(f"  - {issue}")

In [None]:
# Run single critic strategy
single_result = run_critique_strategy(
    "single",
    MOBILE_CHECKOUT,
    provider="ollama",
    model=None,  # Uses default
    temperature=0.3
)

print("SINGLE CRITIC FEEDBACK:")
print(single_result.synthesis)
print("\n" + "="*60)
print(f"\nRecommendations ({len(single_result.recommendations)}):")
for i, rec in enumerate(single_result.recommendations, 1):
    print(f"{i}. {rec}")
print(f"\nLatency: {single_result.latency_s:.2f}s")
print(f"Cost: ${single_result.total_cost_usd:.4f}")

## Example 2: Multi-Perspective Critique

Now let's get feedback from multiple expert perspectives.

In [None]:
# Run multi-perspective critique
multi_result = run_critique_strategy(
    "multi_perspective",
    MOBILE_CHECKOUT,
    provider="ollama",
    synthesize=True,  # Combine perspectives into unified feedback
    temperature=0.3
)

print("MULTI-PERSPECTIVE CRITIQUES:")
print("\nIndividual Perspectives:")
for critique in multi_result.critiques:
    print(f"\n{'='*60}")
    print(f"PERSPECTIVE: {critique['perspective'].upper()}")
    print(f"{'='*60}")
    print(critique['critique'])

if multi_result.synthesis:
    print("\n" + "="*60)
    print("SYNTHESIZED FEEDBACK:")
    print("="*60)
    print(multi_result.synthesis)

print(f"\n{'='*60}")
print(f"Recommendations ({len(multi_result.recommendations)}):")
for i, rec in enumerate(multi_result.recommendations, 1):
    print(f"{i}. {rec}")

print(f"\nLatency: {multi_result.latency_s:.2f}s")
print(f"Cost: ${multi_result.total_cost_usd:.4f}")

## Example 3: Compare Single vs Multi-Perspective

Let's evaluate and compare the two approaches.

In [None]:
# Evaluate both approaches
single_eval = evaluate_critique(
    MOBILE_CHECKOUT,
    single_result,
    method="combined",
    judge_provider="ollama"
)

multi_eval = evaluate_critique(
    MOBILE_CHECKOUT,
    multi_result,
    method="combined",
    judge_provider="ollama"
)

print("EVALUATION COMPARISON:\n")
print(f"{'Metric':<25} {'Single':<15} {'Multi-Perspective':<15}")
print("="*55)

# Coverage
print(f"{'Coverage Score':<25} {single_eval['coverage']['overall_coverage']:<15.2f} {multi_eval['coverage']['overall_coverage']:<15.2f}")
print(f"{'Known Issues Found':<25} {single_eval['coverage']['known_issues_mentioned']:<15} {multi_eval['coverage']['known_issues_mentioned']:<15}")
print(f"{'Criteria Addressed':<25} {single_eval['coverage']['criteria_mentioned']:<15} {multi_eval['coverage']['criteria_mentioned']:<15}")

# Quality
print(f"\n{'Quality Score':<25} {single_eval['quality']['overall_quality']:<15.2f} {multi_eval['quality']['overall_quality']:<15.2f}")
print(f"{'Specificity':<25} {single_eval['quality']['specificity']:<15.2f} {multi_eval['quality']['specificity']:<15.2f}")
print(f"{'Actionability':<25} {single_eval['quality']['actionability']:<15.2f} {multi_eval['quality']['actionability']:<15.2f}")
print(f"{'Relevance':<25} {single_eval['quality']['relevance']:<15.2f} {multi_eval['quality']['relevance']:<15.2f}")

# Depth
print(f"\n{'Depth Score':<25} {single_eval['depth']['depth_score']:<15.2f} {multi_eval['depth']['depth_score']:<15.2f}")
print(f"{'Perspectives Used':<25} {single_eval['depth']['critique_count']:<15} {multi_eval['depth']['critique_count']:<15}")
print(f"{'Recommendations':<25} {single_eval['depth']['recommendations_count']:<15} {multi_eval['depth']['recommendations_count']:<15}")

# Overall
print(f"\n{'='*55}")
print(f"{'COMBINED SCORE':<25} {single_eval['combined_score']:<15.2f} {multi_eval['combined_score']:<15.2f}")

# Performance
print(f"\n{'Latency (s)':<25} {single_result.latency_s:<15.2f} {multi_result.latency_s:<15.2f}")
print(f"{'Cost (USD)':<25} ${single_result.total_cost_usd:<14.4f} ${multi_result.total_cost_usd:<14.4f}")

## Example 4: Iterative Critique

Test if iterative refinement improves the design.

In [None]:
# Run iterative critique (2 rounds)
iterative_result = run_critique_strategy(
    "iterative",
    MOBILE_CHECKOUT,
    iterations=2,
    provider="ollama",
    temperature=0.3
)

print("ITERATIVE CRITIQUE RESULTS:\n")

for critique in iterative_result.critiques:
    iteration = critique.get('iteration', '?')
    crit_type = critique.get('type', 'unknown')
    
    print(f"\n{'='*60}")
    print(f"ITERATION {iteration} - {crit_type.upper()}")
    print(f"{'='*60}")
    
    if crit_type == 'critique':
        print(critique['feedback'])
    elif crit_type == 'revision':
        print("REVISED DESIGN:")
        print(critique['revised_design'])

if iterative_result.revised_design:
    print("\n" + "="*60)
    print("FINAL REVISED DESIGN:")
    print("="*60)
    print(iterative_result.revised_design)

print(f"\nLatency: {iterative_result.latency_s:.2f}s")
print(f"Cost: ${iterative_result.total_cost_usd:.4f}")

## Example 5: Adversarial Critique

Use adversarial dialogue to challenge assumptions and find edge cases.

In [None]:
# Run adversarial critique
adversarial_result = run_critique_strategy(
    "adversarial",
    MOBILE_CHECKOUT,
    provider="ollama",
    temperature=0.4  # Slightly higher for more diverse debate
)

print("ADVERSARIAL CRITIQUE EXCHANGE:\n")

for critique in adversarial_result.critiques:
    agent = critique['agent']
    print(f"\n{'='*60}")
    print(f"AGENT: {agent.upper()}")
    print(f"{'='*60}")
    print(critique['content'])

print("\n" + "="*60)
print("SYNTHESIS")
print("="*60)
print(adversarial_result.synthesis)

print(f"\nLatency: {adversarial_result.latency_s:.2f}s")
print(f"Cost: ${adversarial_result.total_cost_usd:.4f}")

## Example 6: Compare All Strategies

Run all strategies on the same problem and compare.

In [None]:
# Run all strategies
strategies_to_test = {
    "single": {},
    "multi_perspective": {"synthesize": True},
    "iterative": {"iterations": 2},
    "adversarial": {},
}

results = {}

print("Running all strategies...\n")
for strategy_name, kwargs in strategies_to_test.items():
    print(f"Running {strategy_name}...")
    result = run_critique_strategy(
        strategy_name,
        MOBILE_CHECKOUT,
        provider="ollama",
        temperature=0.3,
        **kwargs
    )
    results[strategy_name] = result
    print(f"  Completed in {result.latency_s:.2f}s")

print("\nAll strategies completed!")

In [None]:
# Compare strategies
comparison = compare_strategies(
    MOBILE_CHECKOUT,
    results,
    judge_provider="ollama"
)

print("STRATEGY COMPARISON RESULTS:\n")

# Rankings
if "combined" in comparison["rankings"]:
    print("Overall Ranking (Combined Score):")
    for i, item in enumerate(comparison["rankings"]["combined"], 1):
        print(f"{i}. {item['strategy']:<20} Score: {item['score']:.3f}")

print("\nDetailed Metrics:")
print(f"\n{'Strategy':<20} {'Coverage':<12} {'Quality':<12} {'Depth':<12} {'Combined':<12}")
print("="*68)

for strategy_name, eval_result in comparison["evaluations"].items():
    coverage = eval_result.get("coverage", {}).get("overall_coverage", 0)
    quality = eval_result.get("quality", {}).get("overall_quality", 0)
    depth = eval_result.get("depth", {}).get("depth_score", 0)
    combined = eval_result.get("combined_score", 0)
    
    print(f"{strategy_name:<20} {coverage:<12.3f} {quality:<12.3f} {depth:<12.3f} {combined:<12.3f}")

print("\nPerformance Metrics:")
print(f"\n{'Strategy':<20} {'Latency (s)':<15} {'Cost (USD)':<15} {'Tokens':<15}")
print("="*65)

for strategy_name, perf in comparison["performance"].items():
    print(f"{strategy_name:<20} {perf['latency_s']:<15.2f} ${perf['total_cost_usd']:<14.4f} {perf['total_tokens']:<15}")

## Example 7: Test on Different Problem Types

Let's see how strategies perform across different design domains.

In [None]:
# Test on different domains
test_problems = [
    MOBILE_CHECKOUT,    # UI/UX
    REST_API_VERSIONING,  # API
    CACHING_STRATEGY,   # System
]

domain_results = {}

for problem in test_problems:
    print(f"\nTesting on: {problem.name} ({problem.domain.value})")
    
    # Use multi-perspective for each
    result = run_critique_strategy(
        "multi_perspective",
        problem,
        provider="ollama",
        synthesize=True,
        temperature=0.3
    )
    
    domain_results[problem.name] = {
        "problem": problem,
        "critique_result": result
    }
    
    print(f"  Perspectives used: {len(result.critiques)}")
    print(f"  Recommendations: {len(result.recommendations)}")
    print(f"  Latency: {result.latency_s:.2f}s")

In [None]:
# Batch evaluate across domains
batch_results = batch_evaluate(
    list(domain_results.values()),
    judge_provider="ollama"
)

print("BATCH EVALUATION ACROSS DOMAINS:\n")
print(f"Total problems evaluated: {batch_results['count']}")
print("\nAggregate Scores:")
for metric, score in batch_results['aggregates'].items():
    print(f"  {metric}: {score:.3f}")

print("\nIndividual Problem Scores:")
print(f"\n{'Problem':<30} {'Coverage':<12} {'Quality':<12} {'Combined':<12}")
print("="*66)

for eval_result in batch_results['evaluations']:
    name = eval_result['problem_name']
    coverage = eval_result.get("coverage", {}).get("overall_coverage", 0)
    quality = eval_result.get("quality", {}).get("overall_quality", 0)
    combined = eval_result.get("combined_score", 0)
    
    print(f"{name:<30} {coverage:<12.3f} {quality:<12.3f} {combined:<12.3f}")

## Example 8: Custom Perspectives

You can specify custom perspectives for domain-specific critique.

In [None]:
# Custom perspectives for API design
custom_perspectives = [
    CritiquePerspective.USABILITY,
    CritiquePerspective.CONSISTENCY,
    CritiquePerspective.SCALABILITY,
    CritiquePerspective.SECURITY,
]

custom_result = multi_perspective_critique(
    REST_API_VERSIONING,
    perspectives=custom_perspectives,
    synthesize=True,
    provider="ollama",
    temperature=0.3
)

print("CUSTOM PERSPECTIVE CRITIQUE:\n")
print(f"Perspectives used: {[p.value for p in custom_perspectives]}")
print(f"\nSynthesis:")
print(custom_result.synthesis)

## Research Questions to Explore

Use this toolkit to investigate:

1. **Strategy Effectiveness**
   - Which strategy finds the most issues?
   - Which produces the most actionable recommendations?
   - How does performance scale with problem complexity?

2. **Perspective Value**
   - Which perspectives are most valuable for different domains?
   - How much overlap is there between perspectives?
   - Can we identify "essential" vs "nice-to-have" perspectives?

3. **Cost-Benefit Analysis**
   - What's the ROI of multi-agent vs single-agent?
   - Where is the point of diminishing returns?
   - How to optimize for cost vs quality?

4. **Synthesis Quality**
   - Does synthesis improve upon individual critiques?
   - What's lost in synthesis?
   - Can we measure synthesis effectiveness?

5. **Iteration Value**
   - How much does each iteration improve the design?
   - When do iterations stop adding value?
   - Can we predict optimal iteration count?

6. **Model Capabilities**
   - How do different models compare on design critique?
   - Do larger models give better critiques?
   - Are specialized models better for specific domains?

## Next Steps

1. **Create custom design problems** for your specific domain
2. **Compare different models** (local vs API, small vs large)
3. **Experiment with perspective combinations**
4. **Test on real-world design challenges**
5. **Integrate with experiment tracking** for long-term analysis
6. **Build validated benchmark datasets** for consistent evaluation