# Lab 11 - Module 2: Evaluate (Stage 2)

**Time:** ~20 minutes

## Stage 2: Systematic Evaluation with Rubrics

In Module 1, you gave a quick "first impression" rating. Now you'll evaluate more systematically using **explicit criteria**.

### Learning Objectives

- Apply explicit rubrics to evaluate content quality
- Compare human evaluation vs. AI self-evaluation
- Discover where AI's self-assessment is accurate or inaccurate
- Understand that "good" requires clear, measurable standards

### What You'll Do

1. Select a rubric that fits your content type
2. Score the AI's output (human judgment) using the rubric
3. Ask the AI to score its OWN work using the same rubric
4. Compare: Where do you agree? Where do you disagree?
5. Identify the weakest areas for revision in Module 3

## Setup: Load Your Data

In [None]:
import numpy as np
import pandas as pd
import json
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML, Markdown

print("‚úì Libraries loaded!")

In [None]:
# Load scenario and Module 1 data
group_code = int(input("Enter your group code: "))

try:
    # Load scenario
    with open(f'lab11_group_{group_code}_scenario.json', 'r') as f:
        scenario = json.load(f)
    
    # Load Module 1 data
    with open(f'lab11_group_{group_code}_module1.json', 'r') as f:
        module1_data = json.load(f)
    
    print(f"‚úì Loaded data for Group {group_code}")
    print(f"\nPrompt Family: {scenario['family']}")
    print(f"AI Model Used: {module1_data['model_used']}")
    print(f"First Impression Rating: {module1_data['first_impression_rating']}/5")
    
except FileNotFoundError as e:
    print(f"\n‚ùå ERROR: Could not find required files for group {group_code}.")
    print("Please run Module 0 and Module 1 first.")

## Review: Your AI's Output

Here's what the AI generated in Module 1. You'll be evaluating this:

In [None]:
print("="*70)
print("AI OUTPUT TO EVALUATE")
print("="*70)
print()
print(module1_data['ai_output'])
print()
print("="*70)

## Step 1: Select a Rubric

Choose the rubric that best fits your content type. The **recommended rubric** for your scenario is shown, but you can choose a different one if it fits better.

In [None]:
# Define the three rubrics
RUBRICS = {
    'General Communication': {
        'description': 'Best for: Science Explainers, Museum Panels, Educational Analogies, Infographic Structures',
        'criteria': [
            {
                'name': 'Clarity',
                'levels': {
                    1: 'Confusing, jargon-heavy, hard to follow',
                    3: 'Mostly clear with some unclear parts',
                    5: 'Crystal clear, no confusion, well-explained'
                }
            },
            {
                'name': 'Audience Fit',
                'levels': {
                    1: 'Wrong level/tone for target audience',
                    3: 'Generally appropriate for audience',
                    5: 'Perfectly tailored to audience needs'
                }
            },
            {
                'name': 'Structure',
                'levels': {
                    1: 'Disorganized, hard to follow flow',
                    3: 'Logical but could be tighter',
                    5: 'Exceptionally well-organized, smooth flow'
                }
            },
            {
                'name': 'Specificity',
                'levels': {
                    1: 'Vague, generic statements',
                    3: 'Some specifics mixed with generic content',
                    5: 'Rich in concrete details and examples'
                }
            },
            {
                'name': 'Accuracy',
                'levels': {
                    1: 'Factual errors present',
                    3: 'Mostly accurate with minor issues',
                    5: 'Fully accurate and defensible'
                }
            }
        ]
    },
    'Persuasion/Campaign': {
        'description': 'Best for: PSAs, Product Pitches, Debate Positions',
        'criteria': [
            {
                'name': 'Message Focus',
                'levels': {
                    1: 'Unclear or multiple conflicting messages',
                    3: 'Clear message but could be sharper',
                    5: 'Single, powerful, memorable message'
                }
            },
            {
                'name': 'Evidence/Support',
                'levels': {
                    1: 'No evidence or weak anecdotes only',
                    3: 'Some credible support provided',
                    5: 'Strong, credible, well-integrated evidence'
                }
            },
            {
                'name': 'Call to Action',
                'levels': {
                    1: 'Missing or extremely vague',
                    3: 'Present but not compelling',
                    5: 'Clear, specific, highly motivating'
                }
            },
            {
                'name': 'Tone Fit',
                'levels': {
                    1: 'Mismatched tone for purpose/audience',
                    3: 'Appropriate but not distinctive',
                    5: 'Perfectly calibrated, engaging tone'
                }
            },
            {
                'name': 'Ethics',
                'levels': {
                    1: 'Misleading or manipulative',
                    3: 'Honest with some weak points',
                    5: 'Transparent, balanced, trustworthy'
                }
            }
        ]
    },
    'Creative/Narrative': {
        'description': 'Best for: Short Narratives, Creative Content',
        'criteria': [
            {
                'name': 'Engagement',
                'levels': {
                    1: 'Boring, loses attention quickly',
                    3: 'Intermittently interesting',
                    5: 'Captivating throughout'
                }
            },
            {
                'name': 'Coherence',
                'levels': {
                    1: 'Confusing, disjointed narrative',
                    3: 'Logical but rough transitions',
                    5: 'Seamlessly flows, well-integrated'
                }
            },
            {
                'name': 'Creativity',
                'levels': {
                    1: 'Clich√©d, completely predictable',
                    3: 'Some original elements',
                    5: 'Fresh, unexpected, memorable'
                }
            },
            {
                'name': 'Constraint Adherence',
                'levels': {
                    1: 'Ignores key requirements',
                    3: 'Meets most constraints',
                    5: 'Perfectly fulfills all constraints'
                }
            }
        ]
    }
}

# Display rubric information
print(f"\nRECOMMENDED RUBRIC FOR YOUR SCENARIO:")
print(f"  {scenario['default_rubric']}")
print(f"\nALL AVAILABLE RUBRICS:")
print()
for rubric_name, rubric_info in RUBRICS.items():
    print(f"‚Ä¢ {rubric_name}")
    print(f"  {rubric_info['description']}")
    print()

In [None]:
# Rubric selection widget
rubric_selector = widgets.Dropdown(
    options=list(RUBRICS.keys()),
    value=scenario['default_rubric'],
    description='Select Rubric:',
    style={'description_width': '120px'},
    layout=widgets.Layout(width='400px')
)

rubric_justification = widgets.Textarea(
    value='',
    placeholder='Why did you choose this rubric? How does it fit your content?',
    description='Justification:',
    layout=widgets.Layout(width='100%', height='80px'),
    style={'description_width': '120px'}
)

print("SELECT YOUR RUBRIC")
print("="*70)
display(rubric_selector)
display(rubric_justification)

## Step 2: Display Your Selected Rubric

Run this cell to see the full rubric with all criteria and scoring levels:

In [None]:
def display_rubric(rubric_name):
    rubric = RUBRICS[rubric_name]
    
    print("="*70)
    print(f"{rubric_name.upper()} RUBRIC")
    print("="*70)
    print(f"{rubric['description']}")
    print()
    
    for i, criterion in enumerate(rubric['criteria'], 1):
        print(f"{i}. {criterion['name']}")
        print(f"   1 (Weak):      {criterion['levels'][1]}")
        print(f"   3 (Good):      {criterion['levels'][3]}")
        print(f"   5 (Excellent): {criterion['levels'][5]}")
        print()
    print("="*70)

# Button to display rubric
display_button = widgets.Button(
    description='üìã Display Rubric',
    button_style='info',
    layout=widgets.Layout(width='200px')
)

rubric_output = widgets.Output()

def on_display_clicked(b):
    with rubric_output:
        clear_output()
        display_rubric(rubric_selector.value)

display_button.on_click(on_display_clicked)

display(display_button)
display(rubric_output)

## Step 3: Human Evaluation (Your Scores)

Now score the AI's output using the rubric. Use your best judgment.

**Scoring Scale:** 1, 2, 3, 4, or 5 for each criterion
- **1-2:** Closer to "Weak" description
- **3:** Matches "Good" description
- **4-5:** Closer to "Excellent" description

In [None]:
# Create scoring widgets dynamically based on selected rubric
def create_human_scoring_widgets():
    selected_rubric = RUBRICS[rubric_selector.value]
    human_scores = {}
    human_justifications = {}
    
    print("HUMAN EVALUATION (Your Scores)")
    print("="*70)
    print("Score each criterion from 1-5 based on the rubric.")
    print()
    
    for criterion in selected_rubric['criteria']:
        name = criterion['name']
        
        score_widget = widgets.Dropdown(
            options=[1, 2, 3, 4, 5],
            description=f"{name}:",
            style={'description_width': '150px'},
            layout=widgets.Layout(width='250px')
        )
        
        justification_widget = widgets.Textarea(
            value='',
            placeholder=f'Why this score for {name}? Be specific.',
            layout=widgets.Layout(width='100%', height='60px')
        )
        
        human_scores[name] = score_widget
        human_justifications[name] = justification_widget
        
        display(score_widget)
        display(justification_widget)
        print()
    
    return human_scores, human_justifications

# Button to create scoring interface
create_scoring_button = widgets.Button(
    description='‚úèÔ∏è Start Scoring',
    button_style='success',
    layout=widgets.Layout(width='200px')
)

scoring_output = widgets.Output()
human_scores_dict = {}
human_justifications_dict = {}

def on_create_scoring_clicked(b):
    global human_scores_dict, human_justifications_dict
    with scoring_output:
        clear_output()
        human_scores_dict, human_justifications_dict = create_human_scoring_widgets()

create_scoring_button.on_click(on_create_scoring_clicked)

print("Click 'Start Scoring' after you've displayed and reviewed the rubric:")
display(create_scoring_button)
display(scoring_output)

## Step 4: AI Self-Evaluation

Now go back to your AI system and ask it to evaluate its OWN work using the same rubric.

### Instructions:

1. Go back to the AI conversation from Module 1 (or start a new one and paste the AI's output)

2. Copy the **prompt below** (it includes the rubric and asks AI to self-evaluate)

3. Paste it into the AI and get its response

4. Record the AI's self-scores and reasoning below

In [None]:
# Generate self-evaluation prompt
def generate_self_eval_prompt():
    selected_rubric = RUBRICS[rubric_selector.value]
    
    prompt = f"""I need you to evaluate the quality of your previous response using this rubric:

**{rubric_selector.value} Rubric:**

"""
    
    for i, criterion in enumerate(selected_rubric['criteria'], 1):
        prompt += f"{i}. **{criterion['name']}**\n"
        prompt += f"   - Score 1 (Weak): {criterion['levels'][1]}\n"
        prompt += f"   - Score 3 (Good): {criterion['levels'][3]}\n"
        prompt += f"   - Score 5 (Excellent): {criterion['levels'][5]}\n"
        prompt += "\n"
    
    prompt += """Please:
1. Score your response on EACH criterion (1-5 scale)
2. Briefly explain your reasoning for each score (1-2 sentences)
3. Identify the TWO WEAKEST areas (lowest scores)
4. For each weak area, suggest a specific improvement

Format your response clearly with the criterion name, score, and reasoning."""
    
    return prompt

# Button to generate and display prompt
generate_prompt_button = widgets.Button(
    description='üìù Generate Prompt',
    button_style='info',
    layout=widgets.Layout(width='200px')
)

prompt_output = widgets.Output()

def on_generate_prompt_clicked(b):
    with prompt_output:
        clear_output()
        prompt = generate_self_eval_prompt()
        print("="*70)
        print("COPY THIS PROMPT TO YOUR AI")
        print("="*70)
        print()
        print(prompt)
        print()
        print("="*70)

generate_prompt_button.on_click(on_generate_prompt_clicked)

print("\nGenerate the self-evaluation prompt to give to your AI:")
display(generate_prompt_button)
display(prompt_output)

## Step 5: Record AI's Self-Evaluation

In [None]:
# Widget to record AI's complete self-evaluation
ai_self_eval_full = widgets.Textarea(
    value='',
    placeholder="Paste the AI's COMPLETE self-evaluation response here...",
    description='AI Self-Eval:',
    layout=widgets.Layout(width='100%', height='250px'),
    style={'description_width': '120px'}
)

ai_weaknesses_identified = widgets.Textarea(
    value='',
    placeholder="What 2 weaknesses did the AI identify? What improvements did it suggest?",
    description='AI Weaknesses:',
    layout=widgets.Layout(width='100%', height='100px'),
    style={'description_width': '120px'}
)

print("RECORD AI'S SELF-EVALUATION")
print("="*70)
display(ai_self_eval_full)
display(ai_weaknesses_identified)

## Step 6: Extract AI's Scores

Manually enter the scores the AI gave itself for each criterion:

In [None]:
# Create AI score entry widgets
def create_ai_score_widgets():
    selected_rubric = RUBRICS[rubric_selector.value]
    ai_scores = {}
    
    print("AI'S SELF-SCORES (Enter what AI scored itself)")
    print("="*70)
    print()
    
    for criterion in selected_rubric['criteria']:
        name = criterion['name']
        
        score_widget = widgets.Dropdown(
            options=[1, 2, 3, 4, 5],
            description=f"{name}:",
            style={'description_width': '150px'},
            layout=widgets.Layout(width='250px')
        )
        
        ai_scores[name] = score_widget
        display(score_widget)
    
    return ai_scores

# Button to create AI score entry
create_ai_scores_button = widgets.Button(
    description='ü§ñ Enter AI Scores',
    button_style='warning',
    layout=widgets.Layout(width='200px')
)

ai_scores_output = widgets.Output()
ai_scores_dict = {}

def on_create_ai_scores_clicked(b):
    global ai_scores_dict
    with ai_scores_output:
        clear_output()
        ai_scores_dict = create_ai_score_widgets()

create_ai_scores_button.on_click(on_create_ai_scores_clicked)

display(create_ai_scores_button)
display(ai_scores_output)

## Step 7: Compare Evaluations

Now let's see where you and the AI agree and disagree:

In [None]:
# Generate comparison table
compare_button = widgets.Button(
    description='üìä Compare Scores',
    button_style='success',
    layout=widgets.Layout(width='200px')
)

compare_output = widgets.Output()

def on_compare_clicked(b):
    with compare_output:
        clear_output()
        
        if not human_scores_dict or not ai_scores_dict:
            print("‚ùå Please complete both human scoring and AI score entry first.")
            return
        
        # Create comparison table
        selected_rubric = RUBRICS[rubric_selector.value]
        
        comparison_data = []
        total_human = 0
        total_ai = 0
        disagreements = []
        
        for criterion in selected_rubric['criteria']:
            name = criterion['name']
            human_score = human_scores_dict[name].value
            ai_score = ai_scores_dict[name].value
            diff = human_score - ai_score
            
            total_human += human_score
            total_ai += ai_score
            
            agreement = "‚úì Match" if diff == 0 else ("‚ö†Ô∏è Human higher" if diff > 0 else "‚ö†Ô∏è AI higher")
            
            if abs(diff) >= 2:
                disagreements.append((name, human_score, ai_score, diff))
            
            comparison_data.append({
                'Criterion': name,
                'Human Score': human_score,
                'AI Score': ai_score,
                'Difference': diff,
                'Agreement': agreement
            })
        
        df = pd.DataFrame(comparison_data)
        
        print("="*70)
        print("HUMAN VS. AI EVALUATION COMPARISON")
        print("="*70)
        print()
        print(df.to_string(index=False))
        print()
        print("="*70)
        print(f"Total Human Score: {total_human}/{len(selected_rubric['criteria']) * 5}")
        print(f"Total AI Score:    {total_ai}/{len(selected_rubric['criteria']) * 5}")
        print(f"Overall Difference: {total_human - total_ai}")
        print()
        
        if disagreements:
            print("SIGNIFICANT DISAGREEMENTS (¬±2 or more):")
            for name, h_score, a_score, diff in disagreements:
                if diff > 0:
                    print(f"  ‚Ä¢ {name}: You scored {h_score}, AI scored {a_score} (AI underrated itself by {diff})")
                else:
                    print(f"  ‚Ä¢ {name}: You scored {h_score}, AI scored {a_score} (AI overrated itself by {abs(diff)})")
        else:
            print("No major disagreements (all within ¬±1 point)")
        
        print("="*70)

compare_button.on_click(on_compare_clicked)

display(compare_button)
display(compare_output)

## Step 8: Reflection on Disagreements

In [None]:
disagreement_analysis = widgets.Textarea(
    value='',
    placeholder='Where did you most disagree with AI? Why do you think this disagreement occurred?',
    description='Analysis:',
    layout=widgets.Layout(width='100%', height='120px'),
    style={'description_width': '120px'}
)

print("\nANALYZE DISAGREEMENTS")
print("="*70)
display(disagreement_analysis)

## Step 9: Save All Data

In [None]:
# Save button
save_module2_button = widgets.Button(
    description='üíæ Save Module 2',
    button_style='success',
    layout=widgets.Layout(width='200px', height='40px')
)

save_output = widgets.Output()

def on_save_module2_clicked(b):
    with save_output:
        clear_output()
        
        # Validate
        if not rubric_justification.value:
            print("‚ùå Please provide justification for rubric choice")
            return
        if not human_scores_dict or not ai_scores_dict:
            print("‚ùå Please complete both human and AI scoring")
            return
        if not ai_self_eval_full.value:
            print("‚ùå Please record AI's full self-evaluation")
            return
        
        # Compile data
        selected_rubric = RUBRICS[rubric_selector.value]
        
        human_scores_data = {name: widget.value for name, widget in human_scores_dict.items()}
        human_justifications_data = {name: widget.value for name, widget in human_justifications_dict.items()}
        ai_scores_data = {name: widget.value for name, widget in ai_scores_dict.items()}
        
        # Find two weakest criteria (lowest human scores)
        sorted_criteria = sorted(human_scores_data.items(), key=lambda x: x[1])
        weakest_criteria = [sorted_criteria[0][0], sorted_criteria[1][0]]
        
        module2_data = {
            'group_code': group_code,
            'selected_rubric': rubric_selector.value,
            'rubric_justification': rubric_justification.value,
            'human_scores': human_scores_data,
            'human_justifications': human_justifications_data,
            'ai_self_evaluation_full': ai_self_eval_full.value,
            'ai_scores': ai_scores_data,
            'ai_weaknesses_identified': ai_weaknesses_identified.value,
            'disagreement_analysis': disagreement_analysis.value,
            'weakest_criteria': weakest_criteria
        }
        
        # Save
        with open(f'lab11_group_{group_code}_module2.json', 'w') as f:
            json.dump(module2_data, f, indent=2)
        
        print("‚úì Module 2 data saved successfully!")
        print(f"\nWeakest criteria identified for revision:")
        print(f"  1. {weakest_criteria[0]} (score: {human_scores_data[weakest_criteria[0]]})")
        print(f"  2. {weakest_criteria[1]} (score: {human_scores_data[weakest_criteria[1]]})")
        print(f"\nYou're ready for Module 3 (Revise & Compare)!")

save_module2_button.on_click(on_save_module2_clicked)

print("\n" + "="*70)
print("SAVE YOUR WORK")
print("="*70)
display(save_module2_button)
display(save_output)

## Module 2 Questions

Answer these on your Lab 11 Answer Sheet.

### Q7: Rubric Selection

Which rubric did you choose? Why is it the best fit for your content type?

*(Answer on your answer sheet)*

### Q8: Human Evaluation

Record your scores for each criterion. Include brief justification for each score.

*(Answer on your answer sheet - use the comparison table)*

### Q9: AI Self-Scores

What scores did the AI give itself? Record the AI's reasoning for each score.

*(Answer on your answer sheet)*

### Q10: AI-Identified Weaknesses

Which 2 weaknesses did the AI identify in its own work? What improvements did it suggest?

*(Answer on your answer sheet)*

### Q11: Biggest Disagreement

Where did you MOST disagree with the AI's self-evaluation? Why do you think this disagreement occurred? Was the AI overconfident or underconfident?

*(Answer on your answer sheet)*

## Summary: What You've Accomplished

‚úì Selected a rubric that fits your content type

‚úì Systematically scored the AI's output (human judgment)

‚úì Asked the AI to self-evaluate using the same rubric

‚úì Compared human vs. AI scores

‚úì Identified areas of agreement and disagreement

‚úì Found the 2 weakest criteria for revision

### Key Insight

You've likely discovered that **AI self-evaluation is unreliable**. AI may:
- Be **overconfident** (score itself higher than it deserves)
- Be **underconfident** (score itself lower than deserved)
- Miss obvious flaws that humans catch
- Focus on the wrong aspects of quality

This is why **human judgment remains essential** in evaluation!

### What's Next?

In **Module 3 (Revise & Compare)**, you will:
- Ask the AI to revise its work, targeting the 2 weakest criteria
- Score the revised version
- Compare: Did it improve? Did new problems appear?
- Test whether AI is better at revision than self-evaluation