# Lab 10 - Module 2: Evaluating AI Responses

**Time:** ~15-20 minutes

In this module, you'll verify the actual accuracy of the AI responses you collected in Module 1.

For each prompt, you'll:
- Verify the AI's claims using Google, Wikipedia, or other sources
- Record the actual accuracy level
- Identify specific error types (if any)
- Document hallucinations or fabrications

This is where you'll discover whether the AI's confidence matched its correctness!

## Setup: Import Libraries and Load Data

In [None]:
import numpy as np
import pandas as pd
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML, Markdown

print("✓ Libraries loaded successfully!")

## Enter Your Group Code

Use the **same group code** from Modules 0 and 1.

In [None]:
group_code = int(input("Enter your group code: "))

# Load prompts and predictions
prompts_filename = f"lab10_group_{group_code}_prompts.csv"
predictions_filename = f"lab10_group_{group_code}_predictions.csv"

try:
    prompts_df = pd.read_csv(prompts_filename)
    predictions_df = pd.read_csv(predictions_filename)
    
    # Merge data
    data_df = prompts_df.merge(predictions_df, on=['prompt_id', 'category'])
    
    print(f"✓ Loaded {len(data_df)} prompts with predictions")
    print(f"✓ Group Code: {group_code}")
    print()
    print("Reminder of what you recorded in Module 1:")
    display(data_df[['prompt_id', 'category_display', 'ai_confidence', 'student_prediction']].head())
    
except FileNotFoundError as e:
    print(f"❌ ERROR: Could not find required files")
    print(f"Missing: {e.filename}")
    print("\nMake sure you:")
    print("1. Ran Module 0 (to generate prompts)")
    print("2. Ran Module 1 (to collect predictions)")
    print("3. Are using the same group code!")

## Verification Instructions

### How to Verify Each Response

For each of the 8 prompts, you'll need to verify whether the AI's response was actually accurate.

### Verification Strategies

**For factual claims:**
- Search Google for authoritative sources
- Check Wikipedia (but verify with additional sources)
- Look for government data, academic sources, or reputable news

**For citations and sources:**
- Search for the exact paper/book title in quotes
- Check if the authors exist and work in that field
- Verify the journal or publisher is real
- Try Google Scholar for academic papers
- **Red flag:** If you can't find it anywhere, it's likely fabricated

**For calculations:**
- Use a calculator or spreadsheet
- Double-check the mathematical reasoning
- Verify units and reasonableness

**For logic problems:**
- Work through the problem yourself
- Check if the AI's reasoning makes sense step-by-step
- Look for common trick question patterns

**For recent events:**
- Remember that AI has a training cutoff
- Check if the AI acknowledged this limitation
- Verify dates and current information

### What Counts as a Hallucination?

A hallucination is when the AI invents specific, false information:
- Fabricated citations (papers, authors, journals that don't exist)
- Invented statistics or numbers
- Made-up historical facts or dates
- False quotes attributed to real people

**Not hallucinations:**
- Vague or general statements
- Correct refusals ("I don't have access to real-time data")
- Acknowledged limitations

### Time Budget

Spend about **2 minutes per prompt**:
- 1 minute: Search and verify claims
- 1 minute: Record accuracy and error types

**Total: ~16 minutes for all 8 prompts**

## Data Collection Interface

Run this cell to start verifying each prompt's accuracy.

You'll see each prompt with its AI response characteristics, and you'll record the actual accuracy.

In [None]:
# Initialize data storage
evaluations_data = []

# Accuracy options
accuracy_options = [
    'Select...',
    'Fully accurate - no errors found',
    'Mostly accurate - minor errors or imprecision',
    'Partially accurate - significant errors or omissions',
    'Inaccurate - major errors or hallucinations',
    'Refused or unable to answer'
]

# Error type options
error_types_options = [
    'No errors',
    'Factual error (wrong information)',
    'Hallucinated citation/source',
    'Logic error in reasoning',
    'Outdated information',
    'Overgeneralization',
    'Incomplete answer',
    'Misunderstood question'
]

def create_evaluation_interface(idx, row):
    """Create widgets for evaluating a single prompt."""
    
    # Display prompt information and previous recording
    print("="*70)
    print(f"PROMPT #{row['prompt_id']} of {len(data_df)}")
    print(f"Category: {row['category_display']}")
    print("="*70)
    print()
    print("THE PROMPT:")
    print("-"*70)
    print(row['prompt_text'])
    print("-"*70)
    print()
    print("WHAT YOU RECORDED IN MODULE 1:")
    print(f"  AI Confidence: {row['ai_confidence']}")
    print(f"  Your Prediction: {row['student_prediction']}")
    if pd.notna(row['notes']) and row['notes'].strip():
        print(f"  Notes: {row['notes']}")
    print()
    print("NOW: Verify the actual accuracy of the AI's response")
    print("Use Google, Wikipedia, or other sources to check the facts.")
    print()
    
    # Create widgets
    accuracy_widget = widgets.Dropdown(
        options=accuracy_options,
        value='Select...',
        description='Actual Accuracy:',
        style={'description_width': 'initial'},
        layout={'width': '650px'}
    )
    
    error_types_widget = widgets.SelectMultiple(
        options=error_types_options,
        value=[],
        description='Error Types:',
        style={'description_width': 'initial'},
        layout={'width': '650px', 'height': '150px'}
    )
    
    error_details_widget = widgets.Textarea(
        value='',
        placeholder='Describe specific errors, hallucinations, or issues you found. For hallucinations, be specific (e.g., "Invented citation: claimed paper by Smith et al. in Nature 2023, but no such paper exists").', 
        description='Error Details:',
        style={'description_width': 'initial'},
        layout={'width': '650px', 'height': '100px'}
    )
    
    save_button = widgets.Button(
        description=f'Save and Continue to Prompt #{row["prompt_id"] + 1}' if idx < len(data_df) - 1 else 'Save Final Evaluation',
        button_style='success',
        layout={'width': '300px'}
    )
    
    output = widgets.Output()
    
    def on_save(b):
        with output:
            clear_output()
            
            # Validate inputs
            if accuracy_widget.value == 'Select...':
                print("⚠️ Please select an accuracy level before saving.")
                return
            
            # Save data
            evaluations_data.append({
                'prompt_id': row['prompt_id'],
                'category': row['category'],
                'actual_accuracy': accuracy_widget.value,
                'error_types': '; '.join(error_types_widget.value) if error_types_widget.value else 'None',
                'error_details': error_details_widget.value
            })
            
            print(f"✓ Prompt #{row['prompt_id']} evaluated!")
            print(f"Progress: {len(evaluations_data)}/{len(data_df)} prompts completed")
            
            # Show if prediction was correct
            was_accurate = accuracy_widget.value in ['Fully accurate - no errors found', 'Mostly accurate - minor errors or imprecision']
            was_confident = 'No caveats' in row['ai_confidence']
            
            if was_confident and not was_accurate:
                print("⚠️ Overconfidence detected! AI was confident but inaccurate.")
            elif not was_confident and was_accurate:
                print("ℹ️ Underconfidence: AI was cautious but actually accurate.")
            elif was_confident and was_accurate:
                print("✓ Well-calibrated: AI was confident AND accurate.")
            
            if idx < len(data_df) - 1:
                print(f"\n→ Scroll down for Prompt #{row['prompt_id'] + 1}")
            else:
                print("\n✓ All evaluations completed!")
                print("→ Scroll down to save your data.")
    
    save_button.on_click(on_save)
    
    # Display widgets
    display(accuracy_widget)
    display(Markdown("*Hold Ctrl (Windows) or Cmd (Mac) to select multiple error types:*"))
    display(error_types_widget)
    display(error_details_widget)
    display(save_button)
    display(output)
    print()
    print()

# Display interface for each prompt
for idx, row in data_df.iterrows():
    create_evaluation_interface(idx, row)

## Save Your Evaluations

After completing all 8 evaluations above, run this cell to save your data.

In [None]:
if len(evaluations_data) < len(data_df):
    print(f"⚠️ WARNING: You've only completed {len(evaluations_data)}/{len(data_df)} evaluations.")
    print("Please complete all prompts above before saving.")
else:
    # Create DataFrame
    evaluations_df = pd.DataFrame(evaluations_data)
    
    # Save to CSV
    evaluations_filename = f"lab10_group_{group_code}_evaluations.csv"
    evaluations_df.to_csv(evaluations_filename, index=False)
    
    # Merge all data for analysis
    complete_df = data_df.merge(evaluations_df, on=['prompt_id', 'category'])
    
    # Calculate summary statistics
    total_prompts = len(complete_df)
    accurate_prompts = len(complete_df[complete_df['actual_accuracy'].str.contains('accurate', case=False, na=False)])
    confident_prompts = len(complete_df[complete_df['ai_confidence'].str.contains('No caveats', case=False, na=False)])
    overconfident = len(complete_df[
        complete_df['ai_confidence'].str.contains('No caveats', case=False, na=False) & 
        complete_df['actual_accuracy'].str.contains('Inaccurate|Partially accurate', case=False, na=False)
    ])
    
    # Display summary
    print("="*60)
    print("✓ Evaluations Saved Successfully!")
    print("="*60)
    print(f"File: {evaluations_filename}")
    print(f"Prompts evaluated: {total_prompts}")
    print()
    print("ACCURACY SUMMARY:")
    print(f"  Accurate responses: {accurate_prompts}/{total_prompts} ({accurate_prompts/total_prompts*100:.1f}%)")
    print(f"  Confident responses: {confident_prompts}/{total_prompts}")
    print(f"  Overconfident responses: {overconfident}/{confident_prompts if confident_prompts > 0 else 1}")
    if confident_prompts > 0:
        print(f"  Overconfidence rate: {overconfident/confident_prompts*100:.1f}%")
    print()
    print("Distribution of Actual Accuracy:")
    print(evaluations_df['actual_accuracy'].value_counts())
    print()
    print("="*60)
    print()
    print("Next Steps:")
    print("1. Answer Q8-Q12 on your lab handout")
    print("2. Continue to Module 3 for visualizations and analysis")
    print("3. Use the same group code in Module 3!")
    print("="*60)

## Preview Your Complete Data

Optional: View the merged dataset with predictions and evaluations.

In [None]:
if len(evaluations_data) == len(data_df):
    complete_df = data_df.merge(pd.DataFrame(evaluations_data), on=['prompt_id', 'category'])
    display_cols = ['prompt_id', 'category_display', 'ai_confidence', 'actual_accuracy', 'error_types']
    display(HTML(complete_df[display_cols].to_html(index=False, classes='table table-striped')))
else:
    print("Complete all evaluations and save data first.")

## Questions for Module 2

Answer these questions on your lab handout using the data you just collected.

### Q8: Verification Method for Prompt #1

For Prompt #1, was the AI's response actually accurate? How did you verify this? (What sources did you use?)

*(Answer on your handout)*

### Q9: Overconfidence Example

Identify ONE prompt where the AI was confident but made errors. What prompt was it? What went wrong?

*(Answer on your handout)*

### Q10: Uncertainty and Accuracy

Identify ONE prompt where the AI expressed uncertainty (caveats or refusal). Was the AI actually accurate or inaccurate on that prompt?

*(Answer on your handout)*

### Q11: Hallucination Example

Did you find any "hallucinated" information—specific false details the AI presented as fact? Give a concrete example.

*(Answer on your handout)*

### Q12: Category with Most Errors

Which category of prompts led to the most errors in your group's data? Why do you think this category is difficult for AI?

*(Answer on your handout)*

## Next Steps

1. **Answer Q8-Q12** on your lab handout
2. **Remember your group code**
3. **Continue to Module 3** where you'll visualize patterns of overconfidence
4. **Use the same group code** in Module 3

In Module 3, you'll see the big picture - how often confidence matched accuracy!