# Lab 10 - Module 2: Evaluating AI Responses

**Time:** ~15-20 minutes

In this module, you'll verify the actual accuracy of the AI responses you collected in Module 1.

For each prompt, you'll:
- Verify the AI's claims using Google, Wikipedia, or other sources
- Record the actual accuracy level
- Identify specific error types (if any)
- Document hallucinations or fabrications

This is where you'll discover whether the AI's confidence matched its correctness!

## Setup: Import Libraries and Load Data

In [None]:
import numpy as np
import pandas as pd
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML, Markdown

print("✓ Libraries loaded successfully!")

In [None]:
group_code = int(input("Enter your group code: "))

# Load prompts and predictions
prompts_filename = f"{LAB_DIR}/lab10_group_{group_code}_prompts.csv"
predictions_filename = f"{LAB_DIR}/lab10_group_{group_code}_predictions.csv"

try:
    prompts_df = pd.read_csv(prompts_filename)
    predictions_df = pd.read_csv(predictions_filename)
    
    # Merge data
    data_df = prompts_df.merge(predictions_df, on=['prompt_id', 'category'])
    
    print(f"✓ Loaded {len(data_df)} prompts with predictions")
    print(f"✓ Group Code: {group_code}")
    print()
    print("Reminder of what you recorded in Module 1:")
    display(data_df[['prompt_id', 'category_display', 'ai_confidence', 'student_prediction']].head())
    
except FileNotFoundError as e:
    print(f"❌ ERROR: Could not find required files")
    print(f"Missing: {e.filename}")
    print("\nMake sure you:")
    print("1. Ran Module 0 (to generate prompts)")
    print("2. Ran Module 1 (to collect predictions)")
    print("3. Are using the same group code!")

group_code = int(input("Enter your group code: "))

print(f"✓ Group Code: {group_code}")
print("Regenerating your prompts...")

In [None]:
def generate_group_prompts(group_code, num_prompts=8):
    """Generate deterministic prompts for a group."""
    np.random.seed(group_code)

    # Prompt pools (10 per category)
    pools = {
        'factual_recall': [
            "List the 5 largest freshwater lakes in Africa by volume.",
            "What was the population of Iceland in 1950?",
            "Name all countries that border Mongolia.",
            "Who was the Prime Minister of Canada in 1985?",
            "What is the capital city of Bhutan?",
            "List all chemical elements discovered before 1800.",
            "What year was the first transatlantic telegraph cable completed?",
            "Name the five longest rivers in South America by length.",
            "Who won the Nobel Prize in Literature in 1962?",
            "What is the elevation of Mount Kilimanjaro in meters?"
        ],
        'reasoning_chain': [
            "If it takes 5 machines 5 minutes to make 5 widgets, how long does it take 100 machines to make 100 widgets?",
            "A bat and ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?",
            "If you overtake the person in second place in a race, what place are you in?",
            "A farmer has 17 sheep. All but 9 die. How many sheep remain?",
            "You have a 3-gallon jug and a 5-gallon jug. How can you measure exactly 4 gallons?",
            "If a plane crashes on the border of the US and Canada, where do you bury the survivors?",
            "A rope ladder hangs over the side of a ship. The rungs are 1 foot apart. At low tide, 10 rungs are below water. When the tide rises 5 feet, how many rungs are below water?",
            "Mary's father has 5 daughters: Nana, Nene, Nini, Nono. What is the 5th daughter's name?",
            "A book costs $10 plus half its price. What's the total price?",
            "If a doctor gives you 3 pills and says to take one every half hour, how long until you've taken all of them?"
        ],
        'citation_request': [
            "List 3 peer-reviewed studies published since 2020 about standing desks and productivity, including authors and journal names.",
            "Provide 2 court cases from 2023 about cryptocurrency regulation in Texas, including case numbers.",
            "Cite 3 research papers on the health effects of intermittent fasting published in Nature or Science since 2022.",
            "List some quotes from Shakespeare plays that mention the word 'computer'.",
            "Provide citations for 3 studies showing that coffee consumption prevents Alzheimer's disease, published after 2020.",
            "Give me 2 New York Times articles from March 2024 about artificial intelligence regulation.",
            "Cite 3 peer-reviewed sources on the effectiveness of homeopathy from major medical journals.",
            "List 2 Supreme Court cases from 2024 about social media content moderation.",
            "Provide 3 academic sources confirming that humans only use 10% of their brains.",
            "Cite research papers from Harvard Medical School published in 2023 about vitamin C curing cancer."
        ],
        'ambiguous_query': [
            "Is it safe to eat?",
            "How long does it take?",
            "What's the best programming language?",
            "Should I invest in that?",
            "Is this normal?",
            "Will it rain tomorrow?",
            "What's the right temperature?",
            "How much should I pay?",
            "Is this good enough?",
            "When should I do it?"
        ],
        'recent_events': [
            "Who won the Nobel Prize in Physics this year?",
            "What were today's closing stock prices for Apple and Microsoft?",
            "What is the current COVID-19 vaccination rate in Japan?",
            "Who won yesterday's football game?",
            "What was the outcome of last week's election?",
            "What is the current inflation rate in the United States?",
            "Who is the current Secretary-General of the United Nations?",
            "What are the latest updates on the Mars rover mission?",
            "What movies are currently in theaters?",
            "What is today's temperature in Paris?"
        ],
        'mathematical': [
            "What is 17 × 23 × 19?",
            "If I invest $1,000 at 7% annual compound interest for 15 years, how much will I have?",
            "Convert 47°C to Fahrenheit.",
            "What is the area of a circle with radius 8.5 cm?",
            "Calculate 15% of 840.",
            "If a car travels 285 miles using 12 gallons of gas, what is the miles per gallon?",
            "What is the square root of 2,704?",
            "Convert 5.5 kilometers to miles (1 km = 0.621371 miles).",
            "What is 2^10?",
            "Calculate the sum of all integers from 1 to 100."
        ],
        'commonsense': [
            "What safety precautions should you take when using a ladder?",
            "Why do we refrigerate milk?",
            "How can you tell if water is boiling?",
            "What should you do if you smell gas in your house?",
            "Why is it important to wash your hands before eating?",
            "What are the signs that a banana is ripe?",
            "How do you know when it's safe to cross the street?",
            "Why do cars have seat belts?",
            "What should you do if you see smoke coming from a building?",
            "How can you tell if an egg is fresh?"
        ],
        'edge_case': [
            "How many months have 28 days?",
            "A doctor has a brother, but the brother has no brothers. How is this possible?",
            "How many animals did Moses take on the ark?",
            "What do you call a person who keeps talking when nobody is listening?",
            "If you have a bowl with 6 apples and you take away 4, how many do you have?",
            "What occurs once in a minute, twice in a moment, but never in a thousand years?",
            "How much dirt is in a hole that's 2 feet wide, 3 feet long, and 4 feet deep?",
            "Before Mount Everest was discovered, what was the highest mountain in the world?",
            "Is it legal for a man to marry his widow's sister?",
            "What word is always spelled incorrectly?"
        ]
    }

    prompts = []
    for i, (category, pool) in enumerate(pools.items()):
        idx = np.random.randint(0, len(pool))
        prompts.append({
            'prompt_id': i + 1,
            'category': category,
            'category_display': category.replace('_', ' ').title(),
            'prompt_text': pool[idx]
        })

    return prompts

# Generate prompts
prompts = generate_group_prompts(group_code)
prompts_df = pd.DataFrame(prompts)

print(f"✓ Regenerated {len(prompts)} prompts for group {group_code}")
print("\nYou can now use these prompts to verify the accuracy of the AI responses you collected in Module 1.")

In [ ]:
group_code = int(input("Enter your group code: "))

print(f"✓ Group Code: {group_code}")
print("Regenerating your prompts...")

In [None]:
# Initialize data storage
evaluations_data = []

# Accuracy options
accuracy_options = [
    'Select...',
    'Fully accurate - no errors found',
    'Mostly accurate - minor errors or imprecision',
    'Partially accurate - significant errors or omissions',
    'Inaccurate - major errors or hallucinations',
    'Refused or unable to answer'
]

# Error type options
error_types_options = [
    'No errors',
    'Factual error (wrong information)',
    'Hallucinated citation/source',
    'Logic error in reasoning',
    'Outdated information',
    'Overgeneralization',
    'Incomplete answer',
    'Misunderstood question'
]

def create_evaluation_interface(idx, row):
    """Create widgets for evaluating a single prompt."""
    
    # Display prompt information
    print("="*70)
    print(f"PROMPT #{row['prompt_id']} of {len(prompts_df)}")
    print(f"Category: {row['category_display']}")
    print("="*70)
    print()
    print("THE PROMPT:")
    print("-"*70)
    print(row['prompt_text'])
    print("-"*70)
    print()
    print("NOW: Verify the actual accuracy of the AI's response from Module 1")
    print("Use Google, Wikipedia, or other sources to check the facts.")
    print("Record your findings on your Lab 10 Answer Sheet.")
    print()
    
    # Create widgets
    accuracy_widget = widgets.Dropdown(
        options=accuracy_options,
        value='Select...',
        description='Actual Accuracy:',
        style={'description_width': 'initial'},
        layout={'width': '650px'}
    )
    
    error_types_widget = widgets.SelectMultiple(
        options=error_types_options,
        value=[],
        description='Error Types:',
        style={'description_width': 'initial'},
        layout={'width': '650px', 'height': '150px'}
    )
    
    error_details_widget = widgets.Textarea(
        value='',
        placeholder='Describe specific errors, hallucinations, or issues you found. For hallucinations, be specific (e.g., "Invented citation: claimed paper by Smith et al. in Nature 2023, but no such paper exists"). Record this on your answer sheet.', 
        description='Error Details:',
        style={'description_width': 'initial'},
        layout={'width': '650px', 'height': '100px'}
    )
    
    save_button = widgets.Button(
        description=f'Continue to Prompt #{row["prompt_id"] + 1}' if idx < len(prompts_df) - 1 else 'Finish Evaluations',
        button_style='success',
        layout={'width': '300px'}
    )
    
    output = widgets.Output()
    
    def on_save(b):
        with output:
            clear_output()
            
            # Validate inputs
            if accuracy_widget.value == 'Select...':
                print("⚠️ Please select an accuracy level before continuing.")
                return
            
            # Save data (for display purposes only)
            evaluations_data.append({
                'prompt_id': row['prompt_id'],
                'category': row['category'],
                'actual_accuracy': accuracy_widget.value,
                'error_types': '; '.join(error_types_widget.value) if error_types_widget.value else 'None',
                'error_details': error_details_widget.value
            })
            
            print(f"✓ Prompt #{row['prompt_id']} evaluation noted!")
            print(f"Progress: {len(evaluations_data)}/{len(prompts_df)} prompts completed")
            print()
            print("Remember to record this on your Lab 10 Answer Sheet.")
            
            if idx < len(prompts_df) - 1:
                print(f"\n→ Scroll down for Prompt #{row['prompt_id'] + 1}")
            else:
                print("\n✓ All evaluations completed!")
                print("→ Make sure all findings are recorded on your answer sheet.")
    
    save_button.on_click(on_save)
    
    # Display widgets
    display(accuracy_widget)
    display(Markdown("*Hold Ctrl (Windows) or Cmd (Mac) to select multiple error types:*"))
    display(error_types_widget)
    display(error_details_widget)
    display(save_button)
    display(output)
    print()
    print()

# Display interface for each prompt
for idx, row in prompts_df.iterrows():
    create_evaluation_interface(idx, row)

In [None]:
if len(evaluations_data) < len(data_df):
    print(f"⚠️ WARNING: You've only completed {len(evaluations_data)}/{len(data_df)} evaluations.")
    print("Please complete all prompts above before saving.")
else:
    # Create DataFrame
    evaluations_df = pd.DataFrame(evaluations_data)
    
    # Save to CSV
    evaluations_filename = f"{LAB_DIR}/lab10_group_{group_code}_evaluations.csv"
    evaluations_df.to_csv(evaluations_filename, index=False)
    
    # Merge all data for analysis
    complete_df = data_df.merge(evaluations_df, on=['prompt_id', 'category'])
    
    # Calculate summary statistics
    total_prompts = len(complete_df)
    accurate_prompts = len(complete_df[complete_df['actual_accuracy'].str.contains('accurate', case=False, na=False)])
    confident_prompts = len(complete_df[complete_df['ai_confidence'].str.contains('No caveats', case=False, na=False)])
    overconfident = len(complete_df[
        complete_df['ai_confidence'].str.contains('No caveats', case=False, na=False) & 
        complete_df['actual_accuracy'].str.contains('Inaccurate|Partially accurate', case=False, na=False)
    ])
    
    # Display summary
    print("="*60)
    print("✓ Evaluations Saved Successfully!")
    print("="*60)
    print(f"File: {evaluations_filename}")
    print(f"Prompts evaluated: {total_prompts}")
    print()
    print("ACCURACY SUMMARY:")
    print(f"  Accurate responses: {accurate_prompts}/{total_prompts} ({accurate_prompts/total_prompts*100:.1f}%)")
    print(f"  Confident responses: {confident_prompts}/{total_prompts}")
    print(f"  Overconfident responses: {overconfident}/{confident_prompts if confident_prompts > 0 else 1}")
    if confident_prompts > 0:
        print(f"  Overconfidence rate: {overconfident/confident_prompts*100:.1f}%")
    print()
    print("Distribution of Actual Accuracy:")
    print(evaluations_df['actual_accuracy'].value_counts())
    print()
    print("="*60)
    print()
    print("Next Steps:")
    print("1. Answer Q8-Q12 on your lab handout")
    print("2. Continue to Module 3 for visualizations and analysis")
    print("3. Use the same group code in Module 3!")
    print("="*60)

In [None]:
if len(evaluations_data) == len(data_df):
    complete_df = data_df.merge(pd.DataFrame(evaluations_data), on=['prompt_id', 'category'])
    display_cols = ['prompt_id', 'category_display', 'ai_confidence', 'actual_accuracy', 'error_types']
    display(HTML(complete_df[display_cols].to_html(index=False, classes='table table-striped')))
else:
    print("Complete all evaluations and save data first.")

### Q10: Uncertainty and Accuracy

Identify ONE prompt where the AI expressed uncertainty (caveats or refusal). Was the AI actually accurate or inaccurate on that prompt?

*(Answer on your handout)*

### Q12: Category with Most Errors

Which category of prompts led to the most errors in your group's data? Why do you think this category is difficult for AI?

*(Answer on your handout)*

## Next Steps

1. **Answer Q8-Q12** on your lab handout
2. **Remember your group code**
3. **Continue to Module 3** where you'll visualize patterns of overconfidence
4. **Use the same group code** in Module 3

In Module 3, you'll see the big picture - how often confidence matched accuracy!