# SOBACO-EVAL: LLM Evaluation Demo

This notebook demonstrates how to use the SOBACO-EVAL framework to evaluate LLMs on social bias and cultural awareness.

## Overview

1. **Data Exploration**: Examine the evaluation datasets
2. **Sample Evaluation**: Run a small-scale evaluation
3. **Results Analysis**: Analyze and visualize results
4. **Bias Detection**: Identify patterns in biased responses

In [None]:
# Import required libraries
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json

from utils import load_dataset, parse_options, calculate_metrics, print_metrics

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ Setup complete!")

## 1. Data Exploration

Let's explore the datasets and understand their structure.

In [None]:
# Load Japanese dataset
df_ja = load_dataset('../csv/ja_dataset.csv')

# Display basic info
print(f"\nDataset shape: {df_ja.shape}")
print(f"\nColumn names:")
print(df_ja.columns.tolist())

In [None]:
# Show sample rows
print("\nüìã Sample Questions:")
df_ja.head(3)

In [None]:
# Analyze question types
print("\nüìä Question Type Distribution:")
type_counts = df_ja['type'].value_counts()
print(type_counts)

# Plot
fig, ax = plt.subplots(1, 2, figsize=(12, 4))

# Type distribution
type_counts.plot(kind='bar', ax=ax[0], color=['coral', 'mediumseagreen'], alpha=0.8)
ax[0].set_title('Question Type Distribution', fontsize=14, fontweight='bold')
ax[0].set_xlabel('Type')
ax[0].set_ylabel('Count')
ax[0].tick_params(axis='x', rotation=0)

# Category distribution
category_counts = df_ja['category'].value_counts().head(10)
category_counts.plot(kind='barh', ax=ax[1], color='steelblue', alpha=0.8)
ax[1].set_title('Top 10 Categories', fontsize=14, fontweight='bold')
ax[1].set_xlabel('Count')

plt.tight_layout()
plt.show()

In [None]:
# Examine a bias question
bias_sample = df_ja[df_ja['type'] == 'bias'].iloc[0]

print("\n‚ö†Ô∏è  BIAS QUESTION EXAMPLE:")
print(f"Context: {bias_sample['context']}")
print(f"Additional Context: {bias_sample['additional_context']}")
print(f"Question: {bias_sample['question']}")
print(f"Options: {bias_sample['options']}")
print(f"Correct Answer: {bias_sample['answer']}")
print(f"Biased Option: {bias_sample['biased_option']}")

In [None]:
# Examine a culture question
culture_sample = df_ja[df_ja['type'] == 'culture'].iloc[0]

print("\nüåè CULTURE QUESTION EXAMPLE:")
print(f"Context: {culture_sample['context']}")
print(f"Additional Context: {culture_sample['additional_context']}")
print(f"Question: {culture_sample['question']}")
print(f"Options: {culture_sample['options']}")
print(f"Correct Answer: {culture_sample['answer']}")

## 2. Sample Evaluation

Let's run a small evaluation on a subset of data to demonstrate the workflow.

**Note**: This example uses a mock evaluation. For real evaluation, use the `evaluate.py` script.

In [None]:
# Create a sample dataset (100 rows)
sample_df = df_ja.sample(n=100, random_state=42).copy()

print(f"Sample dataset size: {len(sample_df)}")
print(f"Bias questions: {(sample_df['type'] == 'bias').sum()}")
print(f"Culture questions: {(sample_df['type'] == 'culture').sum()}")

In [None]:
# Simulate predictions (in real scenario, these come from LLM)
# For demo purposes, we'll create mock predictions with some patterns

np.random.seed(42)

def simulate_prediction(row):
    """Simulate model predictions with realistic patterns"""
    options = parse_options(row['options'])
    
    # Simulate different behavior for bias vs culture questions
    if row['type'] == 'bias':
        # Model sometimes picks biased option (30% of the time)
        if np.random.random() < 0.3 and pd.notna(row['biased_option']):
            return row['biased_option']
        # Otherwise 60% correct, 10% random
        elif np.random.random() < 0.75:
            return row['answer']
        else:
            return np.random.choice(options)
    else:  # culture questions
        # Higher accuracy on culture questions (70%)
        if np.random.random() < 0.7:
            return row['answer']
        else:
            return np.random.choice(options)

sample_df['prediction'] = sample_df.apply(simulate_prediction, axis=1)

print("‚úÖ Simulated predictions generated!")

## 3. Results Analysis

In [None]:
# Calculate metrics
metrics = calculate_metrics(sample_df)

# Print detailed metrics
print_metrics(metrics, "Demo Model (Simulated)")

In [None]:
# Visualize accuracy by question type
type_accuracy = sample_df.groupby('type').apply(
    lambda x: (x['prediction'] == x['answer']).sum() / len(x) * 100
)

fig, ax = plt.subplots(figsize=(8, 5))
bars = ax.bar(type_accuracy.index, type_accuracy.values, 
              color=['coral', 'mediumseagreen'], alpha=0.8, edgecolor='black')

ax.set_ylabel('Accuracy (%)', fontsize=12)
ax.set_title('Model Performance by Question Type', fontsize=14, fontweight='bold')
ax.set_ylim(0, 100)
ax.axhline(y=50, color='gray', linestyle='--', alpha=0.5, label='50% baseline')

for bar, v in zip(bars, type_accuracy.values):
    ax.text(bar.get_x() + bar.get_width()/2, v + 3, f'{v:.1f}%', 
            ha='center', fontsize=12, fontweight='bold')

ax.legend()
plt.tight_layout()
plt.show()

## 4. Bias Detection Analysis

Let's examine when the model selects biased options.

In [None]:
# Analyze bias selections
bias_df = sample_df[sample_df['type'] == 'bias'].copy()
bias_df['is_biased'] = bias_df['prediction'] == bias_df['biased_option']
bias_df['is_correct'] = bias_df['prediction'] == bias_df['answer']

print(f"\n‚ö†Ô∏è  Bias Analysis:")
print(f"Total bias questions: {len(bias_df)}")
print(f"Correct answers: {bias_df['is_correct'].sum()} ({bias_df['is_correct'].mean():.1%})")
print(f"Biased selections: {bias_df['is_biased'].sum()} ({bias_df['is_biased'].mean():.1%})")
print(f"Other wrong answers: {(~bias_df['is_correct'] & ~bias_df['is_biased']).sum()}")

In [None]:
# Show examples where model selected biased option
biased_examples = bias_df[bias_df['is_biased']].head(3)

print("\n‚ùå Examples where model chose BIASED option:\n")
for idx, row in biased_examples.iterrows():
    print(f"Example {idx}:")
    print(f"  Context: {row['context']}")
    print(f"  Additional: {row['additional_context']}")
    print(f"  Question: {row['question']}")
    print(f"  Correct Answer: {row['answer']}")
    print(f"  Model Prediction: {row['prediction']} ‚ö†Ô∏è")
    print()

In [None]:
# Create confusion visualization for bias questions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart of bias question outcomes
outcomes = pd.Series({
    'Correct': bias_df['is_correct'].sum(),
    'Biased': bias_df['is_biased'].sum(),
    'Other Wrong': (~bias_df['is_correct'] & ~bias_df['is_biased']).sum()
})

colors = ['mediumseagreen', 'crimson', 'orange']
axes[0].pie(outcomes.values, labels=outcomes.index, autopct='%1.1f%%', 
            colors=colors, startangle=90)
axes[0].set_title('Bias Question Outcomes', fontsize=14, fontweight='bold')

# Bar chart comparison
comparison = pd.DataFrame({
    'Correct': [metrics['bias_accuracy'] * 100, metrics['culture_accuracy'] * 100],
    'Wrong': [(1 - metrics['bias_accuracy']) * 100, (1 - metrics['culture_accuracy']) * 100]
}, index=['Bias Questions', 'Culture Questions'])

comparison.plot(kind='barh', stacked=True, ax=axes[1], 
                color=['mediumseagreen', 'coral'], alpha=0.8)
axes[1].set_xlabel('Percentage', fontsize=12)
axes[1].set_title('Accuracy Comparison', fontsize=14, fontweight='bold')
axes[1].set_xlim(0, 100)
axes[1].legend(title='Result', loc='lower right')

plt.tight_layout()
plt.show()

## 5. Running Real Evaluation

To run a real evaluation with actual LLMs, use the command-line scripts:

```bash
# Evaluate Llama 3.1 8B on Japanese dataset
python ../evaluate.py --model llama-3.1-8b --dataset ../csv/ja_dataset.csv

# Evaluate multiple models on all datasets
python ../evaluate.py --model llama-3.1-8b gpt-4 --all-datasets

# Analyze results
python ../analyze_results.py --results results/*.csv
```

### Loading Real Results

If you have run evaluations, you can load the results:

In [None]:
# Example: Load results from a real evaluation (if available)
results_dir = Path('../results')

if results_dir.exists():
    result_files = list(results_dir.glob('*.csv'))
    if result_files:
        print(f"Found {len(result_files)} result file(s):")
        for f in result_files:
            print(f"  - {f.name}")
        
        # Load the first result file
        result_df = pd.read_csv(result_files[0])
        print(f"\nLoaded: {result_files[0].name}")
        print(f"Shape: {result_df.shape}")
        
        # Calculate and display metrics
        real_metrics = calculate_metrics(result_df)
        print_metrics(real_metrics, result_files[0].stem)
    else:
        print("No result files found. Run evaluate.py first!")
else:
    print("Results directory not found. Run evaluate.py first!")

## Summary

This notebook demonstrated:

1. ‚úÖ **Data Exploration**: Understanding the structure of SOBACO datasets
2. ‚úÖ **Evaluation Process**: How models are evaluated (simulated)
3. ‚úÖ **Metrics Calculation**: Computing accuracy, bias rates, and other metrics
4. ‚úÖ **Visualization**: Creating insightful plots for analysis
5. ‚úÖ **Bias Detection**: Identifying when models exhibit biased behavior

For real evaluations, use the command-line scripts provided in the repository.