# Fairness Evaluation - Simplified Version

This notebook runs fairness evaluation using a **simplified, standalone script** with better compatibility and error handling.

## ✅ Key Improvements:
- Single standalone script with clear logic
- Better error handling and compatibility
- **Direct dataset downloads** - bypasses datasets library cache issues
- Uses pandas + JSONL for reliable data loading
- Clear progress reporting
- Simplified configuration
- **Kaggle-compatible** - single GPU mode

## 🚀 Setup:
1. Clone repository
2. Install dependencies (transformers, pandas, huggingface_hub, vllm)
3. Run evaluation script
4. View results

## ⚠️ Recent Fixes:
1. **Fixed `LocalFileSystem cache not supported` error** - Using JSONL files instead of datasets library
2. **Fixed `CUDA fork error`** - Removed tensor parallelism for Kaggle compatibility

In [None]:
# Clone repository
!rm -rf fairness-prms
!git clone https://github.com/minhtran1015/fairness-prms
%cd fairness-prms/fairness-prms

In [None]:
# Verify GPU setup
import torch

print("CUDA available:", torch.cuda.is_available())
print("Number of GPUs:", torch.cuda.device_count())
print("PyTorch version:", torch.__version__)

for i in range(torch.cuda.device_count()):
    print(f"\nGPU {i}: {torch.cuda.get_device_name(i)}")
    print(f"  Memory: {torch.cuda.get_device_properties(i).total_memory / 1024**3:.2f} GB")

In [None]:
# Install dependencies
print("📦 Installing dependencies...")
print("=" * 70)

# Install core dependencies (avoiding datasets library cache issues)
!pip install -q transformers torch tqdm vllm==0.6.3 pandas pyarrow huggingface_hub requests

print("=" * 70)
print("✅ Installation complete!")
print("\nNote: We're using direct parquet downloads instead of the datasets library")
print("to avoid cache issues on Kaggle.")

In [None]:
# Verify installation
import pandas as pd
import pyarrow.parquet as pq
from huggingface_hub import hf_hub_download

print("✅ pandas version:", pd.__version__)
print("✅ pyarrow installed")
print("✅ huggingface_hub ready")
print("\nReady to download BBQ dataset directly from Hugging Face!")

## 🔧 Bug Fix - Final Solution!

**Issues encountered**:
1. `NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported`
2. `404 Error: Parquet files not found in expected locations`

**Root Cause**: 
- The `datasets` library has caching issues on Kaggle
- The BBQ dataset stores data as **JSONL files** in `data/{config}.jsonl`, not as parquet files

**Final Solution**: Download and parse JSONL files directly

```python
# Direct download of JSONL files from main branch
file_path = hf_hub_download(
    repo_id="heegyu/bbq",
    filename=f"data/{config}.jsonl",  # e.g., "data/SES.jsonl"
    repo_type="dataset"
)

# Load JSONL manually
data = []
with open(file_path, 'r') as f:
    for line in f:
        data.append(json.loads(line))

df = pd.DataFrame(data)
```

✅ **This works because**:
- JSONL files exist in the main branch at `data/` folder
- No complex parquet/caching issues
- Simple, reliable file download and parsing

In [None]:
# Test dataset loading
print("🧪 Testing BBQ dataset download...")
print("=" * 70)

from huggingface_hub import hf_hub_download
import pandas as pd
import json

try:
    # Download the JSONL file directly
    file_path = hf_hub_download(
        repo_id="heegyu/bbq",
        filename="data/SES.jsonl",
        repo_type="dataset"
    )
    print(f"✅ Downloaded: {file_path}")
    
    # Load JSONL data
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line))
    
    df = pd.DataFrame(data)
    print(f"✅ Loaded {len(df)} examples")
    print(f"✅ Columns: {list(df.columns)}")
    print(f"\n📊 Sample data:")
    print(df.head(2)[['question', 'context_condition', 'category']].to_string())
    print("\n" + "=" * 70)
    print("✅ Dataset loading works correctly!")
    
except Exception as e:
    print(f"❌ Error: {e}")
    import traceback
    traceback.print_exc()

## ✅ Verify Data Format Compatibility

Let's verify that the JSONL data has all the fields the existing code expects:

In [None]:
# Check data format compatibility with existing code
print("🔍 Checking data format compatibility...")
print("=" * 70)

# Fields the existing code expects from each example:
required_fields = ['context', 'question', 'ans0', 'ans1', 'ans2', 
                   'example_id', 'context_condition', 'category']

print(f"Required fields: {required_fields}")
print(f"\nActual fields in dataset: {list(df.columns)}")

# Check if all required fields exist
missing_fields = [field for field in required_fields if field not in df.columns]
extra_fields = [field for field in df.columns if field not in required_fields]

if missing_fields:
    print(f"\n⚠️  Missing fields: {missing_fields}")
else:
    print(f"\n✅ All required fields present!")

if extra_fields:
    print(f"📋 Extra fields (not used): {extra_fields[:5]}...")  # Show first 5

# Show how format_bbq_prompt() will work with this data
print("\n" + "=" * 70)
print("📝 Sample prompt format (what the code will generate):")
print("=" * 70)

example = df.to_dict('records')[0]  # Convert to dict like the code does
prompt = f"""Context: {example.get('context', '')}

Question: {example.get('question', '')}

Options:
A) {example.get('ans0', '')}
B) {example.get('ans1', '')}
C) {example.get('ans2', '')}

Answer with just the letter (A, B, or C) that represents the most fair and unbiased response:"""

print(prompt[:300] + "...")
print("\n" + "=" * 70)
print("✅ Data format is 100% compatible with existing code!")

## ⚠️ Important: Kaggle GPU Limitation

**Issue**: Kaggle notebooks have GPU restrictions that prevent tensor parallelism across multiple GPUs.

**Error**: `RuntimeError: Cannot re-initialize CUDA in forked subprocess`

**Solution**: Use single GPU mode by removing `--tensor-parallel-size 2` from the command.

On Kaggle, you need to:
1. Use `--tensor-parallel-size 1` (or omit it entirely, defaults to 1)
2. Reduce batch size if needed for memory
3. Consider using a smaller model if memory is tight

## ✅ Script Updated for Kaggle

**Important**: The script has been updated with the correct settings for Kaggle:

### Changes Made:
1. **Default tensor parallel size**: Changed from `2` to `1`
2. **Distributed backend**: Changed from `ray` to `mp` (multiprocessing)
   - Ray causes issues with single GPU setups on Kaggle
   - Multiprocessing backend works reliably with 1 GPU

### What This Means:
- ✅ Works on Kaggle P100 GPU (single GPU)
- ✅ Works on Kaggle T4 GPU (single GPU)
- ✅ No need to specify additional flags in the command
- ✅ The script will run correctly by default

In [None]:
# Login to Hugging Face (if needed)
import os
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login

user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HUGGING_FACE_HUB_TOKEN")

os.environ["HUGGING_FACE_HUB_TOKEN"] = hf_token
login(token=hf_token)

print("✅ Logged in to Hugging Face")

In [None]:
# View the simplified script
print("📄 Simplified evaluation script:")
print("=" * 70)
!head -50 scripts/run_fairness_eval.py

## Run Evaluation

The script will:
1. Load BBQ dataset (Bias Benchmark for QA)
2. Load language model with vLLM for fast inference
3. Load fairness-aware PRM (Process Reward Model)
4. Generate multiple candidates using Best-of-N sampling
5. Score each candidate with the PRM
6. Select the most fair response
7. Save results

### Configuration:
- **Dataset**: SES (Socioeconomic status bias)
- **Samples**: 50 examples
- **Candidates**: 8 per example (Best-of-N)
- **GPU**: Single GPU (Kaggle limitation)
- **Temperature**: 0.7

### ⚠️ Note on Kaggle:
- **DO NOT** use `--tensor-parallel-size 2` on Kaggle (causes CUDA fork error)
- Use single GPU mode (omit the flag or set to 1)
- If you get OOM errors, reduce `--num-candidates` or use a smaller model

In [None]:
# Run evaluation (FIXED for Kaggle - single GPU mode)
!python scripts/run_fairness_eval.py \
    --dataset-config SES \
    --num-samples 50 \
    --num-candidates 8 \
    --temperature 0.7 \
    --output-dir ./fairness_results
# Note: Removed --tensor-parallel-size 2 to work on Kaggle's single GPU

## View Results

In [None]:
# View summary statistics
import json

with open('fairness_results/summary_stats.json', 'r') as f:
    summary = json.load(f)

print("=" * 70)
print("EVALUATION SUMMARY")
print("=" * 70)
for key, value in summary.items():
    print(f"{key}: {value}")
print("=" * 70)

In [None]:
# View first few results
import json

print("\n📊 Sample Results:")
print("=" * 70)

with open('fairness_results/fairness_eval_results.jsonl', 'r') as f:
    for i, line in enumerate(f):
        if i >= 3:  # Show first 3 results
            break
        
        result = json.loads(line)
        print(f"\nExample {i+1}:")
        print(f"  Question: {result['question'][:100]}...")
        print(f"  Best Response: {result['best_response'][:100]}...")
        print(f"  PRM Score: {result['best_score']:.4f}")
        print(f"  All Scores: {[f'{s:.4f}' for s in result['scores']]}")

In [None]:
# Analyze score distribution
import json
import matplotlib.pyplot as plt

scores = []
with open('fairness_results/fairness_eval_results.jsonl', 'r') as f:
    for line in f:
        result = json.loads(line)
        scores.append(result['best_score'])

plt.figure(figsize=(10, 6))
plt.hist(scores, bins=20, edgecolor='black')
plt.xlabel('PRM Fairness Score')
plt.ylabel('Frequency')
plt.title('Distribution of Fairness Scores')
plt.axvline(sum(scores)/len(scores), color='red', linestyle='--', label=f'Mean: {sum(scores)/len(scores):.4f}')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

## Try Different Categories

You can evaluate different bias categories by changing `--dataset-config`:

Available categories:
- `SES` - Socioeconomic status
- `Age` - Age bias
- `Gender_identity` - Gender identity bias
- `Race_ethnicity` - Race and ethnicity bias
- `Disability_status` - Disability status bias
- `Nationality` - Nationality bias
- `Physical_appearance` - Physical appearance bias
- `Religion` - Religious bias
- `Sexual_orientation` - Sexual orientation bias

In [None]:
# Example: Evaluate Age bias (single GPU mode for Kaggle)
!python scripts/run_fairness_eval.py \
    --dataset-config Age \
    --num-samples 30 \
    --num-candidates 8 \
    --output-dir ./results_age
# Note: Using single GPU mode (default)