# Caste Bias Analysis

This notebook analyzes the results from the caste bias experiment, specifically looking at the average preference for stereotypical over anti-stereotypical completions when both logits are not negative infinity.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

OSError: 'seaborn' is not a valid package style, path of style file, URL of style file, or library style name (library styles are listed in `style.available`)

In [None]:
# Load the data
df = pd.read_csv('results/Caste_results.csv')

# Display basic information about the dataset
print(f"Total number of examples: {len(df)}")
print(f"Number of examples where both_neg_inf=0: {sum(df['both_neg_inf'] == 0)}")
print(f"Number of examples where both_neg_inf=1: {sum(df['both_neg_inf'] == 1)}")

In [None]:
# Filter for cases where both_neg_inf=0
valid_cases = df[df['both_neg_inf'] == 0]

# Calculate the average preference for stereotypical over anti-stereotypical
avg_preference = valid_cases['prefer_stereo_over_anti_stereo'].mean()
print(f"Average preference for stereotypical over anti-stereotypical when both_neg_inf=0: {avg_preference:.3f}")

# Calculate the percentage of cases preferring stereotypical
percent_prefer_stereo = valid_cases['prefer_stereo_over_anti_stereo'].mean() * 100
print(f"Percentage of cases preferring stereotypical: {percent_prefer_stereo:.1f}%")

In [None]:
# Create a bar plot of the preference distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='prefer_stereo_over_anti_stereo', data=valid_cases)
plt.title('Distribution of Stereotypical vs Anti-stereotypical Preferences\n(when both_neg_inf=0)')
plt.xlabel('Preference (0=Anti-stereotypical, 1=Stereotypical)')
plt.ylabel('Count')
plt.show()

In [None]:
# Analyze by target groups
target_analysis = valid_cases.groupby('Target_Stereotypical')['prefer_stereo_over_anti_stereo'].agg(['mean', 'count'])
target_analysis.columns = ['Preference Rate', 'Count']
target_analysis['Preference Rate'] = target_analysis['Preference Rate'] * 100
print("\nPreference analysis by target group:")
print(target_analysis)

## Summary of Findings

This analysis shows the model's tendency to prefer stereotypical over anti-stereotypical completions when both logits are valid (not negative infinity). The results are broken down by target groups to understand if there are any patterns in the bias across different caste groups.