# Annotation Comparison: Gemini vs Original GUS Dataset

This notebook compares the BIO annotations (O, GEN, UNFAIR, STEREO) between:
- **Original GUS Dataset** (`ethical-spectacle/gus-dataset-v1`) - annotated with DSPy/GPT
- **Gemini Annotations** (`gemini_annotations.json`) - annotated with Gemini 3 Flash

Objective: identify sentences where annotations differ significantly for human evaluation.

In [1]:
import json
import ast
import pandas as pd
import numpy as np
from datasets import load_dataset
from IPython.display import display, HTML
from collections import Counter

## 1. Load Datasets

In [2]:
# Load original dataset from HuggingFace
hf_dataset = load_dataset("ethical-spectacle/gus-dataset-v1", split="train")
print(f"Original GUS Dataset: {len(hf_dataset)} sentences")

# Load Gemini annotations
with open("gemini_annotations.json", "r") as f:
    gemini_data = json.load(f)
print(f"Gemini Annotations: {len(gemini_data)} sentences")

Original GUS Dataset: 3739 sentences
Gemini Annotations: 3739 sentences


## 2. Parsing and Tag Normalization

The HF dataset has `ner_tags` as a string of list of lists (each word can have multiple tags from different passes: GEN, UNFAIR, STEREO). Gemini has a flat tag per word.

We will normalize both to a comparable format.

In [3]:
def parse_hf_ner_tags(ner_tags_str):
    """Parse HF dataset ner_tags: string of list of lists -> set of tags per word."""
    try:
        tags_list = ast.literal_eval(ner_tags_str)
        return tags_list  # list of lists
    except:
        return None

def flatten_tags(tag_list):
    """Flatten list of tags per word to a single tag (priority: UNFAIR > STEREO > GEN > O)."""
    priority = {'B-UNFAIR': 4, 'I-UNFAIR': 4, 'B-STEREO': 3, 'I-STEREO': 3, 'B-GEN': 2, 'I-GEN': 2, 'O': 1}
    if not tag_list:
        return 'O'
    best = max(tag_list, key=lambda t: priority.get(t, 0))
    return best

def get_base_label(tag):
    """Remove BIO prefix to get only the category: GEN, UNFAIR, STEREO, O."""
    if tag == 'O':
        return 'O'
    parts = tag.split('-', 1)
    return parts[1] if len(parts) > 1 else tag

# Test with first example
sample_hf = parse_hf_ner_tags(hf_dataset[0]['ner_tags'])
print("HF tags (raw):", sample_hf[:5])
print("HF tags (flat):", [flatten_tags(t) for t in sample_hf[:5]])
print("Gemini tags:", gemini_data[0]['gemini_annotations'][:5])

HF tags (raw): [['O'], ['O'], ['O'], ['O'], ['B-GEN']]
HF tags (flat): ['O', 'O', 'O', 'O', 'B-GEN']
Gemini tags: ['O', 'O', 'O', 'O', 'B-GEN']


## 3. Build Comparison DataFrame

In [4]:
rows = []
skipped = 0

for idx in range(min(len(hf_dataset), len(gemini_data))):
    hf_text = hf_dataset[idx]['text_str']
    gm_text = gemini_data[idx]['text']
    
    # Verify texts match
    if hf_text.strip() != gm_text.strip():
        skipped += 1
        continue
    
    hf_tags_raw = parse_hf_ner_tags(hf_dataset[idx]['ner_tags'])
    gm_tags = gemini_data[idx]['gemini_annotations']
    
    if hf_tags_raw is None or not gm_tags:
        skipped += 1
        continue
    
    words = hf_text.split()
    hf_flat = [flatten_tags(t) for t in hf_tags_raw]
    
    # Ensure length alignment
    min_len = min(len(words), len(hf_flat), len(gm_tags))
    
    # Calculate differences at base category level (without B-/I-)
    hf_base = [get_base_label(hf_flat[i]) for i in range(min_len)]
    gm_base = [get_base_label(gm_tags[i]) for i in range(min_len)]
    
    n_diff = sum(1 for i in range(min_len) if hf_base[i] != gm_base[i])
    n_total = min_len
    len_mismatch = len(hf_flat) != len(gm_tags)
    
    rows.append({
        'idx': idx,
        'text': hf_text,
        'words': words[:min_len],
        'hf_tags': hf_flat[:min_len],
        'hf_tags_raw': hf_tags_raw[:min_len],
        'gm_tags': gm_tags[:min_len],
        'hf_base': hf_base,
        'gm_base': gm_base,
        'n_diff': n_diff,
        'n_total': n_total,
        'diff_ratio': n_diff / n_total if n_total > 0 else 0,
        'len_mismatch': len_mismatch,
        'hf_rationale': hf_dataset[idx].get('rationale', ''),
        'gm_rationale': gemini_data[idx].get('gemini_rationale', '')
    })

df = pd.DataFrame(rows)
print(f"Paired sentences: {len(df)}")
print(f"Sentences with non-matching text (skipped): {skipped}")
print(f"\nDistribution of differences per sentence:")
print(df['n_diff'].describe())

Paired sentences: 3739
Sentences with non-matching text (skipped): 0

Distribution of differences per sentence:
count    3739.000000
mean        4.256218
std         3.782777
min         0.000000
25%         1.000000
50%         3.000000
75%         6.000000
max        26.000000
Name: n_diff, dtype: float64


## 4. Global Divergence Statistics

In [5]:
# How many sentences have at least 1 difference?
n_with_diff = (df['n_diff'] > 0).sum()
n_identical = (df['n_diff'] == 0).sum()

print(f"Identical sentences (0 differences): {n_identical} ({n_identical/len(df)*100:.1f}%)")
print(f"Sentences with differences: {n_with_diff} ({n_with_diff/len(df)*100:.1f}%)")
print(f"\nAverage different words per sentence: {df['n_diff'].mean():.2f}")
print(f"Average difference ratio: {df['diff_ratio'].mean():.2%}")
print(f"\nTop 10 sentences with most differences:")
print(df.nlargest(10, 'n_diff')[['idx', 'text', 'n_diff', 'n_total', 'diff_ratio']].to_string(index=False))

Identical sentences (0 differences): 448 (12.0%)
Sentences with differences: 3291 (88.0%)

Average different words per sentence: 4.26
Average difference ratio: 32.16%

Top 10 sentences with most differences:
 idx                                                                                                                                                                                                                                           text  n_diff  n_total  diff_ratio
1220                      In ancient Rome, infants born into slavery had no legal rights or protections whatsoever. They faced hardships such as separation from parents, lack of proper nutrition, and exposure to dangerous working environments.      26       32    0.812500
1182                                                                                  Women earn less than men across virtually every occupation and industry, even after accounting for differences in hours worked and other job characteristics.   

In [6]:
# Global tag count per dataset
all_hf_base = [tag for row in df['hf_base'] for tag in row]
all_gm_base = [tag for row in df['gm_base'] for tag in row]

hf_counts = Counter(all_hf_base)
gm_counts = Counter(all_gm_base)

tag_comparison = pd.DataFrame({
    'Original GUS': hf_counts,
    'Gemini': gm_counts
}).fillna(0).astype(int)
tag_comparison['Difference'] = tag_comparison['Gemini'] - tag_comparison['Original GUS']
tag_comparison['Difference (%)'] = ((tag_comparison['Gemini'] - tag_comparison['Original GUS']) / tag_comparison['Original GUS'] * 100).round(1)

print("Global tag distribution (base labels):")
display(tag_comparison)

Global tag distribution (base labels):


Unnamed: 0,Original GUS,Gemini,Difference,Difference (%)
O,39387,42746,3359,8.5
GEN,4843,6742,1899,39.2
STEREO,9561,5288,-4273,-44.7
UNFAIR,2643,1816,-827,-31.3
[,13,0,-13,-100.0
",",74,0,-74,-100.0
,71,0,-71,-100.0


## 5. Tag Confusion Matrix

In [7]:
# Confusion Matrix: HF (rows) vs Gemini (columns)
labels = ['O', 'GEN', 'UNFAIR', 'STEREO']
confusion = pd.DataFrame(0, index=labels, columns=labels)

for _, row in df.iterrows():
    for hf_tag, gm_tag in zip(row['hf_base'], row['gm_base']):
        if hf_tag in labels and gm_tag in labels:
            confusion.loc[hf_tag, gm_tag] += 1

print("Confusion Matrix (rows=Original GUS, columns=Gemini):")
display(confusion)

# Agreement percentage by category
print("\nAgreement by category:")
for label in labels:
    total = confusion.loc[label].sum()
    agree = confusion.loc[label, label]
    print(f"  {label}: {agree}/{total} ({agree/total*100:.1f}%)" if total > 0 else f"  {label}: N/A")

Confusion Matrix (rows=Original GUS, columns=Gemini):


Unnamed: 0,O,GEN,UNFAIR,STEREO
O,35394,1937,729,1327
GEN,3048,1656,41,98
UNFAIR,548,572,644,879
STEREO,3598,2577,402,2984



Agreement by category:
  O: 35394/39387 (89.9%)
  GEN: 1656/4843 (34.2%)
  UNFAIR: 644/2643 (24.4%)
  STEREO: 2984/9561 (31.2%)


## 6. Most Common Divergence Types

In [8]:
# What are the most common tag transitions?
transitions = Counter()

for _, row in df.iterrows():
    for hf_tag, gm_tag in zip(row['hf_base'], row['gm_base']):
        if hf_tag != gm_tag:
            transitions[(hf_tag, gm_tag)] += 1

print("Top 15 most common divergences (Original GUS â†’ Gemini):")
print(f"{'Original GUS':<12} {'Gemini':<12} {'Count':>8}")
for (hf, gm), count in transitions.most_common(15):
    print(f"{hf:<12} {gm:<12} {count:>8}")

Top 15 most common divergences (Original GUS â†’ Gemini):
Original GUS Gemini          Count
STEREO       O                3598
GEN          O                3048
STEREO       GEN              2577
O            GEN              1937
O            STEREO           1327
UNFAIR       STEREO            879
O            UNFAIR            729
UNFAIR       GEN               572
UNFAIR       O                 548
STEREO       UNFAIR            402
GEN          STEREO             98
,            O                  74
             O                  71
GEN          UNFAIR             41
[            O                  13


## 7. Visualization of Sentences with Major Differences

Each sentence is shown with colored annotations side-by-side for human evaluation:
- ðŸŸ¢ Green = agree
- ðŸ”´ Red = differ

In [9]:
def render_comparison_html(row, show_rationale=True):
    """Generates HTML for visual comparison of a sentence."""
    tag_colors = {
        'O': '#f0f0f0',
        'GEN': '#a8d8ea',
        'UNFAIR': '#ffb6b9',
        'STEREO': '#ffd93d'
    }
    
    html = f'<div style="border:1px solid #ccc; padding:12px; margin:8px 0; border-radius:8px; background:#fafafa;">'
    html += f'<div style="font-weight:bold; margin-bottom:8px;">Sentence #{row["idx"]} ({row["n_diff"]} different words out of {row["n_total"]}, ratio={row["diff_ratio"]:.0%})</div>'
    
    # Words table
    html += '<table style="border-collapse:collapse; width:100%; font-size:13px;">'
    
    # Header row with words
    html += '<tr>'
    html += '<td style="font-weight:bold; padding:4px 8px; vertical-align:top;">Word</td>'
    for w in row['words']:
        html += f'<td style="padding:4px; text-align:center; font-weight:bold; border-bottom:1px solid #ddd;">{w}</td>'
    html += '</tr>'
    
    # Original GUS tags
    html += '<tr>'
    html += '<td style="padding:4px 8px; font-weight:bold;">Original GUS</td>'
    for i, tag in enumerate(row['hf_base']):
        differs = tag != row['gm_base'][i]
        bg = tag_colors.get(tag, '#f0f0f0')
        border = 'border:2px solid #e74c3c;' if differs else 'border:1px solid #ddd;'
        html += f'<td style="padding:4px; text-align:center; background:{bg}; {border} font-size:11px;">{row["hf_tags"][i]}</td>'
    html += '</tr>'
    
    # Gemini tags
    html += '<tr>'
    html += '<td style="padding:4px 8px; font-weight:bold;">Gemini</td>'
    for i, tag in enumerate(row['gm_base']):
        differs = tag != row['hf_base'][i]
        bg = tag_colors.get(tag, '#f0f0f0')
        border = 'border:2px solid #e74c3c;' if differs else 'border:1px solid #ddd;'
        html += f'<td style="padding:4px; text-align:center; background:{bg}; {border} font-size:11px;">{row["gm_tags"][i]}</td>'
    html += '</tr>'
    
    html += '</table>'
    
    # Rationales
    if show_rationale:
        html += '<details style="margin-top:8px;"><summary style="cursor:pointer; color:#555;">View rationales</summary>'
        html += f'<div style="margin-top:6px; padding:6px; background:#eef; border-radius:4px; font-size:12px;"><b>Gemini:</b> {row["gm_rationale"]}</div>'
        html += '</details>'
    
    html += '</div>'
    return html

print("Color legend: ", end="")
display(HTML('<span style="background:#f0f0f0;padding:2px 8px;">O</span> '
             '<span style="background:#a8d8ea;padding:2px 8px;">GEN</span> '
             '<span style="background:#ffb6b9;padding:2px 8px;">UNFAIR</span> '
             '<span style="background:#ffd93d;padding:2px 8px;">STEREO</span> '
             '<span style="border:2px solid #e74c3c;padding:2px 8px;">Divergence</span>'))

Color legend: 

In [10]:
# Show top 30 sentences with most differences
top_diff = df.nlargest(30, 'diff_ratio')

print(f"Top 30 sentences with largest annotation differences:")
html_output = ""
for _, row in top_diff.iterrows():
    html_output += render_comparison_html(row)

display(HTML(html_output))

Top 30 sentences with largest annotation differences:


0,1,2,3,4,5,6,7,8
Word,Grandparents,always,seem,to,spoil,their,grandkids,rotten.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,B-UNFAIR
Gemini,B-GEN,I-GEN,I-GEN,O,O,O,O,O

0,1,2,3,4,5
Word,Men,cannot,handle,rejection,gracefully.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,O,O,O

0,1,2,3,4
Word,Five-year-olds,talk,back,disrespectfully.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,I-GEN,B-UNFAIR

0,1,2,3,4,5,6,7,8,9,10,11,12,13
Word,Their,moral,compasses,must,be,broken,if,they,don't,believe,in,an,afterlife.
Original GUS,O,O,O,O,O,B-UNFAIR,O,B-GEN,O,O,O,O,O
Gemini,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO

0,1,2,3,4,5,6,7,8,9,10,11
Word,What,good,is,a,woman,who,doesn't,know,how,to,cook?
Original GUS,O,O,O,O,B-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR
Gemini,B-UNFAIR,I-UNFAIR,I-UNFAIR,B-GEN,I-GEN,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO

0,1,2,3,4,5,6,7,8,9,10,11,12,13
Word,Trans,individuals,are,unable,to,fully,integrate,into,society,due,to,their,differences.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5
Word,Adolescents,are,moody,and,irrational.
Original GUS,B-STEREO,I-STEREO,B-UNFAIR,I-STEREO,I-UNFAIR
Gemini,B-GEN,I-GEN,O,O,O

0,1,2,3,4,5,6,7,8
Word,Babies,cannot,learn,languages,effectively,without,formal,instruction.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
Word,Every,man,dreams,of,having,a,wife,who,"cooks,","cleans,",and,looks,after,him.
Original GUS,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7
Word,They,invade,our,country,illegally,in,droves.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,B-UNFAIR,O,O,O,B-GEN,I-GEN

0,1,2,3,4,5,6,7
Word,Age-related,memory,loss,means,seniors,aren't,trustworthy.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,B-UNFAIR
Gemini,O,O,O,O,B-GEN,I-GEN,O

0,1,2,3,4,5,6,7
Word,Girls,are,naturally,more,nurturing,than,boys.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,I-GEN,O,O,O,B-GEN

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
Word,Agnosticism,is,just,an,excuse,for,those,who,don't,want,to,put,in,the,effort,required,to,have,strong,convictions.
Original GUS,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O
Gemini,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO

0,1,2,3,4,5,6,7,8,9,10,11,12
Word,Being,attracted,to,members,of,your,own,gender,makes,you,more,creative.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O,O,O

0,1,2,3,4,5,6,7,8,9
Word,Grandmothers,hold,the,key,to,family,traditions,and,wisdom.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,O,O,O,O,O,O,O,O

0,1,2,3
Word,Women,are,gossips.
Original GUS,B-STEREO,I-STEREO,B-UNFAIR
Gemini,B-GEN,O,O

0,1,2,3,4,5,6,7
Word,Muslims,cannot,contribute,positively,to,modern,civilization.
Original GUS,B-STEREO,B-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10
Word,Middle-aged,folks,have,such,wisdom,from,their,years,of,experience!
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10
Word,Hindus,cannot,accept,criticism,of,their,religion,without,becoming,defensive.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8
Word,College,freshmen,need,constant,supervision,because,they're,clueless.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,B-UNFAIR
Gemini,B-GEN,I-GEN,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11
Word,Sikh,children,grow,up,learning,valuable,life,lessons,from,their,elders.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,O,O,O,O,O,O,O,O,O

0,1,2,3,4
Word,Gay,people,are,predators.
Original GUS,B-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR
Gemini,B-GEN,I-GEN,O,O

0,1,2,3,4,5,6,7
Word,Poor,people,can't,even,afford,basic,necessities.
Original GUS,B-UNFAIR,I-UNFAIR,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,O,O,O,O,O

0,1,2,3,4,5,6,7,8
Word,The,obese,community,contributes,nothing,but,health,issues.
Original GUS,O,B-UNFAIR,I-UNFAIR,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O,O

0,1,2,3,4,5,6,7,8
Word,Intersex,individuals,are,confused,about,their,true,identity.
Original GUS,B-STEREO,I-STEREO,I-STEREO,B-UNFAIR,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,I-GEN,O,O,O,O,O

0,1,2,3,4,5
Word,Their,emotions,cloud,their,judgment.
Original GUS,O,B-GEN,O,O,O
Gemini,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO

0,1,2,3,4,5,6,7
Word,Using,crutches,is,a,sign,of,weakness.
Original GUS,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,B-UNFAIR
Gemini,B-GEN,I-GEN,I-GEN,O,O,O,O

0,1,2,3,4,5,6,7,8
Word,Narcissistic,Personality,Disorder,is,self-centeredness,taken,too,far.
Original GUS,O,O,O,O,O,O,O,O
Gemini,B-GEN,I-GEN,I-GEN,I-GEN,B-STEREO,I-STEREO,I-STEREO,I-STEREO

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
Word,Can't,trust,Asians,around,valuable,items,â€“,they,have,this,innate,desire,to,steal.
Original GUS,B-STEREO,I-STEREO,B-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR
Gemini,B-UNFAIR,I-UNFAIR,B-GEN,O,O,O,O,B-GEN,O,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO

0,1,2,3,4,5
Word,Asexuals,can't,experience,true,love.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,O,O,O


## 8. Analysis by Divergence Type

Separate divergences by type for finer analysis.

In [11]:
def get_divergence_type(row):
    """Categorizes the main divergence type in the sentence."""
    types = set()
    for hf_tag, gm_tag in zip(row['hf_base'], row['gm_base']):
        if hf_tag != gm_tag:
            types.add(f"{hf_tag}â†’{gm_tag}")
    return types

df['divergence_types'] = df.apply(get_divergence_type, axis=1)

# Sentences where GUS says O but Gemini says something (Gemini found bias that GUS didn't)
gemini_found_more = df[df['divergence_types'].apply(lambda x: any('Oâ†’' in t and t != 'Oâ†’O' for t in x))]
print(f"Gemini found bias where GUS says O: {len(gemini_found_more)} sentences")

# Sentences where Gemini says O but GUS says something (GUS found bias that Gemini didn't)
gus_found_more = df[df['divergence_types'].apply(lambda x: any('â†’O' in t and not t.startswith('O') for t in x))]
print(f"GUS found bias where Gemini says O: {len(gus_found_more)} sentences")

# Sentences where they disagree on bias type (e.g., GEN vs STEREO)
type_disagree = df[df['divergence_types'].apply(lambda x: any(
    t.split('â†’')[0] not in ('O', '') and t.split('â†’')[1] not in ('O', '') and t.split('â†’')[0] != t.split('â†’')[1]
    for t in x
))]
print(f"Disagree on bias type (e.g., GEN vs STEREO): {len(type_disagree)} sentences")

Gemini found bias where GUS says O: 1991 sentences
GUS found bias where Gemini says O: 2221 sentences
Disagree on bias type (e.g., GEN vs STEREO): 1461 sentences


In [12]:
# Examples: Gemini found bias that GUS didn't
print("GEMINI FOUND BIAS WHERE GUS SAYS O (sample of 15):")

sample = gemini_found_more.nlargest(15, 'n_diff')
html_output = ""
for _, row in sample.iterrows():
    html_output += render_comparison_html(row)
display(HTML(html_output))

GEMINI FOUND BIAS WHERE GUS SAYS O (sample of 15):


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
Word,Women,earn,less,than,men,across,virtually,every,occupation,and,"industry,",even,after,accounting,for,differences,in,hours,worked,and,other,job,characteristics.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,O,O,O,B-GEN,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Word,Black,men,experience,higher,unemployment,rates,than,other,ethnic,"groups,",even,when,controlling,for,factors,like,"age,",education,"level,",and,work,experience.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,O,O,O,O,O,B-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Word,Do,heteros,ever,consider,how,much,more,interesting,things,would,be,if,everyone,were,attracted,to,each,other,rather,than,opposite,genders?
Original GUS,O,B-UNFAIR,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,B-GEN,O,O,O,O,O,O,O,O,O,O,B-GEN,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
Word,Immigrants,make,up,roughly,14%,of,Australiaâ€™s,population,but,accounted,for,over,20%,of,new,COVID-19,cases,during,its,second,wave.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
Word,Obese,individuals,frequently,experience,comorbidities,associated,with,poor,nutrition,and,sedentary,"lifestyles,",making,them,vulnerable,targets,for,chronic,diseases.,Addressing,root,causes,early,on,reduces,risk,and,improves,quality,of,life.
Original GUS,B-UNFAIR,I-UNFAIR,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,O,O,O,O,O,O,O,O,O,O
Gemini,B-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
Word,Agnosticism,is,just,an,excuse,for,those,who,don't,want,to,put,in,the,effort,required,to,have,strong,convictions.
Original GUS,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O
Gemini,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
Word,"Globally,",women,continue,to,bear,the,brunt,of,domestic,labor,and,caregiving,"duties,",restricting,their,time,and,energy,for,other,endeavors.
Original GUS,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,B-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
Word,Some,research,suggests,that,children,raised,without,religion,may,develop,stronger,critical,thinking,skills,due,to,being,exposed,to,diverse,viewpoints,earlier,in,life.
Original GUS,O,O,O,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Word,Some,activist,groups,appear,to,care,more,about,gaining,media,attention,than,actually,making,a,difference,â€“,what,good,does,that,accomplish?
Original GUS,B-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O
Gemini,O,B-GEN,I-GEN,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,O,B-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
Word,How,do,we,know,if,someone,faking,a,mental,illness,isn't,just,trying,to,get,out,of,doing,something,they,don't,want,to,do?
Original GUS,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O
Gemini,O,O,O,O,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN,B-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38
Word,While,it's,true,that,there,are,instances,where,some,Muslims,may,engage,in,practices,contrary,to,Western,values,such,as,honor,killings,or,female,genital,"mutilation,",these,acts,are,condemned,by,the,vast,majority,of,Muslim,communities,worldwide.
Original GUS,O,O,O,O,O,O,O,O,B-GEN,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,B-UNFAIR,I-UNFAIR,I-STEREO,B-UNFAIR,I-UNFAIR,I-UNFAIR,O,O,O,O,O,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O
Gemini,O,O,O,O,O,O,O,O,B-GEN,I-GEN,I-GEN,I-GEN,O,O,O,O,B-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Word,Becoming,non-binary,feels,like,a,cop,out;,choosing,a,label,simply,because,you,couldn't,handle,the,pressures,of,conforming,to,either,side.
Original GUS,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O
Gemini,B-GEN,I-GEN,O,O,O,B-UNFAIR,I-UNFAIR,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Word,Girls,often,face,societal,pressure,to,conform,to,certain,beauty,"standards,",which,can,negatively,impact,their,self-esteem,and,confidence.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
Word,Numerous,White,athletes,participate,in,sports,at,all,levels,â€“,"local,","national,",and,international,"â€“,",demonstrating,physical,prowess,and,mental,strength.
Original GUS,B-GEN,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,O,O,O,O,O,O
Gemini,B-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
Word,Those,autistic,savants,sure,are,fascinating,â€“,but,letâ€™s,face,it;,they,wouldn't,stand,a,chance,in,real,life,outside,of,their,narrow,expertise.
Original GUS,O,B-STEREO,I-STEREO,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O
Gemini,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O,O,O,O,O,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO


In [13]:
# Examples: GUS found bias that Gemini didn't
print("GUS FOUND BIAS WHERE GEMINI SAYS O (sample of 15):")

sample = gus_found_more.nlargest(15, 'n_diff')
html_output = ""
for _, row in sample.iterrows():
    html_output += render_comparison_html(row)
display(HTML(html_output))

GUS FOUND BIAS WHERE GEMINI SAYS O (sample of 15):


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
Word,In,ancient,"Rome,",infants,born,into,slavery,had,no,legal,rights,or,protections,whatsoever.,They,faced,hardships,such,as,separation,from,"parents,",lack,of,proper,"nutrition,",and,exposure,to,dangerous,working,environments.
Original GUS,O,O,O,B-GEN,I-GEN,I-GEN,B-UNFAIR,O,O,B-UNFAIR,I-UNFAIR,O,B-UNFAIR,I-UNFAIR,B-GEN,I-GEN,B-UNFAIR,I-GEN,I-GEN,B-UNFAIR,I-GEN,B-UNFAIR,B-UNFAIR,I-GEN,I-GEN,B-UNFAIR,I-GEN,B-UNFAIR,I-GEN,B-UNFAIR,I-UNFAIR,I-UNFAIR
Gemini,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
Word,Women,earn,less,than,men,across,virtually,every,occupation,and,"industry,",even,after,accounting,for,differences,in,hours,worked,and,other,job,characteristics.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,O,O,O,B-GEN,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Word,Black,men,experience,higher,unemployment,rates,than,other,ethnic,"groups,",even,when,controlling,for,factors,like,"age,",education,"level,",and,work,experience.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,O,O,O,O,O,B-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Word,Do,heteros,ever,consider,how,much,more,interesting,things,would,be,if,everyone,were,attracted,to,each,other,rather,than,opposite,genders?
Original GUS,O,B-UNFAIR,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,B-GEN,O,O,O,O,O,O,O,O,O,O,B-GEN,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
Word,Immigrants,make,up,roughly,14%,of,Australiaâ€™s,population,but,accounted,for,over,20%,of,new,COVID-19,cases,during,its,second,wave.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
Word,Obese,individuals,frequently,experience,comorbidities,associated,with,poor,nutrition,and,sedentary,"lifestyles,",making,them,vulnerable,targets,for,chronic,diseases.,Addressing,root,causes,early,on,reduces,risk,and,improves,quality,of,life.
Original GUS,B-UNFAIR,I-UNFAIR,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,O,O,O,O,O,O,O,O,O,O
Gemini,B-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
Word,"Globally,",women,continue,to,bear,the,brunt,of,domestic,labor,and,caregiving,"duties,",restricting,their,time,and,energy,for,other,endeavors.
Original GUS,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,B-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
Word,Some,research,suggests,that,children,raised,without,religion,may,develop,stronger,critical,thinking,skills,due,to,being,exposed,to,diverse,viewpoints,earlier,in,life.
Original GUS,O,O,O,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Word,Some,activist,groups,appear,to,care,more,about,gaining,media,attention,than,actually,making,a,difference,â€“,what,good,does,that,accomplish?
Original GUS,B-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O
Gemini,O,B-GEN,I-GEN,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,O,B-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR,I-UNFAIR

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38
Word,While,it's,true,that,there,are,instances,where,some,Muslims,may,engage,in,practices,contrary,to,Western,values,such,as,honor,killings,or,female,genital,"mutilation,",these,acts,are,condemned,by,the,vast,majority,of,Muslim,communities,worldwide.
Original GUS,O,O,O,O,O,O,O,O,B-GEN,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,B-UNFAIR,I-UNFAIR,I-STEREO,B-UNFAIR,I-UNFAIR,I-UNFAIR,O,O,O,O,O,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O
Gemini,O,O,O,O,O,O,O,O,B-GEN,I-GEN,I-GEN,I-GEN,O,O,O,O,B-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Word,Girls,often,face,societal,pressure,to,conform,to,certain,beauty,"standards,",which,can,negatively,impact,their,self-esteem,and,confidence.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
Word,Numerous,White,athletes,participate,in,sports,at,all,levels,â€“,"local,","national,",and,international,"â€“,",demonstrating,physical,prowess,and,mental,strength.
Original GUS,B-GEN,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,O,O,O,O,O,O
Gemini,B-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
Word,It,should,be,noted,that,these,statistics,reflect,various,contexts,and,circumstances;,individual,experiences,may,vary,greatly,depending,on,factors,including,personal,"motivation,","resources,",support,"structures,",and,career,goals.
Original GUS,[,O,",",,O,",",,O,",",,O,",",,O,",",,O,",",,O,",",,O,",",,O,",",,O
Gemini,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Word,It's,almost,impossible,for,ugly,people,to,understand,what,it,feels,like,to,turn,heads,when,entering,a,room.
Original GUS,O,O,O,O,B-UNFAIR,I-UNFAIR,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Word,Straight,people,simply,cannot,comprehend,why,anyone,would,choose,to,engage,in,oral,sex,instead,of,traditional,intercourse.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O


In [14]:
# Examples: Disagree on bias type
print("DISAGREE ON BIAS TYPE (sample of 15):")

sample = type_disagree.nlargest(15, 'n_diff')
html_output = ""
for _, row in sample.iterrows():
    html_output += render_comparison_html(row)
display(HTML(html_output))

DISAGREE ON BIAS TYPE (sample of 15):


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
Word,Women,earn,less,than,men,across,virtually,every,occupation,and,"industry,",even,after,accounting,for,differences,in,hours,worked,and,other,job,characteristics.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,O,O,O,B-GEN,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Word,Black,men,experience,higher,unemployment,rates,than,other,ethnic,"groups,",even,when,controlling,for,factors,like,"age,",education,"level,",and,work,experience.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,O,O,O,O,O,B-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Word,Do,heteros,ever,consider,how,much,more,interesting,things,would,be,if,everyone,were,attracted,to,each,other,rather,than,opposite,genders?
Original GUS,O,B-UNFAIR,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,B-GEN,O,O,O,O,O,O,O,O,O,O,B-GEN,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
Word,Obese,individuals,frequently,experience,comorbidities,associated,with,poor,nutrition,and,sedentary,"lifestyles,",making,them,vulnerable,targets,for,chronic,diseases.,Addressing,root,causes,early,on,reduces,risk,and,improves,quality,of,life.
Original GUS,B-UNFAIR,I-UNFAIR,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,O,O,O,O,O,O,O,O,O,O
Gemini,B-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
Word,"Globally,",women,continue,to,bear,the,brunt,of,domestic,labor,and,caregiving,"duties,",restricting,their,time,and,energy,for,other,endeavors.
Original GUS,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,B-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38
Word,While,it's,true,that,there,are,instances,where,some,Muslims,may,engage,in,practices,contrary,to,Western,values,such,as,honor,killings,or,female,genital,"mutilation,",these,acts,are,condemned,by,the,vast,majority,of,Muslim,communities,worldwide.
Original GUS,O,O,O,O,O,O,O,O,B-GEN,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,B-UNFAIR,I-UNFAIR,I-STEREO,B-UNFAIR,I-UNFAIR,I-UNFAIR,O,O,O,O,O,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O
Gemini,O,O,O,O,O,O,O,O,B-GEN,I-GEN,I-GEN,I-GEN,O,O,O,O,B-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Word,Girls,often,face,societal,pressure,to,conform,to,certain,beauty,"standards,",which,can,negatively,impact,their,self-esteem,and,confidence.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
Word,Numerous,White,athletes,participate,in,sports,at,all,levels,â€“,"local,","national,",and,international,"â€“,",demonstrating,physical,prowess,and,mental,strength.
Original GUS,B-GEN,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,O,O,O,O,O,O
Gemini,B-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
Word,Those,autistic,savants,sure,are,fascinating,â€“,but,letâ€™s,face,it;,they,wouldn't,stand,a,chance,in,real,life,outside,of,their,narrow,expertise.
Original GUS,O,B-STEREO,I-STEREO,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O
Gemini,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O,O,O,O,O,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Word,It's,almost,impossible,for,ugly,people,to,understand,what,it,feels,like,to,turn,heads,when,entering,a,room.
Original GUS,O,O,O,O,B-UNFAIR,I-UNFAIR,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Word,Straight,people,simply,cannot,comprehend,why,anyone,would,choose,to,engage,in,oral,sex,instead,of,traditional,intercourse.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
Word,Elders,continue,learning,new,skills,throughout,their,"lives,",whether,it,be,taking,up,"painting,",playing,musical,"instruments,",or,mastering,digital,tools.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,O,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,O,I-STEREO,I-STEREO,I-STEREO,O,I-STEREO
Gemini,B-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
Word,If,you,ask,"me,",most,activists,could,benefit,from,taking,a,long,hard,look,at,themselves,before,trying,to,reform,society.
Original GUS,O,O,O,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,O,O,O,O,B-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
Word,Female,entrepreneurs,encounter,more,obstacles,when,starting,businesses,due,to,lack,of,access,to,funding,and,resources.
Original GUS,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO
Gemini,B-GEN,I-GEN,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
Word,Have,you,noticed,how,many,successful,businessmen,in,this,city,are,secretly,gay?,It,must,give,them,some,sort,of,advantage!
Original GUS,O,O,O,O,O,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,O,O,O,O,O,O,O,O
Gemini,O,O,O,O,B-GEN,I-GEN,I-GEN,I-GEN,I-GEN,I-GEN,O,O,O,B-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO,I-STEREO


## 9. Span Boundary Analysis (B- vs I-)

Even when they agree on the category, they might disagree on span boundaries (where B- starts and I- continues).

In [15]:
# Count differences only in BIO prefix (even when base is equal)
bio_only_diffs = 0
bio_examples = []

for _, row in df.iterrows():
    for i in range(len(row['words'])):
        hf_base = row['hf_base'][i]
        gm_base = row['gm_base'][i]
        hf_full = row['hf_tags'][i]
        gm_full = row['gm_tags'][i]
        
        # Same base category but different BIO prefix
        if hf_base == gm_base and hf_base != 'O' and hf_full != gm_full:
            bio_only_diffs += 1
            if len(bio_examples) < 5:
                bio_examples.append({
                    'word': row['words'][i],
                    'text': row['text'][:60],
                    'hf': hf_full,
                    'gm': gm_full
                })

print(f"Words with same category but different B/I prefix: {bio_only_diffs}")
if bio_examples:
    print("\nExamples:")
    for ex in bio_examples:
        print(f"  Word '{ex['word']}' in '{ex['text']}...' â†’ GUS: {ex['hf']}, Gemini: {ex['gm']}")

Words with same category but different B/I prefix: 870

Examples:
  Word 'spends' in 'Should we assume that every retired individual spends hours ...' â†’ GUS: I-STEREO, Gemini: B-STEREO
  Word 'prefer' in 'Don't men usually prefer simple solutions over complicated o...' â†’ GUS: I-STEREO, Gemini: B-STEREO
  Word 'who' in 'What's wrong with these uneducated masses who don't know how...' â†’ GUS: I-STEREO, Gemini: B-STEREO
  Word 'people' in 'Isn't it ironic that the same people who cry about free spee...' â†’ GUS: B-GEN, Gemini: I-GEN
  Word 'subhuman' in 'Even animals engage in copulation, making asexuals subhuman ...' â†’ GUS: I-UNFAIR, Gemini: B-UNFAIR


## 10. Export Divergent Sentences for Human Review

In [16]:
# Export sentences with significant differences to CSV
# (ratio > 20% or at least 2 different words)
divergent = df[(df['diff_ratio'] > 0.2) | (df['n_diff'] >= 2)].copy()
divergent = divergent.sort_values('diff_ratio', ascending=False)

export_rows = []
for _, row in divergent.iterrows():
    word_diffs = []
    for i in range(len(row['words'])):
        if row['hf_base'][i] != row['gm_base'][i]:
            word_diffs.append(f"{row['words'][i]}: GUS={row['hf_tags'][i]} / Gemini={row['gm_tags'][i]}")
    
    export_rows.append({
        'idx': row['idx'],
        'text': row['text'],
        'n_diff': row['n_diff'],
        'diff_ratio': f"{row['diff_ratio']:.0%}",
        'divergent_words': ' | '.join(word_diffs),
        'gm_rationale': row['gm_rationale'],
        'human_verdict': ''  # column to be filled manually
    })

export_df = pd.DataFrame(export_rows)

display(export_df.head(10))

Unnamed: 0,idx,text,n_diff,diff_ratio,divergent_words,gm_rationale,human_verdict
0,2395,Unconventional degrees lead to dead-end jobs.,6,100%,Unconventional: GUS=B-STEREO / Gemini=B-GEN | ...,The phrase 'Unconventional degrees' is a gener...,
1,672,It's impossible for attractive people to under...,12,100%,It's: GUS=O / Gemini=B-GEN | impossible: GUS=O...,The phrase 'It's impossible for attractive peo...,
2,721,Girls are poor listeners.,4,100%,Girls: GUS=B-STEREO / Gemini=B-GEN | are: GUS=...,The word 'Girls' is a broad generalization of ...,
3,719,Canadians must be naturally friendly because t...,10,100%,Canadians: GUS=B-STEREO / Gemini=B-GEN | must:...,The word 'Canadians' is a broad generalization...,
4,694,Girls need constant validation from others.,6,100%,Girls: GUS=B-STEREO / Gemini=B-GEN | need: GUS...,The word 'Girls' serves as a broad generalizat...,
5,693,Don't trust anyone under eighteen.,5,100%,Don't: GUS=B-STEREO / Gemini=O | trust: GUS=I-...,The phrase 'anyone under eighteen' is a genera...,
6,2468,Those disabled individuals sure can inspire us...,11,100%,Those: GUS=O / Gemini=B-GEN | disabled: GUS=B-...,The phrase 'Those disabled individuals' is a g...,
7,689,Stutterers add personality to conversations.,5,100%,Stutterers: GUS=B-STEREO / Gemini=B-GEN | add:...,The word 'Stutterers' is a broad generalizatio...,
8,687,Mixed race communities do not exist; everyone ...,9,100%,Mixed: GUS=B-STEREO / Gemini=B-GEN | race: GUS...,The phrase 'Mixed race communities' is a broad...,
9,2476,Men are more skilled multitaskers.,5,100%,Men: GUS=B-STEREO / Gemini=B-GEN | are: GUS=I-...,The word 'Men' is a broad generalization of a ...,


## 11. Summary

In [17]:

print(f"Total phrases compared: {len(df)}")
print(f"Identical phrases: {n_identical} ({n_identical/len(df)*100:.1f}%)")
print(f"Phrases with differences: {n_with_diff} ({n_with_diff/len(df)*100:.1f}%)")
print(f"Average diff ratio: {df['diff_ratio'].mean():.2%}")
print(f"")
print(f"Gemini found extra bias (where GUS says O): {len(gemini_found_more)} phrases")
print(f"GUS found extra bias (where Gemini says O): {len(gus_found_more)} phrases")
print(f"Disagree on bias type: {len(type_disagree)} phrases")
print(f"Differences only in span boundary (B/I): {bio_only_diffs} words")
print(f"")
print(f"Phrases exported for human review: {len(export_df)}")
print(f"File: annotation_divergences.csv")

Total phrases compared: 3739
Identical phrases: 448 (12.0%)
Phrases with differences: 3291 (88.0%)
Average diff ratio: 32.16%

Gemini found extra bias (where GUS says O): 1991 phrases
GUS found extra bias (where Gemini says O): 2221 phrases
Disagree on bias type: 1461 phrases
Differences only in span boundary (B/I): 870 words

Phrases exported for human review: 2796
File: annotation_divergences.csv
