<a href="https://colab.research.google.com/github/itsbhoomika/Clarification-Seeking-LLMs-CS546/blob/main/Version2_LLM_judge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# ===========================================
# COMPLETE LLM JUDGE EVALUATION SCRIPT
# ===========================================

import pandas as pd
import numpy as np
import json
import textwrap
from string import Template
import time
from datetime import timedelta
from collections import Counter
import google.generativeai as genai
import os, json, ast, time, textwrap
import pandas as pd
from string import Template
from tenacity import retry, stop_after_attempt, wait_exponential
from sklearn.metrics import classification_report, accuracy_score, f1_score
import google.generativeai as genai

# ===========================================
# CONFIGURATION
# ===========================================
MODEL_NAME = "gemini-2.5-flash"
API_KEY = "AIzaSyCD1YUnQvPah5zM5tYWKgWDlR-x2odXjQs"  # Replace with your actual API key
OUTPUT_CSV = "judged_results.csv"
OUTPUT_REPORT = "evaluation_report.csv"

# Configure API
genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL_NAME)

# Load your dataframe

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# -----------------------
# USER CONFIG
# -----------------------
DATA_PATH = '/content/drive/My Drive/dataset_1k.csv'   # your uploaded dataset
OUTPUT_CSV = '/content/drive/My Drive/judge_outputs_sample1k.csv'
OUTPUT_REPORT = '/content/drive/My Drive/taxonomy_report_sample1k.csv'
MODEL_NAME = "models/gemini-2.5-flash"  # "flash" = cheaper/faster; "pro" = higher quality

os.environ["GEMINI_API_KEY"] = os.environ.get("GEMINI_API_KEY", "AIzaSyCD1YUnQvPah5zM5tYWKgWDlR-x2odXjQs")
assert os.environ["GEMINI_API_KEY"] and os.environ["GEMINI_API_KEY"] != "PASTE-YOUR-KEY-HERE", "Set GEMINI_API_KEY first."

In [4]:
# -----------------------
# Load and sample dataset
# -----------------------
df = pd.read_csv(DATA_PATH)
print("Loaded full dataset:", df.shape)
df = df.sample(n=1000, random_state=42).reset_index(drop=True)
print("Sampled 1000 rows.")

def ensure_jsonable(x):
    if isinstance(x, str):
        try:
            return ast.literal_eval(x)
        except Exception:
            return x
    return x

json_cols = ["disambig_questions","qaPairs","facets","facet_taxonomies","taxonomy"]
for col in json_cols:
    if col in df.columns:
        df[col] = df[col].apply(ensure_jsonable)

if "pred_cq" not in df.columns:
    def first_cq(x):
        if isinstance(x, list) and len(x)>0 and isinstance(x[0], str):
            return x[0]
        return ""
    df["pred_cq"] = df["disambig_questions"].apply(first_cq) if "disambig_questions" in df.columns else ""

df["taxonomy"] = df["taxonomy"].apply(lambda x: [s.strip() for s in x] if isinstance(x, list) else [x.strip()] if isinstance(x,str) else [])

Loaded full dataset: (1000, 9)
Sampled 1000 rows.


In [5]:
# PROMPT TEMPLATE
# ===========================================
ALLOWED_TAXONOMY = [
    "Entity Reference",
    "Part of Entity Reference",
    "Relationships Between Entities",
    "Underspecified Common Nouns",
    "Degree of an Action",
    "Means of an Action",
    "Output Type",
    "Temporal Dependency",
    "Geographical Dependency",
    "Information Source Dependency"
]

In [6]:
PROMPT_TMPL = Template(textwrap.dedent("""

Now you are an expert evaluator auditing an LLM-generated disambiguation dataset.

ALLOWED TAXONOMY CATEGORIES (use ONLY these):
$allowed

INPUT DATA TO EVALUATE:
Question: $q
Is_Ambiguous Flag: $is_amb
Disambiguating Questions: $dq
QA Pairs: $qa
Generated Facets: $facets
Facet Taxonomies: $facet_t
Expected Taxonomy: $gold

EVALUATION CRITERIA:

1. TAXONOMY AGREEMENT (taxonomy_agreement, taxonomy_score):
   - Identify which taxonomy categories truly apply to this question
   - Compare to Expected Taxonomy
   - taxonomy_agreement: true if exact match
   - taxonomy_score: 1.0 perfect, 0.5-0.9 partial, 0.0 wrong
   - Store your taxonomies in gemini_taxonomy

2. FACET COVERAGE (facet_coverage_ok):
   - Check if Generated Facets cover ALL ambiguity dimensions
   - true if comprehensive, false if missing key facets
   - Store your ideal facets in gemini_facets

3. FACET-TAXONOMY ALIGNMENT (facet_taxonomy_ok):
   - Check if facets map correctly to taxonomy categories
   - true if all mappings correct
   - Store mappings in gemini_facet_taxonomies

4. AMBIGUITY FLAG (is_ambiguous_ok):
   - Check if Is_Ambiguous Flag is correct
   - true if flag matches actual ambiguity

5. DISAMBIGUATION QUALITY (disambig_quality_score):
   - Rate disambiguating questions 0.0-1.0
   - If not ambiguous, set to 1.0

6. QA ALIGNMENT (qa_alignment_ok):
   - Check if QA pairs match their facets
   - true if properly aligned

OUTPUT ONLY THIS JSON (no markdown, no explanation):
{{
  "taxonomy_agreement": bool,
  "taxonomy_score": float,
  "facet_coverage_ok": bool,
  "facet_taxonomy_ok": bool,
  "is_ambiguous_ok": bool,
  "disambig_quality_score": float,
  "qa_alignment_ok": bool,
  "gemini_taxonomy": [str],
  "gemini_facets": [str],
  "gemini_facet_taxonomies": [{{"facet": str, "taxonomy": str}}],
  "suggested_fixes": str,
  "notes": str
}}
""").strip())

def build_prompt(row):
    """Build evaluation prompt from dataframe row"""
    return PROMPT_TMPL.substitute(
        q=str(row["question"]).strip(),
        is_amb=str(bool(row["is_ambiguous"])).lower(),
        dq=json.dumps(row.get("disambig_questions", []), ensure_ascii=False),
        qa=json.dumps(row.get("qaPairs", []), ensure_ascii=False),
        facets=json.dumps(row.get("facets", []), ensure_ascii=False),
        facet_t=json.dumps(row.get("facet_taxonomies", []), ensure_ascii=False),
        gold=json.dumps(row.get("taxonomy", []), ensure_ascii=False),
        allowed="\n".join([f'  - "{x}"' for x in ALLOWED_TAXONOMY]),
    )


In [7]:
# API FUNCTIONS
# ===========================================
def judge_once(prompt):
    """Call the API to judge a single prompt"""
    response = model.generate_content(prompt)
    return response.text

def coerce_judge(raw_response):
    """Parse the LLM response into expected dictionary format"""
    text = raw_response.strip()

    # Remove markdown code blocks
    if text.startswith('```json'):
        text = text[7:]
    elif text.startswith('```'):
        text = text[3:]
    if text.endswith('```'):
        text = text[:-3]
    text = text.strip()

    # Parse JSON
    result = json.loads(text)

    # Ensure all required fields exist with defaults
    required_fields = {
        'taxonomy_agreement': False,
        'taxonomy_score': 0.0,
        'facet_coverage_ok': False,
        'facet_taxonomy_ok': False,
        'is_ambiguous_ok': False,
        'disambig_quality_score': 0.0,
        'qa_alignment_ok': False,
        'gemini_taxonomy': [],
        'gemini_facets': [],
        'gemini_facet_taxonomies': [],
        'suggested_fixes': '',
        'notes': ''
    }

    for field, default in required_fields.items():
        if field not in result:
            result[field] = default

    return result


In [8]:
# JUDGE LOOP
# ===========================================
rows = []
start_time = time.time()
failed_count = 0
success_count = 0

print("=" * 60)
print("Starting LLM-as-a-Judge Evaluation")
print("=" * 60)
print(f"Total rows to process: {len(df)}")
print(f"Model: {MODEL_NAME}")
print(f"Start time: {time.strftime('%H:%M:%S')}\n")

# Test API connectivity
print("üîç Testing API connectivity...")
try:
    test_resp = model.generate_content("Return JSON: {\"test\": true}")
    print(f"‚úì API responding!\n")
except Exception as e:
    print(f"‚úó API test failed: {type(e).__name__}: {str(e)}\n")

print("Starting evaluation loop...\n")

for i, r in df.iterrows():
    row_start = time.time()
    verbose = (i < 3)

    if verbose:
        print(f"\n{'='*40}")
        print(f"Processing row {i+1}/{len(df)}")
        print(f"Question: {str(r['question'])[:100]}...")

    try:
        # Build prompt
        if verbose:
            print(f"Building prompt...", end=" ", flush=True)
        prompt = build_prompt(r)
        if verbose:
            print(f"done")

        # Call API
        if verbose:
            print(f"Calling API...", end=" ", flush=True)
        else:
            print(f"[{i+1:3d}/{len(df)}] Calling API...", end=" ", flush=True)

        judged = judge_once(prompt)
        clean = coerce_judge(judged)
        success_count += 1

        if verbose:
            print(f"‚úì Row completed in {time.time()-row_start:.2f}s")
        else:
            print(f"‚úì ({time.time()-row_start:.1f}s)")

    except Exception as e:
        failed_count += 1
        error_msg = f"{type(e).__name__}: {str(e)[:100]}"

        if verbose:
            print(f"\n‚úó ERROR: {error_msg}")
        else:
            print(f"‚úó {type(e).__name__}")

        clean = {
            'taxonomy_agreement': False,
            'facet_coverage_ok': False,
            'facet_taxonomy_ok': False,
            'is_ambiguous_ok': False,
            'qa_alignment_ok': False,
            'taxonomy_score': 0.0,
            'disambig_quality_score': 0.0,
            'gemini_taxonomy': [],
            'gemini_facets': [],
            'gemini_facet_taxonomies': [],
            'suggested_fixes': f"error:{type(e).__name__}",
            'notes': str(e)[:200]
        }

    rows.append(clean)

    # Progress update every 10 rows
    if (i + 1) % 10 == 0:
        elapsed = time.time() - start_time
        avg_time = elapsed / (i + 1)
        remaining = avg_time * (len(df) - (i + 1))
        print(f"\n{'‚îÄ'*60}")
        print(f"Progress: {i+1}/{len(df)} | Elapsed: {str(timedelta(seconds=int(elapsed)))} | "
              f"ETA: {str(timedelta(seconds=int(remaining)))} | Avg: {avg_time:.1f}s/row")
        print(f"Success: {success_count} | Failed: {failed_count}")
        print(f"{'‚îÄ'*60}\n")

total_time = time.time() - start_time
print("\n" + "=" * 60)
print("JUDGE EVALUATION COMPLETE")
print("=" * 60)

# Create output dataframe
judged_df = pd.DataFrame(rows)
out = pd.concat([df, judged_df], axis=1)

# ===========================================
# EVALUATION REPORT
# ===========================================
print("\n" + "=" * 80)
print("üìä DATASET QUALITY EVALUATION")
print("=" * 80)

total_rows = len(out)
overall_quality = out[[
    'taxonomy_agreement', 'facet_coverage_ok', 'facet_taxonomy_ok',
    'is_ambiguous_ok', 'qa_alignment_ok'
]].mean().mean() * 100

print(f"\nüéØ OVERALL QUALITY SCORE: {overall_quality:.1f}/100")
print(f"üìã Total Samples: {total_rows}")
print(f"‚úÖ Successful: {success_count} ({success_count/total_rows*100:.1f}%)")
print(f"‚ùå Failed: {failed_count} ({failed_count/total_rows*100:.1f}%)")

# Component scores
print(f"\nüìà COMPONENT BREAKDOWN:")
print("‚îÄ" * 80)

components = {
    'Taxonomy Accuracy': out['taxonomy_agreement'].mean() * 100,
    'Facet Coverage': out['facet_coverage_ok'].mean() * 100,
    'Facet-Taxonomy Alignment': out['facet_taxonomy_ok'].mean() * 100,
    'Ambiguity Detection': out['is_ambiguous_ok'].mean() * 100,
    'QA Alignment': out['qa_alignment_ok'].mean() * 100
}

for name, score in components.items():
    bar = '‚ñà' * int(score/2.5) + '‚ñë' * (40 - int(score/2.5))
    icon = '‚úÖ' if score >= 90 else '‚ö†Ô∏è' if score >= 75 else '‚ùå'
    print(f"{icon} {name:30s} [{bar}] {score:.1f}%")

# Taxonomy analysis
print(f"\nüè∑Ô∏è  TAXONOMY GENERATION ANALYSIS:")
print("‚îÄ" * 80)

tax_correct = int(out['taxonomy_agreement'].sum())
tax_incorrect = int((~out['taxonomy_agreement']).sum())
tax_score_mean = out['taxonomy_score'].mean()

print(f"Accuracy: {tax_correct/total_rows:.1%} ({tax_correct} correct, {tax_incorrect} incorrect)")
print(f"Mean Score: {tax_score_mean:.3f}")

# Most common taxonomies
all_generated_taxonomies = []
for tax_list in out['gemini_taxonomy'].dropna():
    if isinstance(tax_list, list):
        all_generated_taxonomies.extend(tax_list)

if all_generated_taxonomies:
    tax_counts = Counter(all_generated_taxonomies)
    print(f"\nMost Generated Taxonomies:")
    for tax, count in tax_counts.most_common(10):
        print(f"  ‚Ä¢ {tax}: {count} times")

# Taxonomy errors
taxonomy_errors = []
for idx, row in out.iterrows():
    expected = set(row.get('taxonomy', []) if isinstance(row.get('taxonomy', []), list) else [])
    generated = set(row.get('gemini_taxonomy', []) if isinstance(row.get('gemini_taxonomy', []), list) else [])
    if expected != generated:
        taxonomy_errors.append({
            'missing': list(expected - generated),
            'extra': list(generated - expected)
        })

if taxonomy_errors:
    all_missing = [tax for error in taxonomy_errors for tax in error['missing']]
    all_extra = [tax for error in taxonomy_errors for tax in error['extra']]

    missing_counts = Counter(all_missing)
    extra_counts = Counter(all_extra)

    if missing_counts:
        print(f"\n‚ùå Most MISSED Taxonomies:")
        for tax, count in missing_counts.most_common(5):
            print(f"  ‚Ä¢ {tax}: missed {count} times")

    if extra_counts:
        print(f"\n‚ûï Most INCORRECTLY ADDED:")
        for tax, count in extra_counts.most_common(5):
            print(f"  ‚Ä¢ {tax}: added {count} times")

# Facet analysis
print(f"\nüìë FACET GENERATION ANALYSIS:")
print("‚îÄ" * 80)

facet_coverage = out['facet_coverage_ok'].mean()
facet_alignment = out['facet_taxonomy_ok'].mean()

print(f"Coverage Rate: {facet_coverage:.1%}")
print(f"Taxonomy Alignment: {facet_alignment:.1%}")

facet_counts = out['gemini_facets'].apply(lambda x: len(x) if isinstance(x, list) else 0)
print(f"\nFacet Statistics:")
print(f"  Mean: {facet_counts.mean():.1f}")
print(f"  Median: {facet_counts.median():.1f}")
print(f"  Range: {facet_counts.min():.0f} - {facet_counts.max():.0f}")

# Disambiguation quality
print(f"\nüí¨ DISAMBIGUATION QUALITY:")
print("‚îÄ" * 80)

ambig_accuracy = out['is_ambiguous_ok'].mean()
cq_quality = out['disambig_quality_score'].mean()

print(f"Ambiguity Detection: {ambig_accuracy:.1%}")
print(f"CQ Quality: {cq_quality:.2f}/1.0")

# QA alignment
print(f"\nüîó QA ALIGNMENT:")
print("‚îÄ" * 80)

qa_aligned = int(out['qa_alignment_ok'].sum())
qa_misaligned = int((~out['qa_alignment_ok']).sum())

print(f"Alignment Rate: {qa_aligned/total_rows:.1%} ({qa_aligned} aligned, {qa_misaligned} misaligned)")

# Recommendations
print(f"\nüö® TOP ISSUES TO FIX:")
print("=" * 80)

recommendations = []

if components['Taxonomy Accuracy'] < 80:
    recommendations.append(f"üî¥ CRITICAL: Taxonomy accuracy {components['Taxonomy Accuracy']:.1f}% ‚Üí Improve taxonomy identification")

if components['Facet Coverage'] < 80:
    recommendations.append(f"üî¥ CRITICAL: Facet coverage {components['Facet Coverage']:.1f}% ‚Üí Add comprehensive facet examples")

if components['Facet-Taxonomy Alignment'] < 90:
    recommendations.append(f"üü° HIGH: Facet alignment {components['Facet-Taxonomy Alignment']:.1f}% ‚Üí Fix facet-taxonomy mappings")

if components['Ambiguity Detection'] < 85:
    recommendations.append(f"üü° HIGH: Ambiguity detection {components['Ambiguity Detection']:.1f}% ‚Üí Clarify ambiguity criteria")

if components['QA Alignment'] < 90:
    recommendations.append(f"üî¥ CRITICAL: QA alignment {components['QA Alignment']:.1f}% ‚Üí Improve answer generation")

if recommendations:
    for i, rec in enumerate(recommendations, 1):
        print(f"\n{i}. {rec}")
else:
    print("\n‚úÖ No critical issues found!")

# Save results
print(f"\nüíæ SAVING RESULTS:")
print("‚îÄ" * 80)

out.to_csv(OUTPUT_CSV, index=False)
print(f"‚úì Detailed results: {OUTPUT_CSV}")

summary_data = {
    'Overall Quality Score': overall_quality,
    'Taxonomy Accuracy (%)': components['Taxonomy Accuracy'],
    'Taxonomy Mean Score': tax_score_mean,
    'Facet Coverage (%)': components['Facet Coverage'],
    'Facet-Taxonomy Alignment (%)': components['Facet-Taxonomy Alignment'],
    'Ambiguity Detection (%)': components['Ambiguity Detection'],
    'CQ Quality Score': cq_quality,
    'QA Alignment (%)': components['QA Alignment'],
    'Total Samples': total_rows,
    'Successful': success_count,
    'Failed': failed_count
}

summary_df = pd.DataFrame([summary_data])
summary_df.to_csv(OUTPUT_REPORT, index=False)
print(f"‚úì Summary report: {OUTPUT_REPORT}")

print("\n" + "=" * 80)
print("‚úÖ EVALUATION COMPLETE")
print("=" * 80)

Starting LLM-as-a-Judge Evaluation
Total rows to process: 1000
Model: models/gemini-2.5-flash
Start time: 22:13:37

üîç Testing API connectivity...
‚úì API responding!

Starting evaluation loop...


Processing row 1/1000
Question: Who plays annabeth in the lightning thief musical?...
Building prompt... done
Calling API... ‚úì Row completed in 7.59s

Processing row 2/1000
Question: When did the last soldier of the civil war die?...
Building prompt... done
Calling API... ‚úì Row completed in 22.01s

Processing row 3/1000
Question: Where did the baggy pants trend come from?...
Building prompt... done
Calling API... ‚úì Row completed in 13.91s
[  4/1000] Calling API... ‚úì (6.9s)
[  5/1000] Calling API... ‚úì (17.3s)
[  6/1000] Calling API... ‚úì (13.3s)
[  7/1000] Calling API... ‚úì (12.3s)
[  8/1000] Calling API... ‚úì (34.9s)
[  9/1000] Calling API... 

KeyboardInterrupt: 