# SelfCheckGPT Coherence Variants - Interactive Demo

This notebook demonstrates the three coherence-based hallucination detection variants:
1. **SelfCheckShogenji** - Ratio-based independence measure (C2)
2. **SelfCheckFitelson** - Confirmation-based support measure
3. **SelfCheckOlsson** - Relative overlap measure (C1)

## Setup

Make sure you have set your OpenAI API key:
```bash
export OPENAI_API_KEY="your-api-key-here"
```

In [None]:
import numpy as np
import spacy
from selfcheckgpt.modeling_coherence import SelfCheckShogenji, SelfCheckFitelson, SelfCheckOlsson

## Initialize Coherence Variants

In [None]:
# Initialize all three coherence variants with gpt-4o-mini model
selfcheck_shogenji = SelfCheckShogenji(model="gpt-4o-mini")
selfcheck_fitelson = SelfCheckFitelson(model="gpt-4o-mini")
selfcheck_olsson = SelfCheckOlsson(model="gpt-4o-mini")

## Test Case 1: High Coherence (Truthful Statements)

We expect low hallucination scores for factual statements that are consistent across samples.

In [None]:
# Original truthful passage
truthful_passage = """
Paris is the capital of France. It is known for the Eiffel Tower, which was built in 1889. 
The city is located on the Seine River. Paris is often called the City of Light.
""".replace("\n", " ").strip()

# Stochastically sampled passages (alternative phrasings of same facts)
truthful_samples = [
    "Paris, the capital city of France, is famous for the Eiffel Tower constructed in 1889. The city sits on the banks of the Seine River.",
    "The French capital Paris is home to the iconic Eiffel Tower from 1889. Paris is situated along the Seine River and nicknamed the City of Light.",
    "France's capital is Paris, where you'll find the Eiffel Tower built in 1889. The Seine River runs through Paris, also known as the City of Light."
]

print("Original passage:")
print(truthful_passage)
print("\nSample passages:")
for i, sample in enumerate(truthful_samples, 1):
    print(f"{i}. {sample}")

In [None]:
# Tokenize into sentences using spacy
nlp = spacy.load("en_core_web_sm")
truthful_sentences = [sent.text.strip() for sent in nlp(truthful_passage).sents if len(sent) > 3]

print(f"\nEvaluating {len(truthful_sentences)} sentences:")
for i, sent in enumerate(truthful_sentences, 1):
    print(f"{i}. {sent}")

In [None]:
# Evaluate with Shogenji
print("\n=== SelfCheckShogenji ===")
scores_shogenji = selfcheck_shogenji.predict(
    sentences=truthful_sentences,
    sampled_passages=truthful_samples,
    verbose=True
)

print("\nHallucination scores (lower = more truthful):")
for i, (sent, score) in enumerate(zip(truthful_sentences, scores_shogenji), 1):
    print(f"{i}. [{score:.4f}] {sent}")
print(f"\nMean hallucination score: {np.mean(scores_shogenji):.4f}")

In [None]:
# Evaluate with Fitelson
print("\n=== SelfCheckFitelson ===")
scores_fitelson = selfcheck_fitelson.predict(
    sentences=truthful_sentences,
    sampled_passages=truthful_samples,
    verbose=True
)

print("\nHallucination scores (lower = more truthful):")
for i, (sent, score) in enumerate(zip(truthful_sentences, scores_fitelson), 1):
    print(f"{i}. [{score:.4f}] {sent}")
print(f"\nMean hallucination score: {np.mean(scores_fitelson):.4f}")

In [None]:
# Evaluate with Olsson
print("\n=== SelfCheckOlsson ===")
scores_olsson = selfcheck_olsson.predict(
    sentences=truthful_sentences,
    sampled_passages=truthful_samples,
    verbose=True
)

print("\nHallucination scores (lower = more truthful):")
for i, (sent, score) in enumerate(zip(truthful_sentences, scores_olsson), 1):
    print(f"{i}. [{score:.4f}] {sent}")
print(f"\nMean hallucination score: {np.mean(scores_olsson):.4f}")

## Test Case 2: Low Coherence (Hallucinated Statements)

We expect high hallucination scores for false statements that are inconsistent across samples.

In [None]:
# Hallucinated passage with false information
hallucinated_passage = """
Paris is the capital of Germany. It is known for the Leaning Tower, which was built in 1776. 
The city is located on the Thames River. Paris is often called the City of Eternal Sunshine.
""".replace("\n", " ").strip()

# Samples still contain truthful information (creating inconsistency)
# Using the same truthful samples from Test Case 1

print("Hallucinated passage (contains false info):")
print(hallucinated_passage)
print("\nTruthful sample passages (creating inconsistency):")
for i, sample in enumerate(truthful_samples, 1):
    print(f"{i}. {sample}")

In [None]:
# Tokenize hallucinated sentences
hallucinated_sentences = [sent.text.strip() for sent in nlp(hallucinated_passage).sents if len(sent) > 3]

print(f"\nEvaluating {len(hallucinated_sentences)} sentences:")
for i, sent in enumerate(hallucinated_sentences, 1):
    print(f"{i}. {sent}")

In [None]:
# Evaluate with Shogenji
print("\n=== SelfCheckShogenji ===")
scores_shogenji_hall = selfcheck_shogenji.predict(
    sentences=hallucinated_sentences,
    sampled_passages=truthful_samples,
    verbose=True
)

print("\nHallucination scores (higher = more hallucinated):")
for i, (sent, score) in enumerate(zip(hallucinated_sentences, scores_shogenji_hall), 1):
    print(f"{i}. [{score:.4f}] {sent}")
print(f"\nMean hallucination score: {np.mean(scores_shogenji_hall):.4f}")

In [None]:
# Evaluate with Fitelson
print("\n=== SelfCheckFitelson ===")
scores_fitelson_hall = selfcheck_fitelson.predict(
    sentences=hallucinated_sentences,
    sampled_passages=truthful_samples,
    verbose=True
)

print("\nHallucination scores (higher = more hallucinated):")
for i, (sent, score) in enumerate(zip(hallucinated_sentences, scores_fitelson_hall), 1):
    print(f"{i}. [{score:.4f}] {sent}")
print(f"\nMean hallucination score: {np.mean(scores_fitelson_hall):.4f}")

In [None]:
# Evaluate with Olsson
print("\n=== SelfCheckOlsson ===")
scores_olsson_hall = selfcheck_olsson.predict(
    sentences=hallucinated_sentences,
    sampled_passages=truthful_samples,
    verbose=True
)

print("\nHallucination scores (higher = more hallucinated):")
for i, (sent, score) in enumerate(zip(hallucinated_sentences, scores_olsson_hall), 1):
    print(f"{i}. [{score:.4f}] {sent}")
print(f"\nMean hallucination score: {np.mean(scores_olsson_hall):.4f}")

## Comparison of Variants

Compare how well each variant discriminates between truthful and hallucinated content.

In [None]:
import matplotlib.pyplot as plt

# Compare mean scores
variants = ['Shogenji', 'Fitelson', 'Olsson']
truthful_means = [
    np.mean(scores_shogenji),
    np.mean(scores_fitelson),
    np.mean(scores_olsson)
]
hallucinated_means = [
    np.mean(scores_shogenji_hall),
    np.mean(scores_fitelson_hall),
    np.mean(scores_olsson_hall)
]

x = np.arange(len(variants))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 6))
rects1 = ax.bar(x - width/2, truthful_means, width, label='Truthful', color='green', alpha=0.7)
rects2 = ax.bar(x + width/2, hallucinated_means, width, label='Hallucinated', color='red', alpha=0.7)

ax.set_ylabel('Mean Hallucination Score', fontsize=12)
ax.set_title('Coherence Variants: Truthful vs Hallucinated Content', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(variants, fontsize=11)
ax.legend(fontsize=11)
ax.grid(axis='y', alpha=0.3)

# Add value labels on bars
def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        ax.annotate(f'{height:.3f}',
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),
                    textcoords="offset points",
                    ha='center', va='bottom', fontsize=9)

autolabel(rects1)
autolabel(rects2)

plt.tight_layout()
plt.show()

print("\nSeparation (Hallucinated - Truthful):")
for i, variant in enumerate(variants):
    separation = hallucinated_means[i] - truthful_means[i]
    print(f"{variant}: {separation:.4f}")

## Test with Wiki Bio GPT3 Hallucination Dataset (Small Subset)

Test on a small subset (5 passages) from the actual evaluation dataset.

In [None]:
from datasets import load_dataset

# Load dataset
dataset = load_dataset("potsawee/wiki_bio_gpt3_hallucination")["evaluation"]

# Take first 5 passages for quick testing
num_test_passages = 5
test_subset = dataset.select(range(num_test_passages))

print(f"Loaded {num_test_passages} passages from wiki_bio_gpt3_hallucination dataset")
print(f"\nExample passage:")
print(f"Passage: {test_subset[0]['wiki_bio_text'][:200]}...")
print(f"Has samples: {len(test_subset[0]['gpt3_sentences'])} sentences")

In [None]:
# Evaluate first passage with all three variants
idx = 0
passage = test_subset[idx]['wiki_bio_text']
gpt3_text = test_subset[idx]['gpt3_text']
sentences = test_subset[idx]['gpt3_sentences']
annotations = test_subset[idx]['annotation']

# Generate samples using spacy sentence splitting on GPT-3 text
# In a real scenario, these would be stochastic samples from the LLM
# For this demo, we'll create variations of the passage
samples = [gpt3_text] * 3  # Simplified: use same text as samples

print(f"\nEvaluating passage {idx}:")
print(f"Sentences to evaluate: {len(sentences)}")
print(f"Ground truth annotations: {annotations}")
print(f"\nFirst 3 sentences:")
for i in range(min(3, len(sentences))):
    label = "[ACCURATE]" if annotations[i] == 0 else "[INACCURATE]"
    print(f"{i+1}. {label} {sentences[i][:100]}...")

In [None]:
# Evaluate with Shogenji (verbose to see caching)
print("\n=== Evaluating with SelfCheckShogenji ===")
scores_dataset_shogenji = selfcheck_shogenji.predict(
    sentences=sentences,
    sampled_passages=samples,
    verbose=True
)

print("\nResults:")
for i, (sent, score, ann) in enumerate(zip(sentences, scores_dataset_shogenji, annotations)):
    label = "[ACCURATE]" if ann == 0 else "[INACCURATE]"
    print(f"{i+1}. {label} Score={score:.4f} | {sent[:80]}...")

In [None]:
# Evaluate with Fitelson
print("\n=== Evaluating with SelfCheckFitelson ===")
scores_dataset_fitelson = selfcheck_fitelson.predict(
    sentences=sentences,
    sampled_passages=samples,
    verbose=True
)

print("\nResults:")
for i, (sent, score, ann) in enumerate(zip(sentences, scores_dataset_fitelson, annotations)):
    label = "[ACCURATE]" if ann == 0 else "[INACCURATE]"
    print(f"{i+1}. {label} Score={score:.4f} | {sent[:80]}...")

In [None]:
# Evaluate with Olsson
print("\n=== Evaluating with SelfCheckOlsson ===")
scores_dataset_olsson = selfcheck_olsson.predict(
    sentences=sentences,
    sampled_passages=samples,
    verbose=True
)

print("\nResults:")
for i, (sent, score, ann) in enumerate(zip(sentences, scores_dataset_olsson, annotations)):
    label = "[ACCURATE]" if ann == 0 else "[INACCURATE]"
    print(f"{i+1}. {label} Score={score:.4f} | {sent[:80]}...")

## Visualize Score Distribution by Ground Truth Label

In [None]:
# Separate scores by annotation (0=accurate, 1=inaccurate)
accurate_scores_shog = [s for s, a in zip(scores_dataset_shogenji, annotations) if a == 0]
inaccurate_scores_shog = [s for s, a in zip(scores_dataset_shogenji, annotations) if a == 1]

accurate_scores_fitel = [s for s, a in zip(scores_dataset_fitelson, annotations) if a == 0]
inaccurate_scores_fitel = [s for s, a in zip(scores_dataset_fitelson, annotations) if a == 1]

accurate_scores_ols = [s for s, a in zip(scores_dataset_olsson, annotations) if a == 0]
inaccurate_scores_ols = [s for s, a in zip(scores_dataset_olsson, annotations) if a == 1]

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Shogenji
axes[0].hist([accurate_scores_shog, inaccurate_scores_shog], 
             label=['Accurate', 'Inaccurate'], bins=10, alpha=0.7, color=['green', 'red'])
axes[0].set_xlabel('Hallucination Score')
axes[0].set_ylabel('Frequency')
axes[0].set_title('SelfCheckShogenji')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Fitelson
axes[1].hist([accurate_scores_fitel, inaccurate_scores_fitel], 
             label=['Accurate', 'Inaccurate'], bins=10, alpha=0.7, color=['green', 'red'])
axes[1].set_xlabel('Hallucination Score')
axes[1].set_ylabel('Frequency')
axes[1].set_title('SelfCheckFitelson')
axes[1].legend()
axes[1].grid(alpha=0.3)

# Olsson
axes[2].hist([accurate_scores_ols, inaccurate_scores_ols], 
             label=['Accurate', 'Inaccurate'], bins=10, alpha=0.7, color=['green', 'red'])
axes[2].set_xlabel('Hallucination Score')
axes[2].set_ylabel('Frequency')
axes[2].set_title('SelfCheckOlsson')
axes[2].legend()
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nScore Statistics:")
print(f"Shogenji - Accurate: mean={np.mean(accurate_scores_shog):.4f}, Inaccurate: mean={np.mean(inaccurate_scores_shog):.4f}")
print(f"Fitelson - Accurate: mean={np.mean(accurate_scores_fitel):.4f}, Inaccurate: mean={np.mean(inaccurate_scores_fitel):.4f}")
print(f"Olsson - Accurate: mean={np.mean(accurate_scores_ols):.4f}, Inaccurate: mean={np.mean(inaccurate_scores_ols):.4f}")

## Summary

This notebook demonstrated:
1. Basic usage of all three coherence variants
2. Behavior on known truthful vs hallucinated content
3. Comparison of variant discrimination capabilities
4. Testing on real dataset examples
5. Cache efficiency and API cost management

For comprehensive evaluation on the full dataset, see `scripts/evaluate_coherence.py` and `demo/coherence_evaluation.ipynb`.