# Advanced Hallucination Detection and Evaluation

This notebook focuses on advanced hallucination detection techniques and comprehensive evaluation methodologies for LLM responses in RAG systems.

## Key Features Covered:
- Advanced hallucination detection algorithms
- Factuality assessment
- Contradiction analysis
- Confidence scoring
- Real-world scenario testing

In [None]:
import sys
import os
import json
import asyncio
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import re
import warnings
warnings.filterwarnings('ignore')

# Add backend to path
sys.path.append('../../backend')

# Import evaluation services
from evaluation.evaluation_service import evaluation_service

# Plotting configuration
plt.style.use('seaborn-v0_8')
sns.set_palette("Set2")
%matplotlib inline

## Test Dataset Creation

Create diverse test scenarios to evaluate hallucination detection capabilities.

In [None]:
# Ground truth documents with known facts
ground_truth_documents = [
    {
        "id": "gt_1",
        "title": "Historical Facts: World War II",
        "content": """World War II lasted from 1939 to 1945. 
        The war involved most of the world's nations forming two opposing military alliances: the Allies and the Axis. 
        Key events include the invasion of Poland in 1939, Pearl Harbor attack in 1941, and D-Day landings in 1944. 
        The war ended with the surrender of Germany in May 1945 and Japan in August 1945. 
        Total casualties were estimated at 70-85 million people.""",
        "facts": [
            "WWII lasted from 1939 to 1945",
            "Allies vs Axis powers",
            "Invasion of Poland in 1939",
            "Pearl Harbor attack in 1941",
            "D-Day landings in 1944",
            "Germany surrendered in May 1945",
            "Japan surrendered in August 1945",
            "70-85 million casualties"
        ]
    },
    {
        "id": "gt_2",
        "title": "Scientific Facts: Climate Science",
        "content": """Human activities have caused approximately 1.1¬∞C of global warming above pre-industrial levels. 
        Carbon dioxide concentrations have increased from 280 parts per million to over 420 ppm. 
        The last decade (2011-2020) was the warmest on record. 
        Arctic sea ice extent has declined by about 13% per decade. 
        Sea levels are rising at 3.3 mm per year due to thermal expansion and ice sheet melt.""",
        "facts": [
            "1.1¬∞C global warming above pre-industrial levels",
            "CO2 from 280 to 420+ ppm",
            "2011-2020 warmest decade on record",
            "Arctic sea ice decline 13% per decade",
            "Sea level rise 3.3 mm/year"
        ]
    },
    {
        "id": "gt_3",
        "title": "Technology Facts: Computing History",
        "content": """The first electronic computer ENIAC was built in 1946 at the University of Pennsylvania. 
        Moore's Law predicted transistor density would double every two years. 
        The internet originated from ARPANET project in 1969. 
        First personal computer Altair 8800 was released in 1975. 
        World Wide Web was invented by Tim Berners-Lee in 1989.""",
        "facts": [
            "ENIAC built in 1946",
            "Moore's Law: doubling every 2 years",
            "ARPANET project started in 1969",
            "Altair 8800 released in 1975",
            "WWW invented by Tim Berners-Lee in 1989"
        ]
    }
]

# Test scenarios with varying hallucination likelihood
test_scenarios = [
    {
        "name": "Accurate Response",
        "query": "When did World War II end and what were the key events leading to its conclusion?",
        "ground_truth_doc": ground_truth_documents[0],
        "expected_response": """World War II ended in 1945 with Germany surrendering in May and Japan surrendering in August. 
        Key events included the D-Day landings in 1944, the Battle of the Bulge, 
        and the atomic bombings of Hiroshima and Nagasaki.""",
        "hallucination_level": "none"
    },
    {
        "name": "Partial Hallucination",
        "query": "What caused global warming and what are the current CO2 levels?",
        "ground_truth_doc": ground_truth_documents[1],
        "expected_response": """Human activities have caused approximately 1.1¬∞C of global warming. 
        Current carbon dioxide concentrations are over 420 parts per million. 
        However, some scientists believe solar activity is the primary cause of recent warming.
        The Paris Agreement aims to limit warming to 1.5¬∞C above pre-industrial levels.""",
        "hallucination_level": "moderate"
    },
    {
        "name": "Significant Hallucination",
        "query": "Tell me about early computer history and internet development.",
        "ground_truth_doc": ground_truth_documents[2],
        "expected_response": """The first computer was built in 1846 by Charles Babbage. 
        The internet was invented by Al Gore in 1995. 
        Windows 95 was the first operating system released in 1981. 
        Steve Jobs created the first smartphone in 1990.
        Social media platforms existed in the 1970s.""",
        "hallucination_level": "high"
    },
    {
        "name": "Factual but Unsupported Claims",
        "query": "What are the impacts of climate change on polar regions?",
        "ground_truth_doc": ground_truth_documents[1],
        "expected_response": """Arctic sea ice has declined by 13% per decade. 
        Polar bears are facing extinction due to habitat loss. 
        Antarctic ice sheets are melting at unprecedented rates. 
        Permafrost is thawing and releasing methane gas. 
        Ocean acidification is affecting marine ecosystems globally.
        The North Pole will be ice-free by 2030 according to recent studies.""",
        "hallucination_level": "unsupported"
    },
    {
        "name": "Mixed Accuracy",
        "query": "How has technology evolved from early computers to modern AI?",
        "ground_truth_doc": ground_truth_documents[2],
        "expected_response": """ENIAC was built in 1946 at University of Pennsylvania. 
        The internet originated from ARPANET in 1969. 
        Artificial intelligence breakthroughs occurred in the 2010s with deep learning. 
        Quantum computers will revolutionize computing by 2030. 
        Brain-computer interfaces are already widely available in consumer markets.
        Self-driving cars achieved full autonomy in 2018.""",
        "hallucination_level": "mixed"
    }
]

print(f"Created {len(ground_truth_documents)} ground truth documents")
print(f"Created {len(test_scenarios)} test scenarios with varying hallucination levels")

## Hallucination Detection Analysis

Run comprehensive hallucination detection on test scenarios.

In [None]:
async def analyze_hallucination_detection(scenarios):
    """Analyze hallucination detection across different scenarios"""
    results = []
    
    for i, scenario in enumerate(scenarios):
        print(f"\nAnalyzing Scenario {i+1}: {scenario['name']}")
        print(f"Query: {scenario['query'][:60]}...")
        print(f"Expected Hallucination Level: {scenario['hallucination_level']}")
        
        # Get ground truth document
        gt_doc = scenario['ground_truth_doc']
        retrieved_docs = [gt_doc]
        
        # Analyze the response
        response = scenario['expected_response']
        hallucination_metrics = await evaluation_service.detect_hallucination(
            response=response,
            retrieved_docs=retrieved_docs,
            query=scenario['query']
        )
        
        # Log evaluation
        log_id = await evaluation_service.log_evaluation_metrics(
            query=scenario['query'],
            response=response,
            retrieved_docs=retrieved_docs,
            rag_metrics={},  # Not evaluating RAG for this test
            hallucination_metrics=hallucination_metrics,
            additional_metadata={
                "test_type": "hallucination_detection",
                "scenario_name": scenario['name'],
                "expected_level": scenario['hallucination_level']
            }
        )
        
        result = {
            "scenario_name": scenario['name'],
            "hallucination_level": scenario['hallucination_level'],
            "hallucination_score": hallucination_metrics['hallucination_score'],
            "factuality_score": hallucination_metrics['factuality_score'],
            "confidence": hallucination_metrics['confidence'],
            "contradiction_score": hallucination_metrics['contradiction_score'],
            "supported_sentences": hallucination_metrics['supported_sentences'],
            "total_sentences": hallucination_metrics['total_sentences'],
            "alerts": hallucination_metrics['alerts'],
            "fact_claims_count": hallucination_metrics['analysis_details']['fact_claims_count'],
            "support_ratio": hallucination_metrics['analysis_details']['support_ratio'],
            "log_id": log_id
        }
        
        results.append(result)
        
        # Print results
        print(f"  Hallucination Score: {result['hallucination_score']:.3f}")
        print(f"  Factuality Score: {result['factuality_score']:.3f}")
        print(f"  Confidence: {result['confidence']:.3f}")
        print(f"  Supported Sentences: {result['supported_sentences']}/{result['total_sentences']}")
        if result['alerts']:
            print(f"  Alerts: {result['alerts']}")
    
    return results

# Run hallucination analysis
hallucination_results = await analyze_hallucination_detection(test_scenarios)
print(f"\nCompleted hallucination analysis for {len(hallucination_results)} scenarios")

## Hallucination Detection Visualization

Visualize hallucination detection performance and effectiveness.

In [None]:
# Convert results to DataFrame
df_hallucination = pd.DataFrame(hallucination_results)

print("Hallucination Detection Results:")
display(df_hallucination.round(3))

# Create visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Hallucination Detection Analysis', fontsize=16, fontweight='bold')

# 1. Hallucination scores by scenario
bars = axes[0, 0].bar(range(len(df_hallucination)), df_hallucination['hallucination_score'], 
                     color=['green', 'yellow', 'red', 'orange', 'purple'])
axes[0, 0].set_xlabel('Scenario')
axes[0, 0].set_ylabel('Hallucination Score')
axes[0, 0].set_title('Hallucination Scores by Scenario')
axes[0, 0].set_xticks(range(len(df_hallucination)))
axes[0, 0].set_xticklabels([f"{name}\n({level})" for name, level in 
                           zip(df_hallucination['scenario_name'], df_hallucination['hallucination_level'])], 
                          rotation=45, ha='right')
axes[0, 0].set_ylim(0, 1)
for i, (bar, score) in enumerate(zip(bars, df_hallucination['hallucination_score'])):
    axes[0, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
                   f'{score:.3f}', ha='center', va='bottom')

# 2. Factuality vs Confidence scatter
scatter_colors = {'none': 'green', 'moderate': 'yellow', 'high': 'red', 
                  'unsupported': 'orange', 'mixed': 'purple'}
for level in df_hallucination['hallucination_level'].unique():
    mask = df_hallucination['hallucination_level'] == level
    axes[0, 1].scatter(df_hallucination[mask]['factuality_score'], 
                      df_hallucination[mask]['confidence'],
                      c=scatter_colors[level], label=level, s=100, alpha=0.7)
axes[0, 1].set_xlabel('Factuality Score')
axes[0, 1].set_ylabel('Confidence')
axes[0, 1].set_title('Factuality vs Confidence by Hallucination Level')
axes[0, 1].legend()
axes[0, 1].plot([0, 1], [1, 0], 'r--', alpha=0.5, label='Ideal Trade-off')

# 3. Support ratio analysis
support_ratios = [int(ratio.split('/')[0]) / int(ratio.split('/')[1]) 
                  for ratio in df_hallucination['support_ratio']]
axes[0, 2].bar(range(len(support_ratios)), support_ratios, 
               color=['green', 'yellow', 'red', 'orange', 'purple'])
axes[0, 2].set_xlabel('Scenario')
axes[0, 2].set_ylabel('Support Ratio')
axes[0, 2].set_title('Sentence Support Ratio')
axes[0, 2].set_xticks(range(len(support_ratios)))
axes[0, 2].set_xticklabels([name[:10] for name in df_hallucination['scenario_name']], 
                          rotation=45, ha='right')
axes[0, 2].set_ylim(0, 1)

# 4. Alert frequency by scenario
alert_counts = [len(alerts) for alerts in df_hallucination['alerts']]
axes[1, 0].bar(range(len(alert_counts)), alert_counts, 
               color=['green', 'yellow', 'red', 'orange', 'purple'])
axes[1, 0].set_xlabel('Scenario')
axes[1, 0].set_ylabel('Number of Alerts')
axes[1, 0].set_title('Alert Generation by Scenario')
axes[1, 0].set_xticks(range(len(alert_counts)))
axes[1, 0].set_xticklabels([name[:10] for name in df_hallucination['scenario_name']], 
                          rotation=45, ha='right')

# 5. Correlation heatmap
correlation_features = ['hallucination_score', 'factuality_score', 'confidence', 
                       'contradiction_score', 'fact_claims_count']
corr_matrix = df_hallucination[correlation_features].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, ax=axes[1, 1], cbar_kws={'shrink': 0.8})
axes[1, 1].set_title('Feature Correlations')

# 6. Performance radar chart
from math import pi
categories = ['Hallucination\nScore', 'Factuality\nScore', 'Confidence', 
              'Support\nRatio', 'Contradiction\nScore']

# Prepare data for radar chart
radar_data = []
for _, row in df_hallucination.iterrows():
    support_ratio_val = int(row['support_ratio'].split('/')[0]) / int(row['support_ratio'].split('/')[1])
    data_point = [
        1 - row['hallucination_score'],  # Invert for better visualization
        row['factuality_score'],
        row['confidence'],
        support_ratio_val,
        1 - row['contradiction_score']  # Invert contradiction score
    ]
    radar_data.append(data_point)

# Radar chart
angles = [n / float(len(categories)) * 2 * pi for n in range(len(categories))]
angles += angles[:1]  # Close the circle

ax_radar = plt.subplot(2, 3, 6, projection='polar')
colors_radar = ['green', 'yellow', 'red', 'orange', 'purple']

for i, (data, color, name) in enumerate(zip(radar_data, colors_radar, df_hallucination['scenario_name'])):
    data += data[:1]  # Close the circle
    ax_radar.plot(angles, data, 'o-', linewidth=2, label=name[:10], color=color)
    ax_radar.fill(angles, data, alpha=0.25, color=color)

ax_radar.set_xticks(angles[:-1])
ax_radar.set_xticklabels(categories)
ax_radar.set_ylim(0, 1)
ax_radar.set_title('Performance Radar Chart', pad=20)
ax_radar.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))

plt.tight_layout()
plt.show()

## Detection Effectiveness Analysis

Analyze how well the hallucination detection system identifies different types of hallucinations.

In [None]:
print("=== Hallucination Detection Effectiveness Analysis ===\n")

# Group results by expected hallucination level
grouped_analysis = df_hallucination.groupby('hallucination_level').agg({
    'hallucination_score': ['mean', 'std', 'min', 'max'],
    'factuality_score': 'mean',
    'confidence': 'mean',
    'alerts': 'count'
}).round(3)

print("Performance by Expected Hallucination Level:")
display(grouped_analysis)

# Detection accuracy analysis
print("\nDetection Accuracy Analysis:")
threshold = 0.3  # Hallucination score threshold

detection_results = []
for _, row in df_hallucination.iterrows():
    expected = row['hallucination_level']
    detected = row['hallucination_score'] > threshold
    
    # Map expected levels to binary classification
    expected_positive = expected in ['moderate', 'high', 'mixed', 'unsupported']
    
    detection_results.append({
        'scenario': row['scenario_name'],
        'expected': expected,
        'expected_positive': expected_positive,
        'detected_positive': detected,
        'hallucination_score': row['hallucination_score'],
        'correct': expected_positive == detected
    })

df_detection = pd.DataFrame(detection_results)

# Calculate metrics
true_positives = len(df_detection[(df_detection['expected_positive'] == True) & 
                                 (df_detection['detected_positive'] == True)])
false_positives = len(df_detection[(df_detection['expected_positive'] == False) & 
                                  (df_detection['detected_positive'] == True)])
true_negatives = len(df_detection[(df_detection['expected_positive'] == False) & 
                                 (df_detection['detected_positive'] == False)])
false_negatives = len(df_detection[(df_detection['expected_positive'] == True) & 
                                  (df_detection['detected_positive'] == False)])

total = len(df_detection)
accuracy = (true_positives + true_negatives) / total if total > 0 else 0
precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print(f"\nBinary Classification Metrics (Threshold = {threshold}):")
print(f"  Accuracy: {accuracy:.3f} ({true_positives + true_negatives}/{total})")
print(f"  Precision: {precision:.3f} ({true_positives}/{true_positives + false_positives})")
print(f"  Recall: {recall:.3f} ({true_positives}/{true_positives + false_negatives})")
print(f"  F1-Score: {f1_score:.3f}")

print("\nDetailed Detection Results:")
display_df = df_detection[['scenario', 'expected', 'hallucination_score', 'detected_positive', 'correct']].copy()
display_df.columns = ['Scenario', 'Expected Level', 'Score', 'Detected', 'Correct']
display(display_df.round(3))

# Alert effectiveness
print("\nAlert Effectiveness Analysis:")
scenarios_with_alerts = df_hallucination[df_hallucination['alerts'].apply(len) > 0]
if len(scenarios_with_alerts) > 0:
    avg_hallucination_with_alerts = scenarios_with_alerts['hallucination_score'].mean()
    avg_hallucination_without_alerts = df_hallucination[df_hallucination['alerts'].apply(len) == 0]['hallucination_score'].mean()
    
    print(f"  Average hallucination score WITH alerts: {avg_hallucination_with_alerts:.3f}")
    print(f"  Average hallucination score WITHOUT alerts: {avg_hallucination_without_alerts:.3f}")
    print(f"  Alert discrimination power: {abs(avg_hallucination_with_alerts - avg_hallucination_without_alerts):.3f}")
else:
    print("  No alerts generated in test scenarios")

## Advanced Analysis and Recommendations

Provide detailed analysis and improvement recommendations.

In [None]:
print("=== Advanced Analysis and Recommendations ===\n")

# Performance insights
avg_hallucination = df_hallucination['hallucination_score'].mean()
avg_factuality = df_hallucination['factuality_score'].mean()
avg_confidence = df_hallucination['confidence'].mean()
total_alerts = sum(len(alerts) for alerts in df_hallucination['alerts'])

print("System Performance Summary:")
print(f"  Average Hallucination Score: {avg_hallucination:.3f}")
print(f"  Average Factuality Score: {avg_factuality:.3f}")
print(f"  Average Confidence: {avg_confidence:.3f}")
print(f"  Total Alerts Generated: {total_alerts}")
print(f"  Scenarios with Alerts: {len([a for a in df_hallucination['alerts'] if len(a) > 0])}/{len(df_hallucination)}")

# Identify strengths and weaknesses
print("\nPerformance Analysis:")

if avg_hallucination < 0.2:
    print("‚úÖ Strength: Low average hallucination rate indicates good detection capability")
elif avg_hallucination < 0.4:
    print("‚ö†Ô∏è Moderate: Hallucination detection working reasonably well")
else:
    print("‚ùå Weakness: High hallucination rates suggest detection improvements needed")

if avg_factuality > 0.8:
    print("‚úÖ Strength: High factuality scores indicate reliable information extraction")
elif avg_factuality > 0.6:
    print("‚ö†Ô∏è Moderate: Factuality could be improved")
else:
    print("‚ùå Weakness: Low factuality suggests issues with source document matching")

if avg_confidence > 0.8:
    print("‚úÖ Strength: High confidence indicates consistent performance")
else:
    print("‚ö†Ô∏è Note: Variable confidence may indicate inconsistent performance")

# Detailed scenario analysis
print("\nScenario-Specific Insights:")
for _, row in df_hallucination.iterrows():
    level = row['hallucination_level']
    score = row['hallucination_score']
    
    if level == 'none' and score > 0.1:
        print(f"‚ö†Ô∏è False Positive: {row['scenario_name']} (Score: {score:.3f}) - Detected hallucination in accurate response")
    elif level in ['high', 'moderate'] and score < 0.2:
        print(f"‚ö†Ô∏è False Negative: {row['scenario_name']} (Score: {score:.3f}) - Failed to detect clear hallucination")
    elif level == 'unsupported' and score < 0.3:
        print(f"üîç Nuanced Detection: {row['scenario_name']} (Score: {score:.3f}) - Correctly identified unsupported claims")

# Recommendations
print("\nRecommendations for Improvement:")
recommendations = []

if avg_hallucination > 0.3:
    recommendations.append("Implement more sophisticated semantic similarity measures for better document matching")
    recommendations.append("Add external fact-checking APIs for controversial claims")
    recommendations.append("Increase training data diversity for hallucination detection")

if total_alerts == 0:
    recommendations.append("Review and adjust alert thresholds to catch more issues")
    recommendations.append("Consider adding more granular alert types")
elif total_alerts > len(df_hallucination) * 2:
    recommendations.append("Review alert thresholds - may be generating too many false positives")
    recommendations.append("Implement alert prioritization system")

recommendations.extend([
    "Add human validation loop for high-confidence hallucination detections",
    "Implement continuous learning from feedback to improve detection accuracy",
    "Create detailed logging for false positive/negative analysis",
    "Develop ensemble methods combining multiple detection approaches",
    "Add temporal analysis to track hallucination trends over time",
    "Implement user feedback mechanisms for hallucination reporting"
])

for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec}")

# Threshold optimization suggestion
print("\nThreshold Optimization Suggestions:")
optimal_threshold = df_hallucination['hallucination_score'].median()
print(f"  Current median hallucination score: {optimal_threshold:.3f}")
print(f"  Suggested threshold adjustment range: {max(0.1, optimal_threshold-0.1):.3f} - {min(0.9, optimal_threshold+0.1):.3f}")
print("  Recommendation: Test multiple thresholds and select based on precision-recall trade-off")

## Export Analysis Results

Save detailed analysis results for future reference.

In [None]:
# Export results
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Save detailed results
results_filename = f"hallucination_detection_results_{timestamp}.csv"
df_hallucination.to_csv(results_filename, index=False)
print(f"Detailed results saved to: {results_filename}")

# Save detection analysis
detection_filename = f"detection_analysis_{timestamp}.csv"
df_detection.to_csv(detection_filename, index=False)
print(f"Detection analysis saved to: {detection_filename}")

# Save comprehensive report
report_filename = f"hallucination_analysis_report_{timestamp}.txt"
with open(report_filename, 'w') as f:
    f.write("HALLUCINATION DETECTION ANALYSIS REPORT\n")
    f.write("=" * 50 + "\n\n")
    f.write(f"Generated: {datetime.now().isoformat()}\n")
    f.write(f"Test Scenarios: {len(test_scenarios)}\n\n")
    
    f.write("OVERALL PERFORMANCE:\n")
    f.write(f"  Average Hallucination Score: {avg_hallucination:.3f}\n")
    f.write(f"  Average Factuality Score: {avg_factuality:.3f}\n")
    f.write(f"  Average Confidence: {avg_confidence:.3f}\n")
    f.write(f"  Detection Accuracy: {accuracy:.3f}\n")
    f.write(f"  Precision: {precision:.3f}\n")
    f.write(f"  Recall: {recall:.3f}\n\n")
    
    f.write("DETECTION METRICS:\n")
    f.write(f"  True Positives: {true_positives}\n")
    f.write(f"  False Positives: {false_positives}\n")
    f.write(f"  True Negatives: {true_negatives}\n")
    f.write(f"  False Negatives: {false_negatives}\n\n")
    
    f.write("RECOMMENDATIONS:\n")
    for rec in recommendations:
        f.write(f"  ‚Ä¢ {rec}\n")

print(f"Comprehensive report saved to: {report_filename}")
print("\n‚úÖ Hallucination detection analysis completed successfully!")