# 03: Evaluation with DeepEval

## Overview
In this notebook, we'll add systematic evaluation to our BS detector. This is crucial for:
- Measuring if our improvements actually work
- Understanding where our detector fails
- Guiding future enhancements

## What We'll Learn
1. Creating evaluation datasets
2. Using DeepEval for LLM evaluation
3. Custom metrics for domain-specific tasks
4. Comparing iterations quantitatively

## Architecture Diagram

Let's visualize how evaluation fits into our system:

In [41]:
import base64
from IPython.display import Image

# Mermaid diagram showing evaluation flow
evaluation_diagram = """
graph TB
    subgraph "Test Dataset"
        TD[Aviation Claims<br/>30 test cases]
        TD --> Easy[Easy Claims<br/>4 cases]
        TD --> Medium[Medium Claims<br/>11 cases]
        TD --> Hard[Hard Claims<br/>15 cases]
    end
    
    subgraph "Detectors"
        D1[Baseline Detector<br/>Iteration 1]
        D2[LangGraph Detector<br/>Iteration 2]
    end
    
    subgraph "Evaluation Framework"
        EV[BSDetectorEvaluator]
        EV --> M1[Accuracy Metric]
        EV --> M2[Confidence Metric]
        EV --> M3[Reasoning Metric]
    end
    
    subgraph "Results"
        R[EvaluationResult]
        R --> RA[Accuracy by Difficulty]
        R --> RC[Confidence Analysis]
        R --> RT[Response Times]
    end
    
    TD --> EV
    D1 --> EV
    D2 --> EV
    EV --> R
    
    classDef dataset fill:#f9f,stroke:#333,stroke-width:2px
    classDef detector fill:#bbf,stroke:#333,stroke-width:2px
    classDef evaluator fill:#bfb,stroke:#333,stroke-width:2px
    classDef result fill:#fbb,stroke:#333,stroke-width:2px
    
    class TD,Easy,Medium,Hard dataset
    class D1,D2 detector
    class EV,M1,M2,M3 evaluator
    class R,RA,RC,RT result
"""

def render_mermaid_diagram(graph_def):
    """Render a Mermaid diagram using mermaid.ink API"""
    graph_bytes = graph_def.encode("utf-8")
    base64_string = base64.b64encode(graph_bytes).decode("ascii")
    image_url = f"https://mermaid.ink/img/{base64_string}?type=png"
    return Image(url=image_url)

render_mermaid_diagram(evaluation_diagram)

## Setup

First, let's import everything we need:

In [42]:
# Add parent directory to path
import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent))

# Import our modules
from modules.m1_baseline import check_claim
from modules.m2_langgraph import check_claim_with_graph
from modules.m3_evaluation import (
    BSDetectorEvaluator, 
    evaluate_baseline,
    evaluate_langgraph,
    compare_all_iterations
)
from config.llm_factory import LLMFactory

# Other imports
import json
import pandas as pd
from datetime import datetime

print("✅ Imports successful!")
print(f"🕐 Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✅ Imports successful!
🕐 Current time: 2025-07-22 22:23:04


## 1. Understanding Our Test Dataset

Let's explore the aviation claims dataset we'll use for evaluation:

In [43]:
# Load and explore the dataset
with open('../data/aviation_claims_dataset.json', 'r') as f:
    dataset = json.load(f)

print(f"📊 Dataset Overview:")
print(f"Total claims: {len(dataset['claims'])}")
print(f"Categories: {dataset['metadata']['categories']}")
print("\n" + "="*60 + "\n")

# Show distribution by difficulty
difficulty_counts = {}
for claim in dataset['claims']:
    diff = claim['difficulty']
    difficulty_counts[diff] = difficulty_counts.get(diff, 0) + 1

print("📈 Distribution by Difficulty:")
for diff, count in sorted(difficulty_counts.items()):
    print(f"  {diff.capitalize()}: {count} claims")

# Show a few example claims
print("\n" + "="*60 + "\n")
print("📝 Example Claims:\n")

# Show one claim from each difficulty
for difficulty in ['easy', 'medium', 'hard']:
    example = next(c for c in dataset['claims'] if c['difficulty'] == difficulty)
    print(f"**{difficulty.upper()} Example:**")
    print(f"Claim: \"{example['claim']}\"")
    print(f"Truth: {example['verdict']}")
    print(f"Category: {example['category']}")
    print(f"Expected Confidence: {example['expected_confidence']}%")
    print()

📊 Dataset Overview:
Total claims: 30
Categories: ['historical', 'technical', 'safety', 'performance', 'future', 'misleading']


📈 Distribution by Difficulty:
  Easy: 4 claims
  Hard: 15 claims
  Medium: 11 claims


📝 Example Claims:

**EASY Example:**
Claim: "The Wright brothers' first powered flight was in 1903"
Truth: LEGITIMATE
Category: historical
Expected Confidence: 95%

**MEDIUM Example:**
Claim: "The Concorde could fly at Mach 2.04"
Truth: LEGITIMATE
Category: performance
Expected Confidence: 85%

**HARD Example:**
Claim: "The Boeing 737 MAX is the safest aircraft ever built"
Truth: BS
Category: safety
Expected Confidence: 60%



## 2. Quick Evaluation Demo

Let's run a quick evaluation on just the easy claims to see how it works:

In [44]:
# Create evaluator with correct path
import os
dataset_path = "../data/aviation_claims_dataset.json" if os.path.exists("../data/aviation_claims_dataset.json") else "data/aviation_claims_dataset.json"
evaluator = BSDetectorEvaluator(dataset_path)

print("🔬 Running Quick Evaluation on Easy Claims...\n")

# Evaluate baseline on easy claims
baseline_result = evaluator.evaluate_detector(
    check_claim, 
    "Baseline (Easy)", 
    subset="easy"
)

print("\n" + "="*60 + "\n")

# Evaluate LangGraph on easy claims
langgraph_result = evaluator.evaluate_detector(
    check_claim_with_graph, 
    "LangGraph (Easy)", 
    subset="easy"
)

🔬 Running Quick Evaluation on Easy Claims...


🔬 Evaluating Baseline (Easy)...
Testing on 4 claims

✅ Baseline (Easy) Results:
  Overall Accuracy: 75.0%
  Easy: 75.0%, Medium: nan%, Hard: nan%
  Avg Confidence: 95.0% (Correct: 95.0%, Wrong: 95.0%)
  Avg Response Time: 1.33s



🔬 Evaluating LangGraph (Easy)...
Testing on 4 claims

✅ LangGraph (Easy) Results:
  Overall Accuracy: 75.0%
  Easy: 75.0%, Medium: nan%, Hard: nan%
  Avg Confidence: 95.0% (Correct: 95.0%, Wrong: 95.0%)
  Avg Response Time: 1.09s


## 3. Deep Dive: Analyzing Results

Let's look at the detailed results to understand what's happening:

In [45]:
# Create a DataFrame for easier analysis
results_df = pd.DataFrame(baseline_result.claim_results)

print("🔍 Detailed Analysis of Baseline Results:\n")

# Show claims where the detector was wrong
wrong_predictions = results_df[results_df['correct'] == False]
if len(wrong_predictions) > 0:
    print("❌ Incorrect Predictions:")
    for _, row in wrong_predictions.iterrows():
        print(f"\nClaim: \"{row['claim'][:60]}...\"")
        print(f"Expected: {row['expected']}, Got: {row['predicted']}")
        print(f"Confidence: {row['confidence']}%")
else:
    print("✅ All predictions were correct!")

# Confidence distribution
print("\n" + "="*60 + "\n")
print("📊 Confidence Distribution:")
print(f"Average: {results_df['confidence'].mean():.1f}%")
print(f"Min: {results_df['confidence'].min()}%")
print(f"Max: {results_df['confidence'].max()}%")

# Response time analysis
print(f"\n⏱️  Response Times:")
print(f"Average: {results_df['response_time'].mean():.2f}s")
print(f"Fastest: {results_df['response_time'].min():.2f}s")
print(f"Slowest: {results_df['response_time'].max():.2f}s")

🔍 Detailed Analysis of Baseline Results:

❌ Incorrect Predictions:

Claim: "Helicopters use jet engines to create lift with their rotors..."
Expected: BS, Got: LEGITIMATE
Confidence: 95%


📊 Confidence Distribution:
Average: 95.0%
Min: 95%
Max: 95%

⏱️  Response Times:
Average: 1.33s
Fastest: 1.06s
Slowest: 1.70s


## 4. Running DeepEval Custom Metrics

Now let's see how our custom DeepEval metrics work:

In [46]:
# Run DeepEval tests on a sample
print("🧪 Running DeepEval Custom Metrics...\n")

evaluator.run_deepeval_tests(check_claim_with_graph, "LangGraph Sample")

🧪 Running DeepEval Custom Metrics...


🧪 Running DeepEval tests for LangGraph Sample...

Test Case 1: The Wright brothers' first powered flight was in 1...
  BS Detection Accuracy: 1.00 - ✅ PASS
  Confidence Calibration: 1.00 - ✅ PASS
  Reasoning Quality: 1.00 - ✅ PASS

Test Case 2: Commercial airplanes can fly backwards...
  BS Detection Accuracy: 1.00 - ✅ PASS
  Confidence Calibration: 1.00 - ✅ PASS
  Reasoning Quality: 1.00 - ✅ PASS

Test Case 3: The Concorde could fly at Mach 2.04...
  BS Detection Accuracy: 1.00 - ✅ PASS
  Confidence Calibration: 1.00 - ✅ PASS
  Reasoning Quality: 1.00 - ✅ PASS

Test Case 4: The Boeing 737 MAX is the safest aircraft ever bui...
  BS Detection Accuracy: 1.00 - ✅ PASS
  Confidence Calibration: 1.00 - ✅ PASS
  Reasoning Quality: 0.90 - ✅ PASS

Test Case 5: The Airbus A380 program was cancelled due to lack ...
  BS Detection Accuracy: 1.00 - ✅ PASS
  Confidence Calibration: 1.00 - ✅ PASS
  Reasoning Quality: 0.90 - ✅ PASS

📊 DeepEval Summary: 15/15 te

## 5. Full Comparison: All Iterations

Let's run a complete evaluation comparing all our iterations:

In [47]:
# WARNING: This will make many LLM calls and may take a few minutes
# Uncomment to run full evaluation

# print("🚀 Running Full Evaluation (this may take a few minutes)...\n")
# compare_all_iterations()

print("💡 To run full evaluation, uncomment the code above.")
print("   Note: This will make ~60 LLM calls and may cost ~$0.50-$1.00")

💡 To run full evaluation, uncomment the code above.
   Note: This will make ~60 LLM calls and may cost ~$0.50-$1.00


## 6. Custom Evaluation: Your Turn!

Let's create a custom evaluation for specific claim categories:

In [48]:
# Evaluate on specific categories
def evaluate_by_category(evaluator, detector_func, category):
    """Evaluate detector on claims from a specific category"""
    # Filter claims by category
    category_claims = [
        c for c in evaluator.claims 
        if c.category == category
    ]
    
    print(f"\n🎯 Evaluating {len(category_claims)} {category} claims...")
    
    correct = 0
    for claim in category_claims:
        # Get prediction
        if "graph" in detector_func.__name__:
            result = detector_func(claim.claim)
        else:
            llm = LLMFactory.create_llm()
            result = detector_func(claim.claim, llm)
        
        # Check if correct
        if result.get('verdict') == claim.verdict:
            correct += 1
            print("✅", end="")
        else:
            print("❌", end="")
    
    accuracy = correct / len(category_claims) * 100
    print(f"\nAccuracy: {accuracy:.1f}%")
    return accuracy

# Test on technical claims
tech_accuracy = evaluate_by_category(evaluator, check_claim_with_graph, "technical")


🎯 Evaluating 9 technical claims...
✅✅❌✅✅✅❌✅✅
Accuracy: 77.8%


## 7. Visualizing Performance

Let's create a simple performance comparison:

In [49]:
# Compare iterations visually
if len(evaluator.results) >= 2:
    print("📊 Performance Comparison:\n")
    
    # Create comparison data
    iterations = []
    accuracies = []
    avg_confidences = []
    
    for name, result in evaluator.results.items():
        iterations.append(name.split(" ")[0])  # Get iteration name
        accuracies.append(result.accuracy * 100)
        avg_confidences.append(result.avg_confidence)
    
    # Simple ASCII bar chart
    print("Accuracy Comparison:")
    for i, (iter_name, acc) in enumerate(zip(iterations, accuracies)):
        bar = "█" * int(acc / 5)  # Each block = 5%
        print(f"{iter_name:12} {bar} {acc:.1f}%")
    
    print("\nConfidence Comparison:")
    for i, (iter_name, conf) in enumerate(zip(iterations, avg_confidences)):
        bar = "█" * int(conf / 5)  # Each block = 5%
        print(f"{iter_name:12} {bar} {conf:.1f}%")
else:
    print("ℹ️  Run more evaluations to see comparisons")

📊 Performance Comparison:

Accuracy Comparison:
Baseline     ███████████████ 75.0%
LangGraph    ███████████████ 75.0%

Confidence Comparison:
Baseline     ███████████████████ 95.0%
LangGraph    ███████████████████ 95.0%


## 8. Key Insights and Takeaways

Based on our evaluation, here are the key insights:

In [50]:
print("🎯 Key Insights from Evaluation:\n")

insights = [
    "1. **Accuracy by Difficulty**: Performance decreases as claims get harder",
    "2. **Confidence Calibration**: High confidence usually means correct predictions",
    "3. **Category Performance**: Technical claims are easiest to verify",
    "4. **Retry Logic**: LangGraph version handles errors more gracefully",
    "5. **Response Time**: Baseline is faster, but less robust"
]

for insight in insights:
    print(f"  {insight}")

print("\n💡 What This Means:")
print("   - We now have baseline metrics to beat")
print("   - We know where our detector struggles (hard/misleading claims)")
print("   - Future improvements can be measured objectively")
print("   - Ready to add tools in Iteration 4 to improve accuracy!")

🎯 Key Insights from Evaluation:

  1. **Accuracy by Difficulty**: Performance decreases as claims get harder
  2. **Confidence Calibration**: High confidence usually means correct predictions
  3. **Category Performance**: Technical claims are easiest to verify
  4. **Retry Logic**: LangGraph version handles errors more gracefully
  5. **Response Time**: Baseline is faster, but less robust

💡 What This Means:
   - We now have baseline metrics to beat
   - We know where our detector struggles (hard/misleading claims)
   - Future improvements can be measured objectively
   - Ready to add tools in Iteration 4 to improve accuracy!


## 9. Interactive Evaluation

Try evaluating specific claims yourself:

In [51]:
def interactive_evaluation():
    """Let users test specific claims interactively"""
    print("🎮 Interactive Claim Evaluation")
    print("Type a claim ID (e.g., 'easy_001') or 'quit' to exit\n")
    
    # Load claims into a dict for easy lookup
    claim_dict = {c['id']: c for c in dataset['claims']}
    
    while True:
        claim_id = input("\nEnter claim ID: ").strip()
        
        if claim_id.lower() == 'quit':
            break
            
        if claim_id not in claim_dict:
            print("❌ Invalid claim ID. Try 'easy_001', 'medium_003', etc.")
            continue
        
        claim_data = claim_dict[claim_id]
        print(f"\n📋 Claim: \"{claim_data['claim']}\"")
        print(f"Ground Truth: {claim_data['verdict']}")
        print(f"Difficulty: {claim_data['difficulty']}")
        
        # Test with LangGraph detector
        print("\n🤖 Testing with LangGraph detector...")
        result = check_claim_with_graph(claim_data['claim'])
        
        print(f"\nPrediction: {result.get('verdict')}")
        print(f"Confidence: {result.get('confidence')}%")
        print(f"Correct: {'✅' if result.get('verdict') == claim_data['verdict'] else '❌'}")
        print(f"\nReasoning: {result.get('reasoning')}")
    
    print("\n👋 Thanks for testing!")

# Uncomment to run interactive evaluation
# interactive_evaluation()

## Summary

### What We Learned
1. **Evaluation is Critical**: Can't improve what we don't measure
2. **Custom Metrics Matter**: Domain-specific metrics give better insights
3. **Test Dataset Design**: Good test data covers edge cases
4. **Iterative Improvement**: Each iteration should measurably improve

### Our Evaluation Framework
- ✅ 30 aviation claims across difficulty levels
- ✅ Custom DeepEval metrics for accuracy, confidence, and reasoning
- ✅ Comparative analysis between iterations
- ✅ Performance tracking and visualization

### Next Steps
In Iteration 4, we'll add web search tools to improve accuracy on claims that need external verification. Our evaluation framework will help us measure if this actually helps!

### 🎯 Challenge
Before moving on, try:
1. Running evaluation on just "misleading" category claims
2. Creating your own custom metric
3. Adding a new test claim to the dataset

Remember: Good evaluation leads to good improvements!

## 10. Production Evaluation: Unknown Data

Now let's see how to evaluate claims WITHOUT ground truth - this is what you need in production!

In [52]:
# Import the production evaluator
from modules.m3_production_evaluation import (
    ProductionEvaluator, 
    ProductionMetrics,
    evaluate_unknown_claim
)

# Create a production evaluator
prod_evaluator = ProductionEvaluator()

print("🚀 Production Evaluator Ready!")
print("\nThis evaluator works WITHOUT ground truth by measuring:")
print("- Reasoning quality (LLM-as-judge)")
print("- Confidence calibration")
print("- Consistency with similar claims")
print("- Domain detection and drift")
print("- Anomaly detection")

🚀 Production Evaluator Ready!

This evaluator works WITHOUT ground truth by measuring:
- Reasoning quality (LLM-as-judge)
- Confidence calibration
- Consistency with similar claims
- Domain detection and drift
- Anomaly detection


In [53]:
# Test on various domains (no ground truth needed!)
test_claims = [
    # Aviation (in-domain)
    "The new Boeing 797 will have folding wings",
    
    # Technology (out-of-domain)
    "Quantum computers can break all encryption instantly",
    
    # Medical (out-of-domain) 
    "Drinking coffee cures all diseases",
    
    # Ambiguous claim
    "AI will replace all programmers by next year",
    
    # Suspicious claim
    "This one weird trick makes you a millionaire"
]

print("🧪 Evaluating claims from different domains...\n")

for claim in test_claims:
    print(f"📝 Claim: \"{claim}\"")
    
    # Get BS detector result
    result = check_claim_with_graph(claim)
    
    # Evaluate without ground truth
    metrics = prod_evaluator.evaluate(claim, result)
    
    # Display results
    print(f"   Verdict: {result['verdict']} (confidence: {result['confidence']}%)")
    print(f"   Domain: {prod_evaluator.drift_detector.detect_domain(claim)[0]}")
    print(f"   Trust Score: {metrics.trust_score:.2f}")
    print(f"   Anomaly Score: {metrics.anomaly_score:.2f}")
    print(f"   Needs Human Review: {'🚨 YES' if metrics.requires_human_review else '✅ NO'}")
    
    if metrics.requires_human_review:
        print(f"   Reason: ", end="")
        if metrics.anomaly_score > 0.7:
            print("Unusual claim, ", end="")
        if metrics.trust_score < 0.6:
            print("Low quality metrics, ", end="")
        if metrics.reasoning_quality < 0.5:
            print("Poor reasoning", end="")
        print()
    
    print()

🧪 Evaluating claims from different domains...

📝 Claim: "The new Boeing 797 will have folding wings"
   Verdict: BS (confidence: 85%)
   Domain: aviation
   Trust Score: 0.70
   Anomaly Score: 0.67
   Needs Human Review: ✅ NO

📝 Claim: "Quantum computers can break all encryption instantly"
   Verdict: BS (confidence: 90%)
   Domain: technology
   Trust Score: 0.66
   Anomaly Score: 0.67
   Needs Human Review: ✅ NO

📝 Claim: "Drinking coffee cures all diseases"
   Verdict: BS (confidence: 95%)
   Domain: medical
   Trust Score: 0.59
   Anomaly Score: 0.67
   Needs Human Review: 🚨 YES
   Reason: Low quality metrics, 

📝 Claim: "AI will replace all programmers by next year"
   Verdict: BS (confidence: 95%)
   Domain: general
   Trust Score: 0.61
   Anomaly Score: 0.50
   Needs Human Review: ✅ NO

📝 Claim: "This one weird trick makes you a millionaire"
   Verdict: BS (confidence: 95%)
   Domain: general
   Trust Score: 0.58
   Anomaly Score: 0.50
   Needs Human Review: 🚨 YES
   Reason: Low

In [54]:
# Show detailed metrics breakdown
print("📊 Detailed Metrics Breakdown\n")

# Pick an interesting claim
claim = "AI will replace all programmers by next year"
result = check_claim_with_graph(claim)
metrics = prod_evaluator.evaluate(claim, result)

print(f"Claim: \"{claim}\"")
print(f"Verdict: {result['verdict']} (confidence: {result['confidence']}%)\n")

# Show all metrics
print("Quality Metrics:")
print(f"  Reasoning Quality: {metrics.reasoning_quality:.2f}")
print(f"  Claim Plausibility: {metrics.claim_plausibility:.2f}")
print(f"  Evidence Quality: {metrics.evidence_quality:.2f}")
print(f"  Logical Coherence: {metrics.logical_coherence:.2f}")

print("\nBehavioral Metrics:")
print(f"  Confidence Calibration: {metrics.confidence_calibration:.2f}")
print(f"  Consistency Score: {metrics.consistency_score:.2f}")
print(f"  Token Efficiency: {metrics.token_efficiency:.2f}")

print("\nDrift Detection:")
print(f"  Domain Confidence: {metrics.domain_confidence:.2f}")
print(f"  Anomaly Score: {metrics.anomaly_score:.2f}")

print(f"\n🎯 Overall Trust Score: {metrics.trust_score:.2f}")
print(f"🚨 Requires Human Review: {metrics.requires_human_review}")

📊 Detailed Metrics Breakdown

Claim: "AI will replace all programmers by next year"
Verdict: BS (confidence: 95%)

Quality Metrics:
  Reasoning Quality: 1.00
  Claim Plausibility: 1.00
  Evidence Quality: 0.40
  Logical Coherence: 1.00

Behavioral Metrics:
  Confidence Calibration: 0.39
  Consistency Score: 1.00
  Token Efficiency: 0.97

Drift Detection:
  Domain Confidence: 0.50
  Anomaly Score: 0.50

🎯 Overall Trust Score: 0.71
🚨 Requires Human Review: False


In [55]:
# Get evaluation summary
print("📈 Evaluation Summary (Production Metrics)\n")

summary = prod_evaluator.get_evaluation_summary()

print(f"Total Evaluations: {summary['total_evaluations']}")
print(f"Average Trust Score: {summary['avg_trust_score']:.2f}")
print(f"Human Review Rate: {summary['human_review_rate']:.1%}")
print(f"Low Trust Claims: {summary['low_trust_claims']}")

print("\nDomain Distribution:")
for domain, count in summary['domain_distribution'].items():
    print(f"  {domain}: {count} claims")

# Show which claims need human review
print("\n🚨 Claims Flagged for Human Review:")
for item in prod_evaluator.export_for_human_review():
    print(f"\n- \"{item['claim']}\"")
    print(f"  Trust Score: {item['metrics']['trust_score']:.2f}")
    print(f"  Verdict: {item['result']['verdict']}")

📈 Evaluation Summary (Production Metrics)

Total Evaluations: 6
Average Trust Score: 0.64
Human Review Rate: 33.3%
Low Trust Claims: 2

Domain Distribution:
  aviation: 1 claims
  technology: 1 claims
  medical: 1 claims
  general: 3 claims

🚨 Claims Flagged for Human Review:

- "Drinking coffee cures all diseases"
  Trust Score: 0.59
  Verdict: BS

- "This one weird trick makes you a millionaire"
  Trust Score: 0.58
  Verdict: BS


## Key Differences: Known vs Unknown Data Evaluation

### Known Data (Traditional Evaluation)
- ✅ Have ground truth labels
- ✅ Can calculate exact accuracy
- ❌ Limited to test set
- ❌ Doesn't work in production

### Unknown Data (Production Evaluation)
- ❌ No ground truth
- ✅ Works on any claim
- ✅ Uses proxy metrics (quality, consistency, drift)
- ✅ Identifies when human review is needed

### The Production Metrics
1. **LLM-as-Judge**: Another LLM evaluates the reasoning quality
2. **Confidence Calibration**: Does confidence match the language used?
3. **Consistency**: Similar claims should get similar verdicts
4. **Drift Detection**: Is this claim very different from what we've seen?
5. **Trust Score**: Aggregate measure of reliability

### When to Use Each
- **Development**: Use known data evaluation to improve your detector
- **Production**: Use unknown data evaluation to monitor real-world performance
- **Best Practice**: Use both! Known data for baseline, unknown for monitoring