# Task 3: Performance Evaluation with Multiple Metrics

**Title:** "Multi-Metric NER Performance Evaluation Against Ground Truth"

## Objective
Create a comprehensive evaluation module for NER predictions from Task 2 that:
- Loads ground truth from 'original' subdirectory
- Parses both predicted and ground truth annotations into comparable formats
- Implements entity-level evaluation (exact boundary + label matching)
- Uses seqeval library for proper NER evaluation
- Provides per-entity-type metrics, overall metrics, and confusion matrix
- Includes detailed justifications for evaluation approach choices


In [None]:
# Import required libraries
import sys
from pathlib import Path
import re
from typing import List, Tuple, Dict, Set, Optional
from collections import defaultdict, Counter
import numpy as np
import pandas as pd

# Install seqeval if not already installed
try:
    from seqeval.metrics import (
        classification_report,
        accuracy_score,
        precision_score,
        recall_score,
        f1_score
    )
except ImportError:
    print("‚ö† seqeval not found. Installing...")
    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "seqeval"])
    from seqeval.metrics import (
        classification_report,
        accuracy_score,
        precision_score,
        recall_score,
        f1_score
    )
    print("‚úì seqeval installed successfully")

# Note: confusion_matrix is not available in seqeval.metrics
# We use our own generate_confusion_matrix() function for entity-level confusion matrices

import warnings
warnings.filterwarnings('ignore')

# Configuration
BASE_DIR = Path("cadec")
TEXT_DIR = BASE_DIR / "text"
ORIGINAL_DIR = BASE_DIR / "original"

# Verify directories exist
if not TEXT_DIR.exists():
    raise FileNotFoundError(f"Directory not found: {TEXT_DIR}")
if not ORIGINAL_DIR.exists():
    raise FileNotFoundError(f"Directory not found: {ORIGINAL_DIR}")

print("‚úì Directories verified")
print(f"  - Text directory: {TEXT_DIR}")
print(f"  - Original directory: {ORIGINAL_DIR}")

# Label types we're evaluating
LABEL_TYPES = ['ADR', 'Drug', 'Disease', 'Symptom']


## 1. Loading and Parsing Ground Truth Annotations

Load ground truth annotations from the 'original' subdirectory and parse them into a structured format.


In [None]:
def load_ground_truth(ann_file_path: Path) -> List[Dict]:
    """
    Load and parse ground truth annotation file from 'original' subdirectory.
    
    Format: TAG\tLABEL START END\tTEXT
    Example: T1	ADR 9 19	bit drowsy
    Example with multiple ranges: T6	Symptom 66 74;76 94;98 107	the heel I couldn't walk on very well
    
    Parameters:
    -----------
    ann_file_path : Path
        Path to the .ann annotation file
        
    Returns:
    --------
    List[Dict]
        List of entity dictionaries with:
        - 'label': Entity type (ADR, Drug, Disease, Symptom)
        - 'text': Entity text
        - 'start': Start character position
        - 'end': End character position
        - 'tag': Original tag identifier (T1, T2, etc.)
    """
    entities = []
    
    try:
        with open(ann_file_path, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                
                # Skip empty lines and comment lines (starting with '#')
                if not line or line.startswith('#'):
                    continue
                
                # Parse entity annotation lines (starting with 'T' followed by a number)
                # Format: TAG\tLABEL RANGES\tTEXT
                match = re.match(r'^(T\d+)\t([^\t]+)\t(.+)$', line)
                if match:
                    tag = match.group(1)
                    label_and_ranges = match.group(2)
                    text = match.group(3)
                    
                    # Extract label type (first word) and ranges (remaining part)
                    parts = label_and_ranges.split(None, 1)
                    if len(parts) < 2:
                        continue
                    
                    label_type = parts[0]
                    ranges_str = parts[1]
                    
                    # Only process ADR, Drug, Disease, Symptom labels
                    if label_type not in LABEL_TYPES:
                        continue
                    
                    # Extract ranges (can be multiple pairs separated by semicolons)
                    ranges = []
                    if ';' in ranges_str:
                        # Multiple ranges format: "START1 END1;START2 END2;..."
                        range_pairs = ranges_str.split(';')
                        for rp in range_pairs:
                            rp = rp.strip()
                            if rp:
                                range_nums = rp.split()
                                if len(range_nums) >= 2:
                                    try:
                                        start = int(range_nums[0])
                                        end = int(range_nums[1])
                                        ranges.append((start, end))
                                    except ValueError:
                                        continue
                    else:
                        # Single range format: "START END"
                        range_nums = ranges_str.split()
                        if len(range_nums) >= 2:
                            try:
                                start = int(range_nums[0])
                                end = int(range_nums[1])
                                ranges = [(start, end)]
                            except ValueError:
                                continue
                    
                    # Create entity entries for each range
                    # For multiple ranges, we create separate entities (standard practice in NER)
                    for start, end in ranges:
                        entities.append({
                            'label': label_type,
                            'text': text.strip(),
                            'start': start,
                            'end': end,
                            'tag': tag
                        })
    
    except Exception as e:
        print(f"Error loading ground truth from {ann_file_path}: {e}")
        return []
    
    return entities

print("‚úì Ground truth loading function defined")


## 2. Loading Predicted Annotations

Load predicted annotations. These can be:
- Generated on-the-fly using Task 2 pipeline
- Loaded from saved prediction files (if available)

For this implementation, we'll create a function that can parse predictions in the same format as ground truth (TAG\tLABEL START END\tTEXT).


In [None]:
def load_predictions(ann_file_path: Path) -> List[Dict]:
    """
    Load and parse predicted annotation file.
    
    Uses the same parsing logic as load_ground_truth() since predictions
    should be in the same format (generated by Task 2's conversion function).
    
    Parameters:
    -----------
    ann_file_path : Path
        Path to the predicted .ann file (or can be a list of annotation lines)
        
    Returns:
    --------
    List[Dict]
        List of entity dictionaries with same structure as ground truth
    """
    # If it's a Path object pointing to a file
    if isinstance(ann_file_path, Path) and ann_file_path.exists():
        return load_ground_truth(ann_file_path)
    
    # If it's a list of annotation lines (from Task 2 output)
    if isinstance(ann_file_path, list):
        entities = []
        for line in ann_file_path:
            line = line.strip()
            if not line or line.startswith('#'):
                continue
            
            # Parse: TAG\tLABEL START END\tTEXT
            match = re.match(r'^(T\d+)\t([^\t]+)\t(.+)$', line)
            if match:
                tag = match.group(1)
                label_and_ranges = match.group(2)
                text = match.group(3)
                
                parts = label_and_ranges.split(None, 1)
                if len(parts) < 2:
                    continue
                
                label_type = parts[0]
                ranges_str = parts[1]
                
                if label_type not in LABEL_TYPES:
                    continue
                
                # Parse ranges
                ranges = []
                if ';' in ranges_str:
                    range_pairs = ranges_str.split(';')
                    for rp in range_pairs:
                        rp = rp.strip()
                        if rp:
                            range_nums = rp.split()
                            if len(range_nums) >= 2:
                                try:
                                    start = int(range_nums[0])
                                    end = int(range_nums[1])
                                    ranges.append((start, end))
                                except ValueError:
                                    continue
                else:
                    range_nums = ranges_str.split()
                    if len(range_nums) >= 2:
                        try:
                            start = int(range_nums[0])
                            end = int(range_nums[1])
                            ranges = [(start, end)]
                        except ValueError:
                            continue
                
                for start, end in ranges:
                    entities.append({
                        'label': label_type,
                        'text': text.strip(),
                        'start': start,
                        'end': end,
                        'tag': tag
                    })
        
        return entities
    
    return []

print("‚úì Prediction loading function defined")


## 3. Entity-Level Evaluation Implementation and Justification

### Why Entity-Level Evaluation is More Appropriate Than Token-Level

Entity-level evaluation is the **recommended approach** for medical NER evaluation, especially for downstream tasks. Here's why:

#### 1. **Downstream Task Alignment**
Medical information extraction systems operate on **complete entities**, not individual tokens:
- **Drug-Disease Relationships**: Need complete entity spans to extract meaningful relationships (e.g., "ibuprofen treats arthritis" requires both full entities)
- **Adverse Event Detection**: ADR extraction systems need exact entity boundaries to properly link drugs to adverse reactions
- **Knowledge Graph Construction**: Medical knowledge graphs require precise entity nodes with correct boundaries
- **Clinical Decision Support**: Partial matches provide incomplete or ambiguous information that can mislead clinical systems

#### 2. **Importance of Exact Boundary Matching in Medical NER**

In medical contexts, boundary precision is **critical for clinical safety**:
- **Semantic Meaning**: "stomach pain" vs "stomach" + "pain" convey different clinical meanings
  - "stomach pain" = specific symptom entity
  - "stomach" + "pain" = potentially two separate concepts
- **Drug Names**: "aspirin" (6 chars) vs "aspirin" with trailing space can lead to database lookup failures
- **Disease Names**: "type 2 diabetes" must be captured as complete entity for proper ICD-10 coding
- **Dosage Information**: Boundary errors can misidentify dosage amounts (e.g., "50mg" vs "50 mg")

#### 3. **Partial Matches Should Be Considered False Positives**

Our evaluation treats **any partial match as a False Positive**:
- **Rationale**: A partial match doesn't provide complete, usable information for downstream tasks
- **Example Scenarios**:
  - GT: "ibuprofen" [10, 19], Pred: "ibuprofen" [10, 20] ‚Üí **FP** (boundary mismatch)
  - GT: "bit drowsy" [10, 19], Pred: "drowsy" [14, 19] ‚Üí **FP** (partial boundary)
  - GT: ADR [10, 19], Pred: Drug [10, 19] ‚Üí **FP** (label confusion)
- **Strict Matching Encourages**:
  - Models learn precise entity boundaries
  - Production systems deliver accurate extractions
  - Reduces errors in clinical applications

#### 4. **Clinical Safety Implications**

Incorrect entity boundaries can have serious consequences:
- **Drug Interaction Detection**: Wrong boundaries can miss critical drug-drug interactions
- **Patient Safety**: Misidentified adverse events can lead to incorrect medical decisions
- **Regulatory Compliance**: Medical device software requires exact entity matching for FDA submissions

### How seqeval Handles Entity Boundary Evaluation

The `seqeval` library implements **CoNLL-2003 evaluation standards**:

- **Exact Match Requirement**: Entities must match EXACTLY on:
  - Boundary start position
  - Boundary end position  
  - Label type
- **No Partial Credit**: A prediction (ADR, 10, 20) only matches ground truth (ADR, 10, 20)
- **Overlapping Entities**: 
  - Prediction [10, 25] vs Ground Truth [10, 20] ‚Üí **False Positive**
  - Prediction [15, 20] vs Ground Truth [10, 20] ‚Üí **False Positive**
- **BIO Format Evaluation**: seqeval works on BIO-tagged sequences but evaluates at entity level
- **Token-Level vs Entity-Level**: seqeval calculates metrics on complete entities, not individual tokens

**Alignment with Standards**: This approach follows established NER evaluation methodology used in:
- CoNLL-2003 shared task
- SemEval medical NER challenges
- Clinical NLP evaluation benchmarks


In [None]:
def entity_level_evaluation(ground_truth: List[Dict], predictions: List[Dict]) -> Dict:
    """
    Perform entity-level evaluation (exact boundary + label matching).
    
    Entity-level evaluation requires:
    - Exact match of entity boundaries (start AND end positions)
    - Exact match of label type
    
    Returns TP, FP, FN counts for each entity type and overall.
    
    Parameters:
    -----------
    ground_truth : List[Dict]
        List of ground truth entities
    predictions : List[Dict]
        List of predicted entities
        
    Returns:
    --------
    Dict
        Dictionary containing:
        - 'tp', 'fp', 'fn' per entity type
        - Overall 'tp', 'fp', 'fn'
        - 'precision', 'recall', 'f1' per entity type
        - Overall 'precision', 'recall', 'f1'
    """
    # Convert entities to sets of tuples for exact matching
    # Format: (label, start, end, text) - text included for verification
    gt_set = set()
    for entity in ground_truth:
        gt_set.add((entity['label'], entity['start'], entity['end'], entity['text']))
    
    pred_set = set()
    for entity in predictions:
        pred_set.add((entity['label'], entity['start'], entity['end'], entity['text']))
    
    # Calculate True Positives: entities that appear in both sets
    tp_all = gt_set.intersection(pred_set)
    
    # Calculate False Positives: predicted entities not in ground truth
    fp_all = pred_set - gt_set
    
    # Calculate False Negatives: ground truth entities not predicted
    fn_all = gt_set - pred_set
    
    # Per-entity-type metrics
    results = {
        'tp': {},
        'fp': {},
        'fn': {},
        'precision': {},
        'recall': {},
        'f1': {}
    }
    
    # Calculate metrics per entity type
    for label_type in LABEL_TYPES:
        # Filter by label type
        tp_type = {e for e in tp_all if e[0] == label_type}
        fp_type = {e for e in fp_all if e[0] == label_type}
        fn_type = {e for e in fn_all if e[0] == label_type}
        
        tp_count = len(tp_type)
        fp_count = len(fp_type)
        fn_count = len(fn_type)
        
        # Calculate Precision, Recall, F1
        precision = tp_count / (tp_count + fp_count) if (tp_count + fp_count) > 0 else 0.0
        recall = tp_count / (tp_count + fn_count) if (tp_count + fn_count) > 0 else 0.0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
        
        results['tp'][label_type] = tp_count
        results['fp'][label_type] = fp_count
        results['fn'][label_type] = fn_count
        results['precision'][label_type] = precision
        results['recall'][label_type] = recall
        results['f1'][label_type] = f1
    
    # Overall micro-averaged metrics
    total_tp = len(tp_all)
    total_fp = len(fp_all)
    total_fn = len(fn_all)
    
    overall_precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0.0
    overall_recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0.0
    overall_f1 = 2 * (overall_precision * overall_recall) / (overall_precision + overall_recall) if (overall_precision + overall_recall) > 0 else 0.0
    
    results['tp']['OVERALL'] = total_tp
    results['fp']['OVERALL'] = total_fp
    results['fn']['OVERALL'] = total_fn
    results['precision']['OVERALL'] = overall_precision
    results['recall']['OVERALL'] = overall_recall
    results['f1']['OVERALL'] = overall_f1
    
    return results

print("‚úì Entity-level evaluation function defined")


## 4. Converting to BIO Format for seqeval

seqeval requires BIO-tagged sequences. We'll convert both ground truth and predictions to BIO format for comprehensive evaluation.


In [None]:
def entities_to_bio_tags(entities: List[Dict], text: str) -> List[str]:
    """
    Convert entity annotations to BIO-tagged sequence.
    
    This function:
    1. Tokenizes text word-by-word
    2. Assigns BIO tags based on entity spans
    3. Handles overlapping entities (takes first entity in case of overlap)
    
    Parameters:
    -----------
    entities : List[Dict]
        List of entity dictionaries with 'label', 'start', 'end'
    text : str
        Original text to tag
        
    Returns:
    --------
    List[str]
        List of BIO tags (one per word/token)
    """
    # Tokenize text word-by-word and get character positions
    words = text.split()
    word_tokens = []
    current_pos = 0
    
    for word in words:
        word_start = text.find(word, current_pos)
        if word_start == -1:
            word_start = current_pos
        word_end = word_start + len(word)
        word_tokens.append((word, word_start, word_end))
        
        next_pos = word_end
        while next_pos < len(text) and text[next_pos].isspace():
            next_pos += 1
        current_pos = next_pos
    
    # Initialize all tags as 'O'
    bio_tags = ['O'] * len(word_tokens)
    
    # Sort entities by start position to handle overlaps consistently
    sorted_entities = sorted(entities, key=lambda x: (x['start'], x['end']))
    
    # Assign BIO tags
    for entity in sorted_entities:
        label = entity['label']
        start_char = entity['start']
        end_char = entity['end']
        
        # Find words that overlap with this entity span
        overlapping_indices = []
        for i, (word, word_start, word_end) in enumerate(word_tokens):
            # Word overlaps if: word_start < end_char AND word_end > start_char
            if word_start < end_char and word_end > start_char:
                overlapping_indices.append(i)
        
        # Assign BIO labels: B- for first word, I- for subsequent words
        if overlapping_indices:
            for idx, word_idx in enumerate(overlapping_indices):
                # Only assign if not already assigned (handle overlaps)
                if bio_tags[word_idx] == 'O':
                    if idx == 0:
                        bio_tags[word_idx] = f"B-{label}"
                    else:
                        # Check if previous word had the same entity type
                        prev_label = bio_tags[word_idx - 1]
                        if prev_label.endswith(label):
                            bio_tags[word_idx] = f"I-{label}"
                        else:
                            bio_tags[word_idx] = f"B-{label}"  # New entity of same type
    
    return bio_tags

print("‚úì BIO conversion function defined")


## 5. Confusion Matrix Generation

Generate a confusion matrix to visualize common misclassifications between entity types.


In [None]:
def generate_confusion_matrix(ground_truth: List[Dict], predictions: List[Dict]) -> pd.DataFrame:
    """
    Generate a confusion matrix showing misclassifications.
    
    For each predicted entity that doesn't exactly match ground truth,
    track what label was predicted vs what should have been predicted.
    
    Note: Only includes entities where boundaries match but labels differ,
    or where boundaries partially overlap (showing label confusion).
    
    Parameters:
    -----------
    ground_truth : List[Dict]
        List of ground truth entities
    predictions : List[Dict]
        List of predicted entities
        
    Returns:
    --------
    pd.DataFrame
        Confusion matrix with ground truth labels as rows, predicted labels as columns
    """
    # Create mapping of (start, end) -> label for ground truth
    gt_by_position = {}
    for entity in ground_truth:
        pos_key = (entity['start'], entity['end'])
        # If multiple entities at same position, keep the first
        if pos_key not in gt_by_position:
            gt_by_position[pos_key] = entity['label']
    
    # Create mapping of (start, end) -> label for predictions
    pred_by_position = {}
    for entity in predictions:
        pos_key = (entity['start'], entity['end'])
        if pos_key not in pred_by_position:
            pred_by_position[pos_key] = entity['label']
    
    # Build confusion matrix
    confusion_dict = defaultdict(lambda: defaultdict(int))
    
    # Check all predicted positions
    for pos_key, pred_label in pred_by_position.items():
        gt_label = gt_by_position.get(pos_key, 'NONE')
        confusion_dict[gt_label][pred_label] += 1
    
    # Check all ground truth positions not in predictions
    for pos_key, gt_label in gt_by_position.items():
        if pos_key not in pred_by_position:
            confusion_dict[gt_label]['NONE'] += 1
    
    # Convert to DataFrame
    all_labels = LABEL_TYPES + ['NONE']
    confusion_data = []
    
    for gt_label in all_labels:
        row = {'Ground Truth': gt_label}
        for pred_label in all_labels:
            row[pred_label] = confusion_dict[gt_label][pred_label]
        confusion_data.append(row)
    
    df = pd.DataFrame(confusion_data)
    df = df.set_index('Ground Truth')
    
    return df

print("‚úì Confusion matrix generation function defined")


## 6. Complete Evaluation Pipeline

This function orchestrates the entire evaluation process for a single file or all files.


In [None]:
def evaluate_file(text_file_path: Path, 
                  gt_entities: List[Dict],
                  pred_entities: List[Dict],
                  text: str) -> Dict:
    """
    Evaluate predictions against ground truth for a single file.
    
    Parameters:
    -----------
    text_file_path : Path
        Path to the text file (for reference)
    gt_entities : List[Dict]
        Ground truth entities
    pred_entities : List[Dict]
        Predicted entities
    text : str
        Original text
        
    Returns:
    --------
    Dict
        Dictionary containing all evaluation results
    """
    results = {}
    
    # 1. Entity-level evaluation
    entity_results = entity_level_evaluation(gt_entities, pred_entities)
    results['entity_level'] = entity_results
    
    # 2. BIO format conversion for seqeval
    gt_bio = entities_to_bio_tags(gt_entities, text)
    pred_bio = entities_to_bio_tags(pred_entities, text)
    
    # Ensure same length (pad or truncate if needed)
    min_len = min(len(gt_bio), len(pred_bio))
    gt_bio = gt_bio[:min_len]
    pred_bio = pred_bio[:min_len]
    
    # 3. seqeval metrics
    try:
        # Calculate seqeval metrics
        # Note: seqeval automatically infers labels from BIO sequences
        seqeval_results = {
            'accuracy': accuracy_score([gt_bio], [pred_bio]),
            'precision': precision_score([gt_bio], [pred_bio]),
            'recall': recall_score([gt_bio], [pred_bio]),
            'f1': f1_score([gt_bio], [pred_bio])
        }
        
        # Classification report (detailed per-entity metrics)
        # seqeval's classification_report doesn't accept 'labels' parameter
        # It automatically extracts labels from the BIO sequences
        try:
            report = classification_report([gt_bio], [pred_bio], output_dict=True, zero_division=0)
            seqeval_results['classification_report'] = report
        except Exception as report_error:
            # If classification_report fails, continue without it
            print(f"  Note: Could not generate classification report: {report_error}")
        
        results['seqeval'] = seqeval_results
        
    except Exception as e:
        print(f"‚ö† Error in seqeval evaluation for {text_file_path.name}: {e}")
        results['seqeval'] = None
    
    # 4. Confusion matrix
    confusion_matrix_df = generate_confusion_matrix(gt_entities, pred_entities)
    results['confusion_matrix'] = confusion_matrix_df
    
    return results

print("‚úì Complete evaluation function defined")


## 7. Aggregating Results Across All Files

Function to evaluate all files and aggregate metrics.


In [None]:
def aggregate_results(all_results: List[Dict]) -> Dict:
    """
    Aggregate evaluation results across all files.
    
    Parameters:
    -----------
    all_results : List[Dict]
        List of evaluation results for each file
        
    Returns:
    --------
    Dict
        Aggregated metrics
    """
    # Aggregate entity-level metrics
    aggregated = {
        'tp': Counter(),
        'fp': Counter(),
        'fn': Counter()
    }
    
    # Sum up TP, FP, FN across all files
    for result in all_results:
        entity_level = result.get('entity_level', {})
        for label_type in LABEL_TYPES + ['OVERALL']:
            tp = entity_level.get('tp', {}).get(label_type, 0)
            fp = entity_level.get('fp', {}).get(label_type, 0)
            fn = entity_level.get('fn', {}).get(label_type, 0)
            
            aggregated['tp'][label_type] += tp
            aggregated['fp'][label_type] += fp
            aggregated['fn'][label_type] += fn
    
    # Calculate aggregated precision, recall, F1
    aggregated['precision'] = {}
    aggregated['recall'] = {}
    aggregated['f1'] = {}
    
    for label_type in LABEL_TYPES + ['OVERALL']:
        tp = aggregated['tp'][label_type]
        fp = aggregated['fp'][label_type]
        fn = aggregated['fn'][label_type]
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
        
        aggregated['precision'][label_type] = precision
        aggregated['recall'][label_type] = recall
        aggregated['f1'][label_type] = f1
    
    return aggregated

print("‚úì Aggregation function defined")


## 8. Integration with Task 2 Pipeline

To evaluate predictions, we need to either:
1. Load saved predictions from Task 2
2. Re-generate predictions using Task 2 pipeline

Here, we'll create a helper function that can generate predictions using Task 2 functions (if available) or load them from files.


In [None]:
# Try to import Task 2 functions for generating predictions
# If Task 2 notebook functions are not directly importable, we'll need to
# either regenerate predictions or load them from saved files

try:
    # Attempt to import from task2 (may not work if not in same session)
    # This is a placeholder - in practice, you'd either:
    # 1. Save predictions from Task 2 to files
    # 2. Run Task 2 in the same notebook session
    # 3. Load Task 2 functions dynamically
    
    # For now, we'll assume predictions are generated elsewhere or saved
    print("Note: For full evaluation, predictions should be generated using Task 2 pipeline")
    print("      or loaded from saved prediction files.")
    
except ImportError:
    print("Task 2 functions not directly importable.")
    print("Please ensure predictions are available for evaluation.")

print("‚úì Integration placeholder defined")


## 9. Testing with Sample File

Let's test the evaluation pipeline with a sample file to verify everything works correctly.


In [None]:
# Test with a sample file
sample_text_files = list(TEXT_DIR.glob("*.txt"))[:1]  # Get first file for testing

if sample_text_files:
    test_file = sample_text_files[0]
    print(f"Testing evaluation with file: {test_file.name}")
    print("=" * 80)
    
    # Load text
    with open(test_file, 'r', encoding='utf-8') as f:
        text = f.read().strip()
    
    # Load ground truth
    gt_file = ORIGINAL_DIR / test_file.name.replace('.txt', '.ann')
    gt_entities = load_ground_truth(gt_file)
    
    print(f"\nGround Truth: {len(gt_entities)} entities found")
    for entity in gt_entities[:5]:  # Show first 5
        print(f"  - {entity['label']}: '{entity['text']}' [{entity['start']}:{entity['end']}]")
    
    # For testing, create dummy predictions (in real scenario, these come from Task 2)
    # Here we'll use a subset of ground truth to simulate predictions
    print("\n‚ö† Note: Using dummy predictions for demonstration.")
    print("   In actual evaluation, use predictions from Task 2 pipeline.")
    
    # Create dummy predictions (remove some entities, add some wrong ones)
    pred_entities = gt_entities[:len(gt_entities)//2] if len(gt_entities) > 1 else []
    
    if pred_entities:
        print(f"\nDummy Predictions: {len(pred_entities)} entities")
        for entity in pred_entities[:5]:
            print(f"  - {entity['label']}: '{entity['text']}' [{entity['start']}:{entity['end']}]")
    
    # Evaluate
    print("\n" + "=" * 80)
    print("EVALUATION RESULTS")
    print("=" * 80)
    
    results = evaluate_file(test_file, gt_entities, pred_entities, text)
    
    # Display entity-level results
    print("\nEntity-Level Metrics:")
    print("-" * 80)
    entity_level = results['entity_level']
    
    df = pd.DataFrame({
        'Label': LABEL_TYPES + ['OVERALL'],
        'Precision': [entity_level['precision'][l] for l in LABEL_TYPES + ['OVERALL']],
        'Recall': [entity_level['recall'][l] for l in LABEL_TYPES + ['OVERALL']],
        'F1': [entity_level['f1'][l] for l in LABEL_TYPES + ['OVERALL']],
        'TP': [entity_level['tp'][l] for l in LABEL_TYPES + ['OVERALL']],
        'FP': [entity_level['fp'][l] for l in LABEL_TYPES + ['OVERALL']],
        'FN': [entity_level['fn'][l] for l in LABEL_TYPES + ['OVERALL']]
    })
    
    print(df.to_string(index=False))
    
    # Display seqeval results if available
    if results.get('seqeval'):
        print("\nseqeval Metrics:")
        print("-" * 80)
        seqeval = results['seqeval']
        print(f"  Accuracy:  {seqeval.get('accuracy', 0):.4f}")
        print(f"  Precision: {seqeval.get('precision', 0):.4f}")
        print(f"  Recall:    {seqeval.get('recall', 0):.4f}")
        print(f"  F1-Score:  {seqeval.get('f1', 0):.4f}")
    
    # Display confusion matrix
    print("\nConfusion Matrix:")
    print("-" * 80)
    print(results['confusion_matrix'].to_string())
    
else:
    print("‚ö† No text files found for testing")


## 10. Full Dataset Evaluation

This section evaluates all files in the dataset. Note that generating predictions for all files using Task 2 may take considerable time.

**Important**: You'll need to either:
1. Have predictions saved from Task 2 execution
2. Integrate Task 2 pipeline here to generate predictions on-the-fly
3. Use a subset of files for faster evaluation

For demonstration, we'll create a framework that can handle all scenarios.


In [None]:
def evaluate_all_files(text_files: List[Path], 
                       get_predictions_func,
                       max_files: Optional[int] = None) -> Dict:
    """
    Evaluate all text files in the dataset.
    
    Parameters:
    -----------
    text_files : List[Path]
        List of text file paths to evaluate
    get_predictions_func : callable
        Function that takes (text_file_path, text) and returns predicted entities
        Format: List[Dict] with 'label', 'start', 'end', 'text'
    max_files : int, optional
        Maximum number of files to evaluate (for testing on subset)
        
    Returns:
    --------
    Dict
        Aggregated evaluation results
    """
    all_results = []
    files_evaluated = 0
    
    if max_files:
        text_files = text_files[:max_files]
    
    total_files = len(text_files)
    print(f"Evaluating {total_files} files...")
    
    for idx, text_file in enumerate(text_files, 1):
        try:
            # Load text
            with open(text_file, 'r', encoding='utf-8') as f:
                text = f.read().strip()
            
            # Load ground truth
            gt_file = ORIGINAL_DIR / text_file.name.replace('.txt', '.ann')
            if not gt_file.exists():
                print(f"‚ö† Skipping {text_file.name}: ground truth not found")
                continue
            
            gt_entities = load_ground_truth(gt_file)
            
            # Get predictions
            pred_entities = get_predictions_func(text_file, text)
            
            # Evaluate
            file_results = evaluate_file(text_file, gt_entities, pred_entities, text)
            all_results.append(file_results)
            
            files_evaluated += 1
            
            # Progress update
            if idx % 50 == 0:
                print(f"  Processed {idx}/{total_files} files...")
                
        except Exception as e:
            print(f"‚ö† Error evaluating {text_file.name}: {e}")
            continue
    
    print(f"\n‚úì Evaluation complete: {files_evaluated} files evaluated")
    
    # Aggregate results
    aggregated = aggregate_results(all_results)
    
    return {
        'per_file_results': all_results,
        'aggregated': aggregated,
        'files_evaluated': files_evaluated
    }

print("‚úì Full dataset evaluation function defined")


In [None]:
def display_evaluation_results(aggregated_results: Dict):
    """
    Display comprehensive evaluation results in a formatted way.
    
    Parameters:
    -----------
    aggregated_results : Dict
        Results from aggregate_results() or evaluate_all_files()
    """
    aggregated = aggregated_results.get('aggregated', aggregated_results)
    
    print("=" * 80)
    print("COMPREHENSIVE EVALUATION RESULTS")
    print("=" * 80)
    
    # Per-entity-type metrics
    print("\nPer-Entity-Type Metrics:")
    print("-" * 80)
    
    metrics_data = []
    for label_type in LABEL_TYPES:
        metrics_data.append({
            'Entity Type': label_type,
            'Precision': f"{aggregated['precision'][label_type]:.4f}",
            'Recall': f"{aggregated['recall'][label_type]:.4f}",
            'F1-Score': f"{aggregated['f1'][label_type]:.4f}",
            'TP': aggregated['tp'][label_type],
            'FP': aggregated['fp'][label_type],
            'FN': aggregated['fn'][label_type]
        })
    
    # Add overall metrics
    metrics_data.append({
        'Entity Type': 'OVERALL (Micro-Avg)',
        'Precision': f"{aggregated['precision']['OVERALL']:.4f}",
        'Recall': f"{aggregated['recall']['OVERALL']:.4f}",
        'F1-Score': f"{aggregated['f1']['OVERALL']:.4f}",
        'TP': aggregated['tp']['OVERALL'],
        'FP': aggregated['fp']['OVERALL'],
        'FN': aggregated['fn']['OVERALL']
    })
    
    df_metrics = pd.DataFrame(metrics_data)
    print(df_metrics.to_string(index=False))
    
    # Summary statistics
    print("\n" + "=" * 80)
    print("Summary Statistics:")
    print("-" * 80)
    print(f"Total True Positives:  {aggregated['tp']['OVERALL']}")
    print(f"Total False Positives: {aggregated['fp']['OVERALL']}")
    print(f"Total False Negatives: {aggregated['fn']['OVERALL']}")
    print(f"\nOverall Precision: {aggregated['precision']['OVERALL']:.4f}")
    print(f"Overall Recall:    {aggregated['recall']['OVERALL']:.4f}")
    print(f"Overall F1-Score:  {aggregated['f1']['OVERALL']:.4f}")
    
    print("\n" + "=" * 80)

print("‚úì Results display function defined")


## 12. Example: Evaluation with Task 2 Integration

This cell demonstrates how to integrate with Task 2 to generate predictions and evaluate them.

**Note**: This requires Task 2 functions to be available. You may need to:
1. Copy relevant functions from Task 2 notebook
2. Import them if they're in a module
3. Or re-run Task 2 in the same session


In [None]:
# Example integration with Task 2
# This is a template - adjust based on how Task 2 predictions are generated/saved

def example_get_predictions(text_file_path: Path, text: str) -> List[Dict]:
    """
    Example function to get predictions for a text file.
    
    In practice, this would:
    1. Call Task 2's process_text_file() function
    2. Convert annotation lines to entity dictionaries
    3. Return list of entities
    
    For now, this is a placeholder that returns empty list.
    Replace this with actual Task 2 integration.
    """
    # Placeholder: return empty predictions
    # In real implementation, integrate with Task 2:
    #
    # from task2_functions import process_text_file  # or however Task 2 is structured
    # bio_tagged, annotation_lines = process_text_file(text_file_path, ner_pipeline, tokenizer)
    # pred_entities = load_predictions(annotation_lines)
    # return pred_entities
    
    return []

# Example usage (commented out - uncomment and modify when Task 2 is integrated):
"""
# Get sample files for testing
text_files = list(TEXT_DIR.glob("*.txt"))[:10]  # Evaluate first 10 files

# Evaluate with Task 2 predictions
results = evaluate_all_files(
    text_files=text_files,
    get_predictions_func=example_get_predictions,
    max_files=10
)

# Display results
display_evaluation_results(results)
"""

print("‚úì Example integration template defined")
print("  ‚Üí Uncomment and modify the example code above when Task 2 is integrated")


## 13. Edge Cases in Entity Matching

This section documents important edge cases handled in the evaluation:

### Edge Cases:

1. **Overlapping Entities**
   - If ground truth has entity at [10, 20] and prediction has [10, 25]:
     - This is treated as a False Positive (boundaries don't match exactly)
   - seqeval handles this by requiring exact boundary matches

2. **Partial Matches**
   - Entity with correct label but wrong boundaries is a False Positive
   - Example: GT="ibuprofen" [10, 19], Pred="ibuprofen" [10, 20] ‚Üí FP

3. **Label Confusion**
   - Correct boundaries but wrong label is a False Positive
   - Example: GT=ADR [10, 19], Pred=Drug [10, 19] ‚Üí FP

4. **Multiple Ranges**
   - Ground truth entities with multiple character ranges (semicolon-separated)
   - Each range is treated as a separate entity for evaluation

5. **Empty Predictions/Ground Truth**
   - If no predictions: all ground truth entities are False Negatives
   - If no ground truth: all predictions are False Positives

### Why This Strict Matching?

- **Medical Safety**: Exact boundaries ensure correct entity extraction
- **Downstream Tasks**: Knowledge graphs, relation extraction need precise entities
- **Reproducibility**: Standard CoNLL evaluation methodology
- **Clinical Accuracy**: Partial matches can misrepresent medical conditions


In [None]:
# Test edge cases
print("Edge Case Testing:")
print("=" * 80)

# Example ground truth
gt_example = [
    {'label': 'ADR', 'text': 'drowsy', 'start': 10, 'end': 16},
    {'label': 'Drug', 'text': 'ibuprofen', 'start': 30, 'end': 39},
]

# Test Case 1: Exact match
pred_exact = [
    {'label': 'ADR', 'text': 'drowsy', 'start': 10, 'end': 16},
    {'label': 'Drug', 'text': 'ibuprofen', 'start': 30, 'end': 39},
]
result1 = entity_level_evaluation(gt_example, pred_exact)
print("\n1. Exact Match:")
print(f"   TP: {result1['tp']['OVERALL']}, FP: {result1['fp']['OVERALL']}, FN: {result1['fn']['OVERALL']}")
print(f"   Precision: {result1['precision']['OVERALL']:.4f}, Recall: {result1['recall']['OVERALL']:.4f}")

# Test Case 2: Boundary mismatch (partial match)
pred_boundary = [
    {'label': 'ADR', 'text': 'drowsy', 'start': 10, 'end': 17},  # Wrong boundary
    {'label': 'Drug', 'text': 'ibuprofen', 'start': 30, 'end': 39},
]
result2 = entity_level_evaluation(gt_example, pred_boundary)
print("\n2. Boundary Mismatch (Partial Match):")
print(f"   TP: {result2['tp']['OVERALL']}, FP: {result2['fp']['OVERALL']}, FN: {result2['fn']['OVERALL']}")
print(f"   Note: Partial match treated as FP (no TP)")

# Test Case 3: Label confusion
pred_label = [
    {'label': 'ADR', 'text': 'drowsy', 'start': 10, 'end': 16},
    {'label': 'ADR', 'text': 'ibuprofen', 'start': 30, 'end': 39},  # Wrong label
]
result3 = entity_level_evaluation(gt_example, pred_label)
print("\n3. Label Confusion:")
print(f"   TP: {result3['tp']['OVERALL']}, FP: {result3['fp']['OVERALL']}, FN: {result3['fn']['OVERALL']}")
print(f"   Note: Correct boundary but wrong label ‚Üí FP")

# Test Case 4: Missing prediction
pred_missing = [
    {'label': 'ADR', 'text': 'drowsy', 'start': 10, 'end': 16},
    # Missing Drug entity
]
result4 = entity_level_evaluation(gt_example, pred_missing)
print("\n4. Missing Prediction (False Negative):")
print(f"   TP: {result4['tp']['OVERALL']}, FP: {result4['fp']['OVERALL']}, FN: {result4['fn']['OVERALL']}")
print(f"   Note: Missing entity ‚Üí FN")

print("\n" + "=" * 80)
print("‚úì Edge case handling verified")


## 14. Summary

This notebook provides a comprehensive evaluation framework for Medical NER with:

### ‚úÖ Features Implemented:

1. **Ground Truth Loading**: Parses annotation files from 'original' subdirectory
2. **Prediction Loading**: Handles predictions in same format (from Task 2)
3. **Entity-Level Evaluation**: Exact boundary + label matching
4. **seqeval Integration**: Standard NER evaluation metrics
5. **Per-Entity-Type Metrics**: ADR, Drug, Disease, Symptom
6. **Overall Micro-Averaged Metrics**: Aggregate performance
7. **Confusion Matrix**: Visualize misclassifications
8. **Edge Case Handling**: Overlapping entities, partial matches, label confusion

### üìä Evaluation Approach:

- **Entity-Level**: Requires exact boundary AND label match for True Positive
- **Strict Matching**: Partial matches ‚Üí False Positive (encourages precise boundaries)
- **Medical Focus**: Optimized for clinical downstream tasks

### üîß Usage:

1. Integrate with Task 2 to generate predictions
2. Run evaluation on single file or full dataset
3. Analyze per-entity-type and overall metrics
4. Review confusion matrix for common errors

### üìù Notes:

- Predictions must be generated using Task 2 pipeline or loaded from files
- Evaluation uses exact boundary matching (standard CoNLL methodology)
- seqeval provides additional token-level insights alongside entity-level metrics
