# Task 4: ADR-Specific Evaluation with MedDRA Annotations

**Title:** "ADR-Focused Performance Analysis with MedDRA Ground Truth"

## Objective
Extend Task 3 evaluation to focus specifically on ADR entities using MedDRA annotations:
- Load ground truth from 'meddra' subdirectory (contains only ADR entities with MedDRA codes)
- Parse MedDRA format (TT prefix + MedDRA code + character ranges + entity text)
- Filter predicted entities to only ADR labels
- Match predicted ADR entities against MedDRA ground truth with exact span matching
- Calculate ADR-specific Precision, Recall, F1
- Compare with ADR performance from Task 3 (original annotations)
- Analyze differences between original and MedDRA ground truth

## Overview
This notebook extends Task 3 evaluation framework to focus specifically on Adverse Drug Reaction (ADR) entities using MedDRA-standardized annotations as ground truth.


## 1. Understanding MedDRA Annotation Format

### MedDRA Format Explanation

The MedDRA (Medical Dictionary for Regulatory Activities) annotation format contains **only ADR entities** with standardized medical codes:

**Format Structure:**
```
TT<original_tag>\t<MedDRA_code> <start> <end>\t<entity_text>
```

**Example:**
```
TT1\t10028836 9 18\tneck pain
TT2\t10001949 20 31\tmemory loss
```

**Components:**
1. **Identifier**: `TT1`, `TT2`, etc. (TT prefix + original tag from 'original' directory)
2. **MedDRA Code**: Numeric code (e.g., `10028836`) - standardized medical term identifier
3. **Character Ranges**: Start and end positions (e.g., `9 18`)
4. **Entity Text**: The actual ADR mention in the text (e.g., `neck pain`)

### Why ADR Detection is Critical in Pharmacovigilance

**Adverse Drug Reaction (ADR) detection is particularly important in pharmacovigilance** for several reasons:

1. **Patient Safety**: ADRs can range from mild discomfort to life-threatening conditions. Accurate detection enables timely medical intervention.

2. **Regulatory Compliance**: Pharmaceutical companies must report ADRs to regulatory bodies (FDA, EMA). Standardized MedDRA coding ensures consistent reporting.

3. **Signal Detection**: Automated ADR detection from patient forums, social media, and clinical notes helps identify potential safety signals early.

4. **Drug Monitoring**: Post-marketing surveillance relies on accurate ADR extraction to monitor drug safety in real-world populations.

5. **Knowledge Discovery**: ADR patterns can reveal drug-drug interactions, contraindications, and population-specific risks.

6. **Clinical Decision Support**: Healthcare systems use ADR information to alert clinicians about potential adverse events.

7. **Standardization**: MedDRA provides a hierarchical taxonomy (SOC ‚Üí HLGT ‚Üí HLT ‚Üí PT ‚Üí LLT) enabling structured analysis across different data sources.


## 2. Import Required Libraries and Setup


In [None]:
import sys
from pathlib import Path
import re
from typing import List, Tuple, Dict, Set, Optional
from collections import defaultdict, Counter
import numpy as np
import pandas as pd

# Install seqeval if not already installed
try:
    from seqeval.metrics import (
        classification_report,
        accuracy_score,
        precision_score,
        recall_score,
        f1_score
    )
except ImportError:
    print("‚ö† seqeval not found. Installing...")
    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "seqeval"])
    from seqeval.metrics import (
        classification_report,
        accuracy_score,
        precision_score,
        recall_score,
        f1_score
    )
    print("‚úì seqeval installed successfully")

import warnings
warnings.filterwarnings('ignore')

# Configuration
BASE_DIR = Path("cadec")
TEXT_DIR = BASE_DIR / "text"
ORIGINAL_DIR = BASE_DIR / "original"
MEDDRA_DIR = BASE_DIR / "meddra"

# Verify directories exist
if not TEXT_DIR.exists():
    raise FileNotFoundError(f"Directory not found: {TEXT_DIR}")
if not ORIGINAL_DIR.exists():
    raise FileNotFoundError(f"Directory not found: {ORIGINAL_DIR}")
if not MEDDRA_DIR.exists():
    raise FileNotFoundError(f"Directory not found: {MEDDRA_DIR}")

print("‚úì Directories verified")
print(f"  - Text directory: {TEXT_DIR}")
print(f"  - Original directory: {ORIGINAL_DIR}")
print(f"  - MedDRA directory: {MEDDRA_DIR}")


## 3. Parse MedDRA Annotation Format

MedDRA annotations use a specific format with TT prefix, MedDRA code, and character ranges.


In [None]:
def load_meddra_ground_truth(ann_file_path: Path) -> List[Dict]:
    """
    Load and parse MedDRA ground truth annotation file.
    
    MedDRA Format: TT<tag>\t<MedDRA_code1> [ + <MedDRA_code2> ...] <start1> <end1>[;<start2> <end2>...]\t<text>
    Examples:
        TT1\t10028836 9 18\tneck pain
        TT3\t10033371 + 10023477 13 37;52 57\tSevere joint pain in the knees
        TT4\t10033430 59 63;77 82;83 88\tPain in my hands
    
    Note: Multiple MedDRA codes are joined with '+', multiple ranges are separated by ';'
    For entities with multiple ranges, we create separate entity entries for each range.
    
    Parameters:
    -----------
    ann_file_path : Path
        Path to the .ann annotation file in meddra directory
        
    Returns:
    --------
    List[Dict]
        List of ADR entity dictionaries with:
        - 'label': Always 'ADR' (MedDRA contains only ADR entities)
        - 'text': Entity text
        - 'start': Start character position
        - 'end': End character position
        - 'tag': Original tag identifier (TT1, TT2, etc.)
        - 'meddra_code': First MedDRA standardized code (primary code)
    """
    entities = []
    
    try:
        with open(ann_file_path, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                
                # Skip empty lines and comment lines (starting with '#')
                if not line or line.startswith('#'):
                    continue
                
                # Parse MedDRA format: TT<tag>\t<MedDRA_codes> <ranges>\t<text>
                # Split by tab first
                parts = line.split('\t')
                if len(parts) < 2:
                    continue
                
                identifier = parts[0]  # TT1, TT2, etc.
                
                # Get the metadata part (MedDRA codes + ranges) and text
                if len(parts) >= 3:
                    metadata_part = parts[1]
                    text = parts[2]
                else:
                    # Sometimes text might be in the same part as metadata
                    metadata_part = parts[1]
                    # Try to extract text from metadata_part (everything after the last range)
                    # This is a fallback - normally text should be in parts[2]
                    text = ""
                
                # Parse metadata part: <MedDRA_code1> [ + <MedDRA_code2> ...] <ranges>
                # Example: "10033371 + 10023477 13 37;52 57"
                # Strategy: Use regex to find all MedDRA codes, then everything after is ranges
                
                # Find all MedDRA codes (8+ digit numbers, or CONCEPT_LESS)
                # MedDRA codes are typically 8 digits starting with 100
                meddra_code_pattern = r'\b(100\d{5}|\d{8,}|CONCEPT_LESS)\b'
                code_matches = list(re.finditer(meddra_code_pattern, metadata_part))
                
                if not code_matches:
                    # No codes found, skip this line
                    continue
                
                # Get the primary MedDRA code (first one found)
                primary_meddra_code = code_matches[0].group(1)
                
                # Find where ranges start (after the last code)
                last_code_end = code_matches[-1].end()
                ranges_str = metadata_part[last_code_end:].strip()
                
                # Remove any '+' or extra whitespace
                ranges_str = re.sub(r'\s*\+\s*', '', ranges_str)
                ranges_str = ranges_str.strip()
                
                # Parse ranges: can be multiple pairs separated by semicolons
                # Format: "START1 END1;START2 END2;START3 END3" or "START END"
                ranges = []
                
                if ';' in ranges_str:
                    # Multiple ranges format: "START1 END1;START2 END2;..."
                    # Split by semicolon first, then parse each pair
                    range_pairs = ranges_str.split(';')
                    for rp in range_pairs:
                        rp = rp.strip()
                        if rp:
                            # Split by whitespace and get first two numbers
                            range_nums = rp.split()
                            if len(range_nums) >= 2:
                                try:
                                    start = int(range_nums[0])
                                    end = int(range_nums[1])
                                    ranges.append((start, end))
                                except ValueError:
                                    continue
                else:
                    # Single range format: "START END"
                    range_nums = ranges_str.split()
                    if len(range_nums) >= 2:
                        try:
                            start = int(range_nums[0])
                            end = int(range_nums[1])
                            ranges = [(start, end)]
                        except ValueError:
                            pass
                
                # If no ranges found, skip this entity
                if not ranges:
                    continue
                
                # Create entity entries for each range
                # For multiple ranges, we create separate entities (standard practice in NER)
                for start, end in ranges:
                    entities.append({
                        'label': 'ADR',  # MedDRA only contains ADR entities
                        'text': text.strip(),
                        'start': start,
                        'end': end,
                        'tag': identifier,
                        'meddra_code': primary_meddra_code
                    })
                    
    except Exception as e:
        print(f"Error loading MedDRA ground truth from {ann_file_path}: {e}")
        return []
    
    return entities

print("‚úì MedDRA ground truth loading function defined")


## 4. Load Original Annotations (for Comparison with Task 3)

We'll also load original annotations to compare ADR performance between original and MedDRA ground truth.


In [None]:
def load_original_ground_truth(ann_file_path: Path) -> List[Dict]:
    """
    Load and parse ground truth annotation file from 'original' subdirectory.
    This is the same function from Task 3, used for comparison.
    
    Format: TAG\tLABEL START END\tTEXT
    Example: T1\tADR 9 19\tbit drowsy
    
    Parameters:
    -----------
    ann_file_path : Path
        Path to the .ann annotation file
        
    Returns:
    --------
    List[Dict]
        List of ADR entity dictionaries with:
        - 'label': Entity type (ADR)
        - 'text': Entity text
        - 'start': Start character position
        - 'end': End character position
        - 'tag': Original tag identifier (T1, T2, etc.)
    """
    entities = []
    
    try:
        with open(ann_file_path, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                
                # Skip empty lines and comment lines (starting with '#')
                if not line or line.startswith('#'):
                    continue
                
                # Parse entity annotation lines (starting with 'T' followed by a number)
                # Format: TAG\tLABEL RANGES\tTEXT
                match = re.match(r'^(T\d+)\t([^\t]+)\t(.+)$', line)
                if match:
                    tag = match.group(1)
                    label_and_ranges = match.group(2)
                    text = match.group(3)
                    
                    # Extract label type (first word) and ranges (remaining part)
                    parts = label_and_ranges.split(None, 1)
                    if len(parts) < 2:
                        continue
                    
                    label_type = parts[0]
                    ranges_str = parts[1]
                    
                    # Only process ADR labels for this task
                    if label_type != 'ADR':
                        continue
                    
                    # Extract ranges (can be multiple pairs separated by semicolons)
                    ranges = []
                    if ';' in ranges_str:
                        # Multiple ranges format: "START1 END1;START2 END2;..."
                        range_pairs = ranges_str.split(';')
                        for rp in range_pairs:
                            rp = rp.strip()
                            if rp:
                                range_nums = rp.split()
                                if len(range_nums) >= 2:
                                    try:
                                        start = int(range_nums[0])
                                        end = int(range_nums[1])
                                        ranges.append((start, end))
                                    except ValueError:
                                        continue
                    else:
                        # Single range format: "START END"
                        range_nums = ranges_str.split()
                        if len(range_nums) >= 2:
                            try:
                                start = int(range_nums[0])
                                end = int(range_nums[1])
                                ranges = [(start, end)]
                            except ValueError:
                                continue
                    
                    # Create entity entries for each range
                    for start, end in ranges:
                        entities.append({
                            'label': label_type,
                            'text': text.strip(),
                            'start': start,
                            'end': end,
                            'tag': tag
                        })
    
    except Exception as e:
        print(f"Error loading original ground truth from {ann_file_path}: {e}")
        return []
    
    return entities

print("‚úì Original ground truth loading function defined")


In [None]:
def load_predictions(ann_file_path: Path) -> List[Dict]:
    """
    Load and parse predicted annotation file (same format as original annotations).
    
    Format: TAG\tLABEL START END\tTEXT
    This should match the output format from Task 2.
    
    Parameters:
    -----------
    ann_file_path : Path
        Path to the predicted .ann file (or can be a list of annotation lines)
        
    Returns:
    --------
    List[Dict]
        List of entity dictionaries (all labels)
    """
    entities = []
    
    try:
        with open(ann_file_path, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line or line.startswith('#'):
                    continue
                
                # Parse: TAG\tLABEL START END\tTEXT
                match = re.match(r'^(T\d+)\t([^\t]+)\t(.+)$', line)
                if match:
                    tag = match.group(1)
                    label_and_ranges = match.group(2)
                    text = match.group(3)
                    
                    parts = label_and_ranges.split(None, 1)
                    if len(parts) < 2:
                        continue
                    
                    label_type = parts[0]
                    ranges_str = parts[1]
                    
                    # Parse ranges
                    ranges = []
                    if ';' in ranges_str:
                        range_pairs = ranges_str.split(';')
                        for rp in range_pairs:
                            rp = rp.strip()
                            if rp:
                                range_nums = rp.split()
                                if len(range_nums) >= 2:
                                    try:
                                        start = int(range_nums[0])
                                        end = int(range_nums[1])
                                        ranges.append((start, end))
                                    except ValueError:
                                        continue
                    else:
                        range_nums = ranges_str.split()
                        if len(range_nums) >= 2:
                            try:
                                start = int(range_nums[0])
                                end = int(range_nums[1])
                                ranges = [(start, end)]
                            except ValueError:
                                continue
                    
                    for start, end in ranges:
                        entities.append({
                            'label': label_type,
                            'text': text.strip(),
                            'start': start,
                            'end': end,
                            'tag': tag
                        })
    
    except Exception as e:
        print(f"Error loading predictions from {ann_file_path}: {e}")
        return []
    
    return entities

def filter_adr_entities(entities: List[Dict]) -> List[Dict]:
    """
    Filter entities to only include ADR labels.
    
    Parameters:
    -----------
    entities : List[Dict]
        List of all predicted entities
        
    Returns:
    --------
    List[Dict]
        List of ADR entities only
    """
    return [entity for entity in entities if entity['label'] == 'ADR']

print("‚úì Prediction loading and filtering functions defined")


## 6. ADR-Specific Evaluation with Exact Span Matching

Evaluate ADR entities using exact span matching (same approach as Task 3).


In [None]:
def evaluate_adr_entities(ground_truth: List[Dict], predictions: List[Dict]) -> Dict:
    """
    Perform entity-level evaluation for ADR entities (exact boundary + label matching).
    
    Entity-level evaluation requires:
    - Exact match of entity boundaries (start AND end positions)
    - Exact match of label type (ADR)
    
    Parameters:
    -----------
    ground_truth : List[Dict]
        List of ground truth ADR entities
    predictions : List[Dict]
        List of predicted ADR entities
        
    Returns:
    --------
    Dict
        Dictionary containing:
        - 'tp', 'fp', 'fn' counts
        - 'precision', 'recall', 'f1' scores
    """
    # Convert entities to sets of tuples for exact matching
    # Format: (label, start, end) - exact boundary matching required
    gt_set = set()
    for entity in ground_truth:
        gt_set.add((entity['label'], entity['start'], entity['end']))
    
    pred_set = set()
    for entity in predictions:
        pred_set.add((entity['label'], entity['start'], entity['end']))
    
    # Calculate True Positives: entities that appear in both sets
    tp = len(gt_set.intersection(pred_set))
    
    # Calculate False Positives: predicted entities not in ground truth
    fp = len(pred_set - gt_set)
    
    # Calculate False Negatives: ground truth entities not predicted
    fn = len(gt_set - pred_set)
    
    # Calculate Precision, Recall, F1
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
    
    return {
        'tp': tp,
        'fp': fp,
        'fn': fn,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

print("‚úì ADR evaluation function defined")


## 7. Complete Evaluation Pipeline

Evaluate all files and aggregate results for both MedDRA and original ground truth.


In [None]:
def evaluate_all_files(text_files: List[Path], 
                       get_predictions_func,
                       max_files: Optional[int] = None) -> Dict:
    """
    Evaluate all text files in the dataset against both MedDRA and original ground truth.
    
    Parameters:
    -----------
    text_files : List[Path]
        List of text file paths to evaluate
    get_predictions_func : callable
        Function that takes (text_file_path, text) and returns predicted entities
        Format: List[Dict] with 'label', 'start', 'end', 'text'
    max_files : int, optional
        Maximum number of files to evaluate (for testing on subset)
        
    Returns:
    --------
    Dict
        Dictionary containing:
        - 'meddra_results': Aggregated metrics against MedDRA ground truth
        - 'original_results': Aggregated metrics against original ground truth
        - 'per_file_results': Per-file evaluation results
        - 'files_evaluated': Number of files processed
    """
    all_meddra_results = []
    all_original_results = []
    per_file_results = []
    files_evaluated = 0
    
    if max_files:
        text_files = text_files[:max_files]
    
    total_files = len(text_files)
    print(f"Evaluating {total_files} files...")
    
    for idx, text_file in enumerate(text_files, 1):
        try:
            # Load text
            with open(text_file, 'r', encoding='utf-8') as f:
                text = f.read().strip()
            
            # Load MedDRA ground truth
            meddra_file = MEDDRA_DIR / text_file.name.replace('.txt', '.ann')
            meddra_gt = []
            if meddra_file.exists():
                meddra_gt = load_meddra_ground_truth(meddra_file)
            
            # Load original ground truth (ADR only)
            original_file = ORIGINAL_DIR / text_file.name.replace('.txt', '.ann')
            original_gt = []
            if original_file.exists():
                original_gt = load_original_ground_truth(original_file)
            
            # Get predictions and filter to ADR only
            pred_entities = get_predictions_func(text_file, text)
            pred_adr = filter_adr_entities(pred_entities)
            
            # Evaluate against MedDRA ground truth
            meddra_result = evaluate_adr_entities(meddra_gt, pred_adr)
            all_meddra_results.append(meddra_result)
            
            # Evaluate against original ground truth
            original_result = evaluate_adr_entities(original_gt, pred_adr)
            all_original_results.append(original_result)
            
            # Store per-file results
            per_file_results.append({
                'file': text_file.name,
                'meddra': meddra_result,
                'original': original_result,
                'meddra_gt_count': len(meddra_gt),
                'original_gt_count': len(original_gt),
                'pred_count': len(pred_adr)
            })
            
            files_evaluated += 1
            
            # Progress update
            if idx % 50 == 0:
                print(f"  Processed {idx}/{total_files} files...")
                
        except Exception as e:
            print(f"‚ö† Error evaluating {text_file.name}: {e}")
            continue
    
    print(f"\n‚úì Evaluation complete: {files_evaluated} files evaluated")
    
    # Aggregate results
    def aggregate(results_list):
        total_tp = sum(r['tp'] for r in results_list)
        total_fp = sum(r['fp'] for r in results_list)
        total_fn = sum(r['fn'] for r in results_list)
        
        precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0.0
        recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0.0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
        
        return {
            'tp': total_tp,
            'fp': total_fp,
            'fn': total_fn,
            'precision': precision,
            'recall': recall,
            'f1': f1
        }
    
    return {
        'meddra_results': aggregate(all_meddra_results),
        'original_results': aggregate(all_original_results),
        'per_file_results': per_file_results,
        'files_evaluated': files_evaluated
    }

print("‚úì Complete evaluation pipeline defined")


## 8. Testing with Sample File

Test the evaluation pipeline with a sample file to verify everything works correctly.


In [None]:
# Test with a sample file
sample_text_files = list(TEXT_DIR.glob("*.txt"))[:1]  # Get first file for testing

if sample_text_files:
    test_file = sample_text_files[0]
    print(f"Testing evaluation with file: {test_file.name}")
    print("=" * 80)
    
    # Load text
    with open(test_file, 'r', encoding='utf-8') as f:
        text = f.read().strip()
    
    # Load MedDRA ground truth
    meddra_file = MEDDRA_DIR / test_file.name.replace('.txt', '.ann')
    meddra_gt = load_meddra_ground_truth(meddra_file) if meddra_file.exists() else []
    
    # Load original ground truth (ADR only)
    original_file = ORIGINAL_DIR / test_file.name.replace('.txt', '.ann')
    original_gt = load_original_ground_truth(original_file) if original_file.exists() else []
    
    print(f"\nMedDRA Ground Truth: {len(meddra_gt)} ADR entities")
    for entity in meddra_gt[:5]:  # Show first 5
        print(f"  - {entity['label']}: '{entity['text']}' [{entity['start']}:{entity['end']}] (MedDRA: {entity.get('meddra_code', 'N/A')})")
    
    print(f"\nOriginal Ground Truth (ADR): {len(original_gt)} ADR entities")
    for entity in original_gt[:5]:  # Show first 5
        print(f"  - {entity['label']}: '{entity['text']}' [{entity['start']}:{entity['end']}]")
    
    # For testing, create dummy predictions (in real scenario, these come from Task 2)
    print("\n‚ö† Note: Using dummy predictions for demonstration.")
    print("   In actual evaluation, use predictions from Task 2 pipeline.")
    
    # Create dummy predictions (subset of ground truth to simulate predictions)
    pred_adr = meddra_gt[:len(meddra_gt)//2] if len(meddra_gt) > 1 else []
    
    if pred_adr:
        print(f"\nDummy Predictions (ADR): {len(pred_adr)} entities")
        for entity in pred_adr[:5]:
            print(f"  - {entity['label']}: '{entity['text']}' [{entity['start']}:{entity['end']}]")
    
    # Evaluate against MedDRA
    meddra_result = evaluate_adr_entities(meddra_gt, pred_adr)
    
    # Evaluate against original
    original_result = evaluate_adr_entities(original_gt, pred_adr)
    
    print("\n" + "=" * 80)
    print("EVALUATION RESULTS")
    print("=" * 80)
    
    print("\nAgainst MedDRA Ground Truth:")
    print(f"  Precision: {meddra_result['precision']:.4f}")
    print(f"  Recall:    {meddra_result['recall']:.4f}")
    print(f"  F1-Score:  {meddra_result['f1']:.4f}")
    print(f"  TP: {meddra_result['tp']}, FP: {meddra_result['fp']}, FN: {meddra_result['fn']}")
    
    print("\nAgainst Original Ground Truth:")
    print(f"  Precision: {original_result['precision']:.4f}")
    print(f"  Recall:    {original_result['recall']:.4f}")
    print(f"  F1-Score:  {original_result['f1']:.4f}")
    print(f"  TP: {original_result['tp']}, FP: {original_result['fp']}, FN: {original_result['fn']}")
    
else:
    print("‚ö† No text files found for testing")


## 9. Integration Helper Function

Helper function to integrate with Task 2 predictions. This should be customized based on how Task 2 generates predictions.


In [None]:
# ============================================================================
# TASK 2 INTEGRATION: How to Use Task 2 Pipeline in Task 4 Evaluation
# ============================================================================

# There are three ways to integrate Task 2 with Task 4:

# METHOD 1: Run Task 2 in the same notebook session (RECOMMENDED)
# ----------------------------------------------------------------------------
# Step 1: Run all cells from Task 2 notebook first (or copy necessary cells)
# Step 2: Then use the functions and model pipeline from Task 2 here
#
# This requires:
# - Running Task 2 cells to load the model (ner_pipeline, tokenizer)
# - Having Task 2 functions available (process_text_file, etc.)

def get_predictions_with_task2(text_file_path: Path, text: str) -> List[Dict]:
    """
    Get predictions using Task 2 pipeline (METHOD 1: Same session).
    
    Prerequisites:
    1. Run Task 2 notebook cells first to load model and functions
    2. Ensure these variables are available:
       - ner_pipeline (transformers pipeline)
       - tokenizer (transformers tokenizer)
       - process_text_file function from Task 2
    
    Returns:
    --------
    List[Dict]
        List of entity dictionaries (all labels)
    """
    try:
        # Check if Task 2 components are available
        if 'ner_pipeline' not in globals() or 'tokenizer' not in globals():
            raise NameError("Task 2 model not loaded. Please run Task 2 notebook first.")
        
        if 'process_text_file' not in globals():
            raise NameError("Task 2 functions not available. Please run Task 2 notebook first.")
        
        # Use Task 2 pipeline to generate predictions
        bio_tagged, annotation_lines = process_text_file(text_file_path, ner_pipeline, tokenizer)
        
        # Convert annotation lines to entity dictionaries
        # annotation_lines are in format: "T1\tADR 9 19\tbit drowsy"
        pred_entities = []
        
        for line in annotation_lines:
            line = line.strip()
            if not line or line.startswith('#'):
                continue
            
            # Parse: TAG\tLABEL START END\tTEXT
            match = re.match(r'^(T\d+)\t([^\t]+)\t(.+)$', line)
            if match:
                tag = match.group(1)
                label_and_ranges = match.group(2)
                text_entity = match.group(3)
                
                parts = label_and_ranges.split(None, 1)
                if len(parts) < 2:
                    continue
                
                label_type = parts[0]
                ranges_str = parts[1]
                
                # Parse ranges
                ranges = []
                if ';' in ranges_str:
                    range_pairs = ranges_str.split(';')
                    for rp in range_pairs:
                        rp = rp.strip()
                        if rp:
                            range_nums = rp.split()
                            if len(range_nums) >= 2:
                                try:
                                    start = int(range_nums[0])
                                    end = int(range_nums[1])
                                    ranges.append((start, end))
                                except ValueError:
                                    continue
                else:
                    range_nums = ranges_str.split()
                    if len(range_nums) >= 2:
                        try:
                            start = int(range_nums[0])
                            end = int(range_nums[1])
                            ranges = [(start, end)]
                        except ValueError:
                            continue
                
                for start, end in ranges:
                    pred_entities.append({
                        'label': label_type,
                        'text': text_entity.strip(),
                        'start': start,
                        'end': end,
                        'tag': tag
                    })
        
        return pred_entities
        
    except NameError as e:
        print(f"‚ö† {e}")
        print("   Please run Task 2 notebook cells first, or use METHOD 2/3 below.")
        return []
    except Exception as e:
        print(f"‚ö† Error generating predictions for {text_file_path.name}: {e}")
        return []


# METHOD 2: Load predictions from saved files
# ----------------------------------------------------------------------------
# If you saved Task 2 predictions to files, load them here

def get_predictions_from_file(text_file_path: Path, text: str, predictions_dir: Path = None) -> List[Dict]:
    """
    Get predictions from saved annotation files (METHOD 2: Load from files).
    
    Prerequisites:
    1. Task 2 should have saved predictions to annotation files
    2. Files should be in format: predictions/<filename>.ann
    
    Parameters:
    -----------
    text_file_path : Path
        Path to the text file
    text : str
        Text content (not used, but kept for compatibility)
    predictions_dir : Path, optional
        Directory where predictions are saved (default: Path("predictions"))
    
    Returns:
    --------
    List[Dict]
        List of entity dictionaries (all labels)
    """
    if predictions_dir is None:
        predictions_dir = Path("predictions")
    
    # Find corresponding prediction file
    pred_file = predictions_dir / text_file_path.name.replace('.txt', '.ann')
    
    if not pred_file.exists():
        print(f"‚ö† Prediction file not found: {pred_file}")
        return []
    
    # Load predictions using the load_predictions function
    return load_predictions(pred_file)


# METHOD 3: Load Task 2 model and functions dynamically
# ----------------------------------------------------------------------------
# If Task 2 code is in a separate module or can be imported

def get_predictions_dynamic_import(text_file_path: Path, text: str) -> List[Dict]:
    """
    Get predictions by dynamically loading Task 2 components (METHOD 3: Import).
    
    This method attempts to load Task 2 model and functions from a saved state
    or by re-executing key cells. This is more complex and may require:
    - Saving model state
    - Creating a module from Task 2 functions
    - Or using importlib to execute Task 2 cells
    
    Note: This is advanced and may not work in all environments.
    For most cases, use METHOD 1 (same session) or METHOD 2 (saved files).
    """
    # This is a placeholder - implementation depends on your setup
    # You might need to:
    # 1. Save model after Task 2: torch.save(model.state_dict(), 'model.pt')
    # 2. Load model in Task 4: model.load_state_dict(torch.load('model.pt'))
    # 3. Recreate pipeline and functions
    
    print("‚ö† Dynamic import not implemented. Use METHOD 1 or METHOD 2.")
    return []


# ============================================================================
# USAGE INSTRUCTIONS
# ============================================================================

print("=" * 80)
print("TASK 2 INTEGRATION GUIDE")
print("=" * 80)
print("\nChoose one of the following methods:\n")
print("METHOD 1 (Recommended): Run Task 2 in same notebook session")
print("  ‚Üí Run all Task 2 cells first to load model and functions")
print("  ‚Üí Then use: get_predictions_func = get_predictions_with_task2")
print("\nMETHOD 2: Load predictions from saved files")
print("  ‚Üí Save Task 2 predictions to files first")
print("  ‚Üí Then use: get_predictions_func = get_predictions_from_file")
print("\nMETHOD 3: Dynamic import (advanced)")
print("  ‚Üí Requires custom setup to import Task 2 components")
print("  ‚Üí Use only if you have a specific setup for this")
print("\n" + "=" * 80)
print("\nExample usage after choosing a method:")
print("  results = evaluate_all_files(")
print("      text_files=text_files,")
print("      get_predictions_func=get_predictions_with_task2,  # or method 2/3")
print("      max_files=100")
print("  )")
print("=" * 80)


ok

In [None]:
# ============================================================================
# COMPLETE EXAMPLE: Full Evaluation with Task 2 Integration
# ============================================================================

# Uncomment and modify this section when ready to run full evaluation

"""
# STEP 1: Choose your integration method
# Option A: Same session (requires Task 2 to be run first)
get_predictions_func = get_predictions_with_task2

# Option B: Load from files (requires predictions to be saved first)
# get_predictions_func = lambda path, text: get_predictions_from_file(path, text, Path("predictions"))

# STEP 2: Get all text files
text_files = list(TEXT_DIR.glob("*.txt"))
print(f"Found {len(text_files)} text files")

# STEP 3: Run evaluation
# For testing, start with a small subset
print("\nRunning evaluation on first 10 files (for testing)...")
print("Remove max_files parameter to evaluate all files")
print("=" * 80)

results = evaluate_all_files(
    text_files=text_files,
    get_predictions_func=get_predictions_func,
    max_files=10  # Remove this line to evaluate all files
)

# STEP 4: Display results
print("\n")
display_comprehensive_results(results)

# STEP 5: Optional: Save results to file
import json
results_file = Path("task4_evaluation_results.json")
with open(results_file, 'w') as f:
    # Convert to JSON-serializable format
    results_json = {
        'meddra_results': results['meddra_results'],
        'original_results': results['original_results'],
        'files_evaluated': results['files_evaluated']
    }
    json.dump(results_json, f, indent=2)
print(f"\n‚úì Results saved to {results_file}")
"""

print("‚úì Complete example code provided above")
print("  ‚Üí Uncomment the code block to run full evaluation")
print("  ‚Üí Make sure Task 2 is integrated first (see METHOD 1 or METHOD 2)")


## 10. Display Comprehensive Results

Display and compare results between MedDRA and original ground truth evaluations.


In [None]:
def display_comprehensive_results(results: Dict):
    """
    Display comprehensive evaluation results comparing MedDRA vs Original ground truth.
    
    Parameters:
    -----------
    results : Dict
        Results from evaluate_all_files()
    """
    meddra = results['meddra_results']
    original = results['original_results']
    
    print("=" * 80)
    print("ADR-SPECIFIC EVALUATION RESULTS")
    print("=" * 80)
    print(f"\nFiles Evaluated: {results['files_evaluated']}")
    
    # Create comparison DataFrame
    comparison_data = {
        'Ground Truth': ['MedDRA', 'Original (Task 3)'],
        'Precision': [meddra['precision'], original['precision']],
        'Recall': [meddra['recall'], original['recall']],
        'F1-Score': [meddra['f1'], original['f1']],
        'True Positives': [meddra['tp'], original['tp']],
        'False Positives': [meddra['fp'], original['fp']],
        'False Negatives': [meddra['fn'], original['fn']]
    }
    
    df = pd.DataFrame(comparison_data)
    print("\n" + "=" * 80)
    print("COMPARISON: MedDRA vs Original Ground Truth")
    print("=" * 80)
    print(df.to_string(index=False))
    
    # Calculate differences
    print("\n" + "=" * 80)
    print("PERFORMANCE DIFFERENCES")
    print("=" * 80)
    
    precision_diff = meddra['precision'] - original['precision']
    recall_diff = meddra['recall'] - original['recall']
    f1_diff = meddra['f1'] - original['f1']
    
    print(f"\nPrecision Difference (MedDRA - Original): {precision_diff:+.4f}")
    print(f"Recall Difference (MedDRA - Original):     {recall_diff:+.4f}")
    print(f"F1-Score Difference (MedDRA - Original):     {f1_diff:+.4f}")
    
    # Ground truth count analysis
    meddra_gt_total = sum(r['meddra_gt_count'] for r in results['per_file_results'])
    original_gt_total = sum(r['original_gt_count'] for r in results['per_file_results'])
    
    print("\n" + "=" * 80)
    print("GROUND TRUTH ANALYSIS")
    print("=" * 80)
    print(f"\nTotal MedDRA ADR Entities:    {meddra_gt_total}")
    print(f"Total Original ADR Entities:    {original_gt_total}")
    print(f"Difference:                     {meddra_gt_total - original_gt_total:+.0f}")
    
    if meddra_gt_total != original_gt_total:
        print(f"\n‚ö† Note: MedDRA and Original ground truth have different entity counts.")
        print(f"   This may indicate annotation differences or standardization effects.")
    
    print("\n" + "=" * 80)

print("‚úì Results display function defined")


## 11. Analysis of Performance Differences

### Understanding Differences Between Original and MedDRA Ground Truth

Several factors can contribute to performance differences between evaluations using original vs MedDRA annotations:

#### 1. **Annotation Standardization**
- **MedDRA Standardization**: MedDRA annotations use standardized medical terminology with numeric codes, which may:
  - Consolidate synonymous terms into single codes
  - Normalize variations (e.g., "drowsiness" vs "drowsy" may map to same code)
  - Use more specific clinical terminology
- **Original Annotations**: May contain more natural language variations and informal expressions

#### 2. **Entity Boundary Differences**
- MedDRA annotations may have slightly different character boundaries due to standardization
- Original annotations might include/exclude surrounding words differently

#### 3. **Entity Count Differences**
- MedDRA may consolidate multiple mentions into single standardized entities
- Original annotations may preserve all individual mentions

#### 4. **Label Consistency**
- MedDRA ensures all ADR entities follow standardized coding
- Original annotations may have more variability in labeling consistency

### Expected Scenarios:

1. **MedDRA Performance Higher**: If the model better matches standardized terminology
2. **Original Performance Higher**: If the model captures natural language variations better
3. **Similar Performance**: If both ground truth sets are well-aligned

### Clinical Significance of ADR Detection Accuracy

**High ADR detection accuracy is clinically critical** for several reasons:

1. **Patient Safety**:
   - **False Negatives (Missed ADRs)**: Can lead to:
     - Continued use of harmful medications
     - Delayed medical intervention
     - Severe adverse events going unreported
   - **False Positives (Incorrect ADRs)**: Can lead to:
     - Unnecessary medication changes
     - Patient anxiety
     - Over-reporting that dilutes signal detection

2. **Regulatory Reporting**:
   - Inaccurate ADR detection affects post-marketing surveillance data
   - Regulatory bodies (FDA, EMA) require accurate ADR reporting
   - MedDRA coding ensures standardized reporting across systems

3. **Clinical Decision Support**:
   - Electronic health records use ADR information for alerts
   - Incorrect ADR detection can generate false alerts (alert fatigue)
   - Missing ADRs can miss critical drug-safety information

4. **Pharmacovigilance Signal Detection**:
   - Automated ADR extraction enables early detection of safety signals
   - Low recall can miss emerging safety concerns
   - Low precision can create noise that obscures real signals

5. **Research and Knowledge Discovery**:
   - ADR patterns help identify drug-drug interactions
   - Population-specific risks (age, gender, comorbidities)
   - Dose-response relationships

### Recommended Performance Targets:

- **Precision**: > 0.80 (minimize false positives for regulatory reporting)
- **Recall**: > 0.85 (minimize false negatives for patient safety)
- **F1-Score**: > 0.82 (balanced performance)

### MedDRA vs Original Evaluation Benefits:

1. **MedDRA Evaluation**:
   - Tests model performance on standardized medical terminology
   - Aligns with real-world pharmacovigilance workflows
   - Enables direct integration with regulatory databases

2. **Original Evaluation**:
   - Tests model performance on natural language variations
   - Reflects patient forum and social media contexts
   - Captures informal and colloquial expressions


## 12. Summary

This notebook provides a comprehensive ADR-specific evaluation framework with:

### ‚úÖ Features Implemented:

1. **MedDRA Format Parsing**: Correctly parses TT prefix + MedDRA code + character ranges format
2. **ADR Entity Filtering**: Filters predictions to only ADR labels
3. **Exact Span Matching**: Entity-level evaluation with exact boundary matching
4. **Dual Ground Truth Evaluation**: Compares performance against both MedDRA and original annotations
5. **Comprehensive Metrics**: Precision, Recall, F1-Score with TP/FP/FN breakdowns
6. **Performance Comparison**: Analyzes differences between MedDRA and original evaluations
7. **Clinical Context**: Explains importance of ADR detection in pharmacovigilance

### üìä Key Insights:

- **MedDRA Ground Truth**: Standardized medical terminology with codes
- **Original Ground Truth**: Natural language variations
- **Performance Differences**: Reflect annotation standardization and terminology alignment
- **Clinical Significance**: ADR detection accuracy directly impacts patient safety

### üîß Usage:

1. Integrate with Task 2 to generate predictions
2. Run evaluation on single file or full dataset
3. Compare MedDRA vs Original performance
4. Analyze differences and clinical implications

### üìù Notes:

- Predictions must be generated using Task 2 pipeline or loaded from files
- Evaluation uses exact boundary matching (consistent with Task 3 methodology)
- MedDRA annotations contain only ADR entities (by design)
- Original annotations may have slight boundary/text differences from MedDRA
