## Section 1: Introduction

### Why Manual Annotation?

**Weak labels** (rule-based, lexicon-driven) provide a starting point but have limitations:
- **Boundary errors**: "severe burning sensation" vs "burning sensation"
- **False positives**: Matching anatomy tokens ("skin") without context
- **Missing synonyms**: Lexicons incomplete for colloquial phrasing

**LLM refinement** improves weak labels but still needs human validation:
- +8-15% IOU improvement over weak labels alone
- 5-10% worsened rate (over-correction, hallucination)

**Gold standard annotations** enable:
- Fine-tuning BioBERT for domain-specific NER (target: F1 >0.90)
- Evaluation harness for measuring weak/LLM quality
- Iterative improvement of heuristics and prompts

### Pipeline Overview

```
Raw Text
   ‚Üì
Weak Labels (lexicon + fuzzy matching)
   ‚Üì
LLM Refinement (boundary correction, canonical normalization)
   ‚Üì
Human Annotation (Label Studio)
   ‚Üì
Gold Standard JSONL
   ‚Üì
Evaluation (IOU improvement, correction rate, P/R/F1)
   ‚Üì
Fine-Tuned BioBERT Model
```

## Section 2: Data Preparation

### Load Weak Labels

In [None]:
import json
from pathlib import Path
import pandas as pd

# Load weak labels from test fixtures
weak_path = Path('tests/fixtures/annotation/weak_baseline.jsonl')

weak_records = []
with open(weak_path, 'r', encoding='utf-8') as f:
    for line in f:
        if line.strip():
            weak_records.append(json.loads(line))

print(f"Loaded {len(weak_records)} records")
print(f"\nSample record:")
print(json.dumps(weak_records[0], indent=2))

### Explore Weak Label Quality

In [None]:
# Extract span statistics
all_spans = []
for rec in weak_records:
    for span in rec.get('spans', []):
        all_spans.append({
            'text': span['text'],
            'label': span['label'],
            'confidence': span.get('confidence', 1.0),
            'length': len(span['text'])
        })

df = pd.DataFrame(all_spans)

print("\n=== Weak Label Statistics ===")
print(f"Total spans: {len(df)}")
print(f"\nLabel distribution:")
print(df['label'].value_counts())
print(f"\nConfidence distribution:")
print(df['confidence'].describe())
print(f"\nSpan length distribution:")
print(df['length'].describe())

In [None]:
# Visualize confidence distribution
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Confidence histogram
axes[0].hist(df['confidence'], bins=20, alpha=0.7, color='steelblue')
axes[0].set_xlabel('Confidence Score')
axes[0].set_ylabel('Count')
axes[0].set_title('Weak Label Confidence Distribution')
axes[0].axvline(df['confidence'].median(), color='red', linestyle='--', label=f'Median: {df["confidence"].median():.2f}')
axes[0].legend()

# Label counts
label_counts = df['label'].value_counts()
axes[1].bar(label_counts.index, label_counts.values, alpha=0.7, color=['green', 'blue'])
axes[1].set_xlabel('Entity Type')
axes[1].set_ylabel('Count')
axes[1].set_title('Entity Type Distribution')

plt.tight_layout()
plt.show()

print(f"\nüí° Insight: Low-confidence spans (< 0.80) will benefit most from LLM refinement")

## Section 3: LLM Refinement Demo

### Compare Weak vs LLM-Refined Labels

In [None]:
# Load LLM-refined labels
llm_path = Path('tests/fixtures/annotation/gold_with_llm_refined.jsonl')

llm_records = []
with open(llm_path, 'r', encoding='utf-8') as f:
    for line in f:
        if line.strip():
            llm_records.append(json.loads(line))

print(f"Loaded {len(llm_records)} LLM-refined records\n")

# Compare first record
rec = llm_records[0]
print(f"Text: {rec['text']}\n")

print("WEAK LABELS:")
for span in rec.get('spans', []):
    print(f"  - [{span['start']}-{span['end']}] '{span['text']}' ({span['label']}, conf={span.get('confidence', 1.0):.2f})")

print("\nLLM SUGGESTIONS:")
for span in rec.get('llm_suggestions', []):
    print(f"  - [{span['start']}-{span['end']}] '{span['text']}' ({span['label']})")
    if 'rationale' in span:
        print(f"    Rationale: {span['rationale']}")

### Highlight Boundary Corrections

In [None]:
# Find modified spans
for rec in llm_records:
    weak_spans = {(s['start'], s['end'], s['text']) for s in rec.get('spans', [])}
    llm_spans = {(s['start'], s['end'], s['text']) for s in rec.get('llm_suggestions', [])}
    
    changed = weak_spans ^ llm_spans  # Symmetric difference
    if changed:
        print(f"\nüìù Task: {rec['id']}")
        print(f"Text: {rec['text'][:80]}...")
        
        # Show before/after
        weak_dict = {(s['start'], s['end']): s for s in rec.get('spans', [])}
        llm_dict = {(s['start'], s['end']): s for s in rec.get('llm_suggestions', [])}
        
        for weak_span in rec.get('spans', []):
            weak_key = (weak_span['start'], weak_span['end'])
            llm_match = [s for s in rec.get('llm_suggestions', []) if s['label'] == weak_span['label']]
            
            if llm_match and llm_match[0]['text'] != weak_span['text']:
                print(f"  BEFORE: '{weak_span['text']}' (confidence={weak_span.get('confidence', 1.0):.2f})")
                print(f"  AFTER:  '{llm_match[0]['text']}'")
                if 'rationale' in llm_match[0]:
                    print(f"  WHY:    {llm_match[0]['rationale']}")

print("\n‚úÖ LLM typically corrects:")
print("   - Removes adjectives: 'severe burning' ‚Üí 'burning'")
print("   - Trims determiners: 'the redness' ‚Üí 'redness'")
print("   - Normalizes to canonical: 'itching' ‚Üí 'pruritus' (if in lexicon)")

## Section 4: Label Studio Setup

### Install Label Studio (if not already installed)

In [None]:
# Check if Label Studio is installed
import subprocess
import sys

try:
    result = subprocess.run(['label-studio', '--version'], capture_output=True, text=True)
    print(f"‚úÖ Label Studio installed: {result.stdout.strip()}")
except FileNotFoundError:
    print("‚ùå Label Studio not found. Install with:")
    print("   pip install label-studio")
    print("\nAfter installation, disable telemetry:")
    print("   PowerShell: $env:LABEL_STUDIO_DISABLE_TELEMETRY=1")
    print("   CMD: set LABEL_STUDIO_DISABLE_TELEMETRY=1")

### Import Configuration

**Manual Steps** (one-time setup):

1. **Launch Label Studio**:
   ```bash
   label-studio start
   ```
   Opens at http://localhost:8080

2. **Create Project**:
   - Click "Create Project"
   - Name: "Adverse Event NER"
   - Description: "Symptom and product annotation for biomedical complaints"

3. **Import Label Config**:
   - Go to Settings ‚Üí Labeling Interface
   - Click "Code" tab
   - Copy contents from `data/annotation/config/label_config.xml`
   - Click "Save"

4. **Import Tasks**:
   - Go to project dashboard
   - Click "Import" button
   - Upload JSON file (generated below)

In [None]:
# Generate Label Studio import JSON with pre-annotations
output_path = Path('data/annotation/imports/tutorial_tasks.json')
output_path.parent.mkdir(parents=True, exist_ok=True)

tasks = []
for rec in llm_records[:5]:  # First 5 tasks for tutorial
    task = {
        'data': {'text': rec['text']},
        'predictions': [{
            'result': [
                {
                    'value': {
                        'start': s['start'],
                        'end': s['end'],
                        'text': s['text'],
                        'labels': [s['label']]
                    },
                    'from_name': 'label',
                    'to_name': 'text',
                    'type': 'labels'
                }
                for s in rec.get('llm_suggestions', rec.get('spans', []))
            ]
        }]
    }
    tasks.append(task)

with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(tasks, f, indent=2)

print(f"‚úÖ Exported {len(tasks)} tasks to {output_path}")
print(f"\nImport to Label Studio:")
print(f"  1. Open Label Studio project")
print(f"  2. Click 'Import' button")
print(f"  3. Upload {output_path}")
print(f"  4. Tasks will appear with pre-annotations (LLM suggestions)")

## Section 5: Annotation Practice

### Example 1: Boundary Correction

In [None]:
example_1 = """
Patient reports severe burning sensation after applying the cream.
"""

print("TEXT:")
print(example_1)

print("\nWEAK LABEL:")
print("  - 'severe burning sensation' (SYMPTOM, confidence=0.82)")

print("\nLLM SUGGESTION:")
print("  - 'burning sensation' (SYMPTOM)")
print("  Rationale: Removed non-medical adjective 'severe'")

print("\n‚úÖ CORRECT ANNOTATION:")
print("  - 'burning sensation' (SYMPTOM) [character span: 15-33]")
print("  - 'the cream' (PRODUCT) [character span: 48-57]")

print("\nüìñ RULE:")
print("  Exclude intensity adjectives (severe, mild, slight) from symptom spans.")
print("  Medical lexicons use canonical terms without modifiers.")

### Example 2: Negation Handling

In [None]:
example_2 = """
No redness observed, but patient complains of itching.
"""

print("TEXT:")
print(example_2)

print("\n‚úÖ CORRECT ANNOTATION:")
print("  - 'redness' (SYMPTOM) [character span: 3-10]")
print("  - 'itching' (SYMPTOM) [character span: 45-52]")

print("\nüìñ RULE:")
print("  Annotate negated symptoms (e.g., 'no redness') as SYMPTOM spans.")
print("  Do NOT include the negation word ('no', 'without', 'absence of').")
print("  Rationale: Model can learn negation context from surrounding tokens.")

print("\n‚ùì OPTIONAL:")
print("  If Label Studio supports custom attributes, add 'negated=true' flag.")
print("  See label_config.xml for negation checkbox option.")

### Example 3: Anatomy Gating

In [None]:
example_3 = """
Facial swelling appeared on skin after exposure.
"""

print("TEXT:")
print(example_3)

print("\n‚ùå INCORRECT:")
print("  - 'Facial' (SYMPTOM) ‚Üê Single anatomy token, not a symptom")
print("  - 'skin' (SYMPTOM) ‚Üê Single anatomy token without symptom context")

print("\n‚úÖ CORRECT ANNOTATION:")
print("  - 'swelling' (SYMPTOM) [character span: 7-15]")

print("\nüìñ RULE:")
print("  Skip single anatomy tokens (skin, face, arm, leg, etc.) UNLESS:")
print("  1. Part of multi-word symptom: 'facial swelling', 'skin rash'")
print("  2. Symptom keyword present in lexicon: 'facial swelling' (if lexicon has compound term)")

print("\nüí° TIP:")
print("  If lexicon has 'facial swelling' as canonical term, annotate full phrase.")
print("  Otherwise, annotate only 'swelling' (core symptom).")

### Example 4: Multi-Word Medical Terms

In [None]:
example_4 = """
Patient experienced anaphylactic shock and difficulty breathing.
"""

print("TEXT:")
print(example_4)

print("\n‚ùå INCORRECT:")
print("  - 'shock' (SYMPTOM) ‚Üê Incomplete medical term")
print("  - 'breathing' (SYMPTOM) ‚Üê Missing context ('difficulty')")

print("\n‚úÖ CORRECT ANNOTATION:")
print("  - 'anaphylactic shock' (SYMPTOM) [character span: 20-39]")
print("  - 'difficulty breathing' (SYMPTOM) [character span: 44-63]")

print("\nüìñ RULE:")
print("  Preserve multi-word medical terms from lexicon:")
print("  - 'anaphylactic shock' (not 'shock' alone)")
print("  - 'burning sensation' (not 'burning' alone)")
print("  - 'difficulty breathing' (not 'breathing' alone)")

print("\nüí° TIP:")
print("  When uncertain, check lexicon (data/lexicon/symptoms.csv).")
print("  If multi-word term exists, use full phrase. Otherwise, use core symptom.")

### Example 5: Overlapping Conjunctions

In [None]:
example_5 = """
Redness and swelling observed at injection site.
"""

print("TEXT:")
print(example_5)

print("\n‚ùå INCORRECT:")
print("  - 'Redness and swelling' (SYMPTOM) ‚Üê Conjunction included")

print("\n‚úÖ CORRECT ANNOTATION:")
print("  - 'Redness' (SYMPTOM) [character span: 0-7]")
print("  - 'swelling' (SYMPTOM) [character span: 12-20]")

print("\nüìñ RULE:")
print("  Annotate symptoms separately when connected by conjunctions (and, or).")
print("  Exclude the conjunction itself from spans.")

print("\n‚ùì EDGE CASE:")
print("  If lexicon has compound symptom with 'and' (rare):")
print("  - 'red and swollen' ‚Üí Check lexicon first")
print("  - Default: Separate spans unless lexicon explicitly lists compound")

## Section 6: Export & Evaluation

### Export from Label Studio

**Manual Steps**:

1. **Complete Annotations**:
   - Annotate all imported tasks in Label Studio
   - Click "Submit" after each task

2. **Export JSON**:
   - Go to project dashboard
   - Click "Export" button
   - Select "JSON" format
   - Download file (e.g., `project-1-export.json`)

3. **Save to Data Directory**:
   - Move exported file to `data/annotation/raw/tutorial_export.json`

### Convert to Gold Standard JSONL

In [None]:
# Run conversion script (after manual export)
import subprocess

convert_cmd = [
    'python', 'scripts/annotation/convert_labelstudio.py',
    '--input', 'data/annotation/raw/tutorial_export.json',
    '--output', 'data/gold/tutorial_gold.jsonl',
    '--source', 'tutorial_batch',
    '--annotator', 'tutorial_user',
    '--symptom-lexicon', 'data/lexicon/symptoms.csv',
    '--product-lexicon', 'data/lexicon/products.csv'
]

print("Converting Label Studio export to gold JSONL...\n")
print("Command:")
print(' '.join(convert_cmd))
print("\n(Run after completing Label Studio annotation)")

### Run Evaluation Harness

In [None]:
# Evaluate annotation quality (after conversion)
eval_cmd = [
    'python', 'scripts/annotation/cli.py', 'evaluate-llm',
    '--weak', 'tests/fixtures/annotation/weak_baseline.jsonl',
    '--refined', 'tests/fixtures/annotation/gold_with_llm_refined.jsonl',
    '--gold', 'data/gold/tutorial_gold.jsonl',
    '--output', 'data/annotation/reports/tutorial_eval.json',
    '--markdown',
    '--stratify', 'label', 'confidence'
]

print("Evaluating annotation quality...\n")
print("Command:")
print(' '.join(eval_cmd))
print("\n(Run after converting gold JSONL)")

print("\nüìä Expected Metrics:")
print("  - IOU Improvement: +8-15% (weak ‚Üí LLM vs gold)")
print("  - Exact Match Rate: 70-85% (LLM boundaries align with gold)")
print("  - Correction Rate: >60% improved, <10% worsened")
print("  - F1 Score: >0.85 (LLM precision/recall vs gold)")

### Interpret Evaluation Report

In [None]:
# Load evaluation report (if exists)
eval_path = Path('data/annotation/reports/tutorial_eval.json')

if eval_path.exists():
    with open(eval_path, 'r') as f:
        eval_report = json.load(f)
    
    overall = eval_report.get('overall', {})
    
    print("=== EVALUATION SUMMARY ===")
    print(f"\nIOU Improvement:")
    print(f"  Weak:  {overall.get('weak_mean_iou', 0):.3f}")
    print(f"  LLM:   {overall.get('llm_mean_iou', 0):.3f}")
    print(f"  Delta: +{overall.get('iou_delta', 0):.3f} ({overall.get('iou_improvement_pct', 0):.1f}%)")
    
    correction = overall.get('correction_rate', {})
    print(f"\nCorrection Rate:")
    print(f"  Improved:  {correction.get('improved', 0)}/{correction.get('total_modified', 0)} ({correction.get('improved_pct', 0):.1f}%)")
    print(f"  Worsened:  {correction.get('worsened', 0)}/{correction.get('total_modified', 0)} ({correction.get('worsened_pct', 0):.1f}%)")
    
    llm_prf = overall.get('llm_prf', {})
    print(f"\nLLM Performance:")
    print(f"  Precision: {llm_prf.get('precision', 0):.3f}")
    print(f"  Recall:    {llm_prf.get('recall', 0):.3f}")
    print(f"  F1:        {llm_prf.get('f1', 0):.3f}")
    
    print("\n‚úÖ Good quality indicators:")
    if overall.get('iou_improvement_pct', 0) >= 8:
        print("  ‚úì IOU improvement ‚â•8% (strong LLM refinement)")
    if correction.get('worsened_pct', 100) < 10:
        print("  ‚úì Worsened rate <10% (LLM rarely introduces errors)")
    if llm_prf.get('f1', 0) >= 0.85:
        print("  ‚úì F1 score ‚â•0.85 (high precision and recall)")
else:
    print("‚è≥ Evaluation report not found. Complete annotation workflow first.")

## Section 7: Common Mistakes & Glossary

### Common Annotation Errors

#### 1. Including Intensity Adjectives
‚ùå **Incorrect**: "severe burning sensation"  
‚úÖ **Correct**: "burning sensation"  
**Why**: Lexicons use canonical terms without modifiers

#### 2. Missing Negation Context
‚ùå **Incorrect**: Skip "no redness" (negated symptom)  
‚úÖ **Correct**: Annotate "redness" as SYMPTOM  
**Why**: Model learns negation from context; skipping loses training signal

#### 3. Single Anatomy Tokens
‚ùå **Incorrect**: "skin" alone (without symptom context)  
‚úÖ **Correct**: Skip unless part of compound: "skin rash"  
**Why**: Anatomy is not a symptom; causes false positives

#### 4. Truncating Multi-Word Terms
‚ùå **Incorrect**: "shock" (incomplete)  
‚úÖ **Correct**: "anaphylactic shock" (full medical term)  
**Why**: Lexicons preserve clinical meaning with compound terms

#### 5. Including Conjunctions
‚ùå **Incorrect**: "redness and swelling" (single span)  
‚úÖ **Correct**: "redness" + "swelling" (separate spans)  
**Why**: Each symptom is distinct entity

### Symptom Glossary (Canonical Terms)

| Colloquial | Canonical | Notes |
|------------|-----------|-------|
| itching | pruritus | Prefer medical term if in lexicon |
| redness | erythema | Both acceptable; lexicon determines |
| swelling | edema | Swelling more common in complaints |
| burning | burning sensation | Use full phrase if lexicon has it |
| dry skin | dryness | Canonical form without anatomy |
| shortness of breath | dyspnea | Medical term preferred |
| dizziness | vertigo | Technically distinct; context matters |

**Rule of Thumb**: Check `data/lexicon/symptoms.csv` for canonical form. If colloquial term present, use as-is.

### Product Annotation Tips

- **Brand Names**: Annotate as written ("Advil", "Tylenol")
- **Generic Names**: Lowercase OK ("ibuprofen", "acetaminophen")
- **Abbreviations**: Include if common ("NSAIDs", "OTC meds")
- **Descriptors**: Exclude generic descriptors ("the medication" ‚Üí skip)
- **Combinations**: Annotate full product name ("Advil PM", not just "Advil")

### Boundary Decision Tree

```
Is span a single anatomy token (skin, face, arm)?
‚îú‚îÄ Yes ‚Üí Skip UNLESS part of compound symptom ("skin rash")
‚îî‚îÄ No ‚Üí Continue

Does span include intensity adjective (severe, mild, slight)?
‚îú‚îÄ Yes ‚Üí Remove adjective, keep core symptom
‚îî‚îÄ No ‚Üí Continue

Is span multi-word? ("burning sensation", "anaphylactic shock")
‚îú‚îÄ Yes ‚Üí Check lexicon for canonical compound term
‚îÇ   ‚îú‚îÄ In lexicon ‚Üí Use full phrase
‚îÇ   ‚îî‚îÄ Not in lexicon ‚Üí Use core symptom only
‚îî‚îÄ No ‚Üí Annotate single-word symptom

Is span negated? ("no redness", "without swelling")
‚îú‚îÄ Yes ‚Üí Annotate symptom ONLY (exclude "no", "without")
‚îî‚îÄ No ‚Üí Annotate as-is
```

## Summary & Next Steps

### What You Learned

‚úÖ **Annotation Pipeline**: Raw text ‚Üí weak labels ‚Üí LLM refinement ‚Üí human curation ‚Üí gold standard  
‚úÖ **Quality Metrics**: IOU improvement, correction rate, precision/recall/F1  
‚úÖ **Boundary Rules**: Exclude adjectives, preserve multi-word terms, separate conjunctions  
‚úÖ **Edge Cases**: Negation handling, anatomy gating, canonical normalization  
‚úÖ **Evaluation**: Measuring annotation quality with evaluation harness

### Production Workflow

1. **Prepare Batch** (100 complaints):
   ```bash
   python scripts/annotation/prepare_production_batch.py \
     --input raw_complaints.txt \
     --output data/annotation/batches/batch_001/ \
     --batch-size 100
   ```

2. **Import to Label Studio**:
   - Upload `batches/batch_001/tasks.json`
   - Pre-annotations included (LLM suggestions)

3. **Annotate** (2-3 hours per 100 tasks)

4. **Export & Convert**:
   ```bash
   python scripts/annotation/convert_labelstudio.py \
     --input label_studio_export.json \
     --output data/gold/batch_001.jsonl \
     --annotator your_name
   ```

5. **Evaluate**:
   ```bash
   python scripts/annotation/cli.py evaluate-llm \
     --weak batches/batch_001/weak.jsonl \
     --refined batches/batch_001/llm_refined.jsonl \
     --gold data/gold/batch_001.jsonl \
     --output reports/batch_001_eval.json \
     --markdown --stratify label confidence
   ```

6. **Iterate**: Refine prompts/lexicons based on evaluation feedback

### Resources

- **Annotation Guide**: `docs/annotation_guide.md`
- **Production Evaluation Guide**: `docs/production_evaluation.md`
- **Phase 5 Plan**: `docs/phase_5_plan.md`
- **LLM Providers**: `docs/llm_providers.md`

### Questions?

- Review annotation guide for boundary rules
- Check lexicons (`data/lexicon/`) for canonical terms
- Run evaluation harness to measure quality
- Open GitHub issue for technical problems

**Happy annotating! üéâ**