# Adverse Event NER: Complete Annotation Workflow

This notebook walks you through the **entire annotation pipeline** from raw consumer complaints to Label Studio-ready tasks and quality-checked gold standard data.

## Workflow Overview

1. **Setup & Text Ingestion** - Load sample complaint texts
2. **Weak Labeling** - Auto-generate initial symptom/product spans using heuristics
3. **Export for Annotation** - Convert weak labels to Label Studio task format
4. **Mock Human Curation** - Simulate annotator corrections (Label Studio would be used in production)
5. **Gold Conversion** - Transform annotated data to normalized training format
6. **Quality Metrics** - Compute annotator agreement, label distribution, conflicts
7. **Comparison Analysis** - Evaluate weak labeling precision/recall against gold

---

## 1Ô∏è‚É£ Setup & Text Ingestion

First, configure Python path and load our NER modules.

In [1]:
import sys
import os
import json
from pathlib import Path

# Add project root to Python path
project_root = Path.cwd().parent if Path.cwd().name == "scripts" else Path.cwd()
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))
os.chdir(project_root)

print(f"‚úì Project root: {project_root}")
print(f"‚úì Working directory: {Path.cwd()}")

# Import NER modules
from src.weak_label import load_symptom_lexicon, load_product_lexicon, weak_label_batch
from src.pipeline import simple_inference

print("‚úì Modules loaded successfully")

‚úì Project root: c:\Users\User\Documents\NER
‚úì Working directory: c:\Users\User\Documents\NER


  from .autonotebook import tqdm as notebook_tqdm


‚úì Modules loaded successfully


In [2]:
# Sample complaint texts (realistic consumer reports with symptoms & products)
complaints = [
    "I got a severe rash and headache from the hydra boost cream",
    "No irritation from the face wash, just mild dryness around lips",
    "The moisturizer caused redness and itching on my skin",
    "I experienced nausea after using the vitamin serum",
    "The exfoliating scrub caused stinging and left me feeling dry"
]

print(f"‚úì Loaded {len(complaints)} sample complaints\n")
for i, text in enumerate(complaints, 1):
    print(f"{i}. {text}")

‚úì Loaded 5 sample complaints

1. I got a severe rash and headache from the hydra boost cream
2. No irritation from the face wash, just mild dryness around lips
3. The moisturizer caused redness and itching on my skin
4. I experienced nausea after using the vitamin serum
5. The exfoliating scrub caused stinging and left me feeling dry


## 2Ô∏è‚É£ Weak Labeling (Automated Span Detection)

Use lexicon-based heuristics to automatically detect symptom and product mentions. This generates our initial "weak" labels.

In [3]:
# Load lexicons
symptom_lex = load_symptom_lexicon(Path("data/lexicon/symptoms.csv"))
product_lex = load_product_lexicon(Path("data/lexicon/products.csv"))

print(f"‚úì Loaded {len(symptom_lex)} symptom terms")
print(f"‚úì Loaded {len(product_lex)} product terms")

# Run weak labeling
spans_batch = weak_label_batch(complaints, symptom_lex, product_lex, negation_window=5, scorer="wratio")

print(f"\nüìä Weak Labeling Results:\n")
for i, (text, spans) in enumerate(zip(complaints, spans_batch), 1):
    print(f"Complaint {i}: \"{text[:60]}...\"")
    print(f"  ‚Üí Found {len(spans)} spans")
    for span in spans:
        neg_flag = " [NEGATED]" if span.negated else ""
        conf_str = f" (conf: {span.confidence:.2f})" if span.confidence < 1.0 else ""
        print(f"    ‚Ä¢ {span.label:8} | \"{span.text}\" ‚Üí {span.canonical}{conf_str}{neg_flag}")
    print()

‚úì Loaded 161 symptom terms
‚úì Loaded 3 product terms

üìä Weak Labeling Results:

Complaint 1: "I got a severe rash and headache from the hydra boost cream..."
  ‚Üí Found 3 spans
    ‚Ä¢ SYMPTOM  | "severe" ‚Üí Headache (conf: 0.82)
    ‚Ä¢ SYMPTOM  | "rash" ‚Üí Rash
    ‚Ä¢ SYMPTOM  | "headache" ‚Üí Headache

Complaint 2: "No irritation from the face wash, just mild dryness around l..."
  ‚Üí Found 2 spans
    ‚Ä¢ SYMPTOM  | "mild" ‚Üí Headache (conf: 0.82)
    ‚Ä¢ PRODUCT  | "face wash" ‚Üí Gentle Daily Cleanser

Complaint 3: "The moisturizer caused redness and itching on my skin..."
  ‚Üí Found 2 spans
    ‚Ä¢ SYMPTOM  | "redness" ‚Üí Erythema
    ‚Ä¢ SYMPTOM  | "itching" ‚Üí Pruritus

Complaint 4: "I experienced nausea after using the vitamin serum..."
  ‚Üí Found 2 spans
    ‚Ä¢ SYMPTOM  | "nausea" ‚Üí Nausea
    ‚Ä¢ PRODUCT  | "serum" ‚Üí Radiance Vitamin C Serum

Complaint 5: "The exfoliating scrub caused stinging and left me feeling dr..."
  ‚Üí Found 1 spans
    ‚Ä¢ SYMPT

## 3Ô∏è‚É£ Export Weak Labels to JSONL

Persist the weak labels in a structured format for downstream processing.

In [4]:
# Prepare weak label JSONL
weak_jsonl_path = Path("data/output/workflow_demo_weak.jsonl")
weak_jsonl_path.parent.mkdir(parents=True, exist_ok=True)

weak_records = []
for idx, (text, spans) in enumerate(zip(complaints, spans_batch)):
    record = {
        "id": idx,
        "text": text,
        "weak_spans": [
            {
                "start": s.start,
                "end": s.end,
                "label": s.label,
                "text": s.text,
                "canonical": s.canonical,
                "confidence": s.confidence,
                "negated": s.negated,
            }
            for s in spans
        ]
    }
    weak_records.append(record)

# Write JSONL
with weak_jsonl_path.open("w", encoding="utf-8") as f:
    for rec in weak_records:
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")

print(f"‚úì Saved {len(weak_records)} weak label records to:")
print(f"  {weak_jsonl_path}")

‚úì Saved 5 weak label records to:
  data\output\workflow_demo_weak.jsonl


## 4Ô∏è‚É£ Convert to Label Studio Task Format

Transform weak labels into the JSON format Label Studio expects for import.

In [5]:
# Convert weak labels to Label Studio tasks (with optional pre-annotations)
ls_tasks = []
for rec in weak_records:
    task = {"data": {"text": rec["text"]}}
    
    # Optional: include weak spans as "predictions" (can bias annotators; use cautiously)
    # Uncomment to include pre-annotations:
    # task["predictions"] = [{
    #     "model_version": "weak_v1",
    #     "result": [
    #         {
    #             "value": {
    #                 "start": s["start"],
    #                 "end": s["end"],
    #                 "text": s["text"],
    #                 "labels": [s["label"]],
    #             },
    #             "from_name": "label",
    #             "to_name": "text",
    #             "type": "labels",
    #         }
    #         for s in rec["weak_spans"]
    #     ]
    # }]
    
    ls_tasks.append(task)

# Save tasks JSON
ls_tasks_path = Path("data/annotation/exports/workflow_demo_tasks.json")
ls_tasks_path.parent.mkdir(parents=True, exist_ok=True)
ls_tasks_path.write_text(json.dumps(ls_tasks, ensure_ascii=False, indent=2), encoding="utf-8")

print(f"‚úì Created {len(ls_tasks)} Label Studio tasks")
print(f"  Saved to: {ls_tasks_path}")
print(f"\nüí° In production, you would:")
print(f"   1. Start Label Studio: label-studio start")
print(f"   2. Create project with config from: data/annotation/config/label_config.xml")
print(f"   3. Import tasks from: {ls_tasks_path}")
print(f"   4. Annotators correct/refine spans in UI")
print(f"   5. Export completed annotations")

‚úì Created 5 Label Studio tasks
  Saved to: data\annotation\exports\workflow_demo_tasks.json

üí° In production, you would:
   1. Start Label Studio: label-studio start
   2. Create project with config from: data/annotation/config/label_config.xml
   3. Import tasks from: data\annotation\exports\workflow_demo_tasks.json
   4. Annotators correct/refine spans in UI
   5. Export completed annotations


## 5Ô∏è‚É£ Mock Human Curation (Simulated Label Studio Export)

Since we don't have Label Studio running, we'll simulate a human annotator correcting the weak labels. In production, this would be the JSON export from Label Studio after annotation.

In [6]:
# Simulate Label Studio export (normally exported from Label Studio UI)
# This mimics what a human annotator would produce after reviewing weak labels

mock_ls_export = [
    {
        "id": 1,
        "data": {"text": complaints[0]},
        "annotations": [{
            "result": [
                {"value": {"start": 8, "end": 19, "text": "severe rash", "labels": ["SYMPTOM"]}},
                {"value": {"start": 24, "end": 32, "text": "headache", "labels": ["SYMPTOM"]}},
                {"value": {"start": 42, "end": 59, "text": "hydra boost cream", "labels": ["PRODUCT"]}}
            ]
        }]
    },
    {
        "id": 2,
        "data": {"text": complaints[1]},
        "annotations": [{
            "result": [
                {"value": {"start": 3, "end": 13, "text": "irritation", "labels": ["SYMPTOM"]}},
                {"value": {"start": 23, "end": 32, "text": "face wash", "labels": ["PRODUCT"]}},
                {"value": {"start": 44, "end": 51, "text": "dryness", "labels": ["SYMPTOM"]}}
            ]
        }]
    },
    {
        "id": 3,
        "data": {"text": complaints[2]},
        "annotations": [{
            "result": [
                {"value": {"start": 4, "end": 15, "text": "moisturizer", "labels": ["PRODUCT"]}},
                {"value": {"start": 23, "end": 30, "text": "redness", "labels": ["SYMPTOM"]}},
                {"value": {"start": 35, "end": 42, "text": "itching", "labels": ["SYMPTOM"]}}
            ]
        }]
    },
    {
        "id": 4,
        "data": {"text": complaints[3]},
        "annotations": [{
            "result": [
                {"value": {"start": 14, "end": 20, "text": "nausea", "labels": ["SYMPTOM"]}},
                {"value": {"start": 37, "end": 50, "text": "vitamin serum", "labels": ["PRODUCT"]}}
            ]
        }]
    },
    {
        "id": 5,
        "data": {"text": complaints[4]},
        "annotations": [{
            "result": [
                {"value": {"start": 4, "end": 21, "text": "exfoliating scrub", "labels": ["PRODUCT"]}},
                {"value": {"start": 29, "end": 37, "text": "stinging", "labels": ["SYMPTOM"]}},
                {"value": {"start": 63, "end": 66, "text": "dry", "labels": ["SYMPTOM"]}}
            ]
        }]
    }
]

# Save mock export
mock_export_path = Path("data/annotation/raw/workflow_demo_ls_export.json")
mock_export_path.parent.mkdir(parents=True, exist_ok=True)
mock_export_path.write_text(json.dumps(mock_ls_export, ensure_ascii=False, indent=2), encoding="utf-8")

print(f"‚úì Created mock Label Studio export with {len(mock_ls_export)} annotated tasks")
print(f"  Saved to: {mock_export_path}")
print(f"\nüìù Summary of 'human' corrections:")
for task in mock_ls_export:
    n_spans = len(task["annotations"][0]["result"])
    print(f"   Task {task['id']}: {n_spans} entities annotated")

‚úì Created mock Label Studio export with 5 annotated tasks
  Saved to: data\annotation\raw\workflow_demo_ls_export.json

üìù Summary of 'human' corrections:
   Task 1: 3 entities annotated
   Task 2: 3 entities annotated
   Task 3: 3 entities annotated
   Task 4: 2 entities annotated
   Task 5: 3 entities annotated


## 6Ô∏è‚É£ Convert to Gold Standard Format (with Provenance)

Transform the Label Studio export into our normalized gold JSONL format with provenance tracking and canonical mapping.

In [7]:
# Run the conversion script programmatically
import subprocess

conversion_cmd = [
    "python", "scripts/annotation/convert_labelstudio.py",
    "--input", str(mock_export_path),
    "--output", "data/annotation/exports/workflow_demo_gold.jsonl",
    "--source", "workflow_demo_batch",
    "--annotator", "demo_user",
    "--revision", "1",
    "--symptom-lexicon", "data/lexicon/symptoms.csv",
    "--product-lexicon", "data/lexicon/products.csv"
]

result = subprocess.run(conversion_cmd, capture_output=True, text=True)
print(result.stdout)
if result.returncode != 0:
    print("Error:", result.stderr)

# Load and display the gold standard
gold_path = Path("data/annotation/exports/workflow_demo_gold.jsonl")
gold_records = []
with gold_path.open("r", encoding="utf-8") as f:
    for line in f:
        if line.strip():
            gold_records.append(json.loads(line))

print(f"\nüìä Gold Standard Summary:")
print(f"  Total records: {len(gold_records)}")
print(f"\nSample gold record (task 1):")
print(json.dumps(gold_records[0], indent=2, ensure_ascii=False))

Converted 5 tasks to data\annotation\exports\workflow_demo_gold.jsonl (source=workflow_demo_batch, annotator=demo_user)


üìä Gold Standard Summary:
  Total records: 5

Sample gold record (task 1):
{
  "id": 1,
  "text": "I got a severe rash and headache from the hydra boost cream",
  "source": "workflow_demo_batch",
  "annotator": "demo_user",
  "revision": 1,
  "entities": [
    {
      "start": 8,
      "end": 19,
      "label": "SYMPTOM",
      "text": "severe rash",
      "canonical": "severe rash",
      "concept_id": "SYMPTOM:severe_rash"
    },
    {
      "start": 24,
      "end": 32,
      "label": "SYMPTOM",
      "text": "headache",
      "canonical": "headache",
      "concept_id": "SYMPTOM:headache"
    },
    {
      "start": 42,
      "end": 59,
      "label": "PRODUCT",
      "text": "hydra boost cream",
      "canonical": "hydra boost cream"
    }
  ]
}


## 7Ô∏è‚É£ Quality Metrics & Annotator Agreement

Analyze the gold standard data to compute quality metrics.

In [8]:
# Run quality report
quality_cmd = [
    "python", "scripts/annotation/quality_report.py",
    "--gold", "data/annotation/exports/workflow_demo_gold.jsonl",
    "--out", "data/annotation/reports/workflow_demo_quality.json"
]

result = subprocess.run(quality_cmd, capture_output=True, text=True)
print(result.stdout)

# Load and display quality report
quality_path = Path("data/annotation/reports/workflow_demo_quality.json")
quality_report = json.loads(quality_path.read_text(encoding="utf-8"))

print(f"\nüìà Quality Report:\n")
print(f"Total Tasks: {quality_report['n_tasks']}")
print(f"Mean Spans per Task: {quality_report['mean_spans_per_task']:.1f}")
print(f"\nLabel Distribution:")
for label, count in quality_report['label_counts'].items():
    print(f"  {label}: {count}")
print(f"\nConflicts (overlapping spans with different labels): {quality_report['conflicts']}")
print(f"\nAnnotator Span Counts:")
for annotator, count in quality_report['annotator_counts'].items():
    print(f"  {annotator}: {count}")

Quality report written to data\annotation\reports\workflow_demo_quality.json


üìà Quality Report:

Total Tasks: 5
Mean Spans per Task: 2.8

Label Distribution:
  SYMPTOM: 9
  PRODUCT: 5

Conflicts (overlapping spans with different labels): 0

Annotator Span Counts:
  demo_user: 14


## 8Ô∏è‚É£ Compare Weak vs Gold (Precision/Recall Analysis)

Evaluate how well our automated weak labeling performed compared to human-curated gold standard.

In [None]:
# Run comparison script
comparison_cmd = [
    "python", "scripts/annotation/compare_weak_vs_gold.py",
    "--weak", "data/output/workflow_demo_weak.jsonl",
    "--gold", "data/annotation/exports/workflow_demo_gold.jsonl",
    "--output", "data/annotation/exports/workflow_demo_comparison.txt"
]

result = subprocess.run(comparison_cmd, capture_output=True, text=True, env={**os.environ, "PYTHONPATH": str(project_root)})
print(result.stdout)

# Load and display comparison report
comparison_path = Path("data/annotation/exports/workflow_demo_comparison.txt")
comparison_report = json.loads(comparison_path.read_text(encoding="utf-8"))

print(f"\nüîç Weak vs Gold Comparison:\n")
print(f"Overall Metrics:")
print(f"  Precision: {comparison_report['overall']['precision']:.1%}")
print(f"  Recall: {comparison_report['overall']['recall']:.1%}")
print(f"  F1 Score: {comparison_report['overall']['f1']:.1%}")
print(f"  True Positives: {comparison_report['overall']['tp']}")
print(f"  False Positives: {comparison_report['overall']['fp']}")
print(f"  False Negatives: {comparison_report['overall']['fn']}")

print(f"\nPer-Label Breakdown:")
for label, metrics in comparison_report['labels'].items():
    print(f"\n  {label}:")
    print(f"    Precision: {metrics['precision']:.1%}")
    print(f"    Recall: {metrics['recall']:.1%}")
    print(f"    F1: {metrics['f1']:.1%}")

print(f"\nüí° Suggestions:")
for suggestion in comparison_report.get('suggestions', []):
    print(f"  ‚Ä¢ {suggestion}")

## 9Ô∏è‚É£ Register Batch (Provenance Tracking)

Record this annotation batch in the registry for audit trail and reproducibility.

In [9]:
# Register the batch in provenance registry
register_cmd = [
    "python", "scripts/annotation/register_batch.py",
    "--batch-id", "workflow_demo_batch",
    "--gold", "data/annotation/exports/workflow_demo_gold.jsonl",
    "--annotators", "demo_user",
    "--revision", "1",
    "--notes", "Interactive notebook demo of complete annotation workflow"
]

result = subprocess.run(register_cmd, capture_output=True, text=True)
print(result.stdout)

# Display registry
registry_path = Path("data/annotation/registry.csv")
if registry_path.exists():
    print(f"\nüìã Provenance Registry:")
    print(registry_path.read_text(encoding="utf-8"))

Registered batch workflow_demo_batch (5 tasks) -> data\annotation\registry.csv


üìã Provenance Registry:
timestamp,batch_id,gold_file,n_tasks,annotators,revision,notes
2025-11-16T10:16:28,workflow_demo_batch,data\annotation\exports\workflow_demo_gold.jsonl,5,demo_user,1,Interactive notebook demo of complete annotation workflow



## üéØ Summary & Next Steps

You've just walked through the complete annotation pipeline! Here's what happened:

1. ‚úÖ **Raw Text** ‚Üí Started with 5 consumer complaints
2. ‚úÖ **Weak Labeling** ‚Üí Auto-detected symptoms/products using lexicons
3. ‚úÖ **Export** ‚Üí Persisted weak labels to JSONL
4. ‚úÖ **Label Studio Format** ‚Üí Converted to task import format
5. ‚úÖ **Human Curation** ‚Üí (Simulated) annotator corrections
6. ‚úÖ **Gold Standard** ‚Üí Normalized format with provenance & canonical mapping
7. ‚úÖ **Quality Metrics** ‚Üí Analyzed label distribution, conflicts, annotator stats
8. ‚úÖ **Precision/Recall** ‚Üí Compared weak vs gold performance
9. ‚úÖ **Registry** ‚Üí Tracked batch in provenance audit trail

### Real-World Production Workflow

```bash
# 1. Start Label Studio
label-studio start --no-browser

# 2. Next: obtain API key from Label Studio UI (Account Settings after activating legacy token in Organiation tab)
curl.exe -X POST http://localhost:8080/api/projects -H "Authorization: Token <REPLACE_WITH_YOUR_TOKEN>" -H "Content-Type: application/json" --data "@data/annotation/config/project_bootstrap.json"

# 3. Bootstrap project
python scripts/annotation/init_label_studio_project.py --name "Adverse Event NER"

# 4. Import weak labels (optional pre-annotations)
python scripts/annotation/cli.py import-weak `
    --weak data/output/notebook_test.jsonl `
    --out data/annotation/exports/tasks.json `
    --push --project-id 1

# 5. Annotators work in Label Studio UI

# 6. Export from Label Studio ‚Üí data/annotation/raw/export.json

# 7. Convert to gold with provenance
python scripts/annotation/convert_labelstudio.py \
    --input data/annotation/raw/export.json \
    --output data/annotation/exports/gold.jsonl \
    --source batch_2025Q4 \
    --annotator alice \
    --symptom-lexicon data/lexicon/symptoms.csv \
    --product-lexicon data/lexicon/products.csv

# 8. Quality check
python scripts/annotation/cli.py quality \
    --gold data/annotation/exports/gold.jsonl \
    --out data/annotation/reports/quality.json

# 9. Register batch
python scripts/annotation/cli.py register \
    --batch-id batch_2025Q4 \
    --gold data/annotation/exports/gold.jsonl \
    --annotators alice,bob \
    --revision 1
```

### What's Next?

- **Train/Dev/Test Splits**: Partition gold data for model training
- **Token Classification**: Fine-tune BioBERT with BIO tags from gold spans
- **Active Learning**: Use model predictions to prioritize next annotation batch
- **Threshold Tuning**: Adjust fuzzy/Jaccard cutoffs based on comparison metrics