# Test Notebook 02: Preprocessing Pipeline

**Purpose**: Validate preprocessing on sample data

**Tests**:
1. Load sample parsed data
2. Show text cleaning (before/after)
3. Display processed JSONL format
4. Verify data integrity (no lost samples)
5. Check for any edge cases (empty text, special characters)


In [None]:
import sys
sys.path.append('..')

from pathlib import Path
from src.utils.brat_loader import BRATLoader
from src.utils.preprocess import clean_text, process_events, save_to_jsonl, load_from_jsonl, get_split_statistics
import yaml
import pandas as pd


## 1. Load Sample Data

Load 10 files to test preprocessing


In [None]:
# Load config
with open('../configs/data.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Load sample data from first 5 files per source
loader = BRATLoader(target_event="Drug")
sample_events = []

for source in ['mimic', 'uw']:
    dir_path = Path(config['raw_root']) / 'train' / source
    txt_files = sorted(dir_path.glob('*.txt'))[:5]
    
    for txt_file in txt_files:
        ann_file = txt_file.with_suffix('.ann')
        
        with open(txt_file, 'r') as f:
            text = f.read()
        
        ann_data = loader.parse_ann_file(ann_file)
        events = loader.extract_drug_events(
            ann_data=ann_data,
            text=text,
            note_id=txt_file.stem,
            source=source,
            split='train'
        )
        sample_events.extend(events)

print(f"Loaded {len(sample_events)} Drug events from 10 sample files")
print(f"Sample event keys: {list(sample_events[0].keys()) if sample_events else 'None'}")


## 2. Test Text Cleaning

Show before/after text cleaning


In [None]:
# Show text cleaning examples
if sample_events:
    for i, event in enumerate(sample_events[:3]):
        print(f"\nExample {i+1}:")
        print("=" * 80)
        print(f"BEFORE cleaning ({len(event['text'])} chars):")
        print(repr(event['text'][:200]))
        
        cleaned = clean_text(event['text'])
        print(f"\nAFTER cleaning ({len(cleaned)} chars):")
        print(repr(cleaned[:200]))
        print("=" * 80)


## 3. Process and Save Sample Data

Test JSONL save/load functionality


In [None]:
# Process events
processed = process_events(sample_events, clean=True)
print(f"Processed {len(processed)}/{len(sample_events)} events")

# Save to test file
test_output = Path('../data/processed/test_sample.jsonl')
save_to_jsonl(processed, test_output)

# Load back
loaded = load_from_jsonl(test_output)
print(f"Loaded {len(loaded)} events from JSONL")

# Verify integrity
print(f"\n✅ Data integrity check:")
print(f"  Original: {len(sample_events)} events")
print(f"  Processed: {len(processed)} events")
print(f"  Loaded: {len(loaded)} events")
print(f"  Match: {len(processed) == len(loaded)}")


## ✅ Validation Checklist

**Check before proceeding:**

- Text cleaning works (normalizes whitespace)
- JSONL save/load works correctly
- No data lost in processing
- All fields preserved correctly
- Sample file created successfully

If all checks pass, proceed to full dataset processing!
