# Early Benchmarking Notebook (Reproducible)

**Purpose**: Minimal run with 2–3 docs via /v1/parse → capture text recall proxies, schema-pass result, and which taxonomy bucket errors fell into.

**Features**:
- Record config/seed and environment hash for re-runs
- Capture text recall proxies
- Schema-pass validation
- Error taxonomy classification
- Reproducible results


In [None]:
# Import required modules
import sys
import os
from pathlib import Path

# Add bench directory to path
sys.path.append('/app/bench')

# Import our benchmark class
from early_benchmark import EarlyBenchmark

print("Imports successful")


## 1. Initialize Benchmark

Create benchmark instance with default configuration:
- **Test Documents**: 3 representative invoices
- **Engines**: pdfplumber, pdfminer  
- **Seed**: 42 (for reproducibility)
- **Version**: v1.0


In [None]:
# Create benchmark instance
benchmark = EarlyBenchmark()

print(f"Run ID: {benchmark.run_id}")
print(f"Environment Hash: {benchmark.environment_hash}")
print(f"PDF Directory: {benchmark.pdf_dir}")
print(f"Ground Truth: {benchmark.gt_csv}")


## 2. Run Minimal Benchmark

Execute benchmark with 2-3 documents and capture:
- **Text Recall Proxies**: Field extraction accuracy
- **Schema Pass Results**: Validation success/failure  
- **Error Taxonomy Buckets**: Classification of errors


In [None]:
# Run the benchmark
results = benchmark.run_minimal_benchmark()
benchmark.results = results

print("Benchmark completed!")


## 3. Results Summary

View benchmark results including:
- Success rates by engine
- Schema pass rates
- Text recall proxies
- Error taxonomy breakdown


In [None]:
# Print detailed summary
benchmark.print_summary()


## 4. Save Results

Save benchmark results for reproducibility and compliance tracking.


In [None]:
# Save results
output_file = benchmark.save_results()

print(f"Results saved to: {output_file}")
print(f"Run ID: {benchmark.run_id}")
print(f"Environment Hash: {benchmark.environment_hash}")
