# Ingestion Service Testing (Stage 3)

## Objective
Validate `IngestionService` behavior in isolation, without running Metaflow, including deterministic content-based identity resolution and section diagnostics.

## Prerequisites
1. Run from repository root.
2. Have at least one PDF in `data/`.
3. Python environment with project dependencies installed.


In [1]:
# Notebook bootstrap (imports and paths are initialized in the next cell)


In [2]:
from pathlib import Path

from notebooks._utils import data_dir, list_pdfs, print_checkpoint, project_root
from src.ingest.identity import compute_content_hash, extract_identity
from src.ingest.service import IngestionService

ROOT = project_root()
DATA_DIR = data_dir()
PDFS = list_pdfs(DATA_DIR)

print("ROOT:", ROOT)
print("DATA_DIR:", DATA_DIR)
print(f"Found {len(PDFS)} PDFs")
assert PDFS, "No PDF files found in data/"
print_checkpoint("setup complete")


ROOT: /Users/juanbeiroa/Code/thereisnohr
DATA_DIR: /Users/juanbeiroa/Code/thereisnohr/data
Found 5 PDFs
[checkpoint] setup complete


## 1) Discovery behavior

Checks:
1. files are discovered recursively
2. result is deterministic (sorted)


In [3]:
service = IngestionService()
files = service.discover_pdf_files(DATA_DIR)

assert files == sorted(files), "discover_pdf_files must return sorted results"
assert all(p.suffix.lower() == ".pdf" for p in files), "discovery should include only PDFs"

print("First 5 discovered files:")
for p in files[:5]:
    print("-", p.name)
print_checkpoint("discovery assertions passed")


First 5 discovered files:
- BeiroaJuanIgnacioCV.pdf
- CV_Juan_Beiroa-en.pdf
- CV_Juan_Beiroa-es.pdf
- JuanBeiroaCV_en_2026.pdf
- beiroa_linkedin.pdf
[checkpoint] discovery assertions passed


## 2) Parser invocation through service

Parse a single sample PDF via service boundary.


In [4]:
sample = files[0]
parsed = service.parse_pdf(sample)

assert parsed.sections, "Expected non-empty parsed sections"
assert parsed.parser_version, "Expected parser version"

print("sample:", sample.name)
print("language:", parsed.language)
print("sections:", list(parsed.sections.keys()))
print("links:", parsed.links[:5])
print_checkpoint("service parse assertions passed")


sample: BeiroaJuanIgnacioCV.pdf
language: es
sections: ['general', 'experience']
links: ['http://didacTIC.ar', 'https://didactic.ar', 'https://github.com/fifabsas/talleresfifabsas', 'https://github.com/jbeiroa', 'https://github.com/jbeiroa/udciencia']
[checkpoint] service parse assertions passed


In [5]:
assert parsed.section_items, "Expected non-empty section items"
first_item = parsed.section_items[0]

print("first section type:", first_item.normalized_type)
print("first section confidence:", first_item.confidence)
print("first section signals:", first_item.signals)

assert first_item.signals is not None, "Expected section diagnostics in signals"
assert "diagnostic_flags" in first_item.signals
assert "confidence_inputs" in first_item.signals
assert "recategorization_candidate" in first_item.signals

print_checkpoint("section diagnostics assertions passed")


first section type: general
first section confidence: 0.5
first section signals: {'diagnostic_flags': ['heading_unknown', 'looks_like_contact_block'], 'confidence_inputs': {'word_count': 27, 'heading_mapped_to_general': True}, 'recategorization_candidate': {'section_type': 'contact', 'confidence': 0.8}}
[checkpoint] section diagnostics assertions passed


## 3) Deterministic content identity checks

Validate deterministic identity extraction and content hash behavior.


In [6]:
identity_1 = extract_identity(parsed)
identity_2 = extract_identity(parsed)

assert identity_1.identity_key == identity_2.identity_key, "identity key must be deterministic"
assert identity_1.confidence >= 0.05
assert identity_1.identity_key.startswith(("candidate:v1:", "resume_content:"))

content_hash_1 = compute_content_hash(parsed.clean_text)
content_hash_2 = compute_content_hash(parsed.clean_text)
assert content_hash_1 == content_hash_2, "content hash must be deterministic"
assert len(content_hash_1) == 64

print("identity_key:", identity_1.identity_key)
print("identity_confidence:", identity_1.confidence)
print("identity_signals keys:", sorted(identity_1.signals.keys()))
print("content_hash:", content_hash_1)
print_checkpoint("identity/content-hash assertions passed")


identity_key: candidate:v1:fd69152a2fbe992fc27d87e8
identity_confidence: 0.973
identity_signals keys: ['confidence_inputs', 'emails', 'identity_key_reason', 'model_fallback_used', 'name_signals', 'phones']
content_hash: e97a65c62f09dd451ca194afdc75baf2187a95e8dd95a61856ca8e850d6606a0
[checkpoint] identity/content-hash assertions passed


## 4) Dry-run ingestion simulation notes

This notebook intentionally does not run `ingest_pdf` against DB by default.

If you want full persistence checks (identity-key upsert, `content_hash`, section metadata), use `repositories_smoke_testing.ipynb` and/or run Metaflow flow.


## Summary (fill after run)

- Discovery assertions: pass/fail
- Parse-through-service assertions: pass/fail
- Section diagnostics assertions: pass/fail
- Deterministic identity/content-hash assertions: pass/fail


## Next actions

1. If identity extraction looks wrong, iterate on heuristics in `src/ingest/identity.py` and lock behavior with `tests/test_identity_resolution.py`.
2. If section diagnostics look wrong, refine parser signal builders in `src/ingest/parser.py` and extend `tests/test_ingest_parser.py`.
3. If persistence behavior diverges, validate repository wiring in `repositories_smoke_testing.ipynb`.


## Optional: Inspect Metaflow Run Report + Card

After running the flow:

```bash
uv run python src/ingest/pdf_ingestion_flow.py run --input-dir data --pattern '*.pdf'
```

You can inspect:
1. `run_report` artifact from the flow end step (machine-readable metrics).
2. `run_metrics` card for the styled dashboard (status counts, confidence summary, reasons, per-file sample).

The report includes: `run_meta`, `status_counts`, `step_timing_summary`, `ingest_quality`, `reason_breakdown`, and `files`.
