Skip to content

feat(pdf-quality): content-quality pipeline (feat-0007 slices 1-4)#235

Merged
mrviduus merged 7 commits into
mainfrom
feat/pdf-content-quality
May 22, 2026
Merged

feat(pdf-quality): content-quality pipeline (feat-0007 slices 1-4)#235
mrviduus merged 7 commits into
mainfrom
feat/pdf-content-quality

Conversation

@mrviduus
Copy link
Copy Markdown
Owner

Summary

Make PDF-extracted books readable. Heuristics (PdfPig + post-processing) reach
~70-75%; the gap to ~90% is semantic — running headers leaking into body text,
paragraphs fragmented into one-word <p>s, unmerged line-wrap hyphenation,
inlined footnotes. This adds a gated Claude cleanup pass for the chapters
that need it, and logs every fix as training data for a future deterministic
ratchet. Slices 1-4 of feat-0007.

Marker (the ML PDF→markdown pipeline) was evaluated first and shelved
(shelf/marker-integration) — the prod GTX 1650 Ti's 4 GB VRAM can't hold the
Surya model set with inference headroom (verified: CUDA OOM).

Slice / design

docs/05-features/feat-0007-pdf-content-quality.md — full architecture + the
5-slice plan. This PR is slices 1-4 (slice 5, the heuristic ratchet, needs
real (messy→clean) pairs which only exist after Phase 3 runs on prod).

Changes

Slice 1 — analyzer (backend/src/Extraction/.../Quality/)

  • ChapterContentQualityAnalyzer — deterministic 0-100 score + issue codes.
    Pure C#, no I/O. The gate for the LLM pass.

Slice 2 — persist score (schema + ingestion)

  • ContentQualityScore column on Chapter + UserChapter; set in both
    ingestion paths. BookQualityJob gains Phase 3 counter columns. EF migration.

Slice 3 — Claude cleanup (infra/scripts/)

  • quality-poll.sh Phase 3: flagged chapters → claude CLI cleanup →
    pdf-cleanup-gate.py (word-multiset diff rejects hallucination /
    over-deletion) → write cleaned HTML → log (messy→clean) pair.
  • Internal UpdateQualityJobRequest carries the new counters.
  • Off by default — CONTENT_CLEANUP_ENABLED in .env.

Slice 4 — observability

  • Admin Book Quality job detail shows Phase 3 results (cleaned/rejected/skipped).
  • Worker logs a per-book content-quality score distribution at ingest.

Drive-by fix

  • SsgRouteProviderTests had a stale test asserting pre-noindex-fix behavior
    (non-indexable books dropped from SSG routes — they are intentionally kept
    with a noindex meta). Split into two correct tests.

Tests

  • Unit: ChapterContentQualityAnalyzerTests (12), SsgRouteProviderTests
    reworked. Full suite green — 217 unit, 207 extraction.
  • pdf-cleanup-gate.py smoke-tested (accept / hallucination-reject /
    over-deletion-reject).
  • Solution + admin frontend build clean.

Rollback plan

Slices 1-2 + 4 are inert data/observability — safe to ship. Slice 3 (the only
behavioural change) is fully gated by CONTENT_CLEANUP_ENABLED, default off —
deploying this PR changes nothing until the flag is set on prod. Rollback =
flip the flag off.

Notes

  • Slice 5 (heuristic ratchet) is intentionally out of scope — it needs the
    (messy→clean) pairs that Phase 3 produces. Sequence: merge → deploy →
    enable flag on prod → accumulate pairs → slice 5.
  • /check not run as a single pass; verified piecewise (builds + unit +
    extraction suites green per slice).

🤖 Generated with Claude Code

mrviduus and others added 7 commits May 22, 2026 12:32
Architecture + 5-slice breakdown for feat-0007. Marker shelved (4GB VRAM
infeasible); this reuses quality-poll.sh + BookQualityJob + internal
chapter endpoints — extends rather than builds new.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deterministic 0-100 content-quality score + issue codes for extracted
chapter HTML. Detects the recurring PDF-extraction defects: fragmented
paragraphs, running headers in body, unmerged hyphenation, orphan page
numbers, inlined footnotes.

Pure C#, no I/O — the gate that decides which chapters warrant an LLM
cleanup pass (slice 3). 12 unit tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ContentQualityScore (int?) on Chapter + UserChapter — set at chapter
  creation in both ingestion paths via ChapterContentQualityAnalyzer.
- BookQualityJob: ContentChaptersCleaned/Rejected/Skipped (int?) —
  tracking fields for the Phase 3 cleanup pass (written by the poller).
- EF migration AddChapterContentQualityScore (5 nullable columns).

Score is informational until slice 3 acts on it. Runs on every format;
EPUB/FB2 score high, PDF surfaces the low ones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GetRoutesAsync_ExcludesNonIndexableContent asserted non-indexable books
are dropped from SSG routes. But AddBookRoutesAsync intentionally routes
every Published edition (renderer emits noindex meta; filtering would 404
the slug) — the test was never updated when that fix landed.

Split into two tests reflecting actual design:
- NonIndexableBook_StillRouted — book route emitted regardless of Indexable
- ExcludesNonIndexableAuthorsAndGenres — author/genre listings DO filter it

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
quality-poll.sh Phase 3 — after structure fixes, for each chapter scoring
below CONTENT_QUALITY_THRESHOLD:
  fetch HTML → Claude CLI (fix structure, preserve content verbatim)
  → pdf-cleanup-gate.py (word-multiset diff: reject hallucination /
    over-deletion) → PUT cleaned HTML → log (messy→clean) pair.

- pdf-cleanup-gate.py: deterministic preservation gate, stdlib-only.
  Joins line-wrap hyphens before tokenizing so legit merges don't read
  as new words. 3% novel-token tolerance, 70% retention floor.
- InternalEndpoints: UpdateQualityJobRequest + handler carry the three
  ContentChapters* counters; GetQualityJob returns them.
- Off by default — CONTENT_CLEANUP_ENABLED in .env gates the whole phase.
  Pairs land in data/pdf-cleanup-dataset/ for the slice-5 ratchet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- AdminBookQualityEndpoints: QualityJobDetailDto carries the three
  ContentChapters* counters; GetJob returns them; RetryJob resets them.
- Admin BookQualityPage detail panel shows "Content cleanup — cleaned /
  rejected / skipped" when the job ran Phase 3.
- Worker logs a content-quality score distribution per book at ingest
  (count, avg, how many below 60) — both edition and user-book paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mrviduus mrviduus merged commit 3aa37e9 into main May 22, 2026
5 checks passed
@mrviduus mrviduus deleted the feat/pdf-content-quality branch May 22, 2026 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant