feat(pdf-quality): content-quality pipeline (feat-0007 slices 1-4)#235
Merged
Conversation
Architecture + 5-slice breakdown for feat-0007. Marker shelved (4GB VRAM infeasible); this reuses quality-poll.sh + BookQualityJob + internal chapter endpoints — extends rather than builds new. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deterministic 0-100 content-quality score + issue codes for extracted chapter HTML. Detects the recurring PDF-extraction defects: fragmented paragraphs, running headers in body, unmerged hyphenation, orphan page numbers, inlined footnotes. Pure C#, no I/O — the gate that decides which chapters warrant an LLM cleanup pass (slice 3). 12 unit tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ContentQualityScore (int?) on Chapter + UserChapter — set at chapter creation in both ingestion paths via ChapterContentQualityAnalyzer. - BookQualityJob: ContentChaptersCleaned/Rejected/Skipped (int?) — tracking fields for the Phase 3 cleanup pass (written by the poller). - EF migration AddChapterContentQualityScore (5 nullable columns). Score is informational until slice 3 acts on it. Runs on every format; EPUB/FB2 score high, PDF surfaces the low ones. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GetRoutesAsync_ExcludesNonIndexableContent asserted non-indexable books are dropped from SSG routes. But AddBookRoutesAsync intentionally routes every Published edition (renderer emits noindex meta; filtering would 404 the slug) — the test was never updated when that fix landed. Split into two tests reflecting actual design: - NonIndexableBook_StillRouted — book route emitted regardless of Indexable - ExcludesNonIndexableAuthorsAndGenres — author/genre listings DO filter it Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
quality-poll.sh Phase 3 — after structure fixes, for each chapter scoring
below CONTENT_QUALITY_THRESHOLD:
fetch HTML → Claude CLI (fix structure, preserve content verbatim)
→ pdf-cleanup-gate.py (word-multiset diff: reject hallucination /
over-deletion) → PUT cleaned HTML → log (messy→clean) pair.
- pdf-cleanup-gate.py: deterministic preservation gate, stdlib-only.
Joins line-wrap hyphens before tokenizing so legit merges don't read
as new words. 3% novel-token tolerance, 70% retention floor.
- InternalEndpoints: UpdateQualityJobRequest + handler carry the three
ContentChapters* counters; GetQualityJob returns them.
- Off by default — CONTENT_CLEANUP_ENABLED in .env gates the whole phase.
Pairs land in data/pdf-cleanup-dataset/ for the slice-5 ratchet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- AdminBookQualityEndpoints: QualityJobDetailDto carries the three ContentChapters* counters; GetJob returns them; RetryJob resets them. - Admin BookQualityPage detail panel shows "Content cleanup — cleaned / rejected / skipped" when the job ran Phase 3. - Worker logs a content-quality score distribution per book at ingest (count, avg, how many below 60) — both edition and user-book paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make PDF-extracted books readable. Heuristics (PdfPig + post-processing) reach
~70-75%; the gap to ~90% is semantic — running headers leaking into body text,
paragraphs fragmented into one-word
<p>s, unmerged line-wrap hyphenation,inlined footnotes. This adds a gated Claude cleanup pass for the chapters
that need it, and logs every fix as training data for a future deterministic
ratchet. Slices 1-4 of feat-0007.
Marker (the ML PDF→markdown pipeline) was evaluated first and shelved
(
shelf/marker-integration) — the prod GTX 1650 Ti's 4 GB VRAM can't hold theSurya model set with inference headroom (verified: CUDA OOM).
Slice / design
docs/05-features/feat-0007-pdf-content-quality.md— full architecture + the5-slice plan. This PR is slices 1-4 (slice 5, the heuristic ratchet, needs
real (messy→clean) pairs which only exist after Phase 3 runs on prod).
Changes
Slice 1 — analyzer (
backend/src/Extraction/.../Quality/)ChapterContentQualityAnalyzer— deterministic 0-100 score + issue codes.Pure C#, no I/O. The gate for the LLM pass.
Slice 2 — persist score (schema + ingestion)
ContentQualityScorecolumn onChapter+UserChapter; set in bothingestion paths.
BookQualityJobgains Phase 3 counter columns. EF migration.Slice 3 — Claude cleanup (
infra/scripts/)quality-poll.shPhase 3: flagged chapters →claudeCLI cleanup →pdf-cleanup-gate.py(word-multiset diff rejects hallucination /over-deletion) → write cleaned HTML → log (messy→clean) pair.
UpdateQualityJobRequestcarries the new counters.CONTENT_CLEANUP_ENABLEDin.env.Slice 4 — observability
Drive-by fix
SsgRouteProviderTestshad a stale test asserting pre-noindex-fix behavior(non-indexable books dropped from SSG routes — they are intentionally kept
with a noindex meta). Split into two correct tests.
Tests
ChapterContentQualityAnalyzerTests(12),SsgRouteProviderTestsreworked. Full suite green — 217 unit, 207 extraction.
pdf-cleanup-gate.pysmoke-tested (accept / hallucination-reject /over-deletion-reject).
Rollback plan
Slices 1-2 + 4 are inert data/observability — safe to ship. Slice 3 (the only
behavioural change) is fully gated by
CONTENT_CLEANUP_ENABLED, default off —deploying this PR changes nothing until the flag is set on prod. Rollback =
flip the flag off.
Notes
(messy→clean) pairs that Phase 3 produces. Sequence: merge → deploy →
enable flag on prod → accumulate pairs → slice 5.
/checknot run as a single pass; verified piecewise (builds + unit +extraction suites green per slice).
🤖 Generated with Claude Code