feat(pdf-quality): content-quality pipeline (feat-0007 slices 1-4) by mrviduus · Pull Request #235 · mrviduus/textstack

mrviduus · 2026-05-22T19:11:57Z

Summary

Make PDF-extracted books readable. Heuristics (PdfPig + post-processing) reach
~70-75%; the gap to ~90% is semantic — running headers leaking into body text,
paragraphs fragmented into one-word <p>s, unmerged line-wrap hyphenation,
inlined footnotes. This adds a gated Claude cleanup pass for the chapters
that need it, and logs every fix as training data for a future deterministic
ratchet. Slices 1-4 of feat-0007.

Marker (the ML PDF→markdown pipeline) was evaluated first and shelved
(shelf/marker-integration) — the prod GTX 1650 Ti's 4 GB VRAM can't hold the
Surya model set with inference headroom (verified: CUDA OOM).

Slice / design

docs/05-features/feat-0007-pdf-content-quality.md — full architecture + the
5-slice plan. This PR is slices 1-4 (slice 5, the heuristic ratchet, needs
real (messy→clean) pairs which only exist after Phase 3 runs on prod).

Changes

Slice 1 — analyzer (backend/src/Extraction/.../Quality/)

ChapterContentQualityAnalyzer — deterministic 0-100 score + issue codes.
Pure C#, no I/O. The gate for the LLM pass.

Slice 2 — persist score (schema + ingestion)

ContentQualityScore column on Chapter + UserChapter; set in both
ingestion paths. BookQualityJob gains Phase 3 counter columns. EF migration.

Slice 3 — Claude cleanup (infra/scripts/)

quality-poll.sh Phase 3: flagged chapters → claude CLI cleanup →
pdf-cleanup-gate.py (word-multiset diff rejects hallucination /
over-deletion) → write cleaned HTML → log (messy→clean) pair.
Internal UpdateQualityJobRequest carries the new counters.
Off by default — CONTENT_CLEANUP_ENABLED in .env.

Slice 4 — observability

Admin Book Quality job detail shows Phase 3 results (cleaned/rejected/skipped).
Worker logs a per-book content-quality score distribution at ingest.

Drive-by fix

SsgRouteProviderTests had a stale test asserting pre-noindex-fix behavior
(non-indexable books dropped from SSG routes — they are intentionally kept
with a noindex meta). Split into two correct tests.

Tests

Unit: ChapterContentQualityAnalyzerTests (12), SsgRouteProviderTests
reworked. Full suite green — 217 unit, 207 extraction.
pdf-cleanup-gate.py smoke-tested (accept / hallucination-reject /
over-deletion-reject).
Solution + admin frontend build clean.

Rollback plan

Slices 1-2 + 4 are inert data/observability — safe to ship. Slice 3 (the only
behavioural change) is fully gated by CONTENT_CLEANUP_ENABLED, default off —
deploying this PR changes nothing until the flag is set on prod. Rollback =
flip the flag off.

Notes

Slice 5 (heuristic ratchet) is intentionally out of scope — it needs the
(messy→clean) pairs that Phase 3 produces. Sequence: merge → deploy →
enable flag on prod → accumulate pairs → slice 5.
/check not run as a single pass; verified piecewise (builds + unit +
extraction suites green per slice).

🤖 Generated with Claude Code

Architecture + 5-slice breakdown for feat-0007. Marker shelved (4GB VRAM infeasible); this reuses quality-poll.sh + BookQualityJob + internal chapter endpoints — extends rather than builds new. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Deterministic 0-100 content-quality score + issue codes for extracted chapter HTML. Detects the recurring PDF-extraction defects: fragmented paragraphs, running headers in body, unmerged hyphenation, orphan page numbers, inlined footnotes. Pure C#, no I/O — the gate that decides which chapters warrant an LLM cleanup pass (slice 3). 12 unit tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- ContentQualityScore (int?) on Chapter + UserChapter — set at chapter creation in both ingestion paths via ChapterContentQualityAnalyzer. - BookQualityJob: ContentChaptersCleaned/Rejected/Skipped (int?) — tracking fields for the Phase 3 cleanup pass (written by the poller). - EF migration AddChapterContentQualityScore (5 nullable columns). Score is informational until slice 3 acts on it. Runs on every format; EPUB/FB2 score high, PDF surfaces the low ones. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GetRoutesAsync_ExcludesNonIndexableContent asserted non-indexable books are dropped from SSG routes. But AddBookRoutesAsync intentionally routes every Published edition (renderer emits noindex meta; filtering would 404 the slug) — the test was never updated when that fix landed. Split into two tests reflecting actual design: - NonIndexableBook_StillRouted — book route emitted regardless of Indexable - ExcludesNonIndexableAuthorsAndGenres — author/genre listings DO filter it Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

quality-poll.sh Phase 3 — after structure fixes, for each chapter scoring below CONTENT_QUALITY_THRESHOLD: fetch HTML → Claude CLI (fix structure, preserve content verbatim) → pdf-cleanup-gate.py (word-multiset diff: reject hallucination / over-deletion) → PUT cleaned HTML → log (messy→clean) pair. - pdf-cleanup-gate.py: deterministic preservation gate, stdlib-only. Joins line-wrap hyphens before tokenizing so legit merges don't read as new words. 3% novel-token tolerance, 70% retention floor. - InternalEndpoints: UpdateQualityJobRequest + handler carry the three ContentChapters* counters; GetQualityJob returns them. - Off by default — CONTENT_CLEANUP_ENABLED in .env gates the whole phase. Pairs land in data/pdf-cleanup-dataset/ for the slice-5 ratchet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- AdminBookQualityEndpoints: QualityJobDetailDto carries the three ContentChapters* counters; GetJob returns them; RetryJob resets them. - Admin BookQualityPage detail panel shows "Content cleanup — cleaned / rejected / skipped" when the job ran Phase 3. - Worker logs a content-quality score distribution per book at ingest (count, avg, how many below 60) — both edition and user-book paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mrviduus and others added 7 commits May 22, 2026 12:32

changelog: PDF content quality pipeline (feat-0007 slices 1-4)

53a67cd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mrviduus merged commit 3aa37e9 into main May 22, 2026
5 checks passed

mrviduus deleted the feat/pdf-content-quality branch May 22, 2026 19:29

This was referenced May 22, 2026

fix(pdf-quality): Phase 3 timeout + dataset dir permissions #236

Merged

fix(pdf-quality): bump CLEANUP_TIMEOUT default to 1500s #237

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pdf-quality): content-quality pipeline (feat-0007 slices 1-4)#235

feat(pdf-quality): content-quality pipeline (feat-0007 slices 1-4)#235
mrviduus merged 7 commits into
mainfrom
feat/pdf-content-quality

mrviduus commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrviduus commented May 22, 2026

Summary

Slice / design

Changes

Tests

Rollback plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant