Skip to content

feat(pdf-quality) [slice 5 r1]: drop O'Reilly running headers#241

Merged
mrviduus merged 1 commit into
mainfrom
feat/pdf-quality-ratchet-r1
May 23, 2026
Merged

feat(pdf-quality) [slice 5 r1]: drop O'Reilly running headers#241
mrviduus merged 1 commit into
mainfrom
feat/pdf-quality-ratchet-r1

Conversation

@mrviduus
Copy link
Copy Markdown
Owner

Summary

First heuristic-ratchet round (feat-0007 slice 5). Studied the (messy →
cleaned) pair Phase 3 produced on AI Engineering ch5; the highest-signal
recurring fix Claude made was removing running headers in two shapes:

<p><strong>4 | Chapter 1: Introduction to Building AI Applications…</strong></p>
<p><strong>The Rise of AI Engineering | 3</strong></p>

The page number varies per page, so PdfTextExtractor's cross-page
identical-text filter can't catch them — but the structural signature
(small int + " | " + text on a short paragraph) is distinctive.

Changes

  • PdfPageTextExtractor.IsArtifactNoise — new regex catches the two
    running-header shapes; capped at 200 chars to avoid false positives.
  • IsArtifactNoise opened up to internal for direct tests
    (InternalsVisibleTo("TextStack.Extraction.Tests") in the csproj).
  • 13 new tests: 5 header positives, 3 legacy artifact regressions, 4 prose
    non-matches, 1 length-cap guard.

Now: these headers drop at extraction time, never reach the body, never
trigger Phase 3. Claude usage on books with this pattern shrinks.

Tests

220 extraction tests pass.

Notes

Ratchet round 2 needs more pairs — gated on real user uploads accumulating
material in data/pdf-cleanup-dataset/.

🤖 Generated with Claude Code

First heuristic-ratchet round. Studied the Claude cleanup pair from the
prod test run; the highest-signal recurring fix Claude made on AI
Engineering was removing running-header paragraphs in two shapes:

    <p><strong>4 | Chapter 1: Introduction…</strong></p>
    <p><strong>The Rise of AI Engineering | 3</strong></p>

The page number varies per page, so PdfTextExtractor's cross-page
identical-text filter couldn't catch them — but the structural signature
(small int + " | " + text on a short paragraph) is distinctive. Encoded
as a regex in PdfPageTextExtractor.IsArtifactNoise — these now drop at
extraction time, never reach the body, never need an LLM call.

Made IsArtifactNoise internal + InternalsVisibleTo for direct unit tests.
13 new tests (5 header positives, 3 legacy regressions, 4 prose
non-matches, 1 length cap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mrviduus mrviduus merged commit ed5d54a into main May 23, 2026
5 checks passed
@mrviduus mrviduus deleted the feat/pdf-quality-ratchet-r1 branch May 23, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant