Skip to content

fix(pdf): robust paragraph detection (median gap + indent + bullet + TOC drop)#244

Merged
mrviduus merged 2 commits into
mainfrom
fix/pdf-bullet-paragraph-split
May 23, 2026
Merged

fix(pdf): robust paragraph detection (median gap + indent + bullet + TOC drop)#244
mrviduus merged 2 commits into
mainfrom
fix/pdf-bullet-paragraph-split

Conversation

@mrviduus
Copy link
Copy Markdown
Owner

@mrviduus mrviduus commented May 23, 2026

Summary

PDF paragraph detection was globally broken — bullets, indented body paragraphs, and plain y-gap-separated paragraphs all merged into single runs. The TOC chapter showed up as one dense leader-dotted blob.

Root causes

  1. Mean line-gap baseline was self-defeating. Real paragraph gaps in the data inflated the mean above themselves, so threshold = mean × 1.5 ended up higher than any actual paragraph gap.
  2. Y-gap was the only signal. Indent-only typography (most books) had no chance.
  3. Bullets had no special status. Tightly-spaced list items glued together.
  4. TOC chapter was kept — useless as a leader-dotted wall of text.

Changes

Paragraph detection

  • Median replaces mean as the line-gap baseline.
  • Multiplier 1.5 → 1.2. O'Reilly-style book typography uses ~1.25× line-height for paragraph spacing.
  • First-line indent detection: page's modal left margin is computed; lines ≥6 pt to its right start a new paragraph.
  • Bullet glyphs (•, ●, ▪, ◦, ○, ▫, ◆, ‣, ⁃, ►, ❖) force a new paragraph regardless of y-gap.

TOC drop

  • FrontMatterFilter.IsTableOfContents(title) — anchored regex, en + ru/uk + a few EU languages, tolerates trailing page-number bookmarks ("Contents 5"), rejects substring matches like "Discontent".
  • PdfTextExtractor skips the chapter (with ContentFiltered warning) when bookmark title matches and there are other chapters surviving (single-chapter book literally titled "Contents" stays).
  • EPUB / FB2 unchanged — their TOC is usually HTML and renders fine.

Re-extract path

  • UserBookService.RetryAsync now accepts Ready in addition to Failed. Existing books can pick up extraction improvements without delete+reupload. UserIngestionService already wipes old chapters before re-extracting, so this is safe.
  • UserBookDetailPage: small "Re-extract" icon button next to delete (single click, no confirm — fully reversible).

Tests

  • FrontMatterFilterTests — TOC titles in 6 languages + page-number-trailing variant + negative cases.
  • PdfPageTextExtractorTests:
    • Bullet glyph detection across the set + glued "•You're" form + negative cases.
    • Real-PDF integration test: y-gap = 1.43× line-spacing → must produce 2 paragraphs (old mean+1.5 would have produced 1).
    • Real-PDF integration test: 12 pt first-line indent, no y-gap signal → must produce ≥2 paragraphs containing "Paragraph B".
  • dotnet test tests/TextStack.Extraction.Tests → 259 passed.
  • pnpm -C apps/web build clean.

Rollback

No flag. New behaviour applies to newly-extracted PDFs (re-extracts and fresh uploads). To roll back, revert the commit.

🤖 Generated with Claude Code

mrviduus and others added 2 commits May 23, 2026 13:37
Two visible bugs reported on a fresh PDF upload:
- Bullet lists in the Preface got concatenated into one wall of text —
  "• You're building... • You want to... • Tool developers..." all in a
  single run. PdfPageTextExtractor.GroupLinesIntoParagraphs only split on
  vertical y-gap, and tightly-spaced lists don't have one.
- The "Table of Contents" chapter showed up as a single dense run of
  leader-dotted entries. It's rendered nowhere readably and the in-app
  TOC is already built from the chapter list itself.

Changes:
- Add BulletGlyphs set + StartsWithBulletGlyph check in
  GroupLinesIntoParagraphs — a line whose first word is •, ●, ▪, ◦, ○,
  ‣, ⁃ (etc.) forces a new paragraph regardless of y-gap.
- New FrontMatterFilter.IsTableOfContents (anchored regex, en + ru/uk +
  a few EU languages, tolerates trailing page-number bookmarks).
- PdfTextExtractor skips the chapter when the bookmark title matches.
- UserBookService.RetryAsync now also accepts Ready (not just Failed) so
  existing books can pick up extraction improvements without a
  delete+reupload roundtrip. UserIngestionService already wipes old
  chapters before re-extracting, so this is safe.
- UserBookDetailPage gets a small "Re-extract" icon button next to
  delete (single click, no confirm — reversible).

Tests: 36/36 in TextStack.Extraction.Tests pass — 14 new across
FrontMatterFilterTests + PdfPageTextExtractorTests covering bullet
glyph detection and TOC title matching.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bullet-only fix in the previous commit only addressed list items.
The broader complaint was paragraph breaks missing everywhere — even
plain body text was glued into one run. Two root causes:

1. mean line gap was the baseline. Paragraph gaps in the data inflate
   the mean above themselves, so the threshold (mean × 1.5) ends up
   higher than any real paragraph gap. Self-defeating statistic.
2. y-gap-only detection. Many books use first-line indent instead of
   vertical paragraph spacing; we ignored indent entirely.

Changes:
- Median replaces mean for baseline gap. The modal spacing on a body
  page is line-height; paragraph gaps stay above the threshold.
- Multiplier 1.5 → 1.2. O'Reilly-style typography uses paragraph spacing
  around 1.25× line height; 1.5× missed them.
- New StartsWithIndent: compute the page's modal left margin, treat a
  line whose left edge is ≥6 pt right of it as a paragraph break.
  Catches indent-only paragraphs that have no y-gap signal at all.
- TOC drop in PdfTextExtractor now requires chapters.Count > 1, so a
  single-chapter book literally titled "Contents" won't disappear
  (PR #244 bug-report guard).

Integration tests build real PDFs (y-gap and indent variants) and
verify the extractor splits them. 259/259 in Extraction.Tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mrviduus mrviduus changed the title fix(pdf): bullets → new paragraph, drop TOC chapter, allow re-extract on Ready books fix(pdf): robust paragraph detection (median gap + indent + bullet + TOC drop) May 23, 2026
@mrviduus mrviduus merged commit 8c388ec into main May 23, 2026
5 checks passed
@mrviduus mrviduus deleted the fix/pdf-bullet-paragraph-split branch May 23, 2026 18:06
mrviduus added a commit that referenced this pull request May 23, 2026
…-check + root-cause cleanup (#245)

* fix(pdf): bug-report sweep — content-TOC + multi-col + bullets + file-check

Four follow-ups from PR #244's bug report list:

1. Content-level TOC detection. The bookmark-title-only path missed
   TOCs that came in via the page-split fallback (no bookmark, chapter
   labeled "Pages 1–15"). Now FrontMatterFilter.LooksLikeTableOfContentsBody
   inspects the plain text: ≥40% of substantive lines ending in a
   leader-dot run (or "…") + page number ⇒ TOC. Same single-chapter
   safety guard.
2. Multi-column / mixed-layout guard for StartsWithIndent. The modal
   left margin is now only trusted when it covers ≥50% of all lines.
   On a 2-column academic paper the modal share is well under half;
   we fall back to y-gap and bullet detection only, instead of
   over-splitting on every column shift.
3. Bullet glyph set expanded — ◇ ❖ ❍ ▶ ▸ ▻ ➤ ➔ ➢ ★ ☆ ✓ ✔ ✗ ✘ to cover
   modern textbook list markers.
4. RetryAsync now probes the backing file via storage.GetFileAsync
   before queuing the job. A missing source returned success and left
   the book stuck in Processing forever.

Tests: 273 passing in Extraction.Tests — 8 new across content-TOC
detection (leader-dot, ellipsis, prose negative, too-short, null) and
expanded bullet glyph coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pdf): bug-report sweep round 2 — root-cause-first cleanup

Senior-dev pass over the bug reports from round 1. Each one was traced
to its underlying invariant; the fix removes the bug instead of just
working around the symptom.

PR #245 round-1 plain-text TOC detection silently DID NOT WORK
============================================================
Root cause: ProcessingPipeline.ExtractPlainText runs
    WhitespaceRegex.Replace(text, " ")
which collapses '\n' into a single space. LooksLikeTableOfContentsBody
split on '\n', got 1 line, < 5 significant → always returned false.
Fix: detection now operates on the chapter HTML, splitting on
</p>|</h\d>|</li> instead of newlines. The HTML retains paragraph
boundaries by construction (PdfToHtmlConverter emits one <p> per
extracted paragraph).

#1  Index/Glossary false positive on content-detected TOC
=========================================================
Root cause: Index and Glossary are *also* leader-dotted "term … 47".
The single "look like TOC body" signal isn't enough.
Fix: two extra guards.
 - Position: only drop chapters in the front half of the book.
 - Title: new IsKnownBackMatter — vetoes the drop when the bookmark
   title is Index / Glossary / Bibliography / References / Notes /
   Abbreviations / Colophon (en + ru/uk).

#2  Multi-column threshold flips on borderline pages
====================================================
Root cause: hard 50% modal-coverage cutoff is brittle.
Fix: dominance ratio. Modal margin trusted only when its count is
≥ 2.5× the runner-up. Real 2-column pages sit near 1.0× (~40/40);
single-column body pages sit near 17× (~85/5). Cutoff is far from
either distribution.

#3  RetryAsync opened a file stream just to probe existence
===========================================================
Root cause: IFileStorageService had no existence primitive, so the
guard had to use the heavy GetFileAsync.
Fix: new ExistsAsync(path) on the interface, implemented in
LocalFileStorageService as File.Exists(GetFullPath). RetryAsync uses
it; old GetFileAsync path replaced.

Drive-by cleanup: PdfToHtmlConverter.plainBuilder
=================================================
Built up via AppendLine in the loop, never used — the returned
plainText comes from HtmlCleaner.Clean's pipeline. Removed to stop
implying the local copy is authoritative.

Tests: 287 passing in Extraction.Tests (+14 new across
IsKnownBackMatter, HTML-based LooksLikeTableOfContentsBody).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pdf): bug-report sweep round 3 — root-cause refactor

Three remaining bug reports addressed at their underlying invariant.

#2 (HTML coupling) — root cause: detection ran on chapter HTML and
split on </p>|</h\d>|</li>, coupling it to PdfToHtmlConverter's
markup. The actual signal was always per-paragraph plain text; we
just lost it during HTML conversion. Refactor: LooksLikeTableOfContentsBody
now takes IEnumerable<string> of paragraph texts, called BEFORE HTML
conversion in PdfTextExtractor. Bonus: skips the HtmlCleaner pipeline
entirely for chapters we're about to drop.

#4 (localized back-matter) — root cause: whitelist was English +
Russian/Ukrainian only. Extended to German, French, Spanish, Italian,
Portuguese for Glossary / Bibliography / References / Notes / Appendix
plus their localized forms.

#3 (sidenote dominance) — analysis: with sidenotes the typical split
is ~70/25 = 2.8× which is above the 2.5× threshold, so the multi-column
guard correctly keeps trusting the modal margin. Sidenote lines (a
small minority) are then NOT treated as indent breaks — they just
register as regular paragraph content. The body remains correctly
split. No code change needed; documenting the analysis.

Latent bug — IsTableOfContents drop had no position guard, so an
Italian "Indice" / Spanish "Índice" (same word means "Index" at the
back of the book) would be mis-dropped when it's the Index. Added
isFrontHalf guard to the title-based drop path too.

Tests: 302 passing in Extraction.Tests (+15 new across IsKnownBackMatter
in 7 languages + string-list LooksLikeTableOfContentsBody).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pdf): bug-report sweep round 4 — Unicode-category bullet detection

Last actionable item from the bug list: hardcoded bullet set missed
custom dingbat-font glyphs in modern textbooks. Root cause: detection
was an enumerated whitelist; new fonts shipped new shapes; we kept
playing whack-a-mole.

Generalization: after the fast-path whitelist check, fall back to
Unicode category lookup. If the first word is a single character in
category "Symbol, Other" (So) AND not in the existing
NoisePunctuation set, treat it as a bullet.

Po (Punctuation Other) is deliberately excluded — that category
contains † ‡ § ¶ ※ which are footnote-reference markers, not
paragraph starts. Tests assert both directions:
  - ☑ ☐ ✦ ✺ ♦ ⇒ bullet (So)
  - † ‡ § ¶ ※ ⇒ NOT bullet (Po)

Tests: 312 passing in Extraction.Tests (+10 new).

Other items remaining on the bug list are tradeoffs we declined:
  • sidenote columns = XYCut territory (intentionally not pursued)
  • wrapped TOC entries with hanging indent = context-aware
    extraction, larger refactor

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant