fix(pdf): robust paragraph detection (median gap + indent + bullet + TOC drop) by mrviduus · Pull Request #244 · mrviduus/textstack

mrviduus · 2026-05-23T17:37:09Z

Summary

PDF paragraph detection was globally broken — bullets, indented body paragraphs, and plain y-gap-separated paragraphs all merged into single runs. The TOC chapter showed up as one dense leader-dotted blob.

Root causes

Mean line-gap baseline was self-defeating. Real paragraph gaps in the data inflated the mean above themselves, so threshold = mean × 1.5 ended up higher than any actual paragraph gap.
Y-gap was the only signal. Indent-only typography (most books) had no chance.
Bullets had no special status. Tightly-spaced list items glued together.
TOC chapter was kept — useless as a leader-dotted wall of text.

Changes

Paragraph detection

Median replaces mean as the line-gap baseline.
Multiplier 1.5 → 1.2. O'Reilly-style book typography uses ~1.25× line-height for paragraph spacing.
First-line indent detection: page's modal left margin is computed; lines ≥6 pt to its right start a new paragraph.
Bullet glyphs (•, ●, ▪, ◦, ○, ▫, ◆, ‣, ⁃, ►, ❖) force a new paragraph regardless of y-gap.

TOC drop

FrontMatterFilter.IsTableOfContents(title) — anchored regex, en + ru/uk + a few EU languages, tolerates trailing page-number bookmarks ("Contents 5"), rejects substring matches like "Discontent".
PdfTextExtractor skips the chapter (with ContentFiltered warning) when bookmark title matches and there are other chapters surviving (single-chapter book literally titled "Contents" stays).
EPUB / FB2 unchanged — their TOC is usually HTML and renders fine.

Re-extract path

UserBookService.RetryAsync now accepts Ready in addition to Failed. Existing books can pick up extraction improvements without delete+reupload. UserIngestionService already wipes old chapters before re-extracting, so this is safe.
UserBookDetailPage: small "Re-extract" icon button next to delete (single click, no confirm — fully reversible).

Tests

FrontMatterFilterTests — TOC titles in 6 languages + page-number-trailing variant + negative cases.
PdfPageTextExtractorTests:
- Bullet glyph detection across the set + glued "•You're" form + negative cases.
- Real-PDF integration test: y-gap = 1.43× line-spacing → must produce 2 paragraphs (old mean+1.5 would have produced 1).
- Real-PDF integration test: 12 pt first-line indent, no y-gap signal → must produce ≥2 paragraphs containing "Paragraph B".
dotnet test tests/TextStack.Extraction.Tests → 259 passed.
pnpm -C apps/web build clean.

Rollback

No flag. New behaviour applies to newly-extracted PDFs (re-extracts and fresh uploads). To roll back, revert the commit.

🤖 Generated with Claude Code

Two visible bugs reported on a fresh PDF upload: - Bullet lists in the Preface got concatenated into one wall of text — "• You're building... • You want to... • Tool developers..." all in a single run. PdfPageTextExtractor.GroupLinesIntoParagraphs only split on vertical y-gap, and tightly-spaced lists don't have one. - The "Table of Contents" chapter showed up as a single dense run of leader-dotted entries. It's rendered nowhere readably and the in-app TOC is already built from the chapter list itself. Changes: - Add BulletGlyphs set + StartsWithBulletGlyph check in GroupLinesIntoParagraphs — a line whose first word is •, ●, ▪, ◦, ○, ‣, ⁃ (etc.) forces a new paragraph regardless of y-gap. - New FrontMatterFilter.IsTableOfContents (anchored regex, en + ru/uk + a few EU languages, tolerates trailing page-number bookmarks). - PdfTextExtractor skips the chapter when the bookmark title matches. - UserBookService.RetryAsync now also accepts Ready (not just Failed) so existing books can pick up extraction improvements without a delete+reupload roundtrip. UserIngestionService already wipes old chapters before re-extracting, so this is safe. - UserBookDetailPage gets a small "Re-extract" icon button next to delete (single click, no confirm — reversible). Tests: 36/36 in TextStack.Extraction.Tests pass — 14 new across FrontMatterFilterTests + PdfPageTextExtractorTests covering bullet glyph detection and TOC title matching. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bullet-only fix in the previous commit only addressed list items. The broader complaint was paragraph breaks missing everywhere — even plain body text was glued into one run. Two root causes: 1. mean line gap was the baseline. Paragraph gaps in the data inflate the mean above themselves, so the threshold (mean × 1.5) ends up higher than any real paragraph gap. Self-defeating statistic. 2. y-gap-only detection. Many books use first-line indent instead of vertical paragraph spacing; we ignored indent entirely. Changes: - Median replaces mean for baseline gap. The modal spacing on a body page is line-height; paragraph gaps stay above the threshold. - Multiplier 1.5 → 1.2. O'Reilly-style typography uses paragraph spacing around 1.25× line height; 1.5× missed them. - New StartsWithIndent: compute the page's modal left margin, treat a line whose left edge is ≥6 pt right of it as a paragraph break. Catches indent-only paragraphs that have no y-gap signal at all. - TOC drop in PdfTextExtractor now requires chapters.Count > 1, so a single-chapter book literally titled "Contents" won't disappear (PR #244 bug-report guard). Integration tests build real PDFs (y-gap and indent variants) and verify the extractor splits them. 259/259 in Extraction.Tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-check + root-cause cleanup (#245) * fix(pdf): bug-report sweep — content-TOC + multi-col + bullets + file-check Four follow-ups from PR #244's bug report list: 1. Content-level TOC detection. The bookmark-title-only path missed TOCs that came in via the page-split fallback (no bookmark, chapter labeled "Pages 1–15"). Now FrontMatterFilter.LooksLikeTableOfContentsBody inspects the plain text: ≥40% of substantive lines ending in a leader-dot run (or "…") + page number ⇒ TOC. Same single-chapter safety guard. 2. Multi-column / mixed-layout guard for StartsWithIndent. The modal left margin is now only trusted when it covers ≥50% of all lines. On a 2-column academic paper the modal share is well under half; we fall back to y-gap and bullet detection only, instead of over-splitting on every column shift. 3. Bullet glyph set expanded — ◇ ❖ ❍ ▶ ▸ ▻ ➤ ➔ ➢ ★ ☆ ✓ ✔ ✗ ✘ to cover modern textbook list markers. 4. RetryAsync now probes the backing file via storage.GetFileAsync before queuing the job. A missing source returned success and left the book stuck in Processing forever. Tests: 273 passing in Extraction.Tests — 8 new across content-TOC detection (leader-dot, ellipsis, prose negative, too-short, null) and expanded bullet glyph coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pdf): bug-report sweep round 2 — root-cause-first cleanup Senior-dev pass over the bug reports from round 1. Each one was traced to its underlying invariant; the fix removes the bug instead of just working around the symptom. PR #245 round-1 plain-text TOC detection silently DID NOT WORK ============================================================ Root cause: ProcessingPipeline.ExtractPlainText runs WhitespaceRegex.Replace(text, " ") which collapses '\n' into a single space. LooksLikeTableOfContentsBody split on '\n', got 1 line, < 5 significant → always returned false. Fix: detection now operates on the chapter HTML, splitting on </p>|</h\d>|</li> instead of newlines. The HTML retains paragraph boundaries by construction (PdfToHtmlConverter emits one <p> per extracted paragraph). #1 Index/Glossary false positive on content-detected TOC ========================================================= Root cause: Index and Glossary are *also* leader-dotted "term … 47". The single "look like TOC body" signal isn't enough. Fix: two extra guards. - Position: only drop chapters in the front half of the book. - Title: new IsKnownBackMatter — vetoes the drop when the bookmark title is Index / Glossary / Bibliography / References / Notes / Abbreviations / Colophon (en + ru/uk). #2 Multi-column threshold flips on borderline pages ==================================================== Root cause: hard 50% modal-coverage cutoff is brittle. Fix: dominance ratio. Modal margin trusted only when its count is ≥ 2.5× the runner-up. Real 2-column pages sit near 1.0× (~40/40); single-column body pages sit near 17× (~85/5). Cutoff is far from either distribution. #3 RetryAsync opened a file stream just to probe existence =========================================================== Root cause: IFileStorageService had no existence primitive, so the guard had to use the heavy GetFileAsync. Fix: new ExistsAsync(path) on the interface, implemented in LocalFileStorageService as File.Exists(GetFullPath). RetryAsync uses it; old GetFileAsync path replaced. Drive-by cleanup: PdfToHtmlConverter.plainBuilder ================================================= Built up via AppendLine in the loop, never used — the returned plainText comes from HtmlCleaner.Clean's pipeline. Removed to stop implying the local copy is authoritative. Tests: 287 passing in Extraction.Tests (+14 new across IsKnownBackMatter, HTML-based LooksLikeTableOfContentsBody). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pdf): bug-report sweep round 3 — root-cause refactor Three remaining bug reports addressed at their underlying invariant. #2 (HTML coupling) — root cause: detection ran on chapter HTML and split on </p>|</h\d>|</li>, coupling it to PdfToHtmlConverter's markup. The actual signal was always per-paragraph plain text; we just lost it during HTML conversion. Refactor: LooksLikeTableOfContentsBody now takes IEnumerable<string> of paragraph texts, called BEFORE HTML conversion in PdfTextExtractor. Bonus: skips the HtmlCleaner pipeline entirely for chapters we're about to drop. #4 (localized back-matter) — root cause: whitelist was English + Russian/Ukrainian only. Extended to German, French, Spanish, Italian, Portuguese for Glossary / Bibliography / References / Notes / Appendix plus their localized forms. #3 (sidenote dominance) — analysis: with sidenotes the typical split is ~70/25 = 2.8× which is above the 2.5× threshold, so the multi-column guard correctly keeps trusting the modal margin. Sidenote lines (a small minority) are then NOT treated as indent breaks — they just register as regular paragraph content. The body remains correctly split. No code change needed; documenting the analysis. Latent bug — IsTableOfContents drop had no position guard, so an Italian "Indice" / Spanish "Índice" (same word means "Index" at the back of the book) would be mis-dropped when it's the Index. Added isFrontHalf guard to the title-based drop path too. Tests: 302 passing in Extraction.Tests (+15 new across IsKnownBackMatter in 7 languages + string-list LooksLikeTableOfContentsBody). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pdf): bug-report sweep round 4 — Unicode-category bullet detection Last actionable item from the bug list: hardcoded bullet set missed custom dingbat-font glyphs in modern textbooks. Root cause: detection was an enumerated whitelist; new fonts shipped new shapes; we kept playing whack-a-mole. Generalization: after the fast-path whitelist check, fall back to Unicode category lookup. If the first word is a single character in category "Symbol, Other" (So) AND not in the existing NoisePunctuation set, treat it as a bullet. Po (Punctuation Other) is deliberately excluded — that category contains † ‡ § ¶ ※ which are footnote-reference markers, not paragraph starts. Tests assert both directions: - ☑ ☐ ✦ ✺ ♦ ⇒ bullet (So) - † ‡ § ¶ ※ ⇒ NOT bullet (Po) Tests: 312 passing in Extraction.Tests (+10 new). Other items remaining on the bug list are tradeoffs we declined: • sidenote columns = XYCut territory (intentionally not pursued) • wrapped TOC entries with hanging indent = context-aware extraction, larger refactor Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mrviduus and others added 2 commits May 23, 2026 13:37

mrviduus changed the title ~~fix(pdf): bullets → new paragraph, drop TOC chapter, allow re-extract on Ready books~~ fix(pdf): robust paragraph detection (median gap + indent + bullet + TOC drop) May 23, 2026

mrviduus merged commit 8c388ec into main May 23, 2026
5 checks passed

mrviduus deleted the fix/pdf-bullet-paragraph-split branch May 23, 2026 18:06

mrviduus mentioned this pull request May 23, 2026

fix(pdf): bug-report sweep — content-TOC + multi-col + bullets + file-check + root-cause cleanup #245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pdf): robust paragraph detection (median gap + indent + bullet + TOC drop)#244

fix(pdf): robust paragraph detection (median gap + indent + bullet + TOC drop)#244
mrviduus merged 2 commits into
mainfrom
fix/pdf-bullet-paragraph-split

mrviduus commented May 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrviduus commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root causes

Changes

Paragraph detection

TOC drop

Re-extract path

Tests

Rollback

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mrviduus commented May 23, 2026 •

edited

Loading