Feature/djvu ocr primary by mrviduus · Pull Request #4 · mrviduus/textstack

mrviduus · 2025-12-30T01:47:40Z

No description provided.

- Switch Worker from Alpine to Debian runtime (SkiaSharp needs glibc) - Add SkiaSharp.NativeAssets.Linux.NoDependencies + HarfBuzzSharp native libs - Add DJVU cover extraction using ddjvu + PPM→PNG conversion - Add CoverExtractionFailed warning code for better error reporting - Improve error logging with inner exception details 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add TextProcessingUtils.cs: NormalizeText, PlainTextToHtml, CountWords, ExtractTitleFromFileName - Add ImageUtils.cs: DetectMimeType from magic bytes - Remove ~100 lines of duplicate code from PdfTextExtractor, DjvuTextExtractor, PlainTextReader, EpubTextExtractor 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add page state with limit/offset - Reuse search-page__pagination styles - 12 books per page 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add PreferOcrOverNativeText to ExtractionOptions - DjvuTextExtractor skips native djvutxt when option enabled - OCR via Tesseract used as primary method for better quality - Add cover extraction to OCR path - Add DjvuExtractorTests with Moq 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- TesseractCliOcrEngine uses tesseract CLI (works on Linux/ARM64) - Worker.Dockerfile: add tesseract-ocr, libleptonica-dev packages - Fix tessdata path to /usr/share/tesseract-ocr/5/tessdata - docker-compose: add extraction env vars for worker Tested: 84-page DJVU extracted successfully with clean OCR text 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- appendChapter: single JSON.stringify'd object, round-trips U+2028/U+2029 safely instead of interpolating into a JS string literal. - onLoadEnd: re-inject highlights + vocab (+ inline translations on the library reader) from React-side refs so they survive the WebView reload that font/theme/line-height settings changes trigger. Covers audit findings #4 (appendChapter injection) and #5 (overlay lost on font/theme change).

#3 detail-page duplication: - new BookDetailHero (cover + title + author + description + meta + actions slots). - BookDetailPage and UserBookDetailPage both render it. - user-book-detail__* hero classes replaced by book-hero__* (mark/EPUB buttons → book-hero__read-btn--secondary). #4 card visual drift: - library-card and user-book-card cover gain the same shadow as continue-shelf__card (0 4px 14px rgba(0,0,0,0.18)). - library-card placeholder color/weight aligned to user-book-card. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* library: fix shelf View all links + add per-book Add to collection - Shelves used wrong query keys (?filter=, ?sort=created_desc). Now: ?status=reading|finished|all + ?sort=added (matches existing hooks). - LibraryPage consumes ?sort= once (sort hook is localStorage-backed). - BookActionMenu: new "Add to collection" submenu in saved + userbook menus, reusing dead addBookToCollection() and useCollections(). Success/error toasts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * library: extract BookDetailHero + align card cover shadows #3 detail-page duplication: - new BookDetailHero (cover + title + author + description + meta + actions slots). - BookDetailPage and UserBookDetailPage both render it. - user-book-detail__* hero classes replaced by book-hero__* (mark/EPUB buttons → book-hero__read-btn--secondary). #4 card visual drift: - library-card and user-book-card cover gain the same shadow as continue-shelf__card (0 4px 14px rgba(0,0,0,0.18)). - library-card placeholder color/weight aligned to user-book-card. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * library: drop orphan user-book-detail__* CSS, danger-pill for delete After hero extraction, large blocks of user-book-detail__* CSS have no references in TS/TSX. Removed: __content/__cover/__info/__title/__author/ __description/__meta/__actions/__read-btn/__mark-btn/__delete-btn and the mobile overrides + dark-mode rule for __read-btn. -55 lines, CSS bundle -1.55 kB. Delete button now uses new book-hero__read-btn--danger modifier (pill, red text+border) so its shape matches the other hero buttons. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * library: lift status banners out of hero, inline Read badge into meta Processing/Failed banners were being rendered inside book-hero__actions flex row after the hero refactor, which would distort the row. They are page-level alerts — moved above the hero. The "Read" badge moves into the meta slot as an inline-flex span so it sits next to chapters/pages instead of standing alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three remaining bug reports addressed at their underlying invariant. #2 (HTML coupling) — root cause: detection ran on chapter HTML and split on </p>|</h\d>|</li>, coupling it to PdfToHtmlConverter's markup. The actual signal was always per-paragraph plain text; we just lost it during HTML conversion. Refactor: LooksLikeTableOfContentsBody now takes IEnumerable<string> of paragraph texts, called BEFORE HTML conversion in PdfTextExtractor. Bonus: skips the HtmlCleaner pipeline entirely for chapters we're about to drop. #4 (localized back-matter) — root cause: whitelist was English + Russian/Ukrainian only. Extended to German, French, Spanish, Italian, Portuguese for Glossary / Bibliography / References / Notes / Appendix plus their localized forms. #3 (sidenote dominance) — analysis: with sidenotes the typical split is ~70/25 = 2.8× which is above the 2.5× threshold, so the multi-column guard correctly keeps trusting the modal margin. Sidenote lines (a small minority) are then NOT treated as indent breaks — they just register as regular paragraph content. The body remains correctly split. No code change needed; documenting the analysis. Latent bug — IsTableOfContents drop had no position guard, so an Italian "Indice" / Spanish "Índice" (same word means "Index" at the back of the book) would be mis-dropped when it's the Index. Added isFrontHalf guard to the title-based drop path too. Tests: 302 passing in Extraction.Tests (+15 new across IsKnownBackMatter in 7 languages + string-list LooksLikeTableOfContentsBody). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-check + root-cause cleanup (#245) * fix(pdf): bug-report sweep — content-TOC + multi-col + bullets + file-check Four follow-ups from PR #244's bug report list: 1. Content-level TOC detection. The bookmark-title-only path missed TOCs that came in via the page-split fallback (no bookmark, chapter labeled "Pages 1–15"). Now FrontMatterFilter.LooksLikeTableOfContentsBody inspects the plain text: ≥40% of substantive lines ending in a leader-dot run (or "…") + page number ⇒ TOC. Same single-chapter safety guard. 2. Multi-column / mixed-layout guard for StartsWithIndent. The modal left margin is now only trusted when it covers ≥50% of all lines. On a 2-column academic paper the modal share is well under half; we fall back to y-gap and bullet detection only, instead of over-splitting on every column shift. 3. Bullet glyph set expanded — ◇ ❖ ❍ ▶ ▸ ▻ ➤ ➔ ➢ ★ ☆ ✓ ✔ ✗ ✘ to cover modern textbook list markers. 4. RetryAsync now probes the backing file via storage.GetFileAsync before queuing the job. A missing source returned success and left the book stuck in Processing forever. Tests: 273 passing in Extraction.Tests — 8 new across content-TOC detection (leader-dot, ellipsis, prose negative, too-short, null) and expanded bullet glyph coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pdf): bug-report sweep round 2 — root-cause-first cleanup Senior-dev pass over the bug reports from round 1. Each one was traced to its underlying invariant; the fix removes the bug instead of just working around the symptom. PR #245 round-1 plain-text TOC detection silently DID NOT WORK ============================================================ Root cause: ProcessingPipeline.ExtractPlainText runs WhitespaceRegex.Replace(text, " ") which collapses '\n' into a single space. LooksLikeTableOfContentsBody split on '\n', got 1 line, < 5 significant → always returned false. Fix: detection now operates on the chapter HTML, splitting on </p>|</h\d>|</li> instead of newlines. The HTML retains paragraph boundaries by construction (PdfToHtmlConverter emits one <p> per extracted paragraph). #1 Index/Glossary false positive on content-detected TOC ========================================================= Root cause: Index and Glossary are *also* leader-dotted "term … 47". The single "look like TOC body" signal isn't enough. Fix: two extra guards. - Position: only drop chapters in the front half of the book. - Title: new IsKnownBackMatter — vetoes the drop when the bookmark title is Index / Glossary / Bibliography / References / Notes / Abbreviations / Colophon (en + ru/uk). #2 Multi-column threshold flips on borderline pages ==================================================== Root cause: hard 50% modal-coverage cutoff is brittle. Fix: dominance ratio. Modal margin trusted only when its count is ≥ 2.5× the runner-up. Real 2-column pages sit near 1.0× (~40/40); single-column body pages sit near 17× (~85/5). Cutoff is far from either distribution. #3 RetryAsync opened a file stream just to probe existence =========================================================== Root cause: IFileStorageService had no existence primitive, so the guard had to use the heavy GetFileAsync. Fix: new ExistsAsync(path) on the interface, implemented in LocalFileStorageService as File.Exists(GetFullPath). RetryAsync uses it; old GetFileAsync path replaced. Drive-by cleanup: PdfToHtmlConverter.plainBuilder ================================================= Built up via AppendLine in the loop, never used — the returned plainText comes from HtmlCleaner.Clean's pipeline. Removed to stop implying the local copy is authoritative. Tests: 287 passing in Extraction.Tests (+14 new across IsKnownBackMatter, HTML-based LooksLikeTableOfContentsBody). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pdf): bug-report sweep round 3 — root-cause refactor Three remaining bug reports addressed at their underlying invariant. #2 (HTML coupling) — root cause: detection ran on chapter HTML and split on </p>|</h\d>|</li>, coupling it to PdfToHtmlConverter's markup. The actual signal was always per-paragraph plain text; we just lost it during HTML conversion. Refactor: LooksLikeTableOfContentsBody now takes IEnumerable<string> of paragraph texts, called BEFORE HTML conversion in PdfTextExtractor. Bonus: skips the HtmlCleaner pipeline entirely for chapters we're about to drop. #4 (localized back-matter) — root cause: whitelist was English + Russian/Ukrainian only. Extended to German, French, Spanish, Italian, Portuguese for Glossary / Bibliography / References / Notes / Appendix plus their localized forms. #3 (sidenote dominance) — analysis: with sidenotes the typical split is ~70/25 = 2.8× which is above the 2.5× threshold, so the multi-column guard correctly keeps trusting the modal margin. Sidenote lines (a small minority) are then NOT treated as indent breaks — they just register as regular paragraph content. The body remains correctly split. No code change needed; documenting the analysis. Latent bug — IsTableOfContents drop had no position guard, so an Italian "Indice" / Spanish "Índice" (same word means "Index" at the back of the book) would be mis-dropped when it's the Index. Added isFrontHalf guard to the title-based drop path too. Tests: 302 passing in Extraction.Tests (+15 new across IsKnownBackMatter in 7 languages + string-list LooksLikeTableOfContentsBody). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pdf): bug-report sweep round 4 — Unicode-category bullet detection Last actionable item from the bug list: hardcoded bullet set missed custom dingbat-font glyphs in modern textbooks. Root cause: detection was an enumerated whitelist; new fonts shipped new shapes; we kept playing whack-a-mole. Generalization: after the fast-path whitelist check, fall back to Unicode category lookup. If the first word is a single character in category "Symbol, Other" (So) AND not in the existing NoisePunctuation set, treat it as a bullet. Po (Punctuation Other) is deliberately excluded — that category contains † ‡ § ¶ ※ which are footnote-reference markers, not paragraph starts. Tests assert both directions: - ☑ ☐ ✦ ✺ ♦ ⇒ bullet (So) - † ‡ § ¶ ※ ⇒ NOT bullet (Po) Tests: 312 passing in Extraction.Tests (+10 new). Other items remaining on the bug list are tradeoffs we declined: • sidenote columns = XYCut territory (intentionally not pursued) • wrapped TOC entries with hanging indent = context-aware extraction, larger refactor Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mrviduus and others added 8 commits December 29, 2025 14:43

test: E2E cover extraction tests for PDF/DJVU/EPUB

ffbb1c0

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: truncate long book titles in card grid

14a945a

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: add pagination to books page

e92d0cc

- Add page state with limit/offset - Reuse search-page__pagination styles - 12 books per page 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: add PreferOcrOverNativeText config to Worker

eccb6c6

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

mrviduus closed this Jan 21, 2026

mrviduus mentioned this pull request May 4, 2026

library: shelves + sections merge + Add-to-collection on details + author on saved #203

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/djvu ocr primary#4

Feature/djvu ocr primary#4
mrviduus wants to merge 8 commits into
mainfrom
feature/djvu-ocr-primary

mrviduus commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrviduus commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant