Skip to content

Feature/djvu ocr primary#4

Closed
mrviduus wants to merge 8 commits into
mainfrom
feature/djvu-ocr-primary
Closed

Feature/djvu ocr primary#4
mrviduus wants to merge 8 commits into
mainfrom
feature/djvu-ocr-primary

Conversation

@mrviduus
Copy link
Copy Markdown
Owner

No description provided.

mrviduus and others added 8 commits December 29, 2025 14:43
- Switch Worker from Alpine to Debian runtime (SkiaSharp needs glibc)
- Add SkiaSharp.NativeAssets.Linux.NoDependencies + HarfBuzzSharp native libs
- Add DJVU cover extraction using ddjvu + PPM→PNG conversion
- Add CoverExtractionFailed warning code for better error reporting
- Improve error logging with inner exception details

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add TextProcessingUtils.cs: NormalizeText, PlainTextToHtml, CountWords, ExtractTitleFromFileName
- Add ImageUtils.cs: DetectMimeType from magic bytes
- Remove ~100 lines of duplicate code from PdfTextExtractor, DjvuTextExtractor, PlainTextReader, EpubTextExtractor

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add page state with limit/offset
- Reuse search-page__pagination styles
- 12 books per page

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add PreferOcrOverNativeText to ExtractionOptions
- DjvuTextExtractor skips native djvutxt when option enabled
- OCR via Tesseract used as primary method for better quality
- Add cover extraction to OCR path
- Add DjvuExtractorTests with Moq

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- TesseractCliOcrEngine uses tesseract CLI (works on Linux/ARM64)
- Worker.Dockerfile: add tesseract-ocr, libleptonica-dev packages
- Fix tessdata path to /usr/share/tesseract-ocr/5/tessdata
- docker-compose: add extraction env vars for worker

Tested: 84-page DJVU extracted successfully with clean OCR text

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@mrviduus mrviduus closed this Jan 21, 2026
mrviduus added a commit that referenced this pull request Apr 24, 2026
- appendChapter: single JSON.stringify'd object, round-trips U+2028/U+2029
  safely instead of interpolating into a JS string literal.
- onLoadEnd: re-inject highlights + vocab (+ inline translations on the
  library reader) from React-side refs so they survive the WebView reload
  that font/theme/line-height settings changes trigger.

Covers audit findings #4 (appendChapter injection) and #5 (overlay lost
on font/theme change).
mrviduus added a commit that referenced this pull request May 4, 2026
#3 detail-page duplication:
  - new BookDetailHero (cover + title + author + description + meta + actions slots).
  - BookDetailPage and UserBookDetailPage both render it.
  - user-book-detail__* hero classes replaced by book-hero__* (mark/EPUB
    buttons → book-hero__read-btn--secondary).

#4 card visual drift:
  - library-card and user-book-card cover gain the same shadow as
    continue-shelf__card (0 4px 14px rgba(0,0,0,0.18)).
  - library-card placeholder color/weight aligned to user-book-card.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mrviduus added a commit that referenced this pull request May 4, 2026
* library: fix shelf View all links + add per-book Add to collection

- Shelves used wrong query keys (?filter=, ?sort=created_desc).
  Now: ?status=reading|finished|all + ?sort=added (matches existing hooks).
- LibraryPage consumes ?sort= once (sort hook is localStorage-backed).
- BookActionMenu: new "Add to collection" submenu in saved + userbook menus,
  reusing dead addBookToCollection() and useCollections(). Success/error toasts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* library: extract BookDetailHero + align card cover shadows

#3 detail-page duplication:
  - new BookDetailHero (cover + title + author + description + meta + actions slots).
  - BookDetailPage and UserBookDetailPage both render it.
  - user-book-detail__* hero classes replaced by book-hero__* (mark/EPUB
    buttons → book-hero__read-btn--secondary).

#4 card visual drift:
  - library-card and user-book-card cover gain the same shadow as
    continue-shelf__card (0 4px 14px rgba(0,0,0,0.18)).
  - library-card placeholder color/weight aligned to user-book-card.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* library: drop orphan user-book-detail__* CSS, danger-pill for delete

After hero extraction, large blocks of user-book-detail__* CSS have no
references in TS/TSX. Removed: __content/__cover/__info/__title/__author/
__description/__meta/__actions/__read-btn/__mark-btn/__delete-btn and the
mobile overrides + dark-mode rule for __read-btn. -55 lines, CSS bundle
-1.55 kB.

Delete button now uses new book-hero__read-btn--danger modifier (pill,
red text+border) so its shape matches the other hero buttons.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* library: lift status banners out of hero, inline Read badge into meta

Processing/Failed banners were being rendered inside book-hero__actions
flex row after the hero refactor, which would distort the row. They are
page-level alerts — moved above the hero.

The "Read" badge moves into the meta slot as an inline-flex span so it
sits next to chapters/pages instead of standing alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mrviduus added a commit that referenced this pull request May 23, 2026
Three remaining bug reports addressed at their underlying invariant.

#2 (HTML coupling) — root cause: detection ran on chapter HTML and
split on </p>|</h\d>|</li>, coupling it to PdfToHtmlConverter's
markup. The actual signal was always per-paragraph plain text; we
just lost it during HTML conversion. Refactor: LooksLikeTableOfContentsBody
now takes IEnumerable<string> of paragraph texts, called BEFORE HTML
conversion in PdfTextExtractor. Bonus: skips the HtmlCleaner pipeline
entirely for chapters we're about to drop.

#4 (localized back-matter) — root cause: whitelist was English +
Russian/Ukrainian only. Extended to German, French, Spanish, Italian,
Portuguese for Glossary / Bibliography / References / Notes / Appendix
plus their localized forms.

#3 (sidenote dominance) — analysis: with sidenotes the typical split
is ~70/25 = 2.8× which is above the 2.5× threshold, so the multi-column
guard correctly keeps trusting the modal margin. Sidenote lines (a
small minority) are then NOT treated as indent breaks — they just
register as regular paragraph content. The body remains correctly
split. No code change needed; documenting the analysis.

Latent bug — IsTableOfContents drop had no position guard, so an
Italian "Indice" / Spanish "Índice" (same word means "Index" at the
back of the book) would be mis-dropped when it's the Index. Added
isFrontHalf guard to the title-based drop path too.

Tests: 302 passing in Extraction.Tests (+15 new across IsKnownBackMatter
in 7 languages + string-list LooksLikeTableOfContentsBody).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mrviduus added a commit that referenced this pull request May 23, 2026
…-check + root-cause cleanup (#245)

* fix(pdf): bug-report sweep — content-TOC + multi-col + bullets + file-check

Four follow-ups from PR #244's bug report list:

1. Content-level TOC detection. The bookmark-title-only path missed
   TOCs that came in via the page-split fallback (no bookmark, chapter
   labeled "Pages 1–15"). Now FrontMatterFilter.LooksLikeTableOfContentsBody
   inspects the plain text: ≥40% of substantive lines ending in a
   leader-dot run (or "…") + page number ⇒ TOC. Same single-chapter
   safety guard.
2. Multi-column / mixed-layout guard for StartsWithIndent. The modal
   left margin is now only trusted when it covers ≥50% of all lines.
   On a 2-column academic paper the modal share is well under half;
   we fall back to y-gap and bullet detection only, instead of
   over-splitting on every column shift.
3. Bullet glyph set expanded — ◇ ❖ ❍ ▶ ▸ ▻ ➤ ➔ ➢ ★ ☆ ✓ ✔ ✗ ✘ to cover
   modern textbook list markers.
4. RetryAsync now probes the backing file via storage.GetFileAsync
   before queuing the job. A missing source returned success and left
   the book stuck in Processing forever.

Tests: 273 passing in Extraction.Tests — 8 new across content-TOC
detection (leader-dot, ellipsis, prose negative, too-short, null) and
expanded bullet glyph coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pdf): bug-report sweep round 2 — root-cause-first cleanup

Senior-dev pass over the bug reports from round 1. Each one was traced
to its underlying invariant; the fix removes the bug instead of just
working around the symptom.

PR #245 round-1 plain-text TOC detection silently DID NOT WORK
============================================================
Root cause: ProcessingPipeline.ExtractPlainText runs
    WhitespaceRegex.Replace(text, " ")
which collapses '\n' into a single space. LooksLikeTableOfContentsBody
split on '\n', got 1 line, < 5 significant → always returned false.
Fix: detection now operates on the chapter HTML, splitting on
</p>|</h\d>|</li> instead of newlines. The HTML retains paragraph
boundaries by construction (PdfToHtmlConverter emits one <p> per
extracted paragraph).

#1  Index/Glossary false positive on content-detected TOC
=========================================================
Root cause: Index and Glossary are *also* leader-dotted "term … 47".
The single "look like TOC body" signal isn't enough.
Fix: two extra guards.
 - Position: only drop chapters in the front half of the book.
 - Title: new IsKnownBackMatter — vetoes the drop when the bookmark
   title is Index / Glossary / Bibliography / References / Notes /
   Abbreviations / Colophon (en + ru/uk).

#2  Multi-column threshold flips on borderline pages
====================================================
Root cause: hard 50% modal-coverage cutoff is brittle.
Fix: dominance ratio. Modal margin trusted only when its count is
≥ 2.5× the runner-up. Real 2-column pages sit near 1.0× (~40/40);
single-column body pages sit near 17× (~85/5). Cutoff is far from
either distribution.

#3  RetryAsync opened a file stream just to probe existence
===========================================================
Root cause: IFileStorageService had no existence primitive, so the
guard had to use the heavy GetFileAsync.
Fix: new ExistsAsync(path) on the interface, implemented in
LocalFileStorageService as File.Exists(GetFullPath). RetryAsync uses
it; old GetFileAsync path replaced.

Drive-by cleanup: PdfToHtmlConverter.plainBuilder
=================================================
Built up via AppendLine in the loop, never used — the returned
plainText comes from HtmlCleaner.Clean's pipeline. Removed to stop
implying the local copy is authoritative.

Tests: 287 passing in Extraction.Tests (+14 new across
IsKnownBackMatter, HTML-based LooksLikeTableOfContentsBody).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pdf): bug-report sweep round 3 — root-cause refactor

Three remaining bug reports addressed at their underlying invariant.

#2 (HTML coupling) — root cause: detection ran on chapter HTML and
split on </p>|</h\d>|</li>, coupling it to PdfToHtmlConverter's
markup. The actual signal was always per-paragraph plain text; we
just lost it during HTML conversion. Refactor: LooksLikeTableOfContentsBody
now takes IEnumerable<string> of paragraph texts, called BEFORE HTML
conversion in PdfTextExtractor. Bonus: skips the HtmlCleaner pipeline
entirely for chapters we're about to drop.

#4 (localized back-matter) — root cause: whitelist was English +
Russian/Ukrainian only. Extended to German, French, Spanish, Italian,
Portuguese for Glossary / Bibliography / References / Notes / Appendix
plus their localized forms.

#3 (sidenote dominance) — analysis: with sidenotes the typical split
is ~70/25 = 2.8× which is above the 2.5× threshold, so the multi-column
guard correctly keeps trusting the modal margin. Sidenote lines (a
small minority) are then NOT treated as indent breaks — they just
register as regular paragraph content. The body remains correctly
split. No code change needed; documenting the analysis.

Latent bug — IsTableOfContents drop had no position guard, so an
Italian "Indice" / Spanish "Índice" (same word means "Index" at the
back of the book) would be mis-dropped when it's the Index. Added
isFrontHalf guard to the title-based drop path too.

Tests: 302 passing in Extraction.Tests (+15 new across IsKnownBackMatter
in 7 languages + string-list LooksLikeTableOfContentsBody).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pdf): bug-report sweep round 4 — Unicode-category bullet detection

Last actionable item from the bug list: hardcoded bullet set missed
custom dingbat-font glyphs in modern textbooks. Root cause: detection
was an enumerated whitelist; new fonts shipped new shapes; we kept
playing whack-a-mole.

Generalization: after the fast-path whitelist check, fall back to
Unicode category lookup. If the first word is a single character in
category "Symbol, Other" (So) AND not in the existing
NoisePunctuation set, treat it as a bullet.

Po (Punctuation Other) is deliberately excluded — that category
contains † ‡ § ¶ ※ which are footnote-reference markers, not
paragraph starts. Tests assert both directions:
  - ☑ ☐ ✦ ✺ ♦ ⇒ bullet (So)
  - † ‡ § ¶ ※ ⇒ NOT bullet (Po)

Tests: 312 passing in Extraction.Tests (+10 new).

Other items remaining on the bug list are tradeoffs we declined:
  • sidenote columns = XYCut territory (intentionally not pursued)
  • wrapped TOC entries with hanging indent = context-aware
    extraction, larger refactor

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant