feat: add full-text search library with PostgreSQL FTS by mrviduus · Pull Request #2 · mrviduus/textstack

mrviduus · 2025-12-24T03:38:35Z

Summary

Add OnlineLib.Search library with abstractions and PostgreSQL FTS provider
Implement suggestions (title/author autocomplete) and chapter search
Add search results page with pagination at /:lang/search?q=term
Add "View all results" link in search dropdown
Add Cmd/Ctrl+K keyboard shortcut to focus search
Rich suggestion cards with cover, title, author
Search highlighting in results

Test plan

162 unit tests passing
Manual testing in Chrome: search, suggestions, pagination, navigation

🤖 Generated with Claude Code

- Add OnlineLib.Search library with abstractions and PostgreSQL FTS provider - Implement suggestions (title/author autocomplete) and chapter search - Add search results page with pagination at /:lang/search?q=term - Add "View all results" link in search dropdown - Add Cmd/Ctrl+K keyboard shortcut to focus search - Rich suggestion cards with cover, title, author - Search highlighting in results - 162 unit tests for search library 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Bug #2 — web useRestoreProgress only fetched once per hook mount. When a user logged in mid-reading, savedProgress stayed empty until reload. Fixed with composite-key dedupe (editionId:isAuthenticated) so auth transitions re-fetch. Guard: post-login refetches update savedProgress for bookmarks/UI but do NOT trigger navigation — mid-session jumps are disruptive UX. Bug #1 — mobile reader saved progress to server only. Network flap → progress lost, ContinueReadingCard showed stale chapter on next open. Also skipped entirely for guests. Fix: always persist to AsyncStorage (new progressStorage helper) with updatedAt stamp; server PUT still gated on auth. ContinueReadingCard merges server+local by LWW timestamp, tolerates each API failure independently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Cache-first paint + API refresh pattern, backed by AsyncStorage: - readerOfflineCache: per-edition highlights, per-user-book highlights, and the shared vocab map. Loads render immediately, API .then() overwrites on success, .catch() keeps the cache paint. - Write-through on highlight CRUD (create/note/delete) in both readers. - Vocab map flushed on selection close so words saved this session survive an offline cold start. - clearReaderCache() wired into signOut + onAuthFailure to match the existing B-43 hygiene for progress + vocab stats. Covers audit finding #2 (offline persist).

Bug #2 from PR thread: prior code sorted saved and uploads independently and concatenated them (saved-first). With "Recently opened" sort a freshly opened upload could end up below an older saved book. Now: tag every visible item by kind, run a single comparator over the combined list. Processing/Failed uploads still pin to the top via an attention-rank pre-check. FTS content-search override stays uploads-only (saved books have no content FTS). Bug #3 cleanup: useLibrarySearch's `tab` arg was leftover from the two-instance setup; only one consumer remains, drop it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bug #2 (correctness): collection sidebar filter previously only fetched the collection's books for one type (saved OR uploads, depending on a stale activeTab). In unified-mode that meant filtering only one half of the grid. Now: fetch saved+upload book IDs in parallel, both filter predicates apply independently. activeTab state removed (dead code). Bug #1: LibraryItemDto adds Author (primary, by ea.Order). GetLibrary + AddToLibrary endpoints project it. Frontend LibraryItem type updated; combined sort by 'author' now uses real value for saved items, and search query also matches author. Author rendered under the title in both list and grid views (matches uploads UX). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bug #1 follow-up: FirstOrDefault dropped co-authors. Match the catalog detail page (which joins all authors). LibraryItemDto.Author is now the comma-joined name list, projected to a memory list before being joined (EF can't translate string.Join over nav). Bug #2: AddToLibrary used Include().ThenInclude() to materialise full EditionAuthor + Author entities just to read names. Replaced with a single projection query that pulls only the columns it needs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Senior-dev pass over the bug reports from round 1. Each one was traced to its underlying invariant; the fix removes the bug instead of just working around the symptom. PR #245 round-1 plain-text TOC detection silently DID NOT WORK ============================================================ Root cause: ProcessingPipeline.ExtractPlainText runs WhitespaceRegex.Replace(text, " ") which collapses '\n' into a single space. LooksLikeTableOfContentsBody split on '\n', got 1 line, < 5 significant → always returned false. Fix: detection now operates on the chapter HTML, splitting on |</h\d>|</li> instead of newlines. The HTML retains paragraph boundaries by construction (PdfToHtmlConverter emits one per extracted paragraph). #1 Index/Glossary false positive on content-detected TOC ========================================================= Root cause: Index and Glossary are *also* leader-dotted "term … 47". The single "look like TOC body" signal isn't enough. Fix: two extra guards. - Position: only drop chapters in the front half of the book. - Title: new IsKnownBackMatter — vetoes the drop when the bookmark title is Index / Glossary / Bibliography / References / Notes / Abbreviations / Colophon (en + ru/uk). #2 Multi-column threshold flips on borderline pages ==================================================== Root cause: hard 50% modal-coverage cutoff is brittle. Fix: dominance ratio. Modal margin trusted only when its count is ≥ 2.5× the runner-up. Real 2-column pages sit near 1.0× (~40/40); single-column body pages sit near 17× (~85/5). Cutoff is far from either distribution. #3 RetryAsync opened a file stream just to probe existence =========================================================== Root cause: IFileStorageService had no existence primitive, so the guard had to use the heavy GetFileAsync. Fix: new ExistsAsync(path) on the interface, implemented in LocalFileStorageService as File.Exists(GetFullPath). RetryAsync uses it; old GetFileAsync path replaced. Drive-by cleanup: PdfToHtmlConverter.plainBuilder ================================================= Built up via AppendLine in the loop, never used — the returned plainText comes from HtmlCleaner.Clean's pipeline. Removed to stop implying the local copy is authoritative. Tests: 287 passing in Extraction.Tests (+14 new across IsKnownBackMatter, HTML-based LooksLikeTableOfContentsBody). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three remaining bug reports addressed at their underlying invariant. #2 (HTML coupling) — root cause: detection ran on chapter HTML and split on |</h\d>|</li>, coupling it to PdfToHtmlConverter's markup. The actual signal was always per-paragraph plain text; we just lost it during HTML conversion. Refactor: LooksLikeTableOfContentsBody now takes IEnumerable<string> of paragraph texts, called BEFORE HTML conversion in PdfTextExtractor. Bonus: skips the HtmlCleaner pipeline entirely for chapters we're about to drop. #4 (localized back-matter) — root cause: whitelist was English + Russian/Ukrainian only. Extended to German, French, Spanish, Italian, Portuguese for Glossary / Bibliography / References / Notes / Appendix plus their localized forms. #3 (sidenote dominance) — analysis: with sidenotes the typical split is ~70/25 = 2.8× which is above the 2.5× threshold, so the multi-column guard correctly keeps trusting the modal margin. Sidenote lines (a small minority) are then NOT treated as indent breaks — they just register as regular paragraph content. The body remains correctly split. No code change needed; documenting the analysis. Latent bug — IsTableOfContents drop had no position guard, so an Italian "Indice" / Spanish "Índice" (same word means "Index" at the back of the book) would be mis-dropped when it's the Index. Added isFrontHalf guard to the title-based drop path too. Tests: 302 passing in Extraction.Tests (+15 new across IsKnownBackMatter in 7 languages + string-list LooksLikeTableOfContentsBody). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-check + root-cause cleanup (#245) * fix(pdf): bug-report sweep — content-TOC + multi-col + bullets + file-check Four follow-ups from PR #244's bug report list: 1. Content-level TOC detection. The bookmark-title-only path missed TOCs that came in via the page-split fallback (no bookmark, chapter labeled "Pages 1–15"). Now FrontMatterFilter.LooksLikeTableOfContentsBody inspects the plain text: ≥40% of substantive lines ending in a leader-dot run (or "…") + page number ⇒ TOC. Same single-chapter safety guard. 2. Multi-column / mixed-layout guard for StartsWithIndent. The modal left margin is now only trusted when it covers ≥50% of all lines. On a 2-column academic paper the modal share is well under half; we fall back to y-gap and bullet detection only, instead of over-splitting on every column shift. 3. Bullet glyph set expanded — ◇ ❖ ❍ ▶ ▸ ▻ ➤ ➔ ➢ ★ ☆ ✓ ✔ ✗ ✘ to cover modern textbook list markers. 4. RetryAsync now probes the backing file via storage.GetFileAsync before queuing the job. A missing source returned success and left the book stuck in Processing forever. Tests: 273 passing in Extraction.Tests — 8 new across content-TOC detection (leader-dot, ellipsis, prose negative, too-short, null) and expanded bullet glyph coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pdf): bug-report sweep round 2 — root-cause-first cleanup Senior-dev pass over the bug reports from round 1. Each one was traced to its underlying invariant; the fix removes the bug instead of just working around the symptom. PR #245 round-1 plain-text TOC detection silently DID NOT WORK ============================================================ Root cause: ProcessingPipeline.ExtractPlainText runs WhitespaceRegex.Replace(text, " ") which collapses '\n' into a single space. LooksLikeTableOfContentsBody split on '\n', got 1 line, < 5 significant → always returned false. Fix: detection now operates on the chapter HTML, splitting on |</h\d>|</li> instead of newlines. The HTML retains paragraph boundaries by construction (PdfToHtmlConverter emits one per extracted paragraph). #1 Index/Glossary false positive on content-detected TOC ========================================================= Root cause: Index and Glossary are *also* leader-dotted "term … 47". The single "look like TOC body" signal isn't enough. Fix: two extra guards. - Position: only drop chapters in the front half of the book. - Title: new IsKnownBackMatter — vetoes the drop when the bookmark title is Index / Glossary / Bibliography / References / Notes / Abbreviations / Colophon (en + ru/uk). #2 Multi-column threshold flips on borderline pages ==================================================== Root cause: hard 50% modal-coverage cutoff is brittle. Fix: dominance ratio. Modal margin trusted only when its count is ≥ 2.5× the runner-up. Real 2-column pages sit near 1.0× (~40/40); single-column body pages sit near 17× (~85/5). Cutoff is far from either distribution. #3 RetryAsync opened a file stream just to probe existence =========================================================== Root cause: IFileStorageService had no existence primitive, so the guard had to use the heavy GetFileAsync. Fix: new ExistsAsync(path) on the interface, implemented in LocalFileStorageService as File.Exists(GetFullPath). RetryAsync uses it; old GetFileAsync path replaced. Drive-by cleanup: PdfToHtmlConverter.plainBuilder ================================================= Built up via AppendLine in the loop, never used — the returned plainText comes from HtmlCleaner.Clean's pipeline. Removed to stop implying the local copy is authoritative. Tests: 287 passing in Extraction.Tests (+14 new across IsKnownBackMatter, HTML-based LooksLikeTableOfContentsBody). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pdf): bug-report sweep round 3 — root-cause refactor Three remaining bug reports addressed at their underlying invariant. #2 (HTML coupling) — root cause: detection ran on chapter HTML and split on |</h\d>|</li>, coupling it to PdfToHtmlConverter's markup. The actual signal was always per-paragraph plain text; we just lost it during HTML conversion. Refactor: LooksLikeTableOfContentsBody now takes IEnumerable<string> of paragraph texts, called BEFORE HTML conversion in PdfTextExtractor. Bonus: skips the HtmlCleaner pipeline entirely for chapters we're about to drop. #4 (localized back-matter) — root cause: whitelist was English + Russian/Ukrainian only. Extended to German, French, Spanish, Italian, Portuguese for Glossary / Bibliography / References / Notes / Appendix plus their localized forms. #3 (sidenote dominance) — analysis: with sidenotes the typical split is ~70/25 = 2.8× which is above the 2.5× threshold, so the multi-column guard correctly keeps trusting the modal margin. Sidenote lines (a small minority) are then NOT treated as indent breaks — they just register as regular paragraph content. The body remains correctly split. No code change needed; documenting the analysis. Latent bug — IsTableOfContents drop had no position guard, so an Italian "Indice" / Spanish "Índice" (same word means "Index" at the back of the book) would be mis-dropped when it's the Index. Added isFrontHalf guard to the title-based drop path too. Tests: 302 passing in Extraction.Tests (+15 new across IsKnownBackMatter in 7 languages + string-list LooksLikeTableOfContentsBody). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pdf): bug-report sweep round 4 — Unicode-category bullet detection Last actionable item from the bug list: hardcoded bullet set missed custom dingbat-font glyphs in modern textbooks. Root cause: detection was an enumerated whitelist; new fonts shipped new shapes; we kept playing whack-a-mole. Generalization: after the fast-path whitelist check, fall back to Unicode category lookup. If the first word is a single character in category "Symbol, Other" (So) AND not in the existing NoisePunctuation set, treat it as a bullet. Po (Punctuation Other) is deliberately excluded — that category contains † ‡ § ¶ ※ which are footnote-reference markers, not paragraph starts. Tests assert both directions: - ☑ ☐ ✦ ✺ ♦ ⇒ bullet (So) - † ‡ § ¶ ※ ⇒ NOT bullet (Po) Tests: 312 passing in Extraction.Tests (+10 new). Other items remaining on the bug list are tradeoffs we declined: • sidenote columns = XYCut territory (intentionally not pursued) • wrapped TOC entries with hanging indent = context-aware extraction, larger refactor Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mrviduus merged commit 4c3a7a5 into main Dec 24, 2025
1 check passed

mrviduus mentioned this pull request Apr 28, 2026

my-books v3 slice 01 — header reframe + mobile tabs #166

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add full-text search library with PostgreSQL FTS#2

feat: add full-text search library with PostgreSQL FTS#2
mrviduus merged 1 commit into
mainfrom
feature/search-library

mrviduus commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrviduus commented Dec 24, 2025

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant