Skip to content

Comments

Add tabbed content support, URL-level change detection, and more#55

Merged
djannot merged 9 commits intomainfrom
denis-multiple-improvements
Feb 16, 2026
Merged

Add tabbed content support, URL-level change detection, and more#55
djannot merged 9 commits intomainfrom
denis-multiple-improvements

Conversation

@djannot
Copy link
Collaborator

@djannot djannot commented Feb 15, 2026

Three improvements to content processing and crawl reliability, plus E2E test coverage:

  • Tabbed content support: Detects WAI-ARIA tabs in web pages and injects tab labels into panel content so context is preserved in the Markdown output instead of concatenating unlabeled code blocks.
  • URL-level change detection: Replaces per-chunk hash comparison with URL-level comparison across all source types. Unchanged URLs are skipped entirely; changed URLs are fully re-processed, eliminating orphaned chunks and inconsistent chunk_index/total_chunks when content shifts.
  • Puppeteer resilience: Adds protocol and evaluate timeouts, navigates to about:blank between pages, and recreates the page tab after errors to prevent cascading failures from stuck pages.
  • E2E integration tests: Three end-to-end tests covering local directory (Markdown), code source (Tree-sitter), and website source (Puppeteer + local HTTP server). Each validates that only modified documents are re-embedded while unchanged ones are skipped, with correct chunk ordering and no orphans.
  • Qdrant filter fix: Fixes removeChunksByUrlQdrant using Qdrant's full-text match (match.text) instead of exact keyword match (match.value), which caused cross-URL chunk deletion during content-change re-processing and unnecessary re-embedding of unchanged pages on subsequent runs.
  • HTTP 429 retry and browser recovery: Adds rate-limit retry with Retry-After header parsing (max 3 attempts per URL, configurable fallback delay), browser restart escalation when Chrome protocol errors persist after page recreation, periodic browser restart every 50 pages to prevent memory accumulation, and --disable-dev-shm-usage for Docker stability.
  • Add ETag-based change detection for website crawls
  • Fix shouldProcessUrl rejecting version-like URLs (e.g. /app/2.1.x/)
  • Add adaptive HEAD backoff and sitemap lastmod-based change detection
  • Testing all 4 change detection layers across multiple syncs

… resilience

Signed-off-by: Denis Jannot <denis.jannot@solo.io>
@djannot djannot force-pushed the denis-multiple-improvements branch from 93d24a6 to dae3be9 Compare February 15, 2026 13:34
…er resilience

Signed-off-by: Denis Jannot <denis.jannot@solo.io>
Signed-off-by: Denis Jannot <denis.jannot@solo.io>
@djannot djannot force-pushed the denis-multiple-improvements branch from 2dc3755 to f377257 Compare February 15, 2026 19:13
Signed-off-by: Denis Jannot <denis.jannot@solo.io>
Signed-off-by: Denis Jannot <denis.jannot@solo.io>
Signed-off-by: Denis Jannot <denis.jannot@solo.io>
Signed-off-by: Denis Jannot <denis.jannot@solo.io>
Signed-off-by: Denis Jannot <denis.jannot@solo.io>
@djannot djannot force-pushed the denis-multiple-improvements branch from cf45510 to f3599ed Compare February 16, 2026 11:49
Signed-off-by: Denis Jannot <denis.jannot@solo.io>
@djannot djannot merged commit 823f197 into main Feb 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant