Skip to content

Add incremental indexing with content-hash change detection#2

Open
nnourr wants to merge 3 commits intomainfrom
tricky-chips
Open

Add incremental indexing with content-hash change detection#2
nnourr wants to merge 3 commits intomainfrom
tricky-chips

Conversation

@nnourr
Copy link
Copy Markdown
Owner

@nnourr nnourr commented Apr 16, 2026

Summary

  • When an existing index is found, code-index analyze now auto-runs an incremental pipeline instead of refusing or re-indexing from scratch
  • Changed files are detected via git diff (committed) + SHA-256 content hashes (uncommitted), so repeated runs on a dirty working tree skip re-indexing correctly
  • Only nodes from changed files are re-embedded; unchanged vectors are copied from the existing FAISS index via reconstruct()
  • Falls back to full re-index when the old commit is unreachable (rebase/force-push) or no hash baseline exists yet

Performance on this repo (37 files, 306 embeddable nodes, 1 file changed):

  • Full index: ~55s → Incremental: ~3.6s (~15x faster)

Test plan

  • Full index with --force establishes hash baseline and works as before
  • Incremental run with no changes detects "No files changed" instantly
  • Incremental run after editing 1 file re-embeds only that file's nodes
  • Second incremental run (no new edits) correctly skips — hashes match what was indexed
  • Search works correctly after incremental update
  • --force flag still triggers full re-index

🤖 Generated with Claude Code

nnourr and others added 3 commits April 16, 2026 00:19
When an existing index is found, `code-index analyze` now runs an
incremental pipeline that detects changed files via git diff + SHA-256
content hashes and only re-embeds nodes from those files. Unchanged
node vectors are copied from the existing FAISS index. Repeated runs
on a dirty working tree correctly skip re-indexing because hashes are
compared against what was actually indexed, not just git state.

Falls back to full re-index when the old commit is unreachable (rebase)
or no hash baseline exists yet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These were pulled transitively (via sentence-transformers, optimum,
fastembed), so different venvs could resolve to a broken trio where
transformers demands huggingface-hub>=1.5 but the resolver picked an
older hub. Pinning the 4.x / 0.x / 0.2x line keeps the three in lockstep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Added dependencies for ONNX, ONNX Runtime, and Optimum in `pyproject.toml`.
- Updated `config.py` to set the embedding backend to ONNX, enabling faster inference on Apple Silicon and CUDA.
- Enhanced `engine.py` to support ONNX model loading and embedding, including functions for selecting the best execution provider and ensuring model export.
- Refactored embedding logic to utilize ONNX runtime, improving performance and compatibility with specific models.

These changes enhance the embedding engine's efficiency and broaden its compatibility with various hardware setups.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant