fix(indexer): reactivate deactivated documents when file is restored#41
Merged
itsmostafa merged 16 commits intomainfrom May 3, 2026
Merged
fix(indexer): reactivate deactivated documents when file is restored#41itsmostafa merged 16 commits intomainfrom
itsmostafa merged 16 commits intomainfrom
Conversation
Previously, indexFile only searched for active=1 documents. When a deactivated row existed for the same (collection, path), the subsequent INSERT violated the UNIQUE(collection, path) constraint and the file silently stayed unindexed. Fix: drop the active=1 filter from the existence lookup and include the active column. A deactivated row is now reactivated via the existing update branch (which already regenerates chunks), and its stats are counted as FilesAdded since the file was not searchable before.
Covers the full delete-then-restore cycle: index → remove from disk → re-index (FilesRemoved=1) → restore with new content → re-index (FilesAdded=1, active=1, chunks populated).
chunk_vectors and embeddings referenced chunks(id) with NO ACTION, causing FK violations (and a silent rollback) whenever a changed document was reindexed while embeddings existed. SQLite does not support ALTER TABLE to change FK actions, so this migration rebuilds both tables with the correct ON DELETE CASCADE constraint.
TestIndexer_ReindexWithEmbeddings indexes a file, inserts chunk_vectors and embeddings rows for a real chunk, modifies the file, reindexes, and asserts FilesUpdated==1 with zero orphan rows — the failure mode this change prevents.
If the migration failed mid-run (after creating embeddings_new but before completing), schema_version stayed at 2 and the next run would fail with "table embeddings_new already exists". Adding DROP TABLE IF EXISTS before each CREATE TABLE makes the migration safe to retry.
Migration 002 used CREATE TABLE IF NOT EXISTS, which was a no-op on databases where embeddings already existed without the dimension column. The INSERT in migration 003 then failed with "no such column: dimension". Fix: give dimension a DEFAULT 0 and omit it from the INSERT select list so the migration works regardless of the source table's schema.
Dot-prefixed directories (e.g. .venv, .cache, .mypy_cache) are never user content, so skip them unconditionally rather than relying on an enumerated denylist.
Adds a section advising when to use qi search/query vs qi ask, and marks qi ask as to be used sparingly since it consumes LLM tokens.
There was a problem hiding this comment.
Pull request overview
Fixes an indexer edge case where previously-deactivated documents (from deleted files) could not be re-added when the file is restored, by reusing/reactivating the existing documents row instead of attempting a conflicting insert.
Changes:
- Update
indexFileto find documents regardless ofactivestate, reactivate on update, and adjust stats counting for reactivated files. - Add regression tests covering restore-after-delete and reindexing behavior when embeddings exist.
- Add a DB migration to apply
ON DELETE CASCADEto chunk embedding reference tables; minor CLI skill/docs + marketplace version bump.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
internal/indexer/indexer.go |
Reactivate deactivated documents on restore; tweaks directory-walk ignore behavior |
internal/indexer/indexer_test.go |
Adds restore-after-delete regression + embedding-related reindex test |
internal/db/migrations/003_cascade_chunk_refs.sql |
Rebuilds embedding reference tables to add ON DELETE CASCADE |
skills/qi-cli/SKILL.md |
Adds guidance to prefer search/query and use ask sparingly |
.claude-plugin/marketplace.json |
Bumps plugin metadata version |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The INSERT into embeddings_new omitted dimension, silently writing 0 for all migrated rows even when the source table had real values. Derive dimension from length(cv.vector)/4 via a LEFT JOIN on chunk_vectors so both old schemas (no dimension column) and current ones are handled correctly without touching a potentially-absent source column.
Swallowing the Scan error left docID=0 on any transient DB failure, causing a fallthrough to INSERT that would hit the UNIQUE(collection,path) constraint and silently leave the file unindexed.
When a deactivated document was restored with byte-identical content, the indexer still ran DELETE FROM chunks, which cascades into chunk_vectors and embeddings (added by migration 003), forcing unnecessary re-embedding work. Add a fast-path for `docID != 0 && existingActive == 0 && existingHash == hash`: reactivate the document row and return without touching chunks or embeddings. Adds TestIndexer_ReactivateSameContent to guard the preserved-embedding invariant.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A deleted file could not be re-added to a collection without wiping the database. This PR fixes that root cause and also resolves several cascading issues uncovered while testing the full reindex cycle.
Changes
Core fix — reindex of restored files
active=1filter from thedocumentslookup inindexFile; also fetch theactivecolumnexistingActive == 1 && existingHash == hashso deactivated rows are never skippedactive=1in theUPDATEso the row is reactivated in place (existing chunk-regeneration logic applies unchanged)FilesAdded(notFilesUpdated) since the file was not previously searchableDB — ON DELETE CASCADE on chunk_vectors and embeddings
chunk_vectorsandembeddingsreferencedchunks(id)withNO ACTION, causing FK violations (and a silent rollback) when a changed document was reindexed while embeddings existed003_cascade_chunk_refs.sqlrebuilds both tables with the correctON DELETE CASCADEconstraint (SQLite requires a table rebuild to change FK actions)DB — migration 003 robustness
DROP TABLE IF EXISTSguards so a failed mid-run can be safely retriedno such column: dimensionerror on legacy databases whereembeddingswas created byIF NOT EXISTSbefore the column was added;dimensionnow hasDEFAULT 0and is omitted from the INSERT select listIndexer — skip dot-directories
.venv,.cache,.mypy_cache, etc.) unconditionally rather than relying on an enumerated denylistBreaking Changes
None — migration 003 is a backwards-compatible table rebuild.
Test Plan
TestIndexer_ReindexAfterDeletion— full index → delete → restore cycle: verifiesFilesAdded=1,active=1, and chunks repopulatedTestIndexer_ReindexWithEmbeddings— regression test for the FK cascade fix: reindex succeeds when embeddings rows are presentTestIndexer_AddFiles,TestIndexer_IncrementalUpdate,TestIndexer_DeactivatesMissingFiles)task check(build +go test ./...+go vet ./...) passes cleanRelease Notes
dimensioncolumn.venv,.git,.cache, etc.) are now always excluded from indexing