Skip to content

fix(indexer): reactivate deactivated documents when file is restored#41

Merged
itsmostafa merged 16 commits intomainfrom
fix/indexer-reindex-deleted-files
May 3, 2026
Merged

fix(indexer): reactivate deactivated documents when file is restored#41
itsmostafa merged 16 commits intomainfrom
fix/indexer-reindex-deleted-files

Conversation

@itsmostafa
Copy link
Copy Markdown
Owner

@itsmostafa itsmostafa commented May 2, 2026

Summary

A deleted file could not be re-added to a collection without wiping the database. This PR fixes that root cause and also resolves several cascading issues uncovered while testing the full reindex cycle.

Changes

Core fix — reindex of restored files

  • Drop active=1 filter from the documents lookup in indexFile; also fetch the active column
  • Guard the unchanged short-circuit on existingActive == 1 && existingHash == hash so deactivated rows are never skipped
  • Include active=1 in the UPDATE so the row is reactivated in place (existing chunk-regeneration logic applies unchanged)
  • Count reactivated documents as FilesAdded (not FilesUpdated) since the file was not previously searchable

DB — ON DELETE CASCADE on chunk_vectors and embeddings

  • chunk_vectors and embeddings referenced chunks(id) with NO ACTION, causing FK violations (and a silent rollback) when a changed document was reindexed while embeddings existed
  • Migration 003_cascade_chunk_refs.sql rebuilds both tables with the correct ON DELETE CASCADE constraint (SQLite requires a table rebuild to change FK actions)

DB — migration 003 robustness

  • Made migration 003 idempotent with DROP TABLE IF EXISTS guards so a failed mid-run can be safely retried
  • Fixed a no such column: dimension error on legacy databases where embeddings was created by IF NOT EXISTS before the column was added; dimension now has DEFAULT 0 and is omitted from the INSERT select list

Indexer — skip dot-directories

  • The filesystem walker now skips all dot-prefixed directories (.venv, .cache, .mypy_cache, etc.) unconditionally rather than relying on an enumerated denylist

Breaking Changes

None — migration 003 is a backwards-compatible table rebuild.

Test Plan

  • TestIndexer_ReindexAfterDeletion — full index → delete → restore cycle: verifies FilesAdded=1, active=1, and chunks repopulated
  • TestIndexer_ReindexWithEmbeddings — regression test for the FK cascade fix: reindex succeeds when embeddings rows are present
  • All existing indexer tests pass (TestIndexer_AddFiles, TestIndexer_IncrementalUpdate, TestIndexer_DeactivatesMissingFiles)
  • task check (build + go test ./... + go vet ./...) passes clean

Release Notes

  • Fixed: a file that was indexed, deleted, and then restored on disk would never reappear in search results (previously required a full collection wipe)
  • Fixed: reindexing a document fails silently when vector embeddings exist for its chunks (FK violation)
  • Fixed: migration 003 fails on legacy databases missing the dimension column
  • Fixed: migration 003 is not safe to retry after a partial failure
  • Improved: dot-directories (.venv, .git, .cache, etc.) are now always excluded from indexing

itsmostafa added 2 commits May 1, 2026 23:38
Previously, indexFile only searched for active=1 documents. When a
deactivated row existed for the same (collection, path), the subsequent
INSERT violated the UNIQUE(collection, path) constraint and the file
silently stayed unindexed.

Fix: drop the active=1 filter from the existence lookup and include the
active column. A deactivated row is now reactivated via the existing
update branch (which already regenerates chunks), and its stats are
counted as FilesAdded since the file was not searchable before.
Covers the full delete-then-restore cycle: index → remove from disk →
re-index (FilesRemoved=1) → restore with new content → re-index
(FilesAdded=1, active=1, chunks populated).
@itsmostafa itsmostafa self-assigned this May 2, 2026
itsmostafa added 7 commits May 1, 2026 23:40
chunk_vectors and embeddings referenced chunks(id) with NO ACTION,
causing FK violations (and a silent rollback) whenever a changed document
was reindexed while embeddings existed. SQLite does not support ALTER TABLE
to change FK actions, so this migration rebuilds both tables with the
correct ON DELETE CASCADE constraint.
TestIndexer_ReindexWithEmbeddings indexes a file, inserts chunk_vectors
and embeddings rows for a real chunk, modifies the file, reindexes, and
asserts FilesUpdated==1 with zero orphan rows — the failure mode this
change prevents.
If the migration failed mid-run (after creating embeddings_new but before
completing), schema_version stayed at 2 and the next run would fail with
"table embeddings_new already exists". Adding DROP TABLE IF EXISTS before
each CREATE TABLE makes the migration safe to retry.
Migration 002 used CREATE TABLE IF NOT EXISTS, which was a no-op on
databases where embeddings already existed without the dimension column.
The INSERT in migration 003 then failed with "no such column: dimension".
Fix: give dimension a DEFAULT 0 and omit it from the INSERT select list
so the migration works regardless of the source table's schema.
Dot-prefixed directories (e.g. .venv, .cache, .mypy_cache) are
never user content, so skip them unconditionally rather than relying
on an enumerated denylist.
Adds a section advising when to use qi search/query vs qi ask, and
marks qi ask as to be used sparingly since it consumes LLM tokens.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an indexer edge case where previously-deactivated documents (from deleted files) could not be re-added when the file is restored, by reusing/reactivating the existing documents row instead of attempting a conflicting insert.

Changes:

  • Update indexFile to find documents regardless of active state, reactivate on update, and adjust stats counting for reactivated files.
  • Add regression tests covering restore-after-delete and reindexing behavior when embeddings exist.
  • Add a DB migration to apply ON DELETE CASCADE to chunk embedding reference tables; minor CLI skill/docs + marketplace version bump.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
internal/indexer/indexer.go Reactivate deactivated documents on restore; tweaks directory-walk ignore behavior
internal/indexer/indexer_test.go Adds restore-after-delete regression + embedding-related reindex test
internal/db/migrations/003_cascade_chunk_refs.sql Rebuilds embedding reference tables to add ON DELETE CASCADE
skills/qi-cli/SKILL.md Adds guidance to prefer search/query and use ask sparingly
.claude-plugin/marketplace.json Bumps plugin metadata version

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/indexer/indexer.go Outdated
Comment thread internal/indexer/indexer.go
Comment thread internal/db/migrations/003_cascade_chunk_refs.sql
Comment thread internal/db/migrations/003_cascade_chunk_refs.sql Outdated
Comment thread .claude-plugin/marketplace.json
Comment thread internal/indexer/indexer.go
itsmostafa added 6 commits May 2, 2026 00:01
The INSERT into embeddings_new omitted dimension, silently writing 0 for
all migrated rows even when the source table had real values. Derive
dimension from length(cv.vector)/4 via a LEFT JOIN on chunk_vectors so
both old schemas (no dimension column) and current ones are handled
correctly without touching a potentially-absent source column.
Swallowing the Scan error left docID=0 on any transient DB failure,
causing a fallthrough to INSERT that would hit the UNIQUE(collection,path)
constraint and silently leave the file unindexed.
When a deactivated document was restored with byte-identical content,
the indexer still ran DELETE FROM chunks, which cascades into
chunk_vectors and embeddings (added by migration 003), forcing
unnecessary re-embedding work.

Add a fast-path for `docID != 0 && existingActive == 0 && existingHash == hash`:
reactivate the document row and return without touching chunks or embeddings.
Adds TestIndexer_ReactivateSameContent to guard the preserved-embedding invariant.
@itsmostafa itsmostafa merged commit 9b2a1dc into main May 3, 2026
@itsmostafa itsmostafa deleted the fix/indexer-reindex-deleted-files branch May 3, 2026 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants