Skip to content

bug: build --no-incremental silently wipes the embeddings table #982

@carlos-alm

Description

@carlos-alm

Found during dogfooding v3.9.4

Severity: High
Command: codegraph build --no-incremental

Running a full rebuild after codegraph embed silently drops every row from the embeddings table. There is no warning, no prompt, no opt-out. Users who have spent minutes generating embeddings (especially with the larger Jina models) lose all of them on the next --no-incremental build.

Reproduction

mkdir embed-test && cd embed-test
cat > a.js <<'EOF'
export function alpha() { return 1; }
export function beta() { return alpha(); }
EOF

npx codegraph build .
npx codegraph embed . -m minilm
# Stored 2 embeddings (384d, ...) in graph.db

node -e "const db = require('better-sqlite3')('.codegraph/graph.db'); \
  console.log('before:', db.prepare('SELECT COUNT(*) c FROM embeddings').get().c);"
# before: 2

npx codegraph build . --no-incremental
node -e "const db = require('better-sqlite3')('.codegraph/graph.db'); \
  console.log('after:',  db.prepare('SELECT COUNT(*) c FROM embeddings').get().c);"
# after: 0

Expected behavior

One of:

  1. Preserve embeddings whose node_id still maps to a live node in the rebuilt graph (ideal).
  2. Warn the user before wiping ([codegraph WARN] --no-incremental will discard N embeddings; re-run \codegraph embed` after the build.`).
  3. Require an explicit --wipe-embeddings flag to opt into destruction.

Actual behavior

Embeddings table is emptied silently. Subsequent codegraph search returns zero results with no hint about why.

Suggested fix

The simplest safe change is option 2: warn before wiping. For option 1, keep the embeddings table intact, and after the full rebuild either (a) leave embeddings keyed to their old node_ids that no longer exist (and let a downstream validator prune), or (b) re-key embeddings by symbol signature (name + file + kind) so they survive re-identification.

Related

Full rebuild also invalidates any external consumers that cached node_ids from the old DB. A migration note in the release notes would help, but the silent data loss is the bigger issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions