Skip to content

Commit 08057f0

Browse files
fix: resolve all dogfood v2.2.0 bugs
- structure: treat `.` as no filter in structureData() (#1) - builder: invalidate embeddings when nodes are deleted during build, warn about orphaned embeddings after rebuild (#2) - embedder: change default model to minilm (public, no auth required), catch auth/download errors with clear guidance (#3) - embedder: split camelCase/snake_case identifiers in embedding text for better search relevance (search quality note) - export: add --min-confidence filter (default 0.5) to DOT/Mermaid/JSON exports, filtering spurious low-confidence edges (#4) - dogfood report: annotate all bugs as fixed Impact: 8 functions changed, 8 affected
1 parent ca5cf24 commit 08057f0

File tree

6 files changed

+111
-18
lines changed

6 files changed

+111
-18
lines changed

generated/DOGFOOD_REPORT_v2.2.0.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@
4242

4343
**Fix:** Treat `.` (or current dir equivalent) as `null`/no filter in `structureData()`.
4444

45+
> **FIXED**`structureData()` now normalizes the directory argument and treats `"."` as null/no filter. (`src/structure.js`)
46+
4547
### 2. Stale embeddings after rebuild (Medium severity)
4648

4749
- After an incremental `build`, embedding `node_id`s become orphaned (e.g. old IDs in 3077-range, new IDs in 4335-range)
@@ -52,6 +54,8 @@
5254

5355
**Fix:** Either preserve node IDs across rebuilds, invalidate embeddings when node IDs change, or warn the user to re-run `embed`.
5456

57+
> **FIXED** — Build now invalidates embeddings alongside nodes. Full builds clear the embeddings table entirely. Incremental builds delete embeddings for affected files before deleting their nodes (order matters — need node IDs to find them). After the build, any remaining orphaned embeddings trigger a warning: `"N embeddings are orphaned (nodes changed). Run codegraph embed to refresh."` (`src/builder.js`)
58+
5559
### 3. `embed` default model requires HuggingFace auth (Medium severity)
5660

5761
- `codegraph embed .` crashes with `Error: Unauthorized access to file` for the default `jina-code` model
@@ -61,6 +65,8 @@
6165

6266
**Fix:** Either default to a public model (e.g. `minilm`), auto-fallback to `minilm` on auth failure, or catch the error and provide a clear message with instructions.
6367

68+
> **FIXED** — Default model changed from `nomic-v1.5` (gated, requires HF_TOKEN) to `minilm` (public, 23MB, always works). Additionally, `loadModel()` now catches auth/download failures and prints a clear message with options (set HF_TOKEN or use `--model minilm`) instead of crashing with a raw stack trace. (`src/embedder.js`, `src/cli.js`)
69+
6470
### 4. Cross-language false positive in export (Low severity)
6571

6672
- One low-confidence (0.3) call edge: `main` (build.rs) → `setup` (tests/unit/structure.test.js)
@@ -69,6 +75,8 @@
6975

7076
**Fix:** Export commands could support a `--min-confidence` filter, or the default export could exclude edges below a threshold (e.g. 0.5).
7177

78+
> **FIXED** — Added `--min-confidence <score>` option to the `export` command (default: 0.5). All three formats (DOT, Mermaid, JSON) filter edges by confidence at the SQL level. The 0.3-confidence false positive is excluded by default. Users can pass `--min-confidence 0` to include all edges. (`src/export.js`, `src/cli.js`)
79+
7280
## `--no-tests` Flag
7381

7482
Tested on `stats` and `map` — both correctly filter out test files:
@@ -80,3 +88,5 @@ Tested on `stats` and `map` — both correctly filter out test files:
8088
- `embed --model minilm` successfully generated 392 embeddings (384d)
8189
- `search "build graph"` returned 15 results after fresh embeddings (top hit: 37.9% `test_triangle_cycle`)
8290
- Search quality is reasonable but not ideal — `buildGraph` itself didn't appear in results for "build graph"
91+
92+
> **FIXED** — Embedding text now includes a readable split of the identifier name (e.g. `buildGraph``"function buildGraph (build Graph) in src/builder.js"`). This lets the model naturally associate "build graph" queries with `buildGraph` without needing hybrid search. camelCase, PascalCase, snake_case, and kebab-case are all handled. (`src/embedder.js`)

src/builder.js

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -402,13 +402,30 @@ export async function buildGraph(rootDir, opts = {}) {
402402
return;
403403
}
404404

405+
// Check if embeddings table exists (created by `embed`, not by initSchema)
406+
let hasEmbeddings = false;
407+
try {
408+
db.prepare('SELECT 1 FROM embeddings LIMIT 1').get();
409+
hasEmbeddings = true;
410+
} catch {
411+
/* table doesn't exist */
412+
}
413+
405414
if (isFullBuild) {
415+
const deletions =
416+
'PRAGMA foreign_keys = OFF; DELETE FROM node_metrics; DELETE FROM edges; DELETE FROM nodes; PRAGMA foreign_keys = ON;';
406417
db.exec(
407-
'PRAGMA foreign_keys = OFF; DELETE FROM node_metrics; DELETE FROM edges; DELETE FROM nodes; PRAGMA foreign_keys = ON;',
418+
hasEmbeddings
419+
? `${deletions.replace('PRAGMA foreign_keys = ON;', '')} DELETE FROM embeddings; PRAGMA foreign_keys = ON;`
420+
: deletions,
408421
);
409422
} else {
410423
info(`Incremental: ${parseChanges.length} changed, ${removed.length} removed`);
411-
// Remove metrics/edges/nodes for changed and removed files
424+
// Remove embeddings/metrics/edges/nodes for changed and removed files
425+
// Embeddings must be deleted BEFORE nodes (we need node IDs to find them)
426+
const deleteEmbeddingsForFile = hasEmbeddings
427+
? db.prepare('DELETE FROM embeddings WHERE node_id IN (SELECT id FROM nodes WHERE file = ?)')
428+
: null;
412429
const deleteNodesForFile = db.prepare('DELETE FROM nodes WHERE file = ?');
413430
const deleteEdgesForFile = db.prepare(`
414431
DELETE FROM edges WHERE source_id IN (SELECT id FROM nodes WHERE file = @f)
@@ -418,12 +435,14 @@ export async function buildGraph(rootDir, opts = {}) {
418435
'DELETE FROM node_metrics WHERE node_id IN (SELECT id FROM nodes WHERE file = ?)',
419436
);
420437
for (const relPath of removed) {
438+
deleteEmbeddingsForFile?.run(relPath);
421439
deleteEdgesForFile.run({ f: relPath });
422440
deleteMetricsForFile.run(relPath);
423441
deleteNodesForFile.run(relPath);
424442
}
425443
for (const item of parseChanges) {
426444
const relPath = item.relPath || normalizePath(path.relative(rootDir, item.file));
445+
deleteEmbeddingsForFile?.run(relPath);
427446
deleteEdgesForFile.run({ f: relPath });
428447
deleteMetricsForFile.run(relPath);
429448
deleteNodesForFile.run(relPath);
@@ -823,6 +842,23 @@ export async function buildGraph(rootDir, opts = {}) {
823842
const nodeCount = db.prepare('SELECT COUNT(*) as c FROM nodes').get().c;
824843
info(`Graph built: ${nodeCount} nodes, ${edgeCount} edges`);
825844
info(`Stored in ${dbPath}`);
845+
846+
// Warn about orphaned embeddings that no longer match any node
847+
if (hasEmbeddings) {
848+
try {
849+
const orphaned = db
850+
.prepare('SELECT COUNT(*) as c FROM embeddings WHERE node_id NOT IN (SELECT id FROM nodes)')
851+
.get().c;
852+
if (orphaned > 0) {
853+
warn(
854+
`${orphaned} embeddings are orphaned (nodes changed). Run "codegraph embed" to refresh.`,
855+
);
856+
}
857+
} catch {
858+
/* ignore — embeddings table may have been dropped */
859+
}
860+
}
861+
826862
db.close();
827863

828864
// Write journal header after successful build

src/cli.js

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -272,10 +272,15 @@ program
272272
.option('--functions', 'Function-level graph instead of file-level')
273273
.option('-T, --no-tests', 'Exclude test/spec files')
274274
.option('--include-tests', 'Include test/spec files (overrides excludeTests config)')
275+
.option('--min-confidence <score>', 'Minimum edge confidence threshold (default: 0.5)', '0.5')
275276
.option('-o, --output <file>', 'Write to file instead of stdout')
276277
.action((opts) => {
277278
const db = new Database(findDbPath(opts.db), { readonly: true });
278-
const exportOpts = { fileLevel: !opts.functions, noTests: resolveNoTests(opts) };
279+
const exportOpts = {
280+
fileLevel: !opts.functions,
281+
noTests: resolveNoTests(opts),
282+
minConfidence: parseFloat(opts.minConfidence),
283+
};
279284

280285
let output;
281286
switch (opts.format) {
@@ -412,7 +417,7 @@ program
412417
.action(() => {
413418
console.log('\nAvailable embedding models:\n');
414419
for (const [key, config] of Object.entries(MODELS)) {
415-
const def = key === 'nomic-v1.5' ? ' (default)' : '';
420+
const def = key === 'minilm' ? ' (default)' : '';
416421
console.log(` ${key.padEnd(12)} ${String(config.dim).padStart(4)}d ${config.desc}${def}`);
417422
}
418423
console.log('\nUsage: codegraph embed --model <name>');
@@ -426,8 +431,8 @@ program
426431
)
427432
.option(
428433
'-m, --model <name>',
429-
'Embedding model: minilm, jina-small, jina-base, jina-code, nomic, nomic-v1.5 (default), bge-large. Run `codegraph models` for details',
430-
'nomic-v1.5',
434+
'Embedding model: minilm (default), jina-small, jina-base, jina-code, nomic, nomic-v1.5, bge-large. Run `codegraph models` for details',
435+
'minilm',
431436
)
432437
.action(async (dir, opts) => {
433438
const root = path.resolve(dir || '.');

src/embedder.js

Lines changed: 36 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,18 @@ import Database from 'better-sqlite3';
44
import { findDbPath, openReadonlyOrFail } from './db.js';
55
import { warn } from './logger.js';
66

7+
/**
8+
* Split an identifier into readable words.
9+
* camelCase/PascalCase → "camel Case", snake_case → "snake case", kebab-case → "kebab case"
10+
*/
11+
function splitIdentifier(name) {
12+
return name
13+
.replace(/([a-z])([A-Z])/g, '$1 $2')
14+
.replace(/([A-Z]+)([A-Z][a-z])/g, '$1 $2')
15+
.replace(/[_-]+/g, ' ')
16+
.trim();
17+
}
18+
719
// Lazy-load transformers (heavy, optional module)
820
let pipeline = null;
921
let _cos_sim = null;
@@ -55,7 +67,7 @@ export const MODELS = {
5567
},
5668
};
5769

58-
export const DEFAULT_MODEL = 'nomic-v1.5';
70+
export const DEFAULT_MODEL = 'minilm';
5971
const BATCH_SIZE_MAP = {
6072
minilm: 32,
6173
'jina-small': 16,
@@ -103,8 +115,27 @@ async function loadModel(modelKey) {
103115
_cos_sim = transformers.cos_sim;
104116

105117
console.log(`Loading embedding model: ${config.name} (${config.dim}d)...`);
106-
const opts = config.quantized ? { quantized: true } : {};
107-
extractor = await pipeline('feature-extraction', config.name, opts);
118+
const pipelineOpts = config.quantized ? { quantized: true } : {};
119+
try {
120+
extractor = await pipeline('feature-extraction', config.name, pipelineOpts);
121+
} catch (err) {
122+
const msg = err.message || String(err);
123+
if (msg.includes('Unauthorized') || msg.includes('401') || msg.includes('gated')) {
124+
console.error(
125+
`\nModel "${config.name}" requires authentication.\n` +
126+
`This model is gated on HuggingFace and needs an access token.\n\n` +
127+
`Options:\n` +
128+
` 1. Set HF_TOKEN env var: export HF_TOKEN=hf_...\n` +
129+
` 2. Use a public model instead: codegraph embed --model minilm\n`,
130+
);
131+
} else {
132+
console.error(
133+
`\nFailed to load model "${config.name}": ${msg}\n` +
134+
`Try a different model: codegraph embed --model minilm\n`,
135+
);
136+
}
137+
process.exit(1);
138+
}
108139
activeModel = config.name;
109140
console.log('Model loaded.');
110141
return { extractor, config };
@@ -219,7 +250,8 @@ export async function buildEmbeddings(rootDir, modelKey, customDbPath) {
219250
: Math.min(lines.length, startLine + 15);
220251
const context = lines.slice(startLine, endLine).join('\n');
221252

222-
const text = `${node.kind} ${node.name} in ${file}\n${context}`;
253+
const readable = splitIdentifier(node.name);
254+
const text = `${node.kind} ${node.name} (${readable}) in ${file}\n${context}`;
223255
texts.push(text);
224256
nodeIds.push(node.id);
225257
previews.push(`${node.name} (${node.kind}) -- ${file}:${node.line}`);

src/export.js

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
11
import path from 'node:path';
22
import { isTestFile } from './queries.js';
33

4+
const DEFAULT_MIN_CONFIDENCE = 0.5;
5+
46
/**
57
* Export the dependency graph in DOT (Graphviz) format.
68
*/
79
export function exportDOT(db, opts = {}) {
810
const fileLevel = opts.fileLevel !== false;
911
const noTests = opts.noTests || false;
12+
const minConf = opts.minConfidence ?? DEFAULT_MIN_CONFIDENCE;
1013
const lines = [
1114
'digraph codegraph {',
1215
' rankdir=LR;',
@@ -23,8 +26,9 @@ export function exportDOT(db, opts = {}) {
2326
JOIN nodes n1 ON e.source_id = n1.id
2427
JOIN nodes n2 ON e.target_id = n2.id
2528
WHERE n1.file != n2.file AND e.kind IN ('imports', 'imports-type', 'calls')
29+
AND e.confidence >= ?
2630
`)
27-
.all();
31+
.all(minConf);
2832
if (noTests) edges = edges.filter((e) => !isTestFile(e.source) && !isTestFile(e.target));
2933

3034
// Try to use directory nodes from DB (built by structure analysis)
@@ -102,8 +106,9 @@ export function exportDOT(db, opts = {}) {
102106
JOIN nodes n2 ON e.target_id = n2.id
103107
WHERE n1.kind IN ('function', 'method', 'class', 'interface', 'type', 'struct', 'enum', 'trait', 'record', 'module') AND n2.kind IN ('function', 'method', 'class', 'interface', 'type', 'struct', 'enum', 'trait', 'record', 'module')
104108
AND e.kind = 'calls'
109+
AND e.confidence >= ?
105110
`)
106-
.all();
111+
.all(minConf);
107112
if (noTests)
108113
edges = edges.filter((e) => !isTestFile(e.source_file) && !isTestFile(e.target_file));
109114

@@ -126,6 +131,7 @@ export function exportDOT(db, opts = {}) {
126131
export function exportMermaid(db, opts = {}) {
127132
const fileLevel = opts.fileLevel !== false;
128133
const noTests = opts.noTests || false;
134+
const minConf = opts.minConfidence ?? DEFAULT_MIN_CONFIDENCE;
129135
const lines = ['graph LR'];
130136

131137
if (fileLevel) {
@@ -136,8 +142,9 @@ export function exportMermaid(db, opts = {}) {
136142
JOIN nodes n1 ON e.source_id = n1.id
137143
JOIN nodes n2 ON e.target_id = n2.id
138144
WHERE n1.file != n2.file AND e.kind IN ('imports', 'imports-type', 'calls')
145+
AND e.confidence >= ?
139146
`)
140-
.all();
147+
.all(minConf);
141148
if (noTests) edges = edges.filter((e) => !isTestFile(e.source) && !isTestFile(e.target));
142149

143150
for (const { source, target } of edges) {
@@ -155,8 +162,9 @@ export function exportMermaid(db, opts = {}) {
155162
JOIN nodes n2 ON e.target_id = n2.id
156163
WHERE n1.kind IN ('function', 'method', 'class', 'interface', 'type', 'struct', 'enum', 'trait', 'record', 'module') AND n2.kind IN ('function', 'method', 'class', 'interface', 'type', 'struct', 'enum', 'trait', 'record', 'module')
157164
AND e.kind = 'calls'
165+
AND e.confidence >= ?
158166
`)
159-
.all();
167+
.all(minConf);
160168
if (noTests)
161169
edges = edges.filter((e) => !isTestFile(e.source_file) && !isTestFile(e.target_file));
162170

@@ -175,6 +183,7 @@ export function exportMermaid(db, opts = {}) {
175183
*/
176184
export function exportJSON(db, opts = {}) {
177185
const noTests = opts.noTests || false;
186+
const minConf = opts.minConfidence ?? DEFAULT_MIN_CONFIDENCE;
178187

179188
let nodes = db
180189
.prepare(`
@@ -185,13 +194,13 @@ export function exportJSON(db, opts = {}) {
185194

186195
let edges = db
187196
.prepare(`
188-
SELECT DISTINCT n1.file AS source, n2.file AS target, e.kind
197+
SELECT DISTINCT n1.file AS source, n2.file AS target, e.kind, e.confidence
189198
FROM edges e
190199
JOIN nodes n1 ON e.source_id = n1.id
191200
JOIN nodes n2 ON e.target_id = n2.id
192-
WHERE n1.file != n2.file
201+
WHERE n1.file != n2.file AND e.confidence >= ?
193202
`)
194-
.all();
203+
.all(minConf);
195204
if (noTests) edges = edges.filter((e) => !isTestFile(e.source) && !isTestFile(e.target));
196205

197206
return { nodes, edges };

src/structure.js

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,8 @@ export function buildStructure(db, fileSymbols, _rootDir, lineCountMap, director
231231
*/
232232
export function structureData(customDbPath, opts = {}) {
233233
const db = openReadonlyOrFail(customDbPath);
234-
const filterDir = opts.directory || null;
234+
const rawDir = opts.directory || null;
235+
const filterDir = rawDir && normalizePath(rawDir) !== '.' ? rawDir : null;
235236
const maxDepth = opts.depth || null;
236237
const sortBy = opts.sort || 'files';
237238
const noTests = opts.noTests || false;

0 commit comments

Comments
 (0)