feat(core): v2.2.0 - CPG, HNSW, Hybrid Embedding, Cross-Encoder Reranking#9
feat(core): v2.2.0 - CPG, HNSW, Hybrid Embedding, Cross-Encoder Reranking#9
Conversation
Phase 2: Code Property Graph foundation - Add CPG types with AST/CFG/DFG/CallGraph layers - Implement edge types: CHILD, CALLS, DEFINES, IMPORTS, etc. Phase 4: Adaptive retrieval system - Add query classifier (semantic/structural/historical/hybrid) - Implement query expander with synonym/abbreviation resolution - Add adaptive weight computation based on query type - Implement result fusion and basic re-ranking Phase 5: Performance optimization - Add parallel indexing pipeline with configurable worker pool - Implement MemoryMonitor with adaptive worker count - Add HNSW vector index foundation - Implement error handling with fallback parsing Documentation: - Add AGENTS.md for root, src/core/, src/commands/ - Add pre_plan/optimization-plan.md with 20-week roadmap Tests: - Add retrieval.test.ts (8 tests passing) - Add indexing.test.ts (infrastructure tests) Refactor: - Simplify indexer.ts to use new parallel pipeline - Integrate adaptive retrieval into search.ts - Add HNSW config to sq8.ts
Phase 1: Chunking improvements Chunker module (chunker.ts): - Implement AST-aware hierarchical chunking - Configurable maxTokens (default: 512) - Priority constructs: functions, classes, methods, interfaces - Automatic splitting for oversized chunks with overlap - AST path metadata for each chunk - Symbol reference extraction Chunk relations (chunkRelations.ts): - Infer caller/callee relationships from symbol references - Build parent-child relationships from AST path nesting - Type-based and file-based chunk organization - getRelatedChunks() for traversal up to maxDepth Types extension (types.ts): - Extend ChunkRow with optional AST metadata fields: - file_path, start_line, end_line, ast_path - node_type, token_count, symbol_references Fix (parallel.ts): - Handle empty parse results with fallback (malformed code) - parseWithFallback now checks for empty results too Tests (chunker.test.mjs): - countTokens verification - Simple function chunking - Class with methods chunking - Large function splitting with maxTokens limit
Phase 2: Full CPG Implementation - cfgLayer.ts: Control flow graph (if/else, loops, try/catch) - dfgLayer.ts: Data flow graph (variable definitions/uses) - callGraph.ts: Cross-file call graph with import resolution - Enhanced types in types.ts Phase 3: Full Hybrid Embedding System - semantic.ts: Transformer-based semantic embeddings (CodeBERT) - structural.ts: AST-based structural embeddings (Weisfeiler-Lehman) - symbolic.ts: Symbol relationship embeddings - fusion.ts: Weighted fusion of multi-modal embeddings - tokenizer.ts: Subword tokenization for symbols - parser.ts: Code parsing utilities Phase 4: Cross-encoder Re-ranking (enhanced) - reranker.ts: Cross-encoder for result re-ranking - Improved score fusion with original retrieval scores Phase 5: Full HNSW Implementation - hnsw.ts: Complete Hierarchical Navigable Small World index - SQ8 quantization integration - Full persistence support - Optimized search algorithm Tests: - embedding.test.ts: Tests for hybrid embedding system - 25 total tests passing BREAKING: None - all additions are backward compatible
…king Version 2.2.0 - Major optimization release Core Features: - Code Property Graph (CPG): CFG, DFG, Call Graph, Import Graph layers - HNSW Vector Index: Hierarchical Navigable Small World for fast retrieval - Hybrid Embedding: Semantic/Structural/Symbolic multi-modal embeddings - Cross-Encoder Reranking: ONNX Runtime-powered result re-ranking with fallback Improvements: - AST-aware chunking with semantic boundaries - Parallel indexing pipeline with worker pool - Memory monitoring and adaptive optimization - CFG branch/loop/switch handling fixes - Short-circuit expression support (&&, ||, ternary) Documentation: - Add docs/cross-encoder.md (ONNX cross-encoder feature guide) - Update docs/README.md index Breaking Changes: - None - all additions are backward compatible Dependencies: - Add onnxruntime-node ^1.19.2 for cross-encoder support Tests: - Add test/cpg.test.ts (CFG/DFG/CallGraph tests) - Add test/hnsw.test.ts (HNSW index tests) - Add test/reranker.test.ts (Cross-encoder tests) - All 37 tests passing Note: pre_plan/ removed from tracking (optimization planning artifacts)
The default value `new LruCache(256)` caused TypeScript to infer the parameter type as `LruCache` instead of `Cache` interface, breaking tests that pass mock cache objects. Added `as Cache` assertion to ensure the parameter is typed correctly. All 37 tests pass.
There was a problem hiding this comment.
Pull request overview
Release v2.2.0 introduces major indexing + retrieval capabilities (CPG/CFG/DFG/call graph, HNSW vector index with SQ8 quantization, hybrid embeddings, and an adaptive retrieval pipeline with reranking), along with expanded documentation and new tests.
Changes:
- Added Code Property Graph (AST/CFG/DFG + call/import graphs) and parallel indexing with memory-aware throttling.
- Implemented HNSW index + SQ8 quantization improvements, plus hybrid (semantic/structural/symbolic) embedding pipeline.
- Added adaptive retrieval utilities (classification, expansion, weighting, fusion, reranking) and documentation.
Reviewed changes
Copilot reviewed 49 out of 50 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| test/retrieval.test.ts | Adds retrieval pipeline tests (currently TS test file). |
| test/reranker.test.ts | Adds cross-encoder reranker/cache tests (currently TS test file). |
| test/indexing.test.ts | Adds parallel indexing + HNSW tests (currently TS test file). |
| test/hnsw.test.ts | Adds HNSW correctness + persistence tests (currently TS test file). |
| test/embedding.test.ts | Adds hybrid embedding component tests (currently TS test file). |
| test/cpg.test.ts | Adds CFG/DFG/call-graph tests (currently TS test file). |
| test/chunker.test.mjs | Adds AST-aware chunking test script (not using node:test). |
| src/core/types.ts | Extends ChunkRow with optional AST/chunk metadata fields. |
| src/core/sq8.ts | Adds configurable-bit quantization and quantizeToBits helper. |
| src/core/search.ts | Adds adaptive query planning + fusion/rerank entrypoints. |
| src/core/retrieval/weights.ts | Implements query-type-based base weights + feedback biasing. |
| src/core/retrieval/types.ts | Defines retrieval types and weights/interfaces. |
| src/core/retrieval/reranker.ts | Adds heuristic reranker + cross-encoder reranker (ONNX + cache fallback). |
| src/core/retrieval/index.ts | Exposes retrieval submodules via a public barrel. |
| src/core/retrieval/fuser.ts | Adds weighted fusion + per-source normalization. |
| src/core/retrieval/expander.ts | Adds abbreviation/synonym/domain-vocab query expansion. |
| src/core/retrieval/classifier.ts | Adds heuristic query classifier + entity extraction. |
| src/core/retrieval/cache.ts | Adds simple LRU cache for reranker scores. |
| src/core/parser/chunker.ts | Adds AST-aware chunking implementation + metadata extraction. |
| src/core/parser/chunkRelations.ts | Adds chunk relationship inference utilities. |
| src/core/indexing/parallel.ts | Adds parallel indexing pipeline with throttling + fallback parsing. |
| src/core/indexing/monitor.ts | Adds memory monitor + adaptive worker throttling. |
| src/core/indexing/index.ts | Adds indexing public exports (HNSW, configs, parallel indexing). |
| src/core/indexing/hnsw.ts | Adds HNSW index implementation + persistence/snapshots. |
| src/core/indexing/config.ts | Adds indexing + error handling configuration defaults/merging. |
| src/core/indexer.ts | Switches full indexing flow to parallel indexing + new config plumbing. |
| src/core/embedding/types.ts | Adds embedding interfaces/config types (semantic/structural/symbolic/fusion). |
| src/core/embedding/tokenizer.ts | Adds basic tokenizer loader for ONNX models. |
| src/core/embedding/symbolic.ts | Adds symbolic embedder based on hashed token/relation features. |
| src/core/embedding/structural.ts | Adds WL-style structural embedder over AST. |
| src/core/embedding/semantic.ts | Adds ONNX-based semantic embedder with hashing fallback + caching. |
| src/core/embedding/parser.ts | Adds helper to parse code to a Tree-sitter AST for embeddings. |
| src/core/embedding/index.ts | Adds HybridEmbedder combining semantic/structural/symbolic + fusion. |
| src/core/embedding/fusion.ts | Adds weighted embedding fusion with optional normalization. |
| src/core/cpg/types.ts | Defines CPG types, edge types, and node id helpers. |
| src/core/cpg/index.ts | Builds per-file and multi-file CPGs, including call/import layers. |
| src/core/cpg/dfgLayer.ts | Adds DFG layer + DFG builder for content parsing. |
| src/core/cpg/cfgLayer.ts | Adds CFG layer + CFG builder for content parsing. |
| src/core/cpg/callGraph.ts | Adds call graph/import graph builders + CallGraphBuilder. |
| src/core/cpg/astLayer.ts | Adds AST layer builder with child + next-token edges. |
| src/core/AGENTS.md | Adds repo-specific contributor/agent guidance for src/core. |
| src/commands/AGENTS.md | Adds repo-specific contributor/agent guidance for src/commands. |
| pre_plan/optimization-plan.md | Adds planning document describing the roadmap behind these changes. |
| package.json | Bumps version to 2.2.0 and adds onnxruntime-node dependency. |
| package-lock.json | Locks onnxruntime-node and its dependencies. |
| docs/zh-CN/rules.md | Updates Chinese rules documentation. |
| docs/cross-encoder.md | Adds documentation for cross-encoder reranking + ONNX usage. |
| docs/README.md | Links new cross-encoder documentation. |
| AGENTS.md | Adds project knowledge base + conventions/anti-patterns doc. |
| .git-ai/lancedb.tar.gz | Updates LFS-tracked index archive artifact. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| export interface AdaptiveRetrieval { | ||
| classifyQuery(query: string): QueryType; | ||
| expandQuery(query: string): string[]; | ||
| computeWeights(queryType: QueryType): RetrievalWeights; | ||
| fuseResults(candidates: RetrievalResult[]): RankedResult[]; | ||
| } |
There was a problem hiding this comment.
The AdaptiveRetrieval interface does not match the implemented APIs: expandQuery accepts an optional QueryType, computeWeights accepts optional WeightFeedback, and fuseResults requires weights (and optionally limit). This type is currently misleading for consumers; update the signatures or remove the interface if it isn’t used.
| function addShortCircuitEdges(root: Parser.SyntaxNode, filePath: string, edges: CPEEdge[]): void { | ||
| const visit = (node: Parser.SyntaxNode) => { | ||
| if (node.type === 'logical_expression' || node.type === 'binary_expression') { | ||
| buildLogicalExpression(node, filePath, edges); | ||
| } else if (node.type === 'conditional_expression' || node.type === 'ternary_expression') { | ||
| buildConditionalExpression(node, filePath, edges); | ||
| } | ||
| for (let i = 0; i < node.childCount; i++) { | ||
| const child = node.child(i); | ||
| if (child) visit(child); | ||
| } | ||
| }; | ||
| visit(root); | ||
| } |
There was a problem hiding this comment.
addShortCircuitEdges treats every binary_expression as a short-circuit expression and calls buildLogicalExpression. If the operator isn’t &&/||, extractLogicalOperator returns null and buildLogicalExpression still emits TRUE/FALSE branch edges, corrupting the CFG. Gate this logic on the operator being && or || (or only visit logical_expression nodes).
| function collectSymbolTable(contexts: CallGraphContext[]): Map<string, SymbolEntry> { | ||
| const table = new Map<string, SymbolEntry>(); | ||
| for (const ctx of contexts) { | ||
| const filePosix = toPosixPath(ctx.filePath); | ||
| const visit = (node: Parser.SyntaxNode) => { | ||
| if (node.type === 'function_declaration' || node.type === 'method_definition') { | ||
| const nameNode = node.childForFieldName('name'); | ||
| if (nameNode) { | ||
| const symbol = { | ||
| name: nameNode.text, | ||
| kind: node.type === 'method_definition' ? 'method' : 'function', | ||
| startLine: node.startPosition.row + 1, | ||
| endLine: node.endPosition.row + 1, | ||
| signature: node.text.split('{')[0].trim(), | ||
| }; | ||
| const id = symbolNodeId(filePosix, symbol); | ||
| table.set(symbol.name, { | ||
| id, | ||
| name: symbol.name, | ||
| file: filePosix, | ||
| kind: symbol.kind, | ||
| startLine: symbol.startLine, | ||
| endLine: symbol.endLine, | ||
| }); |
There was a problem hiding this comment.
collectSymbolTable keys the global symbol table only by symbol.name, so same-named functions/classes in different files will overwrite each other. This breaks cross-file resolution and can create incorrect call edges. Use a qualified key (e.g., ${filePosix}:${name}) and resolve imports to the correct file/module before mapping to a symbol id.
| export function runAdaptiveRetrieval( | ||
| query: string, | ||
| candidates: RetrievalResult[], | ||
| options: AdaptiveFusionOptions = {} | ||
| ): AdaptiveFusionOutput { | ||
| const plan = buildAdaptiveQueryPlan(query, options.feedback); | ||
| const fused = fuseResults(candidates, plan.weights, options.limit); | ||
| const results = rerank(plan.query, fused, { limit: options.limit }); | ||
| return { ...plan, results }; |
There was a problem hiding this comment.
runAdaptiveRetrieval always uses the heuristic rerank(...) function, but the PR description calls out “Cross-Encoder Re-ranking (ONNX-based scoring with fallback)”. If cross-encoder reranking is intended to be part of the adaptive retrieval pipeline, consider adding an option to use CrossEncoderReranker here (or clarify that it’s a separate, opt-in API).
| import test from 'node:test'; | ||
| import assert from 'node:assert/strict'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { classifyQuery } from '../dist/src/core/retrieval/classifier.js'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { expandQuery } from '../dist/src/core/retrieval/expander.js'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { computeWeights } from '../dist/src/core/retrieval/weights.js'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { fuseResults } from '../dist/src/core/retrieval/fuser.js'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { rerank } from '../dist/src/core/retrieval/reranker.js'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { runAdaptiveRetrieval } from '../dist/src/core/search.js'; | ||
| import type { QueryType, RetrievalResult } from '../src/core/retrieval/types'; | ||
|
|
There was a problem hiding this comment.
npm test runs node --test (see package.json scripts). Node’s test runner won’t execute TypeScript test files by default, and this file also contains TS-only syntax (e.g. import type). As a result, this test is likely not running in CI. Consider renaming to .test.js/.test.mjs (and importing from dist/), or updating the test command to run TS via an appropriate loader.
| import test from 'node:test'; | ||
| import assert from 'node:assert/strict'; | ||
| import os from 'os'; | ||
| import path from 'path'; | ||
| import fs from 'fs-extra'; | ||
| import { HNSWIndex } from '../dist/src/core/indexing/hnsw.js'; | ||
| import { quantizeSQ8 } from '../dist/src/core/sq8.js'; | ||
|
|
There was a problem hiding this comment.
This is a *.test.ts file, but npm test runs node --test without a TS loader. Unless the project is explicitly configured for TS test discovery/execution, this test likely isn’t running. Consider renaming to .test.js/.test.mjs or updating the test script to run TS tests.
| import test from 'node:test'; | ||
| import assert from 'node:assert/strict'; | ||
| import Parser from 'tree-sitter'; | ||
| import TypeScript from 'tree-sitter-typescript'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { OnnxSemanticEmbedder, defaultSemanticConfig } from '../dist/src/core/embedding/semantic.js'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { WlStructuralEmbedder } from '../dist/src/core/embedding/structural.js'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { GraphSymbolicEmbedder } from '../dist/src/core/embedding/symbolic.js'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { WeightedEmbeddingFusion } from '../dist/src/core/embedding/fusion.js'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { HybridEmbedder, defaultHybridEmbeddingConfig } from '../dist/src/core/embedding/index.js'; | ||
| import type { SymbolInfo } from '../src/core/types'; | ||
|
|
There was a problem hiding this comment.
npm test runs node --test without a TypeScript loader, and this file uses TS-only syntax (import type). That combination usually means the test won’t execute at all. Please rename to .test.js/.test.mjs or update the test script to run TypeScript tests via a loader so these assertions run in CI.
| import test from 'node:test'; | ||
| import assert from 'node:assert/strict'; | ||
| import path from 'path'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { buildCFG } from '../dist/src/core/cpg/cfgLayer.js'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { buildDFG } from '../dist/src/core/cpg/dfgLayer.js'; | ||
| // eslint-disable-next-line @typescript-eslint/ban-ts-comment | ||
| // @ts-ignore dist module has no typings | ||
| import { CallGraphBuilder } from '../dist/src/core/cpg/callGraph.js'; | ||
|
|
There was a problem hiding this comment.
npm test runs node --test without a TypeScript loader, so this new *.test.ts file is likely not being executed (and it includes TS-only syntax). Convert to .test.js/.test.mjs or update the test script to run TS tests, otherwise the CFG/DFG/CallGraph assertions may not run in CI.
| for (const childChunk of childChunks) { | ||
| childChunk.astPath = getAstPath(node).concat(childChunk.astPath); | ||
| chunks.push(childChunk); | ||
| } |
There was a problem hiding this comment.
chunkNode prepends getAstPath(node) to childChunk.astPath, but childChunk.astPath already comes from getAstPath(child) (which includes the full path from the root). This concatenation will duplicate the shared prefix and produce incorrect AST paths. Consider keeping the child chunk’s astPath as-is, or computing a relative path from node before concatenating.
| const symbols = new Set<string>(); | ||
| let m: RegExpExecArray | null; | ||
| while ((m = SYMBOL_PATTERN.exec(query)) !== null) { | ||
| const token = m[1]; | ||
| if (!token) continue; | ||
| if (token.length < 3) continue; | ||
| if (/^(the|and|for|with|from|into|over|when|where|what|show)$/i.test(token)) continue; | ||
| symbols.add(token); | ||
| } |
There was a problem hiding this comment.
SYMBOL_PATTERN is a module-level regex with the g flag; RegExp.exec advances lastIndex, so subsequent calls to extractEntities can skip matches unless lastIndex is reset. Reset SYMBOL_PATTERN.lastIndex = 0 at the start of extractEntities, or create a new RegExp per call.
- Change glob pattern from 'test/**/*.mjs' to 'test/*.test.mjs test/*.test.ts' - Node.js --test doesn't expand globs in some environments (CI/Linux) - Fixes PR #9 CI failure
- Node.js 22 supports native TypeScript execution with --test - Fixes CI failure: Unknown file extension '.ts'
Summary
v2.2.0 major release with significant performance improvements and new features.
Major Features
1. Code Property Graph (CPG)
2. HNSW Vector Index
3. Hybrid Embedding System
4. Adaptive Retrieval
5. AST-Aware Chunking
ast_path,symbol_references,parent_kindFiles Changed
Breaking Changes
None. All existing functionality preserved.
Tests
All 37 tests pass.
Co-authored-by: git-ai mars167@users.noreply.github.com