Skip to content

Hybrid Retrieval

Lisa edited this page Dec 23, 2025 · 6 revisions

Hybrid Retrieval (v7.4)

CKB v7.4 introduces hybrid retrieval that combines graph-based ranking with traditional text search to dramatically improve search quality.

Overview

Traditional code search relies on text matching (FTS), which finds symbols by name but doesn't understand code relationships. Hybrid retrieval adds Personalized PageRank (PPR) over the symbol graph to boost results that are structurally related to your query.

Results

Metric Before After Improvement
Recall@10 62.1% 100% +61%
MRR 0.546 0.914 +67%
Latency 29.4ms 4.5ms -85%

How It Works

1. Initial Search (FTS)

When you search for a symbol, CKB first uses SQLite FTS5 for fast text matching:

Query: "Engine"
FTS Results: Engine, Engine#logger, Engine#config, EngineMock, ...

2. Graph-Based Re-ranking (PPR)

CKB then builds a symbol graph from SCIP edges and runs Personalized PageRank:

Seeds: Top FTS hits (Engine, Engine#logger, ...)
Graph: Call edges, reference edges, type edges
PPR: Propagate importance through graph
Output: Re-ranked by graph proximity + FTS score

3. Fusion Scoring

Multiple signals are combined with learned weights:

Signal Weight Description
FTS score 0.40 Text match quality
PPR score 0.30 Graph proximity
Hotspot 0.15 Recent code churn
Recency 0.10 File modification time
Exact match 0.05 Name equality bonus

Final score = weighted sum of normalized signals.

Eval Suite

CKB includes an evaluation framework to measure retrieval quality.

Running Eval

# Run built-in tests
ckb eval

# Custom fixtures
ckb eval --fixtures=./my-tests.json

# JSON output
ckb eval --format=json

Test Types

Needle tests - Find at least one expected symbol in top-K:

{
  "id": "find-engine",
  "type": "needle",
  "query": "Engine",
  "expectedSymbols": ["Engine", "query.Engine"],
  "topK": 10
}

Ranking tests - Verify expected symbol is highly ranked:

{
  "id": "engine-first",
  "type": "ranking",
  "query": "query engine",
  "expectedSymbols": ["Engine"],
  "topK": 3
}

Expansion tests - Check graph connectivity:

{
  "id": "engine-connects-backends",
  "type": "expansion",
  "query": "Engine",
  "expectedSymbols": ["Engine", "Orchestrator", "SCIPAdapter"],
  "topK": 20
}

Metrics

  • Recall@K - % of tests where expected symbol was in top-K
  • MRR - Mean Reciprocal Rank (higher = expected found earlier)
  • Latency - Average query time

PPR Algorithm

Personalized PageRank computes importance scores relative to seed nodes.

Algorithm

Input:
  - seeds: FTS hit symbol IDs
  - graph: SCIP call/reference edges
  - damping: 0.85 (probability of following edge)
  - iterations: 20 (max power iterations)

Process:
  1. Initialize scores: seeds get 1/n, others get 0
  2. Iterate: score[i] = damping * Σ(edge_weight * score[neighbor])
                        + (1-damping) * teleport[i]
  3. Stop when converged or max iterations

Output:
  - Ranked nodes with scores
  - Backtracked paths explaining "why"

Edge Weights

Edge Type Weight Meaning
Call 1.0 Function calls function
Definition 0.9 Reference to definition
Reference 0.8 General reference
Implements 0.7 Type implements interface
Type-of 0.6 Instance of type
Same-module 0.3 Co-located symbols

Export Organizer

The exportForLLM tool now includes an organizer step that structures output for better LLM comprehension.

Before (v7.3)

## internal/query/
  ! engine.go
    $ Engine
    # SearchSymbols()
    # GetSymbol()
  ! symbols.go
    # rankSearchResults()

After (v7.4)

## Module Map

| Module | Symbols | Files | Key Exports |
|--------|---------|-------|-------------|
| internal/query | 150 | 12 | Engine, SearchSymbols |
| internal/backends | 80 | 8 | Orchestrator, SCIPAdapter |

## Cross-Module Connections

- internal/query → internal/backends
- internal/mcp → internal/query

## Module Details

### internal/query/

**engine.go**
  $ Engine
  # SearchSymbols() [c=12] ★★
  # GetSymbol() [c=5]

Benefits

  • Module Map - Overview of codebase structure at a glance
  • Cross-Module Bridges - Key integration points highlighted
  • Importance Ordering - Most important symbols first
  • Context Efficiency - LLMs understand structure before details

Configuration

No configuration required. Hybrid retrieval is automatic when:

  1. SCIP index is available (ckb index was run)
  2. Search returns more than 3 results
  3. Symbol graph has nodes

Disabling PPR

If you need to disable PPR re-ranking (not recommended):

// .ckb/config.json
{
  "queryPolicy": {
    "enablePPR": false
  }
}

Research Basis

Hybrid retrieval is based on 2024-2025 research:

Paper Key Insight
HippoRAG 2 (ICML 2025) PPR over knowledge graphs improves associative retrieval
CodeRAG (Sep 2025) Multi-path retrieval + reranking beats single-path
GraphCoder (Jun 2024) Code context graphs for repo-level retrieval
GraphRAG surveys Explicit organizer step improves context packing

What's NOT Included

Per CKB's "structured over semantic" principle:

Feature Why Skipped
Embeddings Adds complexity, PPR sufficient for code navigation
Learned reranker Deterministic scoring works well
External vector DB Violates single-binary principle

Troubleshooting

Low Recall@K

  1. Index freshness - Run ckb index to rebuild
  2. FTS population - Check ckb status for FTS symbol count
  3. Query specificity - More specific queries work better

Slow Queries

  1. Graph size - Very large codebases may need graph pruning
  2. PPR iterations - Default 20 is usually sufficient
  3. Cache - Subsequent queries benefit from caching

Debugging

# Check index status
ckb status

# Run diagnostics
ckb doctor

# Verbose eval output
ckb eval --verbose

Clone this wiki locally