sonar

Fast hybrid code search for AI agents. Pure Rust port of semble.

This is a Rust translation of semble. Same algorithm, same constants, same ranking pipeline. Full credit to MinishLab for designing the search and ranking system. We maintain this port because we need a single-binary, zero-dependency solution that AI agents can install in any sandbox without Python.

Why this exists

AI coding agents (Cursor, Claude Code, Codex, etc.) waste significant tokens re-reading codebases. They grep for a keyword, get 90+ file matches, then cat multiple files hunting for the right context. Sonar gives them the exact code chunks they need in one call: 5 ranked results instead of 90 files to read.

This project is:

100% AI-built with minor human steering
A living mirror of semble. When semble ships improvements, we port them
Open for agents to maintain. The intention is that LLM agents keep this in sync with upstream

Performance

Metric	Value
Index 1500-chunk Rust project	1.3s
Index 5600-chunk Python project	2.4s
Search latency (cached index)	~130ms avg
Binary size	~15MB
Dependencies at runtime	zero (statically linked)

Features

Hybrid search. BM25 keyword + Model2Vec semantic, fused with Reciprocal Rank Fusion
Tree-sitter chunking. Python, Rust, JavaScript, TypeScript, TSX, Go, Java + Markdown heading splits + line fallback
290+ file extensions recognized
Pure Rust. No Python, no ONNX runtime, no C dependencies. Single static binary.
MCP server. Stdio JSON-RPC, same tool schemas as semble
Index persistence. OS cache dir with BLAKE3 staleness detection + per-file mtime tracking
File watching. Automatic re-indexing on changes via notify
.gitignore + .sonarignore support. Respects your ignore rules
Git clone support. Index remote repos directly via HTTPS URL
Graceful fallback. If embedding model can't download, falls back to BM25-only

Install

cargo install --path crates/cli

Or build from source:

git clone https://github.com/ooboai/sonar.git
cd sonar
cargo build --release
# Binaries at target/release/sonar and target/release/sonar-mcp

Usage

CLI

# Index a codebase
sonar index /path/to/project

# Search (hybrid: BM25 + semantic)
sonar search "auth middleware" -p /path/to/project

# Search modes
sonar search "parse config" -p ./project --mode hybrid    # default
sonar search "parse config" -p ./project --mode bm25      # keyword only (no model needed)
sonar search "parse config" -p ./project --mode semantic   # vector only

# Index a remote repo
sonar index https://github.com/some/repo.git

# Pre-download embedding model
sonar download-model

# Watch for changes and re-index automatically
sonar watch /path/to/project

# View token savings stats
sonar savings

MCP Server

sonar-mcp

Exposes two tools over stdio JSON-RPC:

search - search a codebase with natural language or code queries
find_related - find semantically similar code to a given location

Compatible with any MCP client (Cursor, Claude Desktop, etc.). Add to your MCP config:

{
  "mcpServers": {
    "sonar": {
      "command": "sonar-mcp",
      "args": []
    }
  }
}

As a Library

[dependencies]
sonar-core = { git = "https://github.com/ooboai/sonar" }

use sonar_core::index::SonarIndex;

let index = SonarIndex::from_path_cached(Path::new("./my-project"), None)?;
let results = index.search("error handling", 10);
for r in &results {
    println!("{} L{}-{} (score: {:.3})",
        r.chunk.file_path, r.chunk.start_line, r.chunk.end_line, r.score);
}

Architecture

sonar/
├── crates/
│   ├── core/       # Library: chunking, BM25, embeddings, ANN, ranking, persistence
│   ├── cli/        # CLI binary
│   └── mcp/        # MCP server binary
├── benchmarks/     # Parity tests and token efficiency benchmarks
└── Cargo.toml      # Workspace

How it works

Walk - discover source files via ignore crate (respects .gitignore + .sonarignore), detect languages from 290+ extensions
Chunk - split files into semantic units using tree-sitter (functions, classes, structs, methods) with merge + split for consistent sizes
Index - build BM25 inverted index + Model2Vec embedding vectors (brute-force flat ANN)
Search - score with both BM25 and cosine similarity, fuse with Reciprocal Rank Fusion
Rank - apply symbol definition boosts, path penalties, file saturation decay, embedded-symbol boosts

Embedding Model

Uses Model2Vec potion-code-16M via the official model2vec-rs crate. Auto-downloads from HuggingFace Hub on first use (~30MB). Falls back to BM25-only if offline or download fails.

Override the model with SONAR_MODEL_NAME env var.

Relation to semble

This is a Rust translation of semble by MinishLab (MIT licensed). We intentionally match their:

Chunking strategy and constants (DESIRED_CHUNK_LENGTH_CHARS = 1500)
BM25 parameters (k1 = 1.2, b = 0.75)
RRF fusion (k = 60, candidate counts)
Alpha weighting logic (symbol detection, natural language detection)
Ranking pipeline (definition boost, path penalty, file saturation)
MCP tool schemas

When semble updates their algorithm, we update ours. This is not a fork. It's a rewrite in a different language with the explicit goal of staying in sync.

Why not just use semble directly?

Semble requires Python. AI agents in sandboxes can't always install Python + pip dependencies.
Single binary distribution. sonar is one static binary, no runtime deps.
Embeddable. Can be linked as a Rust library into other tools (like oobo).

Contributing

This project is primarily maintained by AI agents with human oversight. Contributions welcome, especially:

Porting new semble features as they ship
Adding tree-sitter grammars for more languages
Performance improvements
Bug fixes with test cases

Credits

semble by MinishLab - the original Python implementation that this project ports
Model2Vec by MinishLab - the embedding model and Rust inference library
oobo - the AI code attribution tool that uses sonar for local code search

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
benchmarks		benchmarks
crates		crates
docs		docs
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sonar

Why this exists

Performance

Features

Install

Usage

CLI

MCP Server

As a Library

Architecture

How it works

Embedding Model

Relation to semble

Why not just use semble directly?

Contributing

Credits

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sonar

Why this exists

Performance

Features

Install

Usage

CLI

MCP Server

As a Library

Architecture

How it works

Embedding Model

Relation to semble

Why not just use semble directly?

Contributing

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages