Fast hybrid code search for AI agents. Pure Rust port of semble.
This is a Rust translation of semble. Same algorithm, same constants, same ranking pipeline. Full credit to MinishLab for designing the search and ranking system. We maintain this port because we need a single-binary, zero-dependency solution that AI agents can install in any sandbox without Python.
AI coding agents (Cursor, Claude Code, Codex, etc.) waste significant tokens re-reading codebases. They grep for a keyword, get 90+ file matches, then cat multiple files hunting for the right context. Sonar gives them the exact code chunks they need in one call: 5 ranked results instead of 90 files to read.
This project is:
- 100% AI-built with minor human steering
- A living mirror of semble. When semble ships improvements, we port them
- Open for agents to maintain. The intention is that LLM agents keep this in sync with upstream
| Metric | Value |
|---|---|
| Index 1500-chunk Rust project | 1.3s |
| Index 5600-chunk Python project | 2.4s |
| Search latency (cached index) | ~130ms avg |
| Binary size | ~15MB |
| Dependencies at runtime | zero (statically linked) |
- Hybrid search. BM25 keyword + Model2Vec semantic, fused with Reciprocal Rank Fusion
- Tree-sitter chunking. Python, Rust, JavaScript, TypeScript, TSX, Go, Java + Markdown heading splits + line fallback
- 290+ file extensions recognized
- Pure Rust. No Python, no ONNX runtime, no C dependencies. Single static binary.
- MCP server. Stdio JSON-RPC, same tool schemas as semble
- Index persistence. OS cache dir with BLAKE3 staleness detection + per-file mtime tracking
- File watching. Automatic re-indexing on changes via
notify .gitignore+.sonarignoresupport. Respects your ignore rules- Git clone support. Index remote repos directly via HTTPS URL
- Graceful fallback. If embedding model can't download, falls back to BM25-only
cargo install --path crates/cliOr build from source:
git clone https://github.com/ooboai/sonar.git
cd sonar
cargo build --release
# Binaries at target/release/sonar and target/release/sonar-mcp# Index a codebase
sonar index /path/to/project
# Search (hybrid: BM25 + semantic)
sonar search "auth middleware" -p /path/to/project
# Search modes
sonar search "parse config" -p ./project --mode hybrid # default
sonar search "parse config" -p ./project --mode bm25 # keyword only (no model needed)
sonar search "parse config" -p ./project --mode semantic # vector only
# Index a remote repo
sonar index https://github.com/some/repo.git
# Pre-download embedding model
sonar download-model
# Watch for changes and re-index automatically
sonar watch /path/to/project
# View token savings stats
sonar savingssonar-mcpExposes two tools over stdio JSON-RPC:
search- search a codebase with natural language or code queriesfind_related- find semantically similar code to a given location
Compatible with any MCP client (Cursor, Claude Desktop, etc.). Add to your MCP config:
{
"mcpServers": {
"sonar": {
"command": "sonar-mcp",
"args": []
}
}
}[dependencies]
sonar-core = { git = "https://github.com/ooboai/sonar" }use sonar_core::index::SonarIndex;
let index = SonarIndex::from_path_cached(Path::new("./my-project"), None)?;
let results = index.search("error handling", 10);
for r in &results {
println!("{} L{}-{} (score: {:.3})",
r.chunk.file_path, r.chunk.start_line, r.chunk.end_line, r.score);
}sonar/
├── crates/
│ ├── core/ # Library: chunking, BM25, embeddings, ANN, ranking, persistence
│ ├── cli/ # CLI binary
│ └── mcp/ # MCP server binary
├── benchmarks/ # Parity tests and token efficiency benchmarks
└── Cargo.toml # Workspace
- Walk - discover source files via
ignorecrate (respects.gitignore+.sonarignore), detect languages from 290+ extensions - Chunk - split files into semantic units using tree-sitter (functions, classes, structs, methods) with merge + split for consistent sizes
- Index - build BM25 inverted index + Model2Vec embedding vectors (brute-force flat ANN)
- Search - score with both BM25 and cosine similarity, fuse with Reciprocal Rank Fusion
- Rank - apply symbol definition boosts, path penalties, file saturation decay, embedded-symbol boosts
Uses Model2Vec potion-code-16M via the official model2vec-rs crate. Auto-downloads from HuggingFace Hub on first use (~30MB). Falls back to BM25-only if offline or download fails.
Override the model with SONAR_MODEL_NAME env var.
This is a Rust translation of semble by MinishLab (MIT licensed). We intentionally match their:
- Chunking strategy and constants (
DESIRED_CHUNK_LENGTH_CHARS = 1500) - BM25 parameters (
k1 = 1.2,b = 0.75) - RRF fusion (
k = 60, candidate counts) - Alpha weighting logic (symbol detection, natural language detection)
- Ranking pipeline (definition boost, path penalty, file saturation)
- MCP tool schemas
When semble updates their algorithm, we update ours. This is not a fork. It's a rewrite in a different language with the explicit goal of staying in sync.
- Semble requires Python. AI agents in sandboxes can't always install Python + pip dependencies.
- Single binary distribution.
sonaris one static binary, no runtime deps. - Embeddable. Can be linked as a Rust library into other tools (like oobo).
This project is primarily maintained by AI agents with human oversight. Contributions welcome, especially:
- Porting new semble features as they ship
- Adding tree-sitter grammars for more languages
- Performance improvements
- Bug fixes with test cases
- semble by MinishLab - the original Python implementation that this project ports
- Model2Vec by MinishLab - the embedding model and Rust inference library
- oobo - the AI code attribution tool that uses sonar for local code search
MIT. See LICENSE.
This project is a derivative work of semble (also MIT licensed, Copyright (c) 2026 Thomas van Dongen). The algorithm, constants, and ranking logic are ported from semble with full attribution.