Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ system handles MCP registration, hooks, and skills declaratively via:
- **Merkle tree for diffs**: Avoid re-indexing unchanged code
- **Model name in DB path**: Different models → separate indexes (SHA-256 hash
of path + model name)
- **5-layer file filtering**: SkipDirs → .gitignore → .lumenignore →
- **6-layer file filtering**: SkipDirs → SkipFiles → .gitignore → .lumenignore →
.gitattributes → extension
- **Chunk splitting at line boundaries**: Oversized chunks split at
`LUMEN_MAX_CHUNK_TOKENS` (512 default)
Expand Down
34 changes: 30 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,8 @@ Two skills are also available: `/lumen:doctor` (health check) and
via Merkle tree diffing
- **Incremental updates** — re-indexes only what changed; large codebases
re-index in seconds after the first run
- **12 language families** — Go, Python, TypeScript, JavaScript, Rust, Ruby,
Java, PHP, C/C++, Markdown, YAML, JSON
- **14 language families** — Go, Python, TypeScript, JavaScript, Rust, Ruby,
Java, PHP, C/C++, Markdown, YAML, JSON, TOML, Go module
- **Zero cloud** — embeddings stay on your machine; no data leaves your network
- **Ollama and LM Studio** — works with either local embedding backend

Expand Down Expand Up @@ -107,7 +107,7 @@ reproduce instructions.

## Supported Languages

Supports **12 language families** with semantic chunking:
Supports **14 language families** with semantic chunking:

| Language | Parser | Extensions | Status |
| ---------------- | ----------- | ----------------------------------------- |-------------------------------------|
Expand All @@ -122,7 +122,8 @@ Supports **12 language families** with semantic chunking:
| C / C++ | tree-sitter | `.c`, `.h`, `.cpp`, `.cc`, `.cxx`, `.hpp` | Supported |
| Markdown / MDX | tree-sitter | `.md`, `.mdx` | Supported |
| YAML | tree-sitter | `.yaml`, `.yml` | Supported |
| JSON | tree-sitter | `.json` | Supported |
| JSON / TOML | structured | `.json`, `.toml` | Supported |
| Go module | structured | `.mod` | Supported |

Go uses the native Go AST parser for the most precise chunks. All other
languages use tree-sitter grammars.
Expand Down Expand Up @@ -161,6 +162,31 @@ Dimensions and context length are configured automatically per model:
Switching models creates a separate index automatically. The model name is part
of the database path hash, so different models never collide.

## Controlling What Gets Indexed

Lumen filters files through six layers: built-in directory and lock file skips →
`.gitignore` → `.lumenignore` → `.gitattributes` (`linguist-generated`) →
supported file extension. Only files that pass all layers are indexed.

**`.lumenignore`** uses `.gitignore` syntax. Place it in your project root (or
any subdirectory) to exclude files that aren't in `.gitignore` but are noise for
code search — generated protobuf files, test snapshots, vendored data, etc.

<details>
<summary>Built-in skips (always excluded)</summary>

**Directories:** `.git`, `node_modules`, `vendor`, `dist`, `.cache`, `.venv`,
`__pycache__`, `target`, `.gradle`, `_build`, `deps`, `.idea`, `.vscode`,
`.next`, `.nuxt`, `.build`, `.output`, `bower_components`, `.bundle`, `.tox`,
`.eggs`, `testdata`, `.hg`, `.svn`

**Lock files:** `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `bun.lock`,
`bun.lockb`, `go.sum`, `composer.lock`, `poetry.lock`, `Pipfile.lock`,
`Gemfile.lock`, `Cargo.lock`, `pubspec.lock`, `mix.lock`, `flake.lock`,
`packages.lock.json`

</details>

## Database Location

Index databases are stored outside your project:
Expand Down
5 changes: 4 additions & 1 deletion internal/chunker/languages.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ var supportedExtensions = []string{
".cpp", ".cc", ".cxx", ".hpp",
".php",
".md", ".mdx",
".yaml", ".yml", ".json",
".yaml", ".yml", ".json", ".toml",
".mod",
}

// SupportedExtensions returns the file extensions indexed by DefaultLanguages.
Expand Down Expand Up @@ -184,5 +185,7 @@ func DefaultLanguages(maxChunkTokens int) map[string]Chunker {
".yaml": structured,
".yml": structured,
".json": structured,
".toml": structured,
".mod": structured,
}
}
2 changes: 2 additions & 0 deletions internal/chunker/treesitter_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -568,6 +568,8 @@ func TestDefaultLanguages_AllExtensionsPresent(t *testing.T) {
".yaml": []byte("foo: bar\n"),
".yml": []byte("foo: bar\n"),
".json": []byte(`{"foo": "bar"}`),
".toml": []byte("[package]\nname = \"mymod\"\n"),
".mod": []byte("module example.com/mymod\n\ngo 1.26\n"),
}

langs := chunker.DefaultLanguages(512)
Expand Down
50 changes: 42 additions & 8 deletions internal/merkle/ignore.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,35 @@ import (
ignore "github.com/sabhiram/go-gitignore"
)

// SkipFiles is the canonical set of file basenames that are always skipped
// during tree building. These are typically large generated or binary lock files
// that add noise without indexing value.
var SkipFiles = map[string]bool{
// JS/Node package managers
"package-lock.json": true,
"yarn.lock": true,
"pnpm-lock.yaml": true,
"bun.lock": true, "bun.lockb": true,
// Go
"go.sum": true,
// PHP
"composer.lock": true,
// Python
"poetry.lock": true, "Pipfile.lock": true,
// Ruby
"Gemfile.lock": true,
// Rust
"Cargo.lock": true,
// Dart/Flutter
"pubspec.lock": true,
// Elixir
"mix.lock": true,
// Nix
"flake.lock": true,
// .NET/NuGet
"packages.lock.json": true,
}

// SkipDirs is the canonical set of directory basenames that are always skipped
// during tree building, regardless of .gitignore rules.
var SkipDirs = map[string]bool{
Expand Down Expand Up @@ -108,10 +137,14 @@ func (t *IgnoreTree) loadDir(dirRel string) *dirIgnore {
return d
}

// shouldSkip implements SkipFunc. It checks the five filtering layers:
// 1. SkipDirs, 2. .gitignore, 3. .lumenignore, 4. .gitattributes, 5. extension.
// shouldSkip implements SkipFunc. It checks the six filtering layers:
// 1. SkipDirs, 2. SkipFiles, 3. .gitignore, 4. .lumenignore, 5. .gitattributes, 6. extension.
func (t *IgnoreTree) shouldSkip(relPath string, isDir bool) bool {
if isDir && SkipDirs[filepath.Base(relPath)] {
base := filepath.Base(relPath)
if isDir && SkipDirs[base] {
return true
}
if !isDir && SkipFiles[base] {
return true
}

Expand Down Expand Up @@ -213,12 +246,13 @@ func parseLinguistGenerated(path string) *ignore.GitIgnore {
return ignore.CompileIgnoreLines(patterns...)
}

// MakeSkip returns a SkipFunc that layers five filters:
// MakeSkip returns a SkipFunc that layers six filters:
// 1. SkipDirs — map lookup on directory basename (cheapest check)
// 2. .gitignore — root + nested, hierarchical matching
// 3. .lumenignore — root + nested, hierarchical matching
// 4. .gitattributes — linguist-generated patterns, root + nested
// 5. Extension filter — only index files whose extension is in exts
// 2. SkipFiles — map lookup on file basename (lock files and other noise)
// 3. .gitignore — root + nested, hierarchical matching
// 4. .lumenignore — root + nested, hierarchical matching
// 5. .gitattributes — linguist-generated patterns, root + nested
// 6. Extension filter — only index files whose extension is in exts
//
// Ignore files are discovered lazily as the walk proceeds.
func MakeSkip(rootDir string, exts []string) SkipFunc {
Expand Down
19 changes: 19 additions & 0 deletions internal/merkle/ignore_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,25 @@ func TestMakeSkip_NegationPattern(t *testing.T) {
}
}

func TestMakeSkip_HardcodedFiles(t *testing.T) {
dir := t.TempDir()
skip := MakeSkip(dir, []string{".go", ".json", ".yaml"})

for name := range SkipFiles {
if !skip(name, false) {
t.Errorf("expected hardcoded file %q to be skipped", name)
}
}

// Regular files with same extensions should pass
if skip("package.json", false) {
t.Error("expected package.json to pass")
}
if skip("main.go", false) {
t.Error("expected main.go to pass")
}
}

func TestMakeSkip_HardcodedDirs(t *testing.T) {
dir := t.TempDir()
skip := MakeSkip(dir, []string{".go"})
Expand Down
5 changes: 3 additions & 2 deletions internal/merkle/merkle.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,11 @@ func MakeExtSkip(exts []string) SkipFunc {
extSet[ext] = true
}
return func(relPath string, isDir bool) bool {
base := filepath.Base(relPath)
if isDir {
return SkipDirs[filepath.Base(relPath)]
return SkipDirs[base]
}
return !extSet[filepath.Ext(relPath)]
return SkipFiles[base] || !extSet[filepath.Ext(relPath)]
}
}

Expand Down