A CLI tool to index a git repository by chunking and embedding text files into a LanceDB vector database.
- File Discovery: Scans directories recursively with .gitignore support
- Smart Filtering: Skips binary files and respects custom ignore patterns
- Incremental Indexing: Git-based diff mode for fast updates (only process changed files)
- Intelligent Mode: Auto-detects best indexing strategy (full vs diff)
- Chunking: Uses Mastra's recursive chunking strategy for optimal text segmentation
- Embedding: Generates embeddings via LM Studio's OpenAI-compatible API
- Vector Storage: Stores embeddings in LanceDB for fast similarity search
- GraphRAG (Optional): Build knowledge graphs for relationship-based retrieval
- Resume Support: Tracks processed files and skips unchanged content
- Progress Tracking: Real-time progress bar with colorized output
- Error Handling: Continues processing on errors with detailed warnings
npm install
npm run buildOr use directly with tsx:
npm run dev -- [options]embedder \
--dir /path/to/repository \
--output /path/to/lancedb \
--base-url http://localhost:1234/v1 \
--model text-embedding-qwen3-embedding-0.6b-d, --dir <path>- Directory to index (the git repository)-o, --output <path>- Output path for LanceDB database-u, --base-url <url>- Base URL for LM Studio (e.g., http://localhost:1234/v1)-m, --model <name>- Embedding model name (e.g., text-embedding-qwen3-embedding-0.6b)
-t, --table-name <name>- LanceDB table name (default: embeddings)--dimensions <number>- Embedding dimension size (default: 1024)-i, --ignore <pattern>- Glob patterns to ignore (can be specified multiple times)-b, --batch-size <number>- Number of embeddings to process in a batch (default: 10)--mode <type>- Indexing mode: 'full' (complete re-index), 'diff' (incremental), 'intelligent' (auto-detect) (default: full)--from-commit <hash>- Git commit hash to diff from (overrides stored state, only used with --mode diff)--enable-graph- Enable GraphRAG knowledge graph creation (default: false)--graph-threshold <number>- Similarity threshold for graph edges, 0.0-1.0 (default: 0.7)
Complete indexing of all files. Uses content hashing to skip unchanged files.
embedder \
-d . \
-o ./embeddings \
-u http://localhost:1234/v1 \
-m text-embedding-qwen3-embedding-0.6b \
--mode fullOnly processes files that changed in git since last index. Much faster for updates.
embedder \
-d . \
-o ./embeddings \
-u http://localhost:1234/v1 \
-m text-embedding-qwen3-embedding-0.6b \
--mode diffRequirements:
- Git repository
- Previous index with stored commit hash
- Only indexes committed changes (warns about uncommitted files)
Features:
- Processes added files ✓
- Processes modified files ✓
- Removes deleted files from index ✓
- Handles renamed files (delete old + add new) ✓
Automatically chooses the best mode based on context.
embedder \
-d . \
-o ./embeddings \
-u http://localhost:1234/v1 \
-m text-embedding-qwen3-embedding-0.6b \
--mode intelligentDecision Logic:
- Not a git repo? → Use full mode
- No previous commit hash? → Use full mode (first run)
- Commit hash unchanged? → Already indexed, skip
- Otherwise → Use diff mode
Best for: CI/CD pipelines, automated scripts, scheduled jobs
# Index the current directory with custom settings
embedder \
-d . \
-o ./embeddings \
-u http://localhost:1234/v1 \
-m text-embedding-qwen3-embedding-0.6b \
-t my_embeddings \
--dimensions 768 \
-i "*.test.ts" \
-i "*.spec.ts" \
-b 20
# Incremental indexing with diff mode
embedder \
-d . \
-o ./embeddings \
-u http://localhost:1234/v1 \
-m text-embedding-qwen3-embedding-0.6b \
--mode diff
# With GraphRAG enabled for relationship-based retrieval
embedder \
-d . \
-o ./embeddings \
-u http://localhost:1234/v1 \
-m text-embedding-qwen3-embedding-0.6b \
--enable-graph \
--graph-threshold 0.7- Discovery: Scans the specified directory for text files, respecting .gitignore rules
- Filtering: Skips binary files, images, and files matching ignore patterns
- Resume Check: Compares content hashes to skip already-processed files
- Smart Chunking: Uses appropriate strategy based on file type:
- Markdown (.md, .mdx): semantic-markdown strategy for better structure understanding
- HTML (.html, .htm): HTML strategy preserving document structure
- JSON (.json): JSON-aware chunking
- Other files: Recursive strategy with 512-char chunks, 50-char overlap
- Embedding: Generates embeddings using the specified model via LM Studio
- Storage: Stores embeddings and metadata in LanceDB with configurable table name and dimensions
- State Tracking: Saves progress in
.embedder-state.jsonin the output directory
The tool creates a .embedder-state.json file in the output directory to track:
- Processed file paths
- Content hashes
- Number of chunks per file
- Processing timestamps
- Last indexed git commit hash (for diff mode)
This enables resume functionality - on subsequent runs, only new or modified files are processed.
Diff mode benefits:
- Uses git commit history instead of scanning all files
- Automatically removes deleted files from the index
- Tracks the exact commit that was last indexed
The tool displays:
- Initialization status
- File discovery results
- Real-time progress bar
- Summary with statistics:
- Files processed
- Files skipped (unchanged)
- Total chunks created
- Knowledge graph nodes and edges (if GraphRAG enabled)
- Errors and warnings
- LanceDB database - Vector embeddings stored in the output directory
.embedder-state.json- Processing state for resume functionalitygraph-data/- Persisted knowledge graph (if--enable-graphis used)- Folder-based storage with batched chunks and binary embeddings
- Scalable for large repositories
When --enable-graph is enabled, the tool creates a knowledge graph in addition to the vector store. This enables relationship-based retrieval for RAG applications.
See GRAPHRAG.md for detailed documentation on:
- How GraphRAG works
- Using the persisted graph data in your applications
- GraphStore API reference
- Example RAG workflows
- Performance considerations
- File read errors: Logged as warnings, processing continues
- Embedding failures: Throws exception, processing halts
- Storage errors: Throws exception, processing halts
- All warnings are displayed in the final summary
- Node.js 18+
- LM Studio running with an embedding model loaded
- TypeScript 5+
# Install dependencies
npm install
# Build
npm run build
# Run in development mode
npm run dev -- --help
# Run built version
npm start -- --helpISC