Embedder

A CLI tool to index a git repository by chunking and embedding text files into a LanceDB vector database.

Features

File Discovery: Scans directories recursively with .gitignore support
Smart Filtering: Skips binary files and respects custom ignore patterns
Incremental Indexing: Git-based diff mode for fast updates (only process changed files)
Intelligent Mode: Auto-detects best indexing strategy (full vs diff)
Chunking: Uses Mastra's recursive chunking strategy for optimal text segmentation
Embedding: Generates embeddings via LM Studio's OpenAI-compatible API
Vector Storage: Stores embeddings in LanceDB for fast similarity search
GraphRAG (Optional): Build knowledge graphs for relationship-based retrieval
Resume Support: Tracks processed files and skips unchanged content
Progress Tracking: Real-time progress bar with colorized output
Error Handling: Continues processing on errors with detailed warnings

Installation

npm install
npm run build

Or use directly with tsx:

npm run dev -- [options]

Usage

embedder \
  --dir /path/to/repository \
  --output /path/to/lancedb \
  --base-url http://localhost:1234/v1 \
  --model text-embedding-qwen3-embedding-0.6b

Required Options

-d, --dir <path> - Directory to index (the git repository)
-o, --output <path> - Output path for LanceDB database
-u, --base-url <url> - Base URL for LM Studio (e.g., http://localhost:1234/v1)
-m, --model <name> - Embedding model name (e.g., text-embedding-qwen3-embedding-0.6b)

Optional Options

-t, --table-name <name> - LanceDB table name (default: embeddings)
--dimensions <number> - Embedding dimension size (default: 1024)
-i, --ignore <pattern> - Glob patterns to ignore (can be specified multiple times)
-b, --batch-size <number> - Number of embeddings to process in a batch (default: 10)
--mode <type> - Indexing mode: 'full' (complete re-index), 'diff' (incremental), 'intelligent' (auto-detect) (default: full)
--from-commit <hash> - Git commit hash to diff from (overrides stored state, only used with --mode diff)
--enable-graph - Enable GraphRAG knowledge graph creation (default: false)
--graph-threshold <number> - Similarity threshold for graph edges, 0.0-1.0 (default: 0.7)

Indexing Modes

Full Mode (Default)

Complete indexing of all files. Uses content hashing to skip unchanged files.

embedder \
  -d . \
  -o ./embeddings \
  -u http://localhost:1234/v1 \
  -m text-embedding-qwen3-embedding-0.6b \
  --mode full

Diff Mode (Incremental)

Only processes files that changed in git since last index. Much faster for updates.

embedder \
  -d . \
  -o ./embeddings \
  -u http://localhost:1234/v1 \
  -m text-embedding-qwen3-embedding-0.6b \
  --mode diff

Requirements:

Git repository
Previous index with stored commit hash
Only indexes committed changes (warns about uncommitted files)

Features:

Processes added files ✓
Processes modified files ✓
Removes deleted files from index ✓
Handles renamed files (delete old + add new) ✓

Intelligent Mode (Recommended)

Automatically chooses the best mode based on context.

embedder \
  -d . \
  -o ./embeddings \
  -u http://localhost:1234/v1 \
  -m text-embedding-qwen3-embedding-0.6b \
  --mode intelligent

Decision Logic:

Not a git repo? → Use full mode
No previous commit hash? → Use full mode (first run)
Commit hash unchanged? → Already indexed, skip
Otherwise → Use diff mode

Best for: CI/CD pipelines, automated scripts, scheduled jobs

Example

# Index the current directory with custom settings
embedder \
  -d . \
  -o ./embeddings \
  -u http://localhost:1234/v1 \
  -m text-embedding-qwen3-embedding-0.6b \
  -t my_embeddings \
  --dimensions 768 \
  -i "*.test.ts" \
  -i "*.spec.ts" \
  -b 20

# Incremental indexing with diff mode
embedder \
  -d . \
  -o ./embeddings \
  -u http://localhost:1234/v1 \
  -m text-embedding-qwen3-embedding-0.6b \
  --mode diff

# With GraphRAG enabled for relationship-based retrieval
embedder \
  -d . \
  -o ./embeddings \
  -u http://localhost:1234/v1 \
  -m text-embedding-qwen3-embedding-0.6b \
  --enable-graph \
  --graph-threshold 0.7

How It Works

Discovery: Scans the specified directory for text files, respecting .gitignore rules
Filtering: Skips binary files, images, and files matching ignore patterns
Resume Check: Compares content hashes to skip already-processed files
Smart Chunking: Uses appropriate strategy based on file type:
- Markdown (.md, .mdx): semantic-markdown strategy for better structure understanding
- HTML (.html, .htm): HTML strategy preserving document structure
- JSON (.json): JSON-aware chunking
- Other files: Recursive strategy with 512-char chunks, 50-char overlap
Embedding: Generates embeddings using the specified model via LM Studio
Storage: Stores embeddings and metadata in LanceDB with configurable table name and dimensions
State Tracking: Saves progress in .embedder-state.json in the output directory

State File

The tool creates a .embedder-state.json file in the output directory to track:

Processed file paths
Content hashes
Number of chunks per file
Processing timestamps
Last indexed git commit hash (for diff mode)

This enables resume functionality - on subsequent runs, only new or modified files are processed.

Diff mode benefits:

Uses git commit history instead of scanning all files
Automatically removes deleted files from the index
Tracks the exact commit that was last indexed

Output

The tool displays:

Initialization status
File discovery results
Real-time progress bar
Summary with statistics:
- Files processed
- Files skipped (unchanged)
- Total chunks created
- Knowledge graph nodes and edges (if GraphRAG enabled)
- Errors and warnings

Output Files

LanceDB database - Vector embeddings stored in the output directory
.embedder-state.json - Processing state for resume functionality
graph-data/ - Persisted knowledge graph (if --enable-graph is used)
- Folder-based storage with batched chunks and binary embeddings
- Scalable for large repositories

GraphRAG

When --enable-graph is enabled, the tool creates a knowledge graph in addition to the vector store. This enables relationship-based retrieval for RAG applications.

See GRAPHRAG.md for detailed documentation on:

How GraphRAG works
Using the persisted graph data in your applications
GraphStore API reference
Example RAG workflows
Performance considerations

Error Handling

File read errors: Logged as warnings, processing continues
Embedding failures: Throws exception, processing halts
Storage errors: Throws exception, processing halts
All warnings are displayed in the final summary

Requirements

Node.js 18+
LM Studio running with an embedding model loaded
TypeScript 5+

Development

# Install dependencies
npm install

# Build
npm run build

# Run in development mode
npm run dev -- --help

# Run built version
npm start -- --help

License

ISC

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHUNKING_STRATEGIES.md		CHUNKING_STRATEGIES.md
EXAMPLE.md		EXAMPLE.md
GRAPHRAG.md		GRAPHRAG.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embedder

Features

Installation

Usage

Required Options

Optional Options

Indexing Modes

Full Mode (Default)

Diff Mode (Incremental)

Intelligent Mode (Recommended)

Example

How It Works

State File

Output

Output Files

GraphRAG

Error Handling

Requirements

Development

License

About

Uh oh!

Releases

Packages

Languages

mark-hingston/lance-embedder

Folders and files

Latest commit

History

Repository files navigation

Embedder

Features

Installation

Usage

Required Options

Optional Options

Indexing Modes

Full Mode (Default)

Diff Mode (Incremental)

Intelligent Mode (Recommended)

Example

How It Works

State File

Output

Output Files

GraphRAG

Error Handling

Requirements

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages