Skip to content

jayharish/vector-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vector-search

Semantic codebase search engine for Claude Code. Search your projects by meaning and intent, not just exact text matches.

Built as a Model Context Protocol (MCP) server — Claude Code calls it as a native tool, the same way it uses file read or shell commands.


The Problem It Solves

Standard code search tools (grep, ripgrep, glob) are purely lexical. They find what you literally type. If you search for "pitch correction" and the code uses autotune, praat, or fundamental_frequency — you get nothing, even though those are the same concept.

This engine converts every function, class, and block in your codebase into a semantic vector — a mathematical representation of its meaning. Queries are converted the same way. The engine then finds the closest matches in meaning-space, regardless of what words were used.

Result: Claude Code can answer "where does pitch get corrected?" in one shot, instead of running five grep rounds guessing at variable names.


How It Works

1. Indexing

When you index a project, the engine:

  1. Scans the directory for source files (.py, .js, .ts, .tsx, .md, .json, .yaml, and more)
  2. Chunks each file at natural boundaries — Python files are split at function and class definitions using the AST; other files are split by content windows
  3. Embeds each chunk using OpenAI's text-embedding-3-small model, producing a 1536-dimensional vector that encodes the chunk's meaning
  4. Stores all vectors locally in a ChromaDB database at ~/.claude/vector-index/, keyed per project

Indexing is incremental — each file is hashed and cached. On subsequent runs, only files that have changed are re-embedded. A large repo re-indexes in seconds after the first run.

2. Searching

When Claude Code (or you via CLI) issues a search:

  1. The query is embedded into the same 1536-dimensional space using the same model
  2. ChromaDB performs an approximate nearest-neighbour search using cosine similarity
  3. The top-K most semantically similar chunks are returned, with file path, line numbers, and a similarity score
  4. Claude reads the actual chunk text and uses it as grounded context to answer your question

Architecture Diagram

Your Projects                   vector-search engine
─────────────                   ──────────────────────────────────────
vocal-mix-agent/   ──chunk──►  AST Chunker (function/class level)
music-ai/          ──chunk──►       │
jay-resume-maker/  ──chunk──►       ▼
...                            OpenAI text-embedding-3-small
                                    │  1536-dim vectors
                                    ▼
                               ChromaDB  (~/.claude/vector-index/)
                                    │  per-project collections
                                    │
                               ◄────┘  cosine similarity search
Claude Code (MCP tool call)
  semantic_search("pitch correction", project_dir)
       │
       ▼
  Top 5 chunks  →  autotune.py:12-45, utils.py:88-102, ...

Installation

Requirements

  • Python 3.10+
  • An OpenAI API key (for embedding — costs less than $0.01 to index a full project)

Setup

git clone https://github.com/jayharish/vector-search.git
cd vector-search
pip install -r requirements.txt

Register as a Claude Code MCP tool

claude mcp add vector-search python /path/to/vector-search/server.py

This registers the server globally so Claude Code loads it as a tool in every session.

Set your OpenAI API key

# Current session only
export OPENAI_API_KEY="sk-..."

# Permanent (add to ~/.bashrc or Claude Code settings)
echo 'export OPENAI_API_KEY="sk-..."' >> ~/.bashrc

Or add it to your Claude Code ~/.claude/settings.json:

{
  "env": {
    "OPENAI_API_KEY": "sk-..."
  }
}

Usage

CLI

Index a project:

python cli.py index /path/to/your/project

Search a project:

python cli.py search "how does authentication work" /path/to/your/project
python cli.py search "database connection setup" /path/to/your/project
python cli.py search "pitch correction logic" /path/to/vocal-mix-agent

Index multiple projects at once:

python cli.py index-all /project-a /project-b /project-c

List all indexed projects:

python cli.py list

Force a full re-index (ignores cache):

python cli.py index /path/to/project --force

Remove a project's index:

python cli.py drop /path/to/project

As a Claude Code MCP Tool

Once registered, Claude Code has access to three tools automatically:

semantic_search

semantic_search(query, project_dir, top_k=5)

Searches a project by meaning. Returns the top-K most relevant code chunks with file paths and line numbers.

Example Claude Code usage:

"Find where the beat gets downloaded from YouTube" → Claude calls semantic_search("YouTube beat download", "/path/to/vocal-mix-agent") → Returns downloader.py:8-41 immediately

index_project

index_project(project_dir, force=False)

Indexes or re-indexes a project directory. Incremental by default — only changed files are re-embedded.

list_indexed_projects

list_indexed_projects()

Lists all projects currently in the vector store.


Project Structure

vector-search/
├── server.py        # MCP server — exposes tools to Claude Code via stdio
├── cli.py           # Terminal CLI for indexing and searching
├── indexer.py       # Orchestrates scan → chunk → embed → store pipeline
├── chunker.py       # AST-based splitter for Python; line-based for all others
├── embedder.py      # OpenAI text-embedding-3-small wrapper with batching
├── store.py         # ChromaDB persistent vector store, one collection per project
└── requirements.txt

Supported File Types

Extension Chunking method
.py AST — split at function and class definitions
.js .ts .tsx .jsx Line-window (300-token blocks)
.md .txt Line-window
.json .yaml .yml .toml Line-window

Files larger than 500KB and standard dependency directories (node_modules, __pycache__, .git, venv) are automatically skipped.


Tech Stack

Component Choice Why
Embedding model text-embedding-3-small (OpenAI) Best code embedding quality per cost — $0.02 per 1M tokens
Vector store ChromaDB Local, embedded, no server required — persists to disk at ~/.claude/vector-index/
MCP protocol mcp Python SDK Native Claude Code integration — server communicates via stdio
Chunker Python ast module + line-window Function-level chunks preserve semantic units better than naive splits
CLI Click Clean, composable command interface

Cost

Indexing uses the OpenAI Embeddings API. Approximate costs:

Project size Estimated tokens Cost
Small project (< 20 files) ~10k tokens < $0.001
Medium project (50–100 files) ~50k tokens ~$0.001
Large project (500+ files) ~500k tokens ~$0.01

After the first index, only changed files are re-embedded — ongoing cost is essentially zero.


License

MIT — free to use, modify, and distribute.

About

Semantic vector search MCP tool for Claude Code — search your codebase by meaning, not just text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages