vector-search

Semantic codebase search engine for Claude Code. Search your projects by meaning and intent, not just exact text matches.

Built as a Model Context Protocol (MCP) server — Claude Code calls it as a native tool, the same way it uses file read or shell commands.

The Problem It Solves

Standard code search tools (grep, ripgrep, glob) are purely lexical. They find what you literally type. If you search for "pitch correction" and the code uses autotune, praat, or fundamental_frequency — you get nothing, even though those are the same concept.

This engine converts every function, class, and block in your codebase into a semantic vector — a mathematical representation of its meaning. Queries are converted the same way. The engine then finds the closest matches in meaning-space, regardless of what words were used.

Result: Claude Code can answer "where does pitch get corrected?" in one shot, instead of running five grep rounds guessing at variable names.

How It Works

1. Indexing

When you index a project, the engine:

Scans the directory for source files (.py, .js, .ts, .tsx, .md, .json, .yaml, and more)
Chunks each file at natural boundaries — Python files are split at function and class definitions using the AST; other files are split by content windows
Embeds each chunk using OpenAI's text-embedding-3-small model, producing a 1536-dimensional vector that encodes the chunk's meaning
Stores all vectors locally in a ChromaDB database at ~/.claude/vector-index/, keyed per project

Indexing is incremental — each file is hashed and cached. On subsequent runs, only files that have changed are re-embedded. A large repo re-indexes in seconds after the first run.

2. Searching

When Claude Code (or you via CLI) issues a search:

The query is embedded into the same 1536-dimensional space using the same model
ChromaDB performs an approximate nearest-neighbour search using cosine similarity
The top-K most semantically similar chunks are returned, with file path, line numbers, and a similarity score
Claude reads the actual chunk text and uses it as grounded context to answer your question

Architecture Diagram

Your Projects                   vector-search engine
─────────────                   ──────────────────────────────────────
vocal-mix-agent/   ──chunk──►  AST Chunker (function/class level)
music-ai/          ──chunk──►       │
jay-resume-maker/  ──chunk──►       ▼
...                            OpenAI text-embedding-3-small
                                    │  1536-dim vectors
                                    ▼
                               ChromaDB  (~/.claude/vector-index/)
                                    │  per-project collections
                                    │
                               ◄────┘  cosine similarity search
Claude Code (MCP tool call)
  semantic_search("pitch correction", project_dir)
       │
       ▼
  Top 5 chunks  →  autotune.py:12-45, utils.py:88-102, ...

Installation

Requirements

Python 3.10+
An OpenAI API key (for embedding — costs less than $0.01 to index a full project)

Setup

git clone https://github.com/jayharish/vector-search.git
cd vector-search
pip install -r requirements.txt

Register as a Claude Code MCP tool

claude mcp add vector-search python /path/to/vector-search/server.py

This registers the server globally so Claude Code loads it as a tool in every session.

Set your OpenAI API key

# Current session only
export OPENAI_API_KEY="sk-..."

# Permanent (add to ~/.bashrc or Claude Code settings)
echo 'export OPENAI_API_KEY="sk-..."' >> ~/.bashrc

Or add it to your Claude Code ~/.claude/settings.json:

{
  "env": {
    "OPENAI_API_KEY": "sk-..."
  }
}

Usage

CLI

Index a project:

python cli.py index /path/to/your/project

Search a project:

python cli.py search "how does authentication work" /path/to/your/project
python cli.py search "database connection setup" /path/to/your/project
python cli.py search "pitch correction logic" /path/to/vocal-mix-agent

Index multiple projects at once:

python cli.py index-all /project-a /project-b /project-c

List all indexed projects:

python cli.py list

Force a full re-index (ignores cache):

python cli.py index /path/to/project --force

Remove a project's index:

python cli.py drop /path/to/project

As a Claude Code MCP Tool

Once registered, Claude Code has access to three tools automatically:

`semantic_search`

semantic_search(query, project_dir, top_k=5)

Searches a project by meaning. Returns the top-K most relevant code chunks with file paths and line numbers.

Example Claude Code usage:

"Find where the beat gets downloaded from YouTube" → Claude calls semantic_search("YouTube beat download", "/path/to/vocal-mix-agent") → Returns downloader.py:8-41 immediately

`index_project`

index_project(project_dir, force=False)

Indexes or re-indexes a project directory. Incremental by default — only changed files are re-embedded.

`list_indexed_projects`

list_indexed_projects()

Lists all projects currently in the vector store.

Project Structure

vector-search/
├── server.py        # MCP server — exposes tools to Claude Code via stdio
├── cli.py           # Terminal CLI for indexing and searching
├── indexer.py       # Orchestrates scan → chunk → embed → store pipeline
├── chunker.py       # AST-based splitter for Python; line-based for all others
├── embedder.py      # OpenAI text-embedding-3-small wrapper with batching
├── store.py         # ChromaDB persistent vector store, one collection per project
└── requirements.txt

Supported File Types

Extension	Chunking method
`.py`	AST — split at function and class definitions
`.js` `.ts` `.tsx` `.jsx`	Line-window (300-token blocks)
`.md` `.txt`	Line-window
`.json` `.yaml` `.yml` `.toml`	Line-window

Files larger than 500KB and standard dependency directories (node_modules, __pycache__, .git, venv) are automatically skipped.

Tech Stack

Component	Choice	Why
Embedding model	`text-embedding-3-small` (OpenAI)	Best code embedding quality per cost — $0.02 per 1M tokens
Vector store	ChromaDB	Local, embedded, no server required — persists to disk at `~/.claude/vector-index/`
MCP protocol	`mcp` Python SDK	Native Claude Code integration — server communicates via stdio
Chunker	Python `ast` module + line-window	Function-level chunks preserve semantic units better than naive splits
CLI	Click	Clean, composable command interface

Cost

Indexing uses the OpenAI Embeddings API. Approximate costs:

Project size	Estimated tokens	Cost
Small project (< 20 files)	~10k tokens	< $0.001
Medium project (50–100 files)	~50k tokens	~$0.001
Large project (500+ files)	~500k tokens	~$0.01

After the first index, only changed files are re-embedded — ongoing cost is essentially zero.

License

MIT — free to use, modify, and distribute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vector-search

The Problem It Solves

How It Works

1. Indexing

2. Searching

Architecture Diagram

Installation

Requirements

Setup

Register as a Claude Code MCP tool

Set your OpenAI API key

Usage

CLI

As a Claude Code MCP Tool

`semantic_search`

`index_project`

`list_indexed_projects`

Project Structure

Supported File Types

Tech Stack

Cost

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
chunker.py		chunker.py
cli.py		cli.py
embedder.py		embedder.py
indexer.py		indexer.py
requirements.txt		requirements.txt
server.py		server.py
store.py		store.py

Folders and files

Latest commit

History

Repository files navigation

vector-search

The Problem It Solves

How It Works

1. Indexing

2. Searching

Architecture Diagram

Installation

Requirements

Setup

Register as a Claude Code MCP tool

Set your OpenAI API key

Usage

CLI

As a Claude Code MCP Tool

semantic_search

index_project

list_indexed_projects

Project Structure

Supported File Types

Tech Stack

Cost

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`semantic_search`

`index_project`

`list_indexed_projects`

Packages