AI-powered codebase onboarding tool
Point it at any repo → understand the entire codebase in hours, not weeks
Features • Demo • Setup • Usage • MCP • API • Architecture • Contributing
Codewalk analyzes any codebase and gives you:
- Module detection — groups files into logical modules automatically
- Dependency graph — extracts every import/require → builds the full dependency map
- Blast radius — "if I change this file, what breaks?"
- Reading order — optimal file reading sequence (dependencies first)
- Execution flow — entry points, module-to-module and file-to-file dependency flow
- AI chat — ask anything about the code, powered by RAG + tool-calling agent
Three ways to use it:
| Interface | Best for |
|---|---|
| Web UI (Next.js) | Visual exploration — diagrams, module browser, blast radius viewer |
| MCP Server | VS Code Copilot, Claude Code, Cursor, Codex — AI agents use tools directly |
| REST API | Scripts, CI/CD, custom integrations |
| Scenario | How Codewalk helps |
|---|---|
| New dev joins the team | Point Codewalk at the repo → get an overview, module map, and reading order. Self-onboard in hours instead of weeks of "hey, can you explain this?" |
| LLM token costs are high | Without RAG, the LLM needs your entire codebase in context — slow and expensive. Codewalk embeds code into a vector DB and retrieves only the relevant chunks per query. Faster answers, fraction of the tokens. |
| Senior dev switches modules | You know the auth module but now need to work on payments. Get module info, blast radius, and execution flow without bugging the payments team. |
| Before a refactor | Check blast radius before touching shared code. "If I change base_model.py, what breaks?" — get the answer before you break prod. |
| PR reviews | Reviewer doesn't know what verify_request() does? Explain any function in seconds with AI-powered line-by-line breakdown. |
| Documentation is outdated | Codewalk analyzes the actual code, not stale wiki pages. Always up to date. |
| Feature | Description |
|---|---|
| 🔍 Module Detection | Auto-groups files into packages/modules by directory structure |
| 🕸️ Dependency Graph | Parses imports across 15+ languages via tree-sitter |
| 💥 Blast Radius | BFS on reversed dependency graph → shows transitive impact of any change |
| 📖 Reading Order | Topological sort → "read config.py before embedder.py because embedder imports config" |
| 🔄 Execution Flow | Entry points, module/file dependency chains, Mermaid diagrams |
| 🤖 AI Chat | LangGraph agent with 7 tools, multi-turn conversation with memory |
| 🔎 Semantic Search | ChromaDB vector search on embedded code chunks (RAG) |
| 🧩 MCP Server | 12 tools for VS Code Copilot / Claude Code / Cursor / Codex |
| ⚡ Parallel Embedding | Producer-consumer pipeline — CPU chunking overlaps with GPU embedding |
| 🏗️ Multi-Provider LLM | Ollama (local), OpenAI, Anthropic, Groq, Gemini, OpenRouter |
| 🌐 15+ Languages | Python, JS, TS, Java, Go, Rust, Ruby, PHP, C#, C++, C, Dart, Kotlin, Swift, YAML |
| Language | Extensions | Tree-sitter Parsing | Import Extraction |
|---|---|---|---|
| Python | .py |
✅ | ✅ |
| JavaScript | .js, .jsx |
✅ | ✅ |
| TypeScript | .ts, .tsx |
✅ | ✅ |
| Java | .java |
✅ | ✅ |
| Go | .go |
✅ | ✅ |
| Rust | .rs |
✅ | ✅ |
| Ruby | .rb |
✅ | ✅ |
| PHP | .php |
✅ | ✅ |
| C# | .cs |
✅ | ✅ |
| C++ | .cpp |
✅ | ✅ |
| C | .c |
✅ | ✅ |
| Kotlin | .kt |
✅ | ✅ |
| Swift | .swift |
✅ | ✅ |
| Dart | .dart |
✅ (optional install) | ✅ |
| YAML | .yaml, .yml |
— | — |
| JSON | .json |
— | — |
| TOML | .toml |
— | — |
| Markdown | .md |
— | — |
Tree-sitter parsing = extracts functions, classes, and methods for accurate chunking and function explanations.
Import extraction = builds the dependency graph, blast radius, and reading order.
Languages without tree-sitter support still get indexed via text splitting — they work with semantic search and AI chat, just without function-level granularity.
codewalk-demo-frontend.mp4
codewalk-demo-mcp.mp4
🎥 [Video coming soon]
| Tool | Version | Check |
|---|---|---|
| Python | 3.10+ | python3 --version |
| Node.js | 18+ | node --version |
| Git | Any | git --version |
| Ollama (optional) | Latest | ollama --version |
git clone https://github.com/gupta29470/codewalk.git
cd codewalk# Create virtual environment
python3 -m venv .codewalk-env
source .codewalk-env/bin/activate # macOS / Linux
# .codewalk-env\Scripts\activate # Windows
# Install Python dependencies
pip install -r requirements.txt⚠️ VPN / Corporate Network / Private Network Issues
If you're behind a VPN, corporate proxy, or private network, package installations and model downloads may fail due to blocked connections or SSL certificate errors.
Recommended: Use a normal (non-VPN) network for first-time setup.
Codewalk's setup downloads packages from PyPI, npm, and HuggingFace. These are one-time downloads — once installed, everything runs locally. If possible:
- Disconnect from VPN temporarily
- Run the setup steps (
pip install,npm install, start the backend once to download the embedding model) - Reconnect to VPN — everything is cached locally, no more downloads needed
After the first run, Codewalk works fully offline (with Ollama). The VPN/corporate network won't cause any issues.
Optional: Dart/Flutter support (tree-sitter-dart)
# If you get an SSH error, run this first:
git config --global url."https://github.com/".insteadOf "git@github.com:"
# Then install:
pip install "tree-sitter-dart @ git+https://github.com/UserNobody14/tree-sitter-dart.git"Without this, Codewalk still works — Dart files just won't get tree-sitter parsing (falls back to text splitting).
cd frontend
npm install
cd ..Create a .env file in the project root:
# ─── LLM Configuration ──────────────────────────────────────
# Provider: ollama | openai | anthropic | gemini | groq | openrouter
LLM_PROVIDER=ollama
LLM_MODEL=qwen2.5-coder:7b
# ─── Embeddings ──────────────────────────────────────────────
EMBEDDING_MODEL=jinaai/jina-code-embeddings-1.5b
# ─── Repository to Analyze ──────────────────────────────────
# Relative path (self-analysis): src/codewalk
# Absolute path (any repo): /Users/you/projects/my-app/src
REPO_PATH=src/codewalk
# ─── API Keys (only fill the one you're using) ──────────────
# GROQ_API_KEY=gsk_...
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...
# GOOGLE_API_KEY=AI...
# OPENROUTER_API_KEY=sk-or-...ollama pull qwen2.5-coder:7bRecommended models by size
| Model | Size | Tool Calling | Best For |
|---|---|---|---|
qwen2.5-coder:7b |
4.7 GB | ✅ | Code-focused, fast |
qwen3.5:latest (8B) |
6.6 GB | ✅ | General + code |
qwen3.5:27b |
17 GB | ✅ | Best accuracy |
Open two terminals in codewalk:
Terminal 1 — Backend API
source .codewalk-env/bin/activate
uvicorn src.codewalk.api.main:app --reload --port 8000Terminal 2 — Frontend
cd frontend
npm run devOpen http://localhost:3000 → enter a repo path → click Analyze Codebase.
Then explore:
- Overview — tech stack, modules, dependency diagram, riskiest files
- Modules — browse all modules, click one for file list + dependencies
- Blast Radius — which files break if you change each file
- Reading Order — optimal file reading sequence with risk levels
- Execution Flow — Mermaid diagram of module/file dependencies
- Chat — ask any question ("explain the authentication flow", "what does scanner.py do?")
See MCP Integration below.
# Start the backend
source .codewalk-env/bin/activate
uvicorn src.codewalk.api.main:app --reload --port 8000Step 1 — Analyze a codebase:
curl -X POST http://localhost:8000/analyze \
-H "Content-Type: application/json" \
-d '{"repo_path": "/path/to/your/repo", "index_mode": "auto"}'Step 2 — Explore the results:
# Project overview (tech stack, modules, riskiest files)
curl http://localhost:8000/overview | python3 -m json.tool
# List all modules
curl http://localhost:8000/modules | python3 -m json.tool
# Dive into a specific module
curl http://localhost:8000/modules/auth | python3 -m json.tool
# What breaks if I change files in the auth module?
curl http://localhost:8000/blast-radius/auth | python3 -m json.tool
# Optimal reading order
curl http://localhost:8000/reading-order | python3 -m json.tool
# Execution flow (entry points, dependency chains)
curl http://localhost:8000/execution-flow | python3 -m json.toolStep 3 — Chat with the agent:
# Ask a question
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"message": "Explain this project", "thread_id": "thread-1"}'
# Follow-up (same thread_id = conversation memory)
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"message": "What does the auth module do?", "thread_id": "thread-1"}'
# After code changes — refresh analysis without re-embedding
curl -X POST http://localhost:8000/refreshSee API Reference for full request/response details on every endpoint.
Codewalk runs as an MCP (Model Context Protocol) server, so any AI agent that speaks MCP can use it.
-
Open VS Code in the codewalk project
-
Press
Cmd+Shift+P(macOS) orCtrl+Shift+P(Windows/Linux) -
Type
MCP: List Serversand select it -
You'll see
codewalkin the list -
Click Start Server next to codewalk
-
The server starts in the background (stdio transport)
-
Open Copilot Chat → type
@codewalk→ all 12 tools are available
Add to .vscode/mcp.json in your desired project:
⚠️ Replace/path/to/codewalkwith the actual absolute path where you cloned codewalk.
{
"servers": {
"codewalk": {
"command": "/path/to/codewalk/.codewalk-env/bin/python",
"args": ["-m", "src.codewalk.mcp.server"],
"cwd": "/path/to/codewalk",
"env": {
"REPO_PATH": "${workspaceFolder}",
"EXCLUDE_PATHS": ""
}
}
}
}
EXCLUDE_PATHS— comma-separated list of paths/patterns to skip during scanning. Example:"tests,docs,scripts/legacy,*.generated.*"
Customizing file filters: Codewalk ships with a built-in skip list (binary files, lock files,
node_modules/, etc.). If you want to remove a predefined skip rule (e.g., to index.mdor.cssfiles), editsrc/codewalk/ingestion/file_filter.py.
Then in Copilot Chat: @codewalk → follow the scan → filter → index workflow.
Note: After adding or modifying
.vscode/mcp.json, reload the VS Code window:Cmd+Shift+P→Developer: Reload Window.
Add to ~/.claude/mcp.json:
{
"mcpServers": {
"codewalk": {
"command": "python",
"args": ["-m", "src.codewalk.mcp.server"],
"cwd": "/path/to/codewalk",
"env": {
"REPO_PATH": "/path/to/target/repo",
"EXCLUDE_PATHS": ""
}
}
}
}Settings → MCP Servers → Add:
{
"codewalk": {
"command": "python",
"args": ["-m", "src.codewalk.mcp.server"],
"cwd": "/path/to/codewalk",
"env": {
"REPO_PATH": "/path/to/target/repo",
"EXCLUDE_PATHS": ""
}
}
}Add to ~/.codex/mcp.json:
{
"mcpServers": {
"codewalk": {
"command": "python",
"args": ["-m", "src.codewalk.mcp.server"],
"cwd": "/path/to/codewalk",
"env": {
"REPO_PATH": "/path/to/target/repo",
"EXCLUDE_PATHS": ""
}
}
}
}The first time you use Codewalk on a new codebase, it needs to index the files.
You just tell the AI to analyze — the AI handles the rest automatically.
┌─────────────────────────────────────────────────────────────────────┐
│ SETUP WORKFLOW (run once) │
│ │
│ Step 1 │
│ codewalk_analyze_codebase │
│ │ scans files, builds dependency graph, detects modules │
│ ▼ │
│ Step 2 │
│ codewalk_scan_files(batch=1) │
│ │ returns ~100 file paths for review │
│ ▼ │
│ Step 3 │
│ codewalk_submit_filtered_files(paths=[...]) │
│ │ submit relevant source files from this batch │
│ ▼ │
│ ┌─── More batches? ───┐ │
│ │ YES │ NO │
│ │ Go to Step 2 │ │
│ │ (batch=2, 3, ...) ▼ │
│ └─────────────┐ Step 4 │
│ │ codewalk_index_filtered_files │
│ │ │ chunks + embeds all submitted files │
│ │ ▼ │
│ │ ✅ READY — all query tools unlocked │
│ └──────────────────────────────────────── │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ QUERY TOOLS (use after setup) │
│ │
│ codewalk_get_overview → project summary + diagrams │
│ codewalk_search_codebase → semantic code search │
│ codewalk_get_module_info → inspect a specific module │
│ codewalk_explain_function → AI-powered function explanation │
│ codewalk_get_blast_radius_map → change risk analysis │
│ codewalk_get_reading_order → optimal file reading sequence │
│ codewalk_get_execution_flow → dependency flow diagram │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ MAINTENANCE (after code changes) │
│ │
│ codewalk_refresh_analysis → re-scan without re-embedding │
└─────────────────────────────────────────────────────────────────────┘
💡 Before indexing: Close unnecessary applications (browsers, Slack, Docker, etc.). Indexing loads the embedding model into memory and processes all files at once — freeing up RAM helps it run faster and avoids slowdowns.
You type this in Copilot Chat:
@codewalk analyze this codebase [auto(default) | reindex(update index) | full(delete existing index and generate new index)]
or
@codewalk_analyze_codebase [auto(default) | reindex(update index) | full(delete existing index and generate new index)]
What happens behind the scenes (you don't need to do anything):
- The AI calls
codewalk_analyze_codebase→ scans all files, detects modules, builds the dependency graph - The AI calls
codewalk_scan_files(batch=1)→ gets a batch of file paths - The AI reviews the paths — keeps source code (
.py,.ts,.js), skips junk (node_modules/,__pycache__/, test files, images) - The AI calls
codewalk_submit_filtered_files(file_paths=[...])→ submits the good files - Steps 2-4 repeat for each batch until all files are processed
- The AI calls
codewalk_index_filtered_files→ embeds everything into the vector database
You'll see progress like:
✓ Codebase analyzed — 142 files, 5 modules detected
✓ Scanning batch 1 of 2... submitted 87 source files
✓ Scanning batch 2 of 2... submitted 34 source files (LAST BATCH)
✓ Indexed 121 files → 380 chunks embedded
Ready! You can now use these tools:
- codewalk_get_overview (if LLM didn't call — run manually for project summary)
- codewalk_search_codebase (if LLM didn't call — search code by concept)
- codewalk_get_module_info (if LLM didn't call — inspect a specific module)
- codewalk_explain_function (if LLM didn't call — explain any function/class)
- codewalk_get_blast_radius_map (if LLM didn't call — check change risk)
- codewalk_get_reading_order (if LLM didn't call — optimal file reading order)
- codewalk_get_execution_flow (if LLM didn't call — dependency flow diagram)
Note: After indexing, the AI agent should automatically call these tools. If it doesn't, you can invoke them manually — the hints above tell you exactly which tools to run.
Note: This only happens once. Next time you say
@codewalk analyze this codebase, it detects the existing index and skips straight to "ready."
Some LLMs stop after one tool call instead of continuing the full workflow. Each tool's output tells you exactly what to call next. If the AI stops, just call the next tool yourself:
| AI stopped after... | You call next |
|---|---|
codewalk_analyze_codebase |
codewalk_scan_files(batch=1) |
codewalk_scan_files |
codewalk_submit_filtered_files with the listed paths |
codewalk_submit_filtered_files |
codewalk_scan_files(batch=<next>) or codewalk_index_filtered_files if last batch |
codewalk_index_filtered_files |
Any query tool — codewalk_get_overview, codewalk_search_codebase, etc. |
Tip: Look for the ⏩ NEXT STEP line at the bottom of each tool's output — it tells you exactly what to do.
After indexing is done, here's every tool you can use.
You don't need to remember tool names — just ask naturally and the AI picks the right tool.
Tool: codewalk_get_overview — no parameters needed
You just joined a new team. You have no idea what this project does. Start here.
@codewalk give me an overview of this project
or
@codewalk_get_overview
When to use: Day 1 on a new project. You want to know what you're dealing with.
Tool: codewalk_get_module_info(module_name) — pass the module name
You saw "auth" in the overview and want to dig into it.
@codewalk tell me about the auth module
or
@codewalk_get_module_info auth
When to use: You need to work on a specific module and want to see all its files, classes, and functions at a glance.
Tool: codewalk_explain_function(function_name) — pass the function or class name
Your tech lead mentioned verify_request in a PR review. You have no idea what it does.
@codewalk explain the verify_request function
or
@codewalk_explain_function verify_request function
When to use: You see a function name in code/PR/docs and want to understand exactly what it does without reading the whole file yourself.
Tool: codewalk_search_codebase(query) — pass any natural language question
You need to find where database connections are handled but don't know which file.
@codewalk how does this project handle database connections?
or
@codewalk_search_codebase how does this project handle database connections?
When to use: You have a question about a concept ("error handling", "file upload", "caching") and don't know which files to look at.
Tool: codewalk_get_blast_radius_map(target) — pass a module name, file name, or leave empty
You're about to refactor models/base.py. Before you touch it, you want to know the damage.
@codewalk what's the blast radius of base.py / auth?
or
@codewalk_get_blast_radius_map base.py / auth?
When to use: Before refactoring or making changes. "Is it safe to change this, or will half the project break?"
Tool: codewalk_get_reading_order(module_name) — pass a module name or leave empty for entire repo
You want to understand the agent module but don't know which file to read first.
@codewalk what order should I read the agent module?
or
@codewalk_get_reading_order
When to use: You want to understand code without constantly jumping between files wondering "wait, what's this import?"
Tool: codewalk_get_execution_flow(module_name) — pass a module name or leave empty for module-level view
You want to understand how modules connect to each other.
@codewalk show me the execution flow
or
@codewalk_get_execution_flow
When to use: You want to understand "what calls what" — the big picture of how code connects.
Tool: codewalk_refresh_analysis — no parameters needed
You added 3 new files and refactored a module. The analysis is now stale.
@codewalk refresh the analysis
or
@codewalk_refresh_analysis
When to use: After you commit code changes and want updated blast radius / reading order / execution flow results.
| You want to... | Just say... |
|---|---|
| First-time setup | @codewalk analyze this codebaseor @codewalk_analyze_codebase |
| Big picture overview | @codewalk give me an overview or @codewalk_get_overview |
| Understand a module | @codewalk tell me about the auth module or @codewalk_get_module_info auth |
| Understand a function | @codewalk explain the verify_request function or @codewalk_explain_function verify_request |
| Find code by concept | @codewalk how does error handling work? or @codewalk_search_codebase how does error handling work? |
| Check change risk | @codewalk what's the blast radius of config.py? or @codewalk_get_blast_radius_map config.py? |
| Find riskiest files | @codewalk show me the riskiest files |
| Best reading order | @codewalk what order should I read the agent module? or @codewalk_get_reading_order agent module |
| See dependency flow | @codewalk show me the execution flow or @codewalk_get_execution_flow |
| After code changes | @codewalk refresh the analysis or @codewalk_refresh_analysis |
Base URL: http://localhost:8000
Start the server:
source .codewalk-env/bin/activate
uvicorn src.codewalk.api.main:app --reload --port 8000curl -X POST http://localhost:8000/analyze \
-H "Content-Type: application/json" \
-d '{
"repo_path": "/Users/you/projects/my-app",
"collection_name": "",
"index_mode": "auto"
}'Response:
{
"status": "success",
"repo_path": "/Users/you/projects/my-app",
"files_scanned": 142,
"chunks_created": 380,
"modules": ["api", "auth", "models", "utils", "frontend"]
}index_mode:"auto"(skip if indexed),"reindex"(smart update),"full"(wipe & rebuild)collection_name: leave empty — auto-derived from repo path (e.g.my-app)
curl -N -X POST http://localhost:8000/analyze/stream \
-H "Content-Type: application/json" \
-d '{"repo_path": "/Users/you/projects/my-app", "index_mode": "auto"}'Response (Server-Sent Events):
data: {"step": "scan", "message": "Scanning files..."}
data: {"step": "scan", "message": "Found 142 files"}
data: {"step": "deps", "message": "Building dependency graph..."}
data: {"step": "modules", "message": "Detected 5 modules"}
data: {"step": "embed", "message": "Embedding 142 files → 380 chunks"}
data: {"step": "done", "message": "Analysis complete!"}
curl -X POST http://localhost:8000/refreshResponse:
{
"status": "refreshed",
"files": 142,
"modules": ["api", "auth", "models", "utils", "frontend"]
}curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"message": "Explain how authentication works in this project", "thread_id": "thread-1"}'Response:
{
"answer": "The authentication flow starts in auth/middleware.py which checks JWT tokens on every request. The token validation logic is in auth/jwt.py which uses the python-jose library...",
"thread_id": "thread-1"
}Multi-turn conversation — use the same thread_id:
# Follow-up question
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"message": "What happens if the token expires?", "thread_id": "thread-1"}'curl http://localhost:8000/overviewResponse:
{
"tech_stack": ["Python", "FastAPI", "React"],
"total_files": 142,
"total_modules": 5,
"modules": [
{"name": "api", "file_count": 12, "depends_on": ["auth", "models"]},
{"name": "auth", "file_count": 5, "depends_on": ["models"]}
],
"diagram": "graph TD\n api --> auth\n api --> models\n auth --> models",
"overview_text": "## Project Overview\nTech stack: Python, FastAPI...",
"riskiest_files": [
{"file": "models/base.py", "risk_level": "high", "affected_files": 23}
]
}curl http://localhost:8000/modulesResponse:
{
"modules": [
{"name": "api", "file_count": 12, "languages": ["python"]},
{"name": "auth", "file_count": 5, "languages": ["python"]},
{"name": "frontend", "file_count": 34, "languages": ["typescript", "css"]}
],
"total": 5
}curl http://localhost:8000/modules/authResponse:
{
"name": "auth",
"file_count": 5,
"files": ["auth/middleware.py", "auth/jwt.py", "auth/permissions.py", "auth/models.py", "auth/__init__.py"],
"languages": {"python": 5},
"depends_on": ["models"],
"depended_by": ["api"],
"blast_radius": [
{"file": "auth/middleware.py", "risk_level": "moderate", "affected_files": 8}
],
"module_risk": "moderate"
}curl http://localhost:8000/blast-radiusResponse:
{
"module": null,
"module_risk": "high",
"total_files": 15,
"files": [
{
"file": "models/base.py",
"risk_level": "high",
"affected_files": 23,
"direct": ["api/routes.py", "auth/models.py"],
"transitive": ["api/views.py", "auth/middleware.py"]
}
]
}curl http://localhost:8000/blast-radius/authcurl http://localhost:8000/reading-orderResponse:
{
"order": [
{
"file": "config.py",
"position": 1,
"why": "No internal dependencies",
"risk_level": "moderate",
"affected_files": 12,
"direct": ["embedder.py", "chain.py"],
"transitive": ["pipeline.py"]
},
{
"file": "models/base.py",
"position": 2,
"why": "No internal dependencies | Used by: routes.py, views.py",
"risk_level": "high",
"affected_files": 23
}
]
}curl http://localhost:8000/execution-flowResponse:
{
"flow": "## Execution Flow — Module Level\nEntry modules: api, cli\nTotal modules: 5\n\n### Module Dependencies\n api (12 files) → depends on: auth, models\n auth (5 files) → depends on: models\n models (8 files) → (standalone)\n utils (6 files) → (standalone)\n frontend (34 files) → (standalone)"
}curl http://localhost:8000/healthResponse:
{
"status": "ok"
}┌─────────────────────────────────────────────────────────┐
│ INTERFACES │
│ │
│ Next.js Web UI (:3000) MCP Server REST API │
│ ├── Overview (stdio) (:8000) │
│ ├── Modules │ │ │
│ ├── Blast Radius │ │ │
│ ├── Reading Order │ │ │
│ ├── Execution Flow │ │ │
│ └── Chat ──────────────────┐ │ │ │
│ ▼ ▼ ▼ │
├──────────────────────────────────────────────────────────┤
│ AGENT LAYER │
│ │
│ LangGraph StateGraph ─── LLM (bind_tools) ───┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─ 7 Agent Tools ──────────────────────────────┐ │
│ │ search_codebase get_overview │ │
│ │ get_module_info get_blast_radius_map │ │
│ │ explain_function get_reading_order │ │
│ │ get_execution_flow │ │
│ └──────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────────┤
│ ANALYSIS LAYER │
│ │
│ scanner.py ──► dependency_graph.py ──► module_detector │
│ │ │
│ ▼ │
│ blast_radius.py reading_order.py code_parser.py │
│ (BFS reverse (topological (tree-sitter │
│ graph) sort) 15+ langs) │
├──────────────────────────────────────────────────────────┤
│ EMBEDDING LAYER │
│ │
│ chunker.py ──► embedder.py ──► vector_store.py │
│ (smart code (Jina 1.5B (ChromaDB │
│ chunks) MPS/CUDA) persistent) │
├──────────────────────────────────────────────────────────┤
│ LLM LAYER │
│ │
│ config.py ──► get_llm() factory │
│ Ollama │ OpenAI │ Anthropic │ Gemini │ Groq │ ... │
└──────────────────────────────────────────────────────────┘
codewalk/
├── src/codewalk/
│ ├── config.py # Settings + LLM provider factory
│ ├── pipeline.py # Orchestration (parallel embed)
│ ├── ingestion/ # File scanning & tech detection
│ │ ├── scanner.py # File enumeration
│ │ ├── file_filter.py # Skip rules (node_modules, etc.)
│ │ └── tech_detect.py # Language/framework detection
│ ├── analysis/ # Code parsing & dependency analysis
│ │ ├── code_parser.py # Tree-sitter (15+ languages)
│ │ ├── dependency_graph.py # Import extraction → graph
│ │ ├── module_detector.py # Auto-grouping into modules
│ │ ├── blast_radius.py # Change impact (BFS)
│ │ └── reading_order.py # Topological sort
│ ├── embeddings/ # Vectorization
│ │ ├── chunker.py # Code → chunks
│ │ ├── embedder.py # Chunks → vectors
│ │ └── vector_store.py # ChromaDB storage
│ ├── agent/ # LangGraph chat agent
│ │ ├── graph.py # StateGraph + fallback parser
│ │ ├── tools.py # 7 tool functions
│ │ └── prompts.py # System prompt
│ ├── api/ # FastAPI REST
│ │ ├── main.py # 12 endpoints
│ │ ├── models.py # Pydantic schemas
│ │ └── state.py # Singleton app state
│ └── mcp/ # Model Context Protocol
│ └── server.py # 12 MCP tools (stdio)
│
├── frontend/ # Next.js 14 web UI
│ └── src/app/
│ ├── page.tsx # Home (analyze form)
│ ├── chat/page.tsx # AI chat interface
│ ├── overview/page.tsx # Project overview
│ ├── modules/page.tsx # Module browser
│ ├── module/page.tsx # Single module detail
│ ├── blast-radius/page.tsx # Change impact viewer
│ ├── reading-order/page.tsx # Reading order viewer
│ └── execution-flow/page.tsx# Flow diagram viewer
│
├── data/
│ └── chroma/ # ChromaDB persistent storage
│
├── requirements.txt # Python dependencies
├── .env # Configuration (gitignored)
└── .vscode/mcp.json # MCP server config
| Variable | Default | Description |
|---|---|---|
LLM_PROVIDER |
ollama |
LLM backend: ollama, openai, anthropic, gemini, groq, openrouter |
LLM_MODEL |
qwen3.5:27b |
Model name (must match provider) |
EMBEDDING_MODEL |
jinaai/jina-code-embeddings-1.5b |
Sentence-transformer model for code embeddings |
REPO_PATH |
src/codewalk |
Default repository path to analyze |
EXCLUDE_PATHS |
— | Comma-separated paths to exclude from scanning (e.g. tests,docs,*.generated.*) |
GROQ_API_KEY |
— | Groq API key |
OPENAI_API_KEY |
— | OpenAI API key |
ANTHROPIC_API_KEY |
— | Anthropic API key |
GOOGLE_API_KEY |
— | Google Gemini API key |
OPENROUTER_API_KEY |
— | OpenRouter API key |
| Provider | Set LLM_PROVIDER= |
API Key | Notes |
|---|---|---|---|
| Ollama | ollama |
None | Fully local, no internet. Run ollama serve first |
| OpenAI | openai |
OPENAI_API_KEY |
GPT models, etc. |
| Anthropic | anthropic |
ANTHROPIC_API_KEY |
Claude models |
| Google Gemini | gemini |
GOOGLE_API_KEY |
Gemini models |
| Groq | groq |
GROQ_API_KEY |
Groq models |
| OpenRouter | openrouter |
OPENROUTER_API_KEY |
Access to 100+ models |
To wipe all indexed data and start fresh, delete the data/chroma/ directory:
# From the codewalk project root:
rm -rf data/chroma/This removes all embedded chunks and collections. Next time you run codewalk_analyze_codebase (MCP) or POST /analyze (API), it will re-index from scratch.
When to do this:
- You switched to a different repo and want a clean index
- Embeddings seem stale or corrupted
- You changed the embedding model and need to re-embed everything
- You want to use
index_mode: "full"but it's still picking up old data
| Layer | Technology |
|---|---|
| Backend | Python 3.10+, FastAPI, Uvicorn |
| Agent | LangGraph, LangChain |
| Vector DB | ChromaDB (persistent, local) |
| Embeddings | Jina Code Embeddings 1.5B (1536-dim, MPS/CUDA) |
| Code Parsing | Tree-sitter (15+ language grammars) |
| Frontend | Next.js 14, React 18, TypeScript 5 |
| Styling | Tailwind CSS, shadcn/ui |
| Diagrams | Mermaid.js |
| MCP | Model Context Protocol (stdio transport) |
- Fork this repo
- Clone your fork:
git clone https://github.com/<your-username>/codewalk.git - Create a branch:
git checkout -b feat/my-feature - Make your changes and test them
- Commit:
git commit -m "feat: add my feature" - Push:
git push origin feat/my-feature - Open a Pull Request against
master
All contributions welcome — bug fixes, new language support, UI improvements, docs, anything.
Found a bug? Open an issue with screenshots, error logs, or references — it helps us fix it faster.
⭐ If you find Codewalk useful, give it a star — it helps others discover it!
Built by gupta29470
LinkedIn · Twitter/X



