Skip to content

Commit afced43

Browse files
committed
Enhance Knowledge Base MCP Server Guide #7854
1 parent a3c446f commit afced43

1 file changed

Lines changed: 83 additions & 23 deletions

File tree

learn/guides/mcp/KnowledgeBase.md

Lines changed: 83 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -14,17 +14,91 @@ It indexes four distinct types of content:
1414

1515
## Architecture
1616

17-
The server is built on two key technologies:
17+
The server is built on a robust, service-oriented architecture designed for reliability and extensibility.
1818

1919
### 1. OpenAPI-Driven Design
20-
Unlike typical MCP servers that hardcode their tools, this server is entirely driven by an **OpenAPI 3.0 Specification**. The `openapi.yaml` file defines the tools, their arguments, and their documentation. The server reads this spec at startup to dynamically generate:
21-
* **Zod Validation Schemas:** Ensuring every tool call is type-safe.
22-
* **JSON Tool Definitions:** For the MCP client to discover.
23-
24-
### 2. ChromaDB & Vector Search
20+
Unlike typical MCP servers that hardcode their tools, this server is entirely driven by an **OpenAPI 3.0 Specification**.
21+
- **Source of Truth:** `openapi.yaml` defines every tool, argument, and return type.
22+
- **Dynamic Validation:** `OpenApiValidator.mjs` generates Zod schemas at runtime to ensure strict type safety for all tool calls.
23+
- **Tool Discovery:** `toolService.mjs` dynamically maps OpenAPI operations to service handlers.
24+
25+
### 2. Core Services
26+
The server logic is distributed across specialized services:
27+
28+
#### QueryService (`services/QueryService.mjs`)
29+
The brain of the operation. It handles the "Two-Stage Query Protocol" (Stage 1).
30+
- **Embeddings:** Uses Google's `text-embedding-004` model via the Gemini API to convert queries into vectors.
31+
- **Hybrid Search:** Combines vector similarity with a sophisticated **Weighted Scoring Algorithm**:
32+
- **Boosts:** Matches in file paths (+40), filenames (+30), class names (+20), guides (+50).
33+
- **Penalties:** Tickets (-70) and Release Notes (-50) are penalized to prioritize current code and documentation, unless explicitly requested.
34+
- **Inheritance:** Uses pre-calculated inheritance chains to boost parent classes (+80) with a decay factor, ensuring architectural context is preserved.
35+
36+
#### DatabaseService (`services/DatabaseService.mjs`)
37+
The ETL (Extract, Transform, Load) engine.
38+
- **Extract:** Reads from `docs/output/all.json` (JSDoc), `learn/tree.json` (Guides), and `.github/` (Tickets/Releases).
39+
- **Transform:** Normalizes content into a unified JSONL format (`dist/ai-knowledge-base.jsonl`). It generates a **Content Hash** (SHA-256) for each chunk to detect changes.
40+
- **Load:** "Upserts" vectors into ChromaDB. It uses the content hash to perform a diff, ensuring only new or modified chunks are re-embedded, saving time and API costs.
41+
42+
#### HealthService (`services/HealthService.mjs`)
43+
The gatekeeper.
44+
- **Intelligent Caching:** Caches "healthy" status for 5 minutes to reduce overhead. Unhealthy states are never cached, allowing immediate recovery detection.
45+
- **Gatekeeping:** Every tool call passes through `ensureHealthy()`. If dependencies (ChromaDB, API Key) are missing, it fails fast with actionable error messages.
46+
47+
#### DatabaseLifecycleService (`services/DatabaseLifecycleService.mjs`)
48+
Process manager.
49+
- Automatically manages the local `chroma` server process.
50+
- Can start/stop the database on demand via tools.
51+
52+
#### DocumentService (`services/DocumentService.mjs`)
53+
Inspection and debugging.
54+
- Allows raw access to the indexed documents in ChromaDB to verify content and metadata.
55+
56+
### 3. ChromaDB & Vector Search
2557
The server manages a local instance of **ChromaDB**, a high-performance vector database.
26-
* **Embeddings:** We use Google's `text-embedding-004` model to convert code and text into high-dimensional vectors.
27-
* **Hybrid Search:** The server combines vector similarity (semantic match) with keyword boosting (exact match) to deliver the best results.
58+
- **Persistence:** Data is stored locally in `chroma-neo-knowledge-base/`.
59+
- **Collection:** Uses a single collection `neo-knowledge-base` for all content types.
60+
61+
## Available Tools
62+
63+
### Query Tools
64+
These are the primary tools used by agents to retrieve information.
65+
66+
* **`query_documents`**: Performs a semantic search.
67+
* `query`: The natural language question.
68+
* `type`: Filter by content type (`guide`, `src`, `ticket`, `blog`, `release`, `example`, `all`).
69+
* *Best Practice:* Start broad, then narrow down. Use `type` to focus the search.
70+
71+
* **`list_documents`**: Retrieves a paginated list of all indexed documents (for inspection).
72+
* **`get_document_by_id`**: Retrieves a specific document chunk by its ID.
73+
74+
### Database Management Tools
75+
These tools manage the knowledge base lifecycle.
76+
77+
* **`sync_database`**: **The "One Button" Update.** Triggers the full ETL pipeline:
78+
1. Scans source files.
79+
2. Creates `ai-knowledge-base.jsonl`.
80+
3. Embeds changes into ChromaDB.
81+
* *Use when:* You have modified code or docs and want the agent to "learn" the changes.
82+
83+
* **`create_knowledge_base`**: Runs only the "Extract & Transform" steps (creates the JSONL file). Useful for debugging the extraction logic without incurring embedding costs.
84+
* **`embed_knowledge_base`**: Runs only the "Load" step (embeds the JSONL file). Useful if the JSONL file was manually edited or verified.
85+
* **`delete_database`**: **Destructive.** Deletes the entire ChromaDB collection.
86+
87+
### Infrastructure Tools
88+
These tools manage the underlying services.
89+
90+
* **`healthcheck`**: Diagnostic tool. Checks ChromaDB connectivity, collection status, and API key presence.
91+
* **`start_database`**: Starts the local ChromaDB process.
92+
* **`stop_database`**: Stops the local ChromaDB process.
93+
94+
## Configuration
95+
96+
The server is configured via `ai/mcp/server/knowledge-base/config.mjs` or environment variables.
97+
98+
**Key Environment Variables:**
99+
* `GEMINI_API_KEY`: **Required.** Used for generating text embeddings.
100+
* `CHROMA_DATA_PATH`: Path to store vector data (default: `./chroma-neo-knowledge-base`).
101+
* `CHROMA_PORT`: Port for the database (default: `8000`).
28102

29103
## The Two-Stage Query Protocol
30104

@@ -41,19 +115,5 @@ await KB_QueryService.queryDocuments({
41115
});
42116
```
43117

44-
**Best Practices for Agents:**
45-
* **Start Broad:** Query for concepts first (e.g., "Component lifecycle").
46-
* **Narrow Down:** Use the results to find specific class names (e.g., "Neo.form.field.Base").
47-
* **Check History:** Use `type: 'ticket'` to see if a specific bug has been reported before.
48-
49118
### Stage 2: Querying Memory
50-
(Handled by the [Memory Core Server](./MemoryCore.md))
51-
52-
## Automatic Synchronization
53-
54-
The server is **self-maintaining**. On startup, it runs a health check and, if necessary, triggers an automatic synchronization process (`sync_database`). This ETL (Extract, Transform, Load) pipeline:
55-
1. Reads the latest source code and markdown files.
56-
2. Generates embeddings for new or modified content.
57-
3. Updates the vector database.
58-
59-
This ensures the agent always has the most up-to-date understanding of the project, even as you modify the code.
119+
(Handled by the [Memory Core Server](./MemoryCore.md))

0 commit comments

Comments
 (0)