You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: learn/guides/mcp/KnowledgeBase.md
+83-23Lines changed: 83 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,17 +14,91 @@ It indexes four distinct types of content:
14
14
15
15
## Architecture
16
16
17
-
The server is built on two key technologies:
17
+
The server is built on a robust, service-oriented architecture designed for reliability and extensibility.
18
18
19
19
### 1. OpenAPI-Driven Design
20
-
Unlike typical MCP servers that hardcode their tools, this server is entirely driven by an **OpenAPI 3.0 Specification**. The `openapi.yaml` file defines the tools, their arguments, and their documentation. The server reads this spec at startup to dynamically generate:
21
-
***Zod Validation Schemas:** Ensuring every tool call is type-safe.
22
-
***JSON Tool Definitions:** For the MCP client to discover.
23
-
24
-
### 2. ChromaDB & Vector Search
20
+
Unlike typical MCP servers that hardcode their tools, this server is entirely driven by an **OpenAPI 3.0 Specification**.
21
+
-**Source of Truth:**`openapi.yaml` defines every tool, argument, and return type.
22
+
-**Dynamic Validation:**`OpenApiValidator.mjs` generates Zod schemas at runtime to ensure strict type safety for all tool calls.
23
+
-**Tool Discovery:**`toolService.mjs` dynamically maps OpenAPI operations to service handlers.
24
+
25
+
### 2. Core Services
26
+
The server logic is distributed across specialized services:
27
+
28
+
#### QueryService (`services/QueryService.mjs`)
29
+
The brain of the operation. It handles the "Two-Stage Query Protocol" (Stage 1).
30
+
-**Embeddings:** Uses Google's `text-embedding-004` model via the Gemini API to convert queries into vectors.
31
+
-**Hybrid Search:** Combines vector similarity with a sophisticated **Weighted Scoring Algorithm**:
32
+
-**Boosts:** Matches in file paths (+40), filenames (+30), class names (+20), guides (+50).
33
+
-**Penalties:** Tickets (-70) and Release Notes (-50) are penalized to prioritize current code and documentation, unless explicitly requested.
34
+
-**Inheritance:** Uses pre-calculated inheritance chains to boost parent classes (+80) with a decay factor, ensuring architectural context is preserved.
-**Extract:** Reads from `docs/output/all.json` (JSDoc), `learn/tree.json` (Guides), and `.github/` (Tickets/Releases).
39
+
-**Transform:** Normalizes content into a unified JSONL format (`dist/ai-knowledge-base.jsonl`). It generates a **Content Hash** (SHA-256) for each chunk to detect changes.
40
+
-**Load:** "Upserts" vectors into ChromaDB. It uses the content hash to perform a diff, ensuring only new or modified chunks are re-embedded, saving time and API costs.
41
+
42
+
#### HealthService (`services/HealthService.mjs`)
43
+
The gatekeeper.
44
+
-**Intelligent Caching:** Caches "healthy" status for 5 minutes to reduce overhead. Unhealthy states are never cached, allowing immediate recovery detection.
45
+
-**Gatekeeping:** Every tool call passes through `ensureHealthy()`. If dependencies (ChromaDB, API Key) are missing, it fails fast with actionable error messages.
- Allows raw access to the indexed documents in ChromaDB to verify content and metadata.
55
+
56
+
### 3. ChromaDB & Vector Search
25
57
The server manages a local instance of **ChromaDB**, a high-performance vector database.
26
-
***Embeddings:** We use Google's `text-embedding-004` model to convert code and text into high-dimensional vectors.
27
-
***Hybrid Search:** The server combines vector similarity (semantic match) with keyword boosting (exact match) to deliver the best results.
58
+
-**Persistence:** Data is stored locally in `chroma-neo-knowledge-base/`.
59
+
-**Collection:** Uses a single collection `neo-knowledge-base` for all content types.
60
+
61
+
## Available Tools
62
+
63
+
### Query Tools
64
+
These are the primary tools used by agents to retrieve information.
65
+
66
+
***`query_documents`**: Performs a semantic search.
67
+
*`query`: The natural language question.
68
+
*`type`: Filter by content type (`guide`, `src`, `ticket`, `blog`, `release`, `example`, `all`).
69
+
**Best Practice:* Start broad, then narrow down. Use `type` to focus the search.
70
+
71
+
***`list_documents`**: Retrieves a paginated list of all indexed documents (for inspection).
72
+
***`get_document_by_id`**: Retrieves a specific document chunk by its ID.
73
+
74
+
### Database Management Tools
75
+
These tools manage the knowledge base lifecycle.
76
+
77
+
***`sync_database`**: **The "One Button" Update.** Triggers the full ETL pipeline:
78
+
1. Scans source files.
79
+
2. Creates `ai-knowledge-base.jsonl`.
80
+
3. Embeds changes into ChromaDB.
81
+
**Use when:* You have modified code or docs and want the agent to "learn" the changes.
82
+
83
+
***`create_knowledge_base`**: Runs only the "Extract & Transform" steps (creates the JSONL file). Useful for debugging the extraction logic without incurring embedding costs.
84
+
***`embed_knowledge_base`**: Runs only the "Load" step (embeds the JSONL file). Useful if the JSONL file was manually edited or verified.
85
+
***`delete_database`**: **Destructive.** Deletes the entire ChromaDB collection.
86
+
87
+
### Infrastructure Tools
88
+
These tools manage the underlying services.
89
+
90
+
***`healthcheck`**: Diagnostic tool. Checks ChromaDB connectivity, collection status, and API key presence.
91
+
***`start_database`**: Starts the local ChromaDB process.
92
+
***`stop_database`**: Stops the local ChromaDB process.
93
+
94
+
## Configuration
95
+
96
+
The server is configured via `ai/mcp/server/knowledge-base/config.mjs` or environment variables.
97
+
98
+
**Key Environment Variables:**
99
+
*`GEMINI_API_KEY`: **Required.** Used for generating text embeddings.
100
+
*`CHROMA_DATA_PATH`: Path to store vector data (default: `./chroma-neo-knowledge-base`).
101
+
*`CHROMA_PORT`: Port for the database (default: `8000`).
***Start Broad:** Query for concepts first (e.g., "Component lifecycle").
46
-
***Narrow Down:** Use the results to find specific class names (e.g., "Neo.form.field.Base").
47
-
***Check History:** Use `type: 'ticket'` to see if a specific bug has been reported before.
48
-
49
118
### Stage 2: Querying Memory
50
-
(Handled by the [Memory Core Server](./MemoryCore.md))
51
-
52
-
## Automatic Synchronization
53
-
54
-
The server is **self-maintaining**. On startup, it runs a health check and, if necessary, triggers an automatic synchronization process (`sync_database`). This ETL (Extract, Transform, Load) pipeline:
55
-
1. Reads the latest source code and markdown files.
56
-
2. Generates embeddings for new or modified content.
57
-
3. Updates the vector database.
58
-
59
-
This ensures the agent always has the most up-to-date understanding of the project, even as you modify the code.
119
+
(Handled by the [Memory Core Server](./MemoryCore.md))
0 commit comments