Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion knowledge server/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
config.yaml
config.yaml
datasource.yaml
113 changes: 105 additions & 8 deletions knowledge server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,23 @@ A Model Context Protocol (MCP) server that provides AI agents with access to a k

## Features

- **Document Loading**: Supports PDF and Markdown files from local directories
- **Document Loading**: Supports PDF and Markdown files from local directories, Lark Docs, Lark Wikis, and Lark Wiki Spaces
- **Vector Storage**: Uses Milvus for efficient vector similarity search with full-text search support
- **Embeddings**: Configurable embeddings via Ollama
- **Text Chunking**: Recursive character text splitting with configurable chunk size and overlap
- **MCP Integration**: Exposes knowledge base queries through FastMCP server
- **Lark Integration**: Direct integration with Lark Suite for loading documents, wikis, and entire wiki spaces
- **Flexible Configuration**: YAML-based configuration for easy customization

## Architecture

```
┌─────────────┐ ┌──────────────┐ ┌────────────┐
│ Datasource │─────▶│ Loader │─────▶│ Splitter │
│ (YAML) │ │ (PDF/MD) │ │ │
│ (YAML) │ │ Directory │ │ │
│ │ │ Lark Doc │ │ │
│ │ │ Lark Wiki │ │ │
│ │ │ Lark Space │ │ │
└─────────────┘ └──────────────┘ └────────────┘
Expand Down Expand Up @@ -55,19 +59,73 @@ vector_store:
chunk_size: 1000
chunk_overlap: 200
embeddings:
provider: ollama
model: nomic-embed-text
source: ollama
model: embeddinggemma:latest
lark:
domain: "https://open.larksuite.com"
app_id: "your_app_id"
app_secret: "your_app_secret"
```

**Configuration Options:**
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR) - applies to both application and Lark client
- `vector_store`: Milvus configuration
- `chunk_size`: Size of text chunks for splitting
- `chunk_overlap`: Overlap between chunks
- `embeddings`: Ollama embeddings configuration
- `lark`: Lark Suite API credentials (required only if using Lark datasources)

### 2. Configure Data Sources

Create `datasource.yaml`:
Create `datasource.yaml` with one or more data sources:

**Local Directory (PDF and Markdown files):**
```yaml
datasource:
- type: directory
path: ../datasets/
```

**Lark Document:**
```yaml
datasource:
- type: lark-doc
id: "doc-id"
```

**Lark Wiki:**
```yaml
datasource:
- type: lark-wiki
id: "wiki-id"
```

**Lark Wiki Space (loads all documents in a space):**
```yaml
datasource:
- type: lark-space
id: "space-id"
```

**Multiple Sources:**
```yaml
datasource:
- type: directory
path: ../datasets/
- type: lark-doc
id: "doc-id"
- type: lark-wiki
id: "wiki-id"
- type: lark-space
id: "space-id"
```

**Supported Datasource Types:**
- `directory`: Load PDF and Markdown files from a local directory
- `lark-doc`: Load a single Lark document by ID
- `lark-wiki`: Load a single wiki page by ID
- `lark-space`: Load all documents from a Lark wiki space by space ID (recursively loads all child pages)

### 3. Start Milvus

Using Docker Compose:
Expand All @@ -81,10 +139,16 @@ This will start Milvus on `http://localhost:19530`.

### Running the Server

Using Python directly:
```bash
uv run python main.py
```

Or using the Makefile:
```bash
make run
```

The server will:
1. Load documents from configured datasources
2. Split documents into chunks
Expand All @@ -110,8 +174,9 @@ query_knowledge_base(
├── config/
│ └── config.py # Configuration loader
├── loader/
│ ├── datasource.py # Datasource abstraction
│ └── directory.py # Directory loader (PDF/MD)
│ ├── factory.py # Loader factory and datasource abstraction
│ ├── directory.py # Directory loader (PDF/MD)
│ └── lark.py # Lark Suite loaders (Doc/Wiki/Space)
├── model/
│ ├── factory.py # Embeddings factory
│ └── model_garden.py # Model configurations
Expand All @@ -120,6 +185,7 @@ query_knowledge_base(
├── main.py # Application entry point
├── config.yaml # Runtime configuration
├── datasource.yaml # Data source definitions
├── Makefile # Development tasks
└── pyproject.toml # Project dependencies
```

Expand All @@ -131,15 +197,28 @@ query_knowledge_base(
- **pymilvus**: Milvus vector database client
- **pypdf**: PDF parsing
- **pyyaml**: YAML configuration parsing
- **lark-oapi**: Lark Suite Open API SDK

## Development

### Code Style

Run all checks:
```bash
make check
```

Or run individual tasks:
```bash
make lint # Run ruff check --fix
make format # Run ruff format
make type-check # Run ty check
```

Format code using Ruff:
```bash
uv run ruff format .
uv run ruff check .
uv run ruff check --fix
```

### Type Checking
Expand All @@ -155,6 +234,24 @@ If you encounter `ImportError: cannot import name 'Blob'`, ensure you're using t
from langchain_community.document_loaders.blob_loaders import Blob
```

### Lark API Issues

**Authentication Errors:**
- Verify `app_id` and `app_secret` in `config.yaml`
- Ensure your Lark app has the required permissions:
- `docx:document` for document access
- `wiki:wiki` for wiki access

**Document Not Found:**
- Verify the document/wiki ID is correct
- Check that your app has access to the document/wiki
- Ensure the document/wiki hasn't been deleted

**Getting Document IDs:**
- For Lark Docs: The ID is in the URL: `https://xxx.larksuite.com/docx/{document_id}`
- For Lark Wikis: The ID is in the URL: `https://xxx.larksuite.com/wiki/{wiki_id}`
- For Lark Spaces: The space ID can be found in wiki space settings or via the Lark API

### Milvus Connection Issues

Verify Milvus is running:
Expand Down
7 changes: 7 additions & 0 deletions knowledge server/config.example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,10 @@ vector_store:
enable_full_text_search: true
chunk_size: 1000
chunk_overlap: 200
embeddings:
source: ollama
model: embeddinggemma:latest
lark:
domain: "https://open.larksuite.com"
app_id: "app_id_here"
app_secret: "app_secret_here"
16 changes: 16 additions & 0 deletions knowledge server/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ def __init__(self, filepath):
self.chunk_size = config.get("chunk_size", 1000)
self.chunk_overlap = config.get("chunk_overlap", 200)
self.embeddings = EmbeddingsConfig(config)
self.lark = LarkConfig(config)


class EmbeddingsConfig:
Expand All @@ -38,6 +39,21 @@ def __init__(self, config: dict):
self.model = embeddings_config.get("model", None)


class LarkConfig:
domain: str
app_id: str
app_secret: str

def __init__(self, config: dict):
lark_config = config.get("lark", None)
if lark_config is None:
raise ValueError("Lark configuration is missing in the config file.")

self.domain = lark_config.get("domain", None)
self.app_id = lark_config.get("app_id", None)
self.app_secret = lark_config.get("app_secret", None)


class VectorStoreConfig:
type: str
url: str
Expand Down
9 changes: 9 additions & 0 deletions knowledge server/datasource.example.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
datasource:
- type: directory
path: ../datasets/
- type: lark-doc
id: "some-lark-doc-id"
- type: lark-wiki
id: "some-lark-wiki-id"
- type: lark-space
id: "some-lark-space-id"
3 changes: 0 additions & 3 deletions knowledge server/datasource.yaml

This file was deleted.

31 changes: 0 additions & 31 deletions knowledge server/loader/datasource.py

This file was deleted.

6 changes: 4 additions & 2 deletions knowledge server/loader/directory.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from collections.abc import Iterator
import logging
from langchain_core.document_loaders.base import BaseBlobParser
from langchain_community.document_loaders import (
FileSystemBlobLoader,
Expand All @@ -23,10 +24,11 @@ def parse(self, blob: Blob) -> list[Document]:
class DirectoryLoader(BaseLoader):
pdf_loader: PyPDFDirectoryLoader
md_loader: GenericLoader
logger: logging.Logger

def __init__(self, path: str, logger) -> None:
def __init__(self, path: str, logger: logging.Logger) -> None:
self.pdf_loader = PyPDFDirectoryLoader(
path, recursive=False, mode="single", extraction_mode="layout"
path, recursive=True, mode="single", extraction_mode="layout"
)
self.md_loader = GenericLoader(
blob_loader=FileSystemBlobLoader(
Expand Down
71 changes: 71 additions & 0 deletions knowledge server/loader/factory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import logging


from loader.lark import (
LarkSuiteDocLoader,
LarkSuiteWikiLoader,
LarkSuiteWikiSpaceLoader,
)


from langchain_core.document_loaders.base import BaseLoader
from loader.directory import DirectoryLoader

import lark_oapi as lark


class Datasource:
type: str
path: str
url: str
id: str

def __init__(self, type: str, path: str = "", url: str = "", id: str = ""):
if not type:
raise ValueError("Document source type is missing.")

self.type = type
self.path = path
self.url = url
self.id = id

if self.type == "directory" and not self.path:
raise ValueError("Directory source path is missing.")
elif self.type == "lark-doc" and not self.id:
raise ValueError("Lark document source id is missing.")
elif self.type == "lark-wiki" and not self.id:
raise ValueError("Lark wiki source id is missing.")
elif self.type == "lark-space" and not self.id:
raise ValueError("Lark space source id is missing.")
elif self.type not in ["directory", "lark-doc", "lark-wiki", "lark-space"]:
raise ValueError(f"Unsupported document source type: {self.type}")


class LoaderFactory:
logger: logging.Logger
lark_client: lark.Client

def __init__(self, lark_client: lark.Client, logger: logging.Logger) -> None:
self.lark_client = lark_client
self.logger = logger

def get_loader(self, datasource: Datasource) -> BaseLoader:
if datasource.type == "directory":
return DirectoryLoader(datasource.path, self.logger)
elif datasource.type == "lark-doc":
return LarkSuiteDocLoader(
client=self.lark_client,
document_id=datasource.id,
)
elif datasource.type == "lark-wiki":
return LarkSuiteWikiLoader(
client=self.lark_client,
wiki_id=datasource.id,
)
elif datasource.type == "lark-space":
return LarkSuiteWikiSpaceLoader(
client=self.lark_client,
space_id=datasource.id,
)
else:
raise ValueError(f"Unsupported source type: {datasource.type}")
Loading