A Python-based API system for intelligently processing diverse Vietnamese documents through a multi-stage chunking pipeline. This system converts documents (PDF, DOCX, etc.) into structured, semantically meaningful chunks optimized for RAG (Retrieval-Augmented Generation) and embedding applications.
-
Multi-Stage Processing Pipeline
- Stage 1: Document → Structured Markdown
- Stage 2: Markdown → First-Level Semantic Chunks
- Stage 3: First-Level Chunks → Detailed Sub-Chunks
- Relationship Mapping: Cross-references, hierarchies, domain-specific links
-
Support for Multiple Document Types
- Pricing Documents (Biểu giá dịch vụ)
- Regulatory/Policy Documents (Quy định, hướng dẫn)
- Legal Documents (Hợp đồng, thỏa thuận)
- Operational Documents (SOP, hướng dẫn vận hành)
- Financial/Report Documents (Báo cáo, số liệu)
-
Vietnamese Language Support
- Vietnamese text tokenization (underthesea, pyvi)
- Vietnamese-specific pattern recognition
- Proper handling of tone marks and abbreviations
- Domain-specific terminology extraction
-
Configurable Schema System
- YAML-based document schemas
- No hardcoded logic - fully configurable chunking rules
- Easy to add new document types
-
Rich Metadata Extraction
- Effective dates, container types, service types
- Article numbers, chapter numbers, clause references
- Conditional rules, surcharges, exceptions
- Table IDs, row groupings, calculation formulas
-
Relationship Mapping
- Parent-child relationships
- Cross-references (tables, articles, sections)
- Domain-specific relationships (surcharges, exceptions)
- Relationship validation with broken link detection
- Clone the repository
git clone <repository-url>
cd api4chatbot- Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Create required directories
mkdir -p data/chunks logs tempDocument schemas are stored in config/schemas/. Each schema defines how a document type should be chunked.
Example schemas provided:
pricing_schema.yaml- For pricing documentsregulation_schema.yaml- For regulatory documents
To add a new document type, create a new YAML schema file following the structure in existing schemas.
# Development
uvicorn src.api:app --reload --host 0.0.0.0 --port 8000
# Production
uvicorn src.api:app --host 0.0.0.0 --port 8000 --workers 4The API will be available at http://localhost:8000
API documentation: http://localhost:8000/docs
POST /documents/processUpload and process a document through the chunking pipeline.
Parameters:
file: Document file (PDF, DOCX, MD, TXT)document_type: Document type (pricing, regulation, contract, sop, report)document_name: Optional document nameeffective_date: Optional effective date (YYYY-MM-DD)tags: Optional comma-separated tags
Example:
curl -X POST "http://localhost:8000/documents/process" \
-F "file=@document.pdf" \
-F "document_type=pricing" \
-F "document_name=Biểu giá 2025" \
-F "effective_date=2025-03-10" \
-F "tags=pricing,container,2025"GET /documentsGet list of all processed documents.
GET /documents/{document_id}/chunks?level=1&chunk_type=tableRetrieve chunks for a specific document.
Query Parameters:
level: Filter by chunk level (1 or 2)chunk_type: Filter by chunk type (section, table, clause, etc.)
GET /chunks/{chunk_id}Get a specific chunk with full content and metadata.
GET /chunks/{chunk_id}/relationships?relationship_type=surchargeGet chunks related to a specific chunk.
Query Parameters:
relationship_type: Filter by relationship type (parent, child, reference, surcharge, exception)
POST /chunks/searchSearch chunks with advanced filters.
Request Body:
{
"document_id": "doc_12345678",
"document_type": "pricing",
"level": 1,
"chunk_type": "table",
"metadata_filter": {
"container_types": ["20'", "40'"],
"service_type": "Xếp dỡ"
},
"relationship_type": "surcharge",
"limit": 100,
"offset": 0
}GET /schemasGet list of available document type schemas.
DELETE /documents/{document_id}Delete a document and all its chunks.
1. UPLOAD DOCUMENT
↓
2. STAGE 1: MARKDOWN CONVERSION
PDF/DOCX → Markdown (preserve structure, extract tables, metadata)
↓
3. LOAD SCHEMA FOR DOCUMENT TYPE
↓
4. STAGE 2: FIRST-LEVEL CHUNKING
Markdown + Schema → Primary chunks (1500-3000 tokens)
- Split by section/table/article boundaries
- Extract metadata fields
- Identify rough relationships
↓
5. STAGE 3: SECOND-LEVEL CHUNKING
Primary chunks → Sub-chunks (500-1500 tokens)
- Decompose tables: row groups, conditions, rules
- Decompose lists: items, nested conditions
- Extract structured data
↓
6. RELATIONSHIP MAPPING
Map cross-references and domain relationships
↓
7. VALIDATION & STORAGE
Validate relationships, store chunks
↓
8. RETURN RESPONSE
api4chatbot/
├── config/
│ └── schemas/ # Document type schemas (YAML)
│ ├── pricing_schema.yaml
│ └── regulation_schema.yaml
├── src/
│ ├── models.py # Pydantic data models
│ ├── api.py # FastAPI application
│ ├── core/ # Core processing modules
│ │ ├── document_processor.py # Orchestrator
│ │ ├── stage1_markdown.py # PDF → Markdown
│ │ ├── stage2_first_level.py # Primary chunking
│ │ ├── stage3_second_level.py # Sub-chunking
│ │ └── relationship_mapper.py # Relationship mapping
│ ├── extractors/ # Data extraction utilities
│ │ ├── table_extractor.py
│ │ ├── metadata_extractor.py
│ │ └── vietnamese_parser.py
│ ├── schemas/
│ │ └── schema_loader.py # Schema loader
│ ├── storage/
│ │ └── chunk_repository.py # Chunk storage
│ └── utils/
│ └── logger.py
├── data/
│ └── chunks/ # Stored chunks (JSON)
├── logs/ # Application logs
├── temp/ # Temporary file storage
├── tests/ # Unit tests
├── requirements.txt
└── README.md
{
"chunk_id": "doc_001_section_2_1",
"document_id": "doc_001",
"level": 1,
"type": "section",
"title": "Cước xếp dỡ container",
"content": "...",
"metadata": {
"heading_path": "II.Cước Tác Nghiệp > 1.Cước xếp dỡ container",
"page_range": "4-6",
"effective_date": "2025-03-10",
"container_types": ["20'", "40'", "45'"],
"service_type": "Xếp dỡ"
},
"relationships": [
{
"type": "reference",
"target_chunk_id": "doc_001_bảng_23",
"description": "Xem Bảng 23"
}
],
"token_count": 1850
}{
"chunk_id": "doc_001_section_2_1_sub_table_1",
"document_id": "doc_001",
"level": 2,
"type": "row_group",
"title": "Bảng 01 - Container 20'",
"content": "...",
"metadata": {
"parent_chunk_id": "doc_001_section_2_1",
"container_type": "20'",
"table_id": "Bảng 01"
},
"relationships": [
{
"type": "parent",
"target_chunk_id": "doc_001_section_2_1"
}
],
"extracted_data": {
"type": "table",
"headers": ["Phương án", "20'", "40'"],
"rows": [...]
},
"token_count": 650
}- PDF (
.pdf) - Microsoft Word (
.docx) - Markdown (
.md) - Plain Text (
.txt)
- Tokenization: Uses
undertheseaorpyvifor accurate Vietnamese word segmentation - Pattern Recognition:
- Ordinal days: "ngày thứ 1-3", "từ ngày thứ 4"
- Container types: "20'", "40' HC", "45' GP"
- Legal references: "Điều 15", "Khoản 3 Điều 10", "Bảng 23"
- Conditional markers: "trong trường hợp", "nếu", "ngoài trừ"
- Pricing terms: "tính lũy tiến", "phụ thu", "không vượt quá"
pytest tests/- Create a new schema file in
config/schemas/:
# config/schemas/new_type_schema.yaml
document_type: new_type
description: "Description of new document type"
first_level_boundaries:
- type: section
marker: "^#+\\s+"
# ... (see existing schemas for full structure)- The schema will be automatically loaded on application start
- Use
document_type=new_typewhen processing documents
Modify extractors in src/extractors/:
table_extractor.py- Table extraction logicmetadata_extractor.py- Metadata extraction patternsvietnamese_parser.py- Vietnamese text parsing utilities
- Average processing time: <30 seconds per document
- Chunk size:
- Level 1: 1500-3000 tokens
- Level 2: 500-1500 tokens
- Vietnamese text handling accuracy: 98%+
- Relationship accuracy: >95% (with <5% broken links)
Logs are stored in the logs/ directory:
app.log- General application logserrors.log- Error logs onlyapi.log- API request logs
[Your License Here]
[Your Contributing Guidelines Here]
For issues and questions, please open an issue on GitHub.
- Add support for more document formats (XLSX, PPTX)
- Implement vector embedding integration
- Add async processing for large documents
- Web UI for document management
- Export chunks to various formats (JSON, CSV, Parquet)
- Multi-language support (English, etc.)
- Advanced analytics and visualization