A productionโstyle Document Intelligence and QuestionโAnswering system that ingests files, builds embeddings, performs hybrid retrieval, and answers questions using LLM reasoning pipelines.
- Overview
- Architecture
- Features
- Project Structure
- AI Pipeline
- Notebooks (Start Here)
- Installation
- Running the Project
- Configuration
- Future Improvements
This project implements a modular AI retrieval architecture designed to bridge research workflows and production systems.
Key capabilities:
โข Document ingestion (PDF, DOCX, text, images)
โข Intelligent chunking pipeline
โข Embedding generation
โข Vector search with Chroma
โข Keyword search with BM25 (SQLite FTS5)
โข LangGraph QA reasoning pipeline
โข Configโdriven prompts and models
The system follows clean architecture + adapter pattern to make AI components swappable and testable.
+-------------------+
| Client |
| API / Notebook |
+---------+---------+
|
v
+-------------------+
| Backend Layer |
| Request Control |
+---------+---------+
|
v
+-----------------------+
| AI Core Layer |
|-----------------------|
| Document Parsers |
| Chunking Pipeline |
| Embedding Service |
| Vector Store |
| Keyword Search (BM25) |
| QA LangGraph Workflow |
+-----------+-----------+
|
v
+-------------+
| LLM |
+-------------+
Supports multiple document types via adapters.
- DOCX
- Plain text
- Images
Combines:
โข Dense vector search
โข Sparse keyword search
Result โ better recall and accuracy.
Advanced reasoning pipeline capable of:
- Context retrieval
- Multiโstep reasoning
- Structured responses
- Tool orchestration
Adapters isolate external systems:
LLM
Vector Database
Embedding Model
File Parsers
Search Engines
Making the system easy to extend.
.
โโโ notebooks/ # ๐ฅ Start here
โ
โโโ configs/
โ โโโ configs.yaml
โ โโโ prompts.yml
โ
โโโ core/
โ โโโ adapters/
โ โ โโโ chunking/
โ โ โโโ embedding/
โ โ โโโ keyword_search/
โ โ โโโ llm/
โ โ โโโ vectorstore/
โ โ
โ โโโ workflows/
โ โโโ langgraph/
โ
โโโ docker/
The architecture follows a ports & adapters design to separate business logic from infrastructure.
Files are parsed and normalized using parser adapters.
Documents are split into semantic segments using configurable strategies.
Each chunk is converted into vectors via HuggingFace embedding models.
Vectors are stored in ChromaDB.
Two systems run in parallel:
โข Vector similarity search
โข BM25 keyword search
LangGraph orchestrates:
- Retrieval
- Context filtering
- LLM generation
- Response formatting
The entry point of the project is inside the notebooks directory.
Run the notebooks in order:
notebooks/
โ
โโโ 01_Setup_Notebook.ipynb
โโโ 02_AI_Technical_Notebook.ipynb
โโโ 03_File_Chunk_Embedding_Search.ipynb
โโโ 04_QA_Pipeline_LangGraph.ipynb
These notebooks demonstrate the entire AI pipeline stepโbyโstep.
They serve as:
โข tutorials
โข experiments
โข debugging tools
โข documentation
git clone <repo>
cd project
pip install -r requirements.txtcd notebooks
jupyter notebookdocker compose up --buildMain configuration files:
configs/configs.yaml
configs/prompts.yml
You can configure:
- LLM provider
- Embedding models
- Retrieval parameters
- Prompt templates
- Chunking strategies
โข Research friendly
โข Production oriented
โข Highly modular
โข Replaceable AI components
โข Configโdriven experimentation
- Evaluation pipelines
- Observability
- Hybrid ranking
- Streaming responses
- UI interface
- Multiโdocument reasoning
- Start experimentation in notebooks
- Move stable logic to core/
- Add adapters if integrating new systems
- Expose via backend services
Consider giving it a star.