🧠 AI Document QA System

A production‑style Document Intelligence and Question‑Answering system that ingests files, builds embeddings, performs hybrid retrieval, and answers questions using LLM reasoning pipelines.

📚 Table of Contents

Overview
Architecture
Features
Project Structure
AI Pipeline
Notebooks (Start Here)
Installation
Running the Project
Configuration
Future Improvements

🚀 Overview

This project implements a modular AI retrieval architecture designed to bridge research workflows and production systems.

Key capabilities:

• Document ingestion (PDF, DOCX, text, images)
• Intelligent chunking pipeline
• Embedding generation
• Vector search with Chroma
• Keyword search with BM25 (SQLite FTS5)
• LangGraph QA reasoning pipeline
• Config‑driven prompts and models

The system follows clean architecture + adapter pattern to make AI components swappable and testable.

🧩 Architecture

                        +-------------------+
                        |       Client      |
                        |  API / Notebook   |
                        +---------+---------+
                                  |
                                  v
                        +-------------------+
                        |    Backend Layer  |
                        |  Request Control  |
                        +---------+---------+
                                  |
                                  v
                      +-----------------------+
                      |      AI Core Layer    |
                      |-----------------------|
                      | Document Parsers      |
                      | Chunking Pipeline     |
                      | Embedding Service     |
                      | Vector Store          |
                      | Keyword Search (BM25) |
                      | QA LangGraph Workflow |
                      +-----------+-----------+
                                  |
                                  v
                           +-------------+
                           |     LLM     |
                           +-------------+

✨ Features

Intelligent Document Processing

Supports multiple document types via adapters.

PDF
DOCX
Plain text
Images

Hybrid Retrieval

Combines:

• Dense vector search
• Sparse keyword search

Result → better recall and accuracy.

LangGraph QA Workflows

Advanced reasoning pipeline capable of:

Context retrieval
Multi‑step reasoning
Structured responses
Tool orchestration

Modular Adapter Architecture

Adapters isolate external systems:

LLM
Vector Database
Embedding Model
File Parsers
Search Engines

Making the system easy to extend.

📂 Project Structure

.
├── notebooks/                 # 🔥 Start here
│
├── configs/
│   ├── configs.yaml
│   └── prompts.yml
│
├── core/
│   ├── adapters/
│   │   ├── chunking/
│   │   ├── embedding/
│   │   ├── keyword_search/
│   │   ├── llm/
│   │   └── vectorstore/
│   │
│   └── workflows/
│       └── langgraph/
│
└── docker/

The architecture follows a ports & adapters design to separate business logic from infrastructure.

🧠 AI Pipeline

1️⃣ Document Loading

Files are parsed and normalized using parser adapters.

2️⃣ Chunking

Documents are split into semantic segments using configurable strategies.

3️⃣ Embedding Generation

Each chunk is converted into vectors via HuggingFace embedding models.

4️⃣ Storage

Vectors are stored in ChromaDB.

5️⃣ Retrieval

Two systems run in parallel:

• Vector similarity search
• BM25 keyword search

6️⃣ Reasoning

LangGraph orchestrates:

Retrieval
Context filtering
LLM generation
Response formatting

📓 Notebooks (Start Here)

The entry point of the project is inside the notebooks directory.

Run the notebooks in order:

notebooks/
│
├── 01_Setup_Notebook.ipynb
├── 02_AI_Technical_Notebook.ipynb
├── 03_File_Chunk_Embedding_Search.ipynb
└── 04_QA_Pipeline_LangGraph.ipynb

These notebooks demonstrate the entire AI pipeline step‑by‑step.

They serve as:

• tutorials
• experiments
• debugging tools
• documentation

⚙️ Installation

git clone <repo>

cd project

pip install -r requirements.txt

▶️ Running

Run notebooks

cd notebooks
jupyter notebook

Start services with Docker

docker compose up --build

🔧 Configuration

Main configuration files:

configs/configs.yaml
configs/prompts.yml

You can configure:

LLM provider
Embedding models
Retrieval parameters
Prompt templates
Chunking strategies

🧪 Design Principles

• Research friendly
• Production oriented
• Highly modular
• Replaceable AI components
• Config‑driven experimentation

🔮 Future Improvements

Evaluation pipelines
Observability
Hybrid ranking
Streaming responses
UI interface
Multi‑document reasoning

🤝 Contributing

Start experimentation in notebooks
Move stable logic to core/
Add adapters if integrating new systems
Expose via backend services

⭐ If this project helped you

Consider giving it a star.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.idea		.idea
backend		backend
configs		configs
core		core
docker		docker
docs		docs
frontend		frontend
notebooks		notebooks
requirements		requirements
scripts		scripts
tests		tests
var		var
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
data.docx		data.docx
lom.docx		lom.docx
package-lock.json		package-lock.json
package.json		package.json
pytest.ini		pytest.ini
python		python
requirements.txt		requirements.txt
test.txt		test.txt
test2.docx		test2.docx
tree.txt		tree.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 AI Document QA System

📚 Table of Contents

🚀 Overview

🧩 Architecture

✨ Features

Intelligent Document Processing

Hybrid Retrieval

LangGraph QA Workflows

Modular Adapter Architecture

📂 Project Structure

🧠 AI Pipeline

1️⃣ Document Loading

2️⃣ Chunking

3️⃣ Embedding Generation

4️⃣ Storage

5️⃣ Retrieval

6️⃣ Reasoning

📓 Notebooks (Start Here)

⚙️ Installation

▶️ Running

Run notebooks

Start services with Docker

🔧 Configuration

🧪 Design Principles

🔮 Future Improvements

🤝 Contributing

⭐ If this project helped you

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

mrkeshi/KeshroAI

Folders and files

Latest commit

History

Repository files navigation

🧠 AI Document QA System

📚 Table of Contents

🚀 Overview

🧩 Architecture

✨ Features

Intelligent Document Processing

Hybrid Retrieval

LangGraph QA Workflows

Modular Adapter Architecture

📂 Project Structure

🧠 AI Pipeline

1️⃣ Document Loading

2️⃣ Chunking

3️⃣ Embedding Generation

4️⃣ Storage

5️⃣ Retrieval

6️⃣ Reasoning

📓 Notebooks (Start Here)

⚙️ Installation

▶️ Running

Run notebooks

Start services with Docker

🔧 Configuration

🧪 Design Principles

🔮 Future Improvements

🤝 Contributing

⭐ If this project helped you

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages