Skip to content

hame-dev/docvision-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocVision RAG

Multimodal document Q&A with deep research capabilities

Upload PDFs → Docling parses text, tables, and images → Gemini Embedding 2 creates multimodal embeddings → ChromaDB stores vectors → Chat with your documents using streaming responses.

Quick Start

1. Install dependencies

# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# Install packages
pip install -r requirements.txt

Note: Docling requires Python 3.10+ and may take a few minutes to install (includes ML models for PDF parsing).

2. Set your API key

# Copy the example env file
cp .env.example .env

# Edit .env and add your Google API key
# Get one free at https://aistudio.google.com/apikey

3. Run the app

streamlit run app.py

The app opens at http://localhost:8501.

How It Works

Architecture

PDF Upload → Docling Parser → Text Chunks + Page Images
                                    ↓
                          Gemini Embedding 2 (1536-dim)
                                    ↓
                          ChromaDB (local persistent)
                                    ↓
User Query → Embed (RETRIEVAL_QUERY) → Hybrid Search (text + images)
                                    ↓
                        LLM (Gemini / Ollama) → Streaming Answer + Images

Screenshots

System Workflow

DocVision RAG Workflow

App UI (WHO PDF Test Example)

DocVision RAG App Screenshot

Two Modes

Mode How it works
Standard RAG Single query → retrieve top-K chunks → stream answer
Deep Research Decompose query → multiple sub-searches → validate answers → re-search if gaps → synthesize comprehensive answer

Both modes use whichever LLM provider you select (Gemini or Ollama).

Key Technologies

  • Docling — IBM's PDF parser that extracts text, tables, and images with layout awareness
  • Gemini Embedding 2 (gemini-embedding-2-preview) — Google's first multimodal embedding model. Text and images share the same vector space, enabling cross-modal retrieval
  • ChromaDB — Local persistent vector database. Two collections: text_chunks and page_images
  • LangGraph — Powers the deep research mode's multi-step state machine

Configuration

All settings are in config.py. Key options:

Setting Default Description
EMBEDDING_DIM 1536 Matryoshka dimension (768/1536/3072)
CHUNK_SIZE 1000 Characters per text chunk
CHUNK_OVERLAP 200 Overlap between chunks
TOP_K_TEXT 8 Text chunks per query
TOP_K_IMAGES 4 Page images per query
DEFAULT_LLM_PROVIDER gemini Default LLM (gemini/ollama)
GEMINI_LLM_MODEL gemini-2.5-flash-lite Gemini model for answers
OLLAMA_LLM_MODEL llama3.2 Ollama model for answers

Project Structure

docvision-rag/
├── app.py                          # Streamlit entry point
├── config.py                       # Central configuration
├── requirements.txt
├── .env.example
├── core/
│   ├── document_processor.py       # Docling PDF parsing + chunking
│   ├── embedding_manager.py        # Gemini Embedding 2 wrapper
│   ├── vector_store.py             # ChromaDB operations
│   ├── retriever.py                # Hybrid text+image retrieval
│   ├── llm_manager.py              # Gemini/Ollama abstraction
│   └── chat_engine.py              # Standard RAG with streaming
├── research/
│   ├── deep_researcher.py          # LangGraph multi-step research
│   └── prompts.py                  # All prompt templates
├── utils/
│   ├── image_utils.py              # Image extraction/encoding
│   └── helpers.py                  # Text chunking, ID generation
└── data/
    ├── uploads/                    # Temp uploaded files
    ├── images/                     # Extracted page/figure images
    └── chroma_db/                  # Persistent vector storage

Using with Ollama (Local/Offline)

  1. Install Ollama: https://ollama.com
  2. Pull a model: ollama pull llama3.2
  3. In the app sidebar, select "ollama" as the LLM Provider
  4. Enter your model name (e.g., llama3.2)

Note: When using Ollama, you still need a Google API key for Gemini Embedding 2. The LLM provider choice only affects answer generation, not embedding.

Troubleshooting

Docling is slow on first run: It downloads ML models (~1GB) for PDF understanding. Subsequent runs are faster.

CUDA/GPU errors: Docling runs on CPU by default. If you have a GPU and want to use it, install the appropriate PyTorch version first.

Rate limiting: Gemini Embedding 2 has API rate limits. The app includes automatic delays between calls. For large documents (100+ pages), processing may take several minutes.

ChromaDB errors: If the database gets corrupted, delete the data/chroma_db/ folder and reprocess your documents.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages