DocVision RAG

Multimodal document Q&A with deep research capabilities

Upload PDFs → Docling parses text, tables, and images → Gemini Embedding 2 creates multimodal embeddings → ChromaDB stores vectors → Chat with your documents using streaming responses.

Quick Start

1. Install dependencies

# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# Install packages
pip install -r requirements.txt

Note: Docling requires Python 3.10+ and may take a few minutes to install (includes ML models for PDF parsing).

2. Set your API key

# Copy the example env file
cp .env.example .env

# Edit .env and add your Google API key
# Get one free at https://aistudio.google.com/apikey

3. Run the app

streamlit run app.py

The app opens at http://localhost:8501.

How It Works

Architecture

PDF Upload → Docling Parser → Text Chunks + Page Images
                                    ↓
                          Gemini Embedding 2 (1536-dim)
                                    ↓
                          ChromaDB (local persistent)
                                    ↓
User Query → Embed (RETRIEVAL_QUERY) → Hybrid Search (text + images)
                                    ↓
                        LLM (Gemini / Ollama) → Streaming Answer + Images

Screenshots

System Workflow

App UI (WHO PDF Test Example)

Two Modes

Mode	How it works
Standard RAG	Single query → retrieve top-K chunks → stream answer
Deep Research	Decompose query → multiple sub-searches → validate answers → re-search if gaps → synthesize comprehensive answer

Both modes use whichever LLM provider you select (Gemini or Ollama).

Key Technologies

Docling — IBM's PDF parser that extracts text, tables, and images with layout awareness
Gemini Embedding 2 (gemini-embedding-2-preview) — Google's first multimodal embedding model. Text and images share the same vector space, enabling cross-modal retrieval
ChromaDB — Local persistent vector database. Two collections: text_chunks and page_images
LangGraph — Powers the deep research mode's multi-step state machine

Configuration

All settings are in config.py. Key options:

Setting	Default	Description
`EMBEDDING_DIM`	1536	Matryoshka dimension (768/1536/3072)
`CHUNK_SIZE`	1000	Characters per text chunk
`CHUNK_OVERLAP`	200	Overlap between chunks
`TOP_K_TEXT`	8	Text chunks per query
`TOP_K_IMAGES`	4	Page images per query
`DEFAULT_LLM_PROVIDER`	gemini	Default LLM (gemini/ollama)
`GEMINI_LLM_MODEL`	gemini-2.5-flash-lite	Gemini model for answers
`OLLAMA_LLM_MODEL`	llama3.2	Ollama model for answers

Project Structure

docvision-rag/
├── app.py                          # Streamlit entry point
├── config.py                       # Central configuration
├── requirements.txt
├── .env.example
├── core/
│   ├── document_processor.py       # Docling PDF parsing + chunking
│   ├── embedding_manager.py        # Gemini Embedding 2 wrapper
│   ├── vector_store.py             # ChromaDB operations
│   ├── retriever.py                # Hybrid text+image retrieval
│   ├── llm_manager.py              # Gemini/Ollama abstraction
│   └── chat_engine.py              # Standard RAG with streaming
├── research/
│   ├── deep_researcher.py          # LangGraph multi-step research
│   └── prompts.py                  # All prompt templates
├── utils/
│   ├── image_utils.py              # Image extraction/encoding
│   └── helpers.py                  # Text chunking, ID generation
└── data/
    ├── uploads/                    # Temp uploaded files
    ├── images/                     # Extracted page/figure images
    └── chroma_db/                  # Persistent vector storage

Using with Ollama (Local/Offline)

Install Ollama: https://ollama.com
Pull a model: ollama pull llama3.2
In the app sidebar, select "ollama" as the LLM Provider
Enter your model name (e.g., llama3.2)

Note: When using Ollama, you still need a Google API key for Gemini Embedding 2. The LLM provider choice only affects answer generation, not embedding.

Troubleshooting

Docling is slow on first run: It downloads ML models (~1GB) for PDF understanding. Subsequent runs are faster.

CUDA/GPU errors: Docling runs on CPU by default. If you have a GPU and want to use it, install the appropriate PyTorch version first.

Rate limiting: Gemini Embedding 2 has API rate limits. The app includes automatic delays between calls. For large documents (100+ pages), processing may take several minutes.

ChromaDB errors: If the database gets corrupted, delete the data/chroma_db/ folder and reprocess your documents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocVision RAG

Quick Start

1. Install dependencies

2. Set your API key

3. Run the app

How It Works

Architecture

Screenshots

Two Modes

Key Technologies

Configuration

Project Structure

Using with Ollama (Local/Offline)

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
core		core
docs/images		docs/images
research		research
tests		tests
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.py		config.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DocVision RAG

Quick Start

1. Install dependencies

2. Set your API key

3. Run the app

How It Works

Architecture

Screenshots

Two Modes

Key Technologies

Configuration

Project Structure

Using with Ollama (Local/Offline)

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages