Multimodal document Q&A with deep research capabilities
Upload PDFs → Docling parses text, tables, and images → Gemini Embedding 2 creates multimodal embeddings → ChromaDB stores vectors → Chat with your documents using streaming responses.
# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# Install packages
pip install -r requirements.txtNote: Docling requires Python 3.10+ and may take a few minutes to install (includes ML models for PDF parsing).
# Copy the example env file
cp .env.example .env
# Edit .env and add your Google API key
# Get one free at https://aistudio.google.com/apikeystreamlit run app.pyThe app opens at http://localhost:8501.
PDF Upload → Docling Parser → Text Chunks + Page Images
↓
Gemini Embedding 2 (1536-dim)
↓
ChromaDB (local persistent)
↓
User Query → Embed (RETRIEVAL_QUERY) → Hybrid Search (text + images)
↓
LLM (Gemini / Ollama) → Streaming Answer + Images
System Workflow
App UI (WHO PDF Test Example)
| Mode | How it works |
|---|---|
| Standard RAG | Single query → retrieve top-K chunks → stream answer |
| Deep Research | Decompose query → multiple sub-searches → validate answers → re-search if gaps → synthesize comprehensive answer |
Both modes use whichever LLM provider you select (Gemini or Ollama).
- Docling — IBM's PDF parser that extracts text, tables, and images with layout awareness
- Gemini Embedding 2 (
gemini-embedding-2-preview) — Google's first multimodal embedding model. Text and images share the same vector space, enabling cross-modal retrieval - ChromaDB — Local persistent vector database. Two collections:
text_chunksandpage_images - LangGraph — Powers the deep research mode's multi-step state machine
All settings are in config.py. Key options:
| Setting | Default | Description |
|---|---|---|
EMBEDDING_DIM |
1536 | Matryoshka dimension (768/1536/3072) |
CHUNK_SIZE |
1000 | Characters per text chunk |
CHUNK_OVERLAP |
200 | Overlap between chunks |
TOP_K_TEXT |
8 | Text chunks per query |
TOP_K_IMAGES |
4 | Page images per query |
DEFAULT_LLM_PROVIDER |
gemini | Default LLM (gemini/ollama) |
GEMINI_LLM_MODEL |
gemini-2.5-flash-lite | Gemini model for answers |
OLLAMA_LLM_MODEL |
llama3.2 | Ollama model for answers |
docvision-rag/
├── app.py # Streamlit entry point
├── config.py # Central configuration
├── requirements.txt
├── .env.example
├── core/
│ ├── document_processor.py # Docling PDF parsing + chunking
│ ├── embedding_manager.py # Gemini Embedding 2 wrapper
│ ├── vector_store.py # ChromaDB operations
│ ├── retriever.py # Hybrid text+image retrieval
│ ├── llm_manager.py # Gemini/Ollama abstraction
│ └── chat_engine.py # Standard RAG with streaming
├── research/
│ ├── deep_researcher.py # LangGraph multi-step research
│ └── prompts.py # All prompt templates
├── utils/
│ ├── image_utils.py # Image extraction/encoding
│ └── helpers.py # Text chunking, ID generation
└── data/
├── uploads/ # Temp uploaded files
├── images/ # Extracted page/figure images
└── chroma_db/ # Persistent vector storage
- Install Ollama: https://ollama.com
- Pull a model:
ollama pull llama3.2 - In the app sidebar, select "ollama" as the LLM Provider
- Enter your model name (e.g.,
llama3.2)
Note: When using Ollama, you still need a Google API key for Gemini Embedding 2. The LLM provider choice only affects answer generation, not embedding.
Docling is slow on first run: It downloads ML models (~1GB) for PDF understanding. Subsequent runs are faster.
CUDA/GPU errors: Docling runs on CPU by default. If you have a GPU and want to use it, install the appropriate PyTorch version first.
Rate limiting: Gemini Embedding 2 has API rate limits. The app includes automatic delays between calls. For large documents (100+ pages), processing may take several minutes.
ChromaDB errors: If the database gets corrupted, delete the data/chroma_db/ folder and reprocess your documents.

