AI Document Analyzer

RAG-based (Retrieval-Augmented Generation) document analysis system that lets you upload documents, process them, and interact with their content through natural language questions. Answers are grounded in the actual content of uploaded documents, not in the model's generic knowledge.

Description

AI Document Analyzer is a full-stack application that implements a complete RAG pipeline for document analysis. It enables you to:

Upload documents in PDF, DOCX, TXT, and CSV formats
Process them automatically through text extraction, chunking, embeddings, and vector storage
Ask questions about the content and get AI-generated answers with citations to original sources
View real-time response streaming, query metrics, and cited excerpts

The system is designed for local development, prototypes, and small teams. It supports multiple LLM providers, including free options (Groq, Ollama, Gemini, OpenRouter).

Key Features

Feature	Description
Document Upload	PDF, DOCX, TXT, CSV (up to 50 MB per file)
RAG Pipeline	Extraction → Cleaning → Chunking → Deduplication → Embeddings → ChromaDB
Contextual Chat	Q&A over document content
Streaming	Real-time tokens via SSE
Cited Sources	Answers with references to document, chunk, and original text
Document Dashboard	List, select, and delete documents
Metrics	Query time, tokens used, estimated cost
Multi-LLM	Groq, Ollama, LM Studio, OpenRouter, Gemini, OpenAI
Conversational Memory	Chat history for follow-up questions

Architecture

                    ┌─────────────────┐
                    │   Frontend      │
                    │   Next.js 14    │
                    │   React + Tailwind
                    └────────┬────────┘
                             │ HTTP API
                             ▼
                    ┌─────────────────┐
                    │   Backend       │
                    │   FastAPI       │
                    └────────┬────────┘
                             │
         ┌───────────────────┼───────────────────┐
         ▼                   ▼                   ▼
┌─────────────────┐ ┌───────────────┐ ┌─────────────────┐
│ Document        │ │ ChromaDB      │ │ LLM Provider    │
│ Pipeline        │ │ (Vector DB)   │ │ (Groq/Ollama)   │
│ + Embeddings    │ │               │ │                 │
└─────────────────┘ └───────────────┘ └─────────────────┘

Document Processing Flow

Upload → User uploads file
Storage → Local or S3
Extraction → Text by format
Cleaning → Normalization
Chunking → Fragmentation with overlap
Metadata → Classification (contract, invoice, CV, etc.)
Embeddings → SentenceTransformer
Indexing → ChromaDB

Question Flow (RAG)

Question → Semantic search → Reranking (optional) → LLM → Answer + sources

Tech Stack

Layer	Technology
Backend	FastAPI, Python 3.11+
Frontend	Next.js 14, React, Tailwind CSS
Database	SQLite (dev) / PostgreSQL (prod)
Vector DB	ChromaDB
Embeddings	SentenceTransformers (all-MiniLM-L6-v2)
LLM	Groq (default), Ollama, Gemini, OpenRouter, OpenAI

Prerequisites

Python 3.11+
Node.js 18+ (for frontend)
API key from at least one LLM provider (e.g. Groq - free tier)

Installation

1. Clone or download the project

cd ia_analizer_data

2. Backend (Python)

# Create virtual environment
python -m venv venv

# Activate (Windows PowerShell)
.\venv\Scripts\activate

# Activate (Linux/macOS)
source venv/bin/activate

# Install dependencies
pip install -r backend/requirements.txt

3. Frontend (Node.js)

cd frontend
npm install
cd ..

4. Environment variables

# Copy template
cp .env.example .env

# Edit .env and configure at least:
# - GROQ_API_KEY (recommended to start, free)
# - Or LLM_PROVIDER + corresponding API key

Configuration

Main variables in .env:

Variable	Description	Recommended
`LLM_PROVIDER`	groq, ollama, gemini, openrouter, openai	`groq`
`GROQ_API_KEY`	Groq API key	Get at console.groq.com
`CHROMA_PERSIST_DIR`	ChromaDB path	`./chroma_db`
`STORAGE_LOCAL_PATH`	Uploads folder	`./uploads`
`DATABASE_URL`	SQLite or PostgreSQL	`sqlite:///./ai_docs.db`

For more LLM options, see docs/LLM_PROVIDERS.md.

Running the Project

Development (Windows PowerShell)

# Terminal 1: Backend (port 8001)
.\run.ps1 dev

# Terminal 2: Frontend (port 3000)
.\run.ps1 dev-frontend

Development (Linux/macOS)

# Terminal 1: Backend
make dev
# or: cd backend && uvicorn app.main:app --reload --host 0.0.0.0 --port 8001

# Terminal 2: Frontend
cd frontend && npm run dev

URLs

Service	URL
Application	http://localhost:3000
API docs	http://localhost:8001/docs
Health check	http://localhost:8001/health

Other commands (Windows)

.\run.ps1 install       # Install backend dependencies
.\run.ps1 test         # Run tests
.\run.ps1 kill-port    # Free ports 8000, 8001, 3000

API

Method	Endpoint	Description
POST	`/api/v1/upload`	Upload document
POST	`/api/v1/chat`	Chat with documents (full response)
POST	`/api/v1/chat/stream`	Streaming chat (SSE)
GET	`/api/v1/documents`	List documents
DELETE	`/api/v1/documents/{id}`	Delete document
GET	`/health`	Service status

Interactive documentation is available at /docs (Swagger UI).

AI Providers

The system supports multiple providers to avoid lock-in:

Provider	Type	Requires API key
Groq	Cloud (free)	Yes
Ollama	Local	No
LM Studio	Local	No
OpenRouter	Cloud (models :free)	Yes
Google Gemini	Cloud (free)	Yes
OpenAI	Cloud (paid)	Yes

Default configuration: Groq (llama-3.1-8b-instant).

Project Structure

ia_analizer_data/
├── backend/                 # FastAPI API
│   ├── app/
│   │   ├── api/             # Routes (upload, chat, documents)
│   │   ├── core/            # Security, exceptions
│   │   ├── db/              # Models, repository, session
│   │   ├── models/          # Pydantic schemas
│   │   ├── pipelines/      # Document pipeline
│   │   └── services/       # Business logic
│   ├── tests/
│   └── requirements.txt
├── frontend/                # Next.js
│   ├── src/
│   │   ├── app/             # Routes, layout
│   │   └── components/     # DocumentUpload, DocumentList, Chat
│   └── package.json
├── docs/                    # Documentation
├── docker/                  # Dockerfile, docker-compose
├── run.ps1                  # Windows scripts
├── Makefile                 # Linux/macOS commands
├── .env.example
└── README.md

Use Cases

Contract analysis: Questions about clauses, deadlines, obligations
CV review: Extract skills, experience, education
Reports and summaries: Summarize, compare sections, find data
Invoices and financial documents: Queries about amounts, dates, vendors
Technical documentation: Questions about manuals and specifications

Additional Documentation

Document	Content
docs/INFORME_PROYECTO.md	Full analysis, strengths, limitations
docs/LLM_PROVIDERS.md	AI provider configuration
docs/architecture.md	Architecture diagrams
docs/deployment.md	Deployment guide
docs/IMPLEMENTATION_CHECKLIST.md	Implementation status

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Document Analyzer

Table of Contents

Description

Key Features

Architecture

Document Processing Flow

Question Flow (RAG)

Tech Stack

Prerequisites

Installation

1. Clone or download the project

2. Backend (Python)

3. Frontend (Node.js)

4. Environment variables

Configuration

Running the Project

Development (Windows PowerShell)

Development (Linux/macOS)

URLs

Other commands (Windows)

API

AI Providers

Project Structure

Use Cases

Additional Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
docker		docker
docs		docs
frontend		frontend
workers		workers
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
run.ps1		run.ps1

Folders and files

Latest commit

History

Repository files navigation

AI Document Analyzer

Table of Contents

Description

Key Features

Architecture

Document Processing Flow

Question Flow (RAG)

Tech Stack

Prerequisites

Installation

1. Clone or download the project

2. Backend (Python)

3. Frontend (Node.js)

4. Environment variables

Configuration

Running the Project

Development (Windows PowerShell)

Development (Linux/macOS)

URLs

Other commands (Windows)

API

AI Providers

Project Structure

Use Cases

Additional Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages