A complete end-to-end application for extracting structured invoice data from PDFs and images using AI-powered OCR and LLM extraction.
This application demonstrates a production-ready invoice extraction system that:
- Extracts text from invoices using OCR (PaddleOCR + pdfplumber)
- Uses free/open-source LLMs (Ollama or Groq) to extract structured data
- Stores data in SQLite or Excel format
- Provides a modern web UI for upload, viewing, and editing
New to the codebase? See CODE_GUIDE.md for a short “how to read this code” guide: where each feature lives, request flow, and conventions. Written so anyone can follow without prior context.
┌─────────────────┐
│ Next.js Frontend │
│ (React/TypeScript)│
└────────┬──────────┘
│ HTTP/REST
│
┌────────▼──────────┐
│ Flask Backend │
│ ├─ OCR Service │
│ ├─ LLM Service │
│ └─ DB Service │
└────────┬──────────┘
│
┌────────▼──────────┐
│ SQLite / Excel │
└───────────────────┘
Install Ollama (Required for LLM processing):
-
Install Ollama:
# Option 1: Direct install (recommended) curl -fsSL https://ollama.com/install.sh | sh # Option 2: Using Homebrew brew install ollama
-
Start Ollama service:
# Start Ollama in the background (keep this terminal open) ollama serve -
Download the required model (in a new terminal):
# Download the lightweight 1B model (~1.3GB, faster processing) ollama pull llama3.2:1b # Alternative: Standard 3B model (~2GB, better accuracy) # ollama pull llama3.2
-
Install Ollama:
- Download the installer from https://ollama.com/download
- Run the
.exefile and follow the installation wizard - Ollama will automatically start as a Windows service
-
Verify installation:
# Open Command Prompt or PowerShell ollama --version
-
Download the required model:
# Download the lightweight 1B model (~1.3GB, faster processing) ollama pull llama3.2:1b # Alternative: Standard 3B model (~2GB, better accuracy) # ollama pull llama3.2
-
Check if Ollama is running:
# Should return a list of installed models curl http://localhost:11434/api/tags # Or on Windows PowerShell: # Invoke-RestMethod -Uri http://localhost:11434/api/tags
Expected output:
{ "models": [ { "name": "llama3.2:1b", "size": 1321098329, "parameter_size": "1.2B" } ] }
- macOS: If
ollama servefails, trysudo ollama serveor check if port 11434 is in use - Windows: If Ollama doesn't start automatically, search for "Ollama" in Start Menu and run it
- Both: If model download is slow, try a different time or check your internet connection
- Memory: The 1B model uses ~2GB RAM, the 3B model uses ~4GB RAM
- Navigate to backend:
cd backend- Create virtual environment:
# macOS/Linux
python3 -m venv venv
source venv/bin/activate
# Windows (Command Prompt)
python -m venv venv
venv\Scripts\activate
# Windows (PowerShell)
python -m venv venv
venv\Scripts\Activate.ps1- Install dependencies:
pip install -r requirements.txt- Configure environment:
# macOS/Linux
cp env.example .env
# Windows
copy env.example .env
# Edit .env if needed (default settings work with Ollama)- Run backend:
python app.pyBackend runs on http://localhost:5000
Note: No system dependencies required! PaddleOCR is pure Python and downloads models automatically on first use (~500MB).
- Navigate to frontend:
cd frontend- Install dependencies:
npm install- Run frontend:
npm run devFrontend runs on http://localhost:3000
This project is configured to use Llama 3.2:1b by default for optimal performance:
- Speed: ~2-5 seconds per invoice on modern hardware
- Memory: Uses ~2GB RAM (vs 4GB+ for larger models)
- Accuracy: Excellent for structured data extraction from invoices
- Privacy: Runs completely offline on your machine
Want better accuracy? You can switch to the larger model:
# Download the 3B model
ollama pull llama3.2
# Update backend/config.py or set environment variable:
# OLLAMA_MODEL=llama3.2If you prefer not to run Ollama locally:
- Sign up at https://console.groq.com
- Get your free API key
- Update your
.envfile:GROQ_API_KEY=your_api_key_here USE_GROQ=True
- Restart the backend
- Open browser: Navigate to
http://localhost:3000 - Upload invoice: Drag and drop or click to upload a PDF/image
- View extracted data: Data appears automatically after processing
- Edit if needed: Click "View/Edit" to modify extracted fields
- Save: Click "Save Invoice" to persist to database
- View all: Click "All Invoices" tab to see all extracted invoices
The application extracts data matching the invoice template and SalesOrderHeader / SalesOrderDetail schemas:
- Invoice: invoice_number, order_date, due_date, ship_date
- Customer: customer_id, customer_name, customer_address (Bill To)
- Ship To: ship_to_name, ship_to_address
- Order: purchase_order_number (P.O. #), salesperson, ship_via, terms
- Totals: subtotal, tax_rate, tax, shipping_handling (S&H), other_charges, total_due
- product_code (Item #), description, quantity (OrderQty), unit_price, unit_price_discount, line_total
Schema upgrade: If you had an existing invoices.db, delete it so the app can create tables with the new columns (or run a migration to add them).
document_extractor_demo/
├── backend/
│ ├── app.py # Flask application
│ ├── config.py # Configuration
│ ├── models.py # Database models
│ ├── requirements.txt # Python dependencies
│ ├── services/ # Business logic
│ │ ├── ocr_service.py
│ │ ├── llm_service.py
│ │ └── db_service.py
│ └── README.md
├── frontend/
│ ├── app/ # Next.js app directory
│ ├── components/ # React components
│ ├── lib/ # Utilities
│ ├── package.json
│ └── README.md
└── README.md
✅ OCR Extraction: Supports PDFs and images (PNG, JPG, TIFF, BMP)
✅ LLM Extraction: Uses free/open-source LLMs (Ollama or Groq)
✅ Structured Data: Extracts to normalized schema
✅ Database Storage: SQLite (default) or Excel
✅ Modern UI: Drag & drop upload, editable forms
✅ Error Handling: Comprehensive error messages
✅ Modular Code: DRY, production-ready architecture
Before Extraction:
- Show empty database (SQLite or Excel)
- Explain the SalesOrderHeader and SalesOrderDetail schema
During Extraction:
-
Upload Invoice A (classic corporate layout)
- Drag & drop or click to upload
- Show real-time processing indicator
- System extracts structured data automatically
- Data appears in UI immediately
-
Upload Invoice B (different layout - minimal/modern)
- Different visual layout, same extraction process
- Demonstrates LLM's ability to handle varied formats
- Same normalized schema output
After Extraction: 3. View Database
- Show all extracted invoices in database
- Demonstrate before/after comparison
- Edit & Save
- Click "View/Edit" on any invoice
- Modify extracted values in the form
- Save changes to database
- Verify database updates in real-time
If given more time to evolve this solution:
-
Higher Volume Processing
- Implement async processing with Celery + Redis
- Add batch upload capabilities
- Queue-based architecture for parallel processing
-
Additional Document Types
- Add document classifier (LLM or lightweight ML model)
- Route to specialized extraction prompts per document type
- Schema registry (invoice, PO, receipt, contract, etc.)
-
Cost & Performance Optimization
- Cache OCR results to avoid re-processing
- Use cheaper models for OCR text cleanup
- Only invoke LLM for semantic extraction, not OCR
-
Production Deployment
- Dockerize Flask + Next.js applications
- Host on AWS/GCP/Azure
- Store files in S3/cloud storage
- Use Postgres instead of SQLite
- Add authentication (JWT)
- Implement rate limiting and monitoring
Backend:
- Flask (Python web framework)
- PaddleOCR (text extraction)
- pdfplumber (PDF parsing)
- Ollama/Groq (LLM inference)
- SQLAlchemy (ORM)
- SQLite/Excel (storage)
Frontend:
- Next.js 14 (React framework)
- TypeScript (type safety)
- Tailwind CSS (styling)
- React Hook Form (form management)
- Axios (HTTP client)
This is a demo project for educational purposes.
If you encounter issues:
- Check backend logs for errors
- Verify Ollama is running:
curl http://localhost:11434/api/tags - Check frontend console for API errors
- For OCR issues, PaddleOCR models download automatically on first use