Invoice Document Extractor

A complete end-to-end application for extracting structured invoice data from PDFs and images using AI-powered OCR and LLM extraction.

Overview

This application demonstrates a production-ready invoice extraction system that:

Extracts text from invoices using OCR (PaddleOCR + pdfplumber)
Uses free/open-source LLMs (Ollama or Groq) to extract structured data
Stores data in SQLite or Excel format
Provides a modern web UI for upload, viewing, and editing

New to the codebase? See CODE_GUIDE.md for a short “how to read this code” guide: where each feature lives, request flow, and conventions. Written so anyone can follow without prior context.

Architecture

┌─────────────────┐
│  Next.js Frontend │
│  (React/TypeScript)│
└────────┬──────────┘
         │ HTTP/REST
         │
┌────────▼──────────┐
│   Flask Backend   │
│  ├─ OCR Service   │
│  ├─ LLM Service   │
│  └─ DB Service    │
└────────┬──────────┘
         │
┌────────▼──────────┐
│  SQLite / Excel   │
└───────────────────┘

Quick Start

Prerequisites

Install Ollama (Required for LLM processing):

macOS Installation

Install Ollama:

# Option 1: Direct install (recommended)
curl -fsSL https://ollama.com/install.sh | sh

# Option 2: Using Homebrew
brew install ollama

Start Ollama service:

# Start Ollama in the background (keep this terminal open)
ollama serve

Download the required model (in a new terminal):

# Download the lightweight 1B model (~1.3GB, faster processing)
ollama pull llama3.2:1b

# Alternative: Standard 3B model (~2GB, better accuracy)
# ollama pull llama3.2

Windows Installation

Install Ollama:
- Download the installer from https://ollama.com/download
- Run the .exe file and follow the installation wizard
- Ollama will automatically start as a Windows service

Verify installation:

# Open Command Prompt or PowerShell
ollama --version

Download the required model:

# Download the lightweight 1B model (~1.3GB, faster processing)
ollama pull llama3.2:1b

# Alternative: Standard 3B model (~2GB, better accuracy)
# ollama pull llama3.2

Verify Installation (Both Platforms)

Check if Ollama is running:

# Should return a list of installed models
curl http://localhost:11434/api/tags

# Or on Windows PowerShell:
# Invoke-RestMethod -Uri http://localhost:11434/api/tags

Expected output:

{
  "models": [
    {
      "name": "llama3.2:1b",
      "size": 1321098329,
      "parameter_size": "1.2B"
    }
  ]
}

Troubleshooting Ollama

macOS: If ollama serve fails, try sudo ollama serve or check if port 11434 is in use
Windows: If Ollama doesn't start automatically, search for "Ollama" in Start Menu and run it
Both: If model download is slow, try a different time or check your internet connection
Memory: The 1B model uses ~2GB RAM, the 3B model uses ~4GB RAM

Backend Setup

Navigate to backend:

cd backend

Create virtual environment:

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

# Windows (Command Prompt)
python -m venv venv
venv\Scripts\activate

# Windows (PowerShell)
python -m venv venv
venv\Scripts\Activate.ps1

Install dependencies:

pip install -r requirements.txt

Configure environment:

# macOS/Linux
cp env.example .env

# Windows
copy env.example .env

# Edit .env if needed (default settings work with Ollama)

Run backend:

python app.py

Backend runs on http://localhost:5000

Note: No system dependencies required! PaddleOCR is pure Python and downloads models automatically on first use (~500MB).

Frontend Setup

Navigate to frontend:

cd frontend

Install dependencies:

npm install

Run frontend:

npm run dev

Frontend runs on http://localhost:3000

Model Performance & Choice

This project is configured to use Llama 3.2:1b by default for optimal performance:

Speed: ~2-5 seconds per invoice on modern hardware
Memory: Uses ~2GB RAM (vs 4GB+ for larger models)
Accuracy: Excellent for structured data extraction from invoices
Privacy: Runs completely offline on your machine

Want better accuracy? You can switch to the larger model:

# Download the 3B model
ollama pull llama3.2

# Update backend/config.py or set environment variable:
# OLLAMA_MODEL=llama3.2

Alternative LLM Setup

Option B: Groq API (Free, Fast, Cloud-based)

If you prefer not to run Ollama locally:

Sign up at https://console.groq.com
Get your free API key

Update your .env file:

GROQ_API_KEY=your_api_key_here
USE_GROQ=True

Restart the backend

Usage

Open browser: Navigate to http://localhost:3000
Upload invoice: Drag and drop or click to upload a PDF/image
View extracted data: Data appears automatically after processing
Edit if needed: Click "View/Edit" to modify extracted fields
Save: Click "Save Invoice" to persist to database
View all: Click "All Invoices" tab to see all extracted invoices

Data Model

The application extracts data matching the invoice template and SalesOrderHeader / SalesOrderDetail schemas:

SalesOrderHeader (invoice header + order/shipping)

Invoice: invoice_number, order_date, due_date, ship_date
Customer: customer_id, customer_name, customer_address (Bill To)
Ship To: ship_to_name, ship_to_address
Order: purchase_order_number (P.O. #), salesperson, ship_via, terms
Totals: subtotal, tax_rate, tax, shipping_handling (S&H), other_charges, total_due

SalesOrderDetail (line items: ITEM #, DESCRIPTION, QTY, UNIT PRICE, TOTAL)

product_code (Item #), description, quantity (OrderQty), unit_price, unit_price_discount, line_total

Schema upgrade: If you had an existing invoices.db, delete it so the app can create tables with the new columns (or run a migration to add them).

Project Structure

document_extractor_demo/
├── backend/
│   ├── app.py              # Flask application
│   ├── config.py           # Configuration
│   ├── models.py           # Database models
│   ├── requirements.txt   # Python dependencies
│   ├── services/          # Business logic
│   │   ├── ocr_service.py
│   │   ├── llm_service.py
│   │   └── db_service.py
│   └── README.md
├── frontend/
│   ├── app/               # Next.js app directory
│   ├── components/        # React components
│   ├── lib/              # Utilities
│   ├── package.json
│   └── README.md
└── README.md

Features

✅ OCR Extraction: Supports PDFs and images (PNG, JPG, TIFF, BMP)
✅ LLM Extraction: Uses free/open-source LLMs (Ollama or Groq)
✅ Structured Data: Extracts to normalized schema
✅ Database Storage: SQLite (default) or Excel
✅ Modern UI: Drag & drop upload, editable forms
✅ Error Handling: Comprehensive error messages
✅ Modular Code: DRY, production-ready architecture

Demo Flow

Before Extraction:

Show empty database (SQLite or Excel)
Explain the SalesOrderHeader and SalesOrderDetail schema

During Extraction:

Upload Invoice A (classic corporate layout)
- Drag & drop or click to upload
- Show real-time processing indicator
- System extracts structured data automatically
- Data appears in UI immediately
Upload Invoice B (different layout - minimal/modern)
- Different visual layout, same extraction process
- Demonstrates LLM's ability to handle varied formats
- Same normalized schema output

After Extraction: 3. View Database

Show all extracted invoices in database
Demonstrate before/after comparison

Edit & Save
- Click "View/Edit" on any invoice
- Modify extracted values in the form
- Save changes to database
- Verify database updates in real-time

Scaling Strategies (For Discussion)

If given more time to evolve this solution:

Higher Volume Processing
- Implement async processing with Celery + Redis
- Add batch upload capabilities
- Queue-based architecture for parallel processing
Additional Document Types
- Add document classifier (LLM or lightweight ML model)
- Route to specialized extraction prompts per document type
- Schema registry (invoice, PO, receipt, contract, etc.)
Cost & Performance Optimization
- Cache OCR results to avoid re-processing
- Use cheaper models for OCR text cleanup
- Only invoke LLM for semantic extraction, not OCR
Production Deployment
- Dockerize Flask + Next.js applications
- Host on AWS/GCP/Azure
- Store files in S3/cloud storage
- Use Postgres instead of SQLite
- Add authentication (JWT)
- Implement rate limiting and monitoring

Technologies

Backend:

Flask (Python web framework)
PaddleOCR (text extraction)
pdfplumber (PDF parsing)
Ollama/Groq (LLM inference)
SQLAlchemy (ORM)
SQLite/Excel (storage)

Frontend:

Next.js 14 (React framework)
TypeScript (type safety)
Tailwind CSS (styling)
React Hook Form (form management)
Axios (HTTP client)

License

This is a demo project for educational purposes.

Support

If you encounter issues:

Check backend logs for errors
Verify Ollama is running: curl http://localhost:11434/api/tags
Check frontend console for API errors
For OCR issues, PaddleOCR models download automatically on first use

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Invoice Document Extractor

Overview

Architecture

Quick Start

Prerequisites

macOS Installation

Windows Installation

Verify Installation (Both Platforms)

Troubleshooting Ollama

Backend Setup

Frontend Setup

Model Performance & Choice

Alternative LLM Setup

Option B: Groq API (Free, Fast, Cloud-based)

Usage

Data Model

SalesOrderHeader (invoice header + order/shipping)

SalesOrderDetail (line items: ITEM #, DESCRIPTION, QTY, UNIT PRICE, TOTAL)

Project Structure

Features

Demo Flow

Scaling Strategies (For Discussion)

Technologies

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
frontend		frontend
CODE_GUIDE.md		CODE_GUIDE.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Invoice Document Extractor

Overview

Architecture

Quick Start

Prerequisites

macOS Installation

Windows Installation

Verify Installation (Both Platforms)

Troubleshooting Ollama

Backend Setup

Frontend Setup

Model Performance & Choice

Alternative LLM Setup

Option B: Groq API (Free, Fast, Cloud-based)

Usage

Data Model

SalesOrderHeader (invoice header + order/shipping)

SalesOrderDetail (line items: ITEM #, DESCRIPTION, QTY, UNIT PRICE, TOTAL)

Project Structure

Features

Demo Flow

Scaling Strategies (For Discussion)

Technologies

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages