Skip to content

joseph5988/Document_Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Invoice Document Extractor

A complete end-to-end application for extracting structured invoice data from PDFs and images using AI-powered OCR and LLM extraction.

Overview

This application demonstrates a production-ready invoice extraction system that:

  • Extracts text from invoices using OCR (PaddleOCR + pdfplumber)
  • Uses free/open-source LLMs (Ollama or Groq) to extract structured data
  • Stores data in SQLite or Excel format
  • Provides a modern web UI for upload, viewing, and editing

New to the codebase? See CODE_GUIDE.md for a short “how to read this code” guide: where each feature lives, request flow, and conventions. Written so anyone can follow without prior context.

Architecture

┌─────────────────┐
│  Next.js Frontend │
│  (React/TypeScript)│
└────────┬──────────┘
         │ HTTP/REST
         │
┌────────▼──────────┐
│   Flask Backend   │
│  ├─ OCR Service   │
│  ├─ LLM Service   │
│  └─ DB Service    │
└────────┬──────────┘
         │
┌────────▼──────────┐
│  SQLite / Excel   │
└───────────────────┘

Quick Start

Prerequisites

Install Ollama (Required for LLM processing):

macOS Installation

  1. Install Ollama:

    # Option 1: Direct install (recommended)
    curl -fsSL https://ollama.com/install.sh | sh
    
    # Option 2: Using Homebrew
    brew install ollama
  2. Start Ollama service:

    # Start Ollama in the background (keep this terminal open)
    ollama serve
  3. Download the required model (in a new terminal):

    # Download the lightweight 1B model (~1.3GB, faster processing)
    ollama pull llama3.2:1b
    
    # Alternative: Standard 3B model (~2GB, better accuracy)
    # ollama pull llama3.2

Windows Installation

  1. Install Ollama:

    • Download the installer from https://ollama.com/download
    • Run the .exe file and follow the installation wizard
    • Ollama will automatically start as a Windows service
  2. Verify installation:

    # Open Command Prompt or PowerShell
    ollama --version
  3. Download the required model:

    # Download the lightweight 1B model (~1.3GB, faster processing)
    ollama pull llama3.2:1b
    
    # Alternative: Standard 3B model (~2GB, better accuracy)
    # ollama pull llama3.2

Verify Installation (Both Platforms)

  1. Check if Ollama is running:

    # Should return a list of installed models
    curl http://localhost:11434/api/tags
    
    # Or on Windows PowerShell:
    # Invoke-RestMethod -Uri http://localhost:11434/api/tags

    Expected output:

    {
      "models": [
        {
          "name": "llama3.2:1b",
          "size": 1321098329,
          "parameter_size": "1.2B"
        }
      ]
    }

Troubleshooting Ollama

  • macOS: If ollama serve fails, try sudo ollama serve or check if port 11434 is in use
  • Windows: If Ollama doesn't start automatically, search for "Ollama" in Start Menu and run it
  • Both: If model download is slow, try a different time or check your internet connection
  • Memory: The 1B model uses ~2GB RAM, the 3B model uses ~4GB RAM

Backend Setup

  1. Navigate to backend:
cd backend
  1. Create virtual environment:
# macOS/Linux
python3 -m venv venv
source venv/bin/activate

# Windows (Command Prompt)
python -m venv venv
venv\Scripts\activate

# Windows (PowerShell)
python -m venv venv
venv\Scripts\Activate.ps1
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure environment:
# macOS/Linux
cp env.example .env

# Windows
copy env.example .env

# Edit .env if needed (default settings work with Ollama)
  1. Run backend:
python app.py

Backend runs on http://localhost:5000

Note: No system dependencies required! PaddleOCR is pure Python and downloads models automatically on first use (~500MB).

Frontend Setup

  1. Navigate to frontend:
cd frontend
  1. Install dependencies:
npm install
  1. Run frontend:
npm run dev

Frontend runs on http://localhost:3000

Model Performance & Choice

This project is configured to use Llama 3.2:1b by default for optimal performance:

  • Speed: ~2-5 seconds per invoice on modern hardware
  • Memory: Uses ~2GB RAM (vs 4GB+ for larger models)
  • Accuracy: Excellent for structured data extraction from invoices
  • Privacy: Runs completely offline on your machine

Want better accuracy? You can switch to the larger model:

# Download the 3B model
ollama pull llama3.2

# Update backend/config.py or set environment variable:
# OLLAMA_MODEL=llama3.2

Alternative LLM Setup

Option B: Groq API (Free, Fast, Cloud-based)

If you prefer not to run Ollama locally:

  1. Sign up at https://console.groq.com
  2. Get your free API key
  3. Update your .env file:
    GROQ_API_KEY=your_api_key_here
    USE_GROQ=True
  4. Restart the backend

Usage

  1. Open browser: Navigate to http://localhost:3000
  2. Upload invoice: Drag and drop or click to upload a PDF/image
  3. View extracted data: Data appears automatically after processing
  4. Edit if needed: Click "View/Edit" to modify extracted fields
  5. Save: Click "Save Invoice" to persist to database
  6. View all: Click "All Invoices" tab to see all extracted invoices

Data Model

The application extracts data matching the invoice template and SalesOrderHeader / SalesOrderDetail schemas:

SalesOrderHeader (invoice header + order/shipping)

  • Invoice: invoice_number, order_date, due_date, ship_date
  • Customer: customer_id, customer_name, customer_address (Bill To)
  • Ship To: ship_to_name, ship_to_address
  • Order: purchase_order_number (P.O. #), salesperson, ship_via, terms
  • Totals: subtotal, tax_rate, tax, shipping_handling (S&H), other_charges, total_due

SalesOrderDetail (line items: ITEM #, DESCRIPTION, QTY, UNIT PRICE, TOTAL)

  • product_code (Item #), description, quantity (OrderQty), unit_price, unit_price_discount, line_total

Schema upgrade: If you had an existing invoices.db, delete it so the app can create tables with the new columns (or run a migration to add them).

Project Structure

document_extractor_demo/
├── backend/
│   ├── app.py              # Flask application
│   ├── config.py           # Configuration
│   ├── models.py           # Database models
│   ├── requirements.txt   # Python dependencies
│   ├── services/          # Business logic
│   │   ├── ocr_service.py
│   │   ├── llm_service.py
│   │   └── db_service.py
│   └── README.md
├── frontend/
│   ├── app/               # Next.js app directory
│   ├── components/        # React components
│   ├── lib/              # Utilities
│   ├── package.json
│   └── README.md
└── README.md

Features

OCR Extraction: Supports PDFs and images (PNG, JPG, TIFF, BMP)
LLM Extraction: Uses free/open-source LLMs (Ollama or Groq)
Structured Data: Extracts to normalized schema
Database Storage: SQLite (default) or Excel
Modern UI: Drag & drop upload, editable forms
Error Handling: Comprehensive error messages
Modular Code: DRY, production-ready architecture

Demo Flow

Before Extraction:

  • Show empty database (SQLite or Excel)
  • Explain the SalesOrderHeader and SalesOrderDetail schema

During Extraction:

  1. Upload Invoice A (classic corporate layout)

    • Drag & drop or click to upload
    • Show real-time processing indicator
    • System extracts structured data automatically
    • Data appears in UI immediately
  2. Upload Invoice B (different layout - minimal/modern)

    • Different visual layout, same extraction process
    • Demonstrates LLM's ability to handle varied formats
    • Same normalized schema output

After Extraction: 3. View Database

  • Show all extracted invoices in database
  • Demonstrate before/after comparison
  1. Edit & Save
    • Click "View/Edit" on any invoice
    • Modify extracted values in the form
    • Save changes to database
    • Verify database updates in real-time

Scaling Strategies (For Discussion)

If given more time to evolve this solution:

  1. Higher Volume Processing

    • Implement async processing with Celery + Redis
    • Add batch upload capabilities
    • Queue-based architecture for parallel processing
  2. Additional Document Types

    • Add document classifier (LLM or lightweight ML model)
    • Route to specialized extraction prompts per document type
    • Schema registry (invoice, PO, receipt, contract, etc.)
  3. Cost & Performance Optimization

    • Cache OCR results to avoid re-processing
    • Use cheaper models for OCR text cleanup
    • Only invoke LLM for semantic extraction, not OCR
  4. Production Deployment

    • Dockerize Flask + Next.js applications
    • Host on AWS/GCP/Azure
    • Store files in S3/cloud storage
    • Use Postgres instead of SQLite
    • Add authentication (JWT)
    • Implement rate limiting and monitoring

Technologies

Backend:

  • Flask (Python web framework)
  • PaddleOCR (text extraction)
  • pdfplumber (PDF parsing)
  • Ollama/Groq (LLM inference)
  • SQLAlchemy (ORM)
  • SQLite/Excel (storage)

Frontend:

  • Next.js 14 (React framework)
  • TypeScript (type safety)
  • Tailwind CSS (styling)
  • React Hook Form (form management)
  • Axios (HTTP client)

License

This is a demo project for educational purposes.

Support

If you encounter issues:

  1. Check backend logs for errors
  2. Verify Ollama is running: curl http://localhost:11434/api/tags
  3. Check frontend console for API errors
  4. For OCR issues, PaddleOCR models download automatically on first use

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors