Skip to content

leopbar/InvoiceReader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ InvoiceReader

Production-grade invoice extraction system with self-correcting LangGraph pipeline, multi-LLM fallback, real-time streaming progress, and full role-based access control.

Live Demo LinkedIn Email


🎬 See It In Action

InvoiceReader live extraction pipeline

Watch the LangGraph pipeline run in real time: each node streams its progress to the UI via Server-Sent Events. When the primary LLM fails, the system automatically falls back to a secondary provider β€” without the user noticing.


🎯 The Problem

Manual invoice processing is painful and expensive. Companies receive invoices in dozens of formats β€” PDFs, Word docs, plain text, scanned images β€” and someone has to read each one, copy the relevant fields (invoice number, date, total, vendor, line items), and paste them into spreadsheets or ERPs.

Existing OCR tools are brittle: they work on a specific layout, hallucinate when the format changes, have no way to validate their own output, and crash in production when the LLM provider has an outage.

πŸ’‘ The Solution

InvoiceReader treats extraction as a stateful, observable, self-correcting workflow β€” not a single LLM call. Upload an invoice in any format and the system:

  1. Routes the document to a cost-appropriate model based on complexity
  2. Extracts structured fields using AI with streaming progress events
  3. Validates the output against a strict Pydantic schema
  4. Self-corrects failed fields with targeted retry prompts (no full re-extraction)
  5. Falls back to a different LLM provider on quota errors or failures
  6. Persists validated data to a normalized PostgreSQL schema
  7. Lets users build custom column views, copy individual fields, or export to CSV

🌐 Try the live demo β†’


πŸ§ͺ Want to Test the Application?

The deployed version requires authentication to prevent abuse and protect API costs. If you'd like access to test the live application, please reach out:

πŸ“§ Email: lbarretti@gmail.com

I'll be happy to provide you with credentials to explore the full system, including:

  • Real-time streaming pipeline visualization
  • Multi-format invoice extraction (PDF, DOCX, images, text)
  • Customizable history view with 30+ columns
  • CSV export and column-level copy operations
  • Admin panel for user management

Recruiters, hiring managers, and curious developers are all welcome.


✨ Key Features

Extraction Pipeline

  • πŸ“ Multi-format input β€” PDF, DOCX, TXT, CSV, PNG, JPG/JPEG
  • 🧠 Multi-LLM strategy β€” Gemini as primary, OpenAI as automatic fallback
  • 🎯 Cost-aware model routing β€” gemini_cheap for simple docs, gemini_expensive for complex ones (images, long text)
  • πŸ” Targeted self-correction β€” failed validations trigger surgical retries with field-specific prompts and document excerpts (not full re-extraction)
  • βœ… Schema-enforced output β€” every AI response is validated against Pydantic models before being trusted
  • πŸ“‘ Real-time progress streaming β€” Server-Sent Events (SSE) push pipeline state to the UI as each node runs

Application Features

  • πŸ” Role-based access control β€” admin/user roles with JWT validation on every endpoint
  • πŸ‘₯ User management β€” admin panel for creating, listing, and removing users
  • πŸ“Š Customizable history view β€” users select which of 30+ columns to display, with preferences persisted in localStorage
  • πŸ“€ Flexible export β€” copy by column, copy all visible data as TSV, or download CSV with proper UTF-8 BOM
  • πŸ—‘οΈ Bulk operations β€” multi-select and delete invoices safely with confirmation
  • πŸš€ Production deploy β€” running on Hostinger VPS with automated git push deployment

Engineering Quality

  • πŸ§ͺ Comprehensive test suite β€” pytest with fixtures, parametrized cases, and dedicated tests for schemas, nodes, graph routing, preprocessor, prompts, file processor, and security
  • πŸ“ Structured logging β€” request-level logging middleware + named loggers per module
  • πŸ›‘οΈ Security hardening β€” file size limits, empty file detection, Bearer token enforcement, SQL injection protection, RLS on all Supabase tables
  • πŸ”§ Type safety end-to-end β€” Pydantic on the backend, TypeScript on the frontend, shared interface contracts

πŸ—οΈ Architecture

The core differentiator of this project is how the extraction happens β€” not just that it happens.

Why LangGraph instead of a single LLM call?

A direct call to Gemini or GPT-4 looks like this:

User upload β†’ LLM call β†’ "Trust the output" β†’ Save

This is brittle. If the AI hallucinates a date format, forgets a tax ID, or hits a quota error, you get garbage in your database with no way to recover.

InvoiceReader uses LangGraph to manage extraction as a stateful, decision-driven workflow:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Document Upload    β”‚
                    β”‚  (PDF/DOCX/TXT/IMG)  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Preprocess Document β”‚
                    β”‚  (clean + complexity)β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚     Select Model     β”‚
                    β”‚  cheap / expensive   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚      Extract         β”‚ ◄────────┐
                    β”‚   (Gemini/OpenAI)    β”‚          β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
                               β”‚                      β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
                    β”‚     Validate         β”‚          β”‚
                    β”‚  (Pydantic schema)   β”‚          β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
                               β”‚                      β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
                β”‚              β”‚              β”‚       β”‚
        βœ… Valid         ⚠️ Field errors  ❌ API errorβ”‚
                β”‚              β”‚              β”‚       β”‚
                β–Ό              β–Ό              β–Ό       β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
         β”‚ Finalize β”‚  β”‚   Targeted   β”‚ β”‚ Fallback  β”‚ β”‚
         β”‚ Success  β”‚  β”‚    Retry     β”‚ β”‚  Model    β”‚β”€β”˜
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ (per-field)  β”‚ β”‚ (Gemβ†’GPT) β”‚
                      β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             └─► (back to Validate)

Routing decisions (from route_after_validate)

State Decision Rationale
All fields valid βœ… β†’ finalize_success Done. No retries needed.
API error (429, quota, config) β†’ fallback_model Switch providers automatically β€” no manual intervention
Field errors + retries available β†’ targeted_retry Re-prompt only on failed fields with relevant document excerpt
Retries exhausted, no fallback β†’ fallback_model Try a different LLM before giving up
All paths exhausted β†’ finalize_error Fail loudly β€” never with bad data

This is the difference between a demo and a production system: demos trust the AI; production systems verify, retry, and gracefully degrade.

Why targeted retry instead of full re-extraction

When validation fails on supplier.name and totals.total_amount, naive systems re-run the whole extraction. InvoiceReader extracts the document excerpt most relevant to those failed fields and sends a focused prompt asking only for corrections. This is faster, cheaper, and more accurate.

# From backend/extraction/nodes.py
def targeted_retry_node(state: ExtractionState):
    failed_fields = state.get("failed_fields") or []
    # Find lines containing keywords from the failed field names
    relevant_lines = [
        line for line in cleaned_text.split("\n")
        if any(kw.lower() in line.lower() for kw in keywords)
    ]
    excerpt = "\n".join(relevant_lines)[:1500]
    prompt = build_targeted_retry_prompt(failed_fields, excerpt)
    # ...

Why streaming progress (SSE) matters

Production AI systems need to feel alive, not frozen. The /api/upload/stream endpoint runs the LangGraph in a background thread and pushes progress events via Server-Sent Events:

data: {"type":"progress","step":"reading","detail":"Reading and parsing invoice file..."}
data: {"type":"progress","step":"sending_to_ai","detail":"gemini"}
data: {"type":"progress","step":"waiting_for_ai","detail":"gemini_cheap"}
data: {"type":"progress","step":"ai_failed","detail":"429 quota exceeded"}
data: {"type":"progress","step":"trying_new_ai","detail":"openai_cheap"}
data: {"type":"progress","step":"preparing_data","detail":"Structuring..."}
data: {"type":"result","data": {...full extracted invoice...}}

The React frontend reads this stream and animates a real-time pipeline visualization showing each step the agent take β€” including fallback transitions when one LLM fails. This dramatically improves perceived reliability and gives users (and developers) real-time observability into the AI's decisions.


πŸ› οΈ Tech Stack

Backend

Component Technology Why
Language Python 3.11+ Modern type hints, async support
API framework FastAPI Async, auto-validation, OpenAPI docs out of the box
LLM orchestration LangGraph 0.2+ Stateful workflows with conditional routing
Validation Pydantic 2.7+ Runtime schema enforcement at boundaries
Primary LLM Google Gemini gemini-3-flash-preview, fast and inexpensive
Fallback LLM OpenAI gpt-4o-mini / gpt-4o, automatic failover
Document parsing PyPDF2, python-docx, Pillow Multi-format file reading
Testing pytest Unit + integration + E2E + security tests

Frontend

Component Technology Why
Framework React 19 Latest stable with concurrent features
Language TypeScript 5.8 Strict type safety end-to-end
Build tool Vite 6 Fast HMR, optimized production builds
Styling Tailwind CSS 4.1 Utility-first, oxide compiler
Routing React Router 7 Latest data-router APIs
HTTP Axios + fetch (SSE) REST + Server-Sent Events for streaming
State React Context Auth state and admin role propagation
UI primitives lucide-react, react-dropzone, react-hot-toast Icons, drag-and-drop, notifications

Infrastructure

Component Technology Why
Database Supabase (PostgreSQL) Managed Postgres + auth + RLS in one service
Auth Supabase Auth JWT-based, server-validated on every request
Hosting Hostinger VPS Cost-effective production hosting
CI/CD Git push β†’ auto-deploy Direct VPS deployment on main branch push

πŸš€ Getting Started

Prerequisites

  • Python 3.11+
  • Node.js 20+
  • A Supabase project (supabase.com β€” free tier is enough)
  • API keys for Google Gemini and OpenAI

Installation

# Clone the repository
git clone https://github.com/leopbar/InvoiceReader.git
cd InvoiceReader

# Backend setup
cd backend
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Frontend setup
cd ../frontend
npm install

Environment Variables

Create backend/.env:

GOOGLE_API_KEY=your_gemini_api_key
OPENAI_API_KEY=your_openai_api_key
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your_supabase_anon_key
SUPABASE_SERVICE_ROLE_KEY=your_supabase_service_role_key
CORS_ORIGINS=http://localhost:5173,http://127.0.0.1:5173

# Admin bootstrap (used by create_admin.py)
ADMIN_EMAIL=your_admin_email@example.com
ADMIN_PASSWORD=your_strong_admin_password

Create frontend/.env:

VITE_SUPABASE_URL=https://your-project.supabase.co
VITE_SUPABASE_KEY=your_supabase_anon_key
VITE_API_URL=http://localhost:8000/api

Database Setup

Run the SQL schema in your Supabase SQL editor (creates tables for suppliers, invoices, invoice_items, invoice_addresses, and user_roles with Row Level Security enabled). The full schema is in backend/setup_db.sql.

Create the first admin user

cd backend
python create_admin.py

The script reads ADMIN_EMAIL and ADMIN_PASSWORD from your .env file β€” credentials are never hardcoded in source code.

Run the application

Option 1: Single command (recommended)

./start.sh

Option 2: Run services separately

# Terminal 1 β€” Backend
cd backend
uvicorn main:app --reload --port 8000

# Terminal 2 β€” Frontend
cd frontend
npm run dev
  • Frontend: http://localhost:5173
  • Backend API: http://localhost:8000
  • Auto-generated API docs (dev only): http://localhost:8000/docs

πŸ§ͺ Testing

The repository includes a comprehensive test suite covering schemas, graph logic, file processing, prompts, and security.

cd backend
pytest                              # Run all unit tests
pytest tests/test_schemas.py -v     # Pydantic validation tests
pytest tests/test_graph_routing.py  # LangGraph routing logic
pytest tests/test_nodes.py          # Individual graph nodes
pytest tests/test_preprocessor.py   # Text cleanup + complexity routing
pytest tests/test_file_processor.py # File reading + base64 encoding
pytest tests/test_prompts.py        # Prompt templates

Security boundary tests

# Tests for auth, file size limits, SQL injection (requires running server)
python tests/test_security_api.py

πŸ“š Project Structure

InvoiceReader/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ extraction/                 # LangGraph extraction pipeline
β”‚   β”‚   β”œβ”€β”€ graph.py                # Workflow definition + routing logic
β”‚   β”‚   β”œβ”€β”€ nodes.py                # Individual graph nodes
β”‚   β”‚   β”œβ”€β”€ state.py                # TypedDict state contract
β”‚   β”‚   β”œβ”€β”€ schemas.py              # Pydantic models (Invoice, etc.)
β”‚   β”‚   β”œβ”€β”€ llm_clients.py          # Gemini / OpenAI client factory
β”‚   β”‚   β”œβ”€β”€ preprocessor.py         # Text cleanup + complexity heuristics
β”‚   β”‚   └── prompts.py              # Extraction + targeted retry prompts
β”‚   β”œβ”€β”€ tests/                      # pytest test suite
β”‚   β”œβ”€β”€ main.py                     # FastAPI app + endpoints
β”‚   β”œβ”€β”€ database.py                 # Supabase client setup
β”‚   β”œβ”€β”€ file_processor.py           # PDF/DOCX/image/text parsing
β”‚   β”œβ”€β”€ supabase_service.py         # DB persistence (suppliers, invoices, items)
β”‚   β”œβ”€β”€ setup_db.sql                # Schema migration for Supabase
β”‚   β”œβ”€β”€ create_admin.py             # Bootstrap initial admin user (env-based)
β”‚   └── requirements.txt
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ pages/                  # UploadPage, HistoryPage, etc.
β”‚   β”‚   β”œβ”€β”€ components/             # ExtractedDataDisplay, etc.
β”‚   β”‚   β”œβ”€β”€ context/                # AuthContext (session + admin role)
β”‚   β”‚   β”œβ”€β”€ services/               # api.ts (axios + SSE), supabase.ts
β”‚   β”‚   β”œβ”€β”€ App.tsx                 # Routes + layout
β”‚   β”‚   └── main.tsx
β”‚   β”œβ”€β”€ package.json
β”‚   └── vite.config.ts
β”œβ”€β”€ assets/                         # README screenshots and GIFs
β”œβ”€β”€ start.sh                        # One-command startup
β”œβ”€β”€ package.json                    # npm workspaces root
└── README.md

πŸ” Security Considerations

This project applies several production security practices:

  • JWT validation on every protected endpoint via Depends(verify_token)
  • Admin-only endpoints double-check role membership in user_roles table
  • Service-role Supabase client is server-only and never exposed to the frontend
  • Row Level Security (RLS) enabled on every Supabase table
  • File size limit (10 MB) and empty-file rejection on upload
  • CORS allowlist configurable via env var (no * in production with credentials)
  • Bearer token format enforced β€” rejects Basic auth and missing headers
  • Cannot delete yourself β€” admin user deletion guard
  • Production mode disables /docs and /redoc to reduce attack surface
  • No credentials in source code β€” all secrets loaded from .env files (never committed)

⚠️ Note on initial development: the SQL schema currently uses permissive RLS policies (USING (true)) for rapid iteration. Production deployments should tighten these to per-user policies (e.g., USING (auth.uid() = user_id)).


πŸ›£οΈ Roadmap

  • Per-user RLS policies on invoices and suppliers
  • Langfuse integration for full LLM observability and cost tracking
  • Quality metrics dashboard (per-field accuracy, retry rate, fallback frequency)
  • Batch upload with parallel pipeline execution
  • Golden invoice dataset + automated regression suite
  • Multi-language invoice support (currently English-optimized)
  • Webhook integration for ERP/accounting systems
  • Confidence scores per extracted field (not just pass/fail)

πŸ€” What I Learned Building This

A few specific takeaways from building a real production AI system:

  • Pydantic at the boundary is the cheapest insurance. Every malformed LLM response caught is a bug that never reached the database.
  • State machines beat chains for non-trivial AI flows. LangGraph's conditional edges make retry and fallback logic explicit and debuggable; the equivalent in plain LangChain becomes a maze of nested ifs.
  • Targeted retries are an order of magnitude cheaper than full re-extraction. Sending only the failed fields + relevant document excerpt back to the LLM costs a fraction of running the full prompt again.
  • Multi-provider strategies aren't premature optimization. Both Gemini and OpenAI had outages during development. Auto-fallback turned downtime into a transparent recovery β€” users never noticed.
  • Streaming progress changes UX dramatically. The same pipeline feels twice as fast when users can see it working, even if total latency is identical.
  • Type safety end-to-end pays for itself. Pydantic on the server + TypeScript on the client meant zero "field doesn't exist" bugs in production.

πŸ‘€ About the Author

Leonardo Barretti

Building production AI systems with Python, focusing on robust LLM integration, agentic workflows, and clean engineering practices.

πŸ’¬ Want to test the live application or discuss the architecture? Feel free to email me at lbarretti@gmail.com β€” I'm always happy to chat about LLM engineering, agentic workflows, or interesting opportunities.


πŸ“„ License

This project is licensed under the MIT License β€” see the LICENSE file for details.


If this project taught you something or sparked an idea, consider giving it a ⭐ β€” it helps other developers discover it.

About

Production-grade invoice extraction with self-correcting LangGraph pipeline, multi-LLM fallback (Gemini + OpenAI), real-time streaming, and full RBAC. Built with FastAPI + React 19.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors