Production-grade invoice extraction system with self-correcting LangGraph pipeline, multi-LLM fallback, real-time streaming progress, and full role-based access control.
Watch the LangGraph pipeline run in real time: each node streams its progress to the UI via Server-Sent Events. When the primary LLM fails, the system automatically falls back to a secondary provider β without the user noticing.
Manual invoice processing is painful and expensive. Companies receive invoices in dozens of formats β PDFs, Word docs, plain text, scanned images β and someone has to read each one, copy the relevant fields (invoice number, date, total, vendor, line items), and paste them into spreadsheets or ERPs.
Existing OCR tools are brittle: they work on a specific layout, hallucinate when the format changes, have no way to validate their own output, and crash in production when the LLM provider has an outage.
InvoiceReader treats extraction as a stateful, observable, self-correcting workflow β not a single LLM call. Upload an invoice in any format and the system:
- Routes the document to a cost-appropriate model based on complexity
- Extracts structured fields using AI with streaming progress events
- Validates the output against a strict Pydantic schema
- Self-corrects failed fields with targeted retry prompts (no full re-extraction)
- Falls back to a different LLM provider on quota errors or failures
- Persists validated data to a normalized PostgreSQL schema
- Lets users build custom column views, copy individual fields, or export to CSV
The deployed version requires authentication to prevent abuse and protect API costs. If you'd like access to test the live application, please reach out:
π§ Email: lbarretti@gmail.com
I'll be happy to provide you with credentials to explore the full system, including:
- Real-time streaming pipeline visualization
- Multi-format invoice extraction (PDF, DOCX, images, text)
- Customizable history view with 30+ columns
- CSV export and column-level copy operations
- Admin panel for user management
Recruiters, hiring managers, and curious developers are all welcome.
- π Multi-format input β PDF, DOCX, TXT, CSV, PNG, JPG/JPEG
- π§ Multi-LLM strategy β Gemini as primary, OpenAI as automatic fallback
- π― Cost-aware model routing β
gemini_cheapfor simple docs,gemini_expensivefor complex ones (images, long text) - π Targeted self-correction β failed validations trigger surgical retries with field-specific prompts and document excerpts (not full re-extraction)
- β Schema-enforced output β every AI response is validated against Pydantic models before being trusted
- π‘ Real-time progress streaming β Server-Sent Events (SSE) push pipeline state to the UI as each node runs
- π Role-based access control β admin/user roles with JWT validation on every endpoint
- π₯ User management β admin panel for creating, listing, and removing users
- π Customizable history view β users select which of 30+ columns to display, with preferences persisted in
localStorage - π€ Flexible export β copy by column, copy all visible data as TSV, or download CSV with proper UTF-8 BOM
- ποΈ Bulk operations β multi-select and delete invoices safely with confirmation
- π Production deploy β running on Hostinger VPS with automated
git pushdeployment
- π§ͺ Comprehensive test suite β pytest with fixtures, parametrized cases, and dedicated tests for schemas, nodes, graph routing, preprocessor, prompts, file processor, and security
- π Structured logging β request-level logging middleware + named loggers per module
- π‘οΈ Security hardening β file size limits, empty file detection, Bearer token enforcement, SQL injection protection, RLS on all Supabase tables
- π§ Type safety end-to-end β Pydantic on the backend, TypeScript on the frontend, shared interface contracts
The core differentiator of this project is how the extraction happens β not just that it happens.
A direct call to Gemini or GPT-4 looks like this:
User upload β LLM call β "Trust the output" β Save
This is brittle. If the AI hallucinates a date format, forgets a tax ID, or hits a quota error, you get garbage in your database with no way to recover.
InvoiceReader uses LangGraph to manage extraction as a stateful, decision-driven workflow:
ββββββββββββββββββββββββ
β Document Upload β
β (PDF/DOCX/TXT/IMG) β
ββββββββββββ¬ββββββββββββ
β
ββββββββββββΌββββββββββββ
β Preprocess Document β
β (clean + complexity)β
ββββββββββββ¬ββββββββββββ
β
ββββββββββββΌββββββββββββ
β Select Model β
β cheap / expensive β
ββββββββββββ¬ββββββββββββ
β
ββββββββββββΌββββββββββββ
β Extract β ββββββββββ
β (Gemini/OpenAI) β β
ββββββββββββ¬ββββββββββββ β
β β
ββββββββββββΌββββββββββββ β
β Validate β β
β (Pydantic schema) β β
ββββββββββββ¬ββββββββββββ β
β β
ββββββββββββββββΌβββββββββββββββ β
β β β β
β
Valid β οΈ Field errors β API errorβ
β β β β
βΌ βΌ βΌ β
ββββββββββββ ββββββββββββββββ βββββββββββββ β
β Finalize β β Targeted β β Fallback β β
β Success β β Retry β β Model βββ
ββββββββββββ β (per-field) β β (GemβGPT) β
ββββββββ¬ββββββββ βββββββββββββ
β
βββΊ (back to Validate)
| State | Decision | Rationale |
|---|---|---|
| All fields valid β | β finalize_success | Done. No retries needed. |
| API error (429, quota, config) | β fallback_model | Switch providers automatically β no manual intervention |
| Field errors + retries available | β targeted_retry | Re-prompt only on failed fields with relevant document excerpt |
| Retries exhausted, no fallback | β fallback_model | Try a different LLM before giving up |
| All paths exhausted | β finalize_error | Fail loudly β never with bad data |
This is the difference between a demo and a production system: demos trust the AI; production systems verify, retry, and gracefully degrade.
When validation fails on supplier.name and totals.total_amount, naive systems re-run the whole extraction. InvoiceReader extracts the document excerpt most relevant to those failed fields and sends a focused prompt asking only for corrections. This is faster, cheaper, and more accurate.
# From backend/extraction/nodes.py
def targeted_retry_node(state: ExtractionState):
failed_fields = state.get("failed_fields") or []
# Find lines containing keywords from the failed field names
relevant_lines = [
line for line in cleaned_text.split("\n")
if any(kw.lower() in line.lower() for kw in keywords)
]
excerpt = "\n".join(relevant_lines)[:1500]
prompt = build_targeted_retry_prompt(failed_fields, excerpt)
# ...Production AI systems need to feel alive, not frozen. The /api/upload/stream endpoint runs the LangGraph in a background thread and pushes progress events via Server-Sent Events:
data: {"type":"progress","step":"reading","detail":"Reading and parsing invoice file..."}
data: {"type":"progress","step":"sending_to_ai","detail":"gemini"}
data: {"type":"progress","step":"waiting_for_ai","detail":"gemini_cheap"}
data: {"type":"progress","step":"ai_failed","detail":"429 quota exceeded"}
data: {"type":"progress","step":"trying_new_ai","detail":"openai_cheap"}
data: {"type":"progress","step":"preparing_data","detail":"Structuring..."}
data: {"type":"result","data": {...full extracted invoice...}}
The React frontend reads this stream and animates a real-time pipeline visualization showing each step the agent take β including fallback transitions when one LLM fails. This dramatically improves perceived reliability and gives users (and developers) real-time observability into the AI's decisions.
| Component | Technology | Why |
|---|---|---|
| Language | Python 3.11+ | Modern type hints, async support |
| API framework | FastAPI | Async, auto-validation, OpenAPI docs out of the box |
| LLM orchestration | LangGraph 0.2+ | Stateful workflows with conditional routing |
| Validation | Pydantic 2.7+ | Runtime schema enforcement at boundaries |
| Primary LLM | Google Gemini | gemini-3-flash-preview, fast and inexpensive |
| Fallback LLM | OpenAI | gpt-4o-mini / gpt-4o, automatic failover |
| Document parsing | PyPDF2, python-docx, Pillow | Multi-format file reading |
| Testing | pytest | Unit + integration + E2E + security tests |
| Component | Technology | Why |
|---|---|---|
| Framework | React 19 | Latest stable with concurrent features |
| Language | TypeScript 5.8 | Strict type safety end-to-end |
| Build tool | Vite 6 | Fast HMR, optimized production builds |
| Styling | Tailwind CSS 4.1 | Utility-first, oxide compiler |
| Routing | React Router 7 | Latest data-router APIs |
| HTTP | Axios + fetch (SSE) | REST + Server-Sent Events for streaming |
| State | React Context | Auth state and admin role propagation |
| UI primitives | lucide-react, react-dropzone, react-hot-toast | Icons, drag-and-drop, notifications |
| Component | Technology | Why |
|---|---|---|
| Database | Supabase (PostgreSQL) | Managed Postgres + auth + RLS in one service |
| Auth | Supabase Auth | JWT-based, server-validated on every request |
| Hosting | Hostinger VPS | Cost-effective production hosting |
| CI/CD | Git push β auto-deploy | Direct VPS deployment on main branch push |
- Python 3.11+
- Node.js 20+
- A Supabase project (supabase.com β free tier is enough)
- API keys for Google Gemini and OpenAI
# Clone the repository
git clone https://github.com/leopbar/InvoiceReader.git
cd InvoiceReader
# Backend setup
cd backend
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# Frontend setup
cd ../frontend
npm installCreate backend/.env:
GOOGLE_API_KEY=your_gemini_api_key
OPENAI_API_KEY=your_openai_api_key
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your_supabase_anon_key
SUPABASE_SERVICE_ROLE_KEY=your_supabase_service_role_key
CORS_ORIGINS=http://localhost:5173,http://127.0.0.1:5173
# Admin bootstrap (used by create_admin.py)
ADMIN_EMAIL=your_admin_email@example.com
ADMIN_PASSWORD=your_strong_admin_passwordCreate frontend/.env:
VITE_SUPABASE_URL=https://your-project.supabase.co
VITE_SUPABASE_KEY=your_supabase_anon_key
VITE_API_URL=http://localhost:8000/apiRun the SQL schema in your Supabase SQL editor (creates tables for suppliers, invoices, invoice_items, invoice_addresses, and user_roles with Row Level Security enabled). The full schema is in backend/setup_db.sql.
cd backend
python create_admin.pyThe script reads ADMIN_EMAIL and ADMIN_PASSWORD from your .env file β credentials are never hardcoded in source code.
Option 1: Single command (recommended)
./start.shOption 2: Run services separately
# Terminal 1 β Backend
cd backend
uvicorn main:app --reload --port 8000
# Terminal 2 β Frontend
cd frontend
npm run dev- Frontend:
http://localhost:5173 - Backend API:
http://localhost:8000 - Auto-generated API docs (dev only):
http://localhost:8000/docs
The repository includes a comprehensive test suite covering schemas, graph logic, file processing, prompts, and security.
cd backend
pytest # Run all unit tests
pytest tests/test_schemas.py -v # Pydantic validation tests
pytest tests/test_graph_routing.py # LangGraph routing logic
pytest tests/test_nodes.py # Individual graph nodes
pytest tests/test_preprocessor.py # Text cleanup + complexity routing
pytest tests/test_file_processor.py # File reading + base64 encoding
pytest tests/test_prompts.py # Prompt templates# Tests for auth, file size limits, SQL injection (requires running server)
python tests/test_security_api.pyInvoiceReader/
βββ backend/
β βββ extraction/ # LangGraph extraction pipeline
β β βββ graph.py # Workflow definition + routing logic
β β βββ nodes.py # Individual graph nodes
β β βββ state.py # TypedDict state contract
β β βββ schemas.py # Pydantic models (Invoice, etc.)
β β βββ llm_clients.py # Gemini / OpenAI client factory
β β βββ preprocessor.py # Text cleanup + complexity heuristics
β β βββ prompts.py # Extraction + targeted retry prompts
β βββ tests/ # pytest test suite
β βββ main.py # FastAPI app + endpoints
β βββ database.py # Supabase client setup
β βββ file_processor.py # PDF/DOCX/image/text parsing
β βββ supabase_service.py # DB persistence (suppliers, invoices, items)
β βββ setup_db.sql # Schema migration for Supabase
β βββ create_admin.py # Bootstrap initial admin user (env-based)
β βββ requirements.txt
βββ frontend/
β βββ src/
β β βββ pages/ # UploadPage, HistoryPage, etc.
β β βββ components/ # ExtractedDataDisplay, etc.
β β βββ context/ # AuthContext (session + admin role)
β β βββ services/ # api.ts (axios + SSE), supabase.ts
β β βββ App.tsx # Routes + layout
β β βββ main.tsx
β βββ package.json
β βββ vite.config.ts
βββ assets/ # README screenshots and GIFs
βββ start.sh # One-command startup
βββ package.json # npm workspaces root
βββ README.md
This project applies several production security practices:
- JWT validation on every protected endpoint via
Depends(verify_token) - Admin-only endpoints double-check role membership in
user_rolestable - Service-role Supabase client is server-only and never exposed to the frontend
- Row Level Security (RLS) enabled on every Supabase table
- File size limit (10 MB) and empty-file rejection on upload
- CORS allowlist configurable via env var (no
*in production with credentials) - Bearer token format enforced β rejects Basic auth and missing headers
- Cannot delete yourself β admin user deletion guard
- Production mode disables
/docsand/redocto reduce attack surface - No credentials in source code β all secrets loaded from
.envfiles (never committed)
β οΈ Note on initial development: the SQL schema currently uses permissive RLS policies (USING (true)) for rapid iteration. Production deployments should tighten these to per-user policies (e.g.,USING (auth.uid() = user_id)).
- Per-user RLS policies on invoices and suppliers
- Langfuse integration for full LLM observability and cost tracking
- Quality metrics dashboard (per-field accuracy, retry rate, fallback frequency)
- Batch upload with parallel pipeline execution
- Golden invoice dataset + automated regression suite
- Multi-language invoice support (currently English-optimized)
- Webhook integration for ERP/accounting systems
- Confidence scores per extracted field (not just pass/fail)
A few specific takeaways from building a real production AI system:
- Pydantic at the boundary is the cheapest insurance. Every malformed LLM response caught is a bug that never reached the database.
- State machines beat chains for non-trivial AI flows. LangGraph's conditional edges make retry and fallback logic explicit and debuggable; the equivalent in plain LangChain becomes a maze of nested
ifs. - Targeted retries are an order of magnitude cheaper than full re-extraction. Sending only the failed fields + relevant document excerpt back to the LLM costs a fraction of running the full prompt again.
- Multi-provider strategies aren't premature optimization. Both Gemini and OpenAI had outages during development. Auto-fallback turned downtime into a transparent recovery β users never noticed.
- Streaming progress changes UX dramatically. The same pipeline feels twice as fast when users can see it working, even if total latency is identical.
- Type safety end-to-end pays for itself. Pydantic on the server + TypeScript on the client meant zero "field doesn't exist" bugs in production.
Leonardo Barretti
Building production AI systems with Python, focusing on robust LLM integration, agentic workflows, and clean engineering practices.
- πΌ LinkedIn: linkedin.com/in/leonardo-barretti
- π§ Email: lbarretti@gmail.com
- π GitHub: @leopbar
π¬ Want to test the live application or discuss the architecture? Feel free to email me at lbarretti@gmail.com β I'm always happy to chat about LLM engineering, agentic workflows, or interesting opportunities.
This project is licensed under the MIT License β see the LICENSE file for details.
If this project taught you something or sparked an idea, consider giving it a β β it helps other developers discover it.
