Skip to content

ml-doc-intel/overview

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

AI Document Intelligence Platform (POC) - Overview

This repository is the public overview for the organization project.

It intentionally contains no application code.
It explains the product vision, architecture, capabilities, roadmap, and deployment links.


1. Project Purpose

Build a configurable, multi-tenant platform that can:

  • Read unstructured documents from different departments
  • Convert them into structured JSON
  • Store and organize output by tenant and department
  • Prepare data for analytics, automation, and future AI/ML workflows

The product is designed to work across multiple business functions, not only healthcare.


2. Department-Driven Model

The platform now supports department isolation.
A department is selected before upload/configuration, and data remains scoped to that department.

Current department set:

  • Clinic/Pharma
  • HR
  • Billing/Finance
  • Electricity Bills
  • Water Bills
  • Sales
  • Purchasing
  • Store/Stock

What this means in practice:

  • Each department has its own extraction template
  • Uploads are tagged by department
  • Document views and downstream reports can be filtered per department
  • Future training/analytics can be run department-wise

3. Core Product Capabilities

A) Document Reader and Extractor

  • Upload mixed formats (PDF, DOC, DOCX, images)
  • OCR + native parsing for text acquisition
  • LLM-based structured extraction into JSON
  • Error tracking per document for actionable failure handling

B) Configurable Extraction Templates

  • Non-technical field configuration from UI
  • Field-wise controls: field_name, data_type, description, format_rules
  • Department-level save and isolation

C) Operational Data Layer

  • Structured output stored in PostgreSQL
  • Multi-tenant boundaries
  • Document lifecycle controls (status, error, delete, extracted-data views)

4. Architecture Flow (Current + RAG-ready)

flowchart TD
    A[User Upload] --> B[FastAPI API Layer]
    B --> C[Storage + Metadata]
    C --> D[Text Acquisition Layer]
    D --> D1[PDF parsing + OCR]
    D --> D2[DOCX parser]
    D --> D3[DOC parser]
    D --> D4[Image OCR]
    D1 --> E[LLM Structuring]
    D2 --> E
    D3 --> E
    D4 --> E
    E --> F[Validation + Normalization]
    F --> G[PostgreSQL Persistence]
    G --> H[Dashboard + Extracted Data Views]
    G --> I[RAG Ingestion - Planned]
    I --> J[Chunking + Metadata]
    J --> K[Embeddings]
    K --> L[Vector Index]
    L --> M[Retriever + Filters + Rerank]
    M --> N[Grounded Answer + Citations]
Loading

5. Technology Stack (Current Implementation)

Frontend

  • Next.js: 16.1.6
  • React: 19.2.3
  • Tailwind CSS: 4.x
  • TypeScript: 5.x

Backend

  • FastAPI: >=0.110.0
  • Python: 3.11
  • SQLAlchemy: >=2.0.0
  • Uvicorn: >=0.27.0

Data and Storage

  • PostgreSQL: 15-alpine
  • Local/GCP-style storage abstraction

Document and OCR

  • pdfplumber: >=0.11.0
  • pytesseract: >=0.3.10
  • python-docx: >=1.1.2
  • Pillow: >=10.3.0
  • antiword: for legacy .doc

AI Layer

  • Anthropic Claude API (4.x family with fallback strategy)

Auth

  • Google OAuth 2.0 (authorization code flow)

6. Current Status

Completed

  • End-to-end extraction pipeline
  • Multi-format file ingestion
  • Department-aware config and upload behavior
  • Per-document error message tracking and display
  • Document delete and extracted-data navigation
  • Architecture and dev-status experience pages

In Progress

  • Quality benchmarking and extraction accuracy baselines
  • Operational hardening and migration maturity

Planned

  • RAG ingestion + retrieval pipeline
  • Citation-grounded Q&A
  • Department-wise analytics and KPI dashboards
  • Optional model fine-tuning only after quality baselines are stable

7. Product Rollout Plan (Phased)

  1. Foundation (Done)
  • Upload, extraction, storage, UI visibility
  1. Reliability (Done/In progress)
  • Error visibility, delete controls, quality fixes
  1. Quality Baseline (In progress)
  • Field-level scoring, benchmark set, regression tracking
  1. RAG Enablement (Planned)
  • Chunking, embeddings, vector index, retriever pipeline
  1. Production Hardening (Planned)
  • Queue workers, migration discipline, observability, compliance controls
  1. Advanced ML Training (Conditional)
  • Department-specific training only when measurable benefit is proven

8. Why This Approach

  • Faster business value with pre-trained models
  • Lower initial risk than immediate custom-model training
  • Strong path to scale via department templates + tenant isolation
  • Clean progression from extraction -> retrieval -> intelligence

9. Deployment Links

Update this section after deployment:

  • Product URL: TBD
  • API URL: TBD
  • API Docs: TBD
  • Architecture Page: TBD
  • Dev Status Page: TBD

10. Repository Scope

This repository is the public overview hub for stakeholders, clients, and partners.

It is intended for:

  • Product narrative
  • Capability visibility
  • Architecture communication
  • Roadmap alignment
  • Deployment link sharing

No runtime code is maintained here.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors