This repository is the public overview for the organization project.
It intentionally contains no application code.
It explains the product vision, architecture, capabilities, roadmap, and deployment links.
Build a configurable, multi-tenant platform that can:
- Read unstructured documents from different departments
- Convert them into structured JSON
- Store and organize output by tenant and department
- Prepare data for analytics, automation, and future AI/ML workflows
The product is designed to work across multiple business functions, not only healthcare.
The platform now supports department isolation.
A department is selected before upload/configuration, and data remains scoped to that department.
Current department set:
- Clinic/Pharma
- HR
- Billing/Finance
- Electricity Bills
- Water Bills
- Sales
- Purchasing
- Store/Stock
What this means in practice:
- Each department has its own extraction template
- Uploads are tagged by department
- Document views and downstream reports can be filtered per department
- Future training/analytics can be run department-wise
- Upload mixed formats (PDF, DOC, DOCX, images)
- OCR + native parsing for text acquisition
- LLM-based structured extraction into JSON
- Error tracking per document for actionable failure handling
- Non-technical field configuration from UI
- Field-wise controls:
field_name,data_type,description,format_rules - Department-level save and isolation
- Structured output stored in PostgreSQL
- Multi-tenant boundaries
- Document lifecycle controls (status, error, delete, extracted-data views)
flowchart TD
A[User Upload] --> B[FastAPI API Layer]
B --> C[Storage + Metadata]
C --> D[Text Acquisition Layer]
D --> D1[PDF parsing + OCR]
D --> D2[DOCX parser]
D --> D3[DOC parser]
D --> D4[Image OCR]
D1 --> E[LLM Structuring]
D2 --> E
D3 --> E
D4 --> E
E --> F[Validation + Normalization]
F --> G[PostgreSQL Persistence]
G --> H[Dashboard + Extracted Data Views]
G --> I[RAG Ingestion - Planned]
I --> J[Chunking + Metadata]
J --> K[Embeddings]
K --> L[Vector Index]
L --> M[Retriever + Filters + Rerank]
M --> N[Grounded Answer + Citations]
- Next.js:
16.1.6 - React:
19.2.3 - Tailwind CSS:
4.x - TypeScript:
5.x
- FastAPI:
>=0.110.0 - Python:
3.11 - SQLAlchemy:
>=2.0.0 - Uvicorn:
>=0.27.0
- PostgreSQL:
15-alpine - Local/GCP-style storage abstraction
- pdfplumber:
>=0.11.0 - pytesseract:
>=0.3.10 - python-docx:
>=1.1.2 - Pillow:
>=10.3.0 - antiword: for legacy
.doc
- Anthropic Claude API (4.x family with fallback strategy)
- Google OAuth 2.0 (authorization code flow)
- End-to-end extraction pipeline
- Multi-format file ingestion
- Department-aware config and upload behavior
- Per-document error message tracking and display
- Document delete and extracted-data navigation
- Architecture and dev-status experience pages
- Quality benchmarking and extraction accuracy baselines
- Operational hardening and migration maturity
- RAG ingestion + retrieval pipeline
- Citation-grounded Q&A
- Department-wise analytics and KPI dashboards
- Optional model fine-tuning only after quality baselines are stable
- Foundation (Done)
- Upload, extraction, storage, UI visibility
- Reliability (Done/In progress)
- Error visibility, delete controls, quality fixes
- Quality Baseline (In progress)
- Field-level scoring, benchmark set, regression tracking
- RAG Enablement (Planned)
- Chunking, embeddings, vector index, retriever pipeline
- Production Hardening (Planned)
- Queue workers, migration discipline, observability, compliance controls
- Advanced ML Training (Conditional)
- Department-specific training only when measurable benefit is proven
- Faster business value with pre-trained models
- Lower initial risk than immediate custom-model training
- Strong path to scale via department templates + tenant isolation
- Clean progression from extraction -> retrieval -> intelligence
Update this section after deployment:
- Product URL:
TBD - API URL:
TBD - API Docs:
TBD - Architecture Page:
TBD - Dev Status Page:
TBD
This repository is the public overview hub for stakeholders, clients, and partners.
It is intended for:
- Product narrative
- Capability visibility
- Architecture communication
- Roadmap alignment
- Deployment link sharing
No runtime code is maintained here.