Skip to content

mrGupta04/Asynchronous-document_processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Processing Pipeline Assignment

Production-style full stack app for asynchronous document processing with:

  • Frontend: React + TypeScript
  • Backend: FastAPI (Python)
  • Database: PostgreSQL
  • Background workers: Celery
  • Messaging + Pub/Sub: Redis
  • Live progress: Redis Pub/Sub -> FastAPI WebSocket -> React UI

Demo Video (Required)

  • 3-5 minute walkthrough: ADD_YOUR_VIDEO_LINK_HERE

Repository Structure

.
├── backend
│   ├── app
│   │   ├── api/routes.py
│   │   ├── services/
│   │   ├── celery_app.py
│   │   ├── database.py
│   │   ├── main.py
│   │   ├── models.py
│   │   ├── schemas.py
│   │   └── tasks.py
│   ├── requirements.txt
│   └── Dockerfile
├── frontend
│   ├── src
│   │   ├── pages/
│   │   ├── components/
│   │   ├── api.ts
│   │   └── types.ts
│   └── Dockerfile
├── samples
│   ├── input/
│   └── exports/
├── docker-compose.yml
└── README.md

Architecture Overview

1) Upload and Job Creation

  • User uploads one or more files from frontend.
  • POST /api/documents/upload stores files in backend/storage/.
  • Backend creates:
    • documents row (metadata)
    • processing_jobs row (queued status, attempt=1)
  • Backend dispatches a Celery task per uploaded document.

2) Asynchronous Processing (Celery)

  • Worker task process_document_task(document_id, job_id) executes stages:
    • job_started
    • document_parsing_started
    • document_parsing_completed
    • field_extraction_started
    • field_extraction_completed
    • final_result_stored
    • job_completed / job_failed
  • Worker updates PostgreSQL state on each stage.
  • Worker publishes progress events to Redis Pub/Sub channel job-events.

3) Live Progress Tracking

  • FastAPI WebSocket endpoint /ws/jobs subscribes to Redis Pub/Sub.
  • Every event from worker is pushed to connected clients in near real-time.
  • Frontend dashboard/detail pages subscribe and update progress UI live.

4) Review, Finalize, Export

  • Review/edit extracted output on detail page.
  • Finalize stores immutable final_output and finalized_at.
  • Export finalized records:
    • JSON: GET /api/exports/finalized.json
    • CSV: GET /api/exports/finalized.csv

Database Design

documents

  • file metadata: filename, content_type, file_size, storage_path
  • searchable extracted fields: title, category, summary
  • workflow state: status, error_message
  • extraction/review/final payloads: extracted_output, reviewed_output, final_output
  • finalization flags: is_finalized, finalized_at

processing_jobs

  • FK to document
  • status and progress: status, stage, progress
  • retry tracking: attempt
  • runtime trace: celery_task_id, started_at, completed_at, error_message

API Surface

  • POST /api/documents/upload - upload one/many files and queue jobs
  • GET /api/documents - list documents with search/filter/sort
  • GET /api/documents/{id} - document detail + job history
  • GET /api/jobs/{job_id}/progress - current job progress snapshot
  • POST /api/documents/{id}/retry - retry failed job
  • PATCH /api/documents/{id}/review - update reviewed structured output
  • POST /api/documents/{id}/finalize - finalize record
  • GET /api/exports/finalized.json - export finalized records (JSON)
  • GET /api/exports/finalized.csv - export finalized records (CSV)
  • WS /ws/jobs - live progress events stream

How To Run

Option A: Docker Compose (recommended)

  1. Ensure Docker Desktop is running.
  2. From repo root:
    docker compose up --build
  3. Open:
    • Frontend: http://localhost:5173
    • Backend API docs: http://localhost:8000/docs

Option B: Run services manually

1) Start Postgres and Redis

  • Postgres: localhost:5432 (db/user/pass: doc_pipeline/postgres/postgres)
  • Redis: localhost:6379

2) Backend API

cd backend
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env       # Windows: copy .env.example .env
uvicorn app.main:app --reload --port 8000

3) Celery worker

cd backend
source .venv/bin/activate  # Windows: .venv\Scripts\activate
celery -A app.celery_app.celery_app worker --loglevel=info

4) Frontend

cd frontend
npm install
cp .env.example .env       # Windows: copy .env.example .env
npm run dev

Using the App

  1. Upload one or multiple files on dashboard.
  2. Watch live status transitions and progress in table.
  3. Open Review page for a document.
  4. Edit extracted output and save review.
  5. Finalize record.
  6. Export all finalized records as JSON/CSV.
  7. If processing fails, use Retry on dashboard.

Error Handling and Retry Strategy

  • Worker exceptions mark job as failed and emit job_failed.
  • Retry endpoint only allows retries for latest failed job.
  • Each retry creates a new job with incremented attempt.
  • Worker skips outdated/superseded jobs if a newer attempt exists.

Assumptions

  • Text extraction is intentionally simple (assignment allows mocked business logic).
  • For non-text formats, extraction is simulated.
  • File storage is local filesystem (backend/storage), not object storage.
  • No authentication is required for this assignment.

Tradeoffs

  • Schema migration tooling (Alembic) is omitted to keep setup fast.
  • documents.status mirrors latest job state for simpler filtering/querying.
  • WebSocket channel is global (/ws/jobs) and client-side filters by document_id.

Limitations

  • No virus scanning or file content validation pipeline.
  • No auth/authorization or multi-tenant isolation.
  • No distributed tracing/metrics stack (Prometheus/Grafana).
  • No hard cancellation endpoint for running jobs.

Bonus Implemented

  • Docker Compose setup
  • Retry attempt tracking and stale-job guard (basic idempotent behavior)
  • Export samples and test input samples
  • Clear separation: API routes, services, worker tasks, schemas/models

Sample Files and Export Samples

  • Input samples: samples/input/
  • Export samples: samples/exports/

AI Tool Usage Note

  • AI tooling was used during development for scaffolding and implementation acceleration.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors