Document Processing Pipeline Assignment

Production-style full stack app for asynchronous document processing with:

Frontend: React + TypeScript
Backend: FastAPI (Python)
Database: PostgreSQL
Background workers: Celery
Messaging + Pub/Sub: Redis
Live progress: Redis Pub/Sub -> FastAPI WebSocket -> React UI

Demo Video (Required)

3-5 minute walkthrough: ADD_YOUR_VIDEO_LINK_HERE

Repository Structure

.
├── backend
│   ├── app
│   │   ├── api/routes.py
│   │   ├── services/
│   │   ├── celery_app.py
│   │   ├── database.py
│   │   ├── main.py
│   │   ├── models.py
│   │   ├── schemas.py
│   │   └── tasks.py
│   ├── requirements.txt
│   └── Dockerfile
├── frontend
│   ├── src
│   │   ├── pages/
│   │   ├── components/
│   │   ├── api.ts
│   │   └── types.ts
│   └── Dockerfile
├── samples
│   ├── input/
│   └── exports/
├── docker-compose.yml
└── README.md

Architecture Overview

1) Upload and Job Creation

User uploads one or more files from frontend.
POST /api/documents/upload stores files in backend/storage/.
Backend creates:
- documents row (metadata)
- processing_jobs row (queued status, attempt=1)
Backend dispatches a Celery task per uploaded document.

2) Asynchronous Processing (Celery)

Worker task process_document_task(document_id, job_id) executes stages:
- job_started
- document_parsing_started
- document_parsing_completed
- field_extraction_started
- field_extraction_completed
- final_result_stored
- job_completed / job_failed
Worker updates PostgreSQL state on each stage.
Worker publishes progress events to Redis Pub/Sub channel job-events.

3) Live Progress Tracking

FastAPI WebSocket endpoint /ws/jobs subscribes to Redis Pub/Sub.
Every event from worker is pushed to connected clients in near real-time.
Frontend dashboard/detail pages subscribe and update progress UI live.

4) Review, Finalize, Export

Review/edit extracted output on detail page.
Finalize stores immutable final_output and finalized_at.
Export finalized records:
- JSON: GET /api/exports/finalized.json
- CSV: GET /api/exports/finalized.csv

Database Design

`documents`

file metadata: filename, content_type, file_size, storage_path
searchable extracted fields: title, category, summary
workflow state: status, error_message
extraction/review/final payloads: extracted_output, reviewed_output, final_output
finalization flags: is_finalized, finalized_at

`processing_jobs`

FK to document
status and progress: status, stage, progress
retry tracking: attempt
runtime trace: celery_task_id, started_at, completed_at, error_message

API Surface

POST /api/documents/upload - upload one/many files and queue jobs
GET /api/documents - list documents with search/filter/sort
GET /api/documents/{id} - document detail + job history
GET /api/jobs/{job_id}/progress - current job progress snapshot
POST /api/documents/{id}/retry - retry failed job
PATCH /api/documents/{id}/review - update reviewed structured output
POST /api/documents/{id}/finalize - finalize record
GET /api/exports/finalized.json - export finalized records (JSON)
GET /api/exports/finalized.csv - export finalized records (CSV)
WS /ws/jobs - live progress events stream

How To Run

Option A: Docker Compose (recommended)

Ensure Docker Desktop is running.
From repo root:
```
docker compose up --build
```
Open:
- Frontend: http://localhost:5173
- Backend API docs: http://localhost:8000/docs

Option B: Run services manually

1) Start Postgres and Redis

Postgres: localhost:5432 (db/user/pass: doc_pipeline/postgres/postgres)
Redis: localhost:6379

2) Backend API

cd backend
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env       # Windows: copy .env.example .env
uvicorn app.main:app --reload --port 8000

3) Celery worker

cd backend
source .venv/bin/activate  # Windows: .venv\Scripts\activate
celery -A app.celery_app.celery_app worker --loglevel=info

4) Frontend

cd frontend
npm install
cp .env.example .env       # Windows: copy .env.example .env
npm run dev

Using the App

Upload one or multiple files on dashboard.
Watch live status transitions and progress in table.
Open Review page for a document.
Edit extracted output and save review.
Finalize record.
Export all finalized records as JSON/CSV.
If processing fails, use Retry on dashboard.

Error Handling and Retry Strategy

Worker exceptions mark job as failed and emit job_failed.
Retry endpoint only allows retries for latest failed job.
Each retry creates a new job with incremented attempt.
Worker skips outdated/superseded jobs if a newer attempt exists.

Assumptions

Text extraction is intentionally simple (assignment allows mocked business logic).
For non-text formats, extraction is simulated.
File storage is local filesystem (backend/storage), not object storage.
No authentication is required for this assignment.

Tradeoffs

Schema migration tooling (Alembic) is omitted to keep setup fast.
documents.status mirrors latest job state for simpler filtering/querying.
WebSocket channel is global (/ws/jobs) and client-side filters by document_id.

Limitations

No virus scanning or file content validation pipeline.
No auth/authorization or multi-tenant isolation.
No distributed tracing/metrics stack (Prometheus/Grafana).
No hard cancellation endpoint for running jobs.

Bonus Implemented

Docker Compose setup
Retry attempt tracking and stale-job guard (basic idempotent behavior)
Export samples and test input samples
Clear separation: API routes, services, worker tasks, schemas/models

Sample Files and Export Samples

Input samples: samples/input/
Export samples: samples/exports/

AI Tool Usage Note

AI tooling was used during development for scaffolding and implementation acceleration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Processing Pipeline Assignment

Demo Video (Required)

Repository Structure

Architecture Overview

1) Upload and Job Creation

2) Asynchronous Processing (Celery)

3) Live Progress Tracking

4) Review, Finalize, Export

Database Design

`documents`

`processing_jobs`

API Surface

How To Run

Option A: Docker Compose (recommended)

Option B: Run services manually

1) Start Postgres and Redis

2) Backend API

3) Celery worker

4) Frontend

Using the App

Error Handling and Retry Strategy

Assumptions

Tradeoffs

Limitations

Bonus Implemented

Sample Files and Export Samples

AI Tool Usage Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
frontend		frontend
samples		samples
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Document Processing Pipeline Assignment

Demo Video (Required)

Repository Structure

Architecture Overview

1) Upload and Job Creation

2) Asynchronous Processing (Celery)

3) Live Progress Tracking

4) Review, Finalize, Export

Database Design

documents

processing_jobs

API Surface

How To Run

Option A: Docker Compose (recommended)

Option B: Run services manually

1) Start Postgres and Redis

2) Backend API

3) Celery worker

4) Frontend

Using the App

Error Handling and Retry Strategy

Assumptions

Tradeoffs

Limitations

Bonus Implemented

Sample Files and Export Samples

AI Tool Usage Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`documents`

`processing_jobs`

Packages