Production-style full stack app for asynchronous document processing with:
- Frontend: React + TypeScript
- Backend: FastAPI (Python)
- Database: PostgreSQL
- Background workers: Celery
- Messaging + Pub/Sub: Redis
- Live progress: Redis Pub/Sub -> FastAPI WebSocket -> React UI
- 3-5 minute walkthrough:
ADD_YOUR_VIDEO_LINK_HERE
.
├── backend
│ ├── app
│ │ ├── api/routes.py
│ │ ├── services/
│ │ ├── celery_app.py
│ │ ├── database.py
│ │ ├── main.py
│ │ ├── models.py
│ │ ├── schemas.py
│ │ └── tasks.py
│ ├── requirements.txt
│ └── Dockerfile
├── frontend
│ ├── src
│ │ ├── pages/
│ │ ├── components/
│ │ ├── api.ts
│ │ └── types.ts
│ └── Dockerfile
├── samples
│ ├── input/
│ └── exports/
├── docker-compose.yml
└── README.md
- User uploads one or more files from frontend.
POST /api/documents/uploadstores files inbackend/storage/.- Backend creates:
documentsrow (metadata)processing_jobsrow (queued status, attempt=1)
- Backend dispatches a Celery task per uploaded document.
- Worker task
process_document_task(document_id, job_id)executes stages:job_starteddocument_parsing_starteddocument_parsing_completedfield_extraction_startedfield_extraction_completedfinal_result_storedjob_completed/job_failed
- Worker updates PostgreSQL state on each stage.
- Worker publishes progress events to Redis Pub/Sub channel
job-events.
- FastAPI WebSocket endpoint
/ws/jobssubscribes to Redis Pub/Sub. - Every event from worker is pushed to connected clients in near real-time.
- Frontend dashboard/detail pages subscribe and update progress UI live.
- Review/edit extracted output on detail page.
- Finalize stores immutable
final_outputandfinalized_at. - Export finalized records:
- JSON:
GET /api/exports/finalized.json - CSV:
GET /api/exports/finalized.csv
- JSON:
- file metadata:
filename,content_type,file_size,storage_path - searchable extracted fields:
title,category,summary - workflow state:
status,error_message - extraction/review/final payloads:
extracted_output,reviewed_output,final_output - finalization flags:
is_finalized,finalized_at
- FK to document
- status and progress:
status,stage,progress - retry tracking:
attempt - runtime trace:
celery_task_id,started_at,completed_at,error_message
POST /api/documents/upload- upload one/many files and queue jobsGET /api/documents- list documents with search/filter/sortGET /api/documents/{id}- document detail + job historyGET /api/jobs/{job_id}/progress- current job progress snapshotPOST /api/documents/{id}/retry- retry failed jobPATCH /api/documents/{id}/review- update reviewed structured outputPOST /api/documents/{id}/finalize- finalize recordGET /api/exports/finalized.json- export finalized records (JSON)GET /api/exports/finalized.csv- export finalized records (CSV)WS /ws/jobs- live progress events stream
- Ensure Docker Desktop is running.
- From repo root:
docker compose up --build
- Open:
- Frontend:
http://localhost:5173 - Backend API docs:
http://localhost:8000/docs
- Frontend:
- Postgres:
localhost:5432(db/user/pass:doc_pipeline/postgres/postgres) - Redis:
localhost:6379
cd backend
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env # Windows: copy .env.example .env
uvicorn app.main:app --reload --port 8000cd backend
source .venv/bin/activate # Windows: .venv\Scripts\activate
celery -A app.celery_app.celery_app worker --loglevel=infocd frontend
npm install
cp .env.example .env # Windows: copy .env.example .env
npm run dev- Upload one or multiple files on dashboard.
- Watch live status transitions and progress in table.
- Open
Reviewpage for a document. - Edit extracted output and save review.
- Finalize record.
- Export all finalized records as JSON/CSV.
- If processing fails, use
Retryon dashboard.
- Worker exceptions mark job as
failedand emitjob_failed. - Retry endpoint only allows retries for latest failed job.
- Each retry creates a new job with incremented
attempt. - Worker skips outdated/superseded jobs if a newer attempt exists.
- Text extraction is intentionally simple (assignment allows mocked business logic).
- For non-text formats, extraction is simulated.
- File storage is local filesystem (
backend/storage), not object storage. - No authentication is required for this assignment.
- Schema migration tooling (Alembic) is omitted to keep setup fast.
documents.statusmirrors latest job state for simpler filtering/querying.- WebSocket channel is global (
/ws/jobs) and client-side filters bydocument_id.
- No virus scanning or file content validation pipeline.
- No auth/authorization or multi-tenant isolation.
- No distributed tracing/metrics stack (Prometheus/Grafana).
- No hard cancellation endpoint for running jobs.
- Docker Compose setup
- Retry attempt tracking and stale-job guard (basic idempotent behavior)
- Export samples and test input samples
- Clear separation: API routes, services, worker tasks, schemas/models
- Input samples:
samples/input/ - Export samples:
samples/exports/
- AI tooling was used during development for scaffolding and implementation acceleration.