Skip to content

Fix/worker crash orphan recovery#3

Merged
hellonish merged 24 commits into
mainfrom
fix/worker-crash-orphan-recovery
Apr 4, 2026
Merged

Fix/worker crash orphan recovery#3
hellonish merged 24 commits into
mainfrom
fix/worker-crash-orphan-recovery

Conversation

@hellonish
Copy link
Copy Markdown
Owner

No description provided.

hellonish and others added 24 commits April 3, 2026 00:21
- Fix health endpoint paths (/api/health not /health)
- Remove local Qdrant container (use Qdrant Cloud to save 42MB)
- Remove --workers 2 from uvicorn (was causing child process crashes)
- Fix Caddyfile route ordering (auth before /api/*)
- Pin all dependency versions
- Add standalone output to Next.js config
- Add healthchecks to all services
- Add .dockerignore for lean images

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NEXT_PUBLIC_* vars are baked into client bundle at build time, not runtime.
This ensures the browser gets the correct API URL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the worker process crashes mid-job (OOM, SIGTERM, etc.), jobs are
left in "running" state indefinitely. The frontend SSE stream receives
no further events and the report stays locked with no error shown.

On startup, scan for any jobs still in "running" state, mark them
failed, and publish job_error to their Redis channels so the frontend
SSE or polling fallback surfaces the error immediately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two root causes for the worker dying right after lead plan finalization:

1. Worker container had a 256MB Docker memory limit. Loading
   sentence-transformers (all-MiniLM-L6-v2) + PyTorch at the topic
   cache check requires ~500MB, causing Docker to OOM-kill the container.
   Raised worker limit to 1GB (t3.small has 2GB total; all other services
   combined use ~836MB leaving headroom).

2. find_cached_run() had no error handling — any Qdrant connection
   failure (unreachable server, timeout, bad credentials) would propagate
   as an unhandled exception and kill the job. Wrapped in try-except to
   degrade gracefully as a cache miss so the job continues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
QDRANT_URL defaults to localhost:6333 but no Qdrant container is
running. Set QDRANT_FORCE_IN_MEMORY=1 on the worker so the pipeline
uses an in-memory instance per job. Topic-cache cross-run deduplication
is disabled but all retrieval and writing works correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tools making HTTP requests (pdf_reader, semantic_scholar, etc.) had no
timeout, causing the worker to hang silently at 0% CPU when a request
stalled. Fixed in two layers:

1. tools/base.py: wrap every call_with_retry attempt in asyncio.wait_for
   (default 60s) so any tool that hangs is cancelled and retried/failed.
2. tools/pdf_reader.py: add aiohttp.ClientTimeout(total=30) so the HTTP
   download itself is bounded independently of the outer timeout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A stuck worker (hanging HTTP request, not crashed) keeps the job in
"running" state indefinitely. The frontend was silently spinning with
no feedback to the user.

Add client-side elapsed-time detection: if the job has been running for
more than 10 minutes, show a yellow warning box with the elapsed time
and a back button, so the user knows something is wrong and can retry
instead of waiting forever.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The code read QDRANT_URL but the .env file sets QDRANT_LOCATION.
Fall back through both names before defaulting to localhost so existing
deployments using either convention work without changing their env.

Remove QDRANT_FORCE_IN_MEMORY=1 from the worker — Qdrant Cloud is
properly configured so in-memory mode is no longer needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The createJobMutation had no onError handler. When the API returned 429
(rate limit exceeded), TanStack Query had no error path to execute,
causing an internal crash reading .payload on undefined in core.js.

Add onError to show a human-readable message below the search bar,
with a specific "please wait and retry" message for 429 responses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@hellonish hellonish merged commit bc9c789 into main Apr 4, 2026
0 of 2 checks passed
@hellonish hellonish deleted the fix/worker-crash-orphan-recovery branch April 8, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant