Fix/worker crash orphan recovery by hellonish · Pull Request #3 · hellonish/singularity

hellonish · 2026-04-04T03:01:27Z

No description provided.

- Fix health endpoint paths (/api/health not /health) - Remove local Qdrant container (use Qdrant Cloud to save 42MB) - Remove --workers 2 from uvicorn (was causing child process crashes) - Fix Caddyfile route ordering (auth before /api/*) - Pin all dependency versions - Add standalone output to Next.js config - Add healthchecks to all services - Add .dockerignore for lean images Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

NEXT_PUBLIC_* vars are baked into client bundle at build time, not runtime. This ensures the browser gets the correct API URL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the worker process crashes mid-job (OOM, SIGTERM, etc.), jobs are left in "running" state indefinitely. The frontend SSE stream receives no further events and the report stays locked with no error shown. On startup, scan for any jobs still in "running" state, mark them failed, and publish job_error to their Redis channels so the frontend SSE or polling fallback surfaces the error immediately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two root causes for the worker dying right after lead plan finalization: 1. Worker container had a 256MB Docker memory limit. Loading sentence-transformers (all-MiniLM-L6-v2) + PyTorch at the topic cache check requires ~500MB, causing Docker to OOM-kill the container. Raised worker limit to 1GB (t3.small has 2GB total; all other services combined use ~836MB leaving headroom). 2. find_cached_run() had no error handling — any Qdrant connection failure (unreachable server, timeout, bad credentials) would propagate as an unhandled exception and kill the job. Wrapped in try-except to degrade gracefully as a cache miss so the job continues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

QDRANT_URL defaults to localhost:6333 but no Qdrant container is running. Set QDRANT_FORCE_IN_MEMORY=1 on the worker so the pipeline uses an in-memory instance per job. Topic-cache cross-run deduplication is disabled but all retrieval and writing works correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Tools making HTTP requests (pdf_reader, semantic_scholar, etc.) had no timeout, causing the worker to hang silently at 0% CPU when a request stalled. Fixed in two layers: 1. tools/base.py: wrap every call_with_retry attempt in asyncio.wait_for (default 60s) so any tool that hangs is cancelled and retried/failed. 2. tools/pdf_reader.py: add aiohttp.ClientTimeout(total=30) so the HTTP download itself is bounded independently of the outer timeout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

A stuck worker (hanging HTTP request, not crashed) keeps the job in "running" state indefinitely. The frontend was silently spinning with no feedback to the user. Add client-side elapsed-time detection: if the job has been running for more than 10 minutes, show a yellow warning box with the elapsed time and a back button, so the user knows something is wrong and can retry instead of waiting forever. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The code read QDRANT_URL but the .env file sets QDRANT_LOCATION. Fall back through both names before defaulting to localhost so existing deployments using either convention work without changing their env. Remove QDRANT_FORCE_IN_MEMORY=1 from the worker — Qdrant Cloud is properly configured so in-memory mode is no longer needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The createJobMutation had no onError handler. When the API returned 429 (rate limit exceeded), TanStack Query had no error path to execute, causing an internal crash reading .payload on undefined in core.js. Add onError to show a human-readable message below the search bar, with a specific "please wait and retry" message for 429 responses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t loop blocking

…rmers

…r credibility filter

…tion

hellonish and others added 24 commits April 3, 2026 00:21

deployment ready

2b4efd7

updated dep

e9eddbb

Docker Files Update - Pull instead of Build

579abc3

fix healthchecks

be22831

caddy prod update

ad4a361

INTERNAL API FIX

24126d6

add Google OAuth env vars to frontend service

179668c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add NEXT_PUBLIC_API_URL build arg to frontend Dockerfile

7fd923b

NEXT_PUBLIC_* vars are baked into client bundle at build time, not runtime. This ensures the browser gets the correct API URL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MIgrations fix

ef800e5

fix orphan recovery to also catch pending jobs

36ea1a1

fix: run Qdrant cache ops in thread pool with timeout to prevent even…

db4c002

…t loop blocking

fix: move blocking ML/Qdrant ops to thread pool, add sentence-transfo…

2edb13c

…rmers

swap sentence-transformers for fastembed (~1.4GB lighter)

ef245c1

swap sentence-transformers for fastembed, add Qdrant payload index fo…

9e8228d

…r credibility filter

fix: remove unused PayloadIndexParams import that broke create_collec…

275ca27

…tion

merging before cleaning v1

d92a253

hellonish merged commit bc9c789 into main Apr 4, 2026
0 of 2 checks passed

hellonish deleted the fix/worker-crash-orphan-recovery branch April 8, 2026 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/worker crash orphan recovery#3

Fix/worker crash orphan recovery#3
hellonish merged 24 commits into
mainfrom
fix/worker-crash-orphan-recovery

hellonish commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hellonish commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant