perf+reliability: lifespan fail-fast, tool caching, indexes, task lifecycle#3
Open
madhavcodez wants to merge 1 commit into
Open
perf+reliability: lifespan fail-fast, tool caching, indexes, task lifecycle#3madhavcodez wants to merge 1 commit into
madhavcodez wants to merge 1 commit into
Conversation
4 tasks
…ecycle
Addresses the highest-leverage findings from the parallel performance and
reliability review. Each change is independently revertible.
Lifespan refactor (CRITICAL)
- main.py: replace the warn-and-continue init pattern with a
SubsystemReadiness tracker. Required subsystems (database) fail-fast and
abort startup; optional subsystems (expert seeding, source_registry,
scheduler, redis_bridge) degrade gracefully and expose their state via
the new /health/ready endpoint
- Replace ``next(get_session())`` with ``with SessionLocal() as db:`` so
the session is properly cleaned up on every path
- Remove the ``stop_scheduler = lambda: None`` rebind smell; the default
is defined up-front and overwritten only on successful start
- redis_task tracked by name and awaited on shutdown so cancellation is
observable
Health surface
- api/health.py: new GET /health/ready exposes per-subsystem readiness
from the lifespan handler. Operators can detect degraded boots without
digging through logs
Tool-result caching (CRITICAL — wired existing infra)
- core/tool_cache.py: new fail-open Redis cache keyed on (tool, params)
with deterministic JSON hashing, lazy connection, per-call TTL, and
``cached=True`` flag on hits
- services/crews/tools/exa_search.py: wraps the real fetch in
tool_cache.cached with 1h TTL. Previously every mission re-paid Exa
for the same queries
- services/crews/tools/gemini_search.py: same treatment for Gemini
grounding calls
Celery worker lifecycle (HIGH)
- celery_app.py: ``worker_max_tasks_per_child=10`` recycles each worker
process after 10 tasks to release accumulated LLM-context heap that
workers don't free between tasks. A research-heavy worker grew several
GB RSS over a day without this
- ``task_soft_time_limit=300`` + ``task_time_limit=360`` so a stuck HTTP
call (Gemini latency spike, Twilio webhook timeout) cannot hang a
queue indefinitely
Database indexes (HIGH)
- alembic/versions/f4a8d2c1e9b0_add_perf_indexes.py: adds
- ix_missions_status (backs ?status=running dashboard filter)
- ix_missions_user_created (backs /missions home-page query)
- ix_findings_mission_confidence (backs per-mission finding list with
confidence-ordered display)
Background task lifecycle (HIGH)
- core/background_tasks.py: ``spawn_background_task`` keeps a strong
reference in a module-level set + done-callback that logs uncaught
exceptions. ``asyncio.ensure_future`` (deprecated 3.10+) without
reference retention is GC-collectable; tasks would silently disappear
- api/missions.py, api/projects.py: three ``asyncio.ensure_future``
call sites migrated to ``spawn_background_task``
Error-envelope sanitisation (HIGH)
- api/data_sources.py:183,266 + api/workflows.py:262: stop leaking raw
exception strings in 5xx responses. ``logger.exception(...)`` captures
the full traceback server-side; the user gets a generic message + a
hint to check logs by correlation id
Pagination on unbounded list endpoint (HIGH)
- api/run_steps.py: ``limit``/``offset`` parameters (default 200, max
1000) so a long-running mission's run trace doesn't return a multi-MB
blob
Deferred to a follow-up PR
- Wrapping the broader async-route sync ``db.query`` calls in
``run_in_threadpool`` (or migrating to asyncpg) needs touching ~33
router modules and warrants its own PR with a focused test plan
- Moving APScheduler jobs to Celery Beat (CRITICAL #2 in perf review)
requires reconciling autopilot/monitor/workflow job loaders — also
deferred
- ReactFlow lazy-load + ActivityFeed memoization land with the broader
dashboard polish PR
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 3 of 4 in the critical-cleanup sweep. Addresses the highest-leverage findings from the parallel performance + reliability review.
Key changes
Lifespan rewrite (CRITICAL)
Previously every init step was
try/except: logger.warning— a degraded boot looked successful. Now:GET /health/readyendpointnext(get_session())replaced withwith SessionLocal() as db:stop_scheduler = lambda: Nonerebind smell eliminatedTool-result cache (CRITICAL — wired existing infra)
core/tool_cache.py— fail-open Redis cache keyed on (tool, params). Wired intoexa_searchandgemini_searchwith 1h TTL. Previously every mission re-paid those APIs for identical queries;SourceCacheexisted but those tools didn't use it.Celery worker lifecycle (HIGH)
worker_max_tasks_per_child=10recycles workers to release accumulated LLM-context heap (workers grew several GB RSS over a day). Soft/hard time limits (300s/360s) prevent stuck HTTP calls from hanging queues.Missing DB indexes (HIGH)
New migration
f4a8d2c1e9b0_add_perf_indexes.py:ix_missions_status— backs?status=runningix_missions_user_created— backs dashboard "your recent missions"ix_findings_mission_confidence— backs per-mission finding listBackground task lifecycle (HIGH)
core/background_tasks.py—spawn_background_taskkeeps strong refs in a module set + logs uncaught exceptions in a done-callback. Threeasyncio.ensure_future(deprecated, GC-collectable) call sites migrated.Error envelope sanitisation (HIGH)
data_sources.py:183,266+workflows.py:262no longer leak raw exception strings in 5xx responses.logger.exception(...)captures the traceback server-side.Pagination (HIGH)
run_steps.pygainslimit/offset(default 200, max 1000).Deferred to follow-up PRs
db.queryin async routes — needs ~33 router edits, own PRgather) — UI-focused PRTest plan
alembic upgrade headsucceeds, thenalembic downgrade -1reverses cleanlyGET /health/readyreturns subsystem states/health/ready→ expectredis_bridge: degradedcached: true)References