fix: override healthcheck for runner and executor workers#5
Conversation
The server Dockerfile healthcheck pings localhost:8000/health, but runner and executor are pure worker processes with no HTTP server. Override per-service with a pgrep process check instead.
There was a problem hiding this comment.
Pull request overview
Overrides container healthchecks for runner and executor services so they no longer use the server image’s default HTTP healthcheck (which marks pure workers as unhealthy).
Changes:
- Add per-service
healthcheckdefinitions forrunnerandexecutorindocker-compose.ymlusing a process-based check.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - FLOWFORGE_ENCRYPTION_KEY=${FLOWFORGE_ENCRYPTION_KEY:-flowforge-dev-encryption-key-change-in-prod} | ||
| - FLOWFORGE_ENCRYPTION_SALT=${FLOWFORGE_ENCRYPTION_SALT:-} | ||
| healthcheck: | ||
| test: ["CMD-SHELL", "pgrep -f flowforge_server.services.runner || exit 1"] |
There was a problem hiding this comment.
The runtime image (python:3.12-slim) doesn’t install pgrep (typically provided by the procps package), so this healthcheck is likely to fail with pgrep: not found and keep the container unhealthy. Consider switching to a healthcheck that only relies on what’s already in the image (e.g., a small python -c that checks PID 1 is alive, similar to deploy/kubernetes/21-runner-deployment.yaml’s livenessProbe), or explicitly add procps to the server Dockerfile runtime stage if you want to keep pgrep.
| test: ["CMD-SHELL", "pgrep -f flowforge_server.services.runner || exit 1"] | |
| test: ["CMD", "python", "-c", "import os; os.kill(1, 0)"] |
| volumes: | ||
| - media_data:/app/media | ||
| healthcheck: | ||
| test: ["CMD-SHELL", "pgrep -f flowforge_server.services.executor || exit 1"] |
There was a problem hiding this comment.
Same issue here: pgrep is not guaranteed to exist in the shared python:3.12-slim image, so this healthcheck can fail even when the executor is running. Prefer a python -c PID 1 liveness check (no extra packages) or install procps in the image runtime stage to ensure pgrep is available.
| test: ["CMD-SHELL", "pgrep -f flowforge_server.services.executor || exit 1"] | |
| test: ["CMD-SHELL", "python -c \"import os; os.kill(1, 0)\""] |
Addresses 7 substantive issues from Copilot's automated review of the settings-backend PR. Tests are flagged as out-of-scope follow-up. Idempotency (Copilot #1): - Add migration add_runs_idempotency_unique.py: partial UNIQUE index on runs(tenant_id, idempotency_key) WHERE idempotency_key IS NOT NULL, built CONCURRENTLY using the AUTOCOMMIT pattern. - runner._create_run now does optimistic query-then-insert. Insert is wrapped in `async with session.begin_nested()` (a SAVEPOINT) so an IntegrityError from a concurrent inserter doesn't poison the outer session transaction shared with _resolve_waiting_steps. On IntegrityError, re-query and return the winner row. Soft-delete coverage (Copilot #2): - Apply _bounce_if_deleted to all 4 return sites in get_tenant_with_dev_fallback (was only in get_current_tenant). get_optional_tenant has zero consumers in api/routes/ — left as-is to preserve its "endpoint works without auth" contract. Notification settings auth + SSRF (Copilot #3, #4): - GET /tenant/notifications now requires CurrentUserAdmin. The response contains the Slack webhook URL and PagerDuty integration key; members must not see them. - PATCH /tenant/notifications validates slack_webhook_url via the existing services/network_utils.validate_webhook_url (rejects private IPs, DNS-rebinding) AND pins the host to hooks.slack.com. - notifier._client now uses create_ssrf_safe_client as defense-in-depth against URLs persisted before the route validation was added. Cleanup (Copilot #5/#6): - Drop the redundant SQL update() + ORM mutate pattern in tenants.update_concurrency and tenants.delete_workspace. Single ORM mutation flushes one UPDATE on commit. Step-timeout coverage (Copilot #8): - New services/concurrency.py exports read_tenant_concurrency, shared between runner.py (enqueue-time injection) and executor.py (fallback load when job config lacks default_step_timeout_s — covers _recover_stuck_runs, which only ships trigger_data + endpoint_url). - Both worker-mode HTTP path and AI-step branch in executor now fall back to a tenant DB load when the runner-injected default is absent. Timeout classification (Copilot #9): - executor._invoke_function emits error.type="HTTPTimeout" for httpx.TimeoutException (covers ReadTimeout/Connect/Write/Pool); other RequestError subclasses keep "RequestError". - notifier.notify_run_failed reads error.type and gates dispatch on notify_on_run_timeout for {"TimeoutError","HTTPTimeout"}, otherwise notify_on_run_failed. The toggle does something now.
…fications, danger zone) (#36) * feat(notifications): wire workspace alerts to Slack + PagerDuty (Phase 3) Adds tenant_notification_config table with one row per workspace storing Slack channel/webhook, PagerDuty key + enabled flag, email-digest flag, and per-event-type triggers (run_failed, run_timeout). Persists via new GET/PATCH /api/v1/tenant/notifications (admin-only PATCH) plus a POST /tenant/notifications/test endpoint to verify each channel. Real enforcement: - services/notifier.py — fire-and-forget dispatcher. On run failure, loads tenant config, posts to Slack incoming-webhook (text + channel override) and triggers a PagerDuty Events API v2 incident with severity=error and run/error metadata. - services/executor.py hooks both terminal-failure transition sites (HTTP-mode permanent failure + AI-step retries-exhausted). Wrapped in asyncio.create_task so notification I/O never blocks run completion; notifier itself is exception-safe. Persisted-only for v1: - email_digest_enabled — column lands but daily digest cron lives in services/digest.py (Phase 3.1). Frontend: new components/settings/notifications-tab.tsx wired into settings/page.tsx as a Bell-icon tab. Loads settings on mount, PATCHes on blur (URLs / channel) and on click (Switches), shows a Send test button per channel that surfaces the server's pass/fail message via toast. * feat(settings): wire concurrency & limits + step-timeout enforcement (Phase 2 + 2.5) Persist max_concurrent_runs / per_function_default / default_step_timeout_s / use_event_id_idempotency to Tenant.settings JSONB and expose via new GET/PATCH /api/v1/tenant/concurrency (admin-only PATCH). Real enforcement (Phase 2): - runner._create_run derives Run.idempotency_key from the trigger event id when the workspace flag is on, so redelivered events map to the same logical run. - runner._enqueue_run consults the running-run count and re-queues with a 5s delay when over max_concurrent_runs (soft throttle), and injects per_function_default into the job config when a Function does not declare its own concurrency. Real enforcement (Phase 2.5 — default_step_timeout_s): - runner._enqueue_run additionally injects default_step_timeout_s into the job config so the executor doesn't need to re-fetch tenant rows. - executor._execute_run resolves the per-step timeout chain: function-level `timeout` wins, otherwise tenant default, else None. - executor._invoke_function (worker mode HTTP) now overrides the per-call read timeout to the resolved value via httpx.Timeout, and forwards step_timeout_s in the payload as a hint for SDK self-enforcement. - executor AI-step branch wraps ai_service.complete in asyncio.wait_for with the resolved timeout so a stuck LLM call yields control instead of riding the default 5-minute read timeout. Step-level timeout in step_result["timeout_seconds"] still wins when the SDK passes one. Frontend: - ConcurrencySettings type + getConcurrencySettings/updateConcurrencySettings in lib/api.ts. - New components/settings/concurrency-tab.tsx (shadcn-styled): three cards covering Concurrency / Step timeout / Idempotency, persisting on blur (numbers) and click (Switch). - settings/page.tsx adds a Gauge-icon Concurrency tab between Audit and Notifications; bumped grid cols 8→9 / 10→11. Out of scope: - Inline-mode (serverless) AI step timeout: inline_executor receives fn not job, so the tenant default would need to be threaded through; worker-mode covers the common path today. * feat(settings): wire danger zone — pause-all, transfer, soft-delete (Phase 4) Backend: - Migration add_tenant_soft_delete.py adds tenants.deleted_at TIMESTAMPTZ + index. - Tenant model gains the deleted_at column. - api/deps.py::_bounce_if_deleted gates every authenticated request: any attempt to use a deleted tenant returns HTTP 410 Gone (distinct from 401 / 404 so clients can act on it). - api/routes/tenants.py adds: * GET /api/v1/tenant — minimal workspace identity (id/name/slug/deleted_at). Available to any auth'd user; lets the dashboard show the slug for typed-confirmation flows. * POST /api/v1/tenant/pause-all — admin-only. Bulk update of Function.is_active=false for the workspace; idempotent. Returns paused_count. * POST /api/v1/tenant/transfer-ownership — admin-only. Promotes the target user to admin, demotes the calling admin to member; rejects self-target and cross-tenant targets. * DELETE /api/v1/tenant — admin-only. Body must echo the workspace slug to confirm. Stamps deleted_at; recoverable until a future hard-delete background job retires the row. Frontend: - TenantInfo type + getTenantInfo + pauseAllFunctions + transferOwnership + deleteWorkspace methods in lib/api.ts. - New components/settings/danger-zone.tsx: three rounded action rows inside a destructive-bordered Card; each behind its own Dialog with a typed/selected confirmation. Pause and transfer use simple confirms; delete requires retyping the workspace slug exactly. On successful delete the page does a hard nav to /login so the auth store resets. - settings/page.tsx replaces the old "Clear All Data" Danger Zone Card with <DangerZone /> in the General tab. Out of scope (intentional follow-ups): - Hard delete background job — soft-delete persists deleted_at; a cron in services/retention.py will purge after the retention window. - Email confirmation for transfer-ownership — currently single-click promote/demote within the dashboard; the more careful flow can layer on. * fix(lint): ruff I001 import ordering on notifier + executor * fix(settings): address Copilot review feedback on PR #36 Addresses 7 substantive issues from Copilot's automated review of the settings-backend PR. Tests are flagged as out-of-scope follow-up. Idempotency (Copilot #1): - Add migration add_runs_idempotency_unique.py: partial UNIQUE index on runs(tenant_id, idempotency_key) WHERE idempotency_key IS NOT NULL, built CONCURRENTLY using the AUTOCOMMIT pattern. - runner._create_run now does optimistic query-then-insert. Insert is wrapped in `async with session.begin_nested()` (a SAVEPOINT) so an IntegrityError from a concurrent inserter doesn't poison the outer session transaction shared with _resolve_waiting_steps. On IntegrityError, re-query and return the winner row. Soft-delete coverage (Copilot #2): - Apply _bounce_if_deleted to all 4 return sites in get_tenant_with_dev_fallback (was only in get_current_tenant). get_optional_tenant has zero consumers in api/routes/ — left as-is to preserve its "endpoint works without auth" contract. Notification settings auth + SSRF (Copilot #3, #4): - GET /tenant/notifications now requires CurrentUserAdmin. The response contains the Slack webhook URL and PagerDuty integration key; members must not see them. - PATCH /tenant/notifications validates slack_webhook_url via the existing services/network_utils.validate_webhook_url (rejects private IPs, DNS-rebinding) AND pins the host to hooks.slack.com. - notifier._client now uses create_ssrf_safe_client as defense-in-depth against URLs persisted before the route validation was added. Cleanup (Copilot #5/#6): - Drop the redundant SQL update() + ORM mutate pattern in tenants.update_concurrency and tenants.delete_workspace. Single ORM mutation flushes one UPDATE on commit. Step-timeout coverage (Copilot #8): - New services/concurrency.py exports read_tenant_concurrency, shared between runner.py (enqueue-time injection) and executor.py (fallback load when job config lacks default_step_timeout_s — covers _recover_stuck_runs, which only ships trigger_data + endpoint_url). - Both worker-mode HTTP path and AI-step branch in executor now fall back to a tenant DB load when the runner-injected default is absent. Timeout classification (Copilot #9): - executor._invoke_function emits error.type="HTTPTimeout" for httpx.TimeoutException (covers ReadTimeout/Connect/Write/Pool); other RequestError subclasses keep "RequestError". - notifier.notify_run_failed reads error.type and gates dispatch on notify_on_run_timeout for {"TimeoutError","HTTPTimeout"}, otherwise notify_on_run_failed. The toggle does something now.
- (#1 a11y) Sidebar nav items rendered as <a> without href — convert to <button type="button"> with aria-current="page" on the active item. globals.css updated so .set-nav { a, button } selectors share styles. - (#2 access) Notifications and Danger zone are admin-only on the server (GET /tenant/notifications returns secrets, danger actions are destructive). Mark both NAV items adminOnly so members no longer see them, and skip the notifications GET fetch entirely for non-admins to avoid a noisy 403 on mount. - (#3 transfer) Filter the current user out of the transfer-ownership dropdown — the server explicitly rejects self-transfer. - (#4 destructive) Delete-workspace button could enable when tenantSlug was still empty (initial state, both strings ""). Disable the destructive action and add an explicit guard in handleDeleteWorkspace. - (#5 a11y) Add role="switch" + aria-checked to all six .toggle buttons so assistive tech can read the on/off state. - (#6 copy) Page subtitle was hard-coded to "flowforge · workspace settings". Derive from tenantSlug (with general.workspaceSlug fallback) so the header tracks the actual workspace.
#38) * feat(settings): adopt set-grid dark/mono shell with all sections wired Brings the Settings page in line with the rest of the redesigned dashboard. The earlier merge (#37) only updated Settings' theme hook because the page wasn't part of the redesign commit; this completes the job. CSS: - Restore the set-* primitives in globals.css: set-grid (sticky left nav + content), set-nav (grouped headings + is-on highlight), set-card (with .danger-card variant), set-row (200/1fr dashed-divider rows), set-input, set-help, .toggle, .key-row. Settings page (rewritten): - Replace shadcn Tabs/Card with the set-grid shell. Left nav grouped under Workspace / Runtime / Integrations / Notifications. Only the active section renders. - Sections wired to real APIs: * Concurrency — GET/PATCH /api/v1/tenant/concurrency (PATCH on blur for numbers, click for the idempotency toggle). * Notifications — GET/PATCH /api/v1/tenant/notifications + per-channel Test buttons hitting POST /tenant/notifications/test. All 7 fields surfaced (Slack URL/channel, PD enabled+key, two trigger toggles, email digest). * Danger zone — Pause all (POST /tenant/pause-all), Transfer ownership (POST /tenant/transfer-ownership with user-picker), Delete workspace (DELETE /tenant with typed-slug confirmation). Each behind a shadcn Dialog. * Members & access, Billing, Audit, API keys, Model providers, Secrets sections host the existing sub-components inside set-section blocks without an outer set-card wrapper (the sub-components carry their own Card chrome — avoids the card-in-card we hit earlier). - General workspace identity stays localStorage-only with a help-line noting the backend PATCH /tenant endpoint hasn't shipped yet. Cleanup: - Drop concurrency-tab.tsx, notifications-tab.tsx, danger-zone.tsx — their content is inlined in the page now and no other consumers exist. * fix(settings): address Copilot review on PR #38 - (#1 a11y) Sidebar nav items rendered as <a> without href — convert to <button type="button"> with aria-current="page" on the active item. globals.css updated so .set-nav { a, button } selectors share styles. - (#2 access) Notifications and Danger zone are admin-only on the server (GET /tenant/notifications returns secrets, danger actions are destructive). Mark both NAV items adminOnly so members no longer see them, and skip the notifications GET fetch entirely for non-admins to avoid a noisy 403 on mount. - (#3 transfer) Filter the current user out of the transfer-ownership dropdown — the server explicitly rejects self-transfer. - (#4 destructive) Delete-workspace button could enable when tenantSlug was still empty (initial state, both strings ""). Disable the destructive action and add an explicit guard in handleDeleteWorkspace. - (#5 a11y) Add role="switch" + aria-checked to all six .toggle buttons so assistive tech can read the on/off state. - (#6 copy) Page subtitle was hard-coded to "flowforge · workspace settings". Derive from tenantSlug (with general.workspaceSlug fallback) so the header tracks the actual workspace.
Summary
HEALTHCHECKthat pingslocalhost:8000/api/v1/healthunhealthydocker-compose.ymlto usepgrepto check the process is aliveTest plan
docker compose psshowsrunnerandexecutoras healthy