fix: infer waterfall step durations from inter-step gaps#8
Conversation
For existing DB data where started_at == ended_at (both set at save time before the SDK timing fix), infer step duration from the gap to the next step's started_at. Steps with real SDK timing (>100ms) use their actual duration.
There was a problem hiding this comment.
Pull request overview
Updates the run detail “Execution Timeline” waterfall to show meaningful step durations for legacy step records where started_at and ended_at were saved as the same timestamp, by inferring durations from inter-step start-time gaps.
Changes:
- Adds a
useMemoprecomputation of “effective” step durations, inferring duration from the next step’sstarted_atwhen the recorded duration is near-zero. - Updates waterfall bar width/label rendering to use the effective duration map rather than
ended_at - started_at.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Sort by started_at for gap calculation | ||
| const sorted = [...steps] | ||
| .map((s, origIdx) => ({ step: s, origIdx })) | ||
| .sort((a, b) => { | ||
| const aT = a.step.started_at ? new Date(a.step.started_at).getTime() : 0; | ||
| const bT = b.step.started_at ? new Date(b.step.started_at).getTime() : 0; | ||
| return aT - bT; | ||
| }); | ||
|
|
||
| const durations = new Map<number, number>(); | ||
| for (let i = 0; i < sorted.length; i++) { | ||
| const s = sorted[i].step; | ||
| const start = s.started_at ? new Date(s.started_at).getTime() : spanStart; | ||
| const end = s.ended_at ? new Date(s.ended_at).getTime() : now; | ||
| const recorded = end - start; | ||
|
|
||
| if (recorded > 100) { | ||
| // Real timing from SDK fix — use it | ||
| durations.set(sorted[i].origIdx, recorded); | ||
| } else if (i < sorted.length - 1) { | ||
| // Infer from gap to next step's started_at | ||
| const nextStepStart = sorted[i + 1].step.started_at; | ||
| const nextStart = nextStepStart | ||
| ? new Date(nextStepStart).getTime() | ||
| : now; | ||
| durations.set(sorted[i].origIdx, Math.max(nextStart - start, 100)); |
There was a problem hiding this comment.
started_at is nullable (per Step type), but the sort/inference logic treats missing started_at as time 0 and missing next started_at as now. If there are multiple steps with started_at === null (e.g., pending steps), the first one will infer its duration as now - spanStart (very large) because the “next step” also has no started_at, which will distort the waterfall.
Consider excluding steps without started_at from gap-based inference (set them to a fixed minimal duration), and/or use the same fallback timestamp (spanStart) for sorting and for nextStart fallback rather than 0/now when started_at is missing.
| const start = s.started_at ? new Date(s.started_at).getTime() : spanStart; | ||
| const end = s.ended_at ? new Date(s.ended_at).getTime() : now; | ||
| const recorded = end - start; | ||
|
|
||
| if (recorded > 100) { | ||
| // Real timing from SDK fix — use it | ||
| durations.set(sorted[i].origIdx, recorded); | ||
| } else if (i < sorted.length - 1) { |
There was a problem hiding this comment.
The legacy-data condition described in the PR is started_at == ended_at, but the code uses recorded > 100ms as the cutoff. That will also treat any legitimately fast steps (<=100ms) with real timing as “legacy” and inflate/replace their duration (and clamp to at least 100ms), reducing accuracy for new runs.
Prefer detecting legacy records explicitly (e.g., started_at && ended_at && started_at === ended_at or recorded === 0) and otherwise use the recorded duration (while still applying a separate visual minimum width if needed).
| {visibleSteps.map((step, visibleIdx) => { | ||
| // Find original index for duration lookup | ||
| const origIdx = expanded ? visibleIdx : visibleIdx; | ||
| const stepStart = step.started_at | ||
| ? new Date(step.started_at).getTime() | ||
| : spanStart; | ||
| const stepEnd = step.ended_at | ||
| ? new Date(step.ended_at).getTime() | ||
| : now; | ||
| const duration = effectiveDurations.get(origIdx) ?? 100; |
There was a problem hiding this comment.
origIdx is currently always visibleIdx (expanded ? visibleIdx : visibleIdx), so the variable/comment are misleading and add noise. Either remove origIdx and use visibleIdx directly, or compute the true original index if visibleSteps can ever be a non-trivial projection of steps (filter/reorder).
- Detect legacy records explicitly (started_at ≈ ended_at within 50ms) instead of using a 100ms threshold that misclassifies fast steps - Exclude steps without started_at from gap inference; give them a fixed minimal visual duration - Key duration map by step.id instead of array index; remove dead origIdx variable
Addresses 7 substantive issues from Copilot's automated review of the settings-backend PR. Tests are flagged as out-of-scope follow-up. Idempotency (Copilot #1): - Add migration add_runs_idempotency_unique.py: partial UNIQUE index on runs(tenant_id, idempotency_key) WHERE idempotency_key IS NOT NULL, built CONCURRENTLY using the AUTOCOMMIT pattern. - runner._create_run now does optimistic query-then-insert. Insert is wrapped in `async with session.begin_nested()` (a SAVEPOINT) so an IntegrityError from a concurrent inserter doesn't poison the outer session transaction shared with _resolve_waiting_steps. On IntegrityError, re-query and return the winner row. Soft-delete coverage (Copilot #2): - Apply _bounce_if_deleted to all 4 return sites in get_tenant_with_dev_fallback (was only in get_current_tenant). get_optional_tenant has zero consumers in api/routes/ — left as-is to preserve its "endpoint works without auth" contract. Notification settings auth + SSRF (Copilot #3, #4): - GET /tenant/notifications now requires CurrentUserAdmin. The response contains the Slack webhook URL and PagerDuty integration key; members must not see them. - PATCH /tenant/notifications validates slack_webhook_url via the existing services/network_utils.validate_webhook_url (rejects private IPs, DNS-rebinding) AND pins the host to hooks.slack.com. - notifier._client now uses create_ssrf_safe_client as defense-in-depth against URLs persisted before the route validation was added. Cleanup (Copilot #5/#6): - Drop the redundant SQL update() + ORM mutate pattern in tenants.update_concurrency and tenants.delete_workspace. Single ORM mutation flushes one UPDATE on commit. Step-timeout coverage (Copilot #8): - New services/concurrency.py exports read_tenant_concurrency, shared between runner.py (enqueue-time injection) and executor.py (fallback load when job config lacks default_step_timeout_s — covers _recover_stuck_runs, which only ships trigger_data + endpoint_url). - Both worker-mode HTTP path and AI-step branch in executor now fall back to a tenant DB load when the runner-injected default is absent. Timeout classification (Copilot #9): - executor._invoke_function emits error.type="HTTPTimeout" for httpx.TimeoutException (covers ReadTimeout/Connect/Write/Pool); other RequestError subclasses keep "RequestError". - notifier.notify_run_failed reads error.type and gates dispatch on notify_on_run_timeout for {"TimeoutError","HTTPTimeout"}, otherwise notify_on_run_failed. The toggle does something now.
…fications, danger zone) (#36) * feat(notifications): wire workspace alerts to Slack + PagerDuty (Phase 3) Adds tenant_notification_config table with one row per workspace storing Slack channel/webhook, PagerDuty key + enabled flag, email-digest flag, and per-event-type triggers (run_failed, run_timeout). Persists via new GET/PATCH /api/v1/tenant/notifications (admin-only PATCH) plus a POST /tenant/notifications/test endpoint to verify each channel. Real enforcement: - services/notifier.py — fire-and-forget dispatcher. On run failure, loads tenant config, posts to Slack incoming-webhook (text + channel override) and triggers a PagerDuty Events API v2 incident with severity=error and run/error metadata. - services/executor.py hooks both terminal-failure transition sites (HTTP-mode permanent failure + AI-step retries-exhausted). Wrapped in asyncio.create_task so notification I/O never blocks run completion; notifier itself is exception-safe. Persisted-only for v1: - email_digest_enabled — column lands but daily digest cron lives in services/digest.py (Phase 3.1). Frontend: new components/settings/notifications-tab.tsx wired into settings/page.tsx as a Bell-icon tab. Loads settings on mount, PATCHes on blur (URLs / channel) and on click (Switches), shows a Send test button per channel that surfaces the server's pass/fail message via toast. * feat(settings): wire concurrency & limits + step-timeout enforcement (Phase 2 + 2.5) Persist max_concurrent_runs / per_function_default / default_step_timeout_s / use_event_id_idempotency to Tenant.settings JSONB and expose via new GET/PATCH /api/v1/tenant/concurrency (admin-only PATCH). Real enforcement (Phase 2): - runner._create_run derives Run.idempotency_key from the trigger event id when the workspace flag is on, so redelivered events map to the same logical run. - runner._enqueue_run consults the running-run count and re-queues with a 5s delay when over max_concurrent_runs (soft throttle), and injects per_function_default into the job config when a Function does not declare its own concurrency. Real enforcement (Phase 2.5 — default_step_timeout_s): - runner._enqueue_run additionally injects default_step_timeout_s into the job config so the executor doesn't need to re-fetch tenant rows. - executor._execute_run resolves the per-step timeout chain: function-level `timeout` wins, otherwise tenant default, else None. - executor._invoke_function (worker mode HTTP) now overrides the per-call read timeout to the resolved value via httpx.Timeout, and forwards step_timeout_s in the payload as a hint for SDK self-enforcement. - executor AI-step branch wraps ai_service.complete in asyncio.wait_for with the resolved timeout so a stuck LLM call yields control instead of riding the default 5-minute read timeout. Step-level timeout in step_result["timeout_seconds"] still wins when the SDK passes one. Frontend: - ConcurrencySettings type + getConcurrencySettings/updateConcurrencySettings in lib/api.ts. - New components/settings/concurrency-tab.tsx (shadcn-styled): three cards covering Concurrency / Step timeout / Idempotency, persisting on blur (numbers) and click (Switch). - settings/page.tsx adds a Gauge-icon Concurrency tab between Audit and Notifications; bumped grid cols 8→9 / 10→11. Out of scope: - Inline-mode (serverless) AI step timeout: inline_executor receives fn not job, so the tenant default would need to be threaded through; worker-mode covers the common path today. * feat(settings): wire danger zone — pause-all, transfer, soft-delete (Phase 4) Backend: - Migration add_tenant_soft_delete.py adds tenants.deleted_at TIMESTAMPTZ + index. - Tenant model gains the deleted_at column. - api/deps.py::_bounce_if_deleted gates every authenticated request: any attempt to use a deleted tenant returns HTTP 410 Gone (distinct from 401 / 404 so clients can act on it). - api/routes/tenants.py adds: * GET /api/v1/tenant — minimal workspace identity (id/name/slug/deleted_at). Available to any auth'd user; lets the dashboard show the slug for typed-confirmation flows. * POST /api/v1/tenant/pause-all — admin-only. Bulk update of Function.is_active=false for the workspace; idempotent. Returns paused_count. * POST /api/v1/tenant/transfer-ownership — admin-only. Promotes the target user to admin, demotes the calling admin to member; rejects self-target and cross-tenant targets. * DELETE /api/v1/tenant — admin-only. Body must echo the workspace slug to confirm. Stamps deleted_at; recoverable until a future hard-delete background job retires the row. Frontend: - TenantInfo type + getTenantInfo + pauseAllFunctions + transferOwnership + deleteWorkspace methods in lib/api.ts. - New components/settings/danger-zone.tsx: three rounded action rows inside a destructive-bordered Card; each behind its own Dialog with a typed/selected confirmation. Pause and transfer use simple confirms; delete requires retyping the workspace slug exactly. On successful delete the page does a hard nav to /login so the auth store resets. - settings/page.tsx replaces the old "Clear All Data" Danger Zone Card with <DangerZone /> in the General tab. Out of scope (intentional follow-ups): - Hard delete background job — soft-delete persists deleted_at; a cron in services/retention.py will purge after the retention window. - Email confirmation for transfer-ownership — currently single-click promote/demote within the dashboard; the more careful flow can layer on. * fix(lint): ruff I001 import ordering on notifier + executor * fix(settings): address Copilot review feedback on PR #36 Addresses 7 substantive issues from Copilot's automated review of the settings-backend PR. Tests are flagged as out-of-scope follow-up. Idempotency (Copilot #1): - Add migration add_runs_idempotency_unique.py: partial UNIQUE index on runs(tenant_id, idempotency_key) WHERE idempotency_key IS NOT NULL, built CONCURRENTLY using the AUTOCOMMIT pattern. - runner._create_run now does optimistic query-then-insert. Insert is wrapped in `async with session.begin_nested()` (a SAVEPOINT) so an IntegrityError from a concurrent inserter doesn't poison the outer session transaction shared with _resolve_waiting_steps. On IntegrityError, re-query and return the winner row. Soft-delete coverage (Copilot #2): - Apply _bounce_if_deleted to all 4 return sites in get_tenant_with_dev_fallback (was only in get_current_tenant). get_optional_tenant has zero consumers in api/routes/ — left as-is to preserve its "endpoint works without auth" contract. Notification settings auth + SSRF (Copilot #3, #4): - GET /tenant/notifications now requires CurrentUserAdmin. The response contains the Slack webhook URL and PagerDuty integration key; members must not see them. - PATCH /tenant/notifications validates slack_webhook_url via the existing services/network_utils.validate_webhook_url (rejects private IPs, DNS-rebinding) AND pins the host to hooks.slack.com. - notifier._client now uses create_ssrf_safe_client as defense-in-depth against URLs persisted before the route validation was added. Cleanup (Copilot #5/#6): - Drop the redundant SQL update() + ORM mutate pattern in tenants.update_concurrency and tenants.delete_workspace. Single ORM mutation flushes one UPDATE on commit. Step-timeout coverage (Copilot #8): - New services/concurrency.py exports read_tenant_concurrency, shared between runner.py (enqueue-time injection) and executor.py (fallback load when job config lacks default_step_timeout_s — covers _recover_stuck_runs, which only ships trigger_data + endpoint_url). - Both worker-mode HTTP path and AI-step branch in executor now fall back to a tenant DB load when the runner-injected default is absent. Timeout classification (Copilot #9): - executor._invoke_function emits error.type="HTTPTimeout" for httpx.TimeoutException (covers ReadTimeout/Connect/Write/Pool); other RequestError subclasses keep "RequestError". - notifier.notify_run_failed reads error.type and gates dispatch on notify_on_run_timeout for {"TimeoutError","HTTPTimeout"}, otherwise notify_on_run_failed. The toggle does something now.
Summary
started_at == ended_at(both set toutcnow()at save time) making all steps show 0ms duration in the waterfallstarted_atfor these legacy recordsTest plan