Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
🦋 Changeset detectedLatest commit: e1d47f2 The changes in this PR will be included in the next version bump. This PR includes changesets to release 10 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
|
TL;DR — Replaces the per-trigger daisy-chaining workflow architecture with a single centralized scheduler workflow that polls a new Key changes
Summary | 23 files | 2 commits | base: Centralized
|
| Table | Purpose | Key columns |
|---|---|---|
trigger_schedules |
Materialized view of enabled triggers for polling | next_run_at, claimed_at, enabled |
scheduler_state |
Tracks active scheduler workflow | current_run_id, deployment_id |
runtime-schema.ts · 0023_amazing_romulus.sql · triggerSchedules.ts · schedulerState.ts
ScheduledTriggerService rewrite — sync to runtime table
Before:
onTriggerCreated/Updated/Deletedresolved DoltgreSQL refs, managedscheduled_workflowsrecords, and started/stopped per-trigger workflow runs.
After: These hooks callupsertTriggerSchedule/updateTriggerScheduleEnabled/deleteTriggerScheduleon the runtime DB. The centralized scheduler picks up changes on its next tick.
ScheduledTriggerService.ts · scheduled-triggers.ts
scheduledTriggerRunnerWorkflow — one-shot execution
Before: The runner workflow contained sleep logic, daisy-chaining (
startNextIterationStep), pre/post-sleep enabled checks, pending invocation lookup, and parent adoption.
After: A stateless one-shot workflow: verify trigger enabled → create invocation (idempotent) → execute with retries → mark completed/failed.
scheduledTriggerRunner.ts · scheduledTriggerSteps.ts
Deploy hook: /api/deploy/restart-scheduler
Before: No mechanism to move the scheduler onto a new Vercel deployment.
After: APOST /api/deploy/restart-schedulerendpoint (authenticated viaINKEEP_AGENTS_RUN_API_BYPASS_SECRET) starts a new scheduler workflow. Thevercel-production.ymlCI pipeline calls it after promotion.
restartScheduler.ts · vercel-production.yml · index.ts
There was a problem hiding this comment.
Solid architectural improvement — moving from per-trigger daisy-chaining to a centralized scheduler with a trigger_schedules table is much simpler to reason about and operate. The claim/advance/rollback pattern is well thought out.
A few issues to address before merge, ordered by severity:
- Security bug: The restart endpoint is accessible when
INKEEP_AGENTS_RUN_API_BYPASS_SECRETis unset (undefined !== undefinedisfalse). - Correctness:
ltevseqin claim safety, missingcronTimezonein schedule-change detection, gutted reconciliation with no replacement. - Robustness: No error handling in the restart handler, no error handling around the dispatch tick in the scheduler loop, claim-then-release flow has a theoretical re-dispatch window for fast cron intervals.
Issues on lines outside the diff (cannot be commented inline):
-
agents-api/src/data-reconciliation/handlers/scheduled-triggers.ts:14-16—scheduleChangedcheckscronExpressionandrunAtbut notcronTimezone. Changing the timezone of a cron expression changes the effective schedule (e.g.0 9 * * *in UTC vs EST), but this code path won't cancel pending invocations or recomputenextRunAt. -
agents-api/src/data-reconciliation/handlers/scheduled-triggers.ts:26-34— Thecheckfunction now returns all-empty arrays, effectively disabling data reconciliation for scheduled triggers. In the new architecture, a useful check would verify that every enabled trigger in the manage DB has a correspondingtrigger_schedulesrow with a non-nullnextRunAt, and that no orphaned rows exist. Consider adding a TODO or a basic cross-table consistency check.
| const authHeader = c.req.header('Authorization'); | ||
| const token = authHeader?.replace('Bearer ', ''); | ||
|
|
||
| if (token !== env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) { |
There was a problem hiding this comment.
Bug: undefined secret matches undefined token. INKEEP_AGENTS_RUN_API_BYPASS_SECRET is optional in env.ts. If unset, both env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET and token are undefined — so undefined !== undefined is false, granting access to anyone. Add an early guard:
if (!env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) {
return c.json({ error: 'Endpoint not configured' }, 503);
}| return c.json({ error: 'Unauthorized' }, 401); | ||
| } | ||
|
|
||
| const result = await startSchedulerWorkflow(); |
There was a problem hiding this comment.
If startSchedulerWorkflow() throws (e.g. DB connection failure), Hono's onError handler returns a generic 500, and the CI curl -sf will retry with no actionable info. Wrap in try-catch and return a structured error body so CI logs are useful for debugging.
| }), | ||
| async (c) => { | ||
| const authHeader = c.req.header('Authorization'); | ||
| const token = authHeader?.replace('Bearer ', ''); |
There was a problem hiding this comment.
Nit: replace('Bearer ', '') doesn't validate the prefix — Basic xyz silently becomes Basic xyz. Use authHeader?.startsWith('Bearer ') ? authHeader.slice(7) : undefined for stricter parsing.
| expectedClaimedAt: string | null; | ||
| }): Promise<boolean> => { | ||
| const claimCondition = params.expectedClaimedAt | ||
| ? lte(triggerSchedules.claimedAt, params.expectedClaimedAt) |
There was a problem hiding this comment.
lte should be eq for claim safety. lte(claimedAt, expectedClaimedAt) succeeds if claimedAt is any value ≤ the expected timestamp — e.g., a stale claim from a previous cycle would still match. This defeats the optimistic-concurrency purpose. Use eq so only the exact expected state can be claimed.
| nextRunAt: string | null; | ||
| enabled?: boolean; | ||
| }): Promise<void> => { | ||
| const set: Record<string, unknown> = { |
There was a problem hiding this comment.
Nit: Record<string, unknown> bypasses Drizzle's column-name validation — typos in property names would be silently ignored. Consider building the object inline or typing as Partial<typeof triggerSchedules.$inferInsert>.
| return 'skipped'; | ||
| } | ||
|
|
||
| await releaseTriggerScheduleClaim(runDbClient)({ |
There was a problem hiding this comment.
Releasing the claim after advancing nextRunAt creates a window where the row is eligible again. For most cron intervals this is fine because nextRunAt is in the future. But for fast intervals (every minute), the next nextRunAt could already be due by the time releaseTriggerScheduleClaim runs. Consider combining advance + release into a single atomic update, or omitting the release entirely (the advance already moves the schedule forward).
| tenantId, | ||
| scheduledTriggerId, | ||
| nextRunAt: schedule.nextRunAt, | ||
| enabled: isOneTime ? true : schedule.enabled, |
There was a problem hiding this comment.
For one-time triggers, rollback restores enabled: true. If the workflow engine stays down, the next tick (60s later) will re-dispatch indefinitely. Consider adding a failedAttempts counter or max-retry cap to prevent infinite re-dispatch of one-time triggers.
| return { status: 'superseded', runId: myRunId }; | ||
| } | ||
|
|
||
| await dispatchDueTriggersStep(); |
There was a problem hiding this comment.
If dispatchDueTriggersStep throws (e.g. transient DB error), the workflow step will fail and the framework's retry mechanism kicks in. Depending on backoff behavior, this could repeatedly spam the DB. Consider wrapping the dispatch in a try-catch to log and continue to the next tick for transient errors.
| "${{ needs.deploy-agents-api.outputs.url }}/api/deploy/restart-scheduler" \ | ||
| -H "Authorization: Bearer ${{ secrets.INKEEP_AGENTS_RUN_API_BYPASS_SECRET }}" \ | ||
| --retry 3 \ | ||
| --retry-delay 5 |
There was a problem hiding this comment.
Consider adding --retry-all-errors and --max-time 30. --retry 3 only retries on HTTP 5xx. If the new deployment hasn't finished booting, the connection may be refused outright, and curl won't retry without --retry-all-errors.
| const run = await start(schedulerWorkflow, []); | ||
|
|
||
| await upsertSchedulerState(runDbClient)({ | ||
| currentRunId: run.runId, | ||
| deploymentId: getDeploymentId(), | ||
| }); |
There was a problem hiding this comment.
Race: supersession window allows dual dispatch. start(schedulerWorkflow) on line 16 launches the new scheduler, which calls registerSchedulerStep (writing runId to scheduler_state). But the old scheduler may have already woken and passed checkSchedulerCurrentStep before the upsert lands. Consider calling upsertSchedulerState before start(schedulerWorkflow) with a sentinel value, so the old scheduler's next check fails immediately.
There was a problem hiding this comment.
PR Review Summary
(9) Total Issues | Risk: High
🔴❗ Critical (2) ❗🔴
🔴 1) trigger_schedules Missing data migration for existing scheduled triggers
Issue: The new trigger_schedules runtime table will be empty after deployment. Existing enabled triggers in the manage DB (scheduled_triggers table) will not be backfilled. The scheduler workflow reads from trigger_schedules to dispatch triggers, but only newly created/updated triggers will be synced via onTriggerCreated/onTriggerUpdated.
Why: After deploying this migration, all existing enabled scheduled triggers will stop executing. The old per-trigger daisy-chaining workflows have been removed, and the new centralized scheduler will find zero rows in trigger_schedules. Production cron jobs and one-time triggers will silently fail to run until manually touched via the API or UI. This is a data consistency gap that will cause production outages.
Fix: Add a one-time backfill step. Options:
- Add a migration script (similar to
scripts/sync-spicedb.sh) that queries all enabled triggers from manage DB and inserts corresponding rows intotrigger_schedules - Add backfill logic to
startSchedulerWorkflow()inagents-api/src/index.tsthat runs on server startup before the scheduler starts - Use the data reconciliation framework to sync existing triggers on first scheduler tick
Example:
// In SchedulerService.ts or index.ts startup
async function backfillTriggerSchedules() {
const enabledTriggers = await listEnabledScheduledTriggers(manageDb)({ scopes: { /* all tenants */ } });
for (const trigger of enabledTriggers) {
await syncTriggerToScheduleTable(trigger);
}
}Refs:
- 0023_amazing_romulus.sql:10 — empty table created
- ScheduledTriggerService.ts:45 — only syncs on create/update
Inline Comments:
- 🔴 Critical:
restartScheduler.ts:32Timing-attack vulnerable secret comparison + bypass when secret unset
🟠⚠️ Major (4) 🟠⚠️
🟠 1) trigger_schedules No claim timeout mechanism - stuck triggers unrecoverable
Issue: The claimedAt field has no expiration mechanism. If a dispatcher crashes after claimTriggerSchedule but before releaseTriggerScheduleClaim, the trigger remains claimed indefinitely. The partial index excludes claimed rows from findDueTriggerSchedules, so the trigger will never fire again.
Why: This creates a permanent silent failure mode. The only recovery would be manual database intervention to clear claimedAt. There's no alerting, no self-healing, and no visibility into stuck claims.
Fix: Modify the claim condition to treat stale claims as reclaimable:
// In claimTriggerSchedule - also claim if claimedAt is older than threshold
const claimCondition = or(
isNull(triggerSchedules.claimedAt),
lt(triggerSchedules.claimedAt, sql`now() - interval '5 minutes'`)
);Or add a periodic cleanup job that releases claims older than a threshold.
Refs:
🟠 2) scheduled-triggers.ts Data reconciliation check gutted - no observability into scheduler health
Issue: The check() function now returns empty arrays for all audit categories. This removes the ability to detect orphaned, missing, or stuck triggers through the data reconciliation system.
Why: Operations teams lose visibility into scheduled trigger health. If triggers fail to sync to trigger_schedules or the scheduler workflow dies, there is no automated detection. This contradicts the existing data reconciliation pattern used throughout the codebase.
Fix: Implement a new check function that validates the new architecture:
check: async (ctx): Promise<ScheduledTriggerAuditResult> => {
const [enabledTriggers, schedules] = await Promise.all([
listEnabledScheduledTriggers(ctx.manageDb)({ scopes: ctx.scopes }),
listTriggerSchedulesByProject(ctx.runDb)({ scopes: ctx.scopes }),
]);
const scheduleMap = new Map(schedules.map(s => [s.scheduledTriggerId, s]));
const missingWorkflows = enabledTriggers
.filter(t => !scheduleMap.has(t.id))
.map(t => ({ triggerId: t.id, triggerName: t.name }));
return { missingWorkflows, orphanedWorkflows: [], staleWorkflows: [], deadWorkflows: [], verificationFailures: [] };
}Refs:
🟠 3) system Missing changeset for schema/behavior changes
Issue: This PR adds a database migration, new data-access exports, and a new API endpoint. Per AGENTS.md changeset guidance, schema changes requiring migration warrant a minor bump.
Fix: Create a changeset:
pnpm bump minor --pkg agents-core --pkg agents-api "Add scheduler workflow with centralized trigger dispatch and deploy restart endpoint"🟠 4) system Critical paths have no test coverage
Issue: The following new code has no test coverage:
dispatchSingleTrigger— claim/advance/rollback concurrency controlcomputeNextRunAt— cron parsing and timezone handlingclaimTriggerSchedule— optimistic locking primitivecheckSchedulerCurrentStep— scheduler supersession logic/api/deploy/restart-scheduler— auth validation
Why: The dispatcher's claim/advance/rollback sequence is the core scheduling mechanism. A bug here could cause duplicate dispatches, permanently stuck triggers, or silent missed executions. These are the highest-risk code paths with zero coverage.
Refs:
Inline Comments:
- 🟠 Major:
triggerDispatcher.ts:108No error handling for claim release - 🟠 Major:
restartScheduler.ts:36No error handling for scheduler restart - 🟠 Major:
computeNextRunAt.ts:21Unhandled cron parsing exception
🟡 Minor (3) 🟡
🟡 1) SchedulerService.ts:16 Race condition between workflow start and state registration
Issue: The scheduler state is updated (line 18-21) after start() succeeds (line 16). If the process crashes between these operations, the new workflow runs without being registered in scheduler_state. However, registerSchedulerStep inside the workflow also calls upsertSchedulerState, providing a fallback.
Why: The double-registration is fine but creates potential inconsistency during the race window. Low severity because the workflow self-registers.
Refs:
🟡 2) index.ts:142 Scheduler startup failure doesn't prevent server from serving traffic
Issue: If startSchedulerWorkflow() fails during startup, the error is caught and logged but the server continues running without a scheduler. This is a silent failure state.
Fix: Consider exposing scheduler status via a health endpoint, or emit metrics/alerts when the scheduler fails to start.
🟡 3) triggerSchedules.ts:87-101 No limit on findDueTriggerSchedules query
Issue: The query returns all due triggers without a LIMIT clause. If many triggers become due simultaneously (e.g., after an outage), this could cause memory pressure.
Fix: Add a configurable LIMIT and process in batches across ticks.
Inline Comments:
- 🟡 Minor:
vercel-production.yml:170Missing timeout-minutes - 🟡 Minor:
vercel-production.yml:178curl --retry won't retry HTTP 5xx
🚫 REQUEST CHANGES
Summary: This PR introduces a well-designed centralized scheduler architecture, but has a critical data migration gap that will break all existing scheduled triggers on deployment. The trigger_schedules table will be empty, and existing triggers won't fire until manually touched. Additionally, there are several error handling gaps that could leave triggers permanently stuck, and the data reconciliation check has been gutted without replacement. Please address the migration backfill and error handling issues before merging.
Discarded (8)
| Location | Issue | Reason Discarded |
|---|---|---|
schedulerSteps.ts:8 |
Duplicate scheduler state registration | Intentional fallback - workflow self-registers as recovery mechanism |
triggerDispatcher.ts:89 |
Non-null assertion on nextRunAt | Query guarantees non-null via WHERE clause; fragile but correct |
triggerDispatcher.ts:45-49 |
Error logging without trigger context | Minor logging improvement, not blocking |
schedulerWorkflow.ts:30 |
Missing error handling in infinite loop | Workflow framework handles step retries |
vercel-production.yml:9 |
Missing secret documentation | Documentation-only, not blocking |
runtime-schema.ts:960 |
PK doesn't include all scope columns | Intentional simplification, scheduledTriggerId unique per tenant |
restartScheduler.ts:20 |
Handler-level auth vs middleware | Intentional pattern for deploy hooks |
ScheduledTriggerService.ts:45-52 |
No error handling in onTriggerCreated | Error propagates to caller appropriately |
Reviewers (8)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-architecture |
5 | 2 | 0 | 0 | 0 | 0 | 3 |
pr-review-errors |
8 | 1 | 0 | 0 | 3 | 0 | 4 |
pr-review-sre |
8 | 2 | 0 | 0 | 1 | 0 | 5 |
pr-review-tests |
6 | 1 | 0 | 0 | 0 | 0 | 5 |
pr-review-breaking-changes |
4 | 1 | 0 | 0 | 0 | 0 | 3 |
pr-review-security-iam |
3 | 0 | 0 | 0 | 1 | 0 | 2 |
pr-review-devops |
6 | 1 | 0 | 0 | 2 | 0 | 3 |
pr-review-standards |
3 | 1 | 0 | 0 | 0 | 0 | 2 |
| Total | 43 | 9 | 0 | 0 | 7 | 0 | 27 |
|
|
||
| if (token !== env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) { | ||
| return c.json({ error: 'Unauthorized' }, 401); | ||
| } |
There was a problem hiding this comment.
🔴 CRITICAL: Timing-attack vulnerable secret comparison + bypass when secret unset
Issue: This comparison is not constant-time and allows auth bypass when INKEEP_AGENTS_RUN_API_BYPASS_SECRET is not configured. When the env var is unset (undefined), and no Authorization header is sent (token is also undefined), then undefined !== undefined is false, so auth passes.
Why: An attacker could probe the endpoint to infer the secret via timing differences. More critically, if deployed without the secret configured, the endpoint becomes fully unauthenticated, allowing anyone to restart the scheduler.
Fix:
| } | |
| if (!env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) { | |
| return c.json({ error: 'Endpoint not available' }, 503); | |
| } | |
| if (!token || !constantTimeEqual(token, env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET)) { | |
| return c.json({ error: 'Unauthorized' }, 401); | |
| } |
You'll need to add a constantTimeEqual helper using crypto.timingSafeEqual() - see existing patterns in packages/agents-core/src/utils/apiKeys.ts:97.
Refs:
| await releaseTriggerScheduleClaim(runDbClient)({ | ||
| tenantId, | ||
| scheduledTriggerId, | ||
| }); |
There was a problem hiding this comment.
🟠 MAJOR: No error handling for claim release - triggers can get stuck permanently
Issue: If releaseTriggerScheduleClaim fails after the workflow has been successfully started (line 93), the trigger remains claimed indefinitely. There's no claim timeout mechanism in the schema, so this trigger would never fire again.
Why: A transient database error at this point would leave claimedAt set with no recovery path. The only symptom would be silently missing scheduled executions, which are very hard to debug.
Fix:
| }); | |
| try { | |
| await releaseTriggerScheduleClaim(runDbClient)({ | |
| tenantId, | |
| scheduledTriggerId, | |
| }); | |
| } catch (err) { | |
| logger.error( | |
| { scheduledTriggerId, err }, | |
| 'Failed to release trigger claim after successful dispatch - trigger may remain claimed' | |
| ); | |
| // Don't throw - workflow is already running | |
| } |
Also consider adding a claim timeout mechanism in findDueTriggerSchedules to reclaim triggers where claimed_at is older than a threshold (e.g., 5 minutes).
Refs:
|
|
||
| const result = await startSchedulerWorkflow(); | ||
|
|
||
| logger.info(result, 'Scheduler workflow restarted via deploy hook'); |
There was a problem hiding this comment.
🟠 MAJOR: No error handling - scheduler restart failures are silent
Issue: If startSchedulerWorkflow() throws, the error propagates as an unhandled exception. The CI deploy step will fail, but the error details won't be surfaced in the response body.
Why: A failed scheduler restart during deployment leaves the system without an active scheduler workflow. Scheduled triggers would silently stop firing with no clear indication of why.
Fix:
| logger.info(result, 'Scheduler workflow restarted via deploy hook'); | |
| try { | |
| const result = await startSchedulerWorkflow(); | |
| logger.info(result, 'Scheduler workflow restarted via deploy hook'); | |
| return c.json(result); | |
| } catch (err) { | |
| logger.error({ error: err }, 'Failed to restart scheduler workflow via deploy hook'); | |
| return c.json( | |
| { error: 'Failed to start scheduler workflow', details: err instanceof Error ? err.message : String(err) }, | |
| 500 | |
| ); | |
| } |
| currentDate: baseDate, | ||
| tz: cronTimezone || 'UTC', | ||
| }); | ||
| return interval.next().toISOString(); |
There was a problem hiding this comment.
🟠 MAJOR: Unhandled cron parsing exception can leave triggers stuck
Issue: CronExpressionParser.parse() throws if the cron expression is invalid. This function is called from syncTriggerToScheduleTable and dispatchSingleTrigger. An invalid expression that bypasses validation would cause unhandled exceptions.
Why: In dispatchSingleTrigger, this throws after claiming the trigger but before advancing it, leaving the trigger stuck in claimed state.
Fix:
| return interval.next().toISOString(); | |
| if (cronExpression) { | |
| try { | |
| const baseDate = lastScheduledFor ? new Date(lastScheduledFor) : new Date(); | |
| const interval = CronExpressionParser.parse(cronExpression, { | |
| currentDate: baseDate, | |
| tz: cronTimezone || 'UTC', | |
| }); | |
| return interval.next().toISOString(); | |
| } catch (err) { | |
| throw new Error( | |
| `Invalid cron expression '${cronExpression}': ${err instanceof Error ? err.message : String(err)}` | |
| ); | |
| } | |
| } |
| restart-scheduler: | ||
| name: Restart scheduler workflow | ||
| needs: [promote, deploy-agents-api] | ||
| runs-on: ubuntu-latest |
There was a problem hiding this comment.
🟡 Minor: Missing timeout-minutes on restart-scheduler job
Issue: This job has no timeout-minutes setting. If the endpoint hangs, the job could run until GitHub's 6-hour default.
Fix:
| runs-on: ubuntu-latest | |
| timeout-minutes: 5 | |
| steps: |
Other jobs in this repo use timeout-minutes: 15-30. Since the curl has retries (max ~15s), 5 minutes provides ample margin.
| "${{ needs.deploy-agents-api.outputs.url }}/api/deploy/restart-scheduler" \ | ||
| -H "Authorization: Bearer ${{ secrets.INKEEP_AGENTS_RUN_API_BYPASS_SECRET }}" \ | ||
| --retry 3 \ | ||
| --retry-delay 5 |
There was a problem hiding this comment.
🟡 Minor: curl --retry won't retry HTTP 5xx errors by default
Issue: --retry 3 only retries on transient network errors (connection refused, timeout), not HTTP 5xx responses.
Fix: Add --retry-all-errors to retry on 5xx responses:
| --retry-delay 5 | |
| --retry 3 \ | |
| --retry-delay 5 \ | |
| --retry-all-errors |
This is curl 7.71+, available on ubuntu-latest runners.
There was a problem hiding this comment.
PR Review Summary
(2) Total Issues | Risk: High
This is a delta review scoped to the 2 commits (11 files) since the last automated review. The delta addresses several prior issues but leaves the most critical one unaddressed.
🔴❗ Critical (1) ❗🔴
🔴 1) system Missing data migration for existing scheduled triggers — STILL UNADDRESSED
Issue: This was flagged in the prior review as Critical and remains unaddressed in this delta. The new trigger_schedules runtime table will be empty after deployment. Existing enabled triggers in the manage DB will not be backfilled.
Why: After deploying this migration, all existing enabled scheduled triggers will stop executing. The old per-trigger daisy-chaining workflows have been removed, and the new centralized scheduler will find zero rows in trigger_schedules. Production cron jobs and one-time triggers will silently fail to run until manually touched via the API or UI.
Fix: Add a one-time backfill step. Options:
- Add a migration script that queries all enabled triggers from manage DB and inserts corresponding rows into
trigger_schedules - Add backfill logic to
startSchedulerWorkflow()inagents-api/src/index.tsthat runs on server startup before the scheduler starts - Use the data reconciliation framework to sync existing triggers on first scheduler tick
Refs:
Inline Comments:
- 🟠 Major:
scheduled-triggers.ts:18-20scheduleChanged missing cronTimezone comparison
🟡 Minor (0) 🟡
No new minor issues in delta.
💭 Consider (1) 💭
💭 1) .github/workflows/vercel-production.yml:167 Consider continue-on-error for non-blocking scheduler restart
Issue: The restart-scheduler job will fail the entire workflow if the scheduler restart endpoint returns an error after all retries. The scheduler has self-recovery mechanisms (supersession checks, re-registration on restart).
Why: If the CI call fails, the scheduler will recover when the API boots or on the next cron tick. However, if scheduler restart is considered critical to deploy success, the current strict behavior is appropriate.
Fix: Add continue-on-error: true at job level if scheduler restart failure should not fail the deploy workflow.
✅ Delta Fixes Verified
The following issues from the prior review have been properly addressed in this delta:
| Issue | Status |
|---|---|
| Security: Timing-attack vulnerable secret comparison | ✅ Fixed — constantTimeEqual() using crypto.timingSafeEqual() |
| Security: Auth bypass when secret unset | ✅ Fixed — Returns 503 when INKEEP_AGENTS_RUN_API_BYPASS_SECRET is unset |
| Security: No error handling in restart handler | ✅ Fixed — Try-catch with structured 500 response |
| CI: Missing timeout-minutes | ✅ Fixed — Added timeout-minutes: 2 |
| CI: curl --retry won't retry HTTP 5xx | ✅ Fixed — Added --retry-all-errors |
| Correctness: Data reconciliation check gutted | ✅ Fixed — Restored with cross-table comparison + tests |
| Correctness: No claim timeout mechanism | ✅ Fixed — Added 5-minute stale claim recovery |
| Correctness: lte vs eq in claim safety | ✅ Fixed — Simplified to atomic stale-claim-aware WHERE clause |
🕐 Pending Recommendations (1)
- 🔴 Missing data migration for existing scheduled triggers — Critical issue from prior review, still unaddressed
🚫 REQUEST CHANGES
Summary: The delta addresses most feedback from the prior review — security fixes are complete, CI improvements are in place, and the data reconciliation check has been properly restored with tests. However, the critical data migration gap (existing triggers won't be backfilled to trigger_schedules) remains unaddressed. This will cause all existing scheduled triggers to stop firing after deployment. Please add a backfill mechanism before merging.
Discarded (3)
| Location | Issue | Reason Discarded |
|---|---|---|
registry.test.ts:313 |
Missing test for disabled orphaned schedules | Valid but low priority — edge case that affects reconciliation accuracy but not runtime execution |
registry.test.ts:293 |
Missing test edge case documentation | Valid but INFO level — documents intended behavior but doesn't catch bugs |
vercel-production.yml |
Partial index won't cover stale claims query | Acceptable tradeoff — stale claims are rare, unclaimed triggers (common case) use index |
Reviewers (5)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-security-iam |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-sre |
1 | 0 | 0 | 0 | 1 | 0 | 0 |
pr-review-tests |
3 | 0 | 0 | 0 | 0 | 0 | 2 |
pr-review-devops |
4 | 0 | 1 | 0 | 0 | 0 | 1 |
| Total | 8 | 0 | 1 | 0 | 1 | 0 | 3 |
Note: Security and standards reviewers returned 0 issues because all prior security issues were verified as fixed.
Ito Test Report ❌13 test cases ran. 11 passed, 2 failed. ✅ The scheduled-trigger regression pass found stable behavior for deep-link loading, validation, mobile usability, invocation creation, and auth hardening checks. 🔍 Code-first verification of the two included failures indicates likely product defects in next-run handling for past one-time schedules and in rapid toggle state consistency under concurrent updates. ✅ Passed (11)
❌ Failed (2)
One-time trigger scheduled in the past shows no upcoming next run – Failed
Rapid enable/disable interaction keeps final trigger state consistent – Failed
📋 View Recording |
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low (delta scope)
This is a delta review scoped to changes since the last automated review (7af85b5d481f).
Delta Changes
The only change in this delta is the addition of a new specification document:
specs/2026-03-13-manage-table-cron-dispatcher/SPEC.md(+256 lines)
This document evaluates an alternative architectural approach (using the manage DoltgreSQL table vs runtime Postgres table for scheduling) and recommends against it. The analysis is thorough and well-reasoned, covering:
- Branch iteration cost scaling (O(n) vs O(1))
- DoltgreSQL versioning overhead for high-frequency transactional writes
- Connection pool pressure
- Broken claim/release locking semantics in Dolt's commit model
The spec correctly concludes that the runtime-table approach (already implemented in this PR) is the better design choice.
✅ Delta Review: Clean
No new issues in the delta. The spec document is well-written architectural documentation that supports the implementation decisions already made.
🕐 Pending Recommendations (1)
- 🔴 Missing data migration for existing scheduled triggers — Critical issue from prior review, still unaddressed. The new
trigger_schedulestable will be empty after deployment; existing enabled triggers won't be backfilled.
💡 APPROVE WITH SUGGESTIONS
Summary: The delta (spec document) is clean and adds valuable architectural context. However, the critical data migration gap from prior reviews remains unaddressed — existing scheduled triggers will not be synced to the new trigger_schedules table on deployment, causing all existing cron/one-time triggers to stop firing. Please add a backfill mechanism before merging.
Reviewers (0)
No code reviewers dispatched — delta contains only documentation changes.

































No description provided.