feat(incident): bootstrap IncidentLifecycleWorkflow with E2E test#26883
feat(incident): bootstrap IncidentLifecycleWorkflow with E2E test#26883IceS2 wants to merge 41 commits intofeat/incident-lifecycle-workflowfrom
Conversation
Adds a generic, configurable-status human task node for governance workflows. The node creates an OM Task, waits for status transitions via IntermediateCatchEvent messages, and routes based on terminal vs non-terminal statuses. Key components: - ManualTask.java: BPMN subprocess builder (setup → wait → route → close) - SetupDelegate/SetupImpl: Task creation, idempotent on cycle re-entry - CheckTerminalDelegate: Validates status against template - CloseTaskDelegate/CloseTaskImpl: Closes task on terminal status - SetResultDelegate: Propagates status to parent for edge routing - ManualTaskTemplateResolver: Template-based status configuration - ManualTaskDefinition JSON schema + nodeType/nodeSubType registration The node is domain-agnostic — incident/approval behavior lives in the workflow graph around the node, not inside it.
Remove inputNamespaceMapExpr and configMapExpr from BaseDelegate. Each delegate now declares its own Expression fields, preventing NullPointerExceptions in delegates that don't use these fields (e.g., SetResultDelegate, CheckTerminalDelegate, CloseTaskDelegate).
- Fix: isAlreadyClosed now only checks task.getResolution() != null. Previously it also checked terminalStatuses.contains(currentStatus), which always returned true when CloseTask runs (the PATCH already set the terminal status), leaving tasks permanently unresolved. - Remove unused terminalStatuses parameter from closeTask/CloseTaskDelegate - Rename taskCreated → taskAlreadyExists for clarity: the variable means "should we enter the message-waiting phase" (true on re-entry, false on first creation)
Implements the bridge that connects Task status changes to the ManualTask workflow node via Flowable message delivery. Bridge chain: TaskRepository.postUpdate() detects status change → TaskWorkflowHandler.transitionManualTaskStatus() (sends updatedBy) → WorkflowHandler.sendManualTaskMessage() with async exponential retry Key design decisions: - postUpdate wrapped in try-catch: workflow failures never break PATCH - Async retry via ScheduledExecutorService + resilience4j IntervalFunction: 500ms → 1s → 2s → 4s → 5s cap (~12.5s total coverage) - First attempt synchronous (fast path), retries non-blocking - CloseTaskImpl uses actual user from PATCH, falls back to governance-bot Also includes: - WorkflowDefinitionRepository/WorkflowInstanceStateRepository updates - CollectionDAO, ListFilter, EntityResource supporting changes - SQL migration (2.0.0) for stageResult generated column - ManualTaskWorkflowTest: full E2E lifecycle test
- Fix: catch FlowableOptimisticLockingException in tryDeliverMessage and return false to trigger retry (concurrent modification means another thread may have consumed the subscription) - Fix: nonTerminalReachable BFS now iterates the full edges list at every step, not just the unfiltered outgoingEdges map. Prevents following terminal-condition edges from intermediate nodes. - Refactor: hoist IntervalFunction to a static final constant instead of recreating on each retry. Made interval fields final. - Fix: remove IF NOT EXISTS from PostgreSQL migration for consistency with MySQL pattern (Flyway handles migration idempotency)
…tConsumer Add Entity.TASK to validEntityTypes, detect workflow-managed task status changes via isWorkflowManagedTaskStatusChange, and enqueue them to the outbox table via enqueueTaskMessage before the existing signal broadcast.
Add 4 reflection-based unit tests for isWorkflowManagedTaskStatusChange covering early-return conditions: non-update events, non-task entity types, missing changeDescription, and non-status field changes.
Remove the TaskRepository.postUpdate override that synchronously called TransitionManualTaskStatus, the TaskWorkflowHandler.transitionManualTaskStatus method it depended on, and the WorkflowHandler async retry infrastructure (sendManualTaskMessage, scheduleMessageRetry, tryDeliverMessage, and their backing constants and ScheduledExecutorService). Task status transitions are now delivered exclusively via the Transactional Outbox pattern.
…andler Start the drainer after the process engine is built in the constructor. Restart it when initializeNewProcessEngine() rebuilds the engine at runtime. Shut it down gracefully via WorkflowHandler.shutdown(), which is called from ManagedShutdown.stop() in OpenMetadataApplication.
…tency The E2E test must tolerate up to 10s CE poll + 30s drainer poll plus margin. Raise all Awaitility atMost() values to 90 seconds.
…roadcast disruption A DB failure during outbox INSERT should not prevent the signal broadcast path from executing. Log the error and continue.
…an older row SKIP LOCKED skips individual locked rows, not entire task groups. Without this guard, Worker B could grab a newer status while Worker A still holds the oldest — violating per-task ordering. The fix queries the absolute oldest createdAt (no lock) and skips the task if the locked row is newer.
…ycle Collapse findDistinctPendingTaskIds + per-task findAndLockOldestPending + per-task findOldestPendingCreatedAt into a single findAndLockAllOldestPending query using MIN(createdAt) JOIN with FOR UPDATE SKIP LOCKED. Per-task ordering is preserved naturally: if the oldest row for a task is locked by another worker, the JOIN produces no match for that task.
C2: Replace MIN(createdAt) JOIN with ROW_NUMBER() PARTITION BY taskId
to guarantee exactly one row per task even with identical timestamps.
C3: Split bulk-lock transaction into bulk-read (no lock) + per-entry
transactions. Row locks now held only during single-entry delivery,
not the entire batch. Flowable API calls no longer inside a DB
transaction holding locks on other rows.
I2: Add MAX_ATTEMPTS=100. Entries exceeding this are excluded from the
drain query and effectively dead-lettered for investigation.
I3: Call cleanupDelivered() at end of each drain cycle with 7-day
retention to prevent unbounded table growth.
I4: Extract workflowInstanceId from ChangeEvent entity payload instead
of fetching from DB. Eliminates extra round trip per task status
change event.
I5: Move signal broadcast before outbox enqueue so it always fires.
Wrap enqueue in resilience4j retry (3 attempts) for transient DB
errors. Unhandled failure propagates to event publisher for retry.
Add LIMIT 500 with ORDER BY attempts ASC, createdAt ASC to prevent unbounded result sets and prioritize fresh messages over stuck ones. Separate cleanup into its own try-catch for cleaner error diagnostics.
…OutboxIT Move E2E test to integration-tests module where the full application stack is running (CE pipeline, schedulers, drainer). The test verifies the complete outbox delivery pipeline through observable outcomes: 1. Deploy workflow → create table → workflow triggers → task created 2. PATCH task InProgress → PATCH task Completed 3. Poll workflow instance until FINISHED status 4. Assert stage results contain expected status transitions
Strangler Fig bridge that syncs Task lifecycle events to TCRS records, enabling workflow-managed incidents to keep the existing incident UI/API working while Tasks become the source of truth. Components: - aboutEntityLink field on Task schema (EntityLink format with testCase FQN + incident stateId), backed by generated DB columns + index - IncidentTcrsSyncHandler: lifecycle handler mapping Task events to TCRS records (New, Ack, Assigned, Resolved) with entity relationships - TCRS guard in openOrAssignTask() to skip Task creation when a workflow-managed Task already exists for the incident - SetupImpl builds aboutEntityLink for incident task types - testCaseStatus added to WorkflowTriggerFields enum - E2E integration test verifying full pipeline
…al status check - Rename specificUsers to specificAssignees using EntityLink strings (e.g. <#E::user::alice>, <#E::team::engineers>) to support both users and teams, consistent with the existing user task pattern - CloseTaskImpl.isAlreadyClosed() now checks terminal status from the resolved template instead of checking resolution != null
…incident-tcrs-sync-hook
…pen-metadata/OpenMetadata into feat/ilw-item2-incident-tcrs-sync-hook
Adds configurable duplicate trigger handling for governance workflows. When the same entity triggers a workflow again while an instance is already running, the trigger's onConflict policy controls the behavior: - restart (default): terminate old instance, start new (existing behavior) - skip: keep existing instance, ignore new event - forward: deliver 'retrigger' status to the active ManualTask The retrigger status flows through the existing IntermediateCatchEvent path — zero BPMN changes needed. ManualTask exposes 'retrigger' as a routable branch only when the trigger uses onConflict=forward, allowing workflow authors to wire it via normal graph edges (self-loop for incidents, back-to-decision-tree for approvals).
…d E2E test Ships a default governance workflow that triggers on test case failure, creates an incident task assigned to the table owner, and waits for human resolution. Uses onConflict=forward so re-failures don't kill the running workflow. JSON Logic filter ensures only Failed status triggers the workflow. E2E test validates the full lifecycle: test failure → task created → TCRS(New) synced → InProgress → TCRS(Ack) → Completed → workflow FINISHED → task resolved → TCRS(Resolved).
❌ Lint Check Failed — ESLint + Prettier (core-components)The following files have style issues that need to be fixed: Fix locally (fast — changed files only): cd openmetadata-ui-core-components/src/main/resources/ui
yarn ui-checkstyle:changedOr to fix all files: |
| case "forward" -> { | ||
| boolean forwarded = | ||
| WorkflowHandler.getInstance() | ||
| .forwardRetrigger(mainWorkflowDefinitionName, entityLinkStr, Map.of()); | ||
| if (forwarded) { | ||
| log.info( | ||
| "onConflict=forward: retrigger delivered to running instance for entity={}", | ||
| entityLinkStr); | ||
| passesFilter = false; | ||
| } else { | ||
| log.debug( | ||
| "onConflict=forward: no active ManualTask found for entity={}, starting new instance", | ||
| entityLinkStr); | ||
| } |
There was a problem hiding this comment.
⚠️ Bug: onConflict=forward can create duplicate workflow instances
When forwardRetrigger() returns false because the running instance has no active message subscriptions (ManualTask not yet active or between re-entries), the code falls through and allows a new workflow instance to start — creating a duplicate alongside the existing one.
forwardRetrigger() returns false in two different scenarios: (1) no running instance exists, and (2) a running instance exists but has no message subscriptions. Only scenario (1) should allow a new instance; scenario (2) should block it.
The skip case correctly uses hasRunningInstance() to guard against this.
Suggested fix:
case "forward" -> {
boolean forwarded =
WorkflowHandler.getInstance()
.forwardRetrigger(mainWorkflowDefinitionName, entityLinkStr, Map.of());
if (forwarded) {
log.info("onConflict=forward: retrigger delivered");
passesFilter = false;
} else if (WorkflowHandler.getInstance()
.hasRunningInstance(mainWorkflowDefinitionName, entityLinkStr)) {
log.info("onConflict=forward: running instance exists but no active ManualTask, skipping");
passesFilter = false;
} else {
log.debug("onConflict=forward: no running instance, starting new");
}
}
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| private String extractStringValue(Object value) { | ||
| if (value instanceof String s) { | ||
| return s.startsWith("\"") ? s.substring(1, s.length() - 1) : s; | ||
| } | ||
| return value != null ? value.toString() : null; |
There was a problem hiding this comment.
💡 Edge Case: extractStringValue crashes on single-quote string edge case
In extractStringValue, if value is the string " (a single double-quote character, length 1), the check s.startsWith(""") passes but s.substring(1, s.length() - 1) calls substring(1, 0) which throws StringIndexOutOfBoundsException. While unlikely in practice (FieldChange values are typically well-formed JSON), this is called in a lifecycle handler where an unhandled exception would abort TCRS sync for the event.
Suggested fix:
private String extractStringValue(Object value) {
if (value instanceof String s) {
if (s.length() >= 2 && s.startsWith(""") && s.endsWith(""")) {
return s.substring(1, s.length() - 1);
}
return s;
}
return value != null ? value.toString() : null;
}
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| for (EventSubscription sub : subscriptions) { | ||
| try { | ||
| Map<String, Object> retriggerVars = new HashMap<>(variables); | ||
| retriggerVars.put("status", "retrigger"); | ||
| runtimeService.messageEventReceived( | ||
| sub.getEventName(), sub.getExecutionId(), retriggerVars); | ||
| LOG.info( | ||
| "Forwarded retrigger to process instance {} via message '{}'", | ||
| target.getId(), | ||
| sub.getEventName()); | ||
| return true; | ||
| } catch (Exception e) { | ||
| LOG.debug( | ||
| "Could not deliver retrigger to subscription '{}': {}", | ||
| sub.getEventName(), |
There was a problem hiding this comment.
💡 Bug: forwardRetrigger silently swallows delivery failures at DEBUG level
In the forwardRetrigger method, when message delivery fails for a subscription (line 1521), the exception is logged at DEBUG level and the loop continues to the next subscription. If all subscriptions fail, the method returns false, which the caller interprets as 'no running instance' (potentially creating a duplicate). The caller cannot distinguish between 'no instance found' and 'delivery failed' — both return false.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
❌ Playwright Lint Check Failed — ESLint + Prettier + Organise ImportsThe following files have style issues that need to be fixed: Fix locally (fast — changed files only): cd openmetadata-ui/src/main/resources/ui
yarn ui-checkstyle:playwright:changedOr to fix all playwright files: |
Code Review
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
Summary
testCase-entityUpdatedwhentestCaseStatus == Failed(JSON Logic filter excludes non-Failed updates)addOwners: true(auto-assigns table owner)onConflict: forwardwith retrigger self-loop so re-failures don't kill the running workflowWhy
This is the last mile of Slice 1. All infrastructure was already built (ManualTask node, outbox bridge, TCRS sync handler, onConflict policy). This PR wires them together into a shippable default workflow that addresses #22967 (auto-assign) and collate#3080 (Slice 1 default workflow).
Changes
IncidentLifecycleWorkflow.jsonIncidentLifecycleWorkflowIT.javaWorkflow Graph
Test plan
bootstrapWorkflow_fullLifecycle— test failure triggers workflow → task created (Open, Incident, IncidentResolution) → aboutEntityLink verified → TCRS(New) synced → InProgress → TCRS(Ack) → Completed → workflow FINISHED → task resolved → TCRS(Resolved)