Skip to content

feat(incident): bootstrap IncidentLifecycleWorkflow with E2E test#26883

Closed
IceS2 wants to merge 41 commits intofeat/incident-lifecycle-workflowfrom
feat/ilw-slice1-default-workflow
Closed

feat(incident): bootstrap IncidentLifecycleWorkflow with E2E test#26883
IceS2 wants to merge 41 commits intofeat/incident-lifecycle-workflowfrom
feat/ilw-slice1-default-workflow

Conversation

@IceS2
Copy link
Copy Markdown
Contributor

@IceS2 IceS2 commented Mar 31, 2026

Summary

  • Ships the default IncidentLifecycleWorkflow as a bootstrap workflow definition (auto-deployed on server startup)
  • Triggers on testCase-entityUpdated when testCaseStatus == Failed (JSON Logic filter excludes non-Failed updates)
  • Creates an incident Task via ManualTask node with addOwners: true (auto-assigns table owner)
  • Uses onConflict: forward with retrigger self-loop so re-failures don't kill the running workflow
  • E2E test validates full lifecycle: test failure → task created → TCRS sync (New/Ack/Resolved) → workflow completion

Why

This is the last mile of Slice 1. All infrastructure was already built (ManualTask node, outbox bridge, TCRS sync handler, onConflict policy). This PR wires them together into a shippable default workflow that addresses #22967 (auto-assign) and collate#3080 (Slice 1 default workflow).

Changes

File Change
IncidentLifecycleWorkflow.json Bootstrap workflow definition: trigger, ManualTask node, edges
IncidentLifecycleWorkflowIT.java E2E test: full incident lifecycle against bootstrap workflow

Workflow Graph

         ┌─── retrigger ───┐
         │  ┌── Open ──────┤
         │  │ ┌ InProgress ┤
         │  │ │ ┌ Pending ─┤
         ▼  ▼ ▼ ▼          │
[Start] → [resolveIncident] ─── Completed → [End]

Test plan

  • bootstrapWorkflow_fullLifecycle — test failure triggers workflow → task created (Open, Incident, IncidentResolution) → aboutEntityLink verified → TCRS(New) synced → InProgress → TCRS(Ack) → Completed → workflow FINISHED → task resolved → TCRS(Resolved)

IceS2 and others added 30 commits March 16, 2026 14:41
Adds a generic, configurable-status human task node for governance
workflows. The node creates an OM Task, waits for status transitions
via IntermediateCatchEvent messages, and routes based on terminal vs
non-terminal statuses.

Key components:
- ManualTask.java: BPMN subprocess builder (setup → wait → route → close)
- SetupDelegate/SetupImpl: Task creation, idempotent on cycle re-entry
- CheckTerminalDelegate: Validates status against template
- CloseTaskDelegate/CloseTaskImpl: Closes task on terminal status
- SetResultDelegate: Propagates status to parent for edge routing
- ManualTaskTemplateResolver: Template-based status configuration
- ManualTaskDefinition JSON schema + nodeType/nodeSubType registration

The node is domain-agnostic — incident/approval behavior lives in the
workflow graph around the node, not inside it.
Remove inputNamespaceMapExpr and configMapExpr from BaseDelegate.
Each delegate now declares its own Expression fields, preventing
NullPointerExceptions in delegates that don't use these fields
(e.g., SetResultDelegate, CheckTerminalDelegate, CloseTaskDelegate).
- Fix: isAlreadyClosed now only checks task.getResolution() != null.
  Previously it also checked terminalStatuses.contains(currentStatus),
  which always returned true when CloseTask runs (the PATCH already set
  the terminal status), leaving tasks permanently unresolved.
- Remove unused terminalStatuses parameter from closeTask/CloseTaskDelegate
- Rename taskCreated → taskAlreadyExists for clarity: the variable means
  "should we enter the message-waiting phase" (true on re-entry, false
  on first creation)
Implements the bridge that connects Task status changes to the
ManualTask workflow node via Flowable message delivery.

Bridge chain:
  TaskRepository.postUpdate() detects status change
  → TaskWorkflowHandler.transitionManualTaskStatus() (sends updatedBy)
  → WorkflowHandler.sendManualTaskMessage() with async exponential retry

Key design decisions:
- postUpdate wrapped in try-catch: workflow failures never break PATCH
- Async retry via ScheduledExecutorService + resilience4j IntervalFunction:
  500ms → 1s → 2s → 4s → 5s cap (~12.5s total coverage)
- First attempt synchronous (fast path), retries non-blocking
- CloseTaskImpl uses actual user from PATCH, falls back to governance-bot

Also includes:
- WorkflowDefinitionRepository/WorkflowInstanceStateRepository updates
- CollectionDAO, ListFilter, EntityResource supporting changes
- SQL migration (2.0.0) for stageResult generated column
- ManualTaskWorkflowTest: full E2E lifecycle test
- Fix: catch FlowableOptimisticLockingException in tryDeliverMessage and
  return false to trigger retry (concurrent modification means another
  thread may have consumed the subscription)
- Fix: nonTerminalReachable BFS now iterates the full edges list at every
  step, not just the unfiltered outgoingEdges map. Prevents following
  terminal-condition edges from intermediate nodes.
- Refactor: hoist IntervalFunction to a static final constant instead of
  recreating on each retry. Made interval fields final.
- Fix: remove IF NOT EXISTS from PostgreSQL migration for consistency
  with MySQL pattern (Flyway handles migration idempotency)
…tConsumer

Add Entity.TASK to validEntityTypes, detect workflow-managed task status
changes via isWorkflowManagedTaskStatusChange, and enqueue them to the
outbox table via enqueueTaskMessage before the existing signal broadcast.
Add 4 reflection-based unit tests for isWorkflowManagedTaskStatusChange
covering early-return conditions: non-update events, non-task entity types,
missing changeDescription, and non-status field changes.
Remove the TaskRepository.postUpdate override that synchronously called
TransitionManualTaskStatus, the TaskWorkflowHandler.transitionManualTaskStatus
method it depended on, and the WorkflowHandler async retry infrastructure
(sendManualTaskMessage, scheduleMessageRetry, tryDeliverMessage, and their
backing constants and ScheduledExecutorService). Task status transitions are
now delivered exclusively via the Transactional Outbox pattern.
…andler

Start the drainer after the process engine is built in the constructor.
Restart it when initializeNewProcessEngine() rebuilds the engine at
runtime. Shut it down gracefully via WorkflowHandler.shutdown(), which
is called from ManagedShutdown.stop() in OpenMetadataApplication.
…tency

The E2E test must tolerate up to 10s CE poll + 30s drainer poll plus
margin. Raise all Awaitility atMost() values to 90 seconds.
…roadcast disruption

A DB failure during outbox INSERT should not prevent the signal broadcast
path from executing. Log the error and continue.
…an older row

SKIP LOCKED skips individual locked rows, not entire task groups. Without
this guard, Worker B could grab a newer status while Worker A still holds
the oldest — violating per-task ordering. The fix queries the absolute
oldest createdAt (no lock) and skips the task if the locked row is newer.
…ycle

Collapse findDistinctPendingTaskIds + per-task findAndLockOldestPending +
per-task findOldestPendingCreatedAt into a single findAndLockAllOldestPending
query using MIN(createdAt) JOIN with FOR UPDATE SKIP LOCKED. Per-task
ordering is preserved naturally: if the oldest row for a task is locked by
another worker, the JOIN produces no match for that task.
C2: Replace MIN(createdAt) JOIN with ROW_NUMBER() PARTITION BY taskId
    to guarantee exactly one row per task even with identical timestamps.

C3: Split bulk-lock transaction into bulk-read (no lock) + per-entry
    transactions. Row locks now held only during single-entry delivery,
    not the entire batch. Flowable API calls no longer inside a DB
    transaction holding locks on other rows.

I2: Add MAX_ATTEMPTS=100. Entries exceeding this are excluded from the
    drain query and effectively dead-lettered for investigation.

I3: Call cleanupDelivered() at end of each drain cycle with 7-day
    retention to prevent unbounded table growth.

I4: Extract workflowInstanceId from ChangeEvent entity payload instead
    of fetching from DB. Eliminates extra round trip per task status
    change event.

I5: Move signal broadcast before outbox enqueue so it always fires.
    Wrap enqueue in resilience4j retry (3 attempts) for transient DB
    errors. Unhandled failure propagates to event publisher for retry.
Add LIMIT 500 with ORDER BY attempts ASC, createdAt ASC to prevent
unbounded result sets and prioritize fresh messages over stuck ones.
Separate cleanup into its own try-catch for cleaner error diagnostics.
…OutboxIT

Move E2E test to integration-tests module where the full application stack
is running (CE pipeline, schedulers, drainer). The test verifies the
complete outbox delivery pipeline through observable outcomes:

1. Deploy workflow → create table → workflow triggers → task created
2. PATCH task InProgress → PATCH task Completed
3. Poll workflow instance until FINISHED status
4. Assert stage results contain expected status transitions
IceS2 and others added 10 commits March 23, 2026 08:29
Strangler Fig bridge that syncs Task lifecycle events to TCRS records,
enabling workflow-managed incidents to keep the existing incident UI/API
working while Tasks become the source of truth.

Components:
- aboutEntityLink field on Task schema (EntityLink format with testCase
  FQN + incident stateId), backed by generated DB columns + index
- IncidentTcrsSyncHandler: lifecycle handler mapping Task events to TCRS
  records (New, Ack, Assigned, Resolved) with entity relationships
- TCRS guard in openOrAssignTask() to skip Task creation when a
  workflow-managed Task already exists for the incident
- SetupImpl builds aboutEntityLink for incident task types
- testCaseStatus added to WorkflowTriggerFields enum
- E2E integration test verifying full pipeline
…al status check

- Rename specificUsers to specificAssignees using EntityLink strings
  (e.g. <#E::user::alice>, <#E::team::engineers>) to support both
  users and teams, consistent with the existing user task pattern
- CloseTaskImpl.isAlreadyClosed() now checks terminal status from the
  resolved template instead of checking resolution != null
…pen-metadata/OpenMetadata into feat/ilw-item2-incident-tcrs-sync-hook
Adds configurable duplicate trigger handling for governance workflows.
When the same entity triggers a workflow again while an instance is
already running, the trigger's onConflict policy controls the behavior:

- restart (default): terminate old instance, start new (existing behavior)
- skip: keep existing instance, ignore new event
- forward: deliver 'retrigger' status to the active ManualTask

The retrigger status flows through the existing IntermediateCatchEvent
path — zero BPMN changes needed. ManualTask exposes 'retrigger' as a
routable branch only when the trigger uses onConflict=forward, allowing
workflow authors to wire it via normal graph edges (self-loop for
incidents, back-to-decision-tree for approvals).
…d E2E test

Ships a default governance workflow that triggers on test case failure,
creates an incident task assigned to the table owner, and waits for
human resolution. Uses onConflict=forward so re-failures don't kill
the running workflow. JSON Logic filter ensures only Failed status
triggers the workflow.

E2E test validates the full lifecycle: test failure → task created →
TCRS(New) synced → InProgress → TCRS(Ack) → Completed → workflow
FINISHED → task resolved → TCRS(Resolved).
@IceS2 IceS2 requested a review from a team as a code owner March 31, 2026 10:11
@github-actions github-actions bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Mar 31, 2026
@github-actions
Copy link
Copy Markdown
Contributor

❌ Lint Check Failed — ESLint + Prettier (core-components)

The following files have style issues that need to be fixed:

Fix locally (fast — changed files only):

cd openmetadata-ui-core-components/src/main/resources/ui
yarn ui-checkstyle:changed

Or to fix all files: yarn lint:fix && yarn pretty

Comment on lines +92 to +105
case "forward" -> {
boolean forwarded =
WorkflowHandler.getInstance()
.forwardRetrigger(mainWorkflowDefinitionName, entityLinkStr, Map.of());
if (forwarded) {
log.info(
"onConflict=forward: retrigger delivered to running instance for entity={}",
entityLinkStr);
passesFilter = false;
} else {
log.debug(
"onConflict=forward: no active ManualTask found for entity={}, starting new instance",
entityLinkStr);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: onConflict=forward can create duplicate workflow instances

When forwardRetrigger() returns false because the running instance has no active message subscriptions (ManualTask not yet active or between re-entries), the code falls through and allows a new workflow instance to start — creating a duplicate alongside the existing one.

forwardRetrigger() returns false in two different scenarios: (1) no running instance exists, and (2) a running instance exists but has no message subscriptions. Only scenario (1) should allow a new instance; scenario (2) should block it.

The skip case correctly uses hasRunningInstance() to guard against this.

Suggested fix:

case "forward" -> {
  boolean forwarded =
      WorkflowHandler.getInstance()
          .forwardRetrigger(mainWorkflowDefinitionName, entityLinkStr, Map.of());
  if (forwarded) {
    log.info("onConflict=forward: retrigger delivered");
    passesFilter = false;
  } else if (WorkflowHandler.getInstance()
      .hasRunningInstance(mainWorkflowDefinitionName, entityLinkStr)) {
    log.info("onConflict=forward: running instance exists but no active ManualTask, skipping");
    passesFilter = false;
  } else {
    log.debug("onConflict=forward: no running instance, starting new");
  }
}

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Comment on lines +225 to +229
private String extractStringValue(Object value) {
if (value instanceof String s) {
return s.startsWith("\"") ? s.substring(1, s.length() - 1) : s;
}
return value != null ? value.toString() : null;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: extractStringValue crashes on single-quote string edge case

In extractStringValue, if value is the string " (a single double-quote character, length 1), the check s.startsWith(""") passes but s.substring(1, s.length() - 1) calls substring(1, 0) which throws StringIndexOutOfBoundsException. While unlikely in practice (FieldChange values are typically well-formed JSON), this is called in a lifecycle handler where an unhandled exception would abort TCRS sync for the event.

Suggested fix:

private String extractStringValue(Object value) {
  if (value instanceof String s) {
    if (s.length() >= 2 && s.startsWith(""") && s.endsWith(""")) {
      return s.substring(1, s.length() - 1);
    }
    return s;
  }
  return value != null ? value.toString() : null;
}

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Comment on lines +1510 to +1524
for (EventSubscription sub : subscriptions) {
try {
Map<String, Object> retriggerVars = new HashMap<>(variables);
retriggerVars.put("status", "retrigger");
runtimeService.messageEventReceived(
sub.getEventName(), sub.getExecutionId(), retriggerVars);
LOG.info(
"Forwarded retrigger to process instance {} via message '{}'",
target.getId(),
sub.getEventName());
return true;
} catch (Exception e) {
LOG.debug(
"Could not deliver retrigger to subscription '{}': {}",
sub.getEventName(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Bug: forwardRetrigger silently swallows delivery failures at DEBUG level

In the forwardRetrigger method, when message delivery fails for a subscription (line 1521), the exception is logged at DEBUG level and the loop continues to the next subscription. If all subscriptions fail, the method returns false, which the caller interprets as 'no running instance' (potentially creating a duplicate). The caller cannot distinguish between 'no instance found' and 'delivery failed' — both return false.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@github-actions
Copy link
Copy Markdown
Contributor

✅ TypeScript Types Auto-Updated

The generated TypeScript types have been automatically updated based on JSON schema changes in this PR.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Playwright Lint Check Failed — ESLint + Prettier + Organise Imports

The following files have style issues that need to be fixed:

Fix locally (fast — changed files only):

cd openmetadata-ui/src/main/resources/ui
yarn ui-checkstyle:playwright:changed

Or to fix all playwright files: yarn ui-checkstyle:playwright

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Mar 31, 2026

Code Review ⚠️ Changes requested 0 resolved / 3 findings

Bootstraps IncidentLifecycleWorkflow with E2E tests but introduces three issues: the onConflict=forward policy can create duplicate workflow instances when ManualTask subscriptions aren't active, extractStringValue crashes on single-quote edge cases, and forwardRetrigger silently swallows delivery failures at DEBUG level.

⚠️ Bug: onConflict=forward can create duplicate workflow instances

📄 openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/elements/triggers/impl/FilterEntityImpl.java:92-105

When forwardRetrigger() returns false because the running instance has no active message subscriptions (ManualTask not yet active or between re-entries), the code falls through and allows a new workflow instance to start — creating a duplicate alongside the existing one.

forwardRetrigger() returns false in two different scenarios: (1) no running instance exists, and (2) a running instance exists but has no message subscriptions. Only scenario (1) should allow a new instance; scenario (2) should block it.

The skip case correctly uses hasRunningInstance() to guard against this.

Suggested fix
case "forward" -> {
  boolean forwarded =
      WorkflowHandler.getInstance()
          .forwardRetrigger(mainWorkflowDefinitionName, entityLinkStr, Map.of());
  if (forwarded) {
    log.info("onConflict=forward: retrigger delivered");
    passesFilter = false;
  } else if (WorkflowHandler.getInstance()
      .hasRunningInstance(mainWorkflowDefinitionName, entityLinkStr)) {
    log.info("onConflict=forward: running instance exists but no active ManualTask, skipping");
    passesFilter = false;
  } else {
    log.debug("onConflict=forward: no running instance, starting new");
  }
}
💡 Edge Case: extractStringValue crashes on single-quote string edge case

📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/handlers/IncidentTcrsSyncHandler.java:225-229

In extractStringValue, if value is the string " (a single double-quote character, length 1), the check s.startsWith(""") passes but s.substring(1, s.length() - 1) calls substring(1, 0) which throws StringIndexOutOfBoundsException. While unlikely in practice (FieldChange values are typically well-formed JSON), this is called in a lifecycle handler where an unhandled exception would abort TCRS sync for the event.

Suggested fix
private String extractStringValue(Object value) {
  if (value instanceof String s) {
    if (s.length() >= 2 && s.startsWith(""") && s.endsWith(""")) {
      return s.substring(1, s.length() - 1);
    }
    return s;
  }
  return value != null ? value.toString() : null;
}
💡 Bug: forwardRetrigger silently swallows delivery failures at DEBUG level

📄 openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/WorkflowHandler.java:1510-1524

In the forwardRetrigger method, when message delivery fails for a subscription (line 1521), the exception is logged at DEBUG level and the loop continues to the next subscription. If all subscriptions fail, the method returns false, which the caller interprets as 'no running instance' (potentially creating a duplicate). The caller cannot distinguish between 'no instance found' and 'delivery failed' — both return false.

🤖 Prompt for agents
Code Review: Bootstraps IncidentLifecycleWorkflow with E2E tests but introduces three issues: the onConflict=forward policy can create duplicate workflow instances when ManualTask subscriptions aren't active, extractStringValue crashes on single-quote edge cases, and forwardRetrigger silently swallows delivery failures at DEBUG level.

1. ⚠️ Bug: onConflict=forward can create duplicate workflow instances
   Files: openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/elements/triggers/impl/FilterEntityImpl.java:92-105

   When `forwardRetrigger()` returns `false` because the running instance has no active message subscriptions (ManualTask not yet active or between re-entries), the code falls through and allows a new workflow instance to start — creating a duplicate alongside the existing one.
   
   `forwardRetrigger()` returns `false` in two different scenarios: (1) no running instance exists, and (2) a running instance exists but has no message subscriptions. Only scenario (1) should allow a new instance; scenario (2) should block it.
   
   The `skip` case correctly uses `hasRunningInstance()` to guard against this.

   Suggested fix:
   case "forward" -> {
     boolean forwarded =
         WorkflowHandler.getInstance()
             .forwardRetrigger(mainWorkflowDefinitionName, entityLinkStr, Map.of());
     if (forwarded) {
       log.info("onConflict=forward: retrigger delivered");
       passesFilter = false;
     } else if (WorkflowHandler.getInstance()
         .hasRunningInstance(mainWorkflowDefinitionName, entityLinkStr)) {
       log.info("onConflict=forward: running instance exists but no active ManualTask, skipping");
       passesFilter = false;
     } else {
       log.debug("onConflict=forward: no running instance, starting new");
     }
   }

2. 💡 Edge Case: extractStringValue crashes on single-quote string edge case
   Files: openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/handlers/IncidentTcrsSyncHandler.java:225-229

   In `extractStringValue`, if `value` is the string `"` (a single double-quote character, length 1), the check `s.startsWith(""")` passes but `s.substring(1, s.length() - 1)` calls `substring(1, 0)` which throws `StringIndexOutOfBoundsException`. While unlikely in practice (FieldChange values are typically well-formed JSON), this is called in a lifecycle handler where an unhandled exception would abort TCRS sync for the event.

   Suggested fix:
   private String extractStringValue(Object value) {
     if (value instanceof String s) {
       if (s.length() >= 2 && s.startsWith(""") && s.endsWith(""")) {
         return s.substring(1, s.length() - 1);
       }
       return s;
     }
     return value != null ? value.toString() : null;
   }

3. 💡 Bug: forwardRetrigger silently swallows delivery failures at DEBUG level
   Files: openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/WorkflowHandler.java:1510-1524

   In the `forwardRetrigger` method, when message delivery fails for a subscription (line 1521), the exception is logged at DEBUG level and the loop continues to the next subscription. If all subscriptions fail, the method returns `false`, which the caller interprets as 'no running instance' (potentially creating a duplicate). The caller cannot distinguish between 'no instance found' and 'delivery failed' — both return `false`.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant