fix: resolve governance workflow deadlock and remove WorkflowTransactionManager#26452
Conversation
|
gitar bot fix the review comments, run mvn spotless:apply on the root of the repo before pushing |
Co-authored-by: yan-3005 <yan-3005@users.noreply.github.com>
Done — applied all three review fixes and ran
|
There was a problem hiding this comment.
Pull request overview
This PR targets governance workflow stability by changing how duplicate workflow instances are terminated (to avoid DB deadlocks/FK violations) and by simplifying how workflow instance states are queried/updated, while removing an unused transaction-manager abstraction from the workflow definition resource layer.
Changes:
- Move Flowable process-instance deletion out of the JDBI transaction path and add guards to avoid deleting auto-completing processes.
- Terminate duplicate instances asynchronously from the trigger delegate to avoid transactional conflicts during signal handling.
- Simplify instance-state retrieval by adding a DAO method to list all states for a workflow instance and removing the unused
WorkflowTransactionManager.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/resources/governance/WorkflowDefinitionResource.java | Reverts workflow definition CRUD to the base EntityResource implementations (removes transaction-manager usage). |
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/WorkflowInstanceStateRepository.java | Refactors failure-marking to use a new “list all states for instance” query. |
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/WorkflowInstanceRepository.java | Uses the new state listing method to derive final workflow instance status. |
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java | Adds listAllStatesForInstance query on the workflow instance state time-series table. |
| openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/elements/triggers/impl/FilterEntityImpl.java | Runs duplicate-termination asynchronously to avoid Flowable/MySQL transactional conflicts. |
| openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/WorkflowTransactionManager.java | Removes the unused transaction manager class. |
| openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/WorkflowHandler.java | Performs Flowable deletions outside JDBI transaction and adds a “skip delete if no active user tasks” guard. |
| CompletableFuture.runAsync( | ||
| () -> | ||
| WorkflowHandler.getInstance() | ||
| .terminateDuplicateInstances( | ||
| workflowName, entityLinkStrFinal, processInstanceId)) | ||
| .exceptionally( | ||
| ex -> { | ||
| log.error("Async termination of duplicate instances failed", ex); | ||
| return null; | ||
| }); |
There was a problem hiding this comment.
CompletableFuture.runAsync(...) is using the ForkJoinPool common pool. This code path can block on DB/Flowable work, so using the common pool risks thread starvation and unpredictable throughput under load. Use the service's managed executor (e.g., AsyncService.getInstance().getExecutorService() or another dedicated executor used elsewhere in the server) and consider wrapping the runnable with the existing context-propagation helper if needed.
| List<WorkflowInstanceState> states = | ||
| workflowInstanceStateRepository.listAllStatesForInstance(workflowInstanceId); | ||
|
|
||
| do { | ||
| ResultList<WorkflowInstanceState> workflowInstanceStates = | ||
| workflowInstanceStateRepository.listWorkflowInstanceStateForInstance( | ||
| workflowDefinition.getName(), | ||
| workflowInstanceId, | ||
| offset, | ||
| workflowInstance.getStartedAt(), | ||
| workflowInstance.getEndedAt(), | ||
| 100, | ||
| false); | ||
|
|
||
| if (workflowInstanceStates.getData().stream() | ||
| .anyMatch( | ||
| workflowInstanceState -> | ||
| workflowInstanceState | ||
| .getStatus() | ||
| .equals(WorkflowInstance.WorkflowStatus.FAILURE))) { | ||
| workflowStatus = WorkflowInstance.WorkflowStatus.FAILURE; | ||
| break; | ||
| } | ||
|
|
||
| offset = workflowInstanceStates.getPaging().getAfter(); | ||
| } while (offset != null); | ||
| WorkflowInstance.WorkflowStatus workflowStatus = WorkflowInstance.WorkflowStatus.FINISHED; | ||
| if (states.stream() | ||
| .anyMatch(s -> s.getStatus().equals(WorkflowInstance.WorkflowStatus.FAILURE))) { | ||
| workflowStatus = WorkflowInstance.WorkflowStatus.FAILURE; | ||
| } |
There was a problem hiding this comment.
updateWorkflowInstance now loads all states for an instance into memory just to determine whether any state is FAILURE. For instances with many state records this is a potential memory/latency regression vs the previous paginated/early-exit approach. Consider adding a DAO/repository method that short-circuits (e.g., fetch only statuses, or page through results and break on first failure) so this method doesn’t require reading every state JSON blob.
Code Review
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
| processId, "Terminated due to conflicting workflow instance"); | ||
| } catch (FlowableObjectNotFoundException e) { | ||
| LOG.debug( | ||
| "Process instance {} already completed before termination, skipping", processId); |
There was a problem hiding this comment.
The Flowable termination loop only handles FlowableObjectNotFoundException. Any other runtime exception from deleteProcessInstance (e.g., transient DB/lock errors) will abort the method before the JDBI transaction runs, leaving conflicting instances neither terminated nor marked as failed. To make this more robust, catch/log unexpected exceptions per-instance and continue so the state-marking transaction still executes.
| "Process instance {} already completed before termination, skipping", processId); | |
| "Process instance {} already completed before termination, skipping", processId); | |
| } catch (RuntimeException e) { | |
| LOG.error( | |
| "Failed to terminate process instance {} for conflicting instance {}. Continuing with remaining instances.", | |
| processId, | |
| instance.getId(), | |
| e); |
|



…ionManager
Describe your changes:
Fixes
I worked on ... because ...
Type of change:
Checklist:
Fixes <issue-number>: <short explanation>Summary by Gitar
terminateDuplicateInstancescall outside JDBI transaction inFilterEntityImplto async execution, preventing FK constraint violation from concurrent Flowable job operations.WorkflowHandlerto skip deletion of processes with no active user tasks, preventing lock-order deadlock on PostgreSQL.WorkflowDefinitionResource.listAllStatesForInstance()method toCollectionDAOandWorkflowInstanceStateRepositoryfor direct state retrieval without pagination.markInstanceStatesAsFailed()andmarkInstanceAsFinished()to use new direct query instead of paginated queries with workflow definition lookups.This will update automatically on new commits.