Skip to content

Add retryable handling for transient missing MCP tools in Wayflow tool-list resolution #161

@coolkid123-collab

Description

@coolkid123-collab

Problem

Our project uses several external MCPs mounted inside our project MCP. During transient external MCP health or remount events, our project workflowserver may still have a cached view of the full tool list, while Wayflow’s direct MCP query can temporarily receive a partial tool list.

Example:

Expected/cached tools: a, b, c
Wayflow MCP query returns: b, c
Workflow step requires: a
Result: step fails because Wayflow believes tool a does not exist

This appears to be transient rather than a true configuration error. The external MCP may become healthy again shortly afterward.

Observed behavior

When Wayflow calls a tool that is missing from the MCP server’s current tool list, the MCP server returns a valid HTTP 200 response. Because of that, the existing MCP Transport-level RetryPolicy does not treat this as a transport failure and therefore does not retry.

Impact

Workflows can fail immediately even though the missing tool condition may be caused by a temporary external MCP or mount issue. This is especially problematic when our project workflowserver still reports the expected tool from its cached view, but Wayflow’s live MCP query sees only a partial list.

Request

Add a Wayflow-side mechanism to handle this condition as retryable or detectable by callers.

Possible approaches:

Retry missing expected tools
When a workflow step references a tool that is expected/configured but is absent from the current MCP tool list, retry the MCP tool-list query a few times before failing.
Make retry count/backoff configurable for assistant developers.
Return a specific retryable error/status
If Wayflow determines that a required tool is missing from the MCP list, return a distinct error code/status that our project can detect.
Our project can then retry from its side instead of treating the workflow step as a permanent failure.

Expected behavior

If a required MCP tool is missing due to a transient partial tool-list response, Wayflow should either:

retry tool-list resolution before failing

or

return a specific retryable missing-tool error/status

so the workflow does not fail immediately on temporary external MCP or remount issues.

Notes

Team’s investigation found that the current transport retry policy will not help because the MCP server returns HTTP 200 for the missing-tool case. A new Wayflow-level mechanism would likely be needed to catch this semantic error and allow retry configuration around it.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions