Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RemoteWorkflows] Fail fast when remote runner is failing [1.6.x] #5469

Merged
merged 3 commits into from
Apr 30, 2024

Conversation

liranbg
Copy link
Member

@liranbg liranbg commented Apr 29, 2024

When running a remote workflow, the configuration might be incorrect causing the remote runner (the pod that spins the remote workflow) to fail before it spins the workflow at all.
This would lead to needlessly wait for the workflow to come up the entire timeout (300 seconds by default)

This PR introduces a fail-fast mechanism for remote workflows where it detects the remote runner status and once it fails, it returns 412 to client (a-la SDK) which stops immediately waiting for the remote workflow to be finished

https://iguazio.atlassian.net/browse/ML-6188

@liranbg liranbg requested a review from TomerShor April 29, 2024 06:34
Copy link
Member

@TomerShor TomerShor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a really minor comment

mlrun/model.py Outdated Show resolved Hide resolved
@liranbg liranbg merged commit c634bd8 into mlrun:1.6.x Apr 30, 2024
10 checks passed
@TomerShor TomerShor changed the title [RemoteWorkflows] Fail fast when remote runner is failing [RemoteWorkflows] Fail fast when remote runner is failing [1.6.x] Apr 30, 2024
liranbg added a commit to liranbg/mlrun that referenced this pull request Apr 30, 2024
@liranbg liranbg deleted the remote-wf-failfast branch May 5, 2024 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants