Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase job submission timeout and fast fail when it occurs #217

Merged
merged 2 commits into from
Oct 15, 2020

Conversation

mwylde
Copy link
Contributor

@mwylde mwylde commented Oct 5, 2020

If the job submission POST request hits our 1 minute timeout (typically because the main method of the job, which needs to run to produce the job graph before job submission can complete) we can end up with two jobs submitted to the cluster.

The sequence of events looks like this:

  1. We POST to /run
  2. The POST times out
  3. We retry the submission, POSTing to /run
  4. The JobManager finishes job submission from the first request
  5. The JobManager finishes job submission from the second request

Given that, after a timeout, we have no way to know if the job submission is still in progress on the JM, the only safe behavior in this case is to fail the deploy and rollback to the previous version.

This PR adopts that behavior, and also increases the timeout to 5 minutes which should hopefully make this much rarer in practice.

@premsantosh
Copy link
Contributor

👍

@mwylde mwylde merged commit 236fe19 into master Oct 15, 2020
@mwylde mwylde deleted the micah_submission_timeout branch October 15, 2020 21:47
@@ -36,7 +37,7 @@ const httpPost = "POST"
const httpPatch = "PATCH"
const retryCount = 3
const httpGetTimeOut = 5 * time.Second
const defaultTimeOut = 1 * time.Minute
const defaultTimeOut = 5 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to configure timeouts for different operations? I can see that for certain environments (minikube for instance) timeouts are prevalent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants