New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry to loading the sample job #2199
Conversation
Loading the sample job has sporadically failures due to race condition creating the db foreign key. Retry on the job should succeed.
For details see kubeflow/kubeflow#2199
@@ -216,7 +216,7 @@ | |||
serviceAccountName: "ml-pipeline", | |||
}, | |||
}, | |||
backoffLimit: 0, | |||
backoffLimit: 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the load sample require retries?
BTW, kubeflow/pipelines#610 mentions that there are errors in ml-pipelines-load-samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IronPan Can you expand on what the problem is and how this fixes it?
I see two different problems
#603 problem creating the foreign key
#610 connection timeout.
If this is a retryable error; should the retry logic be in the go code as opposed to the K8s job?
I think in other places in Kubeflow we've used this go library to implement exponential backoff.
https://github.com/cenkalti/backoff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary solution to resolve the foreign key issue that doesn't require a new pipeline release and can quickly unblock kf release. I'll work on a proper fix that doesn't use job to load samples.
Add more details to the description
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Loading the sample job has sporadically failures due to race condition creating the db foreign key. Retry on the job should succeed.
* Add retry to loading the sample job (#2199) Loading the sample job has sporadically failures due to race condition creating the db foreign key. Retry on the job should succeed. * cherrypick * bump pipeline sdk version to 0.1.7 (#2253) * remove load sample job from pipeline config (#2254) Avoid using job to load samples into pipeline system. The K8s Job doesn't work well because it's immutable once created, ksonnet throw an error of "merging object with existing state failure" in case of upgrading the system. * cherrypick
Loading the sample job has sporadically failures due to race condition creating the db foreign key. Retry on the job should succeed.
Loading the sample job has sporadically failures due to race condition creating the db foreign key. Retry on the job should succeed.
Loading the sample job has sporadically failures due to race condition with api server start up. They both can create the db foreign key.
The gorm library handles foreign key existence but it won't be safe in race condition
https://github.com/jinzhu/gorm/blob/master/scope.go#L1221
This is a temporary fix to allow samples load correctly. In long term, the sample loading shouldn't be a job because job doesn't work well with ksonnet. It won't work during upgrade scenario.
This change is