Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry to loading the sample job #2199

Merged
merged 1 commit into from Jan 3, 2019
Merged

Conversation

IronPan
Copy link
Member

@IronPan IronPan commented Jan 3, 2019

Loading the sample job has sporadically failures due to race condition with api server start up. They both can create the db foreign key.

The gorm library handles foreign key existence but it won't be safe in race condition
https://github.com/jinzhu/gorm/blob/master/scope.go#L1221

This is a temporary fix to allow samples load correctly. In long term, the sample loading shouldn't be a job because job doesn't work well with ksonnet. It won't work during upgrade scenario.


This change is Reviewable

Loading the sample job has sporadically failures due to race condition creating the db foreign key. 
Retry on the job should succeed.
@IronPan
Copy link
Member Author

IronPan commented Jan 3, 2019

/assign @kunmingg @jlewi

IronPan added a commit to kubeflow/pipelines that referenced this pull request Jan 3, 2019
@IronPan
Copy link
Member Author

IronPan commented Jan 3, 2019

Fix kubeflow/pipelines#603

@@ -216,7 +216,7 @@
serviceAccountName: "ml-pipeline",
},
},
backoffLimit: 0,
backoffLimit: 2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the load sample require retries?
BTW, kubeflow/pipelines#610 mentions that there are errors in ml-pipelines-load-samples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IronPan Can you expand on what the problem is and how this fixes it?

I see two different problems
#603 problem creating the foreign key
#610 connection timeout.

If this is a retryable error; should the retry logic be in the go code as opposed to the K8s job?

I think in other places in Kubeflow we've used this go library to implement exponential backoff.
https://github.com/cenkalti/backoff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary solution to resolve the foreign key issue that doesn't require a new pipeline release and can quickly unblock kf release. I'll work on a proper fix that doesn't use job to load samples.

Add more details to the description

@jlewi
Copy link
Contributor

jlewi commented Jan 3, 2019

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit b50548b into kubeflow:master Jan 3, 2019
IronPan added a commit to IronPan/kubeflow that referenced this pull request Jan 3, 2019
Loading the sample job has sporadically failures due to race condition creating the db foreign key. 
Retry on the job should succeed.
k8s-ci-robot pushed a commit that referenced this pull request Jan 3, 2019
Loading the sample job has sporadically failures due to race condition creating the db foreign key. 
Retry on the job should succeed.
k8s-ci-robot pushed a commit that referenced this pull request Jan 15, 2019
* Add retry to loading the sample job (#2199)

Loading the sample job has sporadically failures due to race condition creating the db foreign key. 
Retry on the job should succeed.

* cherrypick

* bump pipeline sdk version to 0.1.7 (#2253)

* remove load sample job from pipeline config (#2254)

Avoid using job to load samples into pipeline system. 
The K8s Job doesn't work well because it's immutable once created, ksonnet throw an error of "merging object with existing state failure" in case of upgrading the system.

* cherrypick
kkasravi pushed a commit to kkasravi/kubeflow that referenced this pull request Feb 8, 2019
Loading the sample job has sporadically failures due to race condition creating the db foreign key. 
Retry on the job should succeed.
saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 11, 2021
Loading the sample job has sporadically failures due to race condition creating the db foreign key. 
Retry on the job should succeed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants