-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize post submit jobs flow #1353
Comments
This is the same case with most of the Kubeflow repos. Can there be racecondition issues between multiple merges? eg: code merge of 1 PR while image building from the other? |
em. I remember any code merges will re-trigger presubmit? If so, it forces the pipeline to rebase the master automatically. If not, I think it does have the race condition issue. |
There's another issue. If the post submit fails due to flaky tests. Even the image is built and pushed to registry. The test mark still shows red which is misleading. The major problem is we mix release and post-submit test together. The best practice maybe split release process out from post submit tests. |
What do you think about using GitHub Actions for our post-submits ? |
We are using kaniko to build image in prow cluster and this barely fails the tests. We can update workflow to skip tests part but Johnu's concern is valid. We need to double check there's no race condition issue Does Katib run post-submit e2e tests? |
Unfortunately, we are running Manual script to publish images currently. |
I see. tf-operator currently use this way. It's pretty simple but the challenge is it leveraged ksonnet to render the workflow.. We can do some refactor later to get rid of ksonnet.. |
@Jeffwan We are following the same Kaniko builder for our Katib pre-submits: Do you think AWS cluster is necessary to publish images for Kubeflow Training operator ? |
There're some tradeoffs. If we can not encounter above problem, Github Actions is preferred. |
Yes, you are correct, we don't deploy another cluster in post-submit, but we are still using Argo AWS Cluster to run these Kaniko containers.
I am not sure if that will be easy. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
https://argo.kubeflow-testing.com/workflows/kubeflow-test-infra/kubeflow-tf-operator-postsubmit-v1-7d11ae8-5888-d10a
This is an example of post-submit job. We share same workflow for presubmit jobs and post submit jobs.
I think we just need pipeline to build an image and push to public ECR register. There's no need to rerun the tests? WDYT?
/cc @kubeflow/wg-training-leads
The text was updated successfully, but these errors were encountered: