Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize post submit jobs flow #1353

Closed
Jeffwan opened this issue Aug 14, 2021 · 11 comments
Closed

Optimize post submit jobs flow #1353

Jeffwan opened this issue Aug 14, 2021 · 11 comments

Comments

@Jeffwan
Copy link
Member

Jeffwan commented Aug 14, 2021

https://argo.kubeflow-testing.com/workflows/kubeflow-test-infra/kubeflow-tf-operator-postsubmit-v1-7d11ae8-5888-d10a

This is an example of post-submit job. We share same workflow for presubmit jobs and post submit jobs.
I think we just need pipeline to build an image and push to public ECR register. There's no need to rerun the tests? WDYT?

/cc @kubeflow/wg-training-leads

@johnugeorge
Copy link
Member

This is the same case with most of the Kubeflow repos. Can there be racecondition issues between multiple merges? eg: code merge of 1 PR while image building from the other?

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 15, 2021

em. I remember any code merges will re-trigger presubmit? If so, it forces the pipeline to rebase the master automatically. If not, I think it does have the race condition issue.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 15, 2021

There's another issue. If the post submit fails due to flaky tests. Even the image is built and pushed to registry. The test mark still shows red which is misleading. The major problem is we mix release and post-submit test together. The best practice maybe split release process out from post submit tests.

@andreyvelich
Copy link
Member

What do you think about using GitHub Actions for our post-submits ?
I guess we need only to push images to ECR and Dockerhub after PR is merged.
In that case, we don't need AWS cluster for that.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 16, 2021

In that case, we don't need AWS cluster for that.

We are using kaniko to build image in prow cluster and this barely fails the tests. We can update workflow to skip tests part but Johnu's concern is valid. We need to double check there's no race condition issue

Does Katib run post-submit e2e tests?

@andreyvelich
Copy link
Member

Does Katib run post-submit e2e tests?

Unfortunately, we are running Manual script to publish images currently.
I think, we can setup Actions to build and publish the images.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 16, 2021

I see. tf-operator currently use this way. It's pretty simple but the challenge is it leveraged ksonnet to render the workflow.. We can do some refactor later to get rid of ksonnet..

https://github.com/kubeflow/tf-operator/blob/d3725fd67fb6c285a6eca3db6492048db49dd395/test/workflows/components/workflows.libsonnet#L329-L343

@andreyvelich
Copy link
Member

@Jeffwan We are following the same Kaniko builder for our Katib pre-submits:
https://github.com/kubeflow/katib/blob/master/test/workflows/components/workflows-v1beta1.libsonnet#L435-L440

Do you think AWS cluster is necessary to publish images for Kubeflow Training operator ?
For example, it's not very easy to restart post-submit in AWS in case of fail (for example, Docker pull limit ratio).
Although, GitHub Actions can be restarted by anyone who has write access to the repo.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 17, 2021

@andreyvelich

There're some tradeoffs.
From code reusability perspective, I think we don't necessary need to introduce new stacks. If we just need image building, we don't bring up AWS cluster. It's one kaniko pod running in Argo cluster. The tricky thing is post-submit job can not be operated easily, even by maintainers. If there's a good way to restart the failed job like presubmit jobs. I would love it.

If we can not encounter above problem, Github Actions is preferred.

@andreyvelich
Copy link
Member

Yes, you are correct, we don't deploy another cluster in post-submit, but we are still using Argo AWS Cluster to run these Kaniko containers.

If there's a good way to restart the failed job like presubmit jobs. I would love it.

I am not sure if that will be easy.
Since once Argo Workflow is started, we mount volume for it with the repo and test sources.
To restart it, we should deploy a new Argo Workflow with the new volume.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Mar 2, 2022
@stale stale bot closed this as completed Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants