-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubeflow e2e example #116
Kubeflow e2e example #116
Conversation
Check out this pull request on You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB. |
Oh, seems we are working on same item, I am also worked out this item, but found this ticket. |
/area example |
@fenglixa do you want to add anything to this example? |
No, it's OK to me, Thanks @Tomcli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tomcli if its end-to-end, why we still call it e2e-katib?
Sorry for the confusion. I renamed it to |
thanks |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: animeshsingh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…eline-images Fix sample pipeline images
Which issue is resolved by this Pull Request:
Resolves #
Description of your changes:
This Kubeflow e2e example demonstrates how to use katib + tfjob + kfserving with volumeop using distributed training. To run this pipeline, make sure your cluster has at least 16 cpu and 32GB in total. Otherwise some jobs might not able to run because TFJob needs to run 4 TensorFlow pods in parallel for distributed training.
Environment tested:
python --version
): 3.6.4tkn version
): 0.11.3kubectl version
): 1.16/etc/os-release
):