Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeflow e2e example #116

Merged
merged 2 commits into from
May 1, 2020
Merged

Kubeflow e2e example #116

merged 2 commits into from
May 1, 2020

Conversation

Tomcli
Copy link
Member

@Tomcli Tomcli commented Apr 24, 2020

Which issue is resolved by this Pull Request:
Resolves #

Description of your changes:

This Kubeflow e2e example demonstrates how to use katib + tfjob + kfserving with volumeop using distributed training. To run this pipeline, make sure your cluster has at least 16 cpu and 32GB in total. Otherwise some jobs might not able to run because TFJob needs to run 4 TensorFlow pods in parallel for distributed training.

Environment tested:

  • Python Version (use python --version): 3.6.4
  • Tekton Version (use tkn version): 0.11.3
  • Kubernetes Version (use kubectl version): 1.16
  • OS (e.g. from /etc/os-release):

@kubeflow-bot
Copy link

This change is Reviewable

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

@fenglixa
Copy link
Member

Oh, seems we are working on same item, I am also worked out this item, but found this ticket.

@ckadner
Copy link
Member

ckadner commented Apr 29, 2020

/area example

@animeshsingh
Copy link
Collaborator

@Tomcli @fenglixa are we good to go with this?

@Tomcli
Copy link
Member Author

Tomcli commented Apr 30, 2020

@fenglixa do you want to add anything to this example?

@fenglixa
Copy link
Member

fenglixa commented May 1, 2020

No, it's OK to me, Thanks @Tomcli
/lgtm

Copy link
Collaborator

@animeshsingh animeshsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tomcli if its end-to-end, why we still call it e2e-katib?

@k8s-ci-robot k8s-ci-robot removed the lgtm label May 1, 2020
@Tomcli
Copy link
Member Author

Tomcli commented May 1, 2020

Sorry for the confusion. I renamed it to e2e-mnist to align with the Kubeflow pipeline naming.

@animeshsingh
Copy link
Collaborator

thanks
/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: animeshsingh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 2fd278a into kubeflow:master May 1, 2020
@Tomcli Tomcli deleted the katib-e2e branch May 20, 2020 17:20
HumairAK referenced this pull request in red-hat-data-services/data-science-pipelines-tekton May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants