Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Sep 12, 2022

Fixes: kubeflow/common#66.
Inspired by KFP create_component_from_func.

These APIs will allow user to create TFJob and PyTorchJob without building the image.
This is the first small step to simplify our Kubeflow SDKs and to avoid Kubernetes complexity.

Later we can extend this functionality (support other Job types), give more spec options via APIs. Also, we might consider to use one TrainingClient() (instead of separate client for each Job) to reduce code and improve UX.

I used:

  • docker.io/pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime for PyTorch base image.
  • docker.io/tensorflow/tensorflow:2.9.1 for Tensorflow base image.

cc @kubeflow/wg-training-leads @tenzen-y @anencore94 @ca-scribner Please give your feedback on the API design.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@coveralls
Copy link

coveralls commented Sep 12, 2022

Pull Request Test Coverage Report for Build 3068518611

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 7 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.07%) to 39.751%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/pytorch/master.go 1 91.3%
pkg/controller.v1/pytorch/initcontainer.go 6 80.0%
Totals Coverage Status
Change from base Build 2973439690: -0.07%
Covered Lines: 2327
Relevant Lines: 5854

💛 - Coveralls

@andreyvelich
Copy link
Member Author

/hold for the review

@andreyvelich andreyvelich requested a review from a team September 14, 2022 11:34
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for this awesome work, and sorry for the late review.
I left a few comments.

sdk/python/setup.py Outdated Show resolved Hide resolved
sdk/python/kubeflow/training/api/py_torch_job_client.py Outdated Show resolved Hide resolved
sdk/python/kubeflow/training/constants/constants.py Outdated Show resolved Hide resolved
TFJOB_LOGLEVEL = os.environ.get('TFJOB_LOGLEVEL', 'INFO').upper()
TFJOB_LOGLEVEL = os.environ.get("TFJOB_LOGLEVEL", "INFO").upper()

TFJOB_BASE_IMAGE = "docker.io/tensorflow/tensorflow:2.9.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to use the default image with GPU support same as PYTORCHJOB_BASE_IMAGE.
WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I think the problem is Tensorflow GPU image is 2.2Gb more than image with CPU support: https://hub.docker.com/r/tensorflow/tensorflow/tags?page=1&name=2.9.1.
Maybe we can introduce TFJOB_BASE_IMAGE and TFJOB_BASE_IMAGE_GPU in our SDK, what do you think ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on that @johnugeorge @tenzen-y @anencore94 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think introducing both cpu,gpu is nice to have, but set the cpu image as the default image would be safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

In that case, we also can introduce PYTORCHJOB_BASE_IMAGE and PYTORCHJOB_BASE_IMAGE_GPU as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I wasn't able to find official PyTorch image with CPU support only: https://hub.docker.com/r/pytorch/pytorch/tags
Are you aware of those ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I assumed it existed...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @anencore94 I've added the Tensorflow GPU Image in the constants.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot !

sdk/python/kubeflow/training/utils/utils.py Outdated Show resolved Hide resolved
sdk/python/kubeflow/training/utils/utils.py Outdated Show resolved Hide resolved
sdk/python/kubeflow/training/constants/constants.py Outdated Show resolved Hide resolved
@google-oss-prow google-oss-prow bot removed the lgtm label Sep 15, 2022
Add Client info in example
@johnugeorge
Copy link
Member

Thanks @andreyvelich for this work. This is a good story to start.

/lgtm

@johnugeorge
Copy link
Member

/hold for LGTM from @tenzen-y

@google-oss-prow google-oss-prow bot removed the lgtm label Sep 15, 2022
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for the awesome work!
/lgtm

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@johnugeorge
Copy link
Member

Ready to get merged

@google-oss-prow google-oss-prow bot removed the lgtm label Sep 16, 2022
@andreyvelich
Copy link
Member Author

@johnugeorge @tenzen-y @anencore94 thanks a lot for the review!
I believe all comments have been addressed.

If you are ok, we can merge this PR.

@tenzen-y
Copy link
Member

@johnugeorge @tenzen-y @anencore94 thanks a lot for the review! I believe all comments have been addressed.

If you are ok, we can merge this PR.

I'm ok.
/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Sep 16, 2022
@anencore94
Copy link
Member

@johnugeorge @tenzen-y @anencore94 thanks a lot for the review! I believe all comments have been addressed.

If you are ok, we can merge this PR.

Thaks a lot for introducing such a useful feature!
/lgtm

@andreyvelich
Copy link
Member Author

Thanks everyone!
/hold cancel

@google-oss-prow google-oss-prow bot merged commit 8c9b33c into kubeflow:master Sep 16, 2022
@andreyvelich andreyvelich deleted the update-pytorch-sdk-create-from-func branch September 16, 2022 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow user to submit jobs using Git repo without building container images
6 participants