Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python SDK for Kubeflow Training Operator #1420

Merged
merged 9 commits into from Oct 3, 2021

Conversation

alembiewski
Copy link
Member

@alembiewski alembiewski commented Sep 22, 2021

Resolves #1380.
Depends on #1389.

Drafts the initial proposal for Python SDK for Kubeflow Training Operator by combining the existing SDK for TFJob and PytorchJob into a single SDK and using updated model classes produced by OpenAPI generator.

Changes summary:

  • Python SDK has been generated by using updated tooling from Update scripts to generate sdk for all frameworks #1389.
  • PyTorchJobClient has been copied from the kubeflow/pytorch-operator repo
  • Introduces a new tool hack/python-sdk/post_gen.py to fix imports in the generated model test cases, which can be extended for other post-generation modifications
  • Reduce code duplication by merging utils and constants modules
  • Add model tests (auto-generated) and update e2e tests
  • Add support for the latest Python Kubernetes client
  • PytorchJob notebook example has been copied to sdk/python/examples, example notebooks have been updated to reflect the changes in API and package names
  • Update docs

Note for the reviewers

The following files are autogenerated and could be skipped during the review:

Observations & Questions

  • External model attributes are not generated properly (e.g. from Kubernetes Python client): K8sIoApimachineryPkgApisMetaV1ObjectMeta, K8sIoApimachineryPkgApisMetaV1ListMeta etc. Is this something that could be fixed by updating the generator configuration?
  • How e2e tests and unit tests are executed for the SDK?

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@aws-kf-ci-bot
Copy link
Contributor

Hi @alembiewski. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Jeffwan
Copy link
Member

Jeffwan commented Sep 23, 2021

Thanks for the contribution. I will take some time to review it today. With this one, we don't need this PR anymore https://github.com/kubeflow/tf-operator/pull/1389/files

@alembiewski
Copy link
Member Author

alembiewski commented Sep 23, 2021

@Jeffwan, this PR doesn't include tooling updates for the OpenApi generator - I used #1389 to generate Python SDK, so I think the tooling updates should be merged so we have scripts updated and will be able to regenerate the SDK in case of API changes.

sdk/python/docs/V1TFJobList.md Outdated Show resolved Hide resolved
@@ -0,0 +1,533 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this example

sdk/python/kubeflow/training/__init__.py Show resolved Hide resolved
PYTORCH_LOGLEVEL = os.environ.get('PYTORCHJOB_LOGLEVEL', 'INFO').upper()

# PyTorchJob Label Names
PYTORCHJOB_CONTROLLER_LABEL = 'controller-name'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other framework's constants are not populated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is added by ourselves for e2e test cases? I notice mxnet and xgboost are missing. If we don't have enough time to write test case for other frameworks, it's totally fine. Let's leave a TODO there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also used in client methods:
https://github.com/mesosphere/tf-operator/blob/75734f854a5b35715f068327d69de6185396b5d0/sdk/python/kubeflow/training/api/tf_job_client.py#L122-L125
I will add a comment, sure. The reason why I didn't add similar constants for mxnet and xgboost is that there are no clients for these frameworks currently implemented. Adding support for two more frameworks to the SDK is out of scope for this PR and should be addressed separately IMO.

hack/python-sdk/post_gen.py Show resolved Hide resolved
@Jeffwan
Copy link
Member

Jeffwan commented Sep 23, 2021

@alexlatchford Really appreciate your work! Please check above comments

@Jeffwan
Copy link
Member

Jeffwan commented Sep 23, 2021

/cc @kubeflow/wg-training-leads

@google-oss-robot google-oss-robot requested a review from a team September 23, 2021 07:33
@alembiewski alembiewski changed the title Add Python SDK for Kubeflow Training Operator [WIP] Add Python SDK for Kubeflow Training Operator Sep 23, 2021
@Jeffwan
Copy link
Member

Jeffwan commented Sep 29, 2021

/ok-to-test

@alembiewski
Copy link
Member Author

alembiewski commented Sep 29, 2021

@Jeffwan, test.e2e.test_e2e_pytorchjob: test_sdk_e2e test is failing with the following error, could you help to figure out why? Not sure how can I troubleshoot that on the test cluster.

>           raise RuntimeError("Not found Pods of the PyTorchJob {} "
                               "in namespace {}".format(name, namespace))
E           RuntimeError: Not found Pods of the PyTorchJob pytorchjob-mnist-ci-test in namespace default

sdk/python/kubeflow/training/api/py_torch_job_client.py:384: RuntimeError

It seems like the job has been completed, but it wasn't able to find a pod to fetch the logs.
I also tried to run the test on my cluster, all pass:

(.venv) ➜  tf-operator git:(update-sdk) ✗ pytest sdk/python/test                                                                                                                            
======================================================================================================== test session starts =========================================================================================================
platform darwin -- Python 3.8.9, pytest-4.6.11, py-1.10.0, pluggy-0.13.1
Using --randomly-seed=1632920749
rootdir: /Users/.../go/src/github.com/mesosphere/tf-operator/sdk/python
plugins: cov-2.12.1, randomly-1.2.3
collected 20 items

sdk/python/test/test_v1_run_policy.py .                                                                                                                                                                                        [  5%]
sdk/python/test/e2e/test_e2e_tfjob.py .                                                                                                                                                                                        [ 10%]
sdk/python/test/test_v1_xg_boost_job.py .                                                                                                                                                                                      [ 15%]
sdk/python/test/test_v1_replica_status.py .                                                                                                                                                                                    [ 20%]
sdk/python/test/test_v1_py_torch_job.py .                                                                                                                                                                                      [ 25%]
sdk/python/test/test_v1_replica_spec.py .                                                                                                                                                                                      [ 30%]
sdk/python/test/test_v1_py_torch_job_list.py .                                                                                                                                                                                 [ 35%]
sdk/python/test/test_v1_scheduling_policy.py .                                                                                                                                                                                 [ 40%]
sdk/python/test/test_v1_tf_job_list.py .                                                                                                                                                                                       [ 45%]
sdk/python/test/test_v1_py_torch_job_spec.py .                                                                                                                                                                                 [ 50%]
sdk/python/test/test_v1_xg_boost_job_list.py .                                                                                                                                                                                 [ 55%]
sdk/python/test/e2e/test_e2e_pytorchjob.py .                                                                                                                                                                                   [ 60%]
sdk/python/test/test_v1_tf_job.py .                                                                                                                                                                                            [ 65%]
sdk/python/test/test_v1_xg_boost_job_spec.py .                                                                                                                                                                                 [ 70%]
sdk/python/test/test_v1_mx_job_spec.py .                                                                                                                                                                                       [ 75%]
sdk/python/test/test_v1_job_condition.py .                                                                                                                                                                                     [ 80%]
sdk/python/test/test_v1_tf_job_spec.py .                                                                                                                                                                                       [ 85%]
sdk/python/test/test_v1_mx_job.py .                                                                                                                                                                                            [ 90%]
sdk/python/test/test_v1_mx_job_list.py .                                                                                                                                                                                       [ 95%]
sdk/python/test/test_v1_job_status.py .                                                                                                                                                                                        [100%]

========================================================================================================== warnings summary ==========================================================================================================
sdk/python/.venv/lib/python3.8/site-packages/table_logger/table_logger.py:80
  /Users/.../.../tf-operator/sdk/python/.venv/lib/python3.8/site-packages/table_logger/table_logger.py:80: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    np.float: fmt.FloatFormatter,

sdk/python/.venv/lib/python3.8/site-packages/table_logger/table_logger.py:86
   /Users/.../go/src/github.com/mesosphere/tf-operator/sdk/python/.venv/lib/python3.8/site-packages/table_logger/table_logger.py:86: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    np.int: fmt.IntegerFormatter,

-- Docs: https://docs.pytest.org/en/latest/warnings.html
============================================================================================== 20 passed, 2 warnings in 308.06 seconds ===============================================================================================

@alembiewski
Copy link
Member Author

@Jeffwan, I updated the tooling for SDK generation and address the comments. Could you please take a look at the changes once again?

@Jeffwan
Copy link
Member

Jeffwan commented Sep 29, 2021

@Jeffwan, test.e2e.test_e2e_pytorchjob: test_sdk_e2e test is failing with the following error, could you help to figure out why? Not sure how can I troubleshoot that on the test cluster.

>           raise RuntimeError("Not found Pods of the PyTorchJob {} "
                               "in namespace {}".format(name, namespace))
E           RuntimeError: Not found Pods of the PyTorchJob pytorchjob-mnist-ci-test in namespace default

sdk/python/kubeflow/training/api/py_torch_job_client.py:384: RuntimeError

It seems like the job has been completed, but it wasn't able to find a pod to fetch the logs. I also tried to run the test on my cluster, all pass:

Let me have a check on the failure. If you can run it successfully in your local env. it could be a flaky one.

/retest

@alembiewski alembiewski changed the title [WIP] Add Python SDK for Kubeflow Training Operator Add Python SDK for Kubeflow Training Operator Sep 29, 2021
@Jeffwan
Copy link
Member

Jeffwan commented Sep 30, 2021

image

sdk test case pass and clean up pod policy is a flaky test

/test kubeflow-tf-operator-presubmit

@Jeffwan
Copy link
Member

Jeffwan commented Sep 30, 2021

The PR looks good to me. Please double check it.

/cc @andreyvelich @kubeflow/wg-training-leads

@alembiewski
Copy link
Member Author

@Jeffwan, all tests are green now - I improved attribute checks in k8s_util.py, hoping this will make the test less flaky.
After this PR is merged, we should probably think about publishing it to PyPI, maybe as a part of the 1.3.0 release?

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for updating this @alembiewski!
I left few comments.


To generate Python SDK for the operator, run:
```
.hack/python-sdk/gen-sdk.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.hack/python-sdk/gen-sdk.sh
./hack/python-sdk/gen-sdk.sh

"from kubeflow.training import V1PyTorchJob\n",
"from kubeflow.training import V1PyTorchJobSpec\n",
"from kubeflow.training import V1RunPolicy\n",
"from kubeflow.training.api.py_torch_job_client import PyTorchJobClient"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alembiewski @Jeffwan Do we want to add import of this client to __init__.py, similar to this Katib __init__.py ?

Then, users can just run from kubeflow.training import PyTorchJobClient or ``from kubeflow.training import TFJobClient` ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong options on this. WDYT @alembiewski ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, updated post-gen.py script to add imports automatically

@@ -46,14 +46,36 @@ Class | Method | Description
[TFJobClient](docs/TFJobClient.md) | [is_job_succeeded](docs/TFJobClient.md#is_job_succeeded) | Check if the TFJob status is Succeeded |
[TFJobClient](docs/TFJobClient.md) | [get_pod_names](docs/TFJobClient.md#get_pod_names) | Get pod names of TFJob |
[TFJobClient](docs/TFJobClient.md) | [get_logs](docs/TFJobClient.md#get_logs) | Get training logs of the TFJob |
[PyTorchJobClient](docs/PyTorchJobClient.md) | [create](docs/PyTorchJobClient.md#create) | Create PyTorchJob|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the Client docs were deleted by generator. Should we modify our script to not delete PyTorchJobClient.md and docs/TFJobClient.md during SDK generator run ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that, added client docs back

sdk/python/kubeflow/training/api/py_torch_job_client.py Outdated Show resolved Hide resolved
:return: True or False
"""
pytorchjob_status = self.get_job_status(name, namespace=namespace)
return pytorchjob_status.lower() == "succeeded"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this status to constants ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tbl(tfjob_name, status, update_time)

if name == tfjob_name:
if status == 'Succeeded' or status == 'Failed':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment about status.

sdk/python/kubeflow/training/api/tf_job_client.py Outdated Show resolved Hide resolved
@@ -0,0 +1,60 @@
# Copyright 2020 The Kubeflow Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2020 The Kubeflow Authors.
# Copyright 2021 The Kubeflow Authors.

sdk/python/kubeflow/training/utils/utils.py Outdated Show resolved Hide resolved
description="TFJob Python SDK",
long_description="TFJob Python SDK",
description="Training Operator Python SDK",
long_description="Training Operator Python SDK",
packages=setuptools.find_packages(
include=("kubeflow*")),
package_data={},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we drop support for Python < 3 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@alembiewski
Copy link
Member Author

@Jeffwan, @andreyvelich, thanks for the review! I addressed all comments and suggestions, PTAL

@andreyvelich
Copy link
Member

Thank you for updating this @alembiewski!
/lgtm
I think we should also publish this SDK to PyPi once we are ready.

/cc @kubeflow/wg-training-leads

@Jeffwan
Copy link
Member

Jeffwan commented Oct 3, 2021

/approve

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python SDK for Kubeflow Training Operator
5 participants