Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add E2E test for gang-scheduling #1736

Merged
merged 3 commits into from
Jan 24, 2023

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Jan 22, 2023

Signed-off-by: Yuki Iwai yuki.iwai.tz@gmail.com

What this PR does / why we need it:
I added the E2E test for gang-scheduling.
Also, I added functions to apply patches to CustomJob resources to Python SDK.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Follow up on #1724.

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Jan 22, 2023

Pull Request Test Coverage Report for Build 3994952641

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 9 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.06%) to 39.095%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/pytorch/pytorchjob_controller.go 9 58.16%
Totals Coverage Status
Change from base Build 3980850357: 0.06%
Covered Lines: 2687
Relevant Lines: 6873

💛 - Coveralls

@tenzen-y tenzen-y force-pushed the add-coscheduling-e2e branch 5 times, most recently from db995b5 to a976ca6 Compare January 22, 2023 15:10
@tenzen-y tenzen-y mentioned this pull request Jan 22, 2023
1 task
@tenzen-y
Copy link
Member Author

tenzen-y commented Jan 22, 2023

Blocked by #1737.

/hold

@tenzen-y tenzen-y changed the title [WIP] Add E2E test for Scheduler Plugins with coscheduling [WIP] Add E2E test for gang-scheduling Jan 22, 2023
@tenzen-y tenzen-y force-pushed the add-coscheduling-e2e branch 2 times, most recently from 306e607 to 9a7a3ac Compare January 22, 2023 16:08
Comment on lines 1434 to -1331
Args:
namespace: Namespace to list the MPIJobs.
timeout: Optional, Kubernetes API server timeout in seconds
to execute the request.

Returns:
list[KubeflowOrgV1MPIJob]: List of MPIJobs objects. It returns
empty list if MPIJobs cannot be found.
timeout: Optional, Kubernetes API server timeout in seconds
to execute the request.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout seems to be Args, not Returns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@tenzen-y tenzen-y changed the title [WIP] Add E2E test for gang-scheduling Add E2E test for gang-scheduling Jan 22, 2023
Comment on lines +5 to +7
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
Copy link
Member Author

@tenzen-y tenzen-y Jan 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding these parameters, when we push commits to the same PR, the already running Jobs are forcefully stopped, and new Jobs are started.

So we can reduce CI run times.

@tenzen-y tenzen-y force-pushed the add-coscheduling-e2e branch 3 times, most recently from 30afcaa to 977d2f9 Compare January 22, 2023 18:48
@tenzen-y
Copy link
Member Author

The mxjob-controller watches Pods created by other FrameworkJob resources and updates MXJob Status :(

@tenzen-y
Copy link
Member Author

The mxjob-controller watches Pods created by other FrameworkJob resources and updates MXJob Status :(

Ah, this is my misunderstanding.

@tenzen-y tenzen-y changed the title Add E2E test for gang-scheduling [WIP] Add E2E test for gang-scheduling Jan 22, 2023
@tenzen-y tenzen-y force-pushed the add-coscheduling-e2e branch 2 times, most recently from 50c3004 to 4f4c0da Compare January 22, 2023 21:00
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y tenzen-y changed the title [WIP] Add E2E test for gang-scheduling Add E2E test for gang-scheduling Jan 22, 2023
@tenzen-y
Copy link
Member Author

tenzen-y commented Jan 22, 2023

@johnugeorge This PR is ready for review, and all tests were passed. Please take a look.

@tenzen-y
Copy link
Member Author

/hold cancel

jobs:
integration-test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.23.12", "v1.24.6", "v1.25.2"]
# TODO (tenzen-y): Add volcano.
gang-scheduler-name: ["none", "scheduler-plugins"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to extend to volcano in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should run E2E with volcano. Although, we can follow up on other PRs.
So, I will create an issue to keep tracking this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #1738.

@johnugeorge
Copy link
Member

Great work @tenzen-y for adding this e2e-test

/cc @andreyvelich PTAL at SDK changes

@google-oss-prow
Copy link

@johnugeorge: GitHub didn't allow me to request PR reviews from the following users: PTAL, at, SDK, changes.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Great work @tenzen-y for adding this e2e-test

/cc @andreyvelich PTAL at SDK changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this feature @tenzen-y !
I left few comments.

sdk/python/test/e2e/test_e2e_tfjob.py Show resolved Hide resolved
sdk/python/kubeflow/training/api/training_client.py Outdated Show resolved Hide resolved
Comment on lines 1434 to -1331
Args:
namespace: Namespace to list the MPIJobs.
timeout: Optional, Kubernetes API server timeout in seconds
to execute the request.

Returns:
list[KubeflowOrgV1MPIJob]: List of MPIJobs objects. It returns
empty list if MPIJobs cannot be found.
timeout: Optional, Kubernetes API server timeout in seconds
to execute the request.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

sdk/python/kubeflow/training/utils/utils.py Outdated Show resolved Hide resolved
sdk/python/kubeflow/training/utils/utils.py Outdated Show resolved Hide resolved
sdk/python/test/e2e/constants.py Show resolved Hide resolved
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@andreyvelich
Copy link
Member

Thanks for the updates @tenzen-y!
/lgtm
/assign @johnugeorge @terrytangyuan

@johnugeorge
Copy link
Member

Thanks @tenzen-y
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 307f17a into kubeflow:master Jan 24, 2023
@tenzen-y tenzen-y deleted the add-coscheduling-e2e branch January 24, 2023 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants