Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PodGroup as controller watch source #1666

Merged

Conversation

ggaaooppeenngg
Copy link
Contributor

The common.ReconcileJobs stops creating pods when the related PodGroup is unschedulable. When the PodGroup becomes schedulable, the reconcile loop can not be triggered because of no watch source for the PodGroup.

Signed-off-by: Peng Gao peng.gao.dut@gmail.com

@gaocegege
Copy link
Member

I used to thought it should be already fixed.

/lgtm

/cc @zw0610

Copy link
Member

@shinytang6 shinytang6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe other job controllers(pytorch/tf/...) also need this?

@ggaaooppeenngg
Copy link
Contributor Author

There seems an error. How to install the PodGroup CRD in the test environment?

@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Sep 28, 2022
@coveralls
Copy link

coveralls commented Sep 28, 2022

Pull Request Test Coverage Report for Build 3271153965

  • 9 of 66 (13.64%) changed or added relevant lines in 5 files are covered.
  • 12 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.4%) to 39.527%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/mpi/mpijob_controller.go 3 10 30.0%
pkg/controller.v1/pytorch/pytorchjob_controller.go 3 14 21.43%
pkg/controller.v1/tensorflow/tfjob_controller.go 3 14 21.43%
pkg/controller.v1/mxnet/mxjob_controller.go 0 14 0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 0 14 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mxnet/mxjob_controller.go 1 0%
pkg/controller.v1/mpi/mpijob_controller.go 2 77.39%
pkg/controller.v1/pytorch/pytorchjob_controller.go 9 60.32%
Totals Coverage Status
Change from base Build 3132871820: -0.4%
Covered Lines: 2340
Relevant Lines: 5920

💛 - Coveralls

The common.ReconcileJobs stops creating pods when the related
PodGroup is unschedulable. When the PodGroup becomes schedulable,
the reconcile loop can not be triggered because of no watch source
for the PodGroup.

Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
@johnugeorge
Copy link
Member

You can try to invoke this script locally for debugging

https://github.com/kubeflow/training-operator/blob/master/.github/workflows/integration-tests.yaml#L39

Continue without watching the PodGroup is the PodGroup
is not installed.

Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
@ggaaooppeenngg
Copy link
Contributor Author

It's not easy to manage an external crd. I skip the watching if the crd is not found.

@ggaaooppeenngg ggaaooppeenngg requested review from shinytang6 and removed request for terrytangyuan, gaocegege, jinchihe and zw0610 October 17, 2022 09:01
@ggaaooppeenngg
Copy link
Contributor Author

Need a review @shinytang6

Copy link
Member

@shinytang6 shinytang6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, these changes are ok to me.

_, err = mgr.GetRESTMapper().RESTMapping(schema.GroupKind{Group: v1beta1.SchemeGroupVersion.Group, Kind: "PodGroup"},
v1beta1.SchemeGroupVersion.Version)
if err == nil {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra blank line

_, err = mgr.GetRESTMapper().RESTMapping(schema.GroupKind{Group: v1beta1.SchemeGroupVersion.Group, Kind: "PodGroup"},
v1beta1.SchemeGroupVersion.Version)
if err == nil {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
@ggaaooppeenngg
Copy link
Contributor Author

done @shinytang6
/assign @gaocegege

@johnugeorge
Copy link
Member

/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ggaaooppeenngg, johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@johnugeorge
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Oct 20, 2022
@google-oss-prow google-oss-prow bot merged commit ab9f3ec into kubeflow:master Oct 20, 2022
djwhatle pushed a commit to djwhatle/training-operator that referenced this pull request Oct 27, 2022
* Add PodGroup as controller watch source

The common.ReconcileJobs stops creating pods when the related
PodGroup is unschedulable. When the PodGroup becomes schedulable,
the reconcile loop can not be triggered because of no watch source
for the PodGroup.

Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>

* Fix the no PodGroup kind error

Continue without watching the PodGroup is the PodGroup
is not installed.

Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>

* Remove the PodGroup crd

Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>

* Remove extra blank lines

Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>

Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
(cherry picked from commit ab9f3ec)
@Crazybean-lwb
Copy link

panic happend

E1031 12:09:36.449193 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 549 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x185a700?, 0x2a0c1a0})
/go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000148028?})
/go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/runtime/runtime.go:49 +0x75

@johnugeorge
Copy link
Member

Can you create a separate issue with more details?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants