Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix infinite loop in init-pytorch container #1756

Merged
merged 2 commits into from
Feb 17, 2023

Conversation

AxeZhan
Copy link
Contributor

@AxeZhan AxeZhan commented Feb 10, 2023

What this PR does / why we need it:
Replace infinite loop in init-pytorch container with a finite loop.
Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #1734

Checklist:

  • Docs included if any changes are user facing

@google-cla
Copy link

google-cla bot commented Feb 10, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kidddddddddddddddddddddd Thanks for your contibution!

@kubeflow/wg-training-leads Can you approve CI?

@@ -43,7 +43,7 @@ var (
requests:
cpu: 50m
memory: 10Mi
command: ['sh', '-c', 'until nslookup {{.MasterAddr}}; do echo waiting for master; sleep 2; done;']`
command: ['sh', '-c', 'err=1;for i in $(seq 100); do if nslookup {{.MasterAddr}}; then err=0 && break; fi;echo waiting for master; sleep 2; done; exit $err']`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we set the limit at 100 times?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this as timeout which is configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge Do you mean like {{.MasterAddr}}, set a helm value like for i in $(seq {{ .MaxReries }}); ? Where should I set the default value of this key? I can't seem to find the values file for helm🤦.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 4145313973

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.04%) to 39.356%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 2 77.35%
Totals Coverage Status
Change from base Build 4136378513: 0.04%
Covered Lines: 2725
Relevant Lines: 6924

💛 - Coveralls

@coveralls
Copy link

coveralls commented Feb 10, 2023

Pull Request Test Coverage Report for Build 4171231098

  • 3 of 3 (100.0%) changed or added relevant lines in 1 file are covered.
  • 5 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.003%) to 39.512%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 2 76.97%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go 3 53.35%
Totals Coverage Status
Change from base Build 4147261027: -0.003%
Covered Lines: 2737
Relevant Lines: 6927

💛 - Coveralls

@google-oss-prow google-oss-prow bot added size/S and removed size/XS labels Feb 14, 2023
@AxeZhan AxeZhan force-pushed the fix_loop branch 2 times, most recently from 2a6b6aa to ae2ac0e Compare February 14, 2023 03:25
@johnugeorge
Copy link
Member

/lgtm

/cc @tenzen-y

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kidddddddddddddddddddddd Thanks for your great contribution!

/lgtm
/assign @johnugeorge

@@ -34,6 +34,7 @@ func TestInitContainer(t *testing.T) {

Copy link
Member

@tenzen-y tenzen-y Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to test the whole of initContainer. However, that is out of the scope of this PR.
So we can follow up with another PR.

@tenzen-y
Copy link
Member

@johnugeorge friendly ping.

@johnugeorge
Copy link
Member

Sorry for late response
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, kidddddddddddddddddddddd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 7067269 into kubeflow:master Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Infinitely looping init-pytorch container may eventually exceed its memory limit
4 participants