Skip to content

fix(trainer): raise ValueError for missing containers instead of Stop…#348

Closed
1Ayush-Petwal wants to merge 1 commit intokubeflow:mainfrom
1Ayush-Petwal:fix/trainer-stop-iteration-container-lookup
Closed

fix(trainer): raise ValueError for missing containers instead of Stop…#348
1Ayush-Petwal wants to merge 1 commit intokubeflow:mainfrom
1Ayush-Petwal:fix/trainer-stop-iteration-container-lookup

Conversation

@1Ayush-Petwal
Copy link

What this PR does / why we need it:
next() on a bare generator with no default raises StopIteration, which
Python 3.7+ (PEP 479) promotes to a RuntimeError with no pod/container
context. Hardens two functions in utils.py:

  • get_trainjob_initializer_step: replaced bare next(gen) with next(..., None) + explicit ValueError naming the pod and listingactual containers found
  • get_trainjob_node_step: same pattern applied
  • Added Raises: docstring section to both functions
  • Added first unit test coverage for both functions (8 parametrized cases)

Which issue(s) this PR fixes

General SDK hardening (noticed while exploring the codebase for #164).

Checklist:

  • Docs included if any changes are user facing
  • DCO signed (-s flag on commit)
  • Conventional commit format (fix(trainer):)
  • No public API changes

…Iteration

  next() on a bare generator with no default raises StopIteration which
  Python 3.7+ (PEP 479) promotes to RuntimeError with no pod/container
  context. Replace with next(..., None) + explicit ValueError in both
  get_trainjob_initializer_step and get_trainjob_node_step. Add first
  test coverage for both functions (8 parametrized cases total).

Signed-off-by: Ayush Petwal <ayushpetwal.0105@gmail.com>
Copilot AI review requested due to automatic review settings March 3, 2026 17:54
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens two utility functions in utils.py to replace bare next(generator) calls — which raise StopIteration (promoted to RuntimeError by PEP 479 in Python 3.7+) — with next(..., None) + explicit ValueError with descriptive messages. It also adds first unit test coverage for both affected functions.

Changes:

  • get_trainjob_initializer_step in utils.py: replaced bare next() with next(..., None) and an explicit ValueError naming the pod and listing found containers.
  • get_trainjob_node_step in utils.py: same pattern applied for the node container lookup.
  • New parametrized test cases in utils_test.py covering both success and error paths for both functions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
kubeflow/trainer/backends/kubernetes/utils.py Replaces bare next(gen) calls with safe next(..., None) + ValueError, adds Raises: docstring sections
kubeflow/trainer/backends/kubernetes/utils_test.py Adds 8 parametrized test cases (4 each) covering success and error paths for the two updated functions

@Fiona-Waters
Copy link
Contributor

/ok-to-test

Copy link
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks @1Ayush-Petwal
/lgtm

Comment on lines +164 to +169
(
c
for c in pod_spec.containers
if c.name in {constants.DATASET_INITIALIZER, constants.MODEL_INITIALIZER}
),
None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it be possible that JobSet doesn't have such containers if we define label selector here to query the initializer Pods: https://github.com/1Ayush-Petwal/sdk/blob/cbb4ebd7ed7d24d3d4a70a579cc106c387f1efb7/kubeflow/trainer/backends/kubernetes/backend.py#L687-L689

We reserve the container name for the initializers, and validate it in TrainJob webhook: https://github.com/kubeflow/trainer/blob/master/pkg/webhooks/trainingruntime_webhook.go#L89-L101

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I missed the label selector, this case is effectively unreachable. Would it be ok to add a comment documenting this invariant for future contributors?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please close this PR, and leave comment in the issue that these containers should always exist.

@1Ayush-Petwal
Copy link
Author

Closing this PR, the StopIteration case is unreachable. The label selector ensures only initializer pods reach this code path, and the TrainJob webhook guarantees those pods always have the required container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants