Skip to content

fix(trainer): validate polling_interval in wait_for_job_status#554

Open
HarshPopat23 wants to merge 6 commits into
kubeflow:mainfrom
HarshPopat23:fix-550-polling-interval-validation
Open

fix(trainer): validate polling_interval in wait_for_job_status#554
HarshPopat23 wants to merge 6 commits into
kubeflow:mainfrom
HarshPopat23:fix-550-polling-interval-validation

Conversation

@HarshPopat23

Copy link
Copy Markdown

What this PR does / why we need it:

Adds input validation to TrainerClient.wait_for_job_status() to reject

non-positive polling_interval values before delegating to the backend.

Previously, passing polling_interval=0 could cause a CPU busy-loop

(time.sleep(0) in a tight polling loop), and negative values raised a

cryptic stdlib ValueError from deep inside time.sleep() with no

context about which argument was invalid.

Which issue(s) this PR fixes:

Fixes #550

Checklist:

  • Docs included if any changes are user facing (docstring already documented ValueError: The input values are incorrect. — no change needed)

Adds a ValueError guard in TrainerClient.wait_for_job_status() to reject non-positive polling_interval values before delegating to the backend. Previously this could cause a CPU busy-loop (polling_interval=0) or a cryptic stdlib error (negative values). Fixes kubeflow#550

Signed-off-by: HarshPopat23 <musichk61@gmail.com>
Copilot AI review requested due to automatic review settings July 3, 2026 06:17
@google-oss-prow

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds early input validation to the user-facing TrainerClient.wait_for_job_status() API to reject invalid polling_interval values before delegating to the selected backend, and adds a unit test to cover the new validation behavior.

Changes:

  • Added a guard in TrainerClient.wait_for_job_status() to raise ValueError when polling_interval <= 0.
  • Added unit tests verifying ValueError is raised for polling_interval of 0 and -5.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
kubeflow/trainer/api/trainer_client.py Adds polling_interval validation before calling backend wait_for_job_status().
kubeflow/trainer/api/trainer_client_test.py Adds a unit test ensuring invalid polling_interval values raise ValueError.

Comment thread kubeflow/trainer/api/trainer_client.py Outdated
Comment thread kubeflow/trainer/api/trainer_client_test.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: HarshPopat23 <musichk61@gmail.com>
@HarshPopat23 HarshPopat23 force-pushed the fix-550-polling-interval-validation branch from 0629885 to 96650d5 Compare July 3, 2026 06:34
…rror messages

Signed-off-by: HarshPopat23 <musichk61@gmail.com>

@andreyvelich andreyvelich left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test
/retest

Comment thread kubeflow/trainer/api/trainer_client_test.py Outdated
Comment thread kubeflow/trainer/api/trainer_client_test.py Outdated
Comment thread kubeflow/trainer/api/trainer_client.py Outdated
…ed utils

Move polling_interval and timeout validation into a shared
common_utils.validate_wait_intervals() function, used by both
TrainerClient and OptimizerClient. Also renamed test function per
review feedback and fixed missing newline at end of test file.

Signed-off-by: HarshPopat23 <musichk61@gmail.com>
@google-oss-prow google-oss-prow Bot added size/M and removed size/S labels Jul 4, 2026
Signed-off-by: HarshPopat23 <musichk61@gmail.com>
Signed-off-by: HarshPopat23 <musichk61@gmail.com>
@Goku2099

Goku2099 commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Small naming suggestion: since this helper validates both polling_interval and timeout specifically for wait_for_job_status, would something like validate_wait_for_job_status_args() (or ..._params()) make its purpose a bit clearer?
Not a blocker, just a readability suggestion.

@HarshPopat23

Copy link
Copy Markdown
Author

Small naming suggestion: since this helper validates both polling_interval and timeout specifically for wait_for_job_status, would something like validate_wait_for_job_status_args() (or ..._params()) make its purpose a bit clearer? Not a blocker, just a readability suggestion.

Hi @Goku2099, thank you for taking the time to review my PR and for the suggestion! I actually used this function name based on a recommendation from @andreyvelich. If they agree that validate_wait_for_job_status_args() (or a similar name) is more appropriate, I'd be happy to update it. Thanks again for the helpful feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(trainer): wait_for_job_status() accepts zero or negative polling_interval causing CPU busy-loop or cryptic error

4 participants