Skip to content

fix(api): allow atomic RuntimePatches update on unsuspend#3469

Open
NarayanaSabari wants to merge 9 commits intokubeflow:masterfrom
NarayanaSabari:fix/runtime-patches-atomic-unsuspend
Open

fix(api): allow atomic RuntimePatches update on unsuspend#3469
NarayanaSabari wants to merge 9 commits intokubeflow:masterfrom
NarayanaSabari:fix/runtime-patches-atomic-unsuspend

Conversation

@NarayanaSabari
Copy link
Copy Markdown
Contributor

@NarayanaSabari NarayanaSabari commented Apr 30, 2026

What this PR does / why we need it:

Today the TrainJob admission webhook checks the new object's spec.suspend when validating spec.runtimePatches mutations. That forces external controllers - most notably Kueue - to issue two API requests to start a TrainJob:

  1. update runtimePatches (NodeSelectors, etc.) while still suspended,
  2. set suspend=false to actually start it.

Two consequences:

  • Race window between the two updates (originally surfaced in kubernetes-sigs/kueue#8255).
  • Doubled API server traffic for every TrainJob admission.

This PR changes checkRuntimePatchesImmutability to consult the old TrainJob's suspend state instead, mirroring the Kubernetes core Job validation pattern:

if it was safe to mutate before, it is safe to mutate during the suspended → unsuspended transition.

The downstream JobSet activity check (forbid changes while a ReplicatedJob is Active) is preserved unchanged, so updates while pods are running remain rejected.

This unblocks Kueue collapsing the two-step update at pkg/controller/jobs/trainjob/trainjob_controller.go into one atomic call.

This is a re-do of #3122 on the RuntimePatches API (the old PR targeted PodTemplateOverrides, which was replaced).

Which issue(s) this PR fixes:

Fixes #3043
Refs: kubernetes-sigs/kueue#8296

Commits (split for bisect-friendliness; each independently green):

  1. fix(webhook): allow atomic RuntimePatches update on unsuspend - validation logic + matching unit-test message.
  2. test(jobset): cover atomic suspend->unsuspend RuntimePatches update - new unit case.
  3. test(integration): cover atomic unsuspend + RuntimePatches update - webhook integration test against envtest.

Validation locally:

  • make fmt / make vet - clean
  • go test ./pkg/runtime/framework/plugins/jobset/... ./pkg/webhooks/... - pass
  • make test-integration - 46/46 passed (32.8s)

Checklist:

  • Unit tests added/updated
  • Integration tests added
  • Docs - none needed; no user-facing API change, only relaxes an internal validation rule.
TrainJob webhook validation now permits updating `spec.runtimePatches` while transitioning from `suspend=true` to `suspend=false` in a single API request. The previously required two-step update (modify runtimePatches, then unsuspend) still works.

/cc @kaisoz @mimowo @andreyvelich

Change checkRuntimePatchesImmutability to check the *old* TrainJob's
suspend state instead of the new one. This lets external controllers
(notably Kueue, see kubernetes-sigs/kueue#8296) modify spec.runtimePatches
and set spec.suspend=false in a single API request, eliminating the
two-step update workaround and the race window between the calls.

Mirrors the Kubernetes core Job validation pattern:
https://github.com/kubernetes/kubernetes/blob/86b66f6f333a/pkg/registry/batch/job/strategy.go#L191

The downstream JobSet activity check (no active replicatedJobs) still
runs, so updates while pods are running remain forbidden.

Refs: kubeflow#3043
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Add a unit test that exercises the new atomic-update path: a single
TrainJob update that both clears spec.suspend and modifies
spec.runtimePatches must succeed when the previous state was suspended.

Refs: kubeflow#3043
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Extend the trainjob webhook update table with an entry that creates a
suspended TrainJob with RuntimePatches and then, in a single Update,
both unsuspends it and modifies the runtimePatches NodeSelector. This
exercises the relaxed validation against a real apiserver and protects
the suspended -> unsuspended transition path used by Kueue.

Refs: kubeflow#3043
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Copilot AI review requested due to automatic review settings April 30, 2026 06:06
@google-oss-prow
Copy link
Copy Markdown

@NarayanaSabari: GitHub didn't allow me to request PR reviews from the following users: mimowo.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

What this PR does / why we need it:

Today the TrainJob admission webhook checks the new object's spec.suspend when validating spec.runtimePatches mutations. That forces external controllers — most notably Kueue — to issue two API requests to start a TrainJob:

  1. update runtimePatches (NodeSelectors, etc.) while still suspended,
  2. set suspend=false to actually start it.

Two consequences:

  • Race window between the two updates (originally surfaced in kubernetes-sigs/kueue#8255).
  • Doubled API server traffic for every TrainJob admission.

This PR changes checkRuntimePatchesImmutability to consult the old TrainJob's suspend state instead, mirroring the Kubernetes core Job validation pattern:

if it was safe to mutate before, it is safe to mutate during the suspended → unsuspended transition.

The downstream JobSet activity check (forbid changes while a ReplicatedJob is Active) is preserved unchanged, so updates while pods are running remain rejected.

This unblocks Kueue collapsing the two-step update at pkg/controller/jobs/trainjob/trainjob_controller.go into one atomic call.

This is a re-do of #3122 on the RuntimePatches API (the old PR targeted PodTemplateOverrides, which was replaced).

Which issue(s) this PR fixes:

Fixes #3043
Refs: kubernetes-sigs/kueue#8296

Commits (split for bisect-friendliness; each independently green):

  1. fix(webhook): allow atomic RuntimePatches update on unsuspend — validation logic + matching unit-test message.
  2. test(jobset): cover atomic suspend->unsuspend RuntimePatches update — new unit case.
  3. test(integration): cover atomic unsuspend + RuntimePatches update — webhook integration test against envtest.

Validation locally:

  • make fmt / make vet — clean
  • go test ./pkg/runtime/framework/plugins/jobset/... ./pkg/webhooks/... — pass
  • make test-integration — 46/46 passed (32.8s)

Checklist:

  • Unit tests added/updated
  • Integration tests added
  • Docs — none needed; no user-facing API change, only relaxes an internal validation rule.
TrainJob webhook validation now permits updating `spec.runtimePatches` while transitioning from `suspend=true` to `suspend=false` in a single API request. The previously required two-step update (modify runtimePatches, then unsuspend) still works.

/cc @kaisoz @mimowo @andreyvelich

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Relaxes TrainJob admission validation to allow spec.runtimePatches to be updated atomically while transitioning spec.suspend=true → false, aligning the webhook behavior with Kubernetes core Job validation patterns and removing the need for two-step updates by external controllers (e.g., Kueue).

Changes:

  • Update checkRuntimePatchesImmutability to gate RuntimePatches mutations on the old object’s suspend state instead of the new object’s state.
  • Update/extend unit tests to cover atomic unsuspend + RuntimePatches mutation behavior.
  • Add an envtest integration case validating the atomic unsuspend + RuntimePatches update path through the webhook.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
pkg/runtime/framework/plugins/jobset/jobset.go Adjusts immutability validation to consult old suspend state and preserves the “active JobSet” safety check.
pkg/runtime/framework/plugins/jobset/jobset_test.go Updates expected error text and adds a unit test for atomic unsuspend + RuntimePatches mutation.
test/integration/webhooks/trainjob_test.go Adds an integration test entry ensuring the webhook allows an atomic unsuspend + RuntimePatches update.

if !suspended {
allErrs = append(allErrs, field.Forbidden(runtimePatchesPath, "RuntimePatches can only be modified when the TrainJob is suspended"))
if !oldSuspended {
allErrs = append(allErrs, field.Forbidden(runtimePatchesPath, "RuntimePatches can only be modified when the TrainJob was suspended"))
Copy link
Copy Markdown
Contributor Author

@NarayanaSabari NarayanaSabari Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message has been reworded to "RuntimePatches can only be modified when the TrainJob is suspended before or after the update" in d067b6e - that should remove the ambiguity around which suspend state is being checked.

@NarayanaSabari NarayanaSabari changed the title fix(webhook): allow atomic RuntimePatches update on unsuspend fix(api): allow atomic RuntimePatches update on unsuspend Apr 30, 2026
if changed {
if !suspended {
allErrs = append(allErrs, field.Forbidden(runtimePatchesPath, "RuntimePatches can only be modified when the TrainJob is suspended"))
if !oldSuspended {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider the whole able of transitions:

oldSuspend newSuspend should block main current
false false true true true
true true false false false
true false false true false
false true false false true

So, I think the change should be block if !(oldSuspended || newSuspended) which is equivalent to !oldSuspended && !newSuspended

Copy link
Copy Markdown
Contributor Author

@NarayanaSabari NarayanaSabari Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - pushed d067b6e with the predicate changed to !oldSuspended && !newSuspended and the error message reworded to "RuntimePatches can only be modified when the TrainJob is suspended before or after the update" so it matches the new semantics.

if changed {
if !suspended {
allErrs = append(allErrs, field.Forbidden(runtimePatchesPath, "RuntimePatches can only be modified when the TrainJob is suspended"))
if !oldSuspended {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is surprising that no assert on existing tests are violated with this change. Let's make sure that all variants see table are now tested.

Copy link
Copy Markdown
Contributor Author

@NarayanaSabari NarayanaSabari Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right that the previous predicate was overly permissive on tests - the only existing case that exercises the validator path is (false, false), which still gets blocked under the corrected logic, so the test stayed green even though the behavior on (false, true) was wrong.

dbb9ba4 now covers all four transitions:

oldSuspend newSuspend expected covered by
false false block forbid changes to runtimePatches when trainJob is not suspended
true true allow allow changes to runtimePatches when trainJob is suspended and jobSet does not exist (and the inactive variant)
true false allow allow atomic update: modify runtimePatches and unsuspend trainJob in a single request
false true allow allow atomic update: modify runtimePatches and suspend trainJob in a single request (new)

A matching webhook integration entry was added for the (false, true) case as well, so all four transitions are exercised through envtest. Local make test-integration passes 47/47 in ~33s.

…pended

Per @mimowo's review, the previous condition (!oldSuspended) over-blocked
the (oldSuspended=false, newSuspended=true) transition — i.e. suspending
a running TrainJob while modifying spec.runtimePatches in the same call,
which is safe because the JobSet activity check still guards against
concurrent pod runs.

Change the predicate to !oldSuspended && !newSuspended so the webhook
only forbids RuntimePatches mutations when the TrainJob is staying fully
unsuspended. Reword the error message accordingly so callers see
explicitly that suspension at either side of the update is acceptable.

Refs: kubeflow#3043, kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Per @mimowo's request, complete the transition table coverage:

- (false, false) -> blocked   (existing test)
- (true,  true)  -> allowed   (existing test)
- (true,  false) -> allowed   (added previously)
- (false, true)  -> allowed   (added here)

Mirrors the existing atomic-unsuspend cases — one unit case in
jobset_test.go and one webhook integration entry in trainjob_test.go.

Refs: kubeflow#3043, kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Copy link
Copy Markdown

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 👍
LGTM, cc @kaisoz @tenzen-y
/assign @andreyvelich

// API request, removing the two-step workaround and the race window
// between the calls. See kubernetes-sigs/kueue#8296 and the Kubernetes
// core Job validation pattern at
// https://github.com/kubernetes/kubernetes/blob/86b66f6f333a/pkg/registry/batch/job/strategy.go#L191
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the comment doesn't need to specify neither the current Kueue behavior nor the source issue. It makes it unnecessarily long (the comment is longer than the change!). What about something like:

Allow modifying RuntimePatches if the TrainJob is suspended before or after the update (i.e. block only when it stays fully unsuspended) This lets external controllers (e.g. Kueue) update RuntimePatches and toggle spec.suspend in a single API request.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — applied your wording verbatim in bd7ffea6. The comment is now 4 lines instead of 8 and stays focused on the what/why of the change rather than the cross-repo backstory.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call indeed, the new comment version lgtm.

@kaisoz
Copy link
Copy Markdown
Contributor

kaisoz commented Apr 30, 2026

Thanks so much @NarayanaSabari ! LGTM too! just added a small nit

Trim the comment block above checkRuntimePatchesImmutability to a 4-line
summary, per @kaisoz's review feedback (the comment was longer than the
change itself). Drops the kueue#8296 backreference and the k/k Job
validation link — both are already in the PR description and commit
history.

Refs: kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich
Copy link
Copy Markdown
Member

/ok-to-test

Copy link
Copy Markdown
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM, thank you!

Comment on lines +181 to +182
oldSuspended := ptr.Equal(oldObj.Spec.Suspend, ptr.To(true))
newSuspended := ptr.Equal(newObj.Spec.Suspend, ptr.To(true))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
oldSuspended := ptr.Equal(oldObj.Spec.Suspend, ptr.To(true))
newSuspended := ptr.Equal(newObj.Spec.Suspend, ptr.To(true))
oldSuspended := ptr.Equal(oldObj.Spec.Suspend, new(true))
newSuspended := ptr.Equal(newObj.Spec.Suspend, new(true))

nits: We can rely on the std lib instead of a third-party one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick heads-up: the literal suggestion as new(true) won't compile because the new() builtin takes a type rather than a value (the closest stdlib equivalent of ptr.To(true) is t := true; &t or an inline nil-check + deref).

I went with the inline pattern in 946234f4 — same semantics as ptr.Equal(x, ptr.To(true)), but stdlib only and arguably easier to read for a simple "is this *bool pointing at true?" check:

oldSuspended := oldObj.Spec.Suspend != nil && *oldObj.Spec.Suspend
newSuspended := newObj.Spec.Suspend != nil && *newObj.Spec.Suspend

Happy to switch to a t := true; &t style if you'd prefer keeping ptr.Equal and only dropping ptr.To. The rest of the file still uses ptr.Deref elsewhere, so the import stays either way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Please use the latest Go version (>= 1.26.0) and check your local environment.

return job
},
gomega.Succeed()),
ginkgo.Entry("Should succeed to atomically update runtimePatches and unsuspend in a single request",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this duplicated test case?
We already checked this scenario in UTs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6e117354. I removed both atomic-transition entries (unsuspend + suspend) since both are covered by UTs in jobset_test.go — leaving only the pre-existing "Should succeed to update runtimePatches when suspend is true" entry as the integration smoke test for the webhook chain. Let me know if you'd rather keep one of them as an end-to-end sanity check.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add one more scenario where old is not suspended, but new is suspended, and the JobSet is still active (ReplicatedJobsStatus: []jobsetv1alpha2.ReplicatedJobStatus{ {Name: constants.Node, Active: 2}})?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a3a2271c as "forbid atomic update: suspending trainJob with runtimePatches change is rejected when jobSet has an active replicatedJob". Fixture is oldSuspend=false, newSuspend=true, and a JobSet with ReplicatedJobsStatus: [{Name: constants.Node, Active: 2}] — exactly the scenario you suggested. Verifies that the relaxed suspend predicate alone won't sneak a runtimePatches change past the activity safety net.

Per @tenzen-y's review nit, swap the ptr.Equal(x, ptr.To(true)) idiom
for an inline nil-check + dereference. Equivalent semantics, no extra
generic-comparison hop, and stays in the stdlib for these two lines.

Refs: kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Per @tenzen-y's request, add a unit case where the old TrainJob is
unsuspended, the new one is being suspended with a runtimePatches
change, and the underlying JobSet still has an active replicatedJob.
The relaxed suspend predicate alone would let this through; the
JobSet activity check should still reject it.

Refs: kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Per @tenzen-y's review, the two atomic-transition ginkgo entries
(unsuspend + RuntimePatches, suspend + RuntimePatches) duplicate the
unit cases in jobset_test.go that already exercise the same validator
path. The pre-existing "Should succeed to update runtimePatches when
suspend is true" entry stays as the integration smoke test for the
webhook chain.

Refs: kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't require the Trainjob to be suspended to update podTemplateOverrides

6 participants