scheduler(system): Fix potential panic in deployment handling.#27571
Merged
scheduler(system): Fix potential panic in deployment handling.#27571
Conversation
When a system job deployment is successful and a task group has no feasible candidate nodes, the task group's deployment state is set to nil in the mapping. If the deployment is persisted to state, because the job has multiple task groups and at least one results in successful placements, a subsequant evaluation likely triggered by a recovering node will look up the previous deployment state. It was at this point, the scheduler was not correctly handling the nil object.
tgross
approved these changes
Feb 23, 2026
Member
tgross
left a comment
There was a problem hiding this comment.
This patch seems fine but this same code path came up in #27382 as well. It seems to me that we shouldn't be touching the deployment at all if it's terminal, and that this is where the underlying bug is. I've been looking at that block of code the last week or so and don't quite understand why we're doing that, unfortunately.
LGTM but let's chat about whether there's a more comprehensive fix too.
tgross
added a commit
that referenced
this pull request
Feb 26, 2026
In #27533 we added write skew protection against the scheduler overwriting deployment state written by the clients. But there's a potential edge case where if a deployment were to drop the task group dstate entirely, we would overwrite the existing one. It doesn't appear possible to hit this case in the scheduler without hitting the panic described in #27571 but this is belt-and-suspenders. Ref: #27533 Ref: #27382 Ref: #27571
7 tasks
tgross
added a commit
that referenced
this pull request
Feb 26, 2026
In #27533 we added write skew protection against the scheduler overwriting deployment state written by the clients. But there's a potential edge case where if a deployment were to drop the task group dstate entirely, we would overwrite the existing one. It doesn't appear possible to hit this case in the scheduler without hitting the panic described in #27571 but this is belt-and-suspenders. Ref: #27533 Ref: #27382 Ref: #27571
tgross
added a commit
that referenced
this pull request
Feb 26, 2026
In #27533 we added write skew protection against the scheduler overwriting deployment state written by the clients. But there's a potential edge case where if a deployment were to drop the task group dstate entirely, we would overwrite the existing one. It doesn't appear possible to hit this case in the scheduler without hitting the panic described in #27571 but this is belt-and-suspenders. Ref: #27533 Ref: #27382 Ref: #27571
Member
7 tasks
tgross
added a commit
that referenced
this pull request
Feb 26, 2026
In #27533 we added write skew protection against the scheduler overwriting deployment state written by the clients. But there's a potential edge case where if a deployment were to drop the task group dstate entirely, we would overwrite the existing one. It doesn't appear possible to hit this case in the scheduler without hitting the panic described in #27571 but this is belt-and-suspenders. Ref: #27533 Ref: #27382 Ref: #27571
7 tasks
tgross
added a commit
that referenced
this pull request
Feb 26, 2026
In #27533 we added write skew protection against the scheduler overwriting deployment state written by the clients. But there's a potential edge case where if a deployment were to drop the task group dstate entirely, we would overwrite the existing one. It doesn't appear possible to hit this case in the scheduler without hitting the panic described in #27571 but this is belt-and-suspenders. Ref: #27533 Ref: #27382 Ref: #27571
Merged
7 tasks
tgross
added a commit
that referenced
this pull request
Feb 26, 2026
…27608) In #27533 we added write skew protection against the scheduler overwriting deployment state written by the clients. But there's a potential edge case where if a deployment were to drop the task group dstate entirely, we would overwrite the existing one. It doesn't appear possible to hit this case in the scheduler without hitting the panic described in #27571 but this is belt-and-suspenders. Ref: #27533 Ref: #27382 Ref: #27571 Co-authored-by: Tim Gross <tgross@hashicorp.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a system job deployment is successful and a task group has no feasible candidate nodes, the task group's deployment state is set to nil in the mapping. If the deployment is persisted to state, because the job has multiple task groups and at least one results in successful placements, a subsequent evaluation likely triggered by a recovering node will look up the previous deployment state. It was at this point, the scheduler was not correctly handling the nil object.
Another option could be to ensure the state write never includes a nil deployment object for a task group. It would require more work in upsert handling, so I feel this is the right approach as it's defensive and will always work.
Links
Jira: https://hashicorp.atlassian.net/browse/NMD-1266
Closes: #27567
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad product documentation, which is stored in the
web-unified-docsrepo. Refer to theweb-unified-docscontributor guide for docs guidelines.Please also consider whether the change requires notes within the upgrade
guide. If you would like help with the docs, tag the
nomad-docsteam in this PR.Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.