Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

state store: better handling of job deletion #19609

Merged
merged 13 commits into from Jan 12, 2024
Merged

Conversation

pkazmierczak
Copy link
Contributor

@pkazmierczak pkazmierczak commented Jan 4, 2024

When jobs are deleted with -purge, all their deployments and allocations should be deleted from the state store, and the evals status should be set to complete. Otherwise we end up in a situation where users could re-submit previously failing jobs, but these new jobs would not get deployments allocated unless system gc got called.

Fixes #10502

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks generally good but it's worth trying to reason through what happens if there are allocations on a client that's disconnected when the job is purged. I'm not sure that it's safe to delete allocations until we know they're terminal?

@pkazmierczak
Copy link
Contributor Author

what happens if there are allocations on a client that's disconnected when the job is purged.

If these are failed allocs, they remain in dead state because there's no scheduler to reschedule their re-run. If they correspond to non-failing jobs, they remain running.

I'm not sure that it's safe to delete allocations until we know they're terminal?

Yeah so this to me is more about the intended behavior of the -purge flag. In my mind, purge should delete any and all allocations on all the clients.

The behavior introduced in this PR makes one unfortunate side-effect, that is, if a client is disconnected while the job gets purged, and if it's a failing job (with alloc-failure evals), after reconnecting this client will leak allocs (in dead state). GC cleans them of course.

nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/state/state_store_test.go Show resolved Hide resolved
nomad/core_sched_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments, but one general observation is that we usually have DeleteX() and DeleteXTxn() methods. I think it would nice to keep this pattern here for deployments and allocs.

nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/state/state_store.go Show resolved Hide resolved
nomad/state/state_store.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once @lgfa29's comments are resolved

@pkazmierczak
Copy link
Contributor Author

we usually have DeleteX() and DeleteXTxn() methods. I think it would nice to keep this pattern here for deployments and allocs.

Not sure I understand, are you suggesting to rename DeleteJobTxn to DeleteJobDeploymentsAndAllocsTxn or sth like that? This PR doesn't introduce a new method, it just modifies the DeleteJobTxn.

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 11, 2024

we usually have DeleteX() and DeleteXTxn() methods. I think it would nice to keep this pattern here for deployments and allocs.

Not sure I understand, are you suggesting to rename DeleteJobTxn to DeleteJobDeploymentsAndAllocsTxn or sth like that? This PR doesn't introduce a new method, it just modifies the DeleteJobTxn.

Nope, I was suggesting having DeleteJobTxn() look something like this:

func (s *StateStore) DeleteJobTxn(index uint64, namespace, jobID string, txn Txn) error {
  // ...
  for _, deployment := range deployments {
    err := s.DeleteDeploymentTxn(txn, deployment)
    // ...
  }
  // ....
  for _, alloc := range allocs {
    err := s.DeleteAllocTxn(txn, alloc)
    // ...
  }
}

That way the deployment and alloc delete logic is shared and consistent across methods (for example, you wouldn't need to update their index table in DeleteJobTxn).

nomad/state/state_store.go Outdated Show resolved Hide resolved
@pkazmierczak
Copy link
Contributor Author

That way the deployment and alloc delete logic is shared and consistent across methods (for example, you wouldn't need to update their index table in DeleteJobTxn).

oooh I get it now I think. Implemented in e918e3b, have a look if that's what you meant.

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last bit of nit-picking 😄

nomad/state/state_store.go Outdated Show resolved Hide resolved
@pkazmierczak pkazmierczak merged commit 5d12ca4 into main Jan 12, 2024
19 checks passed
@pkazmierczak pkazmierczak deleted the b-unfinished-gc branch January 12, 2024 09:08
pkazmierczak added a commit that referenced this pull request Jan 15, 2024
There's no need to wait for allocs since #19609, in fact waiting for allocs to
stop will always fail leading to e2e failures.
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
When jobs are deleted with -purge, all their deployments and allocations should
be deleted from the state store, and the evals status should be set to complete.
Otherwise we end up in a situation where users could re-submit previously
failing jobs, but these new jobs would not get deployments allocated unless
system gc got called.
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
There's no need to wait for allocs since hashicorp#19609, in fact waiting for allocs to
stop will always fail leading to e2e failures.
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
When jobs are deleted with -purge, all their deployments and allocations should
be deleted from the state store, and the evals status should be set to complete.
Otherwise we end up in a situation where users could re-submit previously
failing jobs, but these new jobs would not get deployments allocated unless
system gc got called.
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
There's no need to wait for allocs since hashicorp#19609, in fact waiting for allocs to
stop will always fail leading to e2e failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unfinished GC may block placements of new version of the job
3 participants