state store: better handling of job deletion #19609

pkazmierczak · 2024-01-04T18:02:09Z

When jobs are deleted with -purge, all their deployments and allocations should be deleted from the state store, and the evals status should be set to complete. Otherwise we end up in a situation where users could re-submit previously failing jobs, but these new jobs would not get deployments allocated unless system gc got called.

Fixes #10502

tgross

This looks generally good but it's worth trying to reason through what happens if there are allocations on a client that's disconnected when the job is purged. I'm not sure that it's safe to delete allocations until we know they're terminal?

pkazmierczak · 2024-01-09T18:07:55Z

what happens if there are allocations on a client that's disconnected when the job is purged.

If these are failed allocs, they remain in dead state because there's no scheduler to reschedule their re-run. If they correspond to non-failing jobs, they remain running.

I'm not sure that it's safe to delete allocations until we know they're terminal?

Yeah so this to me is more about the intended behavior of the -purge flag. In my mind, purge should delete any and all allocations on all the clients.

The behavior introduced in this PR makes one unfortunate side-effect, that is, if a client is disconnected while the job gets purged, and if it's a failing job (with alloc-failure evals), after reconnecting this client will leak allocs (in dead state). GC cleans them of course.

nomad/state/state_store.go

nomad/state/state_store_test.go

nomad/core_sched_test.go

lgfa29

Some small comments, but one general observation is that we usually have DeleteX() and DeleteXTxn() methods. I think it would nice to keep this pattern here for deployments and allocs.

nomad/state/state_store.go

tgross

LGTM once @lgfa29's comments are resolved

pkazmierczak · 2024-01-11T15:55:49Z

we usually have DeleteX() and DeleteXTxn() methods. I think it would nice to keep this pattern here for deployments and allocs.

Not sure I understand, are you suggesting to rename DeleteJobTxn to DeleteJobDeploymentsAndAllocsTxn or sth like that? This PR doesn't introduce a new method, it just modifies the DeleteJobTxn.

lgfa29 · 2024-01-11T16:22:43Z

we usually have DeleteX() and DeleteXTxn() methods. I think it would nice to keep this pattern here for deployments and allocs.

Not sure I understand, are you suggesting to rename DeleteJobTxn to DeleteJobDeploymentsAndAllocsTxn or sth like that? This PR doesn't introduce a new method, it just modifies the DeleteJobTxn.

Nope, I was suggesting having DeleteJobTxn() look something like this:

func (s *StateStore) DeleteJobTxn(index uint64, namespace, jobID string, txn Txn) error {
  // ...
  for _, deployment := range deployments {
    err := s.DeleteDeploymentTxn(txn, deployment)
    // ...
  }
  // ....
  for _, alloc := range allocs {
    err := s.DeleteAllocTxn(txn, alloc)
    // ...
  }
}

That way the deployment and alloc delete logic is shared and consistent across methods (for example, you wouldn't need to update their index table in DeleteJobTxn).

nomad/state/state_store.go

pkazmierczak · 2024-01-11T17:22:16Z

That way the deployment and alloc delete logic is shared and consistent across methods (for example, you wouldn't need to update their index table in DeleteJobTxn).

oooh I get it now I think. Implemented in e918e3b, have a look if that's what you meant.

lgfa29

Last bit of nit-picking 😄

nomad/state/state_store.go

There's no need to wait for allocs since #19609, in fact waiting for allocs to stop will always fail leading to e2e failures.

When jobs are deleted with -purge, all their deployments and allocations should be deleted from the state store, and the evals status should be set to complete. Otherwise we end up in a situation where users could re-submit previously failing jobs, but these new jobs would not get deployments allocated unless system gc got called.

There's no need to wait for allocs since hashicorp#19609, in fact waiting for allocs to stop will always fail leading to e2e failures.

When jobs are deleted with -purge, all their deployments and allocations should be deleted from the state store, and the evals status should be set to complete. Otherwise we end up in a situation where users could re-submit previously failing jobs, but these new jobs would not get deployments allocated unless system gc got called.

There's no need to wait for allocs since hashicorp#19609, in fact waiting for allocs to stop will always fail leading to e2e failures.

vercel bot deployed to Preview – nomad-storybook-and-ui January 4, 2024 18:03 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui January 8, 2024 16:43 View deployment

pkazmierczak marked this pull request as ready for review January 8, 2024 16:45

pkazmierczak self-assigned this Jan 8, 2024

pkazmierczak requested review from tgross and jrasell January 8, 2024 16:45

pkazmierczak added backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line labels Jan 8, 2024

vercel bot deployed to Preview – nomad-storybook-and-ui January 8, 2024 16:50 View deployment

pkazmierczak requested a review from lgfa29 January 8, 2024 17:32

vercel bot deployed to Preview – nomad-storybook-and-ui January 8, 2024 17:33 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui January 8, 2024 18:24 View deployment

pkazmierczak force-pushed the b-unfinished-gc branch from 1365183 to fafb300 Compare January 8, 2024 18:32

vercel bot deployed to Preview – nomad-storybook-and-ui January 8, 2024 18:35 View deployment

tgross reviewed Jan 8, 2024

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui January 10, 2024 14:20 View deployment

pkazmierczak added 7 commits January 10, 2024 15:37

when deleting a job, delete deployments and mark evals as complete

2acefb4

remove allocs when deleting the job

9d72d8d

cl

7c0ba52

fix TestCoreScheduler_EvalGC_Batch

5d5056b

fix TestStateStore_CSIPlugin_Lifecycle

df8585c

Update state_store_test.go

c3e2660

correct TestStateStore_AllocsForRegisteredJob

62bbe3c

pkazmierczak force-pushed the b-unfinished-gc branch from e0ef8fd to 62bbe3c Compare January 10, 2024 14:51

vercel bot deployed to Preview – nomad-storybook-and-ui January 10, 2024 14:55 View deployment

pkazmierczak requested a review from tgross January 10, 2024 15:10

tgross requested changes Jan 10, 2024

View reviewed changes

lgfa29 reviewed Jan 10, 2024

View reviewed changes

nomad/state/state_store.go Outdated Show resolved Hide resolved

nomad/state/state_store.go Outdated Show resolved Hide resolved

nomad/state/state_store.go Show resolved Hide resolved

nomad/state/state_store.go Outdated Show resolved Hide resolved

vercel bot deployed to Preview – nomad-storybook-and-ui January 11, 2024 09:02 View deployment

revert TestCoreScheduler_EvalGC_Batch change

4657173

vercel bot deployed to Preview – nomad-storybook-and-ui January 11, 2024 09:58 View deployment

tgross approved these changes Jan 11, 2024

View reviewed changes

applied Luiz's review comments

d6849cc

pkazmierczak requested a review from lgfa29 January 11, 2024 15:55

vercel bot deployed to Preview – nomad-storybook-and-ui January 11, 2024 15:57 View deployment

lgfa29 reviewed Jan 11, 2024

View reviewed changes

nomad/state/state_store.go Outdated Show resolved Hide resolved

pkazmierczak added 2 commits January 11, 2024 18:20

refactored according to Luiz's suggestions

e918e3b

evalstatusdescription correction

766c867

vercel bot deployed to Preview – nomad-storybook-and-ui January 11, 2024 17:25 View deployment

lgfa29 approved these changes Jan 11, 2024

View reviewed changes

nomad/state/state_store.go Outdated Show resolved Hide resolved

Luiz's comment

a681167

vercel bot deployed to Preview – nomad-storybook-and-ui January 12, 2024 08:46 View deployment

pkazmierczak merged commit 5d12ca4 into main Jan 12, 2024
19 checks passed

pkazmierczak deleted the b-unfinished-gc branch January 12, 2024 09:08

pkazmierczak mentioned this pull request Jan 15, 2024

e2e: purging jobs removes all allocs #19744

Merged

pkazmierczak added a commit that referenced this pull request Jan 15, 2024

e2e: purging jobs removes all allocs (#19744)

609f3a6

There's no need to wait for allocs since #19609, in fact waiting for allocs to stop will always fail leading to e2e failures.

pkazmierczak mentioned this pull request Jan 16, 2024

e2e: WaitForJobStopped correction #19749

Merged

lgfa29 mentioned this pull request Feb 6, 2024

bug: missing complete AllocationUpdated events from Nomad event stream #19834

Open

nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024

e2e: purging jobs removes all allocs (hashicorp#19744)

5bdc8d4

There's no need to wait for allocs since hashicorp#19609, in fact waiting for allocs to stop will always fail leading to e2e failures.

nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024

e2e: purging jobs removes all allocs (hashicorp#19744)

264b387

There's no need to wait for allocs since hashicorp#19609, in fact waiting for allocs to stop will always fail leading to e2e failures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

state store: better handling of job deletion #19609

state store: better handling of job deletion #19609

pkazmierczak commented Jan 4, 2024 •

edited

tgross left a comment

pkazmierczak commented Jan 9, 2024

lgfa29 left a comment

tgross left a comment

pkazmierczak commented Jan 11, 2024

lgfa29 commented Jan 11, 2024

pkazmierczak commented Jan 11, 2024

lgfa29 left a comment

state store: better handling of job deletion #19609

state store: better handling of job deletion #19609

Conversation

pkazmierczak commented Jan 4, 2024 • edited

tgross left a comment

Choose a reason for hiding this comment

pkazmierczak commented Jan 9, 2024

lgfa29 left a comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

pkazmierczak commented Jan 11, 2024

lgfa29 commented Jan 11, 2024

pkazmierczak commented Jan 11, 2024

lgfa29 left a comment

Choose a reason for hiding this comment

pkazmierczak commented Jan 4, 2024 •

edited