scheduler: `RescheduleTracker` dropped if follow-up fails placements #12319

tgross · 2022-03-18T12:59:02Z

When an allocation fails it triggers an evaluation. The evaluation is processed and the scheduler sees it needs to reschedule, which triggers a follow-up eval. The follow-up eval creates a plan to (stop 1) (place 1). The replacement alloc has a RescheduleTracker (or gets its RescheduleTracker updated).

But in the case where the follow-up eval can't place all allocs (there aren't enough resources), it can create a partial plan to (stop 1) (place 0). It then creates a blocked eval. The plan applier stops the failed alloc. Then when the blocked eval is processed, the job is missing an allocation, so the scheduler creates a new allocation. This allocation is not a replacement from the perspective of the scheduler, so it's not handed off a RescheduleTracker.

This changeset fixes this by annotating the reschedule tracker whenever the scheduler can't place a replacement allocation. We check this annotation for allocations that have the stop desired status when filtering out allocations to pass to the reschedule tracker. I've also included tests that cover this case and expands coverage of the relevant area of the code.

Fixes: #12147
Fixes: #17072
Fixes: https://hashicorp.atlassian.net/browse/NET-9551

Workflow of the typical happy path:

graph TD;
id1(alloc fails)--client status: failed-->id2(eval A);
id2(eval A)-->id3(follow-up eval B);
id3(follow-up eval B)--desired status: stop-->id4(alloc A);
id3(follow up eval B)--create with RescheduleTracker -->id5(alloc B);

Workflow of the buggy path:

graph TD;
id1(alloc fails)--client status: failed-->id2(eval A);
id2(eval A)-->id3(follow-up eval B);
id3(follow-up eval B)--desired status: stop-->id4(alloc A);
id3(follow-up eval B)--could not place alloc B-->id5(blocked eval C);
id5(blocked eval C)--create without RescheduleTracker -->id6(alloc B);

While working on #20462 #12319 I found that some of our scheduler tests around down nodes or disconnected clients were enforcing invariants that were unclear. This changeset pulls out some minor refactorings so that the bug fix PR is easier to review. This includes: * Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch in #12319 anyways. * Adding test names to the node down test * Update the disconnected client test so that we always re-process the pending/blocked eval it creates; this eliminates 2 redundant sub-tests. * Update the disconnected client test assertions so that they're explicit in the test setup rather than implied by whether we re-process the pending/blocked eval. Ref: #20462 Ref: #12319

When an allocation fails it triggers an evaluation. The evaluation is processed and the scheduler sees it needs to reschedule, which triggers a follow-up eval. The follow-up eval creates a plan to `(stop 1) (place 1)`. The replacement alloc has a `RescheduleTracker` (or gets its `RescheduleTracker` updated). But in the case where the follow-up eval can't place all allocs (there aren't enough resources), it can create a partial plan to `(stop 1) (place 0)`. It then creates a blocked eval. The plan applier stops the failed alloc. Then when the blocked eval is processed, the job is missing an allocation, so the scheduler creates a new allocation. This allocation is _not_ a replacement from the perspective of the scheduler, so it's not handed off a `RescheduleTracker`. This changeset fixes this by annotating the reschedule tracker whenever the scheduler can't place a replacement allocation. We check this annotation for allocations that have the `stop` desired status when filtering out allocations to pass to the reschedule tracker. I've also included tests that cover this case and expands coverage of the relevant area of the code. Fixes: #12147 Fixes: #17072

gulducat

I haven't fully absorbed all the implications of this, but here are my comments so far

nomad/structs/structs.go

scheduler/generic_sched.go

scheduler/generic_sched_test.go

gulducat

lgtm!

pkazmierczak

LGTM!

nomad/structs/structs.go

tgross · 2024-06-07T15:33:20Z

Waiting on some comments Juanita told me she was working on before merging this.

Juanadelacuesta

Great work in this meta bug :)

vercel bot deployed to Preview – nomad-storybook-and-ui March 18, 2022 12:59 View deployment

tgross mentioned this pull request Mar 18, 2022

Reschedule attempts limit is not respected when there is placement failures #12147

Closed

tgross added type/bug theme/restart/reschedule labels Mar 18, 2022

tgross self-assigned this Mar 18, 2022

tgross changed the title ~~scheduler: RescheduleTracker dropped if follow-up fail placements~~ scheduler: RescheduleTracker dropped if follow-up fails placements Mar 18, 2022

tgross force-pushed the b-blocked-reschedules branch from 2e99a15 to c704fc2 Compare April 6, 2022 14:59

vercel bot temporarily deployed to Preview – nomad April 6, 2022 14:59 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui April 6, 2022 14:59 View deployment

djenriquez mentioned this pull request Aug 21, 2023

v1.5.3: TaskGroup Rescheduling not enforced #17072

Closed

tgross removed their assignment Oct 26, 2023

tgross mentioned this pull request Apr 25, 2024

Batch job reschedule attempts sometimes not being respected #20462

Open

tgross self-assigned this May 15, 2024

tgross force-pushed the b-blocked-reschedules branch from c704fc2 to 0533b7c Compare May 15, 2024 18:46

vercel bot deployed to Preview – nomad-storybook-and-ui May 15, 2024 18:48 View deployment

tgross force-pushed the b-blocked-reschedules branch from 0533b7c to b8127cb Compare May 15, 2024 18:54

vercel bot deployed to Preview – nomad-storybook-and-ui May 15, 2024 18:57 View deployment

tgross added theme/scheduling stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows labels May 15, 2024

tgross force-pushed the b-blocked-reschedules branch from b8127cb to 680a968 Compare May 17, 2024 15:57

tgross removed the stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows label May 17, 2024

vercel bot deployed to Preview – nomad-storybook-and-ui May 17, 2024 16:00 View deployment

tgross mentioned this pull request May 22, 2024

refactor scheduler tests for node down/disconnected #22198

Merged

tgross force-pushed the b-blocked-reschedules branch from 680a968 to b162cc2 Compare May 22, 2024 14:37

vercel bot deployed to Preview – nomad-storybook-and-ui May 22, 2024 14:40 View deployment

tgross force-pushed the b-blocked-reschedules branch from b162cc2 to a2bd26e Compare May 30, 2024 19:18

vercel bot deployed to Preview – nomad-ui May 30, 2024 19:20 View deployment

tgross force-pushed the b-blocked-reschedules branch from a2bd26e to ac6d975 Compare May 30, 2024 19:31

tgross force-pushed the b-blocked-reschedules branch from 63473e4 to 2e75cd1 Compare May 30, 2024 20:55

vercel bot deployed to Preview – nomad-ui May 30, 2024 20:57 View deployment

tgross force-pushed the b-blocked-reschedules branch from 2e75cd1 to d3c5c8d Compare June 4, 2024 15:07

vercel bot deployed to Preview – nomad-ui June 4, 2024 15:09 View deployment

tgross force-pushed the b-blocked-reschedules branch from d3c5c8d to e404f55 Compare June 4, 2024 15:50

vercel bot deployed to Preview – nomad-ui June 4, 2024 15:52 View deployment

tgross marked this pull request as ready for review June 4, 2024 17:42

tgross requested review from Juanadelacuesta, gulducat and pkazmierczak June 4, 2024 17:42

tgross added this to the 1.8.1 milestone Jun 4, 2024

tgross added backport/ent/1.6.x+ent Changes are backported to 1.6.x+ent backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.8.x backport to 1.8.x release line labels Jun 4, 2024

gulducat reviewed Jun 4, 2024

View reviewed changes

nomad/structs/structs.go Outdated Show resolved Hide resolved

scheduler/generic_sched.go Outdated Show resolved Hide resolved

scheduler/generic_sched_test.go Outdated Show resolved Hide resolved

address comments from code review

a8f174b

vercel bot deployed to Preview – nomad-ui June 5, 2024 15:49 View deployment

gulducat approved these changes Jun 6, 2024

View reviewed changes

pkazmierczak approved these changes Jun 6, 2024

View reviewed changes

nomad/structs/structs.go Outdated Show resolved Hide resolved

address comments from code review

b7fcb2f

vercel bot deployed to Preview – nomad-ui June 6, 2024 15:36 View deployment

address comments from code review

1585937

vercel bot deployed to Preview – nomad-ui June 10, 2024 14:16 View deployment

Juanadelacuesta approved these changes Jun 10, 2024

View reviewed changes

tgross merged commit fa70267 into main Jun 10, 2024
19 checks passed

tgross deleted the b-blocked-reschedules branch June 10, 2024 15:15

hc-github-team-nomad-core mentioned this pull request Jun 10, 2024

Backport of scheduler: RescheduleTracker dropped if follow-up fails placements into release/1.8.x #23282

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: `RescheduleTracker` dropped if follow-up fails placements #12319

scheduler: `RescheduleTracker` dropped if follow-up fails placements #12319

tgross commented Mar 18, 2022 •

edited

Loading

gulducat left a comment

gulducat left a comment

pkazmierczak left a comment

tgross commented Jun 7, 2024 •

edited

Loading

Juanadelacuesta left a comment

scheduler: RescheduleTracker dropped if follow-up fails placements #12319

scheduler: RescheduleTracker dropped if follow-up fails placements #12319

Conversation

tgross commented Mar 18, 2022 • edited Loading

gulducat left a comment

Choose a reason for hiding this comment

gulducat left a comment

Choose a reason for hiding this comment

pkazmierczak left a comment

Choose a reason for hiding this comment

tgross commented Jun 7, 2024 • edited Loading

Juanadelacuesta left a comment

Choose a reason for hiding this comment

scheduler: `RescheduleTracker` dropped if follow-up fails placements #12319

scheduler: `RescheduleTracker` dropped if follow-up fails placements #12319

tgross commented Mar 18, 2022 •

edited

Loading

tgross commented Jun 7, 2024 •

edited

Loading