Skip to content

Commit

Permalink
cli: non-service jobs on job restart -reschedule (hashicorp#19147)
Browse files Browse the repository at this point in the history
The `-reschedule` flag stops allocations and assumes the Nomad scheduler
will create new allocations to replace them. But this is only true for
service and batch jobs.

Restarting non-service jobs with the `-reschedule` flag causes the
command to loop forever waiting for the allocations to be replaced,
which never happens.

Allocations for system jobs may be replaced by triggering an evaluation
after each stop to cause the reconciler to run again.

Sysbatch jobs should not be allowed to be rescheduled as they are never
replaced by the scheduler.
  • Loading branch information
lgfa29 authored and nvanthao committed Mar 1, 2024
1 parent f156810 commit 17bbe08
Show file tree
Hide file tree
Showing 4 changed files with 272 additions and 47 deletions.
3 changes: 3 additions & 0 deletions .changelog/19147.txt
@@ -0,0 +1,3 @@
```release-note:bug
cli: Fixed the `nomad job restart` command to create replacements for batch and system jobs and to prevent sysbatch jobs from being rescheduled since they never create replacements
```
25 changes: 24 additions & 1 deletion command/job_restart.go
Expand Up @@ -187,7 +187,8 @@ Restart Options:
in-place. Since the group is not modified the restart does not create a new
deployment, and so values defined in 'update' blocks, such as
'max_parallel', are not taken into account. This option cannot be used with
'-task'.
'-task'. Only jobs of type 'batch', 'service', and 'system' can be
rescheduled.
-task=<task-name>
Specify the task to restart. Can be specified multiple times. If groups are
Expand Down Expand Up @@ -286,6 +287,16 @@ func (c *JobRestartCommand) Run(args []string) int {

go c.handleSignal(c.sigsCh, activeCh)

// Verify job type can be rescheduled.
if c.reschedule {
switch *job.Type {
case api.JobTypeBatch, api.JobTypeService, api.JobTypeSystem:
default:
c.Ui.Error(fmt.Sprintf("Jobs of type %q are not allowed to be rescheduled.", *job.Type))
return 1
}
}

// Confirm that we should restart a multi-region job in a single region.
if job.IsMultiregion() && !c.autoYes && !c.shouldRestartMultiregion() {
c.Ui.Output("\nJob restart canceled.")
Expand Down Expand Up @@ -952,6 +963,18 @@ func (c *JobRestartCommand) stopAlloc(alloc AllocationListStubWithJob) error {
return fmt.Errorf("Failed to stop allocation: %w", err)
}

// Allocations for system jobs do not get replaced by the scheduler after
// being stopped, so an eval is needed to trigger the reconciler.
if *alloc.Job.Type == api.JobTypeSystem {
opts := api.EvalOptions{
ForceReschedule: true,
}
_, _, err := c.client.Jobs().EvaluateWithOpts(*alloc.Job.ID, opts, nil)
if err != nil {
return fmt.Errorf("Failed evaluate job: %w", err)
}
}

// errCh receives an error if anything goes wrong or nil when the
// replacement allocation is running.
// Use a buffered channel to prevent both goroutine from blocking trying to
Expand Down

0 comments on commit 17bbe08

Please sign in to comment.