unfinished GC may block placements of new version of the job #10502

tgross · 2021-05-04T13:50:55Z

We received in an internal report where a job that had been stopped with the -purge flag and then re-run would not be placed until after a nomad system gc command was run. But upon further observation, it appears that while the job doesn't get placed initially, it does eventually get placed.

I suspect that the system gc is a bit of a red herring and timing dependent. There's already a GC evalution for the job in flight because of the -purge flag. But the old allocation still exists in the state store because the client hasn't GC'd it yet. The old allocation's reference to the job is orphaned, but then becomes valid once the new job is registered.

The reporter of the issue has a workaround to run system gc (or they could simply not re-run the job immediately after purging it). So this isn't particularly urgent.

jobspec

job "example" {
  datacenters = ["dc1"]

  group "group" {
    task "task" {
      driver = "docker"
      config {
        image      = "centos:centos7"
        force_pull = true
        command    = "/blah/non-existent-file"
      }

      resources {
        cpu    = 200
        memory = 64
      }
    }
  }
}

To reproduce, run the failing job:

$ nomad job run ./example.nomad
==> Monitoring evaluation "1a2b9089"
    Evaluation triggered by job "example"
==> Monitoring evaluation "1a2b9089"
    Evaluation within deployment: "e21d83a7"
    Allocation "ef54f207" created: node "7df6e0cf", group "group"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "1a2b9089" finished with status "complete"
vagrant@linux$ nomad job stop -purge ^C
vagrant@linux$ nomad job status ex
ID            = example
Name          = example
Submit Date   = 2021-05-04T13:34:46Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
group       0       0         0        1       0         0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
group       ab6d7799  24s from now

Latest Deployment
ID          = e21d83a7
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
group       1        1       0        1          2021-05-04T13:44:46Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created  Modified
ef54f207  7df6e0cf  group       0        run      failed  7s ago   2s ago

Stop and purge the job, and note that no deployment is visible:

$ nomad job stop -purge example
==> Monitoring evaluation "b98aa4ca"
    Evaluation triggered by job "example"
==> Monitoring evaluation "b98aa4ca"
    Evaluation within deployment: "e21d83a7"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "b98aa4ca" finished with status "complete"
vagrant@linux$ nomad job run ./example.nomad
==> Monitoring evaluation "1b42cb0e"
    Evaluation triggered by job "example"
==> Monitoring evaluation "1b42cb0e"
    Evaluation within deployment: "e21d83a7"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "1b42cb0e" finished with status "complete"

$ nomad job status ex
ID            = example
Name          = example
Submit Date   = 2021-05-04T13:35:03Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
group       0       0         0        0       0         0

Allocations
No allocations placed

Wait a little bit and see that the deployment has been created and failed as expected:

$ nomad job status example
...
Latest Deployment
ID          = 9e69d457
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
group       1        1       0        1          2021-05-04T13:45:18Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created  Modified
09ca1a69  7df6e0cf  group       0        run      failed  16s ago  11s ago

The text was updated successfully, but these errors were encountered:

tgross added type/bug theme/deployments hcc/cst Admin - internal labels May 4, 2021

pkazmierczak self-assigned this Dec 14, 2023

pkazmierczak mentioned this issue Jan 8, 2024

state store: better handling of job deletion #19609

Merged

pkazmierczak closed this as completed in #19609 Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unfinished GC may block placements of new version of the job #10502

unfinished GC may block placements of new version of the job #10502

tgross commented May 4, 2021

unfinished GC may block placements of new version of the job #10502

unfinished GC may block placements of new version of the job #10502

Comments

tgross commented May 4, 2021