Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unfinished GC may block placements of new version of the job #10502

Closed
tgross opened this issue May 4, 2021 · 0 comments · Fixed by #19609
Closed

unfinished GC may block placements of new version of the job #10502

tgross opened this issue May 4, 2021 · 0 comments · Fixed by #19609
Assignees

Comments

@tgross
Copy link
Member

tgross commented May 4, 2021

We received in an internal report where a job that had been stopped with the -purge flag and then re-run would not be placed until after a nomad system gc command was run. But upon further observation, it appears that while the job doesn't get placed initially, it does eventually get placed.

I suspect that the system gc is a bit of a red herring and timing dependent. There's already a GC evalution for the job in flight because of the -purge flag. But the old allocation still exists in the state store because the client hasn't GC'd it yet. The old allocation's reference to the job is orphaned, but then becomes valid once the new job is registered.

The reporter of the issue has a workaround to run system gc (or they could simply not re-run the job immediately after purging it). So this isn't particularly urgent.

jobspec
job "example" {
  datacenters = ["dc1"]

  group "group" {
    task "task" {
      driver = "docker"
      config {
        image      = "centos:centos7"
        force_pull = true
        command    = "/blah/non-existent-file"
      }

      resources {
        cpu    = 200
        memory = 64
      }
    }
  }
}

To reproduce, run the failing job:

$ nomad job run ./example.nomad
==> Monitoring evaluation "1a2b9089"
    Evaluation triggered by job "example"
==> Monitoring evaluation "1a2b9089"
    Evaluation within deployment: "e21d83a7"
    Allocation "ef54f207" created: node "7df6e0cf", group "group"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "1a2b9089" finished with status "complete"
vagrant@linux$ nomad job stop -purge ^C
vagrant@linux$ nomad job status ex
ID            = example
Name          = example
Submit Date   = 2021-05-04T13:34:46Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
group       0       0         0        1       0         0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
group       ab6d7799  24s from now

Latest Deployment
ID          = e21d83a7
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
group       1        1       0        1          2021-05-04T13:44:46Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created  Modified
ef54f207  7df6e0cf  group       0        run      failed  7s ago   2s ago

Stop and purge the job, and note that no deployment is visible:

$ nomad job stop -purge example
==> Monitoring evaluation "b98aa4ca"
    Evaluation triggered by job "example"
==> Monitoring evaluation "b98aa4ca"
    Evaluation within deployment: "e21d83a7"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "b98aa4ca" finished with status "complete"
vagrant@linux$ nomad job run ./example.nomad
==> Monitoring evaluation "1b42cb0e"
    Evaluation triggered by job "example"
==> Monitoring evaluation "1b42cb0e"
    Evaluation within deployment: "e21d83a7"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "1b42cb0e" finished with status "complete"

$ nomad job status ex
ID            = example
Name          = example
Submit Date   = 2021-05-04T13:35:03Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
group       0       0         0        0       0         0

Allocations
No allocations placed

Wait a little bit and see that the deployment has been created and failed as expected:

$ nomad job status example
...
Latest Deployment
ID          = 9e69d457
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
group       1        1       0        1          2021-05-04T13:45:18Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created  Modified
09ca1a69  7df6e0cf  group       0        run      failed  16s ago  11s ago
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants