Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic and parameterized Batch jobs with no children are shown as queued in summary when server is reelected #3886

Closed
burdandrei opened this issue Feb 20, 2018 · 2 comments · Fixed by #5205

Comments

@burdandrei
Copy link
Contributor

Nomad version

Nomad v0.7.1 (0b295d3)

Operating system and Environment details

Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04

Issue

If nomad server leader is reelected (restart of leader) periodic or parametrized Batch jobs are transitioned to Queued status:

curl -X GET \
  https://nomad-ui.us-east-1.yotpo.xyz/v1/job/travis-setup/summary 

{
    "JobID": "travis-setup",
    "Namespace": "default",
    "Summary": {
        "travis": {
            "Queued": 1,
            "Complete": 0,
            "Failed": 0,
            "Running": 0,
            "Starting": 0,
            "Lost": 0
        }
    },
    "Children": null,
    "CreateIndex": 6850855,
    "ModifyIndex": 6850874
}

Calling /v1/system/gc and /v1/system/reconcile/summaries doesn't fix the problem. The only way to clean this is to run nomad stop --purge ${JOB_NAME} and submit the new job.

Reproduction steps

  • Submit parametrized or periodic job
  • restart current cluster leader

Nomad Server logs (from new leader)

Feb 20 09:39:10 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:10.411107 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
Feb 20 09:39:10 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:10.411217 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
Feb 20 09:39:11 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:11 [INFO] serf: EventMemberLeave: i-0319fb6222c41dec5.us-east-1 10.x.x.123
Feb 20 09:39:11 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:11.135682 [INFO] nomad: removing server i-0319fb6222c41dec5.us-east-1 (Addr: 10.x.x.123:4647) (DC: us-east-1)
Feb 20 09:39:11 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:11.137045 [ERR] worker: failed to dequeue evaluation: rpc error: No cluster leader
Feb 20 09:39:11 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:11.137171 [ERR] worker: failed to dequeue evaluation: rpc error: No cluster leader
Feb 20 09:39:11 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:11 [WARN] raft: Heartbeat timeout from "10.x.x.123:4647" reached, starting election
Feb 20 09:39:11 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:11 [INFO] raft: Node at 10.x.y.212:4647 [Candidate] entering Candidate state in term 3443
Feb 20 09:39:11 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:11 [INFO] serf: EventMemberJoin: i-0319fb6222c41dec5.us-east-1 10.x.x.123
Feb 20 09:39:11 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:11.858108 [INFO] nomad: adding server i-0319fb6222c41dec5.us-east-1 (Addr: 10.x.x.123:4647) (DC: us-east-1)
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12 [INFO] raft: Duplicate RequestVote for same term: 3443
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12 [WARN] raft: Election timeout reached, restarting election
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12 [INFO] raft: Node at 10.x.y.212:4647 [Candidate] entering Candidate state in term 3444
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12 [INFO] raft: Election won. Tally: 3
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12 [INFO] raft: Node at 10.x.y.212:4647 [Leader] entering Leader state
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12 [INFO] raft: Added peer 10.x.z.134:4647, starting replication
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12 [INFO] raft: Added peer 10.x.c.234:4647, starting replication
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12 [INFO] raft: Added peer 10.x.z.92:4647, starting replication
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12.953449 [INFO] nomad: cluster leadership acquired
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12 [INFO] raft: pipelining replication to peer {Voter 10.x.c.234:4647 10.x.c.234:4647}
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12 [INFO] raft: pipelining replication to peer {Voter 10.x.z.134:4647 10.x.z.134:4647}
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12.967748 [ERR] worker: failed to dequeue evaluation: eval broker disabled
Feb 20 09:39:12 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:12 [INFO] raft: pipelining replication to peer {Voter 10.x.z.92:4647 10.x.z.92:4647}
Feb 20 09:39:13 ip-10-x-y-212 nomad[1567]:     2018/02/20 09:39:13 [WARN] raft: Rejecting vote request from 10.x.x.123:4647 since we have a leader: 10.x.y.212:4647

@archer-trek
Copy link

Have the same issue.

Nomad version
Nomad v0.8.6 (ab54ebc+CHANGES)

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants