-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Candidate server failed to vote for itself (Can a server be a non-voter or is this a bug?) #13889
Comments
Related: hashicorp/raft#516 |
The "failed to make requestVote RPC" log entry suggests that this node did try to self-elect, as that's from
In any case, that's just one term and raft isn't guaranteed to make forward progress in the face of connectivity issues. But you said that the cluster never elected a leader. Were all the server nodes able to reach each other at some point? Usually when we see a cluster completely stuck on leader election it's because of either that or disk IO problems causing log applies to be too slow. The Agent Members API gives a per-agent view of how many raft members are in play, but won't really help for post-mortem debugging here. The serf peer lists should also be in the logs though, and that's what feeds raft's configuration. |
I considered this, except
r.goFunc means it runs asynchronously and should not block attempts to ask other peers. So I am still not entirely convinced :)
The "usual suspect" list is interesting thank you for sharing. I have limited information (just what logs and long term metric storage can tell us) since our ops team remediated the issue by restarting everything. I was not able to debug it live. But we had 5 servers online that should have been able to reach eachother, but they never established a leader within the 15 minutes or so our ops team took to restart everything. Of interest the servers listed as the target of "failed to make requestVote RPC" logs were not expected to be live. We had 5 servers live, but perhaps more than 5 servers were recognized as potential peers. So I am pretty sure this is a complication in how we change which metals in our datacenter are servers dynamically (we use consul to discover peers) Anyway I realize your eyes are probably glazing over with this description 😁 I think this is likely to be a problem in how we pick/move servers but to understand how to do it better I feel like I need to be able to look at the logs and understand what happened, but I cannot with the current logs. If I could get hashicorp/raft#516 maybe I would be able to make sense of it. I have been able to determine from logs that for term 252218
Is it this log? Kibana is no longer loading my logs from the time of this incident so I will have to check next time :( Hey as a side note let me know if I am abusing git issues with questions like this; let me know if there is a better place. I would love to be able to more intelligently debug these so I can contribute better but I felt here that either 1. I found a bug or 2. I didnt understand something; and I didnt want to assume #2. I still feel I dont understand how it could ever not print "vote granted" for itself since it requests from peers asynchronously. |
Oh you're right, I missed that.
Aye, I'll try to nudge that along. There's probably a few other open issues we could work on to make raft more observable. Of course we'd then have to land that version of raft in Nomad.
Yeah totally agreed. I'm noticing the channel for votes is unbuffered, so the log for "vote granted" would only come after we've polled on that channel at
That one is pretty good but only happens once I think? The messages like If you did see this sort of thing again, you could take a goroutine dump and that would let us know exactly where everything is getting stuck.
We encourage folks to use Discuss for "question-y" things, but for a good in-depth discussion where you need folks from the Nomad engineering team, GitHub issues is totally fine and probably more likely to get good results. |
If any come to mind let me know happy to spend a day adding some visibility to raft; and I can take a swing at updating nomad's dependency
It is buffered https://github.com/hashicorp/raft/blob/v1.3.5/raft.go#L1700 I considered the buffer size might be wrong somehow but I think it looks right, since you only write to the chan from a range loop over The only explanation I can come up with is the server is a candidate but not a voter, and I don't know how that could happen.
Thanks! |
Augh, you're right. Just FYI our team is focused on a team conference this week and so I've been kind of dipping in and out, which isn't fair for getting the right attention here. So my apologies for getting a little sloppy here. I'll commit to circling back on this early next week with some renewed focus. |
No problem Tim you've been helpful. I think all I'd ask for here is confirmation that "non-voter candidate servers" should NOT be possible if that setting is false, and help with the raft/logging. Appreciate the help, enjoy the conference! |
Looking for some guidance at least; I have logs from a server that indicates it never voted for itself. My question is, is this possible or a bug?
Details:
We restarted a bunch of servers in a datacenter and lost quorum. When we brought servers back, nomad failed to elect a leader. I am trying to understand why that happened.
In digging I found this strange behavior and need advice. Take term=252218 as an example
This server entered a candidate state but never voted for itself, it never once prints this:
r.logger.Debug("vote granted", "from", vote.voterID, "term", vote.Term, "tally", grantedVotes)
How is this possible? The only thing that makes sense to me is perhaps this loop never contains itself; which means either 1. theres a bug where the server doesnt see itself in the server list or 2. the server isnt a voter
For the avoidance of doubt, this is not set and we do not have enterprise nomad:
https://www.nomadproject.io/docs/configuration/server#non_voting_server
Really appreciate guidance on how this is supposed to work to help me investigate. Digging through raft to find out where sufferage can change is a nightmare
Nomad version
Nomad v1.2.6
The text was updated successfully, but these errors were encountered: