Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad thinks address of all other servers is 127.0.0.1 #1140

Closed
ghost opened this issue May 2, 2016 · 8 comments
Closed

Nomad thinks address of all other servers is 127.0.0.1 #1140

ghost opened this issue May 2, 2016 · 8 comments

Comments

@ghost
Copy link

ghost commented May 2, 2016

Nomad version

Nomad v0.3.2

Operating system and Environment details

Linux consul-master-eu-west-1 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u4 (2016-02-29) x86_64 GNU/Linux

Issue

I've upgraded Nomad from v0.3.1 to v0.3.2 and restarted each of the servers in turn. Nomad is now unable to elect a leader because it thinks all other servers are reachable on 127.0.0.1:

Name                            Address    Port  Status  Leader  Protocol  Build  Datacenter    Region
consul-master-eu-west-1.europe  127.0.0.1  4648  alive   false   2         0.3.2  europe-west1  europe
consul-master-eu-west-2.europe  127.0.0.1  4648  failed  false   2         0.3.2  europe-west1  europe
consul-master-eu-west-3.europe  127.0.0.1  4648  failed  false   2         0.3.2  europe-west1  europe

The configuration of each node has the following format (substituing the correct IP address):

{
  "advertise": {
    "rpc": "10.133.0.4:4647"
  },
  "bind_addr": "0.0.0.0",
  "client": {
    "enabled": false
  },
  "data_dir": "/var/nomad",
  "datacenter": "europe-west1",
  "region": "europe",
  "server": {
    "bootstrap_expect": 3,
    "enabled": true
  }
}

Nomad Server logs (if appropriate)

The same pattern of errors is visible in the logs of all server nodes:

May  2 10:06:32 consul-master-eu-west-1 nomad[24615]: 2016/05/02 10:06:32 [INFO] serf: attempting reconnect to consul-master-eu-west-3.europe 127.0.0.1:4648
May  2 10:07:02 consul-master-eu-west-1 nomad[24615]: 2016/05/02 10:07:02 [INFO] serf: attempting reconnect to consul-master-eu-west-2.europe 127.0.0.1:4648
May  2 10:07:32 consul-master-eu-west-1 nomad[24615]: 2016/05/02 10:07:32 [INFO] serf: attempting reconnect to consul-master-eu-west-2.europe 127.0.0.1:4648
@ghost
Copy link
Author

ghost commented May 2, 2016

Rolling back to v0.3.1 fixes the issue.

Name                            Address     Port  Status  Protocol  Build  Datacenter    Region
consul-master-eu-west-1.europe  10.133.0.4  4648  alive   2         0.3.1  europe-west1  europe
consul-master-eu-west-2.europe  10.133.0.6  4648  alive   2         0.3.1  europe-west1  europe
consul-master-eu-west-3.europe  10.133.0.7  4648  alive   2         0.3.1  europe-west1  europe

@dadgar
Copy link
Contributor

dadgar commented May 4, 2016

Can you add the serf key to your advertise block.

@igrayson
Copy link

igrayson commented May 22, 2016

I hit up against this just now. Rolling back also fixed it for me.

Adding the serf key also fixes it on 0.3.2:

advertise {
  rpc  = "10.10.36.2:4647"
  http = "10.10.36.2:4648"
  serf = "10.10.36.2:4648"
}

I haven't gotten a 3-node leader election working, yet, so I can't confirm a full fix.

@igrayson
Copy link

Election started working with the addition of serf clause, after I wiped each node's data directory.

Before doing so, this general pattern repeated:

...
    2016/05/22 06:30:05 [INFO] raft: Duplicate RequestVote for same term: 778
    2016/05/22 06:30:05 [WARN] raft: Duplicate RequestVote from candidate: 10.10.3.95:4647
    2016/05/22 06:30:05 [WARN] raft: Remote peer 127.0.0.1:4647 does not have local node 10.10.3.95:4647 as a peer
    2016/05/22 06:30:05 [DEBUG] raft: Vote granted from 127.0.0.1:4647. Tally: 2
    2016/05/22 06:30:05 [INFO] raft: Election won. Tally: 2
    2016/05/22 06:30:05 [INFO] raft: Node at 10.10.3.95:4647 [Leader] entering Leader state
    2016/05/22 06:30:05 [INFO] nomad: cluster leadership acquired
    2016/05/22 06:30:05 [WARN] raft: Clearing log suffix from 925 to 926
    2016/05/22 06:30:05 [INFO] raft: Node at 10.10.3.95:4647 [Follower] entering Follower state
    2016/05/22 06:30:05 [INFO] nomad: cluster leadership lost
    2016/05/22 06:30:05 [ERR] nomad: failed to wait for barrier: leadership lost while committing log
    2016/05/22 06:30:05 [INFO] raft: pipelining replication to peer 127.0.0.1:4647
    2016/05/22 06:30:05 [INFO] raft: aborting pipeline replication to peer 127.0.0.1:4647
    2016/05/22 06:30:05 [ERR] worker: failed to dequeue evaluation: rpc error: rpc error: rpc error: rpc error: < snipped hundreds of lines of this > rpc error: rpc error: rpc error: rpc error: rpc error: No cluster leader
...

@dadgar
Copy link
Contributor

dadgar commented May 24, 2016

Going to close this as you have to include the advertise address.

@dadgar dadgar closed this as completed May 24, 2016
@clumsy
Copy link

clumsy commented Oct 13, 2016

@dadgar Doesn't it look like a bug that you need to wipe data-dir clean after changing the advertise options? I've spent quite a while to figure out the problem before I've found this single helpful comment from @igrayson

@dadgar
Copy link
Contributor

dadgar commented Oct 13, 2016

@clumsy Got a chuckle out of your handle and the situation. I think it needs to be a validation error so you can never even get into this state.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants