Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
intermittent etcd failures post build #1372
Over the last few builds, across platforms, I have occassionally seen some worker nodes come up with a failing
On the affected node, the
This does not happen every time so I am not sure yet how to reproduce consistently.
We should figure out how to prevent this and document how to fix it if it does occur.
hm, after about 45 minutes, the problem resolved itself:
And the service is healthy in consul now.
I just saw this problem, and I believe it is due to etcd not running on every node in the defined
This can happen on existing clusters if you don't push etcd to all the nodes.
Should we make the initial etcd cluster more like consul and only require quorum from the control nodes?
It feels like the non control nodes should be running as proxies and not part of the raft election: https://coreos.com/etcd/docs/latest/proxy.html
Seeing this in the newest build. When i access the mesos ui, it frequently says the server is not available. The /var/log/messages log is scrolling with this message:
Note: consul is green, not issues listed...
Failed to connect to slave '9bb05ba1-3873-4257-8257-0319c5b1f91a-S3' on '/mesos/slave/9bb05ba1-3873-4257-8257-0319c5b1f91a-S3'.
i couldnt get the worker nodes to come back online. When i dug deep, i found that one of the control nodes ( the one in the error message above ) was not reachable. I then rebooted the node and after reboot, the consul piece would not come back online. The worker nodes continued to fail ( rather than switch control nodes? ). At this point three worker nodes and one control node were not working. Guessing the cause was the control node not coming back online.
I destroyed the environment and rebuilt it. After rebuild, two of the worker nodes showed the same issue. Again, one control node wasnt reachable. I was able to restart the control node without issue though. It was fully functional after reboot. What i did observe this time though was that i could not ping on the private network to the control node from the two worker nodes( or vice versa, as expected ). I was able to ping all the other nodes from the control node and the worker nodes as well, it was just that those particular worker nodes wouldnt talk to that particular control node.
I put the two worker nodes in consul maintenance mode, but was unable to get them into mesos maintenance mode ( is there a guide here somewhere as killing the process doesnt work since it restarts ). I went to bed, then when i woke up this morning, the servers were all able to talk with each other.
So this leads me to believe there is some sort of temporary firewall rule that is being activated during install, probably not intentionally, but perhaps its being triggered?
At this point the environment is green.