New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent etcd failures post build #1372

Open
ryane opened this Issue Apr 21, 2016 · 5 comments

Comments

Projects
None yet
5 participants
@ryane
Copy link
Contributor

ryane commented Apr 21, 2016

  • Ansible version (ansible --version): 1.9.4
  • Python version (python --version): 2.7.6
  • Git commit hash or branch: master
  • Cloud Environment: gce, aws
  • Terraform version (terraform version): v0.6.11

Over the last few builds, across platforms, I have occassionally seen some worker nodes come up with a failing etcd distributive check:

{"message":"Internal Server Error"}

On the affected node, the etcd service is running but there are lots of errors like this in the logs:

Apr 21 13:05:33 resching-gce-worker-02 etcd-service-start.sh[7696]: 2016/04/21 13:05:33 rafthttp: failed to find member 96bed1dbce03eb25 in cluster 7eda40fd26f24de5

This does not happen every time so I am not sure yet how to reproduce consistently.

We should figure out how to prevent this and document how to fix it if it does occur.

@ryane

This comment has been minimized.

Copy link
Contributor

ryane commented Apr 21, 2016

hm, after about 45 minutes, the problem resolved itself:

Apr 21 12:23:32 resching-gce-worker-02.c.asteris-mi.internal etcd-service-start.sh[7696]: 2016/04/21 12:23:32 rafthttp: failed to find member beed3541e47a9276 in cluster 7eda40fd26f24de5
... (lots of rafthttp errors) ...
Apr 21 13:08:08 resching-gce-worker-02 etcd-service-start.sh[7696]: 2016/04/21 13:08:08 rafthttp: the connection with fe4d96286d46b0e6 became active
Apr 21 13:08:12 resching-gce-worker-02 etcd-service-start.sh[7696]: 2016/04/21 13:08:12 etcdserver: publish error: etcdserver: request timed out
Apr 21 13:08:12 resching-gce-worker-02 etcd-service-start.sh[7696]: 2016/04/21 13:08:12 etcdserver: published {Name:resching-gce-worker-02 ClientURLs:[http://resching-gce-worker-02:2379]} to cluster 7eda40fd26f24de5

And the service is healthy in consul now.

@ryane ryane modified the milestone: 1.1 Apr 22, 2016

@ryane ryane referenced this issue Apr 26, 2016

Merged

Remove .node.consul from mesos hostnames #1385

0 of 3 tasks complete

@ryane ryane modified the milestones: Feature Backlog, 1.1 Apr 28, 2016

@stevendborrelli

This comment has been minimized.

Copy link
Contributor

stevendborrelli commented May 2, 2016

I just saw this problem, and I believe it is due to etcd not running on every node in the defined ETCD_INITIAL_CLUSTER= during initial boot up of the cluster.

This can happen on existing clusters if you don't push etcd to all the nodes.

Should we make the initial etcd cluster more like consul and only require quorum from the control nodes?

It feels like the non control nodes should be running as proxies and not part of the raft election: https://coreos.com/etcd/docs/latest/proxy.html

@distributorofpain

This comment has been minimized.

Copy link

distributorofpain commented Jun 3, 2016

Seeing this in the newest build. When i access the mesos ui, it frequently says the server is not available. The /var/log/messages log is scrolling with this message:

Note: consul is green, not issues listed...

Jun  3 04:52:09 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:09 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:09 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:09 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:11,346] INFO Received resource offers (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:11,347] INFO No tasks scheduled or next task has been disabled.
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:11,347] INFO Declining unused offers. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:11,348] INFO Declined unused offers with filter refuseSeconds=5.0 (use --decline_offer_duration to reconfigure) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:97)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:11 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: no route to host)
Jun  3 04:52:11 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:11 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: no route to host)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:12,349] INFO Received resource offers (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:12,350] INFO No tasks scheduled or next task has been disabled.
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:12,350] INFO Declining unused offers. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:12,350] INFO Declined unused offers with filter refuseSeconds=5.0 (use --decline_offer_duration to reconfigure) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:97)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:12 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:12 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:12 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:13,349] INFO Received resource offers (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:13,350] INFO No tasks scheduled or next task has been disabled.
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:13,350] INFO Declining unused offers. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Jun  3 04:52:13 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:13,351] INFO Declined unused offers with filter refuseSeconds=5.0 (use --decline_offer_duration to reconfigure) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:97)
Jun  3 04:52:14 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:14 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: no route to host)
Jun  3 04:52:14 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:14 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: no route to host)
Jun  3 04:52:15 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:15,562] INFO 10.128.36.67 -  -  [03/Jun/2016:04:52:15 +0000] "GET / HTTP/1.1" 200 3492 "-" "Consul Health Check" (mesosphere.chaos.http.ChaosRequestLog:15)
Jun  3 04:52:15 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:15 rafthttp: failed to dial a43f56d4501b2085 on stream MsgApp v2 (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:15 mantl-do-nyc2-worker-005 etcd-service-start.sh: 2016/06/3 04:52:15 rafthttp: failed to dial a43f56d4501b2085 on stream Message (dial tcp 10.128.35.83:2380: i/o timeout)
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:16,353] INFO Received resource offers (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:82)
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:16,354] INFO No tasks scheduled or next task has been disabled.
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:131)
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:16,354] INFO Declining unused offers. (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:89)
Jun  3 04:52:16 mantl-do-nyc2-worker-005 journal: [2016-06-03 04:52:16,355] INFO Declined unused offers with filter refuseSeconds=5.0 (use --decline_offer_duration to reconfigure) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:97)

Some error messages
test

Failed to connect to slave '9bb05ba1-3873-4257-8257-0319c5b1f91a-S3' on '/mesos/slave/9bb05ba1-3873-4257-8257-0319c5b1f91a-S3'.
Potential reasons:
• The slave is not accessible from your network
• The slave timed out or went offline

@distributorofpain

This comment has been minimized.

Copy link

distributorofpain commented Jun 7, 2016

i couldnt get the worker nodes to come back online. When i dug deep, i found that one of the control nodes ( the one in the error message above ) was not reachable. I then rebooted the node and after reboot, the consul piece would not come back online. The worker nodes continued to fail ( rather than switch control nodes? ). At this point three worker nodes and one control node were not working. Guessing the cause was the control node not coming back online.

I destroyed the environment and rebuilt it. After rebuild, two of the worker nodes showed the same issue. Again, one control node wasnt reachable. I was able to restart the control node without issue though. It was fully functional after reboot. What i did observe this time though was that i could not ping on the private network to the control node from the two worker nodes( or vice versa, as expected ). I was able to ping all the other nodes from the control node and the worker nodes as well, it was just that those particular worker nodes wouldnt talk to that particular control node.

I put the two worker nodes in consul maintenance mode, but was unable to get them into mesos maintenance mode ( is there a guide here somewhere as killing the process doesnt work since it restarts ). I went to bed, then when i woke up this morning, the servers were all able to talk with each other.

So this leads me to believe there is some sort of temporary firewall rule that is being activated during install, probably not intentionally, but perhaps its being triggered?

At this point the environment is green.

@siddharthist

This comment has been minimized.

Copy link
Contributor

siddharthist commented Jul 4, 2016

The next time anyone sees this, please try #1616

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment