Skip to content
This repository has been archived by the owner on Jul 27, 2023. It is now read-only.

nginx-consul not running on kube workers #1346

Closed
ryane opened this issue Apr 13, 2016 · 10 comments · Fixed by #1394
Closed

nginx-consul not running on kube workers #1346

ryane opened this issue Apr 13, 2016 · 10 comments · Fixed by #1394
Assignees
Milestone

Comments

@ryane
Copy link
Contributor

ryane commented Apr 13, 2016

  • Ansible version (ansible --version): 1.9.4
  • Python version (python --version): 2.7.6
  • Git commit hash or branch: master
  • Cloud Environment: all
  • Terraform version (terraform version): v0.6.11

After a fresh build, consul Distributive Mantl Health Checks are failing on all kube workers:

report="Total: 9\nPassed: 8\nFailed: 1\nOther: 0\nDocker container not runnning:\n\tSpecified: ciscocloud/nginx-consul\n\tActual: gcr.io/google_containers/pause:2.0, skynetservices/skydns:2.5.3a, gcr.io/google_containers/pause:2.0, gcr.io/google_containers/pause:2.0, gcr.io/google_containers/hyperkube:v1.2.0, gcr.io/google_containers/pause:2.0, gcr.io/google_containers/pause:2.0"

On one worker:

sudo systemctl status nginx-consul
● nginx-consul.service - nginx-consul
   Loaded: loaded (/usr/lib/systemd/system/nginx-consul.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2016-04-13 13:09:36 UTC; 9min ago
  Process: 12656 ExecStop=/usr/bin/docker kill nginx-consul (code=exited, status=0/SUCCESS)
 Main PID: 9449 (code=exited, status=137)

Apr 13 13:04:34 resching-gce-kubeworker-01.c.asteris-mi.internal docker[9449]: + /bin/echo 'Starting nginx...'
Apr 13 13:04:34 resching-gce-kubeworker-01.c.asteris-mi.internal docker[9449]: Starting nginx...
Apr 13 13:04:34 resching-gce-kubeworker-01.c.asteris-mi.internal docker[9449]: + /usr/sbin/nginx -c /etc/nginx/nginx.conf
Apr 13 13:04:34 resching-gce-kubeworker-01.c.asteris-mi.internal docker[9449]: + exit 0
Apr 13 13:09:36 resching-gce-kubeworker-01 systemd[1]: Stopping nginx-consul...
Apr 13 13:09:36 resching-gce-kubeworker-01 docker[12656]: nginx-consul
Apr 13 13:09:36 resching-gce-kubeworker-01 systemd[1]: nginx-consul.service: main process exited, code=exited, status=137/n/a
Apr 13 13:09:36 resching-gce-kubeworker-01 systemd[1]: Stopped nginx-consul.
Apr 13 13:09:36 resching-gce-kubeworker-01 systemd[1]: Unit nginx-consul.service entered failed state.
Apr 13 13:09:36 resching-gce-kubeworker-01 systemd[1]: nginx-consul.service failed.

Restart is set to always:

grep Restart /usr/lib/systemd/system/nginx-consul.service
Restart=always
RestartSec=20

Starting nginx-consul manually on each worker seems to resolve the problem.

@ChrisAubuchon
Copy link
Contributor

@BrianHicks Is this related to the issue you saw with k8s stopping non-k8s containers?

@BrianHicks
Copy link
Contributor

Yes, and exactly the same solution. I thought we had fixed this!

@andreimc
Copy link
Contributor

I see this happening as well.

On latest master had to manually start the container after the worker came up.

@ryane
Copy link
Contributor Author

ryane commented Apr 22, 2016

It doesn't look like it is k8s that is stopping the service. Docker is being stopped as part of the flannel install and so nginx-consul dies:

NOTIFIED: [flannel | reconfig docker] *****************************************
changed: [resching-gce-kubeworker-02]
changed: [resching-gce-kubeworker-01]

NOTIFIED: [flannel | restart flannel] *****************************************
changed: [resching-gce-kubeworker-02]
changed: [resching-gce-kubeworker-01]

NOTIFIED: [flannel | stop docker] *********************************************
ok: [resching-gce-kubeworker-02]
ok: [resching-gce-kubeworker-01]

NOTIFIED: [flannel | delete docker0] ******************************************
changed: [resching-gce-kubeworker-02]
changed: [resching-gce-kubeworker-01]

NOTIFIED: [flannel | start docker] ********************************************
changed: [resching-gce-kubeworker-02]
changed: [resching-gce-kubeworker-01]

NOTIFIED: [kubernetes-node | restart kubelet] *********************************
changed: [resching-gce-kubeworker-02]
changed: [resching-gce-kubeworker-01]

I don't understand why Restart=always does not seem to be effective in the nginx-consul systemd service.

@BrianHicks
Copy link
Contributor

Ah ha! There it is! That explains why I couldn't find this. I thought it was part of kubelet starting up. Would it be possible to just remove Flannel? Kubernetes works without it, doesn't it?

@BrianHicks
Copy link
Contributor

My bad, it's not debug mode. The man page says -q or --log-queries. That's probably what you want. http://linux.die.net/man/8/dnsmasq

@BrianHicks
Copy link
Contributor

Whhops! I commented on the wrong issue. The above comment should go in #1367

@ryane
Copy link
Contributor Author

ryane commented Apr 26, 2016

@BrianHicks as far as I know, Kubernetes will require some kind of custom setup to enable ip-per-pod. Will it work if we remove Flannel without implementing one of the other options? http://kubernetes.io/docs/admin/networking/

@BrianHicks
Copy link
Contributor

BrianHicks commented Apr 27, 2016

It actually should. I'll investigate and report back.

@ryane
Copy link
Contributor Author

ryane commented Apr 28, 2016

since we don't yet have an explanation about why the systemd does not restart the service, we have a workaround implemented in #1394

It will be removed if we end up using an alternative to flannel.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants