Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Fix instability while starting kubelet on worker/controller nodes #26

Closed
mumoshu opened this issue Nov 2, 2016 · 4 comments
Closed
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@mumoshu
Copy link
Contributor

mumoshu commented Nov 2, 2016

i.e. coreos/coreos-kubernetes#697

@mumoshu mumoshu added this to the v0.9.0-rc.5 milestone Nov 2, 2016
@mumoshu mumoshu added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 2, 2016
@mumoshu mumoshu changed the title Try to fix instability while starting flanneld and kubelet Try to fix instability while starting kubelet Nov 4, 2016
@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 4, 2016

I'm going to eventually fix 3 units: docker.service, flanneld.service, kubelet.service.
The reasoning behind this is that I believe

  • If flanneld.service depends on just local etcd.service, normally it shouldn't fail transiently(aws: workaround for systemd #1312 coreos/coreos-kubernetes#697 (comment)). But in today's kube-aws, it depends on discrete etcd nodes in controller/worker nodes. It may fail when etcd.service's in controller/worker nodes started before the remote etcd's become ready.
  • docker.service depends on local flanneld.service which depends on the remote etcd in controller/worker nodes. It may fail on startup then.
  • kubelet.service depends on flanneld.service. It may also fail then.

fyi @dghubble

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 4, 2016

well, some more information regarding this. I'm reconsidering where to fix.

@mumoshu mumoshu changed the title Try to fix instability while starting kubelet Try to fix instability while starting kubelet on worker/controller nodes Nov 4, 2016
@mumoshu mumoshu changed the title Try to fix instability while starting kubelet on worker/controller nodes Fix instability while starting kubelet on worker/controller nodes Nov 4, 2016
@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 4, 2016

I've noticed that, in cloud-config-etcd, we should make decrypt-tls-assets based rkt instead of docker, as we had done so for workers/controllers.
I'll address that in a separate issue if it matters.

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 4, 2016

Let me outline how I'm going to fix the issue based on my initial plan in coreos/coreos-kubernetes#697 (comment)

  • Modification to kubelet.service would be useful according to coreos-baremetal#253

Make it not require flannel and/or docker, but want just flannel to up using the Wants and ExecStartPre pattern. This will prevent kubelet.service from stopped forever in case of startup failures in flanneld.service (and decrypt-tls-assets)

Make it want/require nothing. Just run decrypt-tls-assets in ExecStartPre to prevent a single failure of decrypt-tls-assets from breaking flanneld.service forever.

The Wants and ExecStartPre on flanneld to prevent a single failure in starting flanneld.service from breaking docker.service forever.

Also note that, it should work as long as docker.service, flanneld.service and kublet.service aren't Required from other services.

@mumoshu mumoshu closed this as completed in dc7aafa Nov 7, 2016
davidmccormick pushed a commit to HotelsDotCom/kube-aws that referenced this issue Mar 21, 2018
…/RUN-826 to hcom-flavour

* commit 'af4a6da163713398532df5389da58a3886ef61f2':
  Un-hardcode replica count for flyte
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

1 participant