Fix instability while starting kubelet on worker/controller nodes #26

mumoshu · 2016-11-02T07:43:34Z

mumoshu · 2016-11-04T03:27:48Z

I'm going to eventually fix 3 units: docker.service, flanneld.service, kubelet.service.
The reasoning behind this is that I believe

If flanneld.service depends on just local etcd.service, normally it shouldn't fail transiently(aws: workaround for systemd #1312 coreos/coreos-kubernetes#697 (comment)). But in today's kube-aws, it depends on discrete etcd nodes in controller/worker nodes. It may fail when etcd.service's in controller/worker nodes started before the remote etcd's become ready.
docker.service depends on local flanneld.service which depends on the remote etcd in controller/worker nodes. It may fail on startup then.
kubelet.service depends on flanneld.service. It may also fail then.

fyi @dghubble

mumoshu · 2016-11-04T08:13:48Z

well, some more information regarding this. I'm reconsidering where to fix.

flanneld.service doesn't seem to depend on suspicious services (other than the decrypt-tls-assets oneshot service which had been considered problematic and also addressed in my PR to coreos-kubernetes), and it has Restart=always. It won't fail because of dependency failure then.
log messages referred in controller node fails to start periodically because flannel doesn't start. coreos/coreos-kubernetes#710 does show failed decrypt-tls-assets.service, which is required by kubelet.service and flanneld.service in kube-aws

mumoshu · 2016-11-04T08:27:59Z

I've noticed that, in cloud-config-etcd, we should make decrypt-tls-assets based rkt instead of docker, as we had done so for workers/controllers.
I'll address that in a separate issue if it matters.

mumoshu · 2016-11-04T08:50:10Z

Let me outline how I'm going to fix the issue based on my initial plan in coreos/coreos-kubernetes#697 (comment)

Modification to kubelet.service would be useful according to coreos-baremetal#253

Make it not require flannel and/or docker, but want just flannel to up using the Wants and ExecStartPre pattern. This will prevent kubelet.service from stopped forever in case of startup failures in flanneld.service (and decrypt-tls-assets)

Modification to flanneld.service would be useful according to Fix node drain error when trying to evict pods from jobs #710

Make it want/require nothing. Just run decrypt-tls-assets in ExecStartPre to prevent a single failure of decrypt-tls-assets from breaking flanneld.service forever.

Modification to docker.service would be useful according to Allow userdata to be split across multiple parts #682

The Wants and ExecStartPre on flanneld to prevent a single failure in starting flanneld.service from breaking docker.service forever.

Also note that, it should work as long as docker.service, flanneld.service and kublet.service aren't Required from other services.

…/RUN-826 to hcom-flavour * commit 'af4a6da163713398532df5389da58a3886ef61f2': Un-hardcode replica count for flyte

mumoshu added this to the v0.9.0-rc.5 milestone Nov 2, 2016

mumoshu added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 2, 2016

mumoshu changed the title ~~Try to fix instability while starting flanneld and kubelet~~ Try to fix instability while starting kubelet Nov 4, 2016

mumoshu changed the title ~~Try to fix instability while starting kubelet~~ Try to fix instability while starting kubelet on worker/controller nodes Nov 4, 2016

mumoshu changed the title ~~Try to fix instability while starting kubelet on worker/controller nodes~~ Fix instability while starting kubelet on worker/controller nodes Nov 4, 2016

mumoshu mentioned this issue Nov 7, 2016

Fix instability while starting kubelet on worker/controller nodes #34

Merged

mumoshu closed this as completed in dc7aafa Nov 7, 2016

davidmccormick pushed a commit to HotelsDotCom/kube-aws that referenced this issue Mar 21, 2018

Merge pull request kubernetes-retired#26 in RUN/kube-aws from feature…

424f8df

…/RUN-826 to hcom-flavour * commit 'af4a6da163713398532df5389da58a3886ef61f2': Un-hardcode replica count for flyte

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix instability while starting kubelet on worker/controller nodes #26

Fix instability while starting kubelet on worker/controller nodes #26

mumoshu commented Nov 2, 2016

mumoshu commented Nov 4, 2016 •

edited

mumoshu commented Nov 4, 2016 •

edited

mumoshu commented Nov 4, 2016

mumoshu commented Nov 4, 2016

Fix instability while starting kubelet on worker/controller nodes #26

Fix instability while starting kubelet on worker/controller nodes #26

Comments

mumoshu commented Nov 2, 2016

mumoshu commented Nov 4, 2016 • edited

mumoshu commented Nov 4, 2016 • edited

mumoshu commented Nov 4, 2016

mumoshu commented Nov 4, 2016

mumoshu commented Nov 4, 2016 •

edited

mumoshu commented Nov 4, 2016 •

edited