Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreOS: Docker fail to restart with missing /etc/sysconfig/docker file #4747

Closed
ROunofF opened this issue Mar 21, 2018 · 6 comments
Closed

Comments

@ROunofF
Copy link

ROunofF commented Mar 21, 2018

  1. kops 1.8.1
  2. kubernetes 1.8.6
  3. AWS
  4. Any node/master startup (i.e.: kops edit ig nodes) about 5-10% of the time
  5. Docker fail to start on CoreOS during nodeup. When it's restarting docker after adding the /etc/sysconfig/docker setting as EnvrionmentFile.

If we ssh into the node and start docker manually, all services will eventually restart and the node will join the cluster successfully.

  1. Docker restart succeeds and nodeup finish :)
  2. N/A
  3. N/A
  4. More information:

We have tracked this down to this PR #3134

It looks like that sometimes the OnChangeExecute is too fast before the file /etc/sysconfig/docker is written and docker failed to restart.

Here's the entry in journalctl:

Mar 21 10:17:19 <snip> nodeup[898]: I0321 10:17:19.112283     898 file.go:284] Changed; will execute OnChangeExecute command: "systemctl restart docker.service"
Mar 21 10:17:19 <snip> systemd[1]: Starting Wait for Network to be Configured...
Mar 21 10:17:19 <snip> systemd[1]: Started Containerd Container Daemon.
Mar 21 10:17:19 <snip> systemd[1]: Closed Docker Socket for the API.
Mar 21 10:17:19 <snip> systemd[1]: Stopping Docker Socket for the API.
Mar 21 10:17:19 <snip> systemd[1]: Starting Docker Socket for the API.
Mar 21 10:17:19 <snip> systemd[1]: Listening on Docker Socket for the API.
Mar 21 10:17:19 <snip> systemd[1]: Started Garbage Collection for rkt.
Mar 21 10:17:19 <snip> systemd-networkd-wait-online[943]: ignoring: lo
Mar 21 10:17:19 <snip> systemd[1]: Started Wait for Network to be Configured.
Mar 21 10:17:19 <snip> systemd[1]: Reached target Network is Online.
Mar 21 10:17:19 <snip> systemd[1]: docker.service: Failed to load environment files: No such file or directory
Mar 21 10:17:19 <snip> systemd[1]: docker.service: Failed to run 'start' task: No such file or directory
Mar 21 10:17:19 <snip> systemd[1]: Failed to start Docker Application Container Engine.
Mar 21 10:17:19 <snip> systemd[1]: docker.service: Unit entered failed state.
Mar 21 10:17:19 <snip> systemd[1]: docker.service: Failed with result 'resources'.

We would gladly fixes/test this, but we are unsure what is the best approach for kops at this point, see #9 for a couple solutions we put forward internally

Solutions that came up discussion this internally:

  • Move the call to buildSysconfig in https://github.com/kubernetes/kops/blame/master/nodeup/pkg/model/docker.go#L730 to L714 ? We believe the fact the task are in a map this won't guarantee the execution ordering
  • Added an ExecStartPre inside /etc/sysconfig/docker check for that file to exists (e.g.: /bin/sh -cx "declare -i TRY=0;while [ $TRY -lt 5 ]; do if [ -f /etc/sysconfig/docker ]; then break 1;fi; sleep 1; TRY=$TRY+1; done" or something like this ?)
  • Added a dependency for this task to the other task? Looks like this is not possible with the current Dependency, should we aim for this ?
@louismunro
Copy link
Contributor

See PR #4760 for a possible solution.

@chrislovecnm
Copy link
Contributor

/cc @KashifSaadat @gambol99

@louismunro
Copy link
Contributor

I'd love to add unit tests too, when I have a minute.
I'll probably work on it over the weekend, unless someone tells me I'm wasting my time.

@gambol99
Copy link
Contributor

Sounds odd ... and admittedly we've not seen any behavior like this ... which version of CoreOS you running? .. strange the Restart=on-failure doesn't fix it either

@louismunro
Copy link
Contributor

We've seen this on stable (1520.8.0) with some (unpredictable) regularity.
Of course the image updates itself eventually, but by then the failure has already occurred.

I think Restart=on-failure does not cover systemd itself failing to find the EnvironmentFile.
It only restarts the docker process if it has failed.

@justinsb
Copy link
Member

I think the PR makes a ton of sense and would love to get it merged (just needs CLA). There might be more to it, but this feels right & the right fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants