Skip to content
This repository has been archived by the owner on Oct 22, 2020. It is now read-only.

Overlord doesn't finish completion due to failed unit in fleet #3

Closed
metral opened this issue Oct 29, 2014 · 1 comment
Closed

Overlord doesn't finish completion due to failed unit in fleet #3

metral opened this issue Oct 29, 2014 · 1 comment

Comments

@metral
Copy link
Owner

metral commented Oct 29, 2014

Overlord continues to wait for a node to finish running its service but it will indefinitely sit there waiting and not continue.

The reason is that a unit or multiple units failed on a node as shown when doing a fleetctl list-units

When you log into the failed node and issue a fleetctl status <unit_name> it gives an error due to not being able to pull the binaries from the Internet. This is due to the fact that systemd-networkd was restarted too many times and it exited from continuing to attempt restarting the service as shown in systemdctl status systemd-networkd. The issue has to do with the issuance of multiple network devices requiring a restart of systemd-networkd and its restarting too many times for it to be happy.

This issue happens every so often but not always.

The simple work around unfortunately is to destroy the stack all-together via Heat and create a new stack or restart the systemd-networkd unit as well as any other units on the failed nodes, and then observing of the logs of the overlord to make sure it completed

@metral
Copy link
Owner Author

metral commented Nov 20, 2014

Closing issue as manual/hard-coded networking of all network devices & configs for overlay has been replaced by CoreOS Flannel; therefore, this issue no longer exists thanks to commit 2ab5885

@metral metral closed this as completed Nov 20, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant