Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Handle cacheservice startup errors gracefully
The cache service requires a valid network stack including a loopback address. On a system startup the network stack might not be fully initialized yet which can cause the openQA worker cache service to fail with a message like ``` Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124. ``` Our systemd service definition already covers that with automatic restart on failure however we can do better than that already within the application code by catching such errors and retrying internally. This prevents the error messages, ensures the service to be available sooner than when relying on systemd and also works in system environments without systemd present, e.g. container runtime environments. Tested with ``` for i in {1..10}; do echo "### Run $i" && while : ; do ping -c 30 $worker && break || sleep 1; done && while :; do nc -z $worker 22 && break || sleep 1; done && sleep 30 && ssh $worker "sudo journalctl -b _SYSTEMD_UNIT=openqa-worker-cacheservice.service ; sudo reboot"; done ``` and later by simulating a not yet available network stack with ``` sudo ip addr del 127.0.0.1/8 dev lo sudo -u _openqa-worker /usr/share/openqa/script/openqa-workercache-daemon ``` which previously showed that the above mentioned error message happens on one machine used for development in 50% of all boot processes. This commit catches failed application starts within the Perl code and retries after a log message and configurable sleep period Related progress issue: https://progress.opensuse.org/issues/108091
- Loading branch information