Skip to content

Commit

Permalink
Handle cacheservice startup errors gracefully
Browse files Browse the repository at this point in the history
The cache service requires a valid network stack including a loopback
address. On a system startup the network stack might not be fully
initialized yet which can cause the openQA worker cache service to fail
with a message like

```
Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
```

Our systemd service definition already covers that with automatic
restart on failure however we can do better than that already within the
application code by catching such errors and retrying internally. This
prevents the error messages, ensures the service to be available sooner
than when relying on systemd and also works in system environments
without systemd present, e.g. container runtime environments.

Tested with

```
for i in {1..10}; do echo "### Run $i" && while : ; do ping -c 30 $worker && break || sleep 1; done && while :; do nc -z $worker 22 && break || sleep 1; done && sleep 30 && ssh $worker "sudo journalctl -b _SYSTEMD_UNIT=openqa-worker-cacheservice.service ; sudo reboot"; done
```

and later by simulating a not yet available network stack with

```
sudo ip addr del 127.0.0.1/8 dev lo
sudo -u _openqa-worker /usr/share/openqa/script/openqa-workercache-daemon
```

which previously showed that the above mentioned error message happens
on one machine used for development in 50% of all boot processes.

This commit catches failed application starts within the Perl code and
retries after a log message and configurable sleep period

Related progress issue: https://progress.opensuse.org/issues/108091
  • Loading branch information
okurz committed Mar 24, 2022
1 parent b1113b6 commit 0275fba
Showing 1 changed file with 11 additions and 3 deletions.
14 changes: 11 additions & 3 deletions lib/OpenQA/CacheService.pm
Original file line number Diff line number Diff line change
Expand Up @@ -130,9 +130,17 @@ sub run {
$ENV{MOJO_INACTIVITY_TIMEOUT} //= 300;
$app->log->debug("Starting cache service: $0 @args");
$app->defaults->{service_pid} = $$;

my $cmd_return_code = $app->start(@args);
return $app->exit_code // $cmd_return_code // 0;
my $e;
my $cmd_return_code;
my $retry_interval = $ENV{OPENQA_CACHE_SERVICE_RETRY_INTERVAL} // 10;
do {
eval { $cmd_return_code = $app->start(@args) };
chomp($e = $@);
return $app->exit_code // $cmd_return_code // 0 unless $e;
die $e unless $e =~ /Cannot assign requested address/;
$app->log->info("cache service failed with '$e', retrying after $retry_interval seconds");
sleep $retry_interval;
} while (1);
}

1;
Expand Down

0 comments on commit 0275fba

Please sign in to comment.