New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PART 2: Remove undesired Wants=network in systemd + graceful cacheservice start - after #4577 #4573
Conversation
f4319fc
to
cfbc323
Compare
Codecov Report
@@ Coverage Diff @@
## master #4573 +/- ##
==========================================
- Coverage 98.05% 97.98% -0.08%
==========================================
Files 374 375 +1
Lines 34452 34407 -45
==========================================
- Hits 33781 33712 -69
- Misses 671 695 +24
Continue to review full report at Codecov.
|
Cannot judge the perl code but the rest looks fine :) |
lib/OpenQA/CacheService.pm
Outdated
my $e; | ||
my $cmd_return_code; | ||
do { | ||
eval { $cmd_return_code = $app->start(@args) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eval { $cmd_return_code = $app->start(@args) }; | |
my $cmd_return_code = eval { $cmd_return_code = $app->start(@args) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know what's expected with async responses here? Mojolicious::startup doesn't seem to document that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think start
blocks for the whole time the service is running. We likely don't need to care about anything happening asynchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything happing asynchronously here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on what command is executed by start
. By default there are no async commands included though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be using around_dispatch instead? That seems to be the "native" approach to exception handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether invoking start
a 2nd time will ensure the event loop is reset. If not, I suppose that would be a good idea to do to avoid events from the previous run being processed on the next start
invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kalikiana You might be confusing this with another exception handling issue. This one is during app server startup, when the server can't bind to the socket because the interfaces are not ready yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether invoking
start
a 2nd time will ensure the event loop is reset. If not, I suppose that would be a good idea to do to avoid events from the previous run being processed on the nextstart
invocation.
Good point, i think this is currently undefined behaviour.
I still believe that this issue should be resolved at the systemd level, and does not require code workarounds. |
I am open to suggestions how you want to handle that better in systemd. But we can actually separate the concerns here, so moved the systemd changes only to #4577 |
cfbc323
to
0275fba
Compare
When I understand https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget correctly, it should be enough to use Apparently this is also relying on support from the networking manager, in our case Wicked. So that could still be a source of error. Btw, NGINX is using
under Arch Linux which is kind of in-line with https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget. Under Tumbleweed, NGINX is using
which has Maybe just go with
or even just
? |
Ok, my last comment is actually just doing #4577 which I've just approved. |
You mean #4577 is is doing what you proposed already? What this PR is doing on top is fixing the error message that can be seen in https://progress.opensuse.org/issues/108091#note-4 for openqa-worker-cacheservice.service |
0275fba
to
eca0dd0
Compare
The cache service requires a valid network stack including a loopback address. On a system startup the network stack might not be fully initialized yet which can cause the openQA worker cache service to fail with a message like ``` Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124. ``` Our systemd service definition already covers that with automatic restart on failure however we can do better than that already within the application code by catching such errors and retrying internally. This prevents the error messages, ensures the service to be available sooner than when relying on systemd and also works in system environments without systemd present, e.g. container runtime environments. Tested with ``` for i in {1..10}; do echo "### Run $i" && while : ; do ping -c 30 $worker && break || sleep 1; done && while :; do nc -z $worker 22 && break || sleep 1; done && sleep 30 && ssh $worker "sudo journalctl -b _SYSTEMD_UNIT=openqa-worker-cacheservice.service ; sudo reboot"; done ``` and later by simulating a not yet available network stack with ``` sudo ip addr del 127.0.0.1/8 dev lo sudo -u _openqa-worker /usr/share/openqa/script/openqa-workercache-daemon ``` which previously showed that the above mentioned error message happens on one machine used for development in 50% of all boot processes. This commit catches failed application starts within the Perl code and retries after a log message and configurable sleep period Related progress issue: https://progress.opensuse.org/issues/108091
eca0dd0
to
fde81a3
Compare
As decided in the weekly unblock of SUSE QE Tools we decided that we do not want to solve this within the application code and the majority decided that we should rely on system service management frameworks to ensure the presence of network. |
After
systemd: Remove undesired Wants=network.target (boo#1196359) #4577
Handle cacheservice startup errors gracefully
systemd: Remove undesired Wants=network.target (boo#1196359)
See: https://progress.opensuse.org/issues/108091