Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PART 2: Remove undesired Wants=network in systemd + graceful cacheservice start - after #4577 #4573

Closed
wants to merge 1 commit into from

Conversation

okurz
Copy link
Member

@okurz okurz commented Mar 21, 2022

After

See: https://progress.opensuse.org/issues/108091

@codecov
Copy link

codecov bot commented Mar 21, 2022

Codecov Report

Merging #4573 (0275fba) into master (63e6a01) will decrease coverage by 0.07%.
The diff coverage is 66.66%.

❗ Current head 0275fba differs from pull request most recent head fde81a3. Consider uploading reports for the commit fde81a3 to get more accurate results

@@            Coverage Diff             @@
##           master    #4573      +/-   ##
==========================================
- Coverage   98.05%   97.98%   -0.08%     
==========================================
  Files         374      375       +1     
  Lines       34452    34407      -45     
==========================================
- Hits        33781    33712      -69     
- Misses        671      695      +24     
Impacted Files Coverage Δ
lib/OpenQA/CacheService.pm 96.51% <66.66%> (-3.49%) ⬇️
lib/OpenQA/Setup.pm 98.48% <0.00%> (-1.52%) ⬇️
lib/OpenQA/Schema/Result/Jobs.pm 98.38% <0.00%> (-0.30%) ⬇️
lib/OpenQA/Script/CloneJob.pm 96.98% <0.00%> (-0.14%) ⬇️
lib/OpenQA/Schema/Result/ScheduledProducts.pm 98.88% <0.00%> (-0.02%) ⬇️
lib/OpenQA/Worker/Engines/isotovideo.pm 97.16% <0.00%> (-0.02%) ⬇️
t/api/02-iso.t 99.20% <0.00%> (-0.01%) ⬇️
t/03-auth.t 100.00% <0.00%> (ø)
t/10-jobs.t 100.00% <0.00%> (ø)
t/01-style.t 100.00% <0.00%> (ø)
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63e6a01...fde81a3. Read the comment docs.

@nicksinger
Copy link
Member

Cannot judge the perl code but the rest looks fine :)

lib/OpenQA/CacheService.pm Show resolved Hide resolved
lib/OpenQA/CacheService.pm Outdated Show resolved Hide resolved
my $e;
my $cmd_return_code;
do {
eval { $cmd_return_code = $app->start(@args) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
eval { $cmd_return_code = $app->start(@args) };
my $cmd_return_code = eval { $cmd_return_code = $app->start(@args) };

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what's expected with async responses here? Mojolicious::startup doesn't seem to document that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think start blocks for the whole time the service is running. We likely don't need to care about anything happening asynchronously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything happing asynchronously here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on what command is executed by start. By default there are no async commands included though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using around_dispatch instead? That seems to be the "native" approach to exception handling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether invoking start a 2nd time will ensure the event loop is reset. If not, I suppose that would be a good idea to do to avoid events from the previous run being processed on the next start invocation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kalikiana You might be confusing this with another exception handling issue. This one is during app server startup, when the server can't bind to the socket because the interfaces are not ready yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether invoking start a 2nd time will ensure the event loop is reset. If not, I suppose that would be a good idea to do to avoid events from the previous run being processed on the next start invocation.

Good point, i think this is currently undefined behaviour.

@kraih
Copy link
Member

kraih commented Mar 24, 2022

I still believe that this issue should be resolved at the systemd level, and does not require code workarounds.

@okurz
Copy link
Member Author

okurz commented Mar 24, 2022

I still believe that this issue should be resolved at the systemd level, and does not require code workarounds.

I am open to suggestions how you want to handle that better in systemd. But we can actually separate the concerns here, so moved the systemd changes only to #4577

@okurz okurz marked this pull request as draft March 24, 2022 11:28
@okurz okurz changed the title Remove undesired Wants=network in systemd + graceful cacheservice start PART 2: Remove undesired Wants=network in systemd + graceful cacheservice start - after #4577 Mar 24, 2022
@Martchus
Copy link
Contributor

I am open to suggestions how you want to handle that better in systemd.

When I understand https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget correctly, it should be enough to use After=network.target (and Wants=/Requires= should not be used, indeed).

Apparently this is also relying on support from the networking manager, in our case Wicked. So that could still be a source of error.

Btw, NGINX is using

After=network.target network-online.target nss-lookup.target

under Arch Linux which is kind of in-line with https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget.

Under Tumbleweed, NGINX is using

After=network-online.target remote-fs.target nss-lookup.target
Wants=network-online.target

which has Wants= which is likely not necessary.

Maybe just go with

After=network.target network-online.target nss-lookup.target

or even just

After=network.target nss-lookup.target

?

@Martchus
Copy link
Contributor

Ok, my last comment is actually just doing #4577 which I've just approved.

@okurz
Copy link
Member Author

okurz commented Mar 26, 2022

Ok, my last comment is actually just doing #4577 which I've just approved.

You mean #4577 is is doing what you proposed already?

What this PR is doing on top is fixing the error message that can be seen in https://progress.opensuse.org/issues/108091#note-4 for openqa-worker-cacheservice.service

The cache service requires a valid network stack including a loopback
address. On a system startup the network stack might not be fully
initialized yet which can cause the openQA worker cache service to fail
with a message like

```
Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
```

Our systemd service definition already covers that with automatic
restart on failure however we can do better than that already within the
application code by catching such errors and retrying internally. This
prevents the error messages, ensures the service to be available sooner
than when relying on systemd and also works in system environments
without systemd present, e.g. container runtime environments.

Tested with

```
for i in {1..10}; do echo "### Run $i" && while : ; do ping -c 30 $worker && break || sleep 1; done && while :; do nc -z $worker 22 && break || sleep 1; done && sleep 30 && ssh $worker "sudo journalctl -b _SYSTEMD_UNIT=openqa-worker-cacheservice.service ; sudo reboot"; done
```

and later by simulating a not yet available network stack with

```
sudo ip addr del 127.0.0.1/8 dev lo
sudo -u _openqa-worker /usr/share/openqa/script/openqa-workercache-daemon
```

which previously showed that the above mentioned error message happens
on one machine used for development in 50% of all boot processes.

This commit catches failed application starts within the Perl code and
retries after a log message and configurable sleep period

Related progress issue: https://progress.opensuse.org/issues/108091
@okurz
Copy link
Member Author

okurz commented May 4, 2022

As decided in the weekly unblock of SUSE QE Tools we decided that we do not want to solve this within the application code and the majority decided that we should rely on system service management frameworks to ensure the presence of network.

@okurz okurz closed this May 4, 2022
@okurz okurz deleted the feature/network_target branch May 4, 2022 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants