Make logs collection more resilient to failures #11996

rwx788 · 2021-02-19T16:36:25Z

We already use script_run in most of the places instead of
assert_script_run in case of logs collection to attempt gathering more
information even if parts of the system might be broken.
However, we use upload_logs and script_output calls without flags to
proceed on failure, which prevent further post_fail_hook steps from the
execution.

See poo#77161.

Verification runs

Here we can see that we even have a mechanism to recover network in trivial cases (I've added command to turn off the interface), but previously we would not even attempt it, as one of the upload commands will fail.

We already use `script_run` in most of the places instead of `assert_script_run` in case of logs collection to attempt gathering more information even if parts of the system might be broken. However, we use `upload_logs` and `script_output` calls without flags to proceed on failure, which prevent further post_fail_hook steps from the execution. See [poo#77161](https://progress.opensuse.org/issues/77161).

For the generic case, we do not want to stop post_fail_hook execution if coredumps cannot be collected. By introducing additional parameter, we can select the behavior, as subroutine is also used in the testing module, where we expect it to fail in case of issues.

jknphy

LGTM

lib/opensusebasetest.pm

okurz · 2021-02-23T09:07:04Z

I wonder if we should make every call within the post_fail_hook non-fatal to try to continue but then again a lot more failed calls can be confusing to test reviewers as well, hm …

rwx788 · 2021-02-23T10:42:57Z

I wonder if we should make every call within the post_fail_hook non-fatal to try to continue but then again a lot more failed calls can be confusing to test reviewers as well, hm …

I also had same idea, but I believe that there are cases where we want to stop, if something is terribly wrong and we know that it's waste of time. E.g. in case of explicit die. From my experience, good use case is if we can multiple post fail hooks (parent one first and then other one), it would be ideal if we fail from the one, we stop execution of the first one and attempt the next one. It will provide granularity, as with the solution I've provided now we can spend more time and still not be able to collect any logs (risk is there). Consequently, I would leave a possibility to stop trying/skip some steps in case we are sure that we cannot collect any valuable information. For instance, do not execute multiple coredumpctl commands when we know that binary is not available.

Rodion Iafarov added 2 commits February 19, 2021 17:35

jknphy reviewed Feb 22, 2021

View reviewed changes

lib/opensusebasetest.pm Show resolved Hide resolved

rwx788 merged commit 9770235 into os-autoinst:master Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make logs collection more resilient to failures #11996

Make logs collection more resilient to failures #11996

rwx788 commented Feb 19, 2021 •

edited

jknphy left a comment

okurz commented Feb 23, 2021

rwx788 commented Feb 23, 2021

Make logs collection more resilient to failures #11996

Make logs collection more resilient to failures #11996

Conversation

rwx788 commented Feb 19, 2021 • edited

Verification runs

jknphy left a comment

Choose a reason for hiding this comment

okurz commented Feb 23, 2021

rwx788 commented Feb 23, 2021

rwx788 commented Feb 19, 2021 •

edited