Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make logs collection more resilient to failures #11996

Merged
merged 2 commits into from Feb 22, 2021
Merged

Conversation

rwx788
Copy link
Member

@rwx788 rwx788 commented Feb 19, 2021

We already use script_run in most of the places instead of
assert_script_run in case of logs collection to attempt gathering more
information even if parts of the system might be broken.
However, we use upload_logs and script_output calls without flags to
proceed on failure, which prevent further post_fail_hook steps from the
execution.

See poo#77161.

Verification runs

Here we can see that we even have a mechanism to recover network in trivial cases (I've added command to turn off the interface), but previously we would not even attempt it, as one of the upload commands will fail.

Rodion Iafarov added 2 commits February 19, 2021 17:35
We already use `script_run` in most of the places instead of
`assert_script_run` in case of logs collection to attempt gathering more
information even if parts of the system might be broken.
However, we use `upload_logs` and `script_output` calls without flags to
proceed on failure, which prevent further post_fail_hook steps from the
execution.

See [poo#77161](https://progress.opensuse.org/issues/77161).
For the generic case, we do not want to stop post_fail_hook execution if
coredumps cannot be collected. By introducing additional parameter, we
can select the behavior, as subroutine is also used in the testing
module, where we expect it to fail in case of issues.
Copy link
Contributor

@jknphy jknphy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

lib/opensusebasetest.pm Show resolved Hide resolved
@rwx788 rwx788 merged commit 9770235 into os-autoinst:master Feb 22, 2021
@okurz
Copy link
Member

okurz commented Feb 23, 2021

I wonder if we should make every call within the post_fail_hook non-fatal to try to continue but then again a lot more failed calls can be confusing to test reviewers as well, hm …

@rwx788
Copy link
Member Author

rwx788 commented Feb 23, 2021

I wonder if we should make every call within the post_fail_hook non-fatal to try to continue but then again a lot more failed calls can be confusing to test reviewers as well, hm …

I also had same idea, but I believe that there are cases where we want to stop, if something is terribly wrong and we know that it's waste of time. E.g. in case of explicit die. From my experience, good use case is if we can multiple post fail hooks (parent one first and then other one), it would be ideal if we fail from the one, we stop execution of the first one and attempt the next one. It will provide granularity, as with the solution I've provided now we can spend more time and still not be able to collect any logs (risk is there). Consequently, I would leave a possibility to stop trying/skip some steps in case we are sure that we cannot collect any valuable information. For instance, do not execute multiple coredumpctl commands when we know that binary is not available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants