Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5.15.55: Integrity verification failed (no details) #208

Closed
vt-alt opened this issue Jul 20, 2022 · 31 comments
Closed

5.15.55: Integrity verification failed (no details) #208

vt-alt opened this issue Jul 20, 2022 · 31 comments
Labels
duplicate This issue or pull request already exists portability

Comments

@vt-alt
Copy link
Contributor

vt-alt commented Jul 20, 2022

End users reporting LKRG induced kernel panics: https://bugzilla.altlinux.org/43005

IMG_20220720_090515

I am only wonder why LKRG does not tell the reason for the kill except for a very generic Integrity verification failed. This means we will never know even a slightest hint for the real cause.
Package maintainer seems [not] escalating bug reports, perhaps because there is nothing specific to relay.

@solardiz
Copy link
Contributor

solardiz commented Jul 20, 2022

I think this is #192, so already fixed in our git. If ALT is packaging 5.15.40+, you need to include that patch (edit: I mean the one in PR #193) in your LKRG package.

As to no log messages prior to kernel panic, that's nasty. My guess is console log level was reduced to include only messages beyond a certain severity. Does ALT do that by default? IIRC, kernel panic uses EMERG severity. Other LKRG messages are up to CRIT.

@solardiz solardiz added duplicate This issue or pull request already exists portability labels Jul 20, 2022
@wladmis
Copy link
Contributor

wladmis commented Jul 20, 2022

Hi!

For some reason this panic was not triggered during the package build
tests. I would like to enhance those tests that this kind of issue could
be detected early, but I have no idea what to add. Could you please give
any?

@solardiz
Copy link
Contributor

@wladmis Maybe you need to add sysctl lkrg.trigger=1. Otherwise it could take a 15 seconds wait until the integrity check, and maybe your test system isn't staying up this long?

@wladmis
Copy link
Contributor

wladmis commented Jul 20, 2022

@solardiz There was sysctl lkrg.trigger=1 (it was done by opening /proc/sys/lkrg/trigger and writing 1\n to it). I increased timeout to 20 secs, but still there was no panic.

@solardiz
Copy link
Contributor

@wladmis Then I guess nothing triggered an update of the jump labels during this time. I'd try (un)loading other modules while LKRG is already loaded.

@vt-alt
Copy link
Contributor Author

vt-alt commented Jul 20, 2022

Yes boot test should have catch this.

@vt-alt
Copy link
Contributor Author

vt-alt commented Jul 20, 2022

Interestingly, user also reports that he lost rootfs (on next boot) after panic, can this be due to not syncing?

@solardiz
Copy link
Contributor

A kernel panic is indeed not good for filesystem health, but it's unexpected for the filesystem to be practically lost as a result of that. What fs type was that?

@solardiz
Copy link
Contributor

I wonder if we have any of these affected kernels 5.15.40+ in our CI, do we? If we happen to have them and we didn't trigger the issue in CI, then we probably need to add something to ensure it'd have been triggered (and test that in a fork by removing the fix). If we don't have them, I guess that's fine - no action needed.

@Adam-pi3
Copy link
Collaborator

Adam-pi3 commented Jul 21, 2022

We can easily force TRACEPOINT / JUMP_LABEL activity by executing:
echo 1 > /sys/kernel/debug/tracing/events/enable

When commands finish the job, we should run:
sysctl lkrg.trigger=1

and later we should disable tracing via:
echo 0 > /sys/kernel/debug/tracing/events/enable

and we should re-run sysctl lkrg.trigger=1 again.

Although sequence of such commands might take some time...

@wladmis
Copy link
Contributor

wladmis commented Jul 21, 2022

We can easily force TRACEPOINT / JUMP_LABEL activity by executing: echo 1 > /sys/kernel/debug/tracing/events/enable

When commands finish the job, we should run: sysctl lkrg.trigger=1

and later we should disable tracing via: echo 0 > /sys/kernel/debug/tracing/events/enable

and we should re-run sysctl lkrg.trigger=1 again.

Although sequence of such commands might take some time...

For now even with this advice I have no success to trigger a KP during the build test. I guess it is because the test is quite simple compared with a real booting system, and not many things happen in it.

@solardiz
Copy link
Contributor

@wladmis Sorry to ask the obvious, but are you sure you're testing a problematic combination of kernel (5.15.40+ yet in 5.15.x branch) and LKRG (without our recent fix in #193)? And is this the same kernel configuration that's in the user report?

@wladmis
Copy link
Contributor

wladmis commented Jul 21, 2022

@wladmis Sorry to ask the obvious, but are you sure you're testing a problematic combination of kernel (5.15.40+ yet in 5.15.x branch) and LKRG (without our recent fix in #193)? And is this the same kernel configuration that's in the user report?

Yes, I'm sure. It is LKRG 0.9.3 and kernel version 5.15.55 (not exactly the same version, but kernel config should be same). Sure the user's system configuration is quite different from the test one.

@Adam-pi3
Copy link
Collaborator

@wladmis I think @solardiz is referring not to LKRG 0.9.3 but to git TOT since it includes fixes related to JUMP_LABEL

@wladmis
Copy link
Contributor

wladmis commented Jul 21, 2022

No, it does not include that fixes.

@solardiz
Copy link
Contributor

@Adam-pi3 Actually, @wladmis did answer my question - LKRG 0.9.3 is obviously old enough that it doesn't yet contain the fix, so the issue could have been triggered. However, as you pointed out to me in our private discussion, it is not surprising the issue isn't easy to trigger reliably. Maybe @wladmis needs to add loading and unloading of many other modules to trigger it.

@Adam-pi3
Copy link
Collaborator

It is not as easy to generate that specific type of patching (JMP8) and one of the way which I can think you can try is to load and unload a lot of Kernel modules and in the mean time you can execute sysctl lkrg.trigger=1

@vt-alt
Copy link
Contributor Author

vt-alt commented Jul 22, 2022

What fs type was that?

User said (in downstream bug report) it's XFS.

IMG_20220721_093543

He had to run xfs_repair -L to clear journal (replaying journal did not work), which also reports dangling inodes and relocations errors, then run apt-get --reinstall, etc.

I think LKRG shall not destroy users systems.

@solardiz
Copy link
Contributor

I think LKRG shall not destroy users systems.

Indeed, but do you suggest any specific change? Disable kernel integrity enforcement by default? Force disk sync before panic?

Of course, you'll soon have this specific LKRG/kernel incompatibility fixed by using our latest code, but issues like this can reoccur.

I also think this indicates XFS is somehow problematic. A simple kernel panic shouldn't cause any metadata corruption, and a journaling filesystem should be able to recover from it on its own. Maybe there's some mount option or the like to make XFS safer in this respect, but I couldn't find any now.

@wladmis
Copy link
Contributor

wladmis commented Jul 22, 2022

I also think this indicates XFS is somehow problematic. A simple kernel panic shouldn't cause any metadata corruption, and a journaling filesystem should be able to recover from it on its own. Maybe there's some mount option or the like to make XFS safer in this respect, but I couldn't find any now.

They say it is known that XFS is very bad at the sudden power off (an example), and the KP case should not be any better.

@solardiz
Copy link
Contributor

@wladmis Those examples tend to be 10+ years old or involve VMs (so filesystem-on-filesystem), and the typical effect is loss of recent or not-so-recent writes - not metadata corruption.

@vt-alt
Copy link
Contributor Author

vt-alt commented Jul 28, 2022

There is also similar kernel panic report for 0.9.2.0.1.git10ba314-alt1.331577.1 (by ALT QA) so the bug maybe introduced earlier than v0.9.3.

testing-1

@solardiz
Copy link
Contributor

the bug maybe introduced earlier than v0.9.3.

In our understanding, there wasn't a bug - rather, the kernel proceeded to make changes that LKRG did not expect, so we had to adapt, which we did for mainline (5.17+ at the time) between 0.9.2 and 0.9.3, and for 5.15.40+ (which didn't exist at the time of 0.9.3 release) between 0.9.3 and 0.9.4.

Now you've shared a screenshot showing an issue with 0.9.2 on 5.10.133. This could be something else entirely. It is unfortunate that you seem to have a default console log level such that KERN_CRIT messages are not seen - apparently only KERN_EMERG are? Is this something you're fixing in ALT?

BTW, I kept this issue open not because it still needed tracking/work from LKRG project, but just for the ongoing discussion. We are not currently aware of anything to fix in LKRG here. If you see this differently, please say so (and explain). Thanks!

@solardiz
Copy link
Contributor

Oh, I see 5.10.133 is actually the latest in that series, and is very recent. Now we need to see whether they possibly back-ported that same change, and if so we will in fact need to update our code.

@vt-alt
Copy link
Contributor Author

vt-alt commented Jul 28, 2022

It is unfortunate that you seem to have a default console log level such that KERN_CRIT messages are not seen - apparently only KERN_EMERG are? Is this something you're fixing in ALT?

Yes it's unfortunate, I think they use default settings for distribution which is set by people who aren't aware, nor care about other needs than it looks good for users. So it the future we may get more of such abbreviated and incomprehensible reports from the end users if LKRG doesn't log using highest log level (which I suggest).

@solardiz
Copy link
Contributor

I think we can't reasonably log everything at the highest severity level, but we can upgrade what we currently log as CRIT to EMERG. What do you think, @Adam-pi3?

@solardiz
Copy link
Contributor

Just confirmed that https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.133 mentions a back-port of the "problematic" change. We'll need to adapt to that, re-check other kernel branches, and maybe release 0.9.5 shortly. Ideally, we'd come up with and implement a way to detect that change other than via kernel version number, but that's tricky.

@wladmis
Copy link
Contributor

wladmis commented Oct 11, 2022 via email

@solardiz
Copy link
Contributor

@wladmis In a comment above, @Adam-pi3 suggested:

load and unload a lot of Kernel modules and in the mean time you can execute sysctl lkrg.trigger=1

We don't currently have a better idea.

@vt-alt
Copy link
Contributor Author

vt-alt commented Oct 14, 2022

@wladmis In a comment above, @Adam-pi3 suggested:

load and unload a lot of Kernel modules and in the mean time you can execute sysctl lkrg.trigger=1

We don't currently have a better idea.

This does not look hard (looping insmod/rmmod)). @wladmis if you will be successful in reproducing the bug with such test please report. We may want to add this into CI.

@wladmis
Copy link
Contributor

wladmis commented Oct 24, 2022

@wladmis In a comment above, @Adam-pi3 suggested:

load and unload a lot of Kernel modules and in the mean time you can execute sysctl lkrg.trigger=1

We don't currently have a better idea.

This does not look hard (looping insmod/rmmod)). @wladmis if you will be successful in reproducing the bug with such test please report. We may want to add this into CI.

Ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists portability
Projects
None yet
Development

No branches or pull requests

4 participants