5.15.55: Integrity verification failed (no details) #208

vt-alt · 2022-07-20T06:28:41Z

End users reporting LKRG induced kernel panics: https://bugzilla.altlinux.org/43005

I am only wonder why LKRG does not tell the reason for the kill except for a very generic Integrity verification failed. This means we will never know even a slightest hint for the real cause.
Package maintainer seems [not] escalating bug reports, perhaps because there is nothing specific to relay.

The text was updated successfully, but these errors were encountered:

solardiz · 2022-07-20T08:20:57Z

I think this is #192, so already fixed in our git. If ALT is packaging 5.15.40+, you need to include that patch (edit: I mean the one in PR #193) in your LKRG package.

As to no log messages prior to kernel panic, that's nasty. My guess is console log level was reduced to include only messages beyond a certain severity. Does ALT do that by default? IIRC, kernel panic uses EMERG severity. Other LKRG messages are up to CRIT.

wladmis · 2022-07-20T13:38:39Z

Hi!

For some reason this panic was not triggered during the package build
tests. I would like to enhance those tests that this kind of issue could
be detected early, but I have no idea what to add. Could you please give
any?

solardiz · 2022-07-20T13:47:52Z

@wladmis Maybe you need to add sysctl lkrg.trigger=1. Otherwise it could take a 15 seconds wait until the integrity check, and maybe your test system isn't staying up this long?

wladmis · 2022-07-20T15:50:34Z

@solardiz There was sysctl lkrg.trigger=1 (it was done by opening /proc/sys/lkrg/trigger and writing 1\n to it). I increased timeout to 20 secs, but still there was no panic.

solardiz · 2022-07-20T15:54:19Z

@wladmis Then I guess nothing triggered an update of the jump labels during this time. I'd try (un)loading other modules while LKRG is already loaded.

vt-alt · 2022-07-20T20:05:36Z

Yes boot test should have catch this.

vt-alt · 2022-07-20T20:18:21Z

Interestingly, user also reports that he lost rootfs (on next boot) after panic, can this be due to not syncing?

solardiz · 2022-07-20T20:28:36Z

A kernel panic is indeed not good for filesystem health, but it's unexpected for the filesystem to be practically lost as a result of that. What fs type was that?

solardiz · 2022-07-21T15:32:37Z

I wonder if we have any of these affected kernels 5.15.40+ in our CI, do we? If we happen to have them and we didn't trigger the issue in CI, then we probably need to add something to ensure it'd have been triggered (and test that in a fork by removing the fix). If we don't have them, I guess that's fine - no action needed.

Adam-pi3 · 2022-07-21T15:55:22Z

We can easily force TRACEPOINT / JUMP_LABEL activity by executing:
echo 1 > /sys/kernel/debug/tracing/events/enable

When commands finish the job, we should run:
sysctl lkrg.trigger=1

and later we should disable tracing via:
echo 0 > /sys/kernel/debug/tracing/events/enable

and we should re-run sysctl lkrg.trigger=1 again.

Although sequence of such commands might take some time...

wladmis · 2022-07-21T16:54:34Z

We can easily force TRACEPOINT / JUMP_LABEL activity by executing: echo 1 > /sys/kernel/debug/tracing/events/enable

When commands finish the job, we should run: sysctl lkrg.trigger=1

and later we should disable tracing via: echo 0 > /sys/kernel/debug/tracing/events/enable

and we should re-run sysctl lkrg.trigger=1 again.

Although sequence of such commands might take some time...

For now even with this advice I have no success to trigger a KP during the build test. I guess it is because the test is quite simple compared with a real booting system, and not many things happen in it.

solardiz · 2022-07-21T17:00:19Z

@wladmis Sorry to ask the obvious, but are you sure you're testing a problematic combination of kernel (5.15.40+ yet in 5.15.x branch) and LKRG (without our recent fix in #193)? And is this the same kernel configuration that's in the user report?

wladmis · 2022-07-21T17:30:09Z

@wladmis Sorry to ask the obvious, but are you sure you're testing a problematic combination of kernel (5.15.40+ yet in 5.15.x branch) and LKRG (without our recent fix in #193)? And is this the same kernel configuration that's in the user report?

Yes, I'm sure. It is LKRG 0.9.3 and kernel version 5.15.55 (not exactly the same version, but kernel config should be same). Sure the user's system configuration is quite different from the test one.

Adam-pi3 · 2022-07-21T18:54:27Z

@wladmis I think @solardiz is referring not to LKRG 0.9.3 but to git TOT since it includes fixes related to JUMP_LABEL

wladmis · 2022-07-21T19:13:45Z

No, it does not include that fixes.

solardiz · 2022-07-21T19:56:13Z

@Adam-pi3 Actually, @wladmis did answer my question - LKRG 0.9.3 is obviously old enough that it doesn't yet contain the fix, so the issue could have been triggered. However, as you pointed out to me in our private discussion, it is not surprising the issue isn't easy to trigger reliably. Maybe @wladmis needs to add loading and unloading of many other modules to trigger it.

Adam-pi3 · 2022-07-21T19:57:49Z

It is not as easy to generate that specific type of patching (JMP8) and one of the way which I can think you can try is to load and unload a lot of Kernel modules and in the mean time you can execute sysctl lkrg.trigger=1

vt-alt · 2022-07-22T03:20:08Z

What fs type was that?

User said (in downstream bug report) it's XFS.

He had to run xfs_repair -L to clear journal (replaying journal did not work), which also reports dangling inodes and relocations errors, then run apt-get --reinstall, etc.

I think LKRG shall not destroy users systems.

solardiz · 2022-07-22T12:10:04Z

I think LKRG shall not destroy users systems.

Indeed, but do you suggest any specific change? Disable kernel integrity enforcement by default? Force disk sync before panic?

Of course, you'll soon have this specific LKRG/kernel incompatibility fixed by using our latest code, but issues like this can reoccur.

I also think this indicates XFS is somehow problematic. A simple kernel panic shouldn't cause any metadata corruption, and a journaling filesystem should be able to recover from it on its own. Maybe there's some mount option or the like to make XFS safer in this respect, but I couldn't find any now.

wladmis · 2022-07-22T12:50:30Z

I also think this indicates XFS is somehow problematic. A simple kernel panic shouldn't cause any metadata corruption, and a journaling filesystem should be able to recover from it on its own. Maybe there's some mount option or the like to make XFS safer in this respect, but I couldn't find any now.

They say it is known that XFS is very bad at the sudden power off (an example), and the KP case should not be any better.

solardiz · 2022-07-22T14:22:31Z

@wladmis Those examples tend to be 10+ years old or involve VMs (so filesystem-on-filesystem), and the typical effect is loss of recent or not-so-recent writes - not metadata corruption.

vt-alt · 2022-07-28T15:41:06Z

There is also similar kernel panic report for 0.9.2.0.1.git10ba314-alt1.331577.1 (by ALT QA) so the bug maybe introduced earlier than v0.9.3.

solardiz · 2022-07-28T15:52:49Z

the bug maybe introduced earlier than v0.9.3.

In our understanding, there wasn't a bug - rather, the kernel proceeded to make changes that LKRG did not expect, so we had to adapt, which we did for mainline (5.17+ at the time) between 0.9.2 and 0.9.3, and for 5.15.40+ (which didn't exist at the time of 0.9.3 release) between 0.9.3 and 0.9.4.

Now you've shared a screenshot showing an issue with 0.9.2 on 5.10.133. This could be something else entirely. It is unfortunate that you seem to have a default console log level such that KERN_CRIT messages are not seen - apparently only KERN_EMERG are? Is this something you're fixing in ALT?

BTW, I kept this issue open not because it still needed tracking/work from LKRG project, but just for the ongoing discussion. We are not currently aware of anything to fix in LKRG here. If you see this differently, please say so (and explain). Thanks!

solardiz · 2022-07-28T15:55:04Z

Oh, I see 5.10.133 is actually the latest in that series, and is very recent. Now we need to see whether they possibly back-ported that same change, and if so we will in fact need to update our code.

vt-alt · 2022-07-28T16:10:08Z

It is unfortunate that you seem to have a default console log level such that KERN_CRIT messages are not seen - apparently only KERN_EMERG are? Is this something you're fixing in ALT?

Yes it's unfortunate, I think they use default settings for distribution which is set by people who aren't aware, nor care about other needs than it looks good for users. So it the future we may get more of such abbreviated and incomprehensible reports from the end users if LKRG doesn't log using highest log level (which I suggest).

solardiz · 2022-07-28T16:16:00Z

I think we can't reasonably log everything at the highest severity level, but we can upgrade what we currently log as CRIT to EMERG. What do you think, @Adam-pi3?

solardiz · 2022-07-28T16:23:32Z

Just confirmed that https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.133 mentions a back-port of the "problematic" change. We'll need to adapt to that, re-check other kernel branches, and maybe release 0.9.5 shortly. Ideally, we'd come up with and implement a way to detect that change other than via kernel version number, but that's tricky.

wladmis · 2022-10-11T07:20:39Z

Hi! For some reason this panic was not triggered during the package build tests. I would like to enhance those tests that this kind of issue could be detected early, but I have no idea what to add. Could you please give any?

…

-- WBR, Vladimir D. Seleznev

solardiz · 2022-10-13T21:27:40Z

@wladmis In a comment above, @Adam-pi3 suggested:

load and unload a lot of Kernel modules and in the mean time you can execute sysctl lkrg.trigger=1

We don't currently have a better idea.

vt-alt · 2022-10-14T02:06:38Z

@wladmis In a comment above, @Adam-pi3 suggested:

load and unload a lot of Kernel modules and in the mean time you can execute sysctl lkrg.trigger=1

We don't currently have a better idea.

This does not look hard (looping insmod/rmmod)). @wladmis if you will be successful in reproducing the bug with such test please report. We may want to add this into CI.

wladmis · 2022-10-24T04:06:02Z

@wladmis In a comment above, @Adam-pi3 suggested:

load and unload a lot of Kernel modules and in the mean time you can execute sysctl lkrg.trigger=1

We don't currently have a better idea.

This does not look hard (looping insmod/rmmod)). @wladmis if you will be successful in reproducing the bug with such test please report. We may want to add this into CI.

Ok

solardiz added duplicate This issue or pull request already exists portability labels Jul 20, 2022

solardiz mentioned this issue Jul 28, 2022

Fix jump label support for 5.10.133+ #221

Closed

solardiz mentioned this issue Jul 28, 2022

Use KERN_EMERG in place of KERN_CRIT #222

Closed

solardiz closed this as completed Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5.15.55: Integrity verification failed (no details) #208

5.15.55: Integrity verification failed (no details) #208

vt-alt commented Jul 20, 2022 •

edited

solardiz commented Jul 20, 2022 •

edited

wladmis commented Jul 20, 2022

solardiz commented Jul 20, 2022

wladmis commented Jul 20, 2022

solardiz commented Jul 20, 2022

vt-alt commented Jul 20, 2022

vt-alt commented Jul 20, 2022

solardiz commented Jul 20, 2022

solardiz commented Jul 21, 2022

Adam-pi3 commented Jul 21, 2022 •

edited

wladmis commented Jul 21, 2022

solardiz commented Jul 21, 2022

wladmis commented Jul 21, 2022

Adam-pi3 commented Jul 21, 2022

wladmis commented Jul 21, 2022

solardiz commented Jul 21, 2022

Adam-pi3 commented Jul 21, 2022

vt-alt commented Jul 22, 2022 •

edited

solardiz commented Jul 22, 2022

wladmis commented Jul 22, 2022

solardiz commented Jul 22, 2022

vt-alt commented Jul 28, 2022

solardiz commented Jul 28, 2022

solardiz commented Jul 28, 2022

vt-alt commented Jul 28, 2022

solardiz commented Jul 28, 2022

solardiz commented Jul 28, 2022

wladmis commented Oct 11, 2022 via email

solardiz commented Oct 13, 2022

vt-alt commented Oct 14, 2022

wladmis commented Oct 24, 2022

5.15.55: Integrity verification failed (no details) #208

5.15.55: Integrity verification failed (no details) #208

Comments

vt-alt commented Jul 20, 2022 • edited

solardiz commented Jul 20, 2022 • edited

wladmis commented Jul 20, 2022

solardiz commented Jul 20, 2022

wladmis commented Jul 20, 2022

solardiz commented Jul 20, 2022

vt-alt commented Jul 20, 2022

vt-alt commented Jul 20, 2022

solardiz commented Jul 20, 2022

solardiz commented Jul 21, 2022

Adam-pi3 commented Jul 21, 2022 • edited

wladmis commented Jul 21, 2022

solardiz commented Jul 21, 2022

wladmis commented Jul 21, 2022

Adam-pi3 commented Jul 21, 2022

wladmis commented Jul 21, 2022

solardiz commented Jul 21, 2022

Adam-pi3 commented Jul 21, 2022

vt-alt commented Jul 22, 2022 • edited

solardiz commented Jul 22, 2022

wladmis commented Jul 22, 2022

solardiz commented Jul 22, 2022

vt-alt commented Jul 28, 2022

solardiz commented Jul 28, 2022

solardiz commented Jul 28, 2022

vt-alt commented Jul 28, 2022

solardiz commented Jul 28, 2022

solardiz commented Jul 28, 2022

wladmis commented Oct 11, 2022 via email

solardiz commented Oct 13, 2022

vt-alt commented Oct 14, 2022

wladmis commented Oct 24, 2022

vt-alt commented Jul 20, 2022 •

edited

solardiz commented Jul 20, 2022 •

edited

Adam-pi3 commented Jul 21, 2022 •

edited

vt-alt commented Jul 22, 2022 •

edited