New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
5.15.55: Integrity verification failed (no details) #208
Comments
I think this is #192, so already fixed in our git. If ALT is packaging 5.15.40+, you need to include that patch (edit: I mean the one in PR #193) in your LKRG package. As to no log messages prior to kernel panic, that's nasty. My guess is console log level was reduced to include only messages beyond a certain severity. Does ALT do that by default? IIRC, kernel panic uses EMERG severity. Other LKRG messages are up to CRIT. |
Hi! For some reason this panic was not triggered during the package build |
@wladmis Maybe you need to add |
@solardiz There was |
@wladmis Then I guess nothing triggered an update of the jump labels during this time. I'd try (un)loading other modules while LKRG is already loaded. |
Yes boot test should have catch this. |
Interestingly, user also reports that he lost rootfs (on next boot) after panic, can this be due to |
A kernel panic is indeed not good for filesystem health, but it's unexpected for the filesystem to be practically lost as a result of that. What fs type was that? |
I wonder if we have any of these affected kernels 5.15.40+ in our CI, do we? If we happen to have them and we didn't trigger the issue in CI, then we probably need to add something to ensure it'd have been triggered (and test that in a fork by removing the fix). If we don't have them, I guess that's fine - no action needed. |
We can easily force TRACEPOINT / JUMP_LABEL activity by executing: When commands finish the job, we should run: and later we should disable tracing via: and we should re-run Although sequence of such commands might take some time... |
For now even with this advice I have no success to trigger a KP during the build test. I guess it is because the test is quite simple compared with a real booting system, and not many things happen in it. |
Yes, I'm sure. It is LKRG 0.9.3 and kernel version 5.15.55 (not exactly the same version, but kernel config should be same). Sure the user's system configuration is quite different from the test one. |
No, it does not include that fixes. |
@Adam-pi3 Actually, @wladmis did answer my question - LKRG 0.9.3 is obviously old enough that it doesn't yet contain the fix, so the issue could have been triggered. However, as you pointed out to me in our private discussion, it is not surprising the issue isn't easy to trigger reliably. Maybe @wladmis needs to add loading and unloading of many other modules to trigger it. |
It is not as easy to generate that specific type of patching (JMP8) and one of the way which I can think you can try is to load and unload a lot of Kernel modules and in the mean time you can execute |
Indeed, but do you suggest any specific change? Disable kernel integrity enforcement by default? Force disk sync before panic? Of course, you'll soon have this specific LKRG/kernel incompatibility fixed by using our latest code, but issues like this can reoccur. I also think this indicates XFS is somehow problematic. A simple kernel panic shouldn't cause any metadata corruption, and a journaling filesystem should be able to recover from it on its own. Maybe there's some mount option or the like to make XFS safer in this respect, but I couldn't find any now. |
They say it is known that XFS is very bad at the sudden power off (an example), and the KP case should not be any better. |
@wladmis Those examples tend to be 10+ years old or involve VMs (so filesystem-on-filesystem), and the typical effect is loss of recent or not-so-recent writes - not metadata corruption. |
In our understanding, there wasn't a bug - rather, the kernel proceeded to make changes that LKRG did not expect, so we had to adapt, which we did for mainline (5.17+ at the time) between 0.9.2 and 0.9.3, and for 5.15.40+ (which didn't exist at the time of 0.9.3 release) between 0.9.3 and 0.9.4. Now you've shared a screenshot showing an issue with 0.9.2 on 5.10.133. This could be something else entirely. It is unfortunate that you seem to have a default console log level such that BTW, I kept this issue open not because it still needed tracking/work from LKRG project, but just for the ongoing discussion. We are not currently aware of anything to fix in LKRG here. If you see this differently, please say so (and explain). Thanks! |
Oh, I see 5.10.133 is actually the latest in that series, and is very recent. Now we need to see whether they possibly back-ported that same change, and if so we will in fact need to update our code. |
Yes it's unfortunate, I think they use default settings for distribution which is set by people who aren't aware, nor care about other needs than it looks good for users. So it the future we may get more of such abbreviated and incomprehensible reports from the end users if LKRG doesn't log using highest log level (which I suggest). |
I think we can't reasonably log everything at the highest severity level, but we can upgrade what we currently log as |
Just confirmed that https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.133 mentions a back-port of the "problematic" change. We'll need to adapt to that, re-check other kernel branches, and maybe release 0.9.5 shortly. Ideally, we'd come up with and implement a way to detect that change other than via kernel version number, but that's tricky. |
Hi!
For some reason this panic was not triggered during the package build
tests. I would like to enhance those tests that this kind of issue could
be detected early, but I have no idea what to add. Could you please give
any?
…--
WBR,
Vladimir D. Seleznev
|
This does not look hard (looping insmod/rmmod)). @wladmis if you will be successful in reproducing the bug with such test please report. We may want to add this into CI. |
Ok |
End users reporting LKRG induced kernel panics: https://bugzilla.altlinux.org/43005
I am only wonder why LKRG does not tell the reason for the kill except for a very generic
Integrity verification failed
. This means we will never know even a slightest hint for the real cause.Package maintainer seems [not] escalating bug reports, perhaps because there is nothing specific to relay.
The text was updated successfully, but these errors were encountered: