Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate kernel/qemu issue on compute nodes #772

Closed
artificial-intelligence opened this issue Nov 23, 2023 · 5 comments
Closed

investigate kernel/qemu issue on compute nodes #772

artificial-intelligence opened this issue Nov 23, 2023 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@artificial-intelligence

dmesg:

12345[3579432.962837] x86/split lock detection: #AC: CPU 0/KVM/3324070 took a split_lock trap at address: 0xfffff805742147bf

currently found links/information:

workaround from kernel 6.2. might be to set:

kernel.split_lock_mitigate=0

for older kernels this might mitigate this issue (everything here is currently untested!):

split_lock_detect=off
@artificial-intelligence
Copy link
Author

@berendt
Copy link
Member

berendt commented Nov 23, 2023

  1. Upgrade to linux-generic-hwe-22.04 (Linux Kernel 6.2.0)

@berendt berendt added the bug Something isn't working label Nov 24, 2023
@garloff
Copy link

garloff commented Dec 5, 2023

12345[3579432.962837] x86/split lock detection: #AC: CPU 0/KVM/3324070 took a split_lock trap at address: 0xfffff805742147bf

So this is a warning from the kernel that it's doing something inefficient - switching the warning off hides it.
Real fix is to change the code to avoid locks that are split between cache lines.
So maybe the answer is to ignore this?

@berendt
Copy link
Member

berendt commented Dec 5, 2023

Yes, it was just an overlap with another problem. The problem itself was purely in the payload and independent of the compute nodes. I will document what Sven wrote. These are the parameters for ignoring it.

@artificial-intelligence
Copy link
Author

as was pointed out this is more of an application issue, the mitigation from the kernel side is documented above. in the general case the application should be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants