Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests fail on some KVM guests #2

Closed
sharkcz opened this issue Dec 16, 2022 · 16 comments
Closed

tests fail on some KVM guests #2

sharkcz opened this issue Dec 16, 2022 · 16 comments

Comments

@sharkcz
Copy link
Contributor

sharkcz commented Dec 16, 2022

We are experiencing test failures (running make check) on some KVM guests when building qclib in Fedora.

from the build log

...
+ cd qclib-2.3.2
+ make test-sh test
Warning: Capacity data inconsistent, try again later (rc=2)
1 error(s) detected
make: *** [Makefile:69: test-sh] Error 1

But when I try to reproduce it outside the Fedora buildsystem in my KVM guest on z14 LPAR (or on another guest on z15 LPAR (in beaker)), there is no such problem, the tests run and pass.

Is the hypervisor tool old or missing some features/fixes? Would a reboot "fix" that?

@Stefan-Raspl
Copy link
Contributor

Stefan-Raspl commented Oct 18, 2023

Hi - can you provide a dump using the qc_dump script, and share at lest the log (to be found at /tmp/qclib-xxxxxx)?
Is this running on actual IBM Z hardware or in some kind of emulated environment?

@sharkcz
Copy link
Contributor Author

sharkcz commented Oct 18, 2023

sure, I should have something for you tomorrow

@Stefan-Raspl
Copy link
Contributor

Awesome - and I'm terribly sorry for taking so long to respond!

@sharkcz
Copy link
Contributor Author

sharkcz commented Oct 19, 2023

no problem, I got only recently notified again that it's still a problem :-)

But it seems my original comment is right, I can't reproduce the problem outside the buildsystem (KVM guests on z15 LPAR), but working on getting the requested details ...

@sharkcz
Copy link
Contributor Author

sharkcz commented Oct 19, 2023

So here it is - https://koji.fedoraproject.org/koji/taskinfo?taskID=107766617
please see build.log for the whole building log, it contains the data dump in uuencoded form inline

@sharkcz
Copy link
Contributor Author

sharkcz commented Oct 19, 2023

Could it be a LPAR configuration/permission issue?

@Stefan-Raspl
Copy link
Contributor

So...I looked into https://kojipkgs.fedoraproject.org//work/tasks/6617/107766617/build.log, extracted the uuencoded output of qc_dump, uudecoded, and checked the log.
While the whole build process refers to qclib 2.4.0 (which would be the latest version), the log says
10/19,12:11:08,(nil) : This is qclib v2.2.1, level 857cc75, date 2020-10-14 21:58:25 +0200
Seems like there's qclib installed on that machine. You can specify an arbitrary binary (ideally zname in the build try) as an argument to qc_dump. Can you re-run that way?
As for the issue itself: qclib cannot be executed in an atomar manner, and has to account for potential live guest migration events. Therefore, we need to detect whether a live guest migration took place while it was collecting data. To do so, we simply get a copy of /proc/sysinfo before we start to collect the data, and compare to what it looks like after we completed collection of data. That's why I asked whether you're in an emulated environment - unless you're migrating the build system while qclib is executed, I can't imagine what triggers that check to fail.
What might help us here is if you would dump the content of /proc/sysinfo before and after 'make test-sh test' is executed - and/or before/after executing qc_debug.

@sharkcz
Copy link
Contributor Author

sharkcz commented Oct 20, 2023

So...I looked into https://kojipkgs.fedoraproject.org//work/tasks/6617/107766617/build.log, extracted the uuencoded output of qc_dump, uudecoded, and checked the log. While the whole build process refers to qclib 2.4.0 (which would be the latest version), the log says 10/19,12:11:08,(nil) : This is qclib v2.2.1, level 857cc75, date 2020-10-14 21:58:25 +0200 Seems like there's

it comes from https://github.com/ibm-s390-linux/qclib/blob/main/query_capacity.c#L104 (hard-coded there), also both PATH and LD_LIBRARY_PATH are set to pwd when running qc_dump. This might be useful to fix, because it should prefer the fresh qclib stuff from the build directory, not from the system.

qclib installed on that machine. You can specify an arbitrary binary (ideally zname in the build try) as an argument to qc_dump. Can you re-run that way? As for the issue itself: qclib cannot be executed in an atomar manner, and has to account for potential live guest migration events. Therefore, we need to detect whether a live guest migration took place while it was collecting data. To do so, we simply get a copy of /proc/sysinfo before we start to collect the data, and compare to what it looks like after we completed collection of data. That's why I asked whether you're in an emulated environment - unless you're migrating the build system while qclib is executed, I can't imagine what triggers that check to fail. What might help us here is if you would dump the content of /proc/sysinfo before and after 'make test-sh test' is executed - and/or before/after executing qc_debug.

ack

btw how or where are detecting the live migration? There must be something that's specific to the Fedora builders, because I can't reproduce the error anywhere else.

@Stefan-Raspl
Copy link
Contributor

it comes from https://github.com/ibm-s390-linux/qclib/blob/main/query_capacity.c#L104 (hard-coded there), also both PATH and LD_LIBRARY_PATH are set to pwd when running qc_dump. This might be useful to fix, because it should prefer the fresh qclib stuff from the build directory, not from the system.

Oh, crap - this (the hardcoded version/commit ID) must have gotten broken a while ago, when we started to publish code on github. Will need to think about to fix that...
qc_dump should be fine, though: If you specify an arbitrary path for shared libs, then that is honored. E.g. LD_LIBRARY_PATH=. ./qc_dump ./zname uses the shared lib in the current dir instead of the one in /lib.
As for the lgm check: That is done in https://github.com/ibm-s390-linux/qclib/blob/main/query_capacity_sysinfo.c#L118: We simply compare the current content of /proc/sysinfo with what we have stored in memory. If we had changed to a different machine/guest/lpar, that content would have changed.

@sharkcz
Copy link
Contributor Author

sharkcz commented Oct 20, 2023

build.log in https://koji.fedoraproject.org/koji/taskinfo?taskID=107818576 contains 3 qc_dump captures, before make test, after make test (= before make test-sh), after make test-sh

Perhaps is the rc = 2 sneaking into the get_handle() incorrectly somehow ...

@Stefan-Raspl
Copy link
Contributor

The rc=2 is actually set by qc_test at https://github.com/ibm-s390-linux/qclib/blob/main/qc_test.c#L881. All places you instrumented did not expose the issue, maybe invoking qc_dump has a side-effect that makes this particular issue disappear (which is a first)...so may I ask you to do the following:

  • Add a QC_DEBUG=2 to make test statement (this will trigger creation of a couple of files in /tmp/qclib-xxxxxxx* format - please pack them up and make available!)
  • Add cat /proc/sysinfo before and after that make test statement
    Thanks!

@sharkcz
Copy link
Contributor Author

sharkcz commented Oct 24, 2023

Please see build.log from https://koji.fedoraproject.org/koji/taskinfo?taskID=108025354

@sharkcz
Copy link
Contributor Author

sharkcz commented Oct 24, 2023

And here we are ...

...
10/24,09:42:39,0x2c69c40 :   Run consistency check
10/24,09:42:39,0x2c69c40 :     Warning: Consistency check 'num_cpu_shared + num_cpu_dedicated = num_cpu_total' failed at layer 3 (KVM-guest/GUEST): 3 + 0 != 15
10/24,09:42:39,0x2c69c40 :     Warning: Consistency check failed
10/24,09:42:39,0x2c69c40 :   Warning: Gathering data failed, retry 1
...

@Stefan-Raspl
Copy link
Contributor

OK, found it:
So, besides the aforementioned mechanism to detect live guest migrations, there is also a consistency check. Since we tap into multiple data sources, we need to make sure we have consistency. That check is only active during tests, as it is mostly for debugging purposes - which is the case for qc_test (thats executed during make test). If the consistency check fails, it does so thrice, hence triggering the LGM check. Turns out that the respective check is wrong - the difference reported in the warning is due to 12 reserved CPUs in the respective KVM guest, which are not accounted for. I have a fix that addresses the issue. Add'l polishing, due diligence, etc. still required. In the meantime, you could work around this issue by setting env variable QC_CHECK_CONSISTENCY=0when callingmake test- or simply omit themake test` step entirely. In any case: no real bug, just a faulty consistency check.

@sharkcz
Copy link
Contributor Author

sharkcz commented Oct 30, 2023

Thanks, Stefan, the workaround unblocks the build.

@Stefan-Raspl
Copy link
Contributor

Great! I'll commit a proper fix "soon" and make sure the README has a proper entry so you'll know when it's fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants