-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a CPU usage check for HAWK/PUMA #12464
Conversation
This adds a CPU usage check on the ha/check_hawk test module while a client running the ha/hawk_gui test module is interacting with HAWK. It will soft fail with bsc#1179609 (HAWK/PUMA consume a considerable amount of CPU) if HAWK/PUMA CPU usage is over 50%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this test should be added to hawk_test
instead of openQA. It keeps our openQA code simpler and leaves all the tests to the hawk_test
suite instead.
I don't see why this shouldn't be in openQA. FMPOV the test code is not so complicated and moreover the test code in openQA doesn't have to be simple, it has to be readable which is the case here. And as most of our test codes are in openQA is better to have it here IMHO. |
We need to check the CPU usage in the server. |
It's much cleaner and easier to add this test (and, for that matter, any other test) to the I vote for leave to openQA only the tests that can't be run on |
We can run |
And you call this "simpler"? |
We already run commands in |
Yeah, that sounds like an over-complication. I would vote against that. |
Sorry: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Change requested would move commands from the server to the client and unnecessarily complicate the solution.
That's cute. ;) |
I'm proposing adding this test (and any other test on the server) to Missing:
Will provide a verification run once we agree on the method to signal the bug. Benefits:
And finally, when it's merged, we should monitor the CPU usage in every test in past SLES versions to arrive at a decent default value for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Then remove all openQA code and execute the test outside of openQA, you will have ALL test code in
A TLS handshakes means a failure from a customer point of view? |
+50% CPU utilization can happen with TLS handshakes or any other operation. That's why we should test CPU utilization after each test in Hawk. It's the best way to test bsc#1179651 |
I respectfully disagree. I think in this manner we are actually increasing the possibilities of false positives and negatives both, and we effectively reduce that by measuring the CPU usage during a longer period of time and working on averages. Please read the description of the test case in https://confluence.suse.com/pages/viewpage.action?pageId=634290489 (linked on the Jira ticket). It explicitly recommends against taking a single measurement after "stressing" HAWK, which is what the proposed change in What I have included in this PR is not exactly the same as described in confluence as the measurements are being taken in a longer period of time (more or less 5 to 10 minutes, instead of 1 minute), but it is covering with more measurements a longer interaction with HAWK. I understand you're worried that on some of those measurements we could pick expected high CPU usage due to TLS hand-shakes, and that this could lead to false positives, but as you can see from the verification runs, if this is happening, it is being handled by the average operation. Recorded CPU usages seen so far in the verification runs are in the 7.8% to 13.6% range. I could increase the comparison threshold from 50% to 60% if we agree on this as a second measure to avoid false positives, but IMHO, if we see a test spiking from 13.6% CPU usage to more than 50%, this would require an investigation ... false positive or not. As a compromise we could trigger a verification run with both approaches and check how different are the results. |
If we will not get a verification run for this shortly, I propose merging this PR as is, and then rolling back the changes after we determine that both approaches are the same. If we determine that both approaches are complimentary (which is my gut feeling ATM), we keep both.
This is the benefit, as this test could be used by other teams outside of openQA (for example as a CI/CD tool in ClusterLabs repositories) to check regressions to HAWK more thoroughly.
See my other message. I think we avoid false positives related to TLS hand-shakes only, but we may introduce a new set of both false positives and false negatives by taking fewer measurements. |
I can add a sleep after the logout. From the description of bsc#1179651, once the CPU problem begins it doesn't go away. So I think that just one final measurement after all the Hawk tests would be enough. I've been reading bug reports about CPU usage in web frameworks for both Python & Ruby (Django and Puma, respectively) and high CPU utilization can also mean 10% CPU usage when idle. Greater than that may happen when too many persistent connections weren't closed on the server: |
See my other message. I think we avoid false positives related to TLS hand-shakes only, but we may introduce a new set of both false positives and false negatives by taking fewer measurements.  1
Consider that according to the bug description, the CPU problem doesn't go away once it begins, so I don't see the possibility of introducing false positives or false negatives, provided that we check the CPU some time after logout |
To detect this specific bug, I agree. To detect other issues related to CPU usage it can be lacking.
Then it may be relevant to check CPU usage against this 10% threshold even before we have the connection from the client. After the current verification runs finish, I'll see how to add it in this PR. |
Added code to also check CPU usage while HAWK is idle, as suggested by @ricardobranco777, so please @juadk @ldevulder could you review again? Verfication runs are at: node 1, node 2, client & support server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point also checking the CPU when hawk is idle, I will keep an eye on the test results in all of the maintained OS versions. LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds a CPU usage check on the
ha/check_hawk test
module while a client running theha/hawk_gui
test module is interacting with HAWK. It will soft fail with bsc#1179609 (HAWK/PUMA consume a considerable amount of CPU) if HAWK/PUMA CPU usage is over 50%.(failures in 15-SP3 are due to bsc#1184274 and unrelated to this PR)