-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: performance: e2e: configure adaptive cpu in profile #884
Conversation
/hold |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: shajmakh The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -3,6 +3,7 @@ package __performance_config | |||
import ( | |||
"context" | |||
"fmt" | |||
"github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/utils/nodes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move it down along with the other relevant imports
return nil, fmt.Errorf("insufficient cpus on a node to create a valid performanceprofile for the tests, found %d cpus, expects at least 4", capacityCPU) | ||
} | ||
reserved := performancev2.CPUSet(fmt.Sprintf("0,%d", capacityCPU/2)) | ||
isolated := performancev2.CPUSet(fmt.Sprintf("1-%d,%d-%d", capacityCPU/2-1, capacityCPU/2+1, capacityCPU-1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are 2 reserved CPUs are enough for machine with a big amount of cpus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause issues on BM because on HT enabled if we don't pick cpus from the same core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the same for isolated, In terms of upstream where there are only 4 cpus, No Numa, this works , but as we use this on multinuma systems with HT enabled, this gets complicated. I strongly suggest to use PPC . Otherwise you will have to duplicate what ppc does here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using PPC can work here, even though it will make the lane setup significantly more complex. But I agree we're reimplementing PPC.
The best IMO would be reuse part of PPC code, but we would need more detailed HW information to enable PPC.
//Get cpus capacity on a node; typically it is same for all nodes | ||
workerNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels) | ||
Expect(err).ToNot(HaveOccurred()) | ||
capacityCPU, _ := workerNodes[0].Status.Capacity.Cpu().AsInt64() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would validate that all workers are homogeneous and failed other wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please; also, as unlikely it could be, let's handle the error here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I side with @mrniranjan here. The main problem is we're reimplementing PPC.
Best would be to either reuse (part of) PPC code or just use PPC as prerequiste before the testsuite starts.
I don't have strong opinions
//Get cpus capacity on a node; typically it is same for all nodes | ||
workerNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels) | ||
Expect(err).ToNot(HaveOccurred()) | ||
capacityCPU, _ := workerNodes[0].Status.Capacity.Cpu().AsInt64() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please; also, as unlikely it could be, let's handle the error here
func testProfile() (*performancev2.PerformanceProfile, error) { | ||
//Get cpus capacity on a node; typically it is same for all nodes | ||
workerNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels) | ||
Expect(err).ToNot(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please either use Expect
in the code (but please use ExpectWithOffset
) OR return error and let the caller handle - but let's not do both
return nil, fmt.Errorf("insufficient cpus on a node to create a valid performanceprofile for the tests, found %d cpus, expects at least 4", capacityCPU) | ||
} | ||
reserved := performancev2.CPUSet(fmt.Sprintf("0,%d", capacityCPU/2)) | ||
isolated := performancev2.CPUSet(fmt.Sprintf("1-%d,%d-%d", capacityCPU/2-1, capacityCPU/2+1, capacityCPU-1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using PPC can work here, even though it will make the lane setup significantly more complex. But I agree we're reimplementing PPC.
The best IMO would be reuse part of PPC code, but we would need more detailed HW information to enable PPC.
//should never reach but still | ||
return nil, fmt.Errorf("insufficient cpus on a node to create a valid performanceprofile for the tests, found %d cpus, expects at least 4", capacityCPU) | ||
} | ||
reserved := performancev2.CPUSet(fmt.Sprintf("0,%d", capacityCPU/2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly recommend to use CPUSets and do logic with that instead of using just numbers
another option we could explore is somehow a middleground:
It is not fully automated (e.g. just run must-gather and let PPC do its magic) but would be a step in the right direction, would be more automated than it is now and should be deliverable quickly as fix. |
however we will need to check how to actually ship the data and make it accessible to the 0_config suite |
@Tal-or @mrniranjan @ffromani thank you for the valuable reviews and comments. you highlighted legit points about using PPC. however, this PR is intended to address the blocker CI failure on cnf-feature-deploy, considering that d/s CI it is in fact using PPC to generate the profile. I'm researching to see if we can enhance this even more to satisfy all needs taking into account the urgency of this for u/s. |
The urgency IS a factor indeed, but the main blocker for this approach is that we need to take into account HT and NUMA cpu affinity when deciding the reserved cpus. Otherwise the perfprof will be incorrect for other reasons, perhaps lesser reasons than the current state but not much better; the NUMA allocation is perhaps debatable, but HT is a requirement for which we have e2e tests for. I understand both camps and I don't have strong opinions yet. |
We've been observing lately that some tests that involve disabling load balancing are failing (like 32646) because the expected result does not have specific anticipated CPUs. After investigation, it turns out that one factor is the profile configuration of the CPU distribution. PAO functional tests configure fixed CPU values under the PP. This is considered misconfiguration, especially when the system has more than 4 CPUs, and there is no guarantee that the functionality of the performance profile controller will work adequately with not all cpus reflected in the CPU section in the PP. To fix this, we check the actual CPU capacity on a worker node and divide that between reserved and isolated (required cpu fields in PP). Signed-off-by: shajmakh <shajmakh@redhat.com>
ab19aa9
to
60b74db
Compare
After reassessing the cost of the approach described here, I am closing this in favor of #909 |
We've been observing lately that some tests that involve disabling load balancing are failing (like 32646) because the expected result does not have specific anticipated CPUs. After investigation, it turns out that one factor is the profile configuration of the CPU distribution.
PAO functional tests configure fixed CPU values under the PP. This is considered misconfiguration, especially when the system has more than 4 CPUs, and there is no guarantee that the functionality of the performance profile controller will work adequately with not all cpus reflected in the CPU section in the PP.
To fix this, we check the actual CPU capacity on a worker node and divide that between reserved and isolated (required cpu fields in PP).