New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-20368: E2E: Add tests for Dynamic ovs pinning #746
OCPBUGS-20368: E2E: Add tests for Dynamic ovs pinning #746
Conversation
Skipping CI for Draft Pull Request. |
0ff4316
to
4cc20a2
Compare
3b506a2
to
9bc45e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review.
"k8s.io/utils/pointer" | ||
"sigs.k8s.io/controller-runtime/pkg/client" | ||
|
||
"embed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should go up with other build-in deps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit
}) | ||
|
||
AfterAll(func() { | ||
By("Removing the crio fix workaround") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment that explains the workaround?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit
performanceMCP string | ||
//go:embed scripts/* | ||
Scripts embed.FS | ||
) | ||
|
||
var _ = Describe("[performance] Cgroups and affinity", Ordered, func() { | ||
var onlineCPUSet cpuset.CPUSet | ||
|
||
testutils.CustomBeforeAll(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the tests are Ordered so maybe we can use https://onsi.github.io/ginkgo/#setup-in-ordered-containers-beforeall-and-afterall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit
return pids, nil | ||
} | ||
|
||
/*// getCpuOfOvsServices returns cpus used by the ovs services ovs-vswitchd and ovs-dbserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the commented code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit
@@ -95,3 +518,246 @@ func cpuSpecToString(cpus *performancev2.CPU) string { | |||
} | |||
return sb.String() | |||
} | |||
|
|||
func createMachineConfig(profile *performancev2.PerformanceProfile) (*machineconfigv1.MachineConfig, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's a test code which used only once here, maybe it's simpler to read a complete MC manifest using embed
and just apply it instead?
/test e2e-gcp-pao-updating-profile |
/test e2e-gcp-pao-workloadhints |
/retest-required |
/retest |
/retest-required |
2 similar comments
/retest-required |
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I have left few comments below for you.
It would be good to have a short description on the PR, and squash commits into single units where it fits.
appsv1 "k8s.io/api/apps/v1" | ||
"k8s.io/apimachinery/pkg/api/errors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move these up to where all k8 imports are
Describe("[rfe_id: 64006][Dynamic OVS Pinning]", Ordered, func() { | ||
Context("[Performance Profile applied]", func() { | ||
It("[test_id:64097] Activation file is created", func() { | ||
cmd := []string{"ls", activation_file} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if possible let's avoid creating deps on other tools like linux commands ls, cat, find, as much as possible. we can use go binaries to fulfill what we need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being run on the worker-cnf node , we need to use linux tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, I missed that.
policy := "best-effort" | ||
// Need to make some changes to pp , causing system reboot | ||
// and check if activation files is modified or deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need the system reboot or the PP modification? because the later has more expensive cost on the exec time and can possible affect other tests if not reverted properly.
TopologyPolicy: &policy, | ||
} | ||
By("Updating the performance profile") | ||
profiles.UpdateWithRetry(profile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you revert the profile in aftereach, why not add a defer() here to revert it, it's only one test that modifies the profile.
OTOH if you don't specifically need a PP update, then I believe system reboot on the node on which the pod is running can be more efficient here and cheaper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't run systemctl reboot from the pod, It doesn't allow
For example:
[root@dell-r630-007 ~]# oc exec -it pods/machine-config-daemon-bvp6t -n openshift-machine-config-operator -- bash -c "systemctl reboot"
Defaulted container "machine-config-daemon" out of: machine-config-daemon, kube-rbac-proxy
Running in chroot, ignoring request: reboot
Also yes we are changing Profile to trigger reboot, but this is the most safest way to reboot as all the utility functions are well tested and used extensively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: reboot, you can do jumping through some hoops: https://github.com/openshift/cluster-node-tuning-operator/blob/master/test/e2e/performanceprofile/functests/9_reboot/devices.go#L125
that said, can't tell if the reboot is the best approach here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can follow the approach but i don't see much benefit and we are using some hard coded timeout values which i can't guarantee will work on worker nodes specifically on BM. Modify the profile and waiting for the mcp to update seems to me safest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also executing commands through mcd pod is fine as long as you need some output, executing commands like "reboot", where you will instantly loose connection to the node , i find it not so appropriate. As i mentioned above, i can follow the same method if required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, but there is a big difference in execution time with the PP update vs node reboot. when PP is updated it will trigger reboot on all (usually not less than 2) nodes, and later on the revert will do the same, so the time is x4 than rebooting only the node on which the pod is running. I do not expect the pod to be terminated on node reboot, otherwise it would likely be a bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes because in this case the activation file will exist on all the worker-cnf nodes when the PP is applied. if there are multiple worker-cnf nodes, we do want to reboot all the worker-cnf nodes.
Yes it causes delay but it ensures the node is rebooted properly by that i mean we are giving time for mcp to evict the pods, drain the node , and come back again in orderly fashion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, so I see this involves all nodes. following this, correct me if mistaken, I think it is a good idea to cover that also when verifying the existence of the file not only on one node but on all relevant nodes, right?
ref:
https://github.com/openshift/cluster-node-tuning-operator/pull/746/files/23f53670b348a210197ec8d933c7db6fa48934b5#diff-ce7d5cbe8bfc6cc4efbf81602d5f238b37bafcbb68bf2c1ad66f1e1b148a33c5R163
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit
By("Waiting for MCP being updated") | ||
mcps.WaitForCondition(performanceMCP, machineconfigv1.MachineConfigPoolUpdated, corev1.ConditionTrue) | ||
By("Checking Activation file") | ||
cmd := []string{"ls", activation_file} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here for linux commands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned above, we need to use the ls command to check the activation file exists on worker node .
for _, pid := range pidList { | ||
cmd := []string{"/bin/bash", "-c", fmt.Sprintf("grep Cpus_allowed_list /proc/%s/status | awk '{print $2}'", pid)} | ||
cpusOfovServices, err := nodes.ExecCommandOnNode(cmd, workerRTNode) | ||
Expect(cpus == cpusOfovServices).To(BeTrue(), "affinity of ovn kube node pods(%s) do not match with ovservices(%s)", cpus, cpusOfovServices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, you can use To(Equal(..)) instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit
cmd := []string{"/bin/bash", "-c", fmt.Sprintf("grep Cpus_allowed_list /proc/%s/status | awk '{print $2}'", pid)} | ||
cpusOfovServices, err := nodes.ExecCommandOnNode(cmd, workerRTNode) | ||
testlog.Infof("cpus used by ovs service %s is %s", pid, cpusOfovServices) | ||
Expect(cpusOfovServices == ovnContainerCpus).To(BeTrue(), "affinity of ovn kube node pods(%s) do not match with ovservices(%s)", ovnContainerCpus, cpusOfovServices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, you can use To(Equal(..)) instead, and in other later places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit
pidList, err := getOVSServicesPid(workerRTNode) | ||
Expect(err).ToNot(HaveOccurred()) | ||
for _, pid := range pidList { | ||
cmd := []string{"/bin/bash", "-c", fmt.Sprintf("grep Cpus_allowed_list /proc/%s/status | awk '{print $2}'", pid)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here and in other places regarding depending on linux commands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please note these commands are required to run on worker node.
testlog.Info("Rebooting the node") | ||
// reboot the node, for that we change the numa policy to best-effort | ||
// Note: this is used only to trigger reboot | ||
policy := "best-effort" | ||
// Need to make some changes to pp , causing system reboot | ||
// and check if activation files is modified or deleted | ||
profile, err = profiles.GetByNodeLabels(testutils.NodeSelectorLabels) | ||
Expect(err).ToNot(HaveOccurred(), "Unable to fetch latest performance profile") | ||
currentPolicy := profile.Spec.NUMA.TopologyPolicy | ||
if *currentPolicy == "best-effort" { | ||
policy = "restricted" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here regarding reboot, if the goal is to trigger a system reboot and not a PP specific update, then I'd do a system reboot on the node on which the pod runs on. This 1. won't need a revert 2. keep the tests running on consistent configuration (unless intended the opposite) 3. less dependency by directly rebooting the node instead of w/a.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above using systemctl reboot or rebooting the node from a pod will cause unspecified behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i have addressed this already on why doing system reboot is a bad idea.
} | ||
}) | ||
AfterEach(func() { | ||
By("Reverting the Profile") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same regarding the revert vs reboot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed above.
/test e2e-gcp-pao-updating-profile |
/retest-required |
1 similar comment
/retest-required |
b0f5e79
to
413e577
Compare
@mrniranjan: This pull request references Jira Issue OCPBUGS-20368, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (nkononov@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-required |
Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
workerRTNode *corev1.Node | ||
workerRTNodes []corev1.Node | ||
profile, initialProfile *performancev2.PerformanceProfile | ||
activation_file string = "/rootfs/var/lib/ovn-ic/etc/enable_dynamic_cpu_affinity" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this file is not changing it can be const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit
}) | ||
|
||
BeforeEach(func() { | ||
if discovery.Enabled() && testutils.ProfileNotFound { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we still check it in BeforeEach
?
Expect(testclient.Client.Patch(context.TODO(), profile, | ||
client.RawPatch( | ||
types.JSONPatchType, | ||
[]byte(fmt.Sprintf(`[{ "op": "replace", "path": "/spec", "value": %s }]`, spec)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please address
return | ||
} | ||
|
||
err = testclient.Client.Delete(context.TODO(), testpod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are creating new context
here. you should pass ctx
(the first argument of deleteTestPod
function)
options := &client.ListOptions{ | ||
Namespace: "openshift-ovn-kubernetes", | ||
} | ||
err := testclient.Client.List(context.TODO(), ovnpods, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change the getOvnPod
to get context
as it first argument and pass it to the Client.List
call
Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
/test e2e-gcp-pao-updating-profile |
1 similar comment
/test e2e-gcp-pao-updating-profile |
…fficient cpus Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
/retest-required |
1 similar comment
/retest-required |
Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
/retest-required |
Expect(err).ToNot(HaveOccurred()) | ||
|
||
cgfs, err := nodes.GetCgroupFs(workerRTNode) | ||
if cgfs == "tmpfs" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tmpfs? really?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guess it comes from https://kubernetes.io/docs/concepts/architecture/cgroups/#check-cgroup-version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes on cgroupv1 versions, the /sys/fs/cgroup/ is tmpfs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarSik I have completely removed this check , the whole automation patch for now will be compatible with cgroupv1 only, I will do the cgroupv2 changes in separate PR as i need more time to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarSik Request to review, regarding cgroupv2 changes, i would like to take some time and test properly before sending any PR.
Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty much ok with how it looks like now. I can still see possible improvements (cgroups v2, race prevention etc), however we need better test infra for some of that.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MarSik, mrniranjan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@mrniranjan: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@mrniranjan: Jira Issue OCPBUGS-20368: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-20368 has been moved to the MODIFIED state. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-node-tuning-operator-container-v4.16.0-202312201352.p0.gcd7cf63.assembly.stream for distgit cluster-node-tuning-operator. |
/cherry-pick release-4.15 |
@mrniranjan: new pull request created: #904 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
No description provided.