OCPBUGS-20368: E2E: Add tests for Dynamic ovs pinning #746

mrniranjan · 2023-08-03T11:56:20Z

No description provided.

openshift-ci · 2023-08-03T11:56:28Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Tal-or

Initial review.

Tal-or · 2023-09-04T13:56:03Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+	"k8s.io/utils/pointer"
+	"sigs.k8s.io/controller-runtime/pkg/client"
+
+	"embed"


should go up with other build-in deps

Addressed in the latest commit

Tal-or · 2023-09-04T13:57:38Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+	})
+
+	AfterAll(func() {
+		By("Removing the crio fix workaround")


Can we add a comment that explains the workaround?

Addressed in the latest commit

Tal-or · 2023-09-04T13:59:29Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+	performanceMCP          string
+	//go:embed scripts/*
+	Scripts embed.FS
+)

 var _ = Describe("[performance] Cgroups and affinity", Ordered, func() {
 	var onlineCPUSet cpuset.CPUSet

 	testutils.CustomBeforeAll(func() {


I see the tests are Ordered so maybe we can use https://onsi.github.io/ginkgo/#setup-in-ordered-containers-beforeall-and-afterall

Addressed in the latest commit

Tal-or · 2023-09-04T14:05:38Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+	return pids, nil
+}
+
+/*// getCpuOfOvsServices returns cpus used by the ovs services ovs-vswitchd and ovs-dbserver


please remove the commented code

Addressed in the latest commit

Tal-or · 2023-09-04T14:08:56Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

@@ -95,3 +518,246 @@ func cpuSpecToString(cpus *performancev2.CPU) string {
 	}
 	return sb.String()
 }
+
+func createMachineConfig(profile *performancev2.PerformanceProfile) (*machineconfigv1.MachineConfig, error) {


Since it's a test code which used only once here, maybe it's simpler to read a complete MC manifest using embed and just apply it instead?

mrniranjan · 2023-09-05T01:54:06Z

/test e2e-gcp-pao-updating-profile

mrniranjan · 2023-09-14T06:37:36Z

/test e2e-gcp-pao-workloadhints

mrniranjan · 2023-09-14T09:21:01Z

/retest-required

mrniranjan · 2023-09-14T10:27:05Z

/retest

mrniranjan · 2023-09-17T05:58:27Z

/retest-required

mrniranjan · 2023-09-28T06:00:11Z

/retest-required

mrniranjan · 2023-09-28T08:12:51Z

/retest-required

shajmakh

Thanks for the PR! I have left few comments below for you.
It would be good to have a short description on the PR, and squash commits into single units where it fits.

shajmakh · 2023-10-03T05:54:33Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+	appsv1 "k8s.io/api/apps/v1"
+	"k8s.io/apimachinery/pkg/api/errors"


let's move these up to where all k8 imports are

shajmakh · 2023-10-03T05:56:58Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+	Describe("[rfe_id: 64006][Dynamic OVS Pinning]", Ordered, func() {
+		Context("[Performance Profile applied]", func() {
+			It("[test_id:64097] Activation file is created", func() {
+				cmd := []string{"ls", activation_file}


if possible let's avoid creating deps on other tools like linux commands ls, cat, find, as much as possible. we can use go binaries to fulfill what we need

This is being run on the worker-cnf node , we need to use linux tools.

you're right, I missed that.

shajmakh · 2023-10-03T05:58:35Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+				policy := "best-effort"
+				// Need to make some changes to pp , causing system reboot
+				// and check if activation files is modified or deleted


do you need the system reboot or the PP modification? because the later has more expensive cost on the exec time and can possible affect other tests if not reverted properly.

shajmakh · 2023-10-03T06:01:15Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+					TopologyPolicy: &policy,
+				}
+				By("Updating the performance profile")
+				profiles.UpdateWithRetry(profile)


I see you revert the profile in aftereach, why not add a defer() here to revert it, it's only one test that modifies the profile.
OTOH if you don't specifically need a PP update, then I believe system reboot on the node on which the pod is running can be more efficient here and cheaper.

we can't run systemctl reboot from the pod, It doesn't allow
For example:

[root@dell-r630-007 ~]# oc exec -it pods/machine-config-daemon-bvp6t -n openshift-machine-config-operator -- bash -c "systemctl reboot" Defaulted container "machine-config-daemon" out of: machine-config-daemon, kube-rbac-proxy Running in chroot, ignoring request: reboot

Also yes we are changing Profile to trigger reboot, but this is the most safest way to reboot as all the utility functions are well tested and used extensively.

re: reboot, you can do jumping through some hoops: https://github.com/openshift/cluster-node-tuning-operator/blob/master/test/e2e/performanceprofile/functests/9_reboot/devices.go#L125

that said, can't tell if the reboot is the best approach here

i can follow the approach but i don't see much benefit and we are using some hard coded timeout values which i can't guarantee will work on worker nodes specifically on BM. Modify the profile and waiting for the mcp to update seems to me safest.

also executing commands through mcd pod is fine as long as you need some output, executing commands like "reboot", where you will instantly loose connection to the node , i find it not so appropriate. As i mentioned above, i can follow the same method if required.

I see your point, but there is a big difference in execution time with the PP update vs node reboot. when PP is updated it will trigger reboot on all (usually not less than 2) nodes, and later on the revert will do the same, so the time is x4 than rebooting only the node on which the pod is running. I do not expect the pod to be terminated on node reboot, otherwise it would likely be a bug

Yes because in this case the activation file will exist on all the worker-cnf nodes when the PP is applied. if there are multiple worker-cnf nodes, we do want to reboot all the worker-cnf nodes.

Yes it causes delay but it ensures the node is rebooted properly by that i mean we are giving time for mcp to evict the pods, drain the node , and come back again in orderly fashion.

okay, so I see this involves all nodes. following this, correct me if mistaken, I think it is a good idea to cover that also when verifying the existence of the file not only on one node but on all relevant nodes, right?
ref:
https://github.com/openshift/cluster-node-tuning-operator/pull/746/files/23f53670b348a210197ec8d933c7db6fa48934b5#diff-ce7d5cbe8bfc6cc4efbf81602d5f238b37bafcbb68bf2c1ad66f1e1b148a33c5R163

Addressed in the latest commit

shajmakh · 2023-10-03T06:01:30Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+				By("Waiting for MCP being updated")
+				mcps.WaitForCondition(performanceMCP, machineconfigv1.MachineConfigPoolUpdated, corev1.ConditionTrue)
+				By("Checking Activation file")
+				cmd := []string{"ls", activation_file}


same here for linux commands

as mentioned above, we need to use the ls command to check the activation file exists on worker node .

shajmakh · 2023-10-03T06:05:50Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+				for _, pid := range pidList {
+					cmd := []string{"/bin/bash", "-c", fmt.Sprintf("grep Cpus_allowed_list /proc/%s/status | awk '{print $2}'", pid)}
+					cpusOfovServices, err := nodes.ExecCommandOnNode(cmd, workerRTNode)
+					Expect(cpus == cpusOfovServices).To(BeTrue(), "affinity of ovn kube node pods(%s) do not match with ovservices(%s)", cpus, cpusOfovServices)


same, you can use To(Equal(..)) instead

Addressed in the latest commit

shajmakh · 2023-10-03T06:06:12Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+					cmd := []string{"/bin/bash", "-c", fmt.Sprintf("grep Cpus_allowed_list /proc/%s/status | awk '{print $2}'", pid)}
+					cpusOfovServices, err := nodes.ExecCommandOnNode(cmd, workerRTNode)
+					testlog.Infof("cpus used by ovs service %s is %s", pid, cpusOfovServices)
+					Expect(cpusOfovServices == ovnContainerCpus).To(BeTrue(), "affinity of ovn kube node pods(%s) do not match with ovservices(%s)", ovnContainerCpus, cpusOfovServices)


same, you can use To(Equal(..)) instead, and in other later places

Addressed in the latest commit

shajmakh · 2023-10-03T06:07:13Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+				pidList, err := getOVSServicesPid(workerRTNode)
+				Expect(err).ToNot(HaveOccurred())
+				for _, pid := range pidList {
+					cmd := []string{"/bin/bash", "-c", fmt.Sprintf("grep Cpus_allowed_list /proc/%s/status | awk '{print $2}'", pid)}


same here and in other places regarding depending on linux commands

please note these commands are required to run on worker node.

shajmakh · 2023-10-03T06:10:32Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+				testlog.Info("Rebooting the node")
+				// reboot the node, for that we change the numa policy to best-effort
+				// Note: this is used only to trigger reboot
+				policy := "best-effort"
+				// Need to make some changes to pp , causing system reboot
+				// and check if activation files is modified or deleted
+				profile, err = profiles.GetByNodeLabels(testutils.NodeSelectorLabels)
+				Expect(err).ToNot(HaveOccurred(), "Unable to fetch latest performance profile")
+				currentPolicy := profile.Spec.NUMA.TopologyPolicy
+				if *currentPolicy == "best-effort" {
+					policy = "restricted"
+				}


same here regarding reboot, if the goal is to trigger a system reboot and not a PP specific update, then I'd do a system reboot on the node on which the pod runs on. This 1. won't need a revert 2. keep the tests running on consistent configuration (unless intended the opposite) 3. less dependency by directly rebooting the node instead of w/a.

As mentioned above using systemctl reboot or rebooting the node from a pod will cause unspecified behavior.

I think i have addressed this already on why doing system reboot is a bad idea.

shajmakh · 2023-10-03T06:10:56Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+				}
+			})
+			AfterEach(func() {
+				By("Reverting the Profile")


same regarding the revert vs reboot

addressed above.

mrniranjan · 2023-10-09T08:04:50Z

/test e2e-gcp-pao-updating-profile

mrniranjan · 2023-10-09T08:35:14Z

/retest-required

mrniranjan · 2023-10-10T01:17:20Z

/retest-required

openshift-ci-robot · 2023-10-11T09:34:16Z

@mrniranjan: This pull request references Jira Issue OCPBUGS-20368, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.15.0) matches configured target version for branch (4.15.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (nkononov@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mrniranjan · 2023-12-05T11:40:15Z

/retest-required

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

Tal-or · 2023-12-05T12:30:27Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+		workerRTNode            *corev1.Node
+		workerRTNodes           []corev1.Node
+		profile, initialProfile *performancev2.PerformanceProfile
+		activation_file         string = "/rootfs/var/lib/ovn-ic/etc/enable_dynamic_cpu_affinity"


If this file is not changing it can be const

Addressed in the latest commit

Tal-or · 2023-12-05T12:31:54Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+	})
+
+	BeforeEach(func() {
+		if discovery.Enabled() && testutils.ProfileNotFound {


Why do we still check it in BeforeEach?

Tal-or · 2023-12-05T12:34:37Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+					Expect(testclient.Client.Patch(context.TODO(), profile,
+						client.RawPatch(
+							types.JSONPatchType,
+							[]byte(fmt.Sprintf(`[{ "op": "replace", "path": "/spec", "value": %s }]`, spec)),


please address

Tal-or · 2023-12-05T12:38:59Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+		return
+	}
+
+	err = testclient.Client.Delete(context.TODO(), testpod)


you are creating new context here. you should pass ctx (the first argument of deleteTestPod function)

Tal-or · 2023-12-05T12:39:46Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+	options := &client.ListOptions{
+		Namespace: "openshift-ovn-kubernetes",
+	}
+	err := testclient.Client.List(context.TODO(), ovnpods, options)


Please change the getOvnPod to get context as it first argument and pass it to the Client.List call

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

mrniranjan · 2023-12-07T11:38:47Z

/test e2e-gcp-pao-updating-profile

mrniranjan · 2023-12-07T22:20:50Z

/test e2e-gcp-pao-updating-profile

…fficient cpus Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

mrniranjan · 2023-12-08T02:44:25Z

/retest-required

mrniranjan · 2023-12-08T05:22:21Z

/retest-required

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

mrniranjan · 2023-12-11T01:58:09Z

/retest-required

MarSik · 2023-12-11T08:15:51Z

test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

+		Expect(err).ToNot(HaveOccurred())
+
+		cgfs, err := nodes.GetCgroupFs(workerRTNode)
+		if cgfs == "tmpfs" {


tmpfs? really?

guess it comes from https://kubernetes.io/docs/concepts/architecture/cgroups/#check-cgroup-version

Yes on cgroupv1 versions, the /sys/fs/cgroup/ is tmpfs

@MarSik I have completely removed this check , the whole automation patch for now will be compatible with cgroupv1 only, I will do the cgroupv2 changes in separate PR as i need more time to test.

@MarSik Request to review, regarding cgroupv2 changes, i would like to take some time and test properly before sending any PR.

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

mrniranjan · 2023-12-13T11:39:03Z

/retest-required

MarSik

I am pretty much ok with how it looks like now. I can still see possible improvements (cgroups v2, race prevention etc), however we need better test infra for some of that.

openshift-ci · 2023-12-20T08:51:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MarSik, mrniranjan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [MarSik]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-12-20T10:38:27Z

/retest-required

Remaining retests: 0 against base HEAD 878a7db and 2 for PR HEAD 83791a4 in total

openshift-ci · 2023-12-20T12:27:45Z

@mrniranjan: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-12-20T12:31:06Z

@mrniranjan: Jira Issue OCPBUGS-20368: All pull requests linked via external trackers have merged:

openshift/cluster-node-tuning-operator#746

Jira Issue OCPBUGS-20368 has been moved to the MODIFIED state.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2023-12-20T14:24:35Z

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.16.0-202312201352.p0.gcd7cf63.assembly.stream for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

mrniranjan · 2024-01-03T06:37:43Z

/cherry-pick release-4.15

openshift-cherrypick-robot · 2024-01-03T06:38:24Z

@mrniranjan: new pull request created: #904

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 3, 2023

mrniranjan force-pushed the dynamic_ovs branch from 0ff4316 to 4cc20a2 Compare September 4, 2023 13:01

mrniranjan marked this pull request as ready for review September 4, 2023 13:02

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 4, 2023

openshift-ci bot requested review from dagrayvid and ffromani September 4, 2023 13:03

mrniranjan force-pushed the dynamic_ovs branch 2 times, most recently from 3b506a2 to 9bc45e6 Compare September 4, 2023 13:37

Tal-or reviewed Sep 4, 2023

View reviewed changes

mrniranjan mentioned this pull request Sep 5, 2023

OCPBUGS-18392: Change the OVN trigger file name to adapt to OVN IC #777

Merged

shajmakh reviewed Oct 3, 2023

View reviewed changes

mrniranjan force-pushed the dynamic_ovs branch 3 times, most recently from b0f5e79 to 413e577 Compare October 10, 2023 09:25

mrniranjan changed the title ~~E2E: Add tests for Dynamic ovs pinning~~ OCPBUGS-20368: E2E: Add tests for Dynamic ovs pinning Oct 11, 2023

Minor fix to replace the wrong pod name

dcd108c

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

Tal-or reviewed Dec 5, 2023

View reviewed changes

Niranjan M.R added 3 commits December 6, 2023 16:25

Use ctx variable instead of creating new context

94d69df

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

Minor fix to pass right arguments to getOvnPod function

fc589dd

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

use %s format instead of %v for specifying []byte as slice

fb850c3

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

Execute test related to multiple pods and reboot only if nodes has su…

20e343a

…fficient cpus Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

Niranjan M.R added 2 commits December 8, 2023 11:33

minor improvements to checkCpuCount function

9f0ca33

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

increase the testpod2 cpu requirements to 2

cf8bc1d

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

MarSik reviewed Dec 11, 2023

View reviewed changes

Remove checking cgroup version and use cgroupv1 paths

83791a4

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>

MarSik approved these changes Dec 20, 2023

View reviewed changes

openshift-ci bot assigned MarSik Dec 20, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 20, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 20, 2023

openshift-merge-bot bot merged commit cd7cf63 into openshift:master Dec 20, 2023
15 checks passed

openshift-cherrypick-robot mentioned this pull request Jan 3, 2024

[release-4.15] OCPBUGS-25982: E2E: Add tests for Dynamic ovs pinning #904

Merged

		appsv1 "k8s.io/api/apps/v1"
		"k8s.io/apimachinery/pkg/api/errors"

OCPBUGS-20368: E2E: Add tests for Dynamic ovs pinning #746

OCPBUGS-20368: E2E: Add tests for Dynamic ovs pinning #746

Conversation

mrniranjan commented Aug 3, 2023

openshift-ci bot commented Aug 3, 2023

Tal-or left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tal-or Sep 4, 2023 • edited

Choose a reason for hiding this comment

mrniranjan commented Sep 5, 2023

mrniranjan commented Sep 14, 2023

mrniranjan commented Sep 14, 2023

mrniranjan commented Sep 14, 2023

mrniranjan commented Sep 17, 2023

mrniranjan commented Sep 28, 2023

mrniranjan commented Sep 28, 2023

shajmakh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrniranjan commented Oct 9, 2023

mrniranjan commented Oct 9, 2023

mrniranjan commented Oct 10, 2023

openshift-ci-robot commented Oct 11, 2023

mrniranjan commented Dec 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tal-or Dec 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrniranjan commented Dec 7, 2023

mrniranjan commented Dec 7, 2023

mrniranjan commented Dec 8, 2023

mrniranjan commented Dec 8, 2023

mrniranjan commented Dec 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrniranjan commented Dec 13, 2023

MarSik left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Dec 20, 2023

openshift-ci-robot commented Dec 20, 2023

Tal-or Sep 4, 2023 •

edited

Tal-or Dec 5, 2023 •

edited