Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-20368: E2E: Add tests for Dynamic ovs pinning #746

Merged
merged 15 commits into from Dec 20, 2023

Conversation

mrniranjan
Copy link
Contributor

No description provided.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 3, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 3, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@mrniranjan mrniranjan marked this pull request as ready for review September 4, 2023 13:02
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 4, 2023
@mrniranjan mrniranjan force-pushed the dynamic_ovs branch 2 times, most recently from 3b506a2 to 9bc45e6 Compare September 4, 2023 13:37
Copy link
Contributor

@Tal-or Tal-or left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review.

"k8s.io/utils/pointer"
"sigs.k8s.io/controller-runtime/pkg/client"

"embed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should go up with other build-in deps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit

})

AfterAll(func() {
By("Removing the crio fix workaround")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment that explains the workaround?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit

performanceMCP string
//go:embed scripts/*
Scripts embed.FS
)

var _ = Describe("[performance] Cgroups and affinity", Ordered, func() {
var onlineCPUSet cpuset.CPUSet

testutils.CustomBeforeAll(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit

return pids, nil
}

/*// getCpuOfOvsServices returns cpus used by the ovs services ovs-vswitchd and ovs-dbserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the commented code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit

@@ -95,3 +518,246 @@ func cpuSpecToString(cpus *performancev2.CPU) string {
}
return sb.String()
}

func createMachineConfig(profile *performancev2.PerformanceProfile) (*machineconfigv1.MachineConfig, error) {
Copy link
Contributor

@Tal-or Tal-or Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a test code which used only once here, maybe it's simpler to read a complete MC manifest using embed and just apply it instead?

@mrniranjan
Copy link
Contributor Author

/test e2e-gcp-pao-updating-profile

@mrniranjan
Copy link
Contributor Author

/test e2e-gcp-pao-workloadhints

@mrniranjan
Copy link
Contributor Author

/retest-required

@mrniranjan
Copy link
Contributor Author

/retest

@mrniranjan
Copy link
Contributor Author

/retest-required

2 similar comments
@mrniranjan
Copy link
Contributor Author

/retest-required

@mrniranjan
Copy link
Contributor Author

/retest-required

Copy link
Contributor

@shajmakh shajmakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have left few comments below for you.
It would be good to have a short description on the PR, and squash commits into single units where it fits.

Comment on lines 41 to 34
appsv1 "k8s.io/api/apps/v1"
"k8s.io/apimachinery/pkg/api/errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move these up to where all k8 imports are

Describe("[rfe_id: 64006][Dynamic OVS Pinning]", Ordered, func() {
Context("[Performance Profile applied]", func() {
It("[test_id:64097] Activation file is created", func() {
cmd := []string{"ls", activation_file}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible let's avoid creating deps on other tools like linux commands ls, cat, find, as much as possible. we can use go binaries to fulfill what we need

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being run on the worker-cnf node , we need to use linux tools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, I missed that.

Comment on lines +143 to +107
policy := "best-effort"
// Need to make some changes to pp , causing system reboot
// and check if activation files is modified or deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need the system reboot or the PP modification? because the later has more expensive cost on the exec time and can possible affect other tests if not reverted properly.

TopologyPolicy: &policy,
}
By("Updating the performance profile")
profiles.UpdateWithRetry(profile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you revert the profile in aftereach, why not add a defer() here to revert it, it's only one test that modifies the profile.
OTOH if you don't specifically need a PP update, then I believe system reboot on the node on which the pod is running can be more efficient here and cheaper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't run systemctl reboot from the pod, It doesn't allow
For example:

[root@dell-r630-007 ~]# oc exec -it pods/machine-config-daemon-bvp6t -n openshift-machine-config-operator -- bash -c "systemctl reboot"
Defaulted container "machine-config-daemon" out of: machine-config-daemon, kube-rbac-proxy
Running in chroot, ignoring request: reboot

Also yes we are changing Profile to trigger reboot, but this is the most safest way to reboot as all the utility functions are well tested and used extensively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: reboot, you can do jumping through some hoops: https://github.com/openshift/cluster-node-tuning-operator/blob/master/test/e2e/performanceprofile/functests/9_reboot/devices.go#L125

that said, can't tell if the reboot is the best approach here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can follow the approach but i don't see much benefit and we are using some hard coded timeout values which i can't guarantee will work on worker nodes specifically on BM. Modify the profile and waiting for the mcp to update seems to me safest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also executing commands through mcd pod is fine as long as you need some output, executing commands like "reboot", where you will instantly loose connection to the node , i find it not so appropriate. As i mentioned above, i can follow the same method if required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but there is a big difference in execution time with the PP update vs node reboot. when PP is updated it will trigger reboot on all (usually not less than 2) nodes, and later on the revert will do the same, so the time is x4 than rebooting only the node on which the pod is running. I do not expect the pod to be terminated on node reboot, otherwise it would likely be a bug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes because in this case the activation file will exist on all the worker-cnf nodes when the PP is applied. if there are multiple worker-cnf nodes, we do want to reboot all the worker-cnf nodes.

Yes it causes delay but it ensures the node is rebooted properly by that i mean we are giving time for mcp to evict the pods, drain the node , and come back again in orderly fashion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, so I see this involves all nodes. following this, correct me if mistaken, I think it is a good idea to cover that also when verifying the existence of the file not only on one node but on all relevant nodes, right?
ref:
https://github.com/openshift/cluster-node-tuning-operator/pull/746/files/23f53670b348a210197ec8d933c7db6fa48934b5#diff-ce7d5cbe8bfc6cc4efbf81602d5f238b37bafcbb68bf2c1ad66f1e1b148a33c5R163

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit

By("Waiting for MCP being updated")
mcps.WaitForCondition(performanceMCP, machineconfigv1.MachineConfigPoolUpdated, corev1.ConditionTrue)
By("Checking Activation file")
cmd := []string{"ls", activation_file}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here for linux commands

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, we need to use the ls command to check the activation file exists on worker node .

for _, pid := range pidList {
cmd := []string{"/bin/bash", "-c", fmt.Sprintf("grep Cpus_allowed_list /proc/%s/status | awk '{print $2}'", pid)}
cpusOfovServices, err := nodes.ExecCommandOnNode(cmd, workerRTNode)
Expect(cpus == cpusOfovServices).To(BeTrue(), "affinity of ovn kube node pods(%s) do not match with ovservices(%s)", cpus, cpusOfovServices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, you can use To(Equal(..)) instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit

cmd := []string{"/bin/bash", "-c", fmt.Sprintf("grep Cpus_allowed_list /proc/%s/status | awk '{print $2}'", pid)}
cpusOfovServices, err := nodes.ExecCommandOnNode(cmd, workerRTNode)
testlog.Infof("cpus used by ovs service %s is %s", pid, cpusOfovServices)
Expect(cpusOfovServices == ovnContainerCpus).To(BeTrue(), "affinity of ovn kube node pods(%s) do not match with ovservices(%s)", ovnContainerCpus, cpusOfovServices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, you can use To(Equal(..)) instead, and in other later places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit

pidList, err := getOVSServicesPid(workerRTNode)
Expect(err).ToNot(HaveOccurred())
for _, pid := range pidList {
cmd := []string{"/bin/bash", "-c", fmt.Sprintf("grep Cpus_allowed_list /proc/%s/status | awk '{print $2}'", pid)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here and in other places regarding depending on linux commands

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note these commands are required to run on worker node.

Comment on lines +469 to +401
testlog.Info("Rebooting the node")
// reboot the node, for that we change the numa policy to best-effort
// Note: this is used only to trigger reboot
policy := "best-effort"
// Need to make some changes to pp , causing system reboot
// and check if activation files is modified or deleted
profile, err = profiles.GetByNodeLabels(testutils.NodeSelectorLabels)
Expect(err).ToNot(HaveOccurred(), "Unable to fetch latest performance profile")
currentPolicy := profile.Spec.NUMA.TopologyPolicy
if *currentPolicy == "best-effort" {
policy = "restricted"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here regarding reboot, if the goal is to trigger a system reboot and not a PP specific update, then I'd do a system reboot on the node on which the pod runs on. This 1. won't need a revert 2. keep the tests running on consistent configuration (unless intended the opposite) 3. less dependency by directly rebooting the node instead of w/a.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above using systemctl reboot or rebooting the node from a pod will cause unspecified behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i have addressed this already on why doing system reboot is a bad idea.

}
})
AfterEach(func() {
By("Reverting the Profile")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same regarding the revert vs reboot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed above.

@mrniranjan
Copy link
Contributor Author

/test e2e-gcp-pao-updating-profile

@mrniranjan
Copy link
Contributor Author

/retest-required

1 similar comment
@mrniranjan
Copy link
Contributor Author

/retest-required

@mrniranjan mrniranjan force-pushed the dynamic_ovs branch 3 times, most recently from b0f5e79 to 413e577 Compare October 10, 2023 09:25
@mrniranjan mrniranjan changed the title E2E: Add tests for Dynamic ovs pinning OCPBUGS-20368: E2E: Add tests for Dynamic ovs pinning Oct 11, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Oct 11, 2023
@openshift-ci-robot
Copy link
Contributor

@mrniranjan: This pull request references Jira Issue OCPBUGS-20368, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (nkononov@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mrniranjan
Copy link
Contributor Author

/retest-required

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
workerRTNode *corev1.Node
workerRTNodes []corev1.Node
profile, initialProfile *performancev2.PerformanceProfile
activation_file string = "/rootfs/var/lib/ovn-ic/etc/enable_dynamic_cpu_affinity"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this file is not changing it can be const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit

})

BeforeEach(func() {
if discovery.Enabled() && testutils.ProfileNotFound {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we still check it in BeforeEach?

Expect(testclient.Client.Patch(context.TODO(), profile,
client.RawPatch(
types.JSONPatchType,
[]byte(fmt.Sprintf(`[{ "op": "replace", "path": "/spec", "value": %s }]`, spec)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address

return
}

err = testclient.Client.Delete(context.TODO(), testpod)
Copy link
Contributor

@Tal-or Tal-or Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are creating new context here. you should pass ctx (the first argument of deleteTestPod function)

options := &client.ListOptions{
Namespace: "openshift-ovn-kubernetes",
}
err := testclient.Client.List(context.TODO(), ovnpods, options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the getOvnPod to get context as it first argument and pass it to the Client.List call

Niranjan M.R added 3 commits December 6, 2023 16:25
Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
@mrniranjan
Copy link
Contributor Author

/test e2e-gcp-pao-updating-profile

1 similar comment
@mrniranjan
Copy link
Contributor Author

/test e2e-gcp-pao-updating-profile

…fficient cpus

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
@mrniranjan
Copy link
Contributor Author

/retest-required

1 similar comment
@mrniranjan
Copy link
Contributor Author

/retest-required

Niranjan M.R added 2 commits December 8, 2023 11:33
Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
@mrniranjan
Copy link
Contributor Author

/retest-required

Expect(err).ToNot(HaveOccurred())

cgfs, err := nodes.GetCgroupFs(workerRTNode)
if cgfs == "tmpfs" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmpfs? really?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes on cgroupv1 versions, the /sys/fs/cgroup/ is tmpfs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarSik I have completely removed this check , the whole automation patch for now will be compatible with cgroupv1 only, I will do the cgroupv2 changes in separate PR as i need more time to test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarSik Request to review, regarding cgroupv2 changes, i would like to take some time and test properly before sending any PR.

Signed-off-by: Niranjan M.R <mrniranjan@redhat.com>
@mrniranjan
Copy link
Contributor Author

/retest-required

Copy link
Contributor

@MarSik MarSik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty much ok with how it looks like now. I can still see possible improvements (cgroups v2, race prevention etc), however we need better test infra for some of that.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 20, 2023
Copy link
Contributor

openshift-ci bot commented Dec 20, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MarSik, mrniranjan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 20, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 878a7db and 2 for PR HEAD 83791a4 in total

Copy link
Contributor

openshift-ci bot commented Dec 20, 2023

@mrniranjan: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit cd7cf63 into openshift:master Dec 20, 2023
15 checks passed
@openshift-ci-robot
Copy link
Contributor

@mrniranjan: Jira Issue OCPBUGS-20368: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-20368 has been moved to the MODIFIED state.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.16.0-202312201352.p0.gcd7cf63.assembly.stream for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

@mrniranjan
Copy link
Contributor Author

/cherry-pick release-4.15

@openshift-cherrypick-robot

@mrniranjan: new pull request created: #904

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants