OCPBUGS-7980: e2e:wait: return updated pod object explicitly #744

Tal-or · 2023-08-02T12:49:16Z

In some of the tests we're implicitly relaying on the fact that the pod data returned from the pods.WaitFor* functions is the most updated one.

This is not the case in some of the functions on which we are storing the data on different pointer to pod object and never return it.

Instead of using different pod object, we should use the same one passed into the function.

Tal-or · 2023-08-02T12:49:54Z

Thank you @ffromani for helping with this

ffromani

The intent of the change is fine.

I'm reluctant to merge as-is as I'm generally against mutating arguments if we can help it. I'd prefer to return the updated object, even if this change is more invasive, I think is also clearer in the long run and lead to better code.

Also note that by far the most likely reason we have a pointer to pod as first argument (which will indeed mean the function can mutate the argument) is that we were concerned about performance in argument passing.

IOW, we pass a pointer not because the function was meant to mutate (that was never really documented or decided) but as optimization.

ffromani · 2023-08-02T12:57:39Z

/cc @MarSik @yanirq @swatisehgal

Tal-or · 2023-08-02T13:04:22Z

the intent of the change is fine.

I'm reluctant to merge as-is as I'm generally against mutating arguments if we can help it. I'd prefer to return the updated object, even if this change is more invasive, I think is also clearer in the long run and lead to better code.

Also note that by far the most likely reason we have a pointer to pod as first argument (which will indeed mean the function can mutate the argument) is that we were concerned about performance in argument passing.

IOW, we pass a pointer not because the function was meant to mutate (that was never really documented or decided) but as optimization.

Yes I completely understand your intention.
I mutate the pointer itself because:

I wanted to avoid changing the functions signature, because as you said it would required more invasive change (maybe in cnf-feature-deploy as well?)
mutating the pod pointer is common and a controller-runtime-like approach.

Anyway I'm OK with what ever approach

ffromani · 2023-08-02T13:10:47Z

1. I wanted to avoid changing the functions signature, because as you said it would required more invasive change (maybe in cnf-feature-deploy as well?)

2. mutating the pod pointer is common and a controller-runtime-like approach.

You make good points. Even if I'm not a fan of this approach, I won't block it either. I've tagged the other reviewers, let's see if we have a preference collectively.

Tal-or · 2023-08-02T16:25:52Z

/test e2e-hypershift

MarSik · 2023-08-03T07:23:51Z

I wish we had an explicit way of marking the pointer as mutable (&mut Pod from Rust anyone? :)
I am OK with the approach, but I want to see that written in the doctext for the each of the functions.

Tal-or · 2023-08-03T07:34:03Z

I wish we had an explicit way of marking the pointer as mutable (&mut Pod from Rust anyone? :) I am OK with the approach, but I want to see that written in the doctext for the each of the functions.

const pod *corev1.Pod :)

yanirq · 2023-08-03T09:23:42Z

1. I wanted to avoid changing the functions signature, because as you said it would required more invasive change (maybe in cnf-feature-deploy as well?)

2. mutating the pod pointer is common and a controller-runtime-like approach.
You make good points. Even if I'm not a fan of this approach, I won't block it either. I've tagged the other reviewers, let's see if we have a preference collectively.

I would not rule out the "invasive" approach yet , can we check to what extent is it invasive ? (maybe its not that large)

ffromani · 2023-08-03T09:30:30Z

Rough guesstimation:

$ git log -1
commit 0e3f9fca2f798654e3a0e8758241a299b27dae8d (HEAD -> master, upstream/release-4.15, upstream/release-4.14, upstream/master)
Author: SargunNarula <57625241+SargunNarula@users.noreply.github.com>
Date:   Wed Aug 2 18:14:03 2023 +0530

    e2e: remove image parameter from must gather (#743)
$ git grep -w WaitForDeletion | grep -v vendor | grep pods | wc -l
7
$ git grep -w WaitForCondition | grep -v vendor | grep pods | wc -l
15
$ git grep -w WaitForPhase | grep -v vendor | grep pods | wc -l
3

swatisehgal

My thought on this is that for operations like the ones here (WaitForDeletion, WaitForCondition and WaitForPhase) it seems inappropriate to assume that the pod object returned is mutated (and is fresher than the object passed). Mutation should not be part of the equation here and the focus should be on waiting until the pod is deleted, meets a condition or attains a certain phase respectively. Ideally we should have had a function definition where we passed pod name instead of a pointer to pod object and that would have made it explicit that we don't expect pod object to be mutated.

If the caller relies on a fresh pod object, they should obtain it explicitly. The current changes seem to be further enforcing this (incorrect?) assumption rather than correcting or making it explicit. Now the question is are there many places where we have this assumption? IOW, what is the impact surface if we wanted to rectify this assumption of obtaining upto date pod objects when these functions are called?

If the motivating factor here to ensure clarity and to facilitate the consumer of utils/pods package, how about we add a function to explicitly obtain an updated pod object like below and use that where we expect to have a fresh pod object:

func GetPod(podName, namespace string) (*corev1.Pod, error){
.....
}

I am curious about what exactly is the reason we want to have the latest pod object? Do we expect the pod object to change here? In theory I can see a few use cases (in place VPA allowing resources updates, updates to labels, taints, tolerations etc) but I would like to understand it better in the context of the use cases we care about and our test suite.

ffromani · 2023-08-04T05:23:44Z

My thought on this is that for operations like the ones here (WaitForDeletion, WaitForCondition and WaitForPhase) it seems inappropriate to assume that the pod object returned is mutated (and is fresher than the object passed). Mutation should not be part of the equation here and the focus should be on waiting until the pod is deleted, meets a condition or attains a certain phase respectively. Ideally we should have had a function definition where we passed pod name instead of a pointer to pod object and that would have made it explicit that we don't expect pod object to be mutated.

I fully agree. It would be much better if the functions accept the pod identification data (= ObjectKey or namespace/name pair) rather than the full pod object.

If the caller relies on a fresh pod object, they should obtain it explicitly. The current changes seem to be further enforcing this (incorrect?) assumption rather than correcting or making it explicit. Now the question is are there many places where we have this assumption? IOW, what is the impact surface if we wanted to rectify this assumption of obtaining upto date pod objects when these functions are called?

This would be indeed the most explicit approach.

If the motivating factor here to ensure clarity and to facilitate the consumer of utils/pods package, how about we add a function to explicitly obtain an updated pod object like below and use that where we expect to have a fresh pod object:
func GetPod(podName, namespace string) (*corev1.Pod, error){
.....
}
I am curious about what exactly is the reason we want to have the latest pod object? Do we expect the pod object to change here? In theory I can see a few use cases (in place VPA allowing resources updates, updates to labels, taints, tolerations etc) but I would like to understand it better in the context of the use cases we care about and our test suite.

It's to save a Get which is not really costly, but redundant (yes, this is a form of premature optimization)

In a nutshell, some Wait* functions return a copy of the object after the waited condition is met, effectively saving the caller a Get. This is both an optimization and a implementation detail leaking (The Wait* functions are all essentially a Get+condition check wrapped in a poll loop with a timeout), but the saving grace is that it still kinda fits in the semantic, albeit not cleanest approach.

Tal-or · 2023-08-06T09:54:06Z

Then it's settle we have a decision. please review that latest patch

In some of the tests we're implicitly relaying on the fact that the pod data returned from the `pods.WaitFor*` functions is the most updated one. This is not the case in some of the functions on which we are storing the data on different pointer to pod object and never return it. On the other hand, those wait functions changing the passed object is an hidden implementation detail. Now the functions accepts only a key object in order to explicilty indicate no mutation to the object. In addition, the functions returns an updated copy of the object. Signed-off-by: Talor Itzhak <titzhak@redhat.com>

Tal-or · 2023-08-06T12:55:27Z

/test e2e-gcp-pao
New flake/failure, that we won't investigate as part of this PR

[rfe_id:27368][performance] Verify API Conversions  [JustBeforeEach] when the GloballyDisableIrqLoadBalancing field set to false should preserve the value during the v1 <-> v2 conversion
  [JustBeforeEach] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/performance.go:777
  [It] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/performance.go:810
  [FAILED] Failed to create profile
  Unexpected error:
      <*errors.StatusError | 0xc00037f720>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Internal error occurred: failed calling webhook \"vwb.performance.openshift.io\": failed to call webhook: Post \"[https://performance-addon-operator-service.openshift-cluster-node-tuning-operator.svc:443/validate-performance-openshift-io-v2-performanceprofile?timeout=10s](https://performance-addon-operator-service.openshift-cluster-node-tuning-operator.svc/validate-performance-openshift-io-v2-performanceprofile?timeout=10s)\": no endpoints available for service \"performance-addon-operator-service\"",
              Reason: "InternalError",
              Details: {
                  Name: "",
                  Group: "",
                  Kind: "",
                  UID: "",
                  Causes: [
                      {
                          Type: "",
                          Message: "failed calling webhook \"vwb.performance.openshift.io\": failed to call webhook: Post \"https://performance-addon-operator-service.openshift-cluster-node-tuning-operator.svc:[443](https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-node-tuning-operator/744/pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao/1688150438286725120#1:build-log.txt%3A443)/validate-performance-openshift-io-v2-performanceprofile?timeout=10s\": no endpoints available for service \"performance-addon-operator-service\"",
                          Field: "",
                      },
                  ],
                  RetryAfterSeconds: 0,
              },
              Code: 500,
          },
      }
      Internal error occurred: failed calling webhook "vwb.performance.openshift.io": failed to call webhook: Post "[https://performance-addon-operator-service.openshift-cluster-node-tuning-operator.svc:443/validate-performance-openshift-io-v2-performanceprofile?timeout=10s](https://performance-addon-operator-service.openshift-cluster-node-tuning-operator.svc/validate-performance-openshift-io-v2-performanceprofile?timeout=10s)": no endpoints available for service "performance-addon-operator-service"
  occurred
  In [JustBeforeEach] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/performance.go:793 @ 08/06/23 12:22:20.221

openshift-ci · 2023-08-06T14:01:40Z

@Tal-or: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

yanirq · 2023-08-06T14:41:11Z

looks good to me . ill wait for others to give an ack before I add the approval

ffromani

/approve

thanks for the changes, IMO much better now

openshift-ci · 2023-08-07T07:06:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ffromani]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ffromani · 2023-08-07T07:06:28Z

/hold

ffromani · 2023-08-07T07:06:53Z

holding to let other reviewers chime in

ffromani · 2023-08-07T13:33:32Z

/hold cancel

openshift-ci-robot · 2023-08-17T10:00:46Z

@Tal-or: Jira Issue OCPBUGS-7980 is in an unrecognized state (MODIFIED) and will not be moved to the MODIFIED state.

In response to this:

In some of the tests we're implicitly relaying on the fact that the pod data returned from the pods.WaitFor* functions is the most updated one.

This is not the case in some of the functions on which we are storing the data on different pointer to pod object and never return it.

Instead of using different pod object, we should use the same one passed into the function.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Tal-or · 2023-08-17T10:09:47Z

/jira refresh

openshift-ci-robot · 2023-08-17T10:09:51Z

@Tal-or: Jira Issue OCPBUGS-7980: All pull requests linked via external trackers have merged:

openshift/cluster-node-tuning-operator#729

Jira Issue OCPBUGS-7980 has been moved to the MODIFIED state.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Tal-or · 2023-08-17T10:23:26Z

/help

Tal-or · 2023-08-17T10:24:23Z

/cherry-pick ?

openshift-cherrypick-robot · 2023-08-17T10:25:18Z

@Tal-or: cannot checkout ?: error checking out ?: exit status 1. output: error: pathspec '?' did not match any file(s) known to git

In response to this:

/cherry-pick ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from ffromani and jlojosnegros August 2, 2023 12:52

ffromani reviewed Aug 2, 2023

View reviewed changes

openshift-ci bot requested review from MarSik, swatisehgal and yanirq August 2, 2023 12:57

Tal-or mentioned this pull request Aug 3, 2023

OCPBUGS-7980:e2e:ht-aware: exec on the correct worker node #729

Merged

swatisehgal reviewed Aug 3, 2023

View reviewed changes

Tal-or force-pushed the fix_wait_functions branch from b0b3278 to f44889d Compare August 6, 2023 09:53

Tal-or changed the title ~~e2e:wait: store fresh info in pod pointer~~ e2e:wait: return updated pod object explicitly Aug 6, 2023

Tal-or force-pushed the fix_wait_functions branch from f44889d to cdee108 Compare August 6, 2023 11:29

ffromani approved these changes Aug 7, 2023

View reviewed changes

openshift-ci bot assigned ffromani Aug 7, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 7, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 7, 2023

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 7, 2023

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 7, 2023

openshift-merge-robot merged commit 48df9b4 into openshift:master Aug 7, 2023
13 checks passed

ffromani deleted the fix_wait_functions branch August 7, 2023 14:47

Tal-or changed the title ~~e2e:wait: return updated pod object explicitly~~ OCPBUGS-7980: e2e:wait: return updated pod object explicitly Aug 17, 2023

Tal-or mentioned this pull request Aug 17, 2023

[release-4.13] [manual] OCPBUGS-17845: deflake ht aware test #763

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-7980: e2e:wait: return updated pod object explicitly #744

OCPBUGS-7980: e2e:wait: return updated pod object explicitly #744

Tal-or commented Aug 2, 2023

Tal-or commented Aug 2, 2023

ffromani left a comment •

edited

ffromani commented Aug 2, 2023

Tal-or commented Aug 2, 2023 •

edited

ffromani commented Aug 2, 2023

Tal-or commented Aug 2, 2023

MarSik commented Aug 3, 2023

Tal-or commented Aug 3, 2023

yanirq commented Aug 3, 2023 •

edited

ffromani commented Aug 3, 2023 •

edited

swatisehgal left a comment •

edited

ffromani commented Aug 4, 2023

Tal-or commented Aug 6, 2023

Tal-or commented Aug 6, 2023

openshift-ci bot commented Aug 6, 2023

yanirq commented Aug 6, 2023

ffromani left a comment

openshift-ci bot commented Aug 7, 2023

ffromani commented Aug 7, 2023

ffromani commented Aug 7, 2023

ffromani commented Aug 7, 2023

openshift-ci-robot commented Aug 17, 2023

Tal-or commented Aug 17, 2023

openshift-ci-robot commented Aug 17, 2023

Tal-or commented Aug 17, 2023

Tal-or commented Aug 17, 2023

openshift-cherrypick-robot commented Aug 17, 2023

OCPBUGS-7980: e2e:wait: return updated pod object explicitly #744

OCPBUGS-7980: e2e:wait: return updated pod object explicitly #744

Conversation

Tal-or commented Aug 2, 2023

Tal-or commented Aug 2, 2023

ffromani left a comment • edited

Choose a reason for hiding this comment

ffromani commented Aug 2, 2023

Tal-or commented Aug 2, 2023 • edited

ffromani commented Aug 2, 2023

Tal-or commented Aug 2, 2023

MarSik commented Aug 3, 2023

Tal-or commented Aug 3, 2023

yanirq commented Aug 3, 2023 • edited

ffromani commented Aug 3, 2023 • edited

swatisehgal left a comment • edited

Choose a reason for hiding this comment

ffromani commented Aug 4, 2023

Tal-or commented Aug 6, 2023

Tal-or commented Aug 6, 2023

openshift-ci bot commented Aug 6, 2023

yanirq commented Aug 6, 2023

ffromani left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Aug 7, 2023

ffromani commented Aug 7, 2023

ffromani commented Aug 7, 2023

ffromani commented Aug 7, 2023

openshift-ci-robot commented Aug 17, 2023

Tal-or commented Aug 17, 2023

openshift-ci-robot commented Aug 17, 2023

Tal-or commented Aug 17, 2023

Tal-or commented Aug 17, 2023

openshift-cherrypick-robot commented Aug 17, 2023

ffromani left a comment •

edited

Tal-or commented Aug 2, 2023 •

edited

yanirq commented Aug 3, 2023 •

edited

ffromani commented Aug 3, 2023 •

edited

swatisehgal left a comment •

edited