Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-7980: e2e:wait: return updated pod object explicitly #744

Merged
merged 1 commit into from Aug 7, 2023

Conversation

Tal-or
Copy link
Contributor

@Tal-or Tal-or commented Aug 2, 2023

In some of the tests we're implicitly relaying on the fact that the pod data returned from the pods.WaitFor* functions is the most updated one.

This is not the case in some of the functions on which we are storing the data on different pointer to pod object and never return it.

Instead of using different pod object, we should use the same one passed into the function.

@Tal-or
Copy link
Contributor Author

Tal-or commented Aug 2, 2023

Thank you @ffromani for helping with this

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent of the change is fine.

I'm reluctant to merge as-is as I'm generally against mutating arguments if we can help it. I'd prefer to return the updated object, even if this change is more invasive, I think is also clearer in the long run and lead to better code.

Also note that by far the most likely reason we have a pointer to pod as first argument (which will indeed mean the function can mutate the argument) is that we were concerned about performance in argument passing.

IOW, we pass a pointer not because the function was meant to mutate (that was never really documented or decided) but as optimization.

@ffromani
Copy link
Contributor

ffromani commented Aug 2, 2023

/cc @MarSik @yanirq @swatisehgal

@Tal-or
Copy link
Contributor Author

Tal-or commented Aug 2, 2023

the intent of the change is fine.

I'm reluctant to merge as-is as I'm generally against mutating arguments if we can help it. I'd prefer to return the updated object, even if this change is more invasive, I think is also clearer in the long run and lead to better code.

Also note that by far the most likely reason we have a pointer to pod as first argument (which will indeed mean the function can mutate the argument) is that we were concerned about performance in argument passing.

IOW, we pass a pointer not because the function was meant to mutate (that was never really documented or decided) but as optimization.

Yes I completely understand your intention.
I mutate the pointer itself because:

  1. I wanted to avoid changing the functions signature, because as you said it would required more invasive change (maybe in cnf-feature-deploy as well?)
  2. mutating the pod pointer is common and a controller-runtime-like approach.

Anyway I'm OK with what ever approach

@ffromani
Copy link
Contributor

ffromani commented Aug 2, 2023

1. I wanted to avoid changing the functions signature, because as you said it would required more invasive change (maybe in cnf-feature-deploy as well?)

2. mutating the pod pointer is common and a controller-runtime-like approach.

You make good points. Even if I'm not a fan of this approach, I won't block it either. I've tagged the other reviewers, let's see if we have a preference collectively.

@Tal-or
Copy link
Contributor Author

Tal-or commented Aug 2, 2023

/test e2e-hypershift

@MarSik
Copy link
Contributor

MarSik commented Aug 3, 2023

I wish we had an explicit way of marking the pointer as mutable (&mut Pod from Rust anyone? :)
I am OK with the approach, but I want to see that written in the doctext for the each of the functions.

@Tal-or
Copy link
Contributor Author

Tal-or commented Aug 3, 2023

I wish we had an explicit way of marking the pointer as mutable (&mut Pod from Rust anyone? :) I am OK with the approach, but I want to see that written in the doctext for the each of the functions.

const pod *corev1.Pod :)

@yanirq
Copy link
Contributor

yanirq commented Aug 3, 2023

1. I wanted to avoid changing the functions signature, because as you said it would required more invasive change (maybe in cnf-feature-deploy as well?)

2. mutating the pod pointer is common and a controller-runtime-like approach.

You make good points. Even if I'm not a fan of this approach, I won't block it either. I've tagged the other reviewers, let's see if we have a preference collectively.

I would not rule out the "invasive" approach yet , can we check to what extent is it invasive ? (maybe its not that large)

@ffromani
Copy link
Contributor

ffromani commented Aug 3, 2023

Rough guesstimation:

$ git log -1
commit 0e3f9fca2f798654e3a0e8758241a299b27dae8d (HEAD -> master, upstream/release-4.15, upstream/release-4.14, upstream/master)
Author: SargunNarula <57625241+SargunNarula@users.noreply.github.com>
Date:   Wed Aug 2 18:14:03 2023 +0530

    e2e: remove image parameter from must gather (#743)
$ git grep -w WaitForDeletion | grep -v vendor | grep pods | wc -l
7
$ git grep -w WaitForCondition | grep -v vendor | grep pods | wc -l
15
$ git grep -w WaitForPhase | grep -v vendor | grep pods | wc -l
3

Copy link
Contributor

@swatisehgal swatisehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought on this is that for operations like the ones here (WaitForDeletion, WaitForCondition and WaitForPhase) it seems inappropriate to assume that the pod object returned is mutated (and is fresher than the object passed). Mutation should not be part of the equation here and the focus should be on waiting until the pod is deleted, meets a condition or attains a certain phase respectively. Ideally we should have had a function definition where we passed pod name instead of a pointer to pod object and that would have made it explicit that we don't expect pod object to be mutated.

If the caller relies on a fresh pod object, they should obtain it explicitly. The current changes seem to be further enforcing this (incorrect?) assumption rather than correcting or making it explicit. Now the question is are there many places where we have this assumption? IOW, what is the impact surface if we wanted to rectify this assumption of obtaining upto date pod objects when these functions are called?

If the motivating factor here to ensure clarity and to facilitate the consumer of utils/pods package, how about we add a function to explicitly obtain an updated pod object like below and use that where we expect to have a fresh pod object:

func GetPod(podName, namespace string) (*corev1.Pod, error){
.....
}

I am curious about what exactly is the reason we want to have the latest pod object? Do we expect the pod object to change here? In theory I can see a few use cases (in place VPA allowing resources updates, updates to labels, taints, tolerations etc) but I would like to understand it better in the context of the use cases we care about and our test suite.

@ffromani
Copy link
Contributor

ffromani commented Aug 4, 2023

My thought on this is that for operations like the ones here (WaitForDeletion, WaitForCondition and WaitForPhase) it seems inappropriate to assume that the pod object returned is mutated (and is fresher than the object passed). Mutation should not be part of the equation here and the focus should be on waiting until the pod is deleted, meets a condition or attains a certain phase respectively. Ideally we should have had a function definition where we passed pod name instead of a pointer to pod object and that would have made it explicit that we don't expect pod object to be mutated.

I fully agree. It would be much better if the functions accept the pod identification data (= ObjectKey or namespace/name pair) rather than the full pod object.

If the caller relies on a fresh pod object, they should obtain it explicitly. The current changes seem to be further enforcing this (incorrect?) assumption rather than correcting or making it explicit. Now the question is are there many places where we have this assumption? IOW, what is the impact surface if we wanted to rectify this assumption of obtaining upto date pod objects when these functions are called?

This would be indeed the most explicit approach.

If the motivating factor here to ensure clarity and to facilitate the consumer of utils/pods package, how about we add a function to explicitly obtain an updated pod object like below and use that where we expect to have a fresh pod object:

func GetPod(podName, namespace string) (*corev1.Pod, error){
.....
}

I am curious about what exactly is the reason we want to have the latest pod object? Do we expect the pod object to change here? In theory I can see a few use cases (in place VPA allowing resources updates, updates to labels, taints, tolerations etc) but I would like to understand it better in the context of the use cases we care about and our test suite.

It's to save a Get which is not really costly, but redundant (yes, this is a form of premature optimization)

In a nutshell, some Wait* functions return a copy of the object after the waited condition is met, effectively saving the caller a Get. This is both an optimization and a implementation detail leaking (The Wait* functions are all essentially a Get+condition check wrapped in a poll loop with a timeout), but the saving grace is that it still kinda fits in the semantic, albeit not cleanest approach.

@Tal-or
Copy link
Contributor Author

Tal-or commented Aug 6, 2023

Then it's settle we have a decision. please review that latest patch

@Tal-or Tal-or changed the title e2e:wait: store fresh info in pod pointer e2e:wait: return updated pod object explicitly Aug 6, 2023
In some of the tests we're implicitly relaying on the fact
that the pod data returned from the `pods.WaitFor*` functions is
the most updated one.

This is not the case in some of the functions on which
we are storing the data on different pointer to pod object and
never return it.

On the other hand, those wait functions changing the passed object
is an hidden implementation detail.

Now the functions accepts only a key object in order to explicilty
indicate no mutation to the object.

In addition, the functions returns an updated copy of the object.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@Tal-or
Copy link
Contributor Author

Tal-or commented Aug 6, 2023

/test e2e-gcp-pao
New flake/failure, that we won't investigate as part of this PR

[rfe_id:27368][performance] Verify API Conversions  [JustBeforeEach] when the GloballyDisableIrqLoadBalancing field set to false should preserve the value during the v1 <-> v2 conversion
  [JustBeforeEach] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/performance.go:777
  [It] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/performance.go:810
  [FAILED] Failed to create profile
  Unexpected error:
      <*errors.StatusError | 0xc00037f720>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Internal error occurred: failed calling webhook \"vwb.performance.openshift.io\": failed to call webhook: Post \"[https://performance-addon-operator-service.openshift-cluster-node-tuning-operator.svc:443/validate-performance-openshift-io-v2-performanceprofile?timeout=10s](https://performance-addon-operator-service.openshift-cluster-node-tuning-operator.svc/validate-performance-openshift-io-v2-performanceprofile?timeout=10s)\": no endpoints available for service \"performance-addon-operator-service\"",
              Reason: "InternalError",
              Details: {
                  Name: "",
                  Group: "",
                  Kind: "",
                  UID: "",
                  Causes: [
                      {
                          Type: "",
                          Message: "failed calling webhook \"vwb.performance.openshift.io\": failed to call webhook: Post \"https://performance-addon-operator-service.openshift-cluster-node-tuning-operator.svc:[443](https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-node-tuning-operator/744/pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao/1688150438286725120#1:build-log.txt%3A443)/validate-performance-openshift-io-v2-performanceprofile?timeout=10s\": no endpoints available for service \"performance-addon-operator-service\"",
                          Field: "",
                      },
                  ],
                  RetryAfterSeconds: 0,
              },
              Code: 500,
          },
      }
      Internal error occurred: failed calling webhook "vwb.performance.openshift.io": failed to call webhook: Post "[https://performance-addon-operator-service.openshift-cluster-node-tuning-operator.svc:443/validate-performance-openshift-io-v2-performanceprofile?timeout=10s](https://performance-addon-operator-service.openshift-cluster-node-tuning-operator.svc/validate-performance-openshift-io-v2-performanceprofile?timeout=10s)": no endpoints available for service "performance-addon-operator-service"
  occurred
  In [JustBeforeEach] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/performance.go:793 @ 08/06/23 12:22:20.221

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 6, 2023

@Tal-or: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@yanirq
Copy link
Contributor

yanirq commented Aug 6, 2023

looks good to me . ill wait for others to give an ack before I add the approval

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

thanks for the changes, IMO much better now

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 7, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 7, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 7, 2023
@ffromani
Copy link
Contributor

ffromani commented Aug 7, 2023

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 7, 2023
@ffromani
Copy link
Contributor

ffromani commented Aug 7, 2023

holding to let other reviewers chime in

@ffromani
Copy link
Contributor

ffromani commented Aug 7, 2023

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 7, 2023
@openshift-merge-robot openshift-merge-robot merged commit 48df9b4 into openshift:master Aug 7, 2023
13 checks passed
@ffromani ffromani deleted the fix_wait_functions branch August 7, 2023 14:47
@Tal-or Tal-or changed the title e2e:wait: return updated pod object explicitly OCPBUGS-7980: e2e:wait: return updated pod object explicitly Aug 17, 2023
@openshift-ci-robot
Copy link
Contributor

@Tal-or: Jira Issue OCPBUGS-7980 is in an unrecognized state (MODIFIED) and will not be moved to the MODIFIED state.

In response to this:

In some of the tests we're implicitly relaying on the fact that the pod data returned from the pods.WaitFor* functions is the most updated one.

This is not the case in some of the functions on which we are storing the data on different pointer to pod object and never return it.

Instead of using different pod object, we should use the same one passed into the function.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Tal-or
Copy link
Contributor Author

Tal-or commented Aug 17, 2023

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@Tal-or: Jira Issue OCPBUGS-7980: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-7980 has been moved to the MODIFIED state.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Tal-or
Copy link
Contributor Author

Tal-or commented Aug 17, 2023

/help

@Tal-or
Copy link
Contributor Author

Tal-or commented Aug 17, 2023

/cherry-pick ?

@openshift-cherrypick-robot

@Tal-or: cannot checkout ?: error checking out ?: exit status 1. output: error: pathspec '?' did not match any file(s) known to git

In response to this:

/cherry-pick ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants