Force PAO installation on master nodes #351

marcel-apf · 2020-09-21T11:45:23Z

PAO is meant to run on "control plane".

Use podAffinity to schedule PAO only on master nodes.
(https://docs.openshift.com/container-platform/4.5/nodes/scheduling/nodes-scheduler-pod-affinity.html)
Use taints tolerations to allow PAO scheduling on the master nodes.
(https://docs.openshift.com/container-platform/4.5/nodes/scheduling/nodes-scheduler-taints-tolerations.html)

Add a test verifying PAO runs in master nodes
Straight forward test that queries the node PAO is running on
is a master node.

marcel-apf · 2020-09-21T11:49:50Z

/retest

marcel-apf · 2020-09-21T11:52:23Z

Depends on #350

MarSik · 2020-09-21T11:59:04Z

functests/utils/pods/pods.go

+	if err := testclient.Client.List(context.TODO(), pods, opts); err != nil {
+		return nil, err
+	}
+	if len(pods.Items) != 1 {


I wonder if we should just use len(pods.Items) < 1 (just in case we eventually add HA and leader election). Or maybe it would be premature to change the test now, so just asking.

Fair point, I prefer to keep the test as-is and see it break first when/if we add the HA/leader election.

functests/0_config/config.go

ffromani

looks mostly OK. Most (relatively minor) concerns about the e2e test and the GetPerformanceProfilePod function, changes to CSV looks good

ffromani · 2020-09-21T12:10:14Z

...erformance-addon-operator/4.6.0/performance-addon-operator.v4.6.0.clusterserviceversion.yaml

@@ -118,6 +118,13 @@ spec:
              labels:
                name: performance-operator
            spec:
+              affinity:


it seems OCP control plane is using

nodeSelector: node-role.kubernetes.io/master: ""

but the end result is the same and affinity rules are more powerful than nodeSelector, so it seems is fine.

I preferred affinity since is is reflected in the official docs

Is affinity reflected in the openshift official docs or the kubernetes official docs? or both?

functests/0_config/config.go

functests/utils/pods/pods.go

ffromani · 2020-09-21T12:13:38Z

functests/utils/pods/pods.go

+	if len(pods.Items) != 1 {
+		return nil, fmt.Errorf("incorrect performance operator pods count: %d", len(pods.Items))
+	}
+


any way to check that pod is indeeded running performance-operator and not any random pod which is happen to have the same label?

In the same namespace as PAO "openshift-performance-addon" ? I think is a relatively safe bet...

It's still a bet, meaning is still anecdotal evidence, and we need quite some of it to gain confidence. Checking the image name seems stronger (does it contain 'performance-operator'). Feel free to add more checks, or to lookup better ones.
The best way would probably be to check a PAO endpoint and to verify it is behaving correctly; that's sufficient proof the service is there. It's probably complex, maybe too complex to be done here, though.

We are also checking we have only one pod in the namespace. The check is the first step in the func tests, I suppose the next steps will fail if for some reason the PAO is not in the namespace, but another pod with same label.
While I agree that theoretically is not enough, is it the place to add more checks?
I can add a check the actual pod name starts with performace-operator-{something}, but this is a side effect on how k8s names pods.Or maybe would add to confidence?

Added check the performance operator pod name (not label) starts with "performance-operator"

That should increase the confidence in the bet

Still not completely happy with this solution, but good enough for now.

ffromani · 2020-09-21T12:36:03Z

functests/0_config/config.go

@@ -9,6 +9,7 @@ import (

 	. "github.com/onsi/ginkgo"
 	. "github.com/onsi/gomega"
+	. "github.com/onsi/gomega/gstruct"


used by Reject()

Fair enough. And BTW this is exacty why I dislike the dot imports.

functests/0_config/config.go

ffromani

/lgtm
good enough for now

marcel-apf · 2020-09-22T07:18:36Z

/retest

cynepco3hahue · 2020-09-22T15:59:06Z

/retest

coveralls · 2020-09-22T16:07:38Z

Pull Request Test Coverage Report for Build 492

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 70.939%

Totals
Change from base Build 485:	0.0%
Covered Lines:	869
Relevant Lines:	1225

💛 - Coveralls

cynepco3hahue · 2020-09-22T17:34:50Z

@marcel-apf Please fix CI job

export GOROOT=$(go env GOROOT); """_cache"/tools"/"operator-sdk-"v0.18.2"-"x86_64-linux-gnu""" generate k8s
time="2020-09-22T16:10:05Z" level=info msg="Running deepcopy code-generation for Custom Resource group versions: [performance:[v1 v1alpha1], ]\n"
time="2020-09-22T16:10:47Z" level=info msg="Code-generation complete."
Verifying that all code is committed after updating deps and formatting and generating code
hack/verify-generated.sh
uncommitted generated files. run 'make generate' and commit results.
 M functests/utils/pods/pods.go

PAO is meant to run on "control plane". Use podAffinity to schedule PAO only on master nodes. (https://docs.openshift.com/container-platform/4.5/nodes/scheduling/nodes-scheduler-pod-affinity.html) Use taints tolerations to allow PAO scheduling on the master nodes. (https://docs.openshift.com/container-platform/4.5/nodes/scheduling/nodes-scheduler-taints-tolerations.html) Update the manifests according to the above. Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>

Straight forward test that checks the node PAO is running on is a master node. Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>

marcel-apf · 2020-09-23T07:16:12Z

@marcel-apf Please fix CI job

export GOROOT=$(go env GOROOT); """_cache"/tools"/"operator-sdk-"v0.18.2"-"x86_64-linux-gnu""" generate k8s
time="2020-09-22T16:10:05Z" level=info msg="Running deepcopy code-generation for Custom Resource group versions: [performance:[v1 v1alpha1], ]\n"
time="2020-09-22T16:10:47Z" level=info msg="Code-generation complete."
Verifying that all code is committed after updating deps and formatting and generating code
hack/verify-generated.sh
uncommitted generated files. run 'make generate' and commit results.
 M functests/utils/pods/pods.go

Done, thanks!

ffromani

/lgtm

openshift-ci-robot · 2020-09-23T07:38:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fromanirh, marcel-apf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [fromanirh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

marcel-apf · 2020-09-23T13:18:12Z

/retest

marcel-apf · 2020-09-24T06:23:43Z

/retest

cynepco3hahue · 2020-09-24T08:05:18Z

/cherry-pick release-4.6

openshift-cherrypick-robot · 2020-09-24T08:05:19Z

@cynepco3hahue: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ffromani · 2020-09-24T09:17:29Z

/override ci/prow/e2e-gcp-operator-upgrade
known issue about dup kernel args

openshift-ci-robot · 2020-09-24T09:17:33Z

@fromanirh: Overrode contexts on behalf of fromanirh: ci/prow/e2e-gcp-operator-upgrade

In response to this:

/override ci/prow/e2e-gcp-operator-upgrade
known issue about dup kernel args

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cynepco3hahue · 2020-09-24T13:24:02Z

/retest

cynepco3hahue · 2020-09-24T15:37:56Z

The single flakiness because of

out=Unable to connect to the server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

cynepco3hahue · 2020-09-24T15:38:21Z

/override ci/prow/e2e-gcp

openshift-ci-robot · 2020-09-24T15:38:26Z

@cynepco3hahue: Overrode contexts on behalf of cynepco3hahue: ci/prow/e2e-gcp

In response to this:

/override ci/prow/e2e-gcp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ffromani · 2020-09-24T15:39:32Z

The single flakiness because of

out=Unable to connect to the server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

ok, for the record I agree with the evalutation and I was about to override myself. @cynepco3hahue was just faster.

ffromani · 2020-09-24T20:08:26Z

/retest

openshift-ci-robot · 2020-09-24T21:42:47Z

@marcel-apf: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-gcp	`414cb91`	link	`/test e2e-gcp`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cynepco3hahue · 2020-09-24T23:31:05Z

@marcel-apf The failure connects to the functional test under the PR

• Failure [0.012 seconds]
[performance][config] Performance configuration
/go/src/github.com/openshift-kni/performance-addon-operators/functests/0_config/config.go:42
  Should run performance profile pod on a master node [It]
  /go/src/github.com/openshift-kni/performance-addon-operators/functests/0_config/config.go:44
  Failed to find the Performance Addon Operator pod
      
  Unexpected error:
      <*errors.errorString | 0xc00021e7a0>: {
          s: "incorrect performance operator pods count: 0",
      }
      incorrect performance operator pods count: 0
  occurred
  /go/src/github.com/openshift-kni/performance-addon-operators/functests/0_config/config.go:46

cynepco3hahue · 2020-09-24T23:41:43Z

The new test runs before the deployment of the PAO, so it not surprising that the test fails. I propose to make the check where the PAO pod run as part of the deployment test.

ffromani · 2020-09-25T05:22:27Z

The new test runs before the deployment of the PAO, so it not surprising that the test fails. I propose to make the check where the PAO pod run as part of the deployment test.

This means moving the test from the 0_config to the 1_performance suite, right? Because this is what I was thinking, and it's fine for me.

ffromani · 2020-09-25T05:34:55Z

The new test runs before the deployment of the PAO, so it not surprising that the test fails. I propose to make the check where the PAO pod run as part of the deployment test.

This means moving the test from the 0_config to the 1_performance suite, right? Because this is what I was thinking, and it's fine for me.

let's see: c61e2d1

I'll take care of this PR and have it merged ASAP.

MarSik · 2020-09-25T07:55:43Z

/retest

ffromani · 2020-09-25T08:34:20Z

actually obsoleted because #373 got merged

marcel-apf · 2020-09-25T09:02:21Z

yay!, closing this one

MarSik reviewed Sep 21, 2020

View reviewed changes

yanirq reviewed Sep 21, 2020

View reviewed changes

functests/0_config/config.go Outdated Show resolved Hide resolved

yanirq reviewed Sep 21, 2020

View reviewed changes

functests/0_config/config.go Show resolved Hide resolved

ffromani reviewed Sep 21, 2020

View reviewed changes

marcel-apf force-pushed the schedule-masters branch 2 times, most recently from 56ae514 to ac70c96 Compare September 21, 2020 12:32

ffromani reviewed Sep 21, 2020

View reviewed changes

marcel-apf force-pushed the schedule-masters branch from ac70c96 to 3b7b70a Compare September 21, 2020 12:56

marcel-apf requested review from MarSik, yanirq and ffromani September 21, 2020 13:01

ffromani approved these changes Sep 21, 2020

View reviewed changes

openshift-ci-robot assigned ffromani Sep 21, 2020

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 21, 2020

marcel-apf added 2 commits September 23, 2020 10:12

functests/config: Add a test verifying PAO runs in master nodes

414cb91

Straight forward test that checks the node PAO is running on is a master node. Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>

marcel-apf force-pushed the schedule-masters branch from 3b7b70a to 414cb91 Compare September 23, 2020 07:15

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Sep 23, 2020

marcel-apf requested a review from ffromani September 23, 2020 07:28

ffromani approved these changes Sep 23, 2020

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 23, 2020

ffromani mentioned this pull request Sep 25, 2020

Pao run master nodes #373

Merged

marcel-apf closed this Sep 25, 2020

Force PAO installation on master nodes #351

Force PAO installation on master nodes #351

Conversation

marcel-apf commented Sep 21, 2020

marcel-apf commented Sep 21, 2020

marcel-apf commented Sep 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani left a comment

Choose a reason for hiding this comment

marcel-apf commented Sep 22, 2020

cynepco3hahue commented Sep 22, 2020

coveralls commented Sep 22, 2020 • edited

Pull Request Test Coverage Report for Build 492

💛 - Coveralls

cynepco3hahue commented Sep 22, 2020

marcel-apf commented Sep 23, 2020

ffromani left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Sep 23, 2020

marcel-apf commented Sep 23, 2020

marcel-apf commented Sep 24, 2020

cynepco3hahue commented Sep 24, 2020

openshift-cherrypick-robot commented Sep 24, 2020

ffromani commented Sep 24, 2020

openshift-ci-robot commented Sep 24, 2020

cynepco3hahue commented Sep 24, 2020

cynepco3hahue commented Sep 24, 2020

cynepco3hahue commented Sep 24, 2020

openshift-ci-robot commented Sep 24, 2020

ffromani commented Sep 24, 2020

ffromani commented Sep 24, 2020

openshift-ci-robot commented Sep 24, 2020

cynepco3hahue commented Sep 24, 2020

cynepco3hahue commented Sep 24, 2020

ffromani commented Sep 25, 2020

ffromani commented Sep 25, 2020

MarSik commented Sep 25, 2020

ffromani commented Sep 25, 2020

marcel-apf commented Sep 25, 2020

coveralls commented Sep 22, 2020 •

edited