update-etcd-scaling-test #27788

Elbehery · 2023-03-09T12:46:41Z

This PR addresses reviews on #27702

cc @mfojtik @deads2k @hasbro17

Elbehery · 2023-03-09T12:47:05Z

/assign @hasbro17
/assign @mfojtik
/assign @tjungblu
/assign @deads2k

Elbehery · 2023-03-09T12:48:38Z

/assign @DennisPeriquet

Elbehery · 2023-03-09T12:48:55Z

/cherry-pick release-4.13

openshift-cherrypick-robot · 2023-03-09T12:48:57Z

@Elbehery: once the present PR merges, I will cherry-pick it on top of release-4.13 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hasbro17 · 2023-03-09T19:22:53Z

test/extended/etcd/helpers/helpers.go

-				t.Logf("machine %q was listed but not found or already deleted", machineToDelete)
-				return false, fmt.Errorf("machine %q was listed but not found or already deleted", machineToDelete)
+				t.Logf("machine '%q' was listed but not found or already deleted", machineToDelete)
+				return false, nil
 			}
 			return isTransientAPIError(t, err)


We'll can leave off detecting transient API errors here as the client should retry and we shouldn't need to handle those per #27702 (comment)

Suggested change

return isTransientAPIError(t, err)

return false, err

DennisPeriquet · 2023-03-09T20:31:41Z

Since https://issues.redhat.com/browse/OCPBUGS-7989 is merged, the aws and gcp cases for the etcd-scaling tests (that this PR addresses) pass. We also understand that the azure case is still failing as shown in this job at : [sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling] etcd [apigroup:config.openshift.io] is able to vertically scale up and down with a single node

DennisPeriquet · 2023-03-09T20:31:53Z

/lgtm

hasbro17 · 2023-03-10T21:26:13Z

test/extended/etcd/helpers/helpers.go

@@ -247,7 +246,7 @@ func IsCPMSActive(ctx context.Context, t TestingT, cpmsClient machinev1client.Co
 // EnsureReadyReplicasOnCPMS checks if status.readyReplicas on the cluster CPMS is n
 // this effectively counts the number of control-plane machines with the provider state as running
 func EnsureReadyReplicasOnCPMS(ctx context.Context, t TestingT, expectedReplicaCount int, cpmsClient machinev1client.ControlPlaneMachineSetInterface) error {
-	waitPollInterval := 5 * time.Second
+	waitPollInterval := 2 * time.Second
 	waitPollTimeout := 18 * time.Minute
 	t.Logf("Waiting up to %s for the CPMS to have status.readyReplicas = %v", waitPollTimeout.String(), expectedReplicaCount)


This comment is for L259-266 (since Github won't let me comment on unchanged lines)

if cpms.Status.ReadyReplicas != int32(expectedReplicaCount) { t.Logf("expected %d ready replicas on CPMS, got: %v,", expectedReplicaCount, cpms.Status.ReadyReplicas) return false, nil } t.Logf("CPMS has reached the desired number of ready replicas: %v,", cpms.Status.ReadyReplicas) return true, nil

Previous discussion about the behavior of EnsureReadyReplicasOnCPMS #27702 (comment)

deads2k: do you care about this condition or do you care this AND that you have exactly the right number of total nodes AND that the desired replicas is also expectedReplicaCount?

hasbro17: We care about this condition because we want the test to ensure that a new machine has successfully been created and is ready, which means the associated node has also been created and is running.
IIUC ReadyReplicas should cover both conditions:
https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/pkg/controllers/controlplanemachineset/status.go#L95-L96
https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/pkg/machineproviders/types.go#L38-L40
I think the desired status.replicas will only tell us if the machine was created, and not whether it became running with a node attached.

deads2k: if expected does not match the desired (spec), the return further down looks incorrect.
also, this looks like the correct spot to double check that the number of nodes is actually correct.

deads2k (slack): EnsureReadyReplicasOnCPMS doesn't appear to be sure

that .spec.replicas matches .status.readyreplicas

that the number of nodes in the API is actually correct

Addressing the above two points:

2. that the number of nodes in the API is actually correct

As I pointed out above, checking cpms.Status.ReadyReplicas will also indirectly check that the number of nodes are correct. The CPMSO updates status.ReadyReplicas with machines that are only counted as ready when its node is also Ready

We actually caught a bug in this behavior with our test that has since been verified and fixed.
https://issues.redhat.com/browse/OCPBUGS-7989
openshift/cluster-control-plane-machine-set-operator#171

1. that .spec.replicas matches .status.readyReplicas

The purpose of EnsureReadyReplicasOnCPMS in this test is not to check if spec.replicas matches status.readyReplicas.
It is to check whether status.readyReplicas is whatever we expect it to be at a given point in the test i.e expectedReplicaCount.

While spec.replicas will always be 3 it is expected behavior from the CPMSO that status.readyReplicas will surge up to 4 as we scale-up and that's what we check for in step 2 of the test with:

err = scalingtestinglibrary.EnsureReadyReplicasOnCPMS(ctx, g.GinkgoT(), 4, cpmsClient)

And then in step 3 after scale-down we expect it to come back down to 3:

err = scalingtestinglibrary.EnsureReadyReplicasOnCPMS(ctx, g.GinkgoT(), 3, cpmsClient)

@deads2k Does that address your previous comments?

We actually caught a bug in this behavior with our test that has since been verified and fixed.
https://issues.redhat.com/browse/OCPBUGS-7989
openshift/cluster-control-plane-machine-set-operator#171

This sounds like a really strong argument for adding the verification I've listed.

The purpose of EnsureReadyReplicasOnCPMS in this test is not to check if spec.replicas matches status.readyReplicas.
It is to check whether status.readyReplicas is whatever we expect it to be at a given point in the test i.e expectedReplicaCount.

Accepted. There is some test to be sure we eventually converge?

This sounds like a really strong argument for adding the verification I've listed.

Fair point. The CPMSO should have test cases for this with the recent fix, but we can add a check for the node count here to ensure we don't regress.

There is some test to be sure we eventually converge?

Not quite. Indirectly we do check that it goes down to 3. And spec.replicas is supposed to be immutable but that could change in the future.
So based on that, we'll add a check at the end to ensure they both are the same after scale-down if that guarantee ever changes.

Alright, with the latest updates the test should now be checking that the number of nodes are correct, as well as ensuring that spec.replicas and status.readyReplicas converge at the end of the test.

deads2k · 2023-03-10T22:26:32Z

test/extended/etcd/helpers/helpers.go

+
+	t.Logf("Waiting up to %s to delete a machine", waitPollTimeout.String())
+
+	err = wait.Poll(waitPollInterval, waitPollTimeout, func() (bool, error) {


This doesn't need to be wrapped in poll either, but this way is at least functionally correct.

deads2k · 2023-03-10T22:26:54Z

test/extended/etcd/helpers/helpers.go

-				t.Logf("machine %q was listed but not found or already deleted", machineToDelete)
-				return false, fmt.Errorf("machine %q was listed but not found or already deleted", machineToDelete)
+				t.Logf("machine '%q' was listed but not found or already deleted", machineToDelete)
+				return false, nil


this is actually true, nil, right? It's gone and the loop needs to exit

Elbehery · 2023-04-05T12:20:03Z

/label tide/merge-method-squash

Elbehery · 2023-04-05T16:40:42Z

/retest-required

Elbehery · 2023-04-13T11:31:54Z

e2e-openstack-ovn failed during cluster installation

level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
Installer exit with code 5

another run

/test e2e-openstack-ovn

Elbehery · 2023-04-13T11:35:22Z

e2e-vsphere-ovn-etcd-scaling failed cluster installation

: install should succeed: overall expand_less | 0s
-- | --
{  openshift cluster install failed overall}

/test e2e-vsphere-ovn-etcd-scaling

Elbehery · 2023-04-13T12:16:23Z

e2e-aws-csi does not attempt to execute the vertical-scaling e2e test, see machines

Elbehery · 2023-04-13T12:36:04Z

e2e-aws-csi fails due to https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27788/pull-ci-openshift-origin-master-e2e-aws-csi/1646461359266729984

it is not due to vertical-scaling e2e test

Elbehery · 2023-04-14T14:06:42Z

/test e2e-gcp-ovn-etcd-scaling

Elbehery · 2023-04-14T15:48:16Z

/test e2e-gcp-csi

tsorya · 2023-04-16T11:09:01Z

can you explain your question please? I mean not sure what you expect to be replaced? We have one node

e2e-aws-ovn-single-node-upgrade does not attempt to replace the master machine, nodes

@eranco74 Hello ✋🏽 , is this expected behaviour ?

Elbehery · 2023-04-17T09:52:49Z

can you explain your question please? I mean not sure what you expect to be replaced? We have one node

e2e-aws-ovn-single-node-upgrade does not attempt to replace the master machine, nodes
@eranco74 Hello ✋🏽 , is this expected behaviour ?

Yes thats what I thought too, this e2e does not make sense for SNO, correct ?

DennisPeriquet · 2023-04-18T15:26:58Z

regarding ci/prow/e2e-vsphere-ovn-etcd-scaling, we see this build-log.txt, see this timeline:

�[36mINFO�[0m[2023-04-13T12:00:32Z] Running step e2e-vsphere-ovn-etcd-scaling-ipi-install-install.
�[36mINFO�[0m[2023-04-13T12:33:41Z] Step e2e-vsphere-ovn-etcd-scaling-ipi-install-install succeeded after 33m9s.
     looks like an installation finished 
�[36mINFO�[0m[2023-04-13T12:33:41Z] Running step e2e-vsphere-ovn-etcd-scaling-ipi-install-vsphere-registry. 
�[36mINFO�[0m[2023-04-13T12:44:35Z] Step e2e-vsphere-ovn-etcd-scaling-ipi-install-vsphere-registry succeeded after 10m53s. 
�[36mINFO�[0m[2023-04-13T12:44:35Z] Step phase pre succeeded after 46m4s.        
�[36mINFO�[0m[2023-04-13T12:44:35Z] Running multi-stage phase test               
�[36mINFO�[0m[2023-04-13T12:44:35Z] Running step e2e-vsphere-ovn-etcd-scaling-openshift-e2e-test.
     the e2e test is about to start
�[36mINFO�[0m[2023-04-13T12:55:20Z] Logs for container test in pod e2e-vsphere-ovn-etcd-scaling-openshift-e2e-test: 
�[36mINFO�[0m[2023-04-13T12:55:20Z] secret/support created
configmap/admin-acks patched
clusterversion.config.openshift.io/version condition met
Thu Apr 13 12:44:45 UTC 2023 - 6 Machines - 5 Nodes
     10 seconds after trying to start e2e, we already see 5 nodes out of 6 nodes and so the e2e never gets to run
Thu Apr 13 12:45:15 UTC 2023 - 6 Machines - 5 Nodes

As such, I believe your PR change never ran on this job so I don't think your PR is the cause.

As a reference, here's a log where the e2e did run:

[36mINFO�[0m[2023-04-14T20:29:50Z] Running step e2e-vsphere-ovn-etcd-scaling-openshift-e2e-test. 
�[36mINFO�[0m[2023-04-14T21:28:35Z] Logs for container test in pod e2e-vsphere-ovn-etcd-scaling-openshift-e2e-test: 
�[36mINFO�[0m[2023-04-14T21:28:35Z] secret/support created
configmap/admin-acks patched
clusterversion.config.openshift.io/version condition met
Fri Apr 14 20:30:03 UTC 2023 - node count (6) now matches or exceeds machine count (6)
Fri Apr 14 20:30:03 UTC 2023 - waiting for nodes to be ready...
node/ci-op-c8xh1lr5-1eb13-d5v44-master-0 condition met
...
[Fri Apr 14 20:30:09 UTC 2023] All imagestreams are imported.
+ openshift-tests run openshift/etcd/scaling --provider vsphere -o /logs/artifacts/e2e.log --junit-dir /logs/artifacts/junit

Note that there are 6 of 6 nodes. That last line is evidence that the e2e started running. In the above log, that last line is not present which makes me believe the e2e never ran.

DennisPeriquet · 2023-04-18T15:36:47Z

regarding ci/prow/e2e-aws-csi, I see from the relevant slack thread that your e2e test won't even be tested in that; so we can ignore the failure.

Elbehery · 2023-04-18T16:58:21Z

/test e2e-gcp-ovn-etcd-scaling

Elbehery · 2023-04-18T17:42:13Z

/test e2e-vsphere-ovn-etcd-scaling

Elbehery · 2023-04-18T18:02:51Z

filed a bug for vSphere failures https://issues.redhat.com/browse/OCPBUGS-11943

hasbro17 · 2023-04-24T17:09:47Z

/lgtm

openshift-ci · 2023-04-24T17:10:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DennisPeriquet, Elbehery, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/extended/etcd/OWNERS~~ [hasbro17]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-04-24T20:11:02Z

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade	`bc65382`	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-azure-ovn-etcd-scaling	`bc65382`	link	false	`/test e2e-azure-ovn-etcd-scaling`
ci/prow/e2e-aws-csi	`bc65382`	link	false	`/test e2e-aws-csi`
ci/prow/e2e-vsphere-ovn-etcd-scaling	`bc65382`	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/e2e-gcp-ovn-etcd-scaling	`bc65382`	link	false	`/test e2e-gcp-ovn-etcd-scaling`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-04-24T21:01:27Z

/retest-required

Remaining retests: 0 against base HEAD c9cbeee and 2 for PR HEAD bc65382 in total

openshift-ci-robot · 2023-04-25T03:01:13Z

/retest-required

Remaining retests: 0 against base HEAD 9aff6f7 and 1 for PR HEAD bc65382 in total

openshift-cherrypick-robot · 2023-04-25T05:29:06Z

@Elbehery: new pull request created: #27892

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Elbehery · 2023-05-04T16:52:36Z

/cherry-pick release-4.12

openshift-cherrypick-robot · 2023-05-04T16:54:24Z

@Elbehery: #27788 failed to apply on top of branch "release-4.12":

Applying: update-etcd-scaling-test
Using index info to reconstruct a base tree...
M	test/extended/etcd/helpers/helpers.go
M	test/extended/etcd/vertical_scaling.go
Falling back to patching base and 3-way merge...
Auto-merging test/extended/etcd/vertical_scaling.go
CONFLICT (content): Merge conflict in test/extended/etcd/vertical_scaling.go
Auto-merging test/extended/etcd/helpers/helpers.go
CONFLICT (content): Merge conflict in test/extended/etcd/helpers/helpers.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 update-etcd-scaling-test
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…unt for CPMSO (#27907) * manual cherrypick of #27788 * bindata diff * go1.19 diff

openshift-ci bot assigned deads2k, hasbro17, mfojtik and tjungblu Mar 9, 2023

openshift-ci bot requested review from csrwng and hasbro17 March 9, 2023 12:48

openshift-ci bot assigned DennisPeriquet Mar 9, 2023

Elbehery force-pushed the Update-etcd-scaling-test-for-CPMS branch from ca3161c to e05186c Compare March 9, 2023 13:15

hasbro17 reviewed Mar 9, 2023

View reviewed changes

Elbehery force-pushed the Update-etcd-scaling-test-for-CPMS branch from e05186c to 75f666c Compare March 9, 2023 20:30

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2023

hasbro17 reviewed Mar 10, 2023

View reviewed changes

deads2k reviewed Mar 10, 2023

View reviewed changes

Elbehery force-pushed the Update-etcd-scaling-test-for-CPMS branch from 75f666c to 8638513 Compare March 25, 2023 17:31

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 25, 2023

Elbehery force-pushed the Update-etcd-scaling-test-for-CPMS branch from 8638513 to 2118ec0 Compare March 25, 2023 17:33

Elbehery force-pushed the Update-etcd-scaling-test-for-CPMS branch 3 times, most recently from 91c5f8e to 2b6a666 Compare April 5, 2023 10:16

openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 5, 2023

Elbehery force-pushed the Update-etcd-scaling-test-for-CPMS branch from 5fcdc8d to 819bd48 Compare April 5, 2023 13:23

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2023

openshift-ci bot merged commit 9c05d0e into openshift:master Apr 25, 2023
21 of 26 checks passed

openshift-cherrypick-robot mentioned this pull request Apr 25, 2023

[release-4.13] OCPBUGS-12700: update-etcd-scaling-test #27892

Merged

vrutkovs pushed a commit to vrutkovs/origin that referenced this pull request May 2, 2023

update-etcd-scaling-test (openshift#27788)

252baef

Elbehery added a commit to Elbehery/origin that referenced this pull request May 4, 2023

manual cherrypick of openshift#27788

8149c28

Elbehery mentioned this pull request May 4, 2023

[release-4.12] OCPBUGS-6841: Update the vertical scaling test to account for CPMSO #27907

Merged

Elbehery added a commit to Elbehery/origin that referenced this pull request May 5, 2023

manual cherrypick of openshift#27788

71b7f00

openshift-merge-robot pushed a commit that referenced this pull request May 23, 2023

[release-4.12] OCPBUGS-6841: Update the vertical scaling test to acco…

06127b0

…unt for CPMSO (#27907) * manual cherrypick of #27788 * bindata diff * go1.19 diff


		t.Logf("Waiting up to %s to delete a machine", waitPollTimeout.String())

		err = wait.Poll(waitPollInterval, waitPollTimeout, func() (bool, error) {

update-etcd-scaling-test #27788

update-etcd-scaling-test #27788

Conversation

Elbehery commented Mar 9, 2023

Elbehery commented Mar 9, 2023

Elbehery commented Mar 9, 2023

Elbehery commented Mar 9, 2023

openshift-cherrypick-robot commented Mar 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DennisPeriquet commented Mar 9, 2023

DennisPeriquet commented Mar 9, 2023

hasbro17 Mar 10, 2023 • edited

Choose a reason for hiding this comment

2. that the number of nodes in the API is actually correct

1. that .spec.replicas matches .status.readyReplicas

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Elbehery commented Apr 5, 2023

Elbehery commented Apr 5, 2023

Elbehery commented Apr 13, 2023

Elbehery commented Apr 13, 2023

Elbehery commented Apr 13, 2023

Elbehery commented Apr 13, 2023

Elbehery commented Apr 14, 2023

Elbehery commented Apr 14, 2023

tsorya commented Apr 16, 2023

Elbehery commented Apr 17, 2023

DennisPeriquet commented Apr 18, 2023

DennisPeriquet commented Apr 18, 2023

Elbehery commented Apr 18, 2023

Elbehery commented Apr 18, 2023

Elbehery commented Apr 18, 2023

hasbro17 commented Apr 24, 2023

openshift-ci bot commented Apr 24, 2023

openshift-ci bot commented Apr 24, 2023 • edited

openshift-ci-robot commented Apr 24, 2023

openshift-ci-robot commented Apr 25, 2023

openshift-cherrypick-robot commented Apr 25, 2023

Elbehery commented May 4, 2023

openshift-cherrypick-robot commented May 4, 2023

hasbro17 Mar 10, 2023 •

edited

1. that `.spec.replicas` matches `.status.readyReplicas`

openshift-ci bot commented Apr 24, 2023 •

edited