Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update-etcd-scaling-test #27788

Merged

Conversation

Elbehery
Copy link
Contributor

@Elbehery Elbehery commented Mar 9, 2023

This PR addresses reviews on #27702

cc @mfojtik @deads2k @hasbro17

@Elbehery
Copy link
Contributor Author

Elbehery commented Mar 9, 2023

/assign @hasbro17
/assign @mfojtik
/assign @tjungblu
/assign @deads2k

@Elbehery
Copy link
Contributor Author

Elbehery commented Mar 9, 2023

/assign @DennisPeriquet

@Elbehery
Copy link
Contributor Author

Elbehery commented Mar 9, 2023

/cherry-pick release-4.13

@openshift-cherrypick-robot

@Elbehery: once the present PR merges, I will cherry-pick it on top of release-4.13 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Elbehery Elbehery force-pushed the Update-etcd-scaling-test-for-CPMS branch from ca3161c to e05186c Compare March 9, 2023 13:15
t.Logf("machine %q was listed but not found or already deleted", machineToDelete)
return false, fmt.Errorf("machine %q was listed but not found or already deleted", machineToDelete)
t.Logf("machine '%q' was listed but not found or already deleted", machineToDelete)
return false, nil
}
return isTransientAPIError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll can leave off detecting transient API errors here as the client should retry and we shouldn't need to handle those per #27702 (comment)

Suggested change
return isTransientAPIError(t, err)
return false, err

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Elbehery Elbehery force-pushed the Update-etcd-scaling-test-for-CPMS branch from e05186c to 75f666c Compare March 9, 2023 20:30
@DennisPeriquet
Copy link
Contributor

Since https://issues.redhat.com/browse/OCPBUGS-7989 is merged, the aws and gcp cases for the etcd-scaling tests (that this PR addresses) pass. We also understand that the azure case is still failing as shown in this job at : [sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling] etcd [apigroup:config.openshift.io] is able to vertically scale up and down with a single node

@DennisPeriquet
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2023
@@ -247,7 +246,7 @@ func IsCPMSActive(ctx context.Context, t TestingT, cpmsClient machinev1client.Co
// EnsureReadyReplicasOnCPMS checks if status.readyReplicas on the cluster CPMS is n
// this effectively counts the number of control-plane machines with the provider state as running
func EnsureReadyReplicasOnCPMS(ctx context.Context, t TestingT, expectedReplicaCount int, cpmsClient machinev1client.ControlPlaneMachineSetInterface) error {
waitPollInterval := 5 * time.Second
waitPollInterval := 2 * time.Second
waitPollTimeout := 18 * time.Minute
t.Logf("Waiting up to %s for the CPMS to have status.readyReplicas = %v", waitPollTimeout.String(), expectedReplicaCount)
Copy link
Contributor

@hasbro17 hasbro17 Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is for L259-266 (since Github won't let me comment on unchanged lines)

if cpms.Status.ReadyReplicas != int32(expectedReplicaCount) {
			t.Logf("expected %d ready replicas on CPMS, got: %v,", expectedReplicaCount, cpms.Status.ReadyReplicas)
	return false, nil
}

t.Logf("CPMS has reached the desired number of ready replicas: %v,", cpms.Status.ReadyReplicas)
return true, nil

Previous discussion about the behavior of EnsureReadyReplicasOnCPMS #27702 (comment)

deads2k: do you care about this condition or do you care this AND that you have exactly the right number of total nodes AND that the desired replicas is also expectedReplicaCount?

hasbro17: We care about this condition because we want the test to ensure that a new machine has successfully been created and is ready, which means the associated node has also been created and is running.
IIUC ReadyReplicas should cover both conditions:
https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/pkg/controllers/controlplanemachineset/status.go#L95-L96
https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/pkg/machineproviders/types.go#L38-L40
I think the desired status.replicas will only tell us if the machine was created, and not whether it became running with a node attached.

deads2k: if expected does not match the desired (spec), the return further down looks incorrect.
also, this looks like the correct spot to double check that the number of nodes is actually correct.

deads2k (slack): EnsureReadyReplicasOnCPMS doesn't appear to be sure

  1. that .spec.replicas matches .status.readyreplicas
  2. that the number of nodes in the API is actually correct

Addressing the above two points:

2. that the number of nodes in the API is actually correct

As I pointed out above, checking cpms.Status.ReadyReplicas will also indirectly check that the number of nodes are correct. The CPMSO updates status.ReadyReplicas with machines that are only counted as ready when its node is also Ready

We actually caught a bug in this behavior with our test that has since been verified and fixed.
https://issues.redhat.com/browse/OCPBUGS-7989
openshift/cluster-control-plane-machine-set-operator#171

1. that .spec.replicas matches .status.readyReplicas

The purpose of EnsureReadyReplicasOnCPMS in this test is not to check if spec.replicas matches status.readyReplicas.
It is to check whether status.readyReplicas is whatever we expect it to be at a given point in the test i.e expectedReplicaCount.

While spec.replicas will always be 3 it is expected behavior from the CPMSO that status.readyReplicas will surge up to 4 as we scale-up and that's what we check for in step 2 of the test with:

 err = scalingtestinglibrary.EnsureReadyReplicasOnCPMS(ctx, g.GinkgoT(), 4, cpmsClient)

And then in step 3 after scale-down we expect it to come back down to 3:

 err = scalingtestinglibrary.EnsureReadyReplicasOnCPMS(ctx, g.GinkgoT(), 3, cpmsClient)

@deads2k Does that address your previous comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually caught a bug in this behavior with our test that has since been verified and fixed.
https://issues.redhat.com/browse/OCPBUGS-7989
openshift/cluster-control-plane-machine-set-operator#171

This sounds like a really strong argument for adding the verification I've listed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of EnsureReadyReplicasOnCPMS in this test is not to check if spec.replicas matches status.readyReplicas.
It is to check whether status.readyReplicas is whatever we expect it to be at a given point in the test i.e expectedReplicaCount.

Accepted. There is some test to be sure we eventually converge?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a really strong argument for adding the verification I've listed.

Fair point. The CPMSO should have test cases for this with the recent fix, but we can add a check for the node count here to ensure we don't regress.

There is some test to be sure we eventually converge?

Not quite. Indirectly we do check that it goes down to 3. And spec.replicas is supposed to be immutable but that could change in the future.
So based on that, we'll add a check at the end to ensure they both are the same after scale-down if that guarantee ever changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, with the latest updates the test should now be checking that the number of nodes are correct, as well as ensuring that spec.replicas and status.readyReplicas converge at the end of the test.


t.Logf("Waiting up to %s to delete a machine", waitPollTimeout.String())

err = wait.Poll(waitPollInterval, waitPollTimeout, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be wrapped in poll either, but this way is at least functionally correct.

t.Logf("machine %q was listed but not found or already deleted", machineToDelete)
return false, fmt.Errorf("machine %q was listed but not found or already deleted", machineToDelete)
t.Logf("machine '%q' was listed but not found or already deleted", machineToDelete)
return false, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually true, nil, right? It's gone and the loop needs to exit

@Elbehery Elbehery force-pushed the Update-etcd-scaling-test-for-CPMS branch from 75f666c to 8638513 Compare March 25, 2023 17:31
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 25, 2023
@Elbehery Elbehery force-pushed the Update-etcd-scaling-test-for-CPMS branch from 8638513 to 2118ec0 Compare March 25, 2023 17:33
@Elbehery Elbehery force-pushed the Update-etcd-scaling-test-for-CPMS branch 3 times, most recently from 91c5f8e to 2b6a666 Compare April 5, 2023 10:16
@Elbehery
Copy link
Contributor Author

Elbehery commented Apr 5, 2023

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 5, 2023
@Elbehery Elbehery force-pushed the Update-etcd-scaling-test-for-CPMS branch from 5fcdc8d to 819bd48 Compare April 5, 2023 13:23
@Elbehery
Copy link
Contributor Author

Elbehery commented Apr 5, 2023

/retest-required

@Elbehery
Copy link
Contributor Author

e2e-openstack-ovn failed during cluster installation

level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
Installer exit with code 5

another run

/test e2e-openstack-ovn

@Elbehery
Copy link
Contributor Author

e2e-vsphere-ovn-etcd-scaling failed cluster installation

: install should succeed: overall expand_less | 0s
-- | --
{  openshift cluster install failed overall}

/test e2e-vsphere-ovn-etcd-scaling

@Elbehery
Copy link
Contributor Author

e2e-aws-csi does not attempt to execute the vertical-scaling e2e test, see machines

@Elbehery
Copy link
Contributor Author

@Elbehery
Copy link
Contributor Author

/test e2e-gcp-ovn-etcd-scaling

@Elbehery
Copy link
Contributor Author

/test e2e-gcp-csi

@tsorya
Copy link

tsorya commented Apr 16, 2023

can you explain your question please? I mean not sure what you expect to be replaced? We have one node

e2e-aws-ovn-single-node-upgrade does not attempt to replace the master machine, nodes

@eranco74 Hello ✋🏽 , is this expected behaviour ?

@Elbehery
Copy link
Contributor Author

can you explain your question please? I mean not sure what you expect to be replaced? We have one node

e2e-aws-ovn-single-node-upgrade does not attempt to replace the master machine, nodes
@eranco74 Hello ✋🏽 , is this expected behaviour ?

Yes thats what I thought too, this e2e does not make sense for SNO, correct ?

@DennisPeriquet
Copy link
Contributor

regarding ci/prow/e2e-vsphere-ovn-etcd-scaling, we see this build-log.txt, see this timeline:

�[36mINFO�[0m[2023-04-13T12:00:32Z] Running step e2e-vsphere-ovn-etcd-scaling-ipi-install-install.
�[36mINFO�[0m[2023-04-13T12:33:41Z] Step e2e-vsphere-ovn-etcd-scaling-ipi-install-install succeeded after 33m9s.
     looks like an installation finished 
�[36mINFO�[0m[2023-04-13T12:33:41Z] Running step e2e-vsphere-ovn-etcd-scaling-ipi-install-vsphere-registry. 
�[36mINFO�[0m[2023-04-13T12:44:35Z] Step e2e-vsphere-ovn-etcd-scaling-ipi-install-vsphere-registry succeeded after 10m53s. 
�[36mINFO�[0m[2023-04-13T12:44:35Z] Step phase pre succeeded after 46m4s.        
�[36mINFO�[0m[2023-04-13T12:44:35Z] Running multi-stage phase test               
�[36mINFO�[0m[2023-04-13T12:44:35Z] Running step e2e-vsphere-ovn-etcd-scaling-openshift-e2e-test.
     the e2e test is about to start
�[36mINFO�[0m[2023-04-13T12:55:20Z] Logs for container test in pod e2e-vsphere-ovn-etcd-scaling-openshift-e2e-test: 
�[36mINFO�[0m[2023-04-13T12:55:20Z] secret/support created
configmap/admin-acks patched
clusterversion.config.openshift.io/version condition met
Thu Apr 13 12:44:45 UTC 2023 - 6 Machines - 5 Nodes
     10 seconds after trying to start e2e, we already see 5 nodes out of 6 nodes and so the e2e never gets to run
Thu Apr 13 12:45:15 UTC 2023 - 6 Machines - 5 Nodes

As such, I believe your PR change never ran on this job so I don't think your PR is the cause.

As a reference, here's a log where the e2e did run:

[36mINFO�[0m[2023-04-14T20:29:50Z] Running step e2e-vsphere-ovn-etcd-scaling-openshift-e2e-test. 
�[36mINFO�[0m[2023-04-14T21:28:35Z] Logs for container test in pod e2e-vsphere-ovn-etcd-scaling-openshift-e2e-test: 
�[36mINFO�[0m[2023-04-14T21:28:35Z] secret/support created
configmap/admin-acks patched
clusterversion.config.openshift.io/version condition met
Fri Apr 14 20:30:03 UTC 2023 - node count (6) now matches or exceeds machine count (6)
Fri Apr 14 20:30:03 UTC 2023 - waiting for nodes to be ready...
node/ci-op-c8xh1lr5-1eb13-d5v44-master-0 condition met
...
[Fri Apr 14 20:30:09 UTC 2023] All imagestreams are imported.
+ openshift-tests run openshift/etcd/scaling --provider vsphere -o /logs/artifacts/e2e.log --junit-dir /logs/artifacts/junit

Note that there are 6 of 6 nodes. That last line is evidence that the e2e started running. In the above log, that last line is not present which makes me believe the e2e never ran.

@DennisPeriquet
Copy link
Contributor

regarding ci/prow/e2e-aws-csi, I see from the relevant slack thread that your e2e test won't even be tested in that; so we can ignore the failure.

@Elbehery
Copy link
Contributor Author

/test e2e-gcp-ovn-etcd-scaling

@Elbehery
Copy link
Contributor Author

/test e2e-vsphere-ovn-etcd-scaling

@Elbehery
Copy link
Contributor Author

filed a bug for vSphere failures https://issues.redhat.com/browse/OCPBUGS-11943

@hasbro17
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 24, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DennisPeriquet, Elbehery, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 24, 2023

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade bc65382 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-azure-ovn-etcd-scaling bc65382 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-aws-csi bc65382 link false /test e2e-aws-csi
ci/prow/e2e-vsphere-ovn-etcd-scaling bc65382 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-gcp-ovn-etcd-scaling bc65382 link false /test e2e-gcp-ovn-etcd-scaling

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD c9cbeee and 2 for PR HEAD bc65382 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 9aff6f7 and 1 for PR HEAD bc65382 in total

@openshift-ci openshift-ci bot merged commit 9c05d0e into openshift:master Apr 25, 2023
21 of 26 checks passed
@openshift-cherrypick-robot

@Elbehery: new pull request created: #27892

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vrutkovs pushed a commit to vrutkovs/origin that referenced this pull request May 2, 2023
@Elbehery
Copy link
Contributor Author

Elbehery commented May 4, 2023

/cherry-pick release-4.12

@openshift-cherrypick-robot

@Elbehery: #27788 failed to apply on top of branch "release-4.12":

Applying: update-etcd-scaling-test
Using index info to reconstruct a base tree...
M	test/extended/etcd/helpers/helpers.go
M	test/extended/etcd/vertical_scaling.go
Falling back to patching base and 3-way merge...
Auto-merging test/extended/etcd/vertical_scaling.go
CONFLICT (content): Merge conflict in test/extended/etcd/vertical_scaling.go
Auto-merging test/extended/etcd/helpers/helpers.go
CONFLICT (content): Merge conflict in test/extended/etcd/helpers/helpers.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 update-etcd-scaling-test
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Elbehery added a commit to Elbehery/origin that referenced this pull request May 4, 2023
Elbehery added a commit to Elbehery/origin that referenced this pull request May 5, 2023
openshift-merge-robot pushed a commit that referenced this pull request May 23, 2023
…unt for CPMSO (#27907)

* manual cherrypick of #27788

* bindata diff

* go1.19 diff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants