New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update-etcd-scaling-test #27788
update-etcd-scaling-test #27788
Conversation
/assign @DennisPeriquet |
/cherry-pick release-4.13 |
@Elbehery: once the present PR merges, I will cherry-pick it on top of release-4.13 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
ca3161c
to
e05186c
Compare
t.Logf("machine %q was listed but not found or already deleted", machineToDelete) | ||
return false, fmt.Errorf("machine %q was listed but not found or already deleted", machineToDelete) | ||
t.Logf("machine '%q' was listed but not found or already deleted", machineToDelete) | ||
return false, nil | ||
} | ||
return isTransientAPIError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll can leave off detecting transient API errors here as the client should retry and we shouldn't need to handle those per #27702 (comment)
return isTransientAPIError(t, err) | |
return false, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
e05186c
to
75f666c
Compare
Since https://issues.redhat.com/browse/OCPBUGS-7989 is merged, the aws and gcp cases for the etcd-scaling tests (that this PR addresses) pass. We also understand that the azure case is still failing as shown in this job at |
/lgtm |
@@ -247,7 +246,7 @@ func IsCPMSActive(ctx context.Context, t TestingT, cpmsClient machinev1client.Co | |||
// EnsureReadyReplicasOnCPMS checks if status.readyReplicas on the cluster CPMS is n | |||
// this effectively counts the number of control-plane machines with the provider state as running | |||
func EnsureReadyReplicasOnCPMS(ctx context.Context, t TestingT, expectedReplicaCount int, cpmsClient machinev1client.ControlPlaneMachineSetInterface) error { | |||
waitPollInterval := 5 * time.Second | |||
waitPollInterval := 2 * time.Second | |||
waitPollTimeout := 18 * time.Minute | |||
t.Logf("Waiting up to %s for the CPMS to have status.readyReplicas = %v", waitPollTimeout.String(), expectedReplicaCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is for L259-266 (since Github won't let me comment on unchanged lines)
if cpms.Status.ReadyReplicas != int32(expectedReplicaCount) {
t.Logf("expected %d ready replicas on CPMS, got: %v,", expectedReplicaCount, cpms.Status.ReadyReplicas)
return false, nil
}
t.Logf("CPMS has reached the desired number of ready replicas: %v,", cpms.Status.ReadyReplicas)
return true, nil
Previous discussion about the behavior of EnsureReadyReplicasOnCPMS
#27702 (comment)
deads2k: do you care about this condition or do you care this AND that you have exactly the right number of total nodes AND that the desired replicas is also expectedReplicaCount?
hasbro17: We care about this condition because we want the test to ensure that a new machine has successfully been created and is ready, which means the associated node has also been created and is running.
IIUC ReadyReplicas should cover both conditions:
https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/pkg/controllers/controlplanemachineset/status.go#L95-L96
https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/pkg/machineproviders/types.go#L38-L40
I think the desired status.replicas will only tell us if the machine was created, and not whether it became running with a node attached.
deads2k: if expected does not match the desired (spec), the return further down looks incorrect.
also, this looks like the correct spot to double check that the number of nodes is actually correct.
deads2k (slack): EnsureReadyReplicasOnCPMS doesn't appear to be sure
- that .spec.replicas matches .status.readyreplicas
- that the number of nodes in the API is actually correct
Addressing the above two points:
2. that the number of nodes in the API is actually correct
As I pointed out above, checking cpms.Status.ReadyReplicas
will also indirectly check that the number of nodes are correct. The CPMSO updates status.ReadyReplicas
with machines that are only counted as ready when its node is also Ready
We actually caught a bug in this behavior with our test that has since been verified and fixed.
https://issues.redhat.com/browse/OCPBUGS-7989
openshift/cluster-control-plane-machine-set-operator#171
1. that .spec.replicas
matches .status.readyReplicas
The purpose of EnsureReadyReplicasOnCPMS
in this test is not to check if spec.replicas
matches status.readyReplicas
.
It is to check whether status.readyReplicas
is whatever we expect it to be at a given point in the test i.e expectedReplicaCount
.
While spec.replicas
will always be 3 it is expected behavior from the CPMSO that status.readyReplicas
will surge up to 4
as we scale-up and that's what we check for in step 2 of the test with:
err = scalingtestinglibrary.EnsureReadyReplicasOnCPMS(ctx, g.GinkgoT(), 4, cpmsClient)
And then in step 3 after scale-down we expect it to come back down to 3
:
err = scalingtestinglibrary.EnsureReadyReplicasOnCPMS(ctx, g.GinkgoT(), 3, cpmsClient)
@deads2k Does that address your previous comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually caught a bug in this behavior with our test that has since been verified and fixed.
https://issues.redhat.com/browse/OCPBUGS-7989
openshift/cluster-control-plane-machine-set-operator#171
This sounds like a really strong argument for adding the verification I've listed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of EnsureReadyReplicasOnCPMS in this test is not to check if spec.replicas matches status.readyReplicas.
It is to check whether status.readyReplicas is whatever we expect it to be at a given point in the test i.e expectedReplicaCount.
Accepted. There is some test to be sure we eventually converge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a really strong argument for adding the verification I've listed.
Fair point. The CPMSO should have test cases for this with the recent fix, but we can add a check for the node count here to ensure we don't regress.
There is some test to be sure we eventually converge?
Not quite. Indirectly we do check that it goes down to 3. And spec.replicas
is supposed to be immutable but that could change in the future.
So based on that, we'll add a check at the end to ensure they both are the same after scale-down if that guarantee ever changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, with the latest updates the test should now be checking that the number of nodes are correct, as well as ensuring that spec.replicas
and status.readyReplicas
converge at the end of the test.
|
||
t.Logf("Waiting up to %s to delete a machine", waitPollTimeout.String()) | ||
|
||
err = wait.Poll(waitPollInterval, waitPollTimeout, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be wrapped in poll either, but this way is at least functionally correct.
t.Logf("machine %q was listed but not found or already deleted", machineToDelete) | ||
return false, fmt.Errorf("machine %q was listed but not found or already deleted", machineToDelete) | ||
t.Logf("machine '%q' was listed but not found or already deleted", machineToDelete) | ||
return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually true, nil, right? It's gone and the loop needs to exit
75f666c
to
8638513
Compare
8638513
to
2118ec0
Compare
91c5f8e
to
2b6a666
Compare
/label tide/merge-method-squash |
5fcdc8d
to
819bd48
Compare
/retest-required |
another run /test e2e-openstack-ovn |
/test e2e-vsphere-ovn-etcd-scaling |
|
it is not due to |
/test e2e-gcp-ovn-etcd-scaling |
/test e2e-gcp-csi |
Yes thats what I thought too, this e2e does not make sense for SNO, correct ? |
regarding ci/prow/e2e-vsphere-ovn-etcd-scaling, we see this build-log.txt, see this timeline:
As such, I believe your PR change never ran on this job so I don't think your PR is the cause. As a reference, here's a log where the e2e did run:
Note that there are 6 of 6 nodes. That last line is evidence that the e2e started running. In the above log, that last line is not present which makes me believe the e2e never ran. |
regarding ci/prow/e2e-aws-csi, I see from the relevant slack thread that your e2e test won't even be tested in that; so we can ignore the failure. |
/test e2e-gcp-ovn-etcd-scaling |
/test e2e-vsphere-ovn-etcd-scaling |
filed a bug for vSphere failures https://issues.redhat.com/browse/OCPBUGS-11943 |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: DennisPeriquet, Elbehery, hasbro17 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@Elbehery: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@Elbehery: new pull request created: #27892 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.12 |
@Elbehery: #27788 failed to apply on top of branch "release-4.12":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR addresses reviews on #27702
cc @mfojtik @deads2k @hasbro17