-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-4.12] OCPBUGS-6841: Update etcd scaling test for CPMS supported platforms #27702
[release-4.12] OCPBUGS-6841: Update etcd scaling test for CPMS supported platforms #27702
Conversation
@Elbehery: This pull request references Jira Issue OCPBUGS-6841, which is valid. 6 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@Elbehery: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
i ran it locally && /retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/label cherry-pick-approved
@geliu2016: Can not set label cherry-pick-approved: Must be member in one of these teams: [openshift-staff-engineers] In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
test/extended/testdata/bindata.go
Outdated
// a.png | ||
// b.png | ||
// | ||
// data/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how that got added, but you should probably remove this file from the changeset. Seems like a difference in go version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i used go1.18, it is the version used in go.mod
for 4.12
branch
I will remove this 👍🏽
For platforms where the ControlPlaneMachineSet is active and being reconciled by the CPMSO, the vertical scaling test should rely on the CPMSO to remove and add new machines, otherwise there is a race between the test removing a machine and the CPMSO adding a new one.
b9dae82
to
4d9cf6a
Compare
@Elbehery: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest-required |
/retest-required |
@geliu2016 can we label this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/label cherry-pick-approved
@geliu2016: Can not set label cherry-pick-approved: Must be member in one of these teams: [openshift-staff-engineers] In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Test is passing on vsphere, the job is failing due to unrelated monitoring issues:
/approve |
@hasbro17: Can not set label backport-risk-assessed: Must be member in one of these teams: [openshift-staff-engineers] In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-required |
@soltysh hello ✋🏽 Can we get this merged ? .. it is just bumping timeout :) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Elbehery, geliu2016, hasbro17, mfojtik The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -180,6 +183,91 @@ func recoverClusterToInitialStateIfNeeded(ctx context.Context, t TestingT, machi | |||
}) | |||
} | |||
|
|||
func DeleteSingleMachine(ctx context.Context, t TestingT, machineClient machinev1beta1client.MachineInterface) (string, error) { | |||
waitPollInterval := 15 * time.Second | |||
waitPollTimeout := 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably want to check this every 30s or so. The load is minimal and it helps latency. Just fixing in master is ok with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return isTransientAPIError(t, err) | ||
} | ||
|
||
// Machine names are suffixed with an index number (e.g "ci-op-xlbdrkvl-6a467-qcbkh-master-0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not all names are ordered by number, but ordering for consistency is still ok.
|
||
machineToDelete := "" | ||
err := wait.Poll(waitPollInterval, waitPollTimeout, func() (bool, error) { | ||
machineList, err := machineClient.List(ctx, metav1.ListOptions{LabelSelector: masterMachineLabelSelector}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does is this listing repeated instead of identifying the machine-to-be-deleted once and then retrying the Delete until you get a successful call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's to retry the List if it errors out due to a transient API error. Would we not want to retry in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's to retry the List if it errors out due to a transient API error. Would we not want to retry in that case?
The client retries automatically. Failures at this level aren't expected and even in controller code aren't generally handled.
@deads2k can we marge this to be aligned with master and 4.13, then we make the required changes in a new PR against master then backport ? |
// EnsureReadyReplicasOnCPMS checks if status.readyReplicas on the cluster CPMS is n | ||
// this effectively counts the number of control-plane machines with the provider state as running | ||
func EnsureReadyReplicasOnCPMS(ctx context.Context, t TestingT, expectedReplicaCount int, cpmsClient machinev1client.ControlPlaneMachineSetInterface) error { | ||
waitPollInterval := 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poll should be faster.
// The machine we just listed should be present but if not, error out | ||
if apierrors.IsNotFound(err) { | ||
t.Logf("machine %q was listed but not found or already deleted", machineToDelete) | ||
return false, fmt.Errorf("machine %q was listed but not found or already deleted", machineToDelete) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this error will exist the loop as an error, rigth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will break the polling loop and exit the function with an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will break the polling loop and exit the function with an error.
Since the item was deleted and the function is called DeleteSingleMachine
, why return the error?
return isTransientAPIError(t, err) | ||
} | ||
|
||
if cpms.Status.ReadyReplicas != int32(expectedReplicaCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you care about this condition or do you care this AND that you have exactly the right number of total nodes AND that the desired replicas is also expectedReplicaCount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We care about this condition because we want the test to ensure that a new machine has successfully been created and is ready, which means the associated has also been created and is running.
IIUC ReadyReplicas
should cover both conditions:
- https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/pkg/controllers/controlplanemachineset/status.go#L95-L96
- https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/pkg/machineproviders/types.go#L38-L40
I think the desired status.replicas
will only tell us if the machine was created, and not whether it became running with a node attached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if expected does not match the desired (spec), the return further down looks incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this looks like the correct spot to double check that the number of nodes is actually correct.
/label cherry-pick-approved |
@geliu2016: Can not set label cherry-pick-approved: Must be member in one of these teams: [openshift-staff-engineers] In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@deads2k I think we can put up a new PR to address the review comments from here and make those changes in master since this is just a backport. |
t.Logf("machine %q was listed but not found or already deleted", machineToDelete) | ||
return false, fmt.Errorf("machine %q was listed but not found or already deleted", machineToDelete) | ||
} | ||
return isTransientAPIError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you retry this, you are no longer guaranteeing that you're deleting a single item. You can be deleting many.
Looks like etcd vertical scaling for 4.12 is broken until this backport merges (which is a blocking job for the control-plane-machine-set-operator). |
resolved by #27907 /close |
@Elbehery: This pull request references Jira Issue OCPBUGS-6841. The bug has been updated to no longer refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is a cherry-pick of #27497
cc @hasbro17