Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.12] OCPBUGS-6841: Update etcd scaling test for CPMS supported platforms #27702

Conversation

Elbehery
Copy link
Contributor

@Elbehery Elbehery commented Feb 2, 2023

This is a cherry-pick of #27497

cc @hasbro17

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Feb 2, 2023
@openshift-ci-robot
Copy link

@Elbehery: This pull request references Jira Issue OCPBUGS-6841, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.z) matches configured target version for branch (4.12.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-6844 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE))
  • dependent Jira Issue OCPBUGS-6844 targets the "4.13.0" version, which is one of the valid target versions: 4.13.0
  • bug has dependents

Requesting review from QA contact:
/cc @geliu2016

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is a cherry-pick of #27497

cc @hasbro17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 2, 2023

@Elbehery: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.12] OCPBUGS-6841: Update etcd scaling test for CPMS supported platforms

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from geliu2016 February 2, 2023 18:26
@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 2, 2023

/assign @tjungblu
/assign @hasbro17

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 2, 2023

verify job fails on make update-gofmt

i ran it locally && make verify and nothing changed, no error !!

/retest-required

Copy link

@geliu2016 geliu2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/label cherry-pick-approved

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 3, 2023

@geliu2016: Can not set label cherry-pick-approved: Must be member in one of these teams: [openshift-staff-engineers]

In response to this:

/label cherry-pick-approved

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

// a.png
// b.png
//
// data/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how that got added, but you should probably remove this file from the changeset. Seems like a difference in go version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i used go1.18, it is the version used in go.mod for 4.12 branch

I will remove this 👍🏽

For platforms where the ControlPlaneMachineSet is active and
being reconciled by the CPMSO, the vertical scaling test should rely on
the CPMSO to remove and add new machines, otherwise there is a race between
the test removing a machine and the CPMSO adding a new one.
@Elbehery Elbehery force-pushed the backport_etcd_scaling_test_for_CPMS branch from b9dae82 to 4d9cf6a Compare February 3, 2023 13:01
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 3, 2023

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-etcd-scaling 4d9cf6a link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-single-node-serial 4d9cf6a link false /test e2e-aws-ovn-single-node-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 4, 2023

/retest-required

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 7, 2023

/retest-required

@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 7, 2023

@geliu2016 can we label this please
@hasbro17 any job needs to be overriden ?

Copy link

@geliu2016 geliu2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/label cherry-pick-approved

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 8, 2023

@geliu2016: Can not set label cherry-pick-approved: Must be member in one of these teams: [openshift-staff-engineers]

In response to this:

/label cherry-pick-approved

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hasbro17
Copy link
Contributor

hasbro17 commented Feb 8, 2023

Test is passing on vsphere, the job is failing due to unrelated monitoring issues:

: [sig-instrumentation][Late] Alerts shouldn't report any unexpected alerts in firing or pending state [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] expand_less | 32s
-- | --
{  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:331]: unable to check firing alerts during test Unexpected error:     <*v1.Error \| 0xc005915500>: {         Type: "server_error",         Msg: "server error: 504",         Detail: "<html><body><h1>504 Gateway Time-out</h1>\nThe server didn't respond in time.\n</body></html>\n",     }     server_error: server error: 504 occurred Ginkgo exit error 1: exit with code 1}


/approve
/label backport-risk-assessed

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 8, 2023

@hasbro17: Can not set label backport-risk-assessed: Must be member in one of these teams: [openshift-staff-engineers]

In response to this:

Test is passing on vsphere, the job is failing due to unrelated monitoring issues:

: [sig-instrumentation][Late] Alerts shouldn't report any unexpected alerts in firing or pending state [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] expand_less | 32s
-- | --
{  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:331]: unable to check firing alerts during test Unexpected error:     <*v1.Error \| 0xc005915500>: {         Type: "server_error",         Msg: "server error: 504",         Detail: "<html><body><h1>504 Gateway Time-out</h1>\nThe server didn't respond in time.\n</body></html>\n",     }     server_error: server error: 504 occurred Ginkgo exit error 1: exit with code 1}


/approve
/label backport-risk-assessed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2023
@Elbehery
Copy link
Contributor Author

/retest-required

@Elbehery
Copy link
Contributor Author

@soltysh hello ✋🏽

Can we get this merged ? .. it is just bumping timeout :)

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 14, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Elbehery, geliu2016, hasbro17, mfojtik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mfojtik mfojtik added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Feb 14, 2023
@@ -180,6 +183,91 @@ func recoverClusterToInitialStateIfNeeded(ctx context.Context, t TestingT, machi
})
}

func DeleteSingleMachine(ctx context.Context, t TestingT, machineClient machinev1beta1client.MachineInterface) (string, error) {
waitPollInterval := 15 * time.Second
waitPollTimeout := 5 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably want to check this every 30s or so. The load is minimal and it helps latency. Just fixing in master is ok with me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @deads2k

This is a backport, maybe @hasbro17 we need a new PR with these changes to 4.13 before backporting ?

return isTransientAPIError(t, err)
}

// Machine names are suffixed with an index number (e.g "ci-op-xlbdrkvl-6a467-qcbkh-master-0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not all names are ordered by number, but ordering for consistency is still ok.


machineToDelete := ""
err := wait.Poll(waitPollInterval, waitPollTimeout, func() (bool, error) {
machineList, err := machineClient.List(ctx, metav1.ListOptions{LabelSelector: masterMachineLabelSelector})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does is this listing repeated instead of identifying the machine-to-be-deleted once and then retrying the Delete until you get a successful call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's to retry the List if it errors out due to a transient API error. Would we not want to retry in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's to retry the List if it errors out due to a transient API error. Would we not want to retry in that case?

The client retries automatically. Failures at this level aren't expected and even in controller code aren't generally handled.

@Elbehery
Copy link
Contributor Author

@deads2k can we marge this to be aligned with master and 4.13, then we make the required changes in a new PR against master then backport ?

// EnsureReadyReplicasOnCPMS checks if status.readyReplicas on the cluster CPMS is n
// this effectively counts the number of control-plane machines with the provider state as running
func EnsureReadyReplicasOnCPMS(ctx context.Context, t TestingT, expectedReplicaCount int, cpmsClient machinev1client.ControlPlaneMachineSetInterface) error {
waitPollInterval := 5 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poll should be faster.

// The machine we just listed should be present but if not, error out
if apierrors.IsNotFound(err) {
t.Logf("machine %q was listed but not found or already deleted", machineToDelete)
return false, fmt.Errorf("machine %q was listed but not found or already deleted", machineToDelete)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error will exist the loop as an error, rigth?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will break the polling loop and exit the function with an error.

Copy link
Contributor

@deads2k deads2k Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will break the polling loop and exit the function with an error.

Since the item was deleted and the function is called DeleteSingleMachine, why return the error?

return isTransientAPIError(t, err)
}

if cpms.Status.ReadyReplicas != int32(expectedReplicaCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you care about this condition or do you care this AND that you have exactly the right number of total nodes AND that the desired replicas is also expectedReplicaCount?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We care about this condition because we want the test to ensure that a new machine has successfully been created and is ready, which means the associated has also been created and is running.
IIUC ReadyReplicas should cover both conditions:

I think the desired status.replicas will only tell us if the machine was created, and not whether it became running with a node attached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if expected does not match the desired (spec), the return further down looks incorrect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this looks like the correct spot to double check that the number of nodes is actually correct.

@geliu2016
Copy link

/label cherry-pick-approved

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 21, 2023

@geliu2016: Can not set label cherry-pick-approved: Must be member in one of these teams: [openshift-staff-engineers]

In response to this:

/label cherry-pick-approved

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hasbro17
Copy link
Contributor

@deads2k I think we can put up a new PR to address the review comments from here and make those changes in master since this is just a backport.
However I think merging this first is still useful in fixing the test where it's broken on the platforms that have CPMS enabled.

t.Logf("machine %q was listed but not found or already deleted", machineToDelete)
return false, fmt.Errorf("machine %q was listed but not found or already deleted", machineToDelete)
}
return isTransientAPIError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you retry this, you are no longer guaranteeing that you're deleting a single item. You can be deleting many.

@damdo
Copy link
Member

damdo commented Apr 7, 2023

Looks like etcd vertical scaling for 4.12 is broken until this backport merges (which is a blocking job for the control-plane-machine-set-operator).
Their 4.14 and 4.13 counterparts are working well for us, it would be nice to get this one in too.
Thanks!

@Elbehery
Copy link
Contributor Author

Elbehery commented Apr 7, 2023

Thanks @damdo for your review 👍🏽

There is another on going PR #27788

@Elbehery
Copy link
Contributor Author

Elbehery commented May 5, 2023

resolved by #27907

/close

@Elbehery Elbehery closed this May 5, 2023
@openshift-ci-robot
Copy link

@Elbehery: This pull request references Jira Issue OCPBUGS-6841. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

This is a cherry-pick of #27497

cc @hasbro17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet