Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

member removal part two #768

Merged

Conversation

p0lyn0mial
Copy link
Contributor

No description provided.

@p0lyn0mial
Copy link
Contributor Author

/retest

@p0lyn0mial
Copy link
Contributor Author

/assing @hasbro17 @deads2k

@p0lyn0mial p0lyn0mial changed the title WIP: member removal part two member removal part two Mar 28, 2022
@p0lyn0mial
Copy link
Contributor Author

/assing @hasbro17 @deads2k

@p0lyn0mial
Copy link
Contributor Author

/retest

Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions/notes about how we detect an excess of voting members when deciding to scale-down.

Comment on lines 125 to 149
if len(memberMachines) <= desiredControlPlaneReplicasCount {
klog.V(4).Infof("haven't found a replacement machine, the number of desired control plane replicas: %d must be greater than the current number of machines that host an etcd member: %d", desiredControlPlaneReplicasCount, len(memberMachines))
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of desired control plane replicas: %d must be greater than the current number of machines that host an etcd member

It's actually the opposite since we want an excess of voting member etcd machines or greater than the desired control-plane size, in order to select a replacement member to scale-down.
So we can probably make this log statement more accurate e.g:

Suggested change
if len(memberMachines) <= desiredControlPlaneReplicasCount {
klog.V(4).Infof("haven't found a replacement machine, the number of desired control plane replicas: %d must be greater than the current number of machines that host an etcd member: %d", desiredControlPlaneReplicasCount, len(memberMachines))
return nil
}
if len(memberMachines) <= desiredControlPlaneReplicasCount {
klog.V(4).Infof("Ignoring scale-down since the number of etcd voting member machines (%d) < desired number of control-plane replicas (%d) ", len(memberMachines), desiredControlPlaneReplicasCount)
return nil
}

if err != nil {
return err
}
memberMachines := ceohelpers.FilterMachinesWithMachineDeletionHook(masterMachines)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When deciding whether or not to scale-down a voting member this count should only be voting member machines.

Otherwise given desiredControlPlaneReplicasCount = 3 and if we have 3 voting + 1 learner members, then len(memberMachines) = 4 since learner member machines also have a deletion hook.
Deleting the voting member would trigger a scale-down since len(memberMachines) > desiredControlPlaneReplicasCount but now we've just replaced a voting member with a learner member (which may not get promoted).

So we also need to filter out the learner members here and change memberMachines => votingMemberMachines when considering scaling down a voting machine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the above you would still need to handle the cleanup of learner members pending deletion. That does not depend on len(memberMachines) <= desiredControlPlaneReplicasCount, but rather we remove the learner member whenever its machine is pending deletion.

I would suggest moving the learner member removal logic to its own function to make it simpler, but your call on addressing that wherever makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right, I have updated the PR, please see if you like it.

it attempts to remove the member only once we have identified that
a Machine resource is being deleted and a replacement member has been created
…roller

it attempts to remove a learning member pending deletion regardless of whether a replacement member has been found
if err != nil {
return err
}
for _, member := range members {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point I don't think we have to check the type of the member (voting vs learner)

…ks for master machines

the Machine Deletion Hook this controller reconciles is a mechanism within the Machine API that allow this operator to hold up removal of a machine
@p0lyn0mial p0lyn0mial force-pushed the member-removal-part-two branch 2 times, most recently from 3a56cc8 to c272151 Compare March 30, 2022 12:29
@p0lyn0mial
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 30, 2022

@p0lyn0mial: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-five-control-plane-replicas 5146ae7 link false /test e2e-gcp-five-control-plane-replicas

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Still going through the unit tests but the member removal controller logic makes sense.

learningMachines = append(learningMachines, memberMachine)
}
}
return c.removeMemberPendingDeletion(ctx, learningMachines, "learning")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general comment, not a suggestion to change :)
There is a simpler way to tell if members are learning machines by just doing a member list and then just filtering on the member.IsLearner field from the response.
Although the benefit of your approach is that it doesn't require a live call to the etcd server so this is fine too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the benefit of your approach is that it doesn't require a live call to the etcd server so this is fine too.

yeah, I wanted to avoid a live call on every interation.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 31, 2022
@wallylewis
Copy link

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 31, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17, p0lyn0mial, wallylewis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 85c402e into openshift:master Mar 31, 2022
}

if machineForEtcdHost, hasMachine := ceohelpers.IndexMachinesByNodeInternalIP(masterMachines)[etcdHost]; hasMachine && machineForEtcdHost.DeletionTimestamp != nil {
klog.V(4).Infof("won't add member: %v to the cluster because its machine is pending deletion: %v", etcdHost, machineForEtcdHost.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest v(2). This shouldn't happen often. I can probably agree one that one event every time may cause more problems than it fixes (though a rate limit may be in order), but you'll want to know this is happening when it happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am okay with v(2)

// stop if the machine API is not functional
var errs []error

if err := c.removeMemberWithoutMachine(ctx); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first check used to be, "do we have a machine API". Why would we remove a member without a machine if we don't have a machine API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not? A member without a machine won't' work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, unless there won't be machine objects when the machine API is not functional

if err != nil {
return err
}
if currentVotingMemberIPListSet.Len() <= desiredControlPlaneReplicasCount {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you're doing this because deletionTimestamp set on a machine can mean either

  1. this member should scale down now, or
  2. this member should scale down after something else comes up.

However, there needs to be a better signal of this than a static installConfig that is never changed. Where is this intent actually present.

@@ -149,6 +226,42 @@ func (c *clusterMemberRemovalController) sync(ctx context.Context, _ factory.Syn
return nil
}

func (c *clusterMemberRemovalController) removeMemberPendingDeletion(ctx context.Context, memberMachines []*machinev1beta1.Machine, memberType string) error {
memberMachinesPendingDeletion := ceohelpers.FilterMachinesPendingDeletion(memberMachines)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is missing in this PR, is it DeletionTimestamp != nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return err
}
if hasInternalIP(memberMachineToDelete, memberIP) {
memberLocator := fmt.Sprintf("[ url: %v, name: %v, id: %v ]", memberIP, member.Name, member.ID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is event worthy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemberRemove fires an event

stbenjam added a commit to stbenjam/cluster-etcd-operator that referenced this pull request Apr 4, 2022
…al-part-two"

This reverts commit 85c402e, reversing
changes made to c48fd04.
deads2k added a commit that referenced this pull request Apr 4, 2022
Revert "Merge pull request #768 from p0lyn0mial/member-removal-part-two"
@p0lyn0mial p0lyn0mial deleted the member-removal-part-two branch April 4, 2022 15:46
@p0lyn0mial p0lyn0mial restored the member-removal-part-two branch April 4, 2022 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants