New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
member removal part two #768
member removal part two #768
Conversation
/retest |
84743ea
to
b555615
Compare
b555615
to
76bd1b8
Compare
76bd1b8
to
075082f
Compare
/retest |
075082f
to
fddfdaf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions/notes about how we detect an excess of voting members when deciding to scale-down.
if len(memberMachines) <= desiredControlPlaneReplicasCount { | ||
klog.V(4).Infof("haven't found a replacement machine, the number of desired control plane replicas: %d must be greater than the current number of machines that host an etcd member: %d", desiredControlPlaneReplicasCount, len(memberMachines)) | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the number of desired control plane replicas: %d must be greater than the current number of machines that host an etcd member
It's actually the opposite since we want an excess of voting member etcd machines or greater than the desired control-plane size, in order to select a replacement member to scale-down.
So we can probably make this log statement more accurate e.g:
if len(memberMachines) <= desiredControlPlaneReplicasCount { | |
klog.V(4).Infof("haven't found a replacement machine, the number of desired control plane replicas: %d must be greater than the current number of machines that host an etcd member: %d", desiredControlPlaneReplicasCount, len(memberMachines)) | |
return nil | |
} | |
if len(memberMachines) <= desiredControlPlaneReplicasCount { | |
klog.V(4).Infof("Ignoring scale-down since the number of etcd voting member machines (%d) < desired number of control-plane replicas (%d) ", len(memberMachines), desiredControlPlaneReplicasCount) | |
return nil | |
} |
if err != nil { | ||
return err | ||
} | ||
memberMachines := ceohelpers.FilterMachinesWithMachineDeletionHook(masterMachines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When deciding whether or not to scale-down a voting member this count should only be voting member machines.
Otherwise given desiredControlPlaneReplicasCount = 3
and if we have 3 voting + 1 learner
members, then len(memberMachines) = 4
since learner member machines also have a deletion hook.
Deleting the voting member would trigger a scale-down since len(memberMachines) > desiredControlPlaneReplicasCount
but now we've just replaced a voting member with a learner member (which may not get promoted).
So we also need to filter out the learner members here and change memberMachines => votingMemberMachines
when considering scaling down a voting machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the above you would still need to handle the cleanup of learner members pending deletion. That does not depend on len(memberMachines) <= desiredControlPlaneReplicasCount
, but rather we remove the learner member whenever its machine is pending deletion.
I would suggest moving the learner member removal logic to its own function to make it simpler, but your call on addressing that wherever makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are right, I have updated the PR, please see if you like it.
it attempts to remove the member only once we have identified that a Machine resource is being deleted and a replacement member has been created
…roller it attempts to remove a learning member pending deletion regardless of whether a replacement member has been found
fddfdaf
to
e7e6468
Compare
if err != nil { | ||
return err | ||
} | ||
for _, member := range members { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point I don't think we have to check the type of the member (voting vs learner)
…ks for master machines the Machine Deletion Hook this controller reconciles is a mechanism within the Machine API that allow this operator to hold up removal of a machine
3a56cc8
to
c272151
Compare
c272151
to
5146ae7
Compare
/retest |
@p0lyn0mial: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Still going through the unit tests but the member removal controller logic makes sense.
learningMachines = append(learningMachines, memberMachine) | ||
} | ||
} | ||
return c.removeMemberPendingDeletion(ctx, learningMachines, "learning") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general comment, not a suggestion to change :)
There is a simpler way to tell if members are learning machines by just doing a member list and then just filtering on the member.IsLearner
field from the response.
Although the benefit of your approach is that it doesn't require a live call to the etcd server so this is fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the benefit of your approach is that it doesn't require a live call to the etcd server so this is fine too.
yeah, I wanted to avoid a live call on every interation.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hasbro17, p0lyn0mial, wallylewis The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
} | ||
|
||
if machineForEtcdHost, hasMachine := ceohelpers.IndexMachinesByNodeInternalIP(masterMachines)[etcdHost]; hasMachine && machineForEtcdHost.DeletionTimestamp != nil { | ||
klog.V(4).Infof("won't add member: %v to the cluster because its machine is pending deletion: %v", etcdHost, machineForEtcdHost.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest v(2). This shouldn't happen often. I can probably agree one that one event every time may cause more problems than it fixes (though a rate limit may be in order), but you'll want to know this is happening when it happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am okay with v(2)
// stop if the machine API is not functional | ||
var errs []error | ||
|
||
if err := c.removeMemberWithoutMachine(ctx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first check used to be, "do we have a machine API". Why would we remove a member without a machine if we don't have a machine API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not? A member without a machine won't' work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, unless there won't be machine objects when the machine API is not functional
if err != nil { | ||
return err | ||
} | ||
if currentVotingMemberIPListSet.Len() <= desiredControlPlaneReplicasCount { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you're doing this because deletionTimestamp set on a machine can mean either
- this member should scale down now, or
- this member should scale down after something else comes up.
However, there needs to be a better signal of this than a static installConfig that is never changed. Where is this intent actually present.
@@ -149,6 +226,42 @@ func (c *clusterMemberRemovalController) sync(ctx context.Context, _ factory.Syn | |||
return nil | |||
} | |||
|
|||
func (c *clusterMemberRemovalController) removeMemberPendingDeletion(ctx context.Context, memberMachines []*machinev1beta1.Machine, memberType string) error { | |||
memberMachinesPendingDeletion := ceohelpers.FilterMachinesPendingDeletion(memberMachines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is missing in this PR, is it DeletionTimestamp != nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return err | ||
} | ||
if hasInternalIP(memberMachineToDelete, memberIP) { | ||
memberLocator := fmt.Sprintf("[ url: %v, name: %v, id: %v ]", memberIP, member.Name, member.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is event worthy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MemberRemove
fires an event
Revert "Merge pull request #768 from p0lyn0mial/member-removal-part-two"
No description provided.