-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rotator tool to accelerate k8s cluster upgrades and node rotations. #1
Conversation
@stafot @angeloskyratzakos @Szymongib @gabrieljackson Sorry in advance for the huge PR, it is just all new and there was no easy way to split it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this! A few thoughts and questions below.
return "", nil | ||
} | ||
|
||
return *resp.Reservations[0].Instances[0].InstanceId, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check for len() on either Reservations
or Instances
before referencing the first value? I honestly don't know so just checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes make sense an additional check imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically line 46 covers this. Because if AWS cannot find the instance it will return a nil for reservations. I think we don't need an extra check for that but 0/5 on that.
rotator/drain.go
Outdated
|
||
// getPodsForDeletion receives resource info for a node, and returns all the pods from the given node that we | ||
// are planning on deleting. If there are any pods preventing us from deleting, we return that list in an error. | ||
func getPodsForDeletion(client kubernetes.Interface, node *corev1.Node, options *DrainOptions) (pods []corev1.Pod, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing with named returns here. This file appears to originate from somewhere else so I will leave changing it up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabrieljackson drain was the most complex part of the job because its not supported natively by k8s. This code here is a combination of open shift and k8s code adapted to our needs 😅 . Missed the returns somehow, should be good now.
return err | ||
} | ||
|
||
newNodes := newNodes(nodeHostnames, autoscalingGroup.Nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain a bit the logic here? Why autoscaling.Nodes are the old nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stafot sure, autoscaling.Nodes is the only source of truth that we need. It includes a list of all the nodes that the ASG included when we decided to initiate the rotation. This is the list against which we compare and check the progress of rotation.
Makefile
Outdated
# Build variables | ||
COMMIT_HASH ?= $(shell git rev-parse HEAD) | ||
BUILD_DATE ?= $(shell date +%FT%T%z) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I would like to use these on tagging, but 0/5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great work, which could possible useful not only for Mattermost but for whole Open Source community.
Very nice tool!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! Excited to try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome tool! 🚀
I have listed some of my more or less minor concerns.
rotator/helpers.go
Outdated
func newNodes(allNodes, oldNodes []string) []string { | ||
var newNodes []string | ||
for _, node := range allNodes { | ||
for _, oldNode := range oldNodes { | ||
if node != oldNode { | ||
newNodes = append(newNodes, node) | ||
} | ||
} | ||
} | ||
return newNodes | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the logic is correct here.
Lets examine the following case:
allNodes = [n1, n2, n3, n4]
oldNodes = [n2,n3]
The the function will return newNodes = [n1, n1, n2, n3, n4, n4]
If I understand correctly, you aimed to achieve [n1, n4]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing catch thanks a lot. This check is not that vital for the functionality because it is used to check that ASG nodes are up but it was doing duplicate work for no reason and I missed it. I changed to to a new function :)
rotator/helpers.go
Outdated
return err | ||
} | ||
logger.Infof("Cluster with cluster ID %s is consisted of %d Autoscaling Groups", cluster.ClusterID, len(asgs)) | ||
var autoscalingGroups []AutoscalingGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this var needed? I see you collect all autoscaling groups in it but I do not see it being used after that.
k8s/helpers.go
Outdated
if err != nil && k8sErrors.IsNotFound(err) { | ||
logger.Infof("Node %s not found, waiting...", nodeName) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it wouldn't be useful to also log the error if it is different than NotFound
as in this case we won't know about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split it to log the error if its not NotFound
. It won't return the error as we want the timeout to handle any break.
} | ||
|
||
if len(autoscalingGroup.Nodes) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we pop nodes from autoscalingGroup
here? Or are they removed somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For workers we are popping the nodes in the Drain function. The reason for that is that in a group of worker nodes that are being rotated we want to drain/terminate/pop each one individually so it is done there.
rotator/helpers.go
Outdated
err = Drain(clientset, []*corev1.Node{node}, drainOptions) | ||
if err != nil { | ||
if attempts--; attempts > 0 { | ||
logger.Infof("Node %s drain failed, retrying...", nodeToDrain) | ||
autoscalingGroup.DrainNodes([]string{nodeToDrain}, attempts-1, gracePeriod, wait, clientset, logger, nodeType) | ||
} else { | ||
return errors.Wrapf(err, "Failed to drain node %s", nodeToDrain) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to use loop here instead of recursion. Also I think that we reduce attempts
here twice with attempts--
in line 90 and then passing attempts-1
to the DrainNodes
function. In case of attempts = 3
there would actually be 2 attempts.
It could be rafactored to something like that (disclaimer: I did not test this code 😄 ):
err = Drain(clientset, []*corev1.Node{node}, drainOptions)
for i := 1; i<attempts && err != nil; i++ {
logger.Warnf("Failed to drain node %q on attempt %d, retrying up to %d times", nodesToDrain, i, attempts)
err = Drain(clientset, []*corev1.Node{node}, drainOptions)
}
if err != nil {
return errors.Wrapf(err, "Failed to drain node %s", nodeToDrain)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice and clean, updated. Thanks a lot @Szymongib .
docker build \ | ||
--build-arg DOCKER_BUILD_IMAGE=$(DOCKER_BUILD_IMAGE) \ | ||
--build-arg DOCKER_BASE_IMAGE=$(DOCKER_BASE_IMAGE) \ | ||
. -f build/Dockerfile \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There actually is no Dockerfile, did you intend to add it in later PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap that was my plan, but I already wrote about this in the readme so I added it now ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Rotator is a tool meant to smooth and accelerate k8s cluster upgrades and node rotations. It offers automation on autoscaling group recognition and flexility on options such as, how fast to rotate nodes, drain retries, waiting time between rotations and drains as well as mater/worker node separation.
This tool can work as both a cli/api tool and an imported go package for use from other applications.