Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rotator tool to accelerate k8s cluster upgrades and node rotations. #1

Merged
merged 7 commits into from
Mar 1, 2021

Conversation

stylianosrigas
Copy link
Contributor

@stylianosrigas stylianosrigas commented Feb 16, 2021

Rotator is a tool meant to smooth and accelerate k8s cluster upgrades and node rotations. It offers automation on autoscaling group recognition and flexility on options such as, how fast to rotate nodes, drain retries, waiting time between rotations and drains as well as mater/worker node separation.

This tool can work as both a cli/api tool and an imported go package for use from other applications.

@stylianosrigas stylianosrigas changed the title WIP: Node rotator Rotator tool to accelerate k8s cluster upgrades and node rotations. Feb 25, 2021
@stylianosrigas
Copy link
Contributor Author

@stafot @angeloskyratzakos @Szymongib @gabrieljackson Sorry in advance for the huge PR, it is just all new and there was no easy way to split it.

Copy link

@gabrieljackson gabrieljackson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this! A few thoughts and questions below.

api/api.go Outdated Show resolved Hide resolved
return "", nil
}

return *resp.Reservations[0].Instances[0].InstanceId, nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check for len() on either Reservations or Instances before referencing the first value? I honestly don't know so just checking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes make sense an additional check imho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically line 46 covers this. Because if AWS cannot find the instance it will return a nil for reservations. I think we don't need an extra check for that but 0/5 on that.

rotator/rotator.go Outdated Show resolved Hide resolved
model/rotate_request.go Outdated Show resolved Hide resolved
model/rotate_request.go Show resolved Hide resolved
rotator/drain.go Outdated Show resolved Hide resolved
rotator/drain.go Outdated Show resolved Hide resolved
rotator/drain.go Outdated

// getPodsForDeletion receives resource info for a node, and returns all the pods from the given node that we
// are planning on deleting. If there are any pods preventing us from deleting, we return that list in an error.
func getPodsForDeletion(client kubernetes.Interface, node *corev1.Node, options *DrainOptions) (pods []corev1.Pod, err error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing with named returns here. This file appears to originate from somewhere else so I will leave changing it up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabrieljackson drain was the most complex part of the job because its not supported natively by k8s. This code here is a combination of open shift and k8s code adapted to our needs 😅 . Missed the returns somehow, should be good now.

Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
return err
}

newNodes := newNodes(nodeHostnames, autoscalingGroup.Nodes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain a bit the logic here? Why autoscaling.Nodes are the old nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stafot sure, autoscaling.Nodes is the only source of truth that we need. It includes a list of all the nodes that the ASG included when we decided to initiate the rotation. This is the list against which we compare and check the progress of rotation.

Makefile Outdated
# Build variables
COMMIT_HASH ?= $(shell git rev-parse HEAD)
BUILD_DATE ?= $(shell date +%FT%T%z)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would like to use these on tagging, but 0/5

Copy link
Contributor

@stafot stafot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work, which could possible useful not only for Mattermost but for whole Open Source community.
Very nice tool!!

Copy link

@gabrieljackson gabrieljackson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Excited to try it out.

Makefile Outdated Show resolved Hide resolved
Copy link

@Szymongib Szymongib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome tool! 🚀
I have listed some of my more or less minor concerns.

rotator/rotator.go Outdated Show resolved Hide resolved
Comment on lines 21 to 31
func newNodes(allNodes, oldNodes []string) []string {
var newNodes []string
for _, node := range allNodes {
for _, oldNode := range oldNodes {
if node != oldNode {
newNodes = append(newNodes, node)
}
}
}
return newNodes
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the logic is correct here.
Lets examine the following case:

allNodes = [n1, n2, n3, n4]
oldNodes = [n2,n3]

The the function will return newNodes = [n1, n1, n2, n3, n4, n4]
If I understand correctly, you aimed to achieve [n1, n4]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing catch thanks a lot. This check is not that vital for the functionality because it is used to check that ASG nodes are up but it was doing duplicate work for no reason and I missed it. I changed to to a new function :)

cmd/rotator/server.go Outdated Show resolved Hide resolved
return err
}
logger.Infof("Cluster with cluster ID %s is consisted of %d Autoscaling Groups", cluster.ClusterID, len(asgs))
var autoscalingGroups []AutoscalingGroup

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this var needed? I see you collect all autoscaling groups in it but I do not see it being used after that.

k8s/helpers.go Outdated
Comment on lines 49 to 51
if err != nil && k8sErrors.IsNotFound(err) {
logger.Infof("Node %s not found, waiting...", nodeName)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it wouldn't be useful to also log the error if it is different than NotFound as in this case we won't know about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split it to log the error if its not NotFound. It won't return the error as we want the timeout to handle any break.

Comment on lines +214 to +216
}

if len(autoscalingGroup.Nodes) > 0 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we pop nodes from autoscalingGroup here? Or are they removed somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For workers we are popping the nodes in the Drain function. The reason for that is that in a group of worker nodes that are being rotated we want to drain/terminate/pop each one individually so it is done there.

Comment on lines 88 to 96
err = Drain(clientset, []*corev1.Node{node}, drainOptions)
if err != nil {
if attempts--; attempts > 0 {
logger.Infof("Node %s drain failed, retrying...", nodeToDrain)
autoscalingGroup.DrainNodes([]string{nodeToDrain}, attempts-1, gracePeriod, wait, clientset, logger, nodeType)
} else {
return errors.Wrapf(err, "Failed to drain node %s", nodeToDrain)
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use loop here instead of recursion. Also I think that we reduce attempts here twice with attempts-- in line 90 and then passing attempts-1 to the DrainNodes function. In case of attempts = 3 there would actually be 2 attempts.

It could be rafactored to something like that (disclaimer: I did not test this code 😄 ):

	err = Drain(clientset, []*corev1.Node{node}, drainOptions)
	for i := 1; i<attempts && err != nil; i++ {
		logger.Warnf("Failed to drain node %q on attempt %d, retrying up to %d times", nodesToDrain, i, attempts)
		err = Drain(clientset, []*corev1.Node{node}, drainOptions)
	}
	if err != nil {
		return errors.Wrapf(err, "Failed to drain node %s", nodeToDrain)
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean, updated. Thanks a lot @Szymongib .

Makefile Outdated Show resolved Hide resolved
docker build \
--build-arg DOCKER_BUILD_IMAGE=$(DOCKER_BUILD_IMAGE) \
--build-arg DOCKER_BASE_IMAGE=$(DOCKER_BASE_IMAGE) \
. -f build/Dockerfile \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There actually is no Dockerfile, did you intend to add it in later PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap that was my plan, but I already wrote about this in the readme so I added it now ;)

Makefile Show resolved Hide resolved
Copy link

@Szymongib Szymongib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@stylianosrigas stylianosrigas merged commit eccb2da into main Mar 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2: Dev Review Requires review by a core committer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants