Some documentation and code for managing blue/green EKS workers
- You already have a VPC created (and NAT gateway if applicable)
- You already have [private] subnets for EKS created - see data.tf as you may need to modify the filter for your subnets
- You have AWS credentials setup for the correct region and can Terraform on a basic level
- A public key for your instance keypair exists in the
eks.pub(or change the path in data.tf)
- Your load balancers will attach the autoscaling groups created here
cluster-autoscaleris running in your cluster to provide the autoscaling capabilities.
- You've made any additional changes to the Terrform files as required
Build your AMI
Run terraform to create your cluster and workers
Clone this repo and update any variables, worker parameters, etc. Then you need to go through the standard "new terraform steps"
terraform workspace new <YOUR WORKSPACE NAME>- used as an environment name in the code (i.e. prod, staging, dev)
terraform validate- make sure any changes are valid
terraform apply- to create your cluster with
blueworkers scaled up.
If you don't want to use
us-west-2, modify provider.tf.
Blue/green worker updates
Now that you have a cluster and a fully scaled up worker group, time to scale in the
green workers with a new AMI. Here's an outline of the process:
asg_min_sizegreater than 0 to scale up the
greenworkers with updated AMI
- Wait for them to join the cluster - takes about 30s to build them and another 30-60s or so for them to be ready.
- Assuming your Load Balancers are already aware of the autoscaling groups created by the terraform-aws-eks module, make sure the new workers are attached to your LBs before proceeding, or you will be in for a rude awakening when you transition pods in the next step!
- Drain the old nodes to transition pods slowly over to the new nodes with drain_nodes.sh. If you are confident, you can drain the entire blue node group with this command:
kubectl drain -l eks_worker_group=blue --ignore-daemonsets=true --delete-local-data --force
- After verifying all the pods have been moved to the right nodes, scale the old worker autoscaling group to zero by setting the parameters in step 1 to 0 on the
blueworker group. With the addition of
cluster-autoscaler, the node group will not be scaled to zero based on how cluster-autoscaler works. So, you will need to set the minSize to 0 only and cluster-autoscaler will reap the cordoned nodes in 10 minutes or so since they will be detected as uneeded.
- Simplify the draining/transition process to a single step. Will need CA to support this, as requested here.
- Write a wrapper around all the terraform, verification, waiting, etc.