NOTE: this repository is currently UNMAINTAINED and is looking for new owner(s). See #74 for more information.
- Introduction
- Motivation
- Usage
- Scope of the project
- Operating logic
- Related
- Communication
- Contributing
- License
K8s Spot rescheduler is a tool that tries to reduce load on a set of Kubernetes nodes. It was designed with the purpose of moving Pods scheduled on AWS on-demand instances to AWS spot instances to allow the on-demand instances to be safely scaled down (By the Cluster Autoscaler).
In reality the rescheduler can be used to remove load from any group of nodes onto a different group of nodes. They just need to be labelled appropriately.
For example, it could also be used to allow controller nodes to take up slack while new nodes are being scaled up, and then rescheduling those pods when the new capacity becomes available, thus reducing the load on the controllers once again.
This project was inspired by the Critical Pod Rescheduler and takes portions of code from both the Critical Pod Rescheduler and the Cluster Autoscaler.
AWS spot instances are a great way to reduce the cost of your infrastructure running costs. They do however come with a significant drawback; at any point, the spot price for the instances you are using could rise above your bid and your instances will be terminated. To solve this problem, you can use an AutoScaling group backed by on-demand instances and managed by the Cluster Autoscaler to take up the slack when spot instances are removed from your cluster.
The problem however, comes when the spot price drops and you are given new spot instances back into your cluster. At this point you are left with empty spot instances and full, expensive on-demand instances.
By tainting the on-demand instances with the Kubernetes PreferNoSchedule
taint, we can ensure that, if at any point the scheduler needs to choose between spot and on-demand instances, it will choose the preferred spot instances to schedule the new Pods onto.
However, the scheduler won't reschedule Pods that are already running on on-demand instances, blocking them from being scaled down. At this point, the K8s Spot Rescheduler is required to start the process of moving Pods from the on-demand instances back onto the spot instances.
A docker image is available at quay.io/pusher/k8s-spot-rescheduler
.
These images are currently built on pushes to master. Releases will be tagged as and when releases are made.
Sample Kubernetes manifests are available in the deploy folder.
To deploy in clusters using RBAC, please apply all of the manifests (Deployment, ClusterRole, ClusterRoleBinding and ServiceAccount) in the deploy folder but uncomment the serviceAccountName
in the deployment
For the K8s Spot Rescheduler to process nodes as expected; you will need identifying labels which can be passed to the program to allow it to distinguish which nodes it should consider as on-demand and which it should consider as spot instances.
For instance you could add labels node-role.kubernetes.io/worker
and node-role.kubernetes.io/spot-worker
to your on-demand and spot instances respectively.
You should also add the PreferNoSchedule
taint to your on-demand instances to ensure that the scheduler prefers spot instances when making it's scheduling decisions.
For example you could add the following flags to your Kubelet:
--register-with-taints="node-role.kubernetes.io/worker=true:PreferNoSchedule"
--node-labels="node-role.kubernetes.io/worker=true"
If you wish to build the binary yourself; first make sure you have go installed and set up. Then clone this repo into your $GOPATH
and download the dependencies using dep
.
cd $GOPATH/src/github.com # Create this directory if it doesn't exist
git clone git@github.com:pusher/k8s-spot-rescheduler pusher/k8s-spot-rescheduler
dep ensure -v # Installs dependencies to vendor folder.
Then build the code using go build
which will produce the built binary in a file k8s-spot-rescheduler
.
-v
(default: 0): The log verbosity level the program should run in, currently numeric with values between 2 & 4, recommended to use -v=2
--running-in-cluster
(default: true
): Optional, if this controller is running in a kubernetes cluster, use the pod secrets for creating a Kubernetes client.
--namespace
(deafult: kube-system
): Namespace in which k8s-spot-rescheduler is run.
--kube-api-content-type
(default: application/vnd.kubernetes.protobuf
): Content type of requests sent to apiserver.
--housekeeping-interval
(default: 10s): How often rescheduler takes actions.
--node-drain-delay
(default: 10m): How long the scheduler should wait between draining nodes.
--pod-eviction-timeout
(default: 2m): How long should the rescheduler attempt to retrieve successful pod evictions for.
--max-graceful-termination
(default: 2m): How long should the rescheduler wait for pods to shutdown gracefully before failing the node drain attempt.
--listen-address
(default: localhost:9235
): Address to listen on for serving prometheus metrics.
--on-demand-node-label
(default: node-role.kubernetes.io/worker
) Name of label on nodes to be considered for draining.
--spot-node-label
(default: node-role.kubernetes.io/spot-worker
) Name of label on nodes to be considered as targets for pods.
--delete-non-replicated-pods
(default: false
) Delete non-replicated pods running on on-demand instance. Note that some non-replicated pods will not be rescheduled.
- Look for Pods on on-demand instances
- Look for space for Pods on spot instances
- Checks the following predicates when determining whether a pod can be moved:
- CheckNodeMemoryPressure
- CheckNodeDiskPressure
- GeneralPredicates
- MaxAzureDiskVolumeCount
- MaxGCEPDVolumeCount
- NoDiskConflict
- MatchInterPodAffinity
- PodToleratesNodeTaints
- MaxEBSVolumeCount
- NoVolumeZoneConflict
- ready
- Checks whether there is enough capacity to move all pods on the on-demand node to spot nodes
- Evicts all pods on the node if the previous check passes
- Leaves the node in a schedulable state - in case it's capacity is required again
- Schedule pods (The default scheduler handles this)
- Scale down empty nodes on your cloud provider (Try the Cluster Autoscaler)
The rescheduler logic roughly follows the below:
- Gets a list of on-demand and spot nodes and their respective Pods
- Builds a map of nodeInfo structs
- Add node to struct
- Add pods for that node to struct
- Add requested and free CPU fields to struct
- Map these structs based on whether they are on-demand or spot instances.
- Sort on-demand instances by least requested CPU
- Sort spot instances by most free CPU
- Iterate through each on-demand node and try to drain it
- Iterate through each pod
- Determine if a spot node has space for the pod
- Add the pod to the prospective spot node
- Move onto next node if no spot node space available
- Drain the node
- Iterate through pods and evict them in turn
- Evict pod
- Wait for deletion and reschedule
- Cancel all further processing
- Iterate through pods and evict them in turn
This process is repeated every housekeeping-interval
seconds.
The effect of this algorithm should be, that we take the emptiest nodes first and empty those before we empty a node which is busier, thus resulting in the highest number of 'empty' nodes that can be removed from the cluster.
- K8s Spot Termination Handler: Gracefully drain spot instances when they are issued with a termination notice.
- Found a bug? Please open an issue.
- Have a feature request. Please open an issue.
- If you want to contribute, please submit a pull request
Please see our Contributing guidelines.
This project is licensed under Apache 2.0 and a copy of the license is available here.