The Poseidon/Firmament scheduler incubation project is to bring integration of Firmament Scheduler OSDI paper in Kubernetes. At a very high level, Poseidon/Firmament scheduler augments the current Kubernetes scheduling capabilities by incorporating a new novel flow network graph based scheduling capabilities alongside the default Kubernetes Scheduler. Firmament models workloads on a cluster as flow networks and runs min-cost flow optimizations over these networks to make scheduling decisions.
Due to the inherent rescheduling capabilities, the new scheduler enables a globally optimal scheduling for a given policy that keeps on refining the dynamic placements of the workload.
As we all know that as part of the Kubernetes multiple schedulers support, each new pod is typically scheduled by the default scheduler, but Kubernetes can be instructed to use another scheduler by specifying the name of another custom scheduler (Poseidon in our case) at the time of pod deployment. In this case, the default scheduler will ignore that Pod and allow Poseidon scheduler to schedule the Pod on a relevant node. We plugin Poseidon as an add-on scheduler to K8s, by using the 'schedulerName' as Poseidon in the pod template this will by-pass the default-scheduler.
Flow graph scheduling provides the following
- Support for high-volume workloads placement.
- Complex rule constraints.
- Globally optimal scheduling for a given policy.
- Extremely high scalability.
NOTE: Additionally, it is also very important to highlight that Firmament scales much better than default scheduler as the number of nodes increase in a cluster.
Current Project Stage
Poseidon/Firmament Integration architecture
For more details about the design of this project see the design document doc.
In-cluster installation of Poseidon, please start here.
For developers please refer here
To view details related to coordinated release process between Firmament & Poseidon repos, refer here.
Release 0.1 – Currently Available:
- Baseline Poseidon/Firmament Scheduling capabilities using new multi-dimensional CPU/Memory cost model is part of this release. Currently, this does not include node and pod level affinity/anti-affinity capabilities. As shown below, we are building all this out as part of the upcoming releases.
- Entire test.infra BOT automation jobs are in place as part of this release.
Release 0.2 – Target Date: 25th May 2018:
- Node level Affinity and Anti-Affinity implementation.
Release 0.3 – Target Date: 15th June 2018:
- Pod level Affinity and Anti-Affinity implementation using multi-round scheduling based affinity and anti-affinity.
Release 0.4 – Tentative Target Date: 15th August 2018:
- Taints & Tolerations.
- Support for Pod anti-affinity symmetry.
- Throughput Performance Optimizations.
Release 0.5 onwards:
- Support for Max. Pods per Node.
- Co-Existence with Default Scheduler.
- Optimizations for reducing the no. of arcs by limiting the number of eligible nodes in a cluster.
- CPU/Mem combination optimizations.
- Transitioning to Metrics server API – Our current work for upstreaming new Heapster sink is not a possibility as Heapster is getting deprecated.
- Continuous running scheduling loop versus scheduling intervals mechanism.
- Provide High Availability/Failover for in-memory Firmament/Poseidon processes.
- Gang Scheduling.
- Priority Pre-emption support.
- Resource Utilization benchmark.