Skip to content
@horus-scheduler

horus-scheduler

Horus In-Network Scheduler

This is the repository for "Horus", an in-network task scheduler for datacenters, which is published in the following NSDI'24 paper:

P. Yassini, K. Diab, S. Zanganeh, and M. Hefeeda, Horus: Granular In-Network Task Scheduler for Cloud Datacenters, In Proc. of USENIX Networked Systems Design and Implementation (NSDI'24), Sant Clara, CA, April 2024.

For more information, check the webpage of the Network & Multimedia Systems Lab (NMSL) at Simon Fraser University (SFU).

Abstract: Short-lived tasks are prevalent in modern interactive datacenter applications. However, designing schedulers to assign these tasks to workers distributed across the whole datacenter is challenging, because such schedulers need to make decisions at a microsecond scale, achieve high throughput, and minimize the tail response time. Current task schedulers in the literature are limited to individual racks. We present Horus, a new in-network task scheduler for short tasks that operates at the datacenter scale. Horus efficiently tracks and distributes the worker state among switches, which enables it to schedule tasks in parallel at line rate while optimizing the scheduling quality. We propose a new distributed task scheduling policy that minimizes the state and communication overheads, handles dynamic loads, and does not buffer tasks in switches. We compare Horus against the state-of-the-art in-network scheduler in a testbed with programmable switches as well as using simulations of datacenters with more than 27K hosts and thousands of switches handling diverse and dynamic workloads. Our results show that Horus efficiently scales to large datacenters, and it substantially outperforms the state-of-the-art across all performance metrics, including tail response time and throughput.


Horus has two main components: Data Plane Scheduler and Control Plane.

In Horus, network switches run the task scheduler. The schedulers are implemented in P4. This repository describes the P4 implementation and how to build and run the in-network schedulers.

The control plane of Horus contains a centralized controller and switch controller. Both of them are written in Go. This repository describes the implementation of the control plan and how to build and run Horus controllers.

This repository describes how to set up the evaluation testbed described in the above paper. It also contains the scripts and the detailed steps of the experiments in the paper and reproduces the results.

Here is a video recording of testing Horus in a testbed with a Tofino switch: https://drive.google.com/file/d/1TIHCN1q31pry20VbpgmKeH5PHyE01xuv/view?usp=drive_link

Hardware Setup

You need at least 2 machines for servers and 1 machine as the client, connected to a single Tofino switch which should be able to operate as both of the spine and leaf. All machines need to have DPDK compatible operating systems and NIC devices.

Deploying Horus

You can run Horus on your infrastructure to reproduce the results. For deploying Horus you need to do these overall steps:

  • Clone the P4 implementation on the switch and build it using TNA tools
  • Clone manager repository and build both controller and manager executables
  • Run the Horus compiled P4 application using the manager and controller
  • Build and run server applications on all the server machines
  • Build and run the client application on client machines

A couple of reports would be generated by every invocation of the client applications. You can see the description of the output of each component in its repository's documentation.

This repository contains the codes and instructions for running large-scale simulations. It provides a discrete event simulator that simulates a datacenter with a fat-tree topology with 27K hosts and 1K worker pools operating simultaneously.


This repository contains raw data and logs collected from simulations and testbed experiments.

Pinned

  1. horus-p4 horus-p4 Public

    P4 1

  2. horus_controller horus_controller Public

    Go

  3. horus-app-eval horus-app-eval Public

    C 1

  4. horus-sim horus-sim Public

    Large-scale simulations of Horus scheduler in a multi-tenant datacenter

    Python 1

  5. results results Public

    This repository contains the data collected during our experiments

Repositories

Showing 6 of 6 repositories

Top languages

Loading…

Most used topics

Loading…