[RFC] Use Completely Fair Scheduler (CFS) in NGO #223

guzongmin · 2022-03-03T08:37:00Z

Feature Name: Use completely fair scheduler in NGO
Start Date: 2022-3-3

Summary

Replace the current basic scheduler with the completely fair scheduler.
reference: Process Scheduling in Linux

Motivation

The current naive basic scheduler is workable, but the performance is poor. Two major reasons are listed below:

One task is spin with the vCPU which is assigned at the first time. In some cases, the busy vCPS has many tasks waiting for execution, but others are idle.
The tasks in one vCPU is using FIFO policy, so the short tasks are waiting for execution in most of the time.

The CFS is used as the default scheduler in Linux, suppose the applications are optimized with CFS. So using CFS (or similar scheduler) in Occlum should be a good choice.

Guide-level explanation

When using CFS in Occlum, all the tasks should get fair execution time. Those short (most likely I/O) tasks latency should be smaller than those long (CPU bound) tasks. All the vCPU would get fair payload, if there are multiple tasks are running.

There are two major challenges in the CFS design for Occlum:

Occlum is not able to interrupt the user tasks precisely. So Occlum is not able to control the task execution time when the process already started.

The possible solution is: Occlum records the execution time when the tasks back to Occlum and try to balance the execution time in the future.

The overhead of interrupt in Occlum is very heavy. So Occlum is not able to interrupt tasks frequently. As the result, the average time slices would be large.

In the Occlum CFS implementation, Occlum should try to manage the task's average waiting time.

In order to add the CFS into Occlum and keep the current basic scheduler working, the Occlum should separate the schedule information from the task basic information. And the schedule information for CFS and basic scheduler are different.

Reference-level explanation

Occlum CFS implementation

For each vCPU, there is a red-black-tree used to save ready tasks. The tree index is runtime (execution time) and the smaller one is on the left. Each time, Occlum pick the smallest one from the tree and execute it on the vCPU.
In order to get the task runtime, Occlum should use vDSO interface or RDTSC to avoid ocall. On SGX2 platform, RDTSC is preferred.
SIG64 is used to interrupt the CPU bound tasks, the frequency of SIG64 impact the entire system performance.

balance vCPUs

Active Balancing
Using the left task (the task for next round) waiting time as the waiting time of the vCPU run queue. If the value is much larger than the average value, move this task to another vCPU run queue.
Idle Balancing
If the vCPU run queue is empty, move half of the longest run queue to this queue or move half of the run queue with the largest waiting time to this queue.
new task
The new task would be enqueued to the shortest run queue, with only one exception: if the task dequeue from one run queue recently, enqueue the task back to the previous run queue.

Design Detail

Add four traits as below code.

The scheduler could define its own SchedInfo. Each Task would have an interface to get the task SchedInfo at runtime.
+The TaskSet is a set of Tasks. Each vCPU should have its own TaskSet. The scheduler could use the Taskset as a runqueue.
The Balancer has three balance policies, what are documented in the Linux scheduling guide.
The Scheduler interface is used by executor.

pub trait SchedInfo: Send + Sync {
    fn affinity(&self) -> &RwLock<Affinity>;
    fn priority(&self) -> SchedPriority;
    fn set_priority(&self, priority: SchedPriority);
    fn last_thread_id(&self) -> u32;
    fn set_last_thread_id(&self, id: u32);
    fn enqueue_epochs(&self) -> Duration;
    fn set_enqueue_epochs(&self, data: Duration);
    fn dequeue_epochs(&self) -> Duration;
    fn set_dequeue_epochs(&self, data: Duration);
}

pub(crate) trait TaskSet: Send + Sync {
    fn add_task(&self, task: Arc<Task>) -> Result<()>;
    fn first_task(&self) -> Option<Arc<Task>>;
    fn tasks_with(&self, f: Box<dyn Fn(Arc<Task>) -> bool>, capacity: usize) -> Vec<Arc<Task>>;
    fn len(&self) -> usize;
    fn latency(&self) -> Option<Duration>;
    fn index(&self) -> usize;
}

pub(crate) trait Balancer {
    fn idle_balance(&self, idle_queue: &dyn TaskSet);
    fn enqueue_balance(&self, task: Arc<Task>) -> Option<usize>;
    fn active_balance(&self);
}

#[async_trait]
pub trait Scheduler: Send + Sync {
    fn enqueue_task(&self, task: Arc<Task>);
    fn dequeue_task(&self, thread_id: usize) -> Option<Arc<Task>>;
    async fn balance(&self, task: Arc<Task>);
    fn sched_info(&self, priority: SchedPriority) -> Arc<dyn SchedInfo>;
}

It is a reference implementation, which do not change the current executor implementation.

The alternative could use those traits as template in executor type definition.

Drawbacks

The new scheduler is not always improve the Occlum performance. If the process/thread number is small, most likely CFS performance is the same as basic scheduler.

Rationale and alternatives

There are a lot of scheduler implementations. CFS is the Linux default one, should be the best one for general purpose.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Use Completely Fair Scheduler (CFS) in NGO #223

[RFC] Use Completely Fair Scheduler (CFS) in NGO #223

guzongmin commented Mar 3, 2022 •

edited

Loading

[RFC] Use Completely Fair Scheduler (CFS) in NGO #223

[RFC] Use Completely Fair Scheduler (CFS) in NGO #223

Comments

guzongmin commented Mar 3, 2022 • edited Loading

Summary

Motivation

Guide-level explanation

Reference-level explanation

Design Detail

Drawbacks

Rationale and alternatives

guzongmin commented Mar 3, 2022 •

edited

Loading