Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Use Completely Fair Scheduler (CFS) in NGO #223

Open
guzongmin opened this issue Mar 3, 2022 · 0 comments
Open

[RFC] Use Completely Fair Scheduler (CFS) in NGO #223

guzongmin opened this issue Mar 3, 2022 · 0 comments

Comments

@guzongmin
Copy link
Contributor

guzongmin commented Mar 3, 2022

  • Feature Name: Use completely fair scheduler in NGO
  • Start Date: 2022-3-3

Summary

Replace the current basic scheduler with the completely fair scheduler.
reference: Process Scheduling in Linux

Motivation

The current naive basic scheduler is workable, but the performance is poor. Two major reasons are listed below:

  1. One task is spin with the vCPU which is assigned at the first time. In some cases, the busy vCPS has many tasks waiting for execution, but others are idle.
  2. The tasks in one vCPU is using FIFO policy, so the short tasks are waiting for execution in most of the time.

The CFS is used as the default scheduler in Linux, suppose the applications are optimized with CFS. So using CFS (or similar scheduler) in Occlum should be a good choice.

Guide-level explanation

When using CFS in Occlum, all the tasks should get fair execution time. Those short (most likely I/O) tasks latency should be smaller than those long (CPU bound) tasks. All the vCPU would get fair payload, if there are multiple tasks are running.

There are two major challenges in the CFS design for Occlum:

  1. Occlum is not able to interrupt the user tasks precisely. So Occlum is not able to control the task execution time when the process already started.
  • The possible solution is: Occlum records the execution time when the tasks back to Occlum and try to balance the execution time in the future.
  1. The overhead of interrupt in Occlum is very heavy. So Occlum is not able to interrupt tasks frequently. As the result, the average time slices would be large.
  • In the Occlum CFS implementation, Occlum should try to manage the task's average waiting time.

In order to add the CFS into Occlum and keep the current basic scheduler working, the Occlum should separate the schedule information from the task basic information. And the schedule information for CFS and basic scheduler are different.

Reference-level explanation

  1. Occlum CFS implementation
  • For each vCPU, there is a red-black-tree used to save ready tasks. The tree index is runtime (execution time) and the smaller one is on the left. Each time, Occlum pick the smallest one from the tree and execute it on the vCPU.
  • In order to get the task runtime, Occlum should use vDSO interface or RDTSC to avoid ocall. On SGX2 platform, RDTSC is preferred.
  • SIG64 is used to interrupt the CPU bound tasks, the frequency of SIG64 impact the entire system performance.
  1. balance vCPUs
  • Active Balancing
    Using the left task (the task for next round) waiting time as the waiting time of the vCPU run queue. If the value is much larger than the average value, move this task to another vCPU run queue.

  • Idle Balancing
    If the vCPU run queue is empty, move half of the longest run queue to this queue or move half of the run queue with the largest waiting time to this queue.

  • new task
    The new task would be enqueued to the shortest run queue, with only one exception: if the task dequeue from one run queue recently, enqueue the task back to the previous run queue.

Design Detail

Add four traits as below code.

  • The scheduler could define its own SchedInfo. Each Task would have an interface to get the task SchedInfo at runtime.
    +The TaskSet is a set of Tasks. Each vCPU should have its own TaskSet. The scheduler could use the Taskset as a runqueue.
  • The Balancer has three balance policies, what are documented in the Linux scheduling guide.
  • The Scheduler interface is used by executor.
pub trait SchedInfo: Send + Sync {
    fn affinity(&self) -> &RwLock<Affinity>;
    fn priority(&self) -> SchedPriority;
    fn set_priority(&self, priority: SchedPriority);
    fn last_thread_id(&self) -> u32;
    fn set_last_thread_id(&self, id: u32);
    fn enqueue_epochs(&self) -> Duration;
    fn set_enqueue_epochs(&self, data: Duration);
    fn dequeue_epochs(&self) -> Duration;
    fn set_dequeue_epochs(&self, data: Duration);
}

pub(crate) trait TaskSet: Send + Sync {
    fn add_task(&self, task: Arc<Task>) -> Result<()>;
    fn first_task(&self) -> Option<Arc<Task>>;
    fn tasks_with(&self, f: Box<dyn Fn(Arc<Task>) -> bool>, capacity: usize) -> Vec<Arc<Task>>;
    fn len(&self) -> usize;
    fn latency(&self) -> Option<Duration>;
    fn index(&self) -> usize;
}

pub(crate) trait Balancer {
    fn idle_balance(&self, idle_queue: &dyn TaskSet);
    fn enqueue_balance(&self, task: Arc<Task>) -> Option<usize>;
    fn active_balance(&self);
}

#[async_trait]
pub trait Scheduler: Send + Sync {
    fn enqueue_task(&self, task: Arc<Task>);
    fn dequeue_task(&self, thread_id: usize) -> Option<Arc<Task>>;
    async fn balance(&self, task: Arc<Task>);
    fn sched_info(&self, priority: SchedPriority) -> Arc<dyn SchedInfo>;
}

It is a reference implementation, which do not change the current executor implementation.
image

The alternative could use those traits as template in executor type definition.

Drawbacks

The new scheduler is not always improve the Occlum performance. If the process/thread number is small, most likely CFS performance is the same as basic scheduler.

Rationale and alternatives

There are a lot of scheduler implementations. CFS is the Linux default one, should be the best one for general purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant