-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: investigate possible Go scheduler improvements inspired by Linux Kernel's CFS #51071
Comments
There are no API changes here so this isn't really a proposal. Retitling to reflect the intent as mentioned in Option 1 above. |
CC @aclements @mknyszek @prattmic Note that any changes to the Go scheduler are complicated by the fact that goroutines are running on top of threads that are managed by the Linux scheduler. |
Reworking the scheduler is something we've thought about on and off for quite a while. I haven't taken a close look at your demo scheduler yet, but it looks nifty. In addition to the issues you describe with IO-intensive tasks, overall CPU overhead of the scheduler is something that comes up frequently as problematic for applications that happen to stress corners of the scheduler. So another point to consider. Any major change we make to the scheduler will likely need to be paired with an investment in better testing of the scheduler. The current scheduler has evolved to become quite complex, but with few explicit, deterministic tests due to the difficult nature of testing a scheduler deep in the runtime, instead relying on larger Go programs to reveal problems. This makes it difficult to have high confidence in major changes, so investing in better testing is definitely something I'd want to see (whether or not we make large changes to the scheduler, but definitely if we do). |
Thanks a lot, @prattmic.
Yes, I agree with you. I investigated the performance of Go priority heap:
Such cost might be affordable. I think maybe we should decouple the specific scheduler implementation from other parts of runtime like Linux Kernel does: // definition of `sched_class` in kernel
struct sched_class {
#ifdef CONFIG_UCLAMP_TASK
int uclamp_enabled;
#endif
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
bool (*yield_to_task)(struct rq *rq, struct task_struct *p);
void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
struct task_struct *(*pick_next_task)(struct rq *rq);
void (*put_prev_task)(struct rq *rq, struct task_struct *p);
void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
#ifdef CONFIG_SMP
int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
struct task_struct * (*pick_task)(struct rq *rq);
void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
void (*task_woken)(struct rq *this_rq, struct task_struct *task);
void (*set_cpus_allowed)(struct task_struct *p,
const struct cpumask *newmask,
u32 flags);
void (*rq_online)(struct rq *rq);
void (*rq_offline)(struct rq *rq);
struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);
#endif
void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
void (*task_fork)(struct task_struct *p);
void (*task_dead)(struct task_struct *p);
/*
* The switched_from() call is allowed to drop rq->lock, therefore we
* cannot assume the switched_from/switched_to pair is serialized by
* rq->lock. They are however serialized by p->pi_lock.
*/
void (*switched_from)(struct rq *this_rq, struct task_struct *task);
void (*switched_to) (struct rq *this_rq, struct task_struct *task);
void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
int oldprio);
unsigned int (*get_rr_interval)(struct rq *rq,
struct task_struct *task);
void (*update_curr)(struct rq *rq);
#define TASK_SET_GROUP 0
#define TASK_MOVE_GROUP 1
#ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_change_group)(struct task_struct *p, int type);
#endif
}; This is not only convenient for testing but also convenient for users to add their own scheduler implementation if they want. |
Thank you for working on this. I know nothing of the Go scheduler, but a few questions came to mind when reading your proposal:
|
Maybe MLFQ could be of use? Goroutines of the same priority level would be run in a round robin style, perhaps not too different from the current approach. Demotions would occur after time allotments are used up. And periodic priority boosts could prevent starvation for longer-running tasks. |
I am not aware of how the interactions between the two schedulers (go vs OS) is being tested now, but making the two behave similarly might have drastic behaviour causing them to work against each other (I can think of TCP over TCP meltdown). Which means this would be a slow process. Meanwhile, having a plug and play scheduler implementation would make the burden of switching or making large changes to the scheduler a separated concern with an added benefit of easier testing. Maybe the community will surprise us all with a whole bunch of scheduling ideas. My 2¢ |
@adyanth While your concern is fair, I believe this may not be the case eventually. Kernel threads are oblivious to Go runtime or what kind of code the thread is running ( ignoring IO wait and CPU fairness ). A new runtime scheduler is in turn a pure userspace construct, how it manages its own goroutines should not compete with kernel scheduler which operates on the other side of system boundary. |
It's very kind of you, @tommie :)
I'm afraid that would not make a big difference. Round-Robin is a kind of scheduling strategy, and so is CFS. The key difference between them is CFS could do some adaptive priority adjustment and try to achieve the
Do you mean something like I think the API designing of kernel syscalls related to the scheduler is also worth learning: setpriority // set the "literal" priority, then scheduler could adaptive adjust the real priority based on this "literal" priority
sched_setscheduler // set scheduling strategy, rr or cfs?
sched_setparam // tuning
To be honest, I thought about this problem myself before, and my conclusion is that such problem would unlikely happen in the real production environment. And if it happens, that means the system is already far more overload. Please take a look on the benchmark below: func BenchmarkIntChan(b *testing.B) {
ch := make(chan int)
go func() {
for {
<-ch
}
}()
for i := 0; i < b.N; i++ {
ch <- 1
}
}
func BenchmarkIntChanX(b *testing.B) {
ch := make(chan int)
go func() {
for {
<-ch
}
}()
for i := 0; i < b.N; i++ {
myCh := make(chan int)
go func() {
ch <- 1
myCh <- 1
}()
<-myCh
}
} Which yields: $ # rockyLinux 8.5 & AMD 5950x
$ GOMAXPROCS=1 go test -bench=.
goos: linux
goarch: amd64
pkg: echo
cpu: AMD EPYC-Milan Processor
BenchmarkIntChan 8433139 139.5 ns/op
BenchmarkIntChanX 2393584 495.8 ns/op Let's assume it takes about 1us on-cpu time to run one of those short-lived goroutines, and we have 32 logical cores, then we could get ~32,000,000 op/sec which means a very huge load. If such applications really exist, that means either the design of such application is not very good, or other components (kernel, network, disk ...) will reach their bottlenecks before our Go. This benchmark also proves the excellent performance of goroutine and chan. |
@tommie, and furthermore, there are strategies both in CFS and MLFQ (which were mentioned by @tfaughnan above) to prevent this from happening. |
Very good point, @tfaughnan. My only concern is that a mature MLFQ may not be much simpler or maybe even more complex than CFS, so I think an interface like the |
Hi @adyanth, I agree with @singhpradeep's reply above. Go runtime scheduler is running on the top of kernel scheduler, so they are not in a competitive relationship. Furthermore, they could "work together". |
I wrote a similar package to play with different scheduling algos: https://github.com/tommie/go-coopsched
Assuming the code works, and that |
I'd like to post here my partial skepticism for this change. I believe as a goal, it might be a good thing. I am not equipped to evaluate whether adopting CFS-scheduler algorithm in the goroutine clockwork is a good idea moving forward or not. What I do like to point out is that Go has a surprising small number of adjustment knobs considering the complexity of the runtime (we do see a lot of knobs tweaking the compilation or part of the behavior of the standard library though), I might be forgetting something here, but to the best of my recollection we have just GOGC, GOMAXPROCS, GORACE, and GOTRACEBACK - and these are all scalar values. That leads to a significantly easier experience from operations standpoint. Therefore, adopting a custom goroutine scheduler interface, or adding "just another switch" for an alternative operation mode might shave against the grain in the long term stance of Go having as few knobs as possible. In other words, CFS Scheduler good. More knobs, I am not so sure. |
Agreed about Go's lack of knobs being a great feature. Here's a self-contained cpuworker/example/demo.go. The percentile printout isn't as nice as I wonder if this could go as an add-on to the existing |
If the internal working of the Go scheduler is to be reconsidered then an obvious question arises. Should any new scheduler take account of the heterogeneous nature of modern CPUs. Arm has had big.LITTLE for over ten years, Apple's M1 processors have Firestorm and Icestorm cores and Intel Alder Lake CPUs have Golden Cove high-performance cores and Gracemont power-efficient cores. I too appreciate the lack of "knobs" in the Go runtime but the asymmetric nature of current CPUs may require an extra knob or two to schedule Goroutines as efficiently as possible. |
Naive question: Would this interact at all with package sync's fairness algorithms? |
I think these two behaviors of the Go scheduler must fix in the new version (If any schedule):
Both suggestions don't change the Go behavior and Go apps easily work out of the box, just change the way advanced developers write frameworks to developed software by Go in a better way. |
See also #33803 |
First, please let me pay tribute to your contributions. You guys are awesome! And Go is so marvelous!
It has been more than ten years, and Go has already been very successful. So, I think it is time to give Go a better scheduler -- maybe like the CFS scheduler of Linux kernel.
The current implementation of the scheduling strategy of the Go runtime scheduler is basically Round-Robin scheduling. But in practical application scenarios, priority relationships may need to exist between different types of tasks or even between tasks of the same type. For example, if we want to use Go to implement a storage engine, in general, I/O tasks should have higher priority than any other CPU intensive tasks. Because in this case, the system bottleneck is more likely to be in disk I/O, not CPU. For another example, we may use Go to implement a network service, which is required to ensure the latency of some certain lightweight tasks even while other goroutines are quite CPU intensive. However, in the current implementation of Go runtime, if there are a certain number of CPU intensive tasks that need to run for a long time, it will inevitably impact the scheduling latency. Therefore, we need a real mature Go scheduler like the CFS scheduler in kernel.
From another perspective, we could compare the thread model provided by kernel with the goroutine model provided by Go. In addition to the inherent low-cost and high-concurrency advantages of goroutine over threads, some very useful mechanisms in the kernel thread model cannot be found in the goroutine model of Go, at least not yet. For example, dynamic modification of scheduling policy and priority mechanism including adaptive priority adjustment.
The scheduler of the initial versions of Linux kernel, e.g. v0.01, is quite primitive. But with the continuous development of the kernel, more and more applications have higher and higher requirements for schedulers, and the kernel scheduler has been keeping evolving until today's CFS scheduler.
That should also apply to Go. A real mature scheduler will make Go even greater.
I have already developed a customized scheduler over Go runtime as a demo and which proves my proposal is feasible and such scheduler would benefit many applications which have high requirements from the perspective of latency and throughput.
Terminology
Event intensive task: Most of the time in its life cycle is waiting for events (just name it
off CPU
), and only a very small part of the time is doing CPU computing.CPU intensive task: Most of or even all of its life cycle is doing CPU computing (just name it
on CPU
), and only a very small part of its time or has never been in the state of waiting for events.Description
Basically, the idea of CFS scheduling is to give every thread a logical clock which records the duration of thread's on-cpu time. Different priority settings of threads, different speed time passes of its logical clock. And CFS scheduler would prefer to choose the thread which has the most behind logical clock to run. Because the scheduler thinks it is quite unfair to make such thread even more behind. After all, the scheduler's name
CFS
stands forCompletely Fair Scheduler
. So, if one thread is event intensive, then it would have a higher de facto priority. And if one thread is cpu intensive, then it would have a lower de facto priority. That we could call it adaptive priority adjustment. That is a quite important feature which could ensure the scheduling latency of event intensive threads will not be very high even current system is under high cpu load due to the existence of some cpu intensive threads.Although this demo only implements some part of CFS scheduling, the result is quite promising.
Demo
cpuworker/example/demo.go:
The server cpuworker/example/demo.go is running at an aws instance
c5d.12xlarge
and with the envGOMAXPROCS
set to 16.The benchmark tool is running at an aws instance
c5d.4xlarge
. The two machines is running at the same cluster placement group.Step 1: Killall already existing cmd
x
, then run the cmd 1 (run the standalone benchmark of delay1ms).Step 2: Killall already existing cmd
x
, then run the cmd 1 and cmd 2 simultaneously (run the benchmark of delay1ms with a very heavy cpu load which is scheduled by the original Go scheduler).Current CPU load of the server-side (and please note that the load average is already reaching the
GOMAXPROCS
, i.e. 16 in this case):As we can see, the latency becomes very high and unstable.
Step 3: Killall already existing cmd
x
, then run the cmd 1 and cmd 3 simultaneously (run the benchmark of delay1ms with a very heavy cpu load which is scheduled by our own scheduler over the original Go scheduler).Current CPU load of the server-side (and please note that the load average is near the
cpuWorkerMaxP
, i.e. 12 in this case, and you could set this parameter by yourself):Now the latency becomes fine again even it is running with many CPU intensive tasks!
Step 4: Killall already existing cmd
x
, then run the cmd 1, cmd 3, and cmd 4 simultaneously (run the benchmark of delay1ms and checksumSmallTaskWithCpuWorker with a very heavy cpu load which is scheduled by our own scheduler over the original Go scheduler).Current CPU load of the server-side (and please note that the load average is near the
cpuWorkerMaxP
, i.e. 12 in this case, and you could set this parameter by yourself):The latency of both of them are fine :-)
This demonstration is only to prove that this proposal is feasible and will have obvious benefits to those applications which cares about latency. Moreover, for many applications, throughput is directly affected by latency, so this proposal can also optimize the throughput of those applications
Proposal
Option 1: Bring Go a better scheduler just like Linux Kernel's CFS Scheduler. Support adaptive priority adjustment. Goroutine
Setpriority
likesyscall.Setpriority
could also be supported which only influences the speed time passes of the logical clock.Option 2: Give users the ability to customize their own scheduler without the need to modify their already existing Go codes. I didn't get a very detailed idea yet. Look forward to your excellent ideas.
You might say that one could achieve this goal in the application layer like the way of cpuworker did, but for most applications with already a very large amount of Go codes, such change is just too expensive. And such change in runtime would be more efficient. Therefore, I tend to hope that Go could provide a similar mechanism rather than expect users to achieve this goal by themselves in the application layer.
Thanks a lot for your time to watch this proposal :-)
The text was updated successfully, but these errors were encountered: