Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Differences with my coroutine library? #12

Closed
abbycin opened this issue Jul 18, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@abbycin
Copy link

commented Jul 18, 2018

和这个 https://github.com/abbycin/tools/tree/master/coroutine


Could you please tell me the differences between my coroutine library and libaco?

@hnes

This comment has been minimized.

Copy link
Owner

commented Jul 18, 2018

你好,待我仔细地看一下再回复你哈 :D


Please give me some time to read it through :D

@hnes hnes changed the title 我就想问问有啥区别? Question: Differences with my coroutine library? Jul 18, 2018

@hnes

This comment has been minimized.

Copy link
Owner

commented Jul 19, 2018

(English translation is at the bottom of this reply.)

看了你写的博客,很棒!

下面是它与libaco不同的地方:

  1. 不建议在C++中用户自己实现协程库(应该由C++标准库实现,这样才能保证正确),因为C++的ABI是编译器(甚至版本)相关且平台相关的,而在abbycin/tools/coroutine的实现中却是参考Sys V ABI标准实现的,这是不正确的;或者用C实现,然后让C++调用它,同时注意处理好两个语言之间的边界问题

  2. 即使是参考Sys V ABI标准来实现,在abbycin/tools/coroutine的switch_stack.asm中,也并没有完全遵守它(FPU与MXCSR的控制字的问题,腾讯的libco也犯了同样的错误,issue);

  3. abbycin/tools/coroutine的windows实现switch_stack_win.asm是错误的,Microsoft x64 ABI远远比Sys V ABI AMD64要复杂;

  4. libaco的协程不但支持独立执行栈,还支持与其它数量不限的协程一起共享某一个执行栈(另外还有执行栈上guard page的支持),而在abbycin/tools/coroutine中只支持独立执行栈,这在高并发场景下会消耗巨大(比如百万或者千万协程)。


I have read your blog, it's a very nice one :D

And here the following is the differences between your library with libaco:

  1. In C++, it is not recommended that the user implement the coroutine library (it should be implemented by the C++ standard and its library), because C++'s ABI is both compiler (or even by version) and platform dependent, but in the implementation of your abbycin/tools/coroutine, you only refer to the Sys V ABI standard, which is incorrect. You could implement a coroutine library by C, then call it in C++, and beware that the boundary problem between the two languages is handled correctly.

  2. Even if you choose to use the Sys V ABI standard, but in switch_stack.asm, you didn't fully comply with it (the problem of the FPU and MXCSR's control words, the Tencent's libco has made the same mistake too, here is the bug issue);

  3. Your switch_stack_win.asm is also wrong, because Microsoft x64 ABI is far more complex than the Sys V ABI AMD64;

  4. libaco not only supports the standalone execution stack of coroutine, but also supports the sharing of a single execution stack with other unlimited numbers of coroutines (and also supports the guard page on the execution stack), while the standalone execution stack is only supported in the abbycin/tools/coroutine (this will consume huge virtual memory in the high concurrency scenarios, a concurrency of 1 - 10 million for example).

@hnes hnes added the question label Jul 19, 2018

@yuanzhubi

This comment has been minimized.

Copy link

commented Jul 19, 2018

楼主的库除了没有保存&恢复RDI 和RSI 其他没什么问题。
You forget to store/resume the callee saved registers RDI and RSI in windows x64.

@hnes

This comment has been minimized.

Copy link
Owner

commented Jul 19, 2018

Here is a table about the registers' usage in the Windows X64 ABI standard:

Nonvolatile registers: R12:R15 RDI RSI RBX RBP RSP XMM6:XMM15

And this code snippet below is a right implementation on windows should be like (there may be still some bugs in there because I'm not fully checked it yet though).

libcoro/coro.c#L137:

       #if __amd64

         #if _WIN32 || __CYGWIN__
           #define NUM_SAVED 29
           "\tsubq $168, %rsp\t" /* one dummy qword to improve alignment */
           "\tmovaps %xmm6, (%rsp)\n"
           "\tmovaps %xmm7, 16(%rsp)\n"
           "\tmovaps %xmm8, 32(%rsp)\n"
           "\tmovaps %xmm9, 48(%rsp)\n"
           "\tmovaps %xmm10, 64(%rsp)\n"
           "\tmovaps %xmm11, 80(%rsp)\n"
           "\tmovaps %xmm12, 96(%rsp)\n"
           "\tmovaps %xmm13, 112(%rsp)\n"
           "\tmovaps %xmm14, 128(%rsp)\n"
           "\tmovaps %xmm15, 144(%rsp)\n"
           "\tpushq %rsi\n"
           "\tpushq %rdi\n"
           "\tpushq %rbp\n"
           "\tpushq %rbx\n"
           "\tpushq %r12\n"
           "\tpushq %r13\n"
           "\tpushq %r14\n"
           "\tpushq %r15\n"
           #if CORO_WIN_TIB
             "\tpushq %fs:0x0\n"
             "\tpushq %fs:0x8\n"
             "\tpushq %fs:0xc\n"
           #endif
           "\tmovq %rsp, (%rcx)\n"
           "\tmovq (%rdx), %rsp\n"
           #if CORO_WIN_TIB
             "\tpopq %fs:0xc\n"
             "\tpopq %fs:0x8\n"
             "\tpopq %fs:0x0\n"
           #endif
           "\tpopq %r15\n"
           "\tpopq %r14\n"
           "\tpopq %r13\n"
           "\tpopq %r12\n"
           "\tpopq %rbx\n"
           "\tpopq %rbp\n"
           "\tpopq %rdi\n"
           "\tpopq %rsi\n"
           "\tmovaps (%rsp), %xmm6\n"
           "\tmovaps 16(%rsp), %xmm7\n"
           "\tmovaps 32(%rsp), %xmm8\n"
           "\tmovaps 48(%rsp), %xmm9\n"
           "\tmovaps 64(%rsp), %xmm10\n"
           "\tmovaps 80(%rsp), %xmm11\n"
           "\tmovaps 96(%rsp), %xmm12\n"
           "\tmovaps 112(%rsp), %xmm13\n"
           "\tmovaps 128(%rsp), %xmm14\n"
           "\tmovaps 144(%rsp), %xmm15\n"
           "\taddq $168, %rsp\n"
         #else
           #define NUM_SAVED 6
           "\tpushq %rbp\n"
           "\tpushq %rbx\n"
           "\tpushq %r12\n"
           "\tpushq %r13\n"
           "\tpushq %r14\n"
           "\tpushq %r15\n"
           "\tmovq %rsp, (%rdi)\n"
           "\tmovq (%rsi), %rsp\n"
           "\tpopq %r15\n"
           "\tpopq %r14\n"
           "\tpopq %r13\n"
           "\tpopq %r12\n"
           "\tpopq %rbx\n"
           "\tpopq %rbp\n"
         #endif
         "\tpopq %rcx\n"
         "\tjmpq *%rcx\n"

这是一个Windows X64 ABI标准中关于寄存器的使用的表

Nonvolatile registers:R12:R15 RDI RSI RBX RBP RSP XMM6:XMM15

下面的代码片段才是Windows上的正确实现应该像的样子(可能还是有一些bug,因为我还没有完全检查它)。

libcoro/coro.c#L137:

       #if __amd64

         #if _WIN32 || __CYGWIN__
           #define NUM_SAVED 29
           "\tsubq $168, %rsp\t" /* one dummy qword to improve alignment */
           "\tmovaps %xmm6, (%rsp)\n"
           "\tmovaps %xmm7, 16(%rsp)\n"
           "\tmovaps %xmm8, 32(%rsp)\n"
           "\tmovaps %xmm9, 48(%rsp)\n"
           "\tmovaps %xmm10, 64(%rsp)\n"
           "\tmovaps %xmm11, 80(%rsp)\n"
           "\tmovaps %xmm12, 96(%rsp)\n"
           "\tmovaps %xmm13, 112(%rsp)\n"
           "\tmovaps %xmm14, 128(%rsp)\n"
           "\tmovaps %xmm15, 144(%rsp)\n"
           "\tpushq %rsi\n"
           "\tpushq %rdi\n"
           "\tpushq %rbp\n"
           "\tpushq %rbx\n"
           "\tpushq %r12\n"
           "\tpushq %r13\n"
           "\tpushq %r14\n"
           "\tpushq %r15\n"
           #if CORO_WIN_TIB
             "\tpushq %fs:0x0\n"
             "\tpushq %fs:0x8\n"
             "\tpushq %fs:0xc\n"
           #endif
           "\tmovq %rsp, (%rcx)\n"
           "\tmovq (%rdx), %rsp\n"
           #if CORO_WIN_TIB
             "\tpopq %fs:0xc\n"
             "\tpopq %fs:0x8\n"
             "\tpopq %fs:0x0\n"
           #endif
           "\tpopq %r15\n"
           "\tpopq %r14\n"
           "\tpopq %r13\n"
           "\tpopq %r12\n"
           "\tpopq %rbx\n"
           "\tpopq %rbp\n"
           "\tpopq %rdi\n"
           "\tpopq %rsi\n"
           "\tmovaps (%rsp), %xmm6\n"
           "\tmovaps 16(%rsp), %xmm7\n"
           "\tmovaps 32(%rsp), %xmm8\n"
           "\tmovaps 48(%rsp), %xmm9\n"
           "\tmovaps 64(%rsp), %xmm10\n"
           "\tmovaps 80(%rsp), %xmm11\n"
           "\tmovaps 96(%rsp), %xmm12\n"
           "\tmovaps 112(%rsp), %xmm13\n"
           "\tmovaps 128(%rsp), %xmm14\n"
           "\tmovaps 144(%rsp), %xmm15\n"
           "\taddq $168, %rsp\n"
         #else
           #define NUM_SAVED 6
           "\tpushq %rbp\n"
           "\tpushq %rbx\n"
           "\tpushq %r12\n"
           "\tpushq %r13\n"
           "\tpushq %r14\n"
           "\tpushq %r15\n"
           "\tmovq %rsp, (%rdi)\n"
           "\tmovq (%rsi), %rsp\n"
           "\tpopq %r15\n"
           "\tpopq %r14\n"
           "\tpopq %r13\n"
           "\tpopq %r12\n"
           "\tpopq %rbx\n"
           "\tpopq %rbp\n"
         #endif
         "\tpopq %rcx\n"
         "\tjmpq *%rcx\n"
@abbycin

This comment has been minimized.

Copy link
Author

commented Jul 20, 2018

emmm...

43029211-1cbccc86-8cb7-11e8-86df-7df2cd6e8226

@hnes

This comment has been minimized.

Copy link
Owner

commented Jul 21, 2018

In the Best Practice part:

In summary, if you want to gain the ultra performance of libaco, just keep the stack usage of the non-standalone non-main co at the point of calling aco_yield as small as possible.

       co_fp 
       /  \
      /    \  
    f1     f2
   /  \    / \
  /    \  f4  \
yield  f3     f5

The stack usage of non-standalone (share stack with other coroutines) non-main co when it is been yielded (i.e. call aco_yield to yield back to main co) has big impact to the performance of context switching between coroutines. The benchmark result shows that clearly already. In the diagram above, the stack usage of function f2, f3, f4 and f5 has no direct influence to context switching performance since there are no aco_yield when they are executing. Whereas the stack usage of co_fp and f1 dominates the value of co->save_stack.max_cpsz and has a big influence to the context switching performance.

The key to keep a tiny stack usage of a function is to allocate the local variables (especially the big ones) on the heap and manage their lifecycle manually instead of allocating them on the stack by default. The -fstack-usage option of gcc is very helpful about this.

And from the Benchmark part:

aco_create/init_save_stk_sz=64B                              1     0.000 s      230.00 ns/op    4347824.79 op/s
aco_resume/co_amount=1/copy_stack_size=0B             20000000     0.412 s       20.59 ns/op   48576413.55 op/s
  -> acosw                                            40000000     0.412 s       10.29 ns/op   97152827.10 op/s
aco_destroy                                                  1     0.000 s      650.00 ns/op    1538461.66 op/s

aco_create/init_save_stk_sz=64B                       10000000     1.240 s      123.97 ns/op    8066542.54 op/s
aco_resume/co_amount=10000000/copy_stack_size=8B      40000000     1.327 s       33.17 ns/op   30143409.55 op/s
aco_destroy                                           10000000     0.328 s       32.82 ns/op   30467658.05 op/s

aco_create/init_save_stk_sz=64B                       10000000     0.659 s       65.94 ns/op   15165717.02 op/s
aco_resume/co_amount=10000000/copy_stack_size=24B     40000000     1.345 s       33.63 ns/op   29737708.53 op/s
aco_destroy                                           10000000     0.337 s       33.71 ns/op   29666697.09 op/s

aco_create/init_save_stk_sz=64B                       10000000     0.654 s       65.38 ns/op   15296191.35 op/s
aco_resume/co_amount=10000000/copy_stack_size=40B     40000000     1.348 s       33.71 ns/op   29663992.77 op/s
aco_destroy                                           10000000     0.336 s       33.56 ns/op   29794574.96 op/s

aco_create/init_save_stk_sz=64B                       10000000     0.653 s       65.29 ns/op   15316087.09 op/s
aco_resume/co_amount=10000000/copy_stack_size=56B     40000000     1.384 s       34.60 ns/op   28902221.24 op/s
aco_destroy                                           10000000     0.337 s       33.73 ns/op   29643682.93 op/s

aco_create/init_save_stk_sz=64B                       10000000     0.652 s       65.19 ns/op   15340872.40 op/s
aco_resume/co_amount=10000000/copy_stack_size=120B    40000000     1.565 s       39.11 ns/op   25566255.73 op/s
aco_destroy                                           10000000     0.443 s       44.30 ns/op   22574242.55 op/s

aco_create/init_save_stk_sz=64B                        2000000     0.131 s       65.61 ns/op   15241722.94 op/s
aco_resume/co_amount=2000000/copy_stack_size=136B     20000000     0.947 s       47.36 ns/op   21114212.05 op/s
aco_destroy                                            2000000     0.125 s       62.35 ns/op   16039466.45 op/s

aco_create/init_save_stk_sz=64B                        2000000     0.131 s       65.71 ns/op   15218784.72 op/s
aco_resume/co_amount=2000000/copy_stack_size=136B     20000000     0.948 s       47.39 ns/op   21101216.29 op/s
aco_destroy                                            2000000     0.125 s       62.73 ns/op   15941559.26 op/s

aco_create/init_save_stk_sz=64B                        2000000     0.131 s       65.49 ns/op   15270258.18 op/s
aco_resume/co_amount=2000000/copy_stack_size=152B     20000000     1.069 s       53.44 ns/op   18714275.17 op/s
aco_destroy                                            2000000     0.122 s       61.05 ns/op   16378678.85 op/s

aco_create/init_save_stk_sz=64B                        2000000     0.132 s       65.91 ns/op   15171336.62 op/s
aco_resume/co_amount=2000000/copy_stack_size=232B     20000000     1.190 s       59.48 ns/op   16813230.99 op/s
aco_destroy                                            2000000     0.123 s       61.26 ns/op   16324298.25 op/s

aco_create/init_save_stk_sz=64B                        2000000     0.131 s       65.68 ns/op   15224361.30 op/s
aco_resume/co_amount=2000000/copy_stack_size=488B     20000000     1.828 s       91.40 ns/op   10941133.56 op/s
aco_destroy                                            2000000     0.145 s       72.56 ns/op   13781182.82 op/s

aco_create/init_save_stk_sz=64B                        2000000     0.132 s       65.80 ns/op   15197461.34 op/s
aco_resume/co_amount=2000000/copy_stack_size=488B     20000000     1.829 s       91.47 ns/op   10932139.32 op/s
aco_destroy                                            2000000     0.149 s       74.70 ns/op   13387258.82 op/s

As the README already described, there are some limitations when you are using libaco with the shared execution stack mode, but if you could keep the stack usage of the shared execution stack when you are yielding as small as you could accept, then it is just fine.

But still, using DMA on the userspace is a very valuable method and worth further investigation in the future (that could be OS dependent or even OS version dependent though).

Thank you very much, @abbycin and your friend :D

@hnes hnes added the answered label Jul 23, 2018

@hnes hnes closed this Jul 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.