Wait buffer #11

craff · 2023-08-06T16:45:56Z

Introduces EventBuffer to avoid allocating at each wait/wait_fold.

Warning: This PR is based on constant3 (PR#10) and not master, because I added one line in src/constants.ml

Questions:

I think the new wait_buffer/wait_buffer_fold could replace wait and wait_fold which do not have the intended complexity. We could use deprecation or just accept a breaking change ? Problem with deprecation is we need new function names.
I am not sure if I want EventBuffer.t = int * Bytes.t or just Bytes.t and to a division to compute val_max ?

lindig · 2023-08-06T18:12:08Z

The allocation using alloca on the C stack is basically cost free and cheaper than any other form of allocation either using malloc or using the OCaml heap because the latter requires garbage collection. Allocation with alloca is just a pointer operation and memory is freed as soon as possible. The fact that this buffer is allocated at every invocation of epoll_wait and does not survive the call is irrelevant. I don't believe that pre-allocating a buffer which is GC'ed later is cheaper or faster. So this would require a benchmark to convince me otherwise.

lindig · 2023-08-06T18:13:01Z

Maybe epoll_wait and epoll_wait_fold could be unified? The latter is the more general case.

edwintorok · 2023-08-06T18:53:21Z

Note that if you want the highest performance you should look at using uring (https://github.com/ocaml-multicore/ocaml-uring), which works with OCaml 4.x too (with 5.x you get a higher level and more convenient interface to it with Eio). It is a much lower level interface though.

polly is useful in situations where you do not have uring: an older kernel, when you are not root (in case 'uring' has been restricted to root only), or if uring has been compiled out of the kernel. It is also useful when you want to debug your program using strace (uring avoids making any syscalls in some modes, so it is harder to debug).
Also polly doesn't allocate memory on the OCaml side, and requires no work from the GC, whereas 'uring' does perform small allocations in a few places.
I would expect the performance of most functions in polly to be dominated by the syscall time (which is quite slow, especially with all the security mitigations in place).

I have some benchmark code for a small HTTP load generator built using polly, but I don't have benchmark code for polly itself. IIRC 'polly' didn't show up in my performance profiles in a significant way, so I haven't looked at optimizing it, but I don't have concrete numbers at hand. When I have some time I can try to rerun the benchmarks and see how much time 'Polly.wait_fold' and 'Polly.upd' takes up.

craff · 2023-08-06T18:59:52Z

It is still scanned by the minor GC? I was going to benchmark, but I forgot the power outlet of my laptop. Le 6 août 2023 08:12:19 GMT-10:00, Christian Lindig ***@***.***> a écrit :

…

The allocation using `alloca` on the C stack is basically cost free and cheaper than any other form of allocation either using malloc or using the OCaml heap because the latter requires garbage collection. Allocation with `alloca` is just a pointer operation and memory is freed as soon as possible. The fact that this buffer is allocated at every invocation of `epoll_wait` and does not survive the call is irrelevant. I don't believe that pre-allocating a buffer which is GC'ed later is cheaper or faster. So this would require a benchmark to convince me otherwise. -- Reply to this email directly or view it on GitHub: #11 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

lindig · 2023-08-06T19:11:51Z

Maybe my understanding is wrong (@edwintorok please correct me), but alloca reserves space on the C stack by increasing the current stack frame. It is invisible to the OCaml GC and is automatically de-allocated when the C function returns. This is why it is so cheap. The only potential problem are very large numbers of val_max because it could overflow the stack. But the structure we are allocating per instance is small: about two 64-bit words.

craff · 2023-08-06T19:45:48Z

Isn't there a bzero at stack allocation or deallocation? Le 6 août 2023 08:12:19 GMT-10:00, Christian Lindig ***@***.***> a écrit :

…

The allocation using `alloca` on the C stack is basically cost free and cheaper than any other form of allocation either using malloc or using the OCaml heap because the latter requires garbage collection. Allocation with `alloca` is just a pointer operation and memory is freed as soon as possible. The fact that this buffer is allocated at every invocation of `epoll_wait` and does not survive the call is irrelevant. I don't believe that pre-allocating a buffer which is GC'ed later is cheaper or faster. So this would require a benchmark to convince me otherwise. -- Reply to this email directly or view it on GitHub: #11 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

edwintorok · 2023-08-06T21:43:10Z

Variables allocated on the stack are uninitialized, there is no need for a bzero there. You can look at the generated assembly code, I wouldn't worry about the performance of the alloca, it is quite similar to allocating an array on the stack.
Using objdump -dS on _build/default/src/main.exe shows the C source code together with the disassembly (approximatively, due to optimization the code is rearranged a bit, and depending on OS and compiler version the exact code may be different a bit). You can see that the alloca is basically just sub %rax, %rsp (where %rax is computed based on val_max):

      if (Int_val(val_max) <= 0)
  52fc18:       85 c0                   test   %eax,%eax
  52fc1a:       0f 8e ed 00 00 00       jle    52fd0d <caml_polly_wait_fold+0x1dd>
                caml_invalid_argument(__FUNCTION__);
        events =
            (struct epoll_event *)alloca(Int_val(val_max) *
  52fc20:       48 98                   cltq
  52fc22:       48 8d 04 40             lea    (%rax,%rax,2),%rax
  52fc26:       48 8d 04 85 17 00 00    lea    0x17(,%rax,4),%rax
  52fc2d:       00 
  52fc2e:       48 83 e0 f0             and    $0xfffffffffffffff0,%rax
  52fc32:       48 29 c4                sub    %rax,%rsp
                                         sizeof(struct epoll_event));

        caml_enter_blocking_section();
  52fc35:       e8 66 60 02 00          call   555ca0 <caml_enter_blocking_section>

craff · 2023-08-06T23:42:28Z

Variables allocated on the stack are uninitialized, there is no need for a bzero there. You can look at the generated assembly code, I wouldn't worry about the performance of the alloca, it is quite similar to allocating an array on the stack. Using objdump -dS on _build/default/src/main.exe shows the C source code together with the disassembly (approximatively, due to optimization the code is rearranged a bit, and depending on OS and compiler version the exact code may be different a bit). You can see that the alloca is basically just sub %rax, %rsp (where %rax is computed based on val_max):
      if (Int_val(val_max) <= 0)
  52fc18:       85 c0                   test   %eax,%eax
  52fc1a:       0f 8e ed 00 00 00       jle    52fd0d <caml_polly_wait_fold+0x1dd>
                caml_invalid_argument(__FUNCTION__);
        events =
            (struct epoll_event *)alloca(Int_val(val_max) *
  52fc20:       48 98                   cltq
  52fc22:       48 8d 04 40             lea    (%rax,%rax,2),%rax
  52fc26:       48 8d 04 85 17 00 00    lea    0x17(,%rax,4),%rax
  52fc2d:       00 
  52fc2e:       48 83 e0 f0             and    $0xfffffffffffffff0,%rax
  52fc32:       48 29 c4                sub    %rax,%rsp
                                         sizeof(struct epoll_event));

        caml_enter_blocking_section();
  52fc35:       e8 66 60 02 00          call   555ca0 <caml_enter_blocking_section>

Thanks for the information. This PR is therefore clearly useless. I close it.

craff added 4 commits August 5, 2023 13:28

compile time constants generated from a small C file

8668d60

forgotten new file

7d2e8e3

minor changes

60c4001

introduce EventBuffer to avoid allocation at each wait or wait_fold

5b7eb27

craff closed this Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait buffer #11

Wait buffer #11

craff commented Aug 6, 2023 •

edited

Loading

lindig commented Aug 6, 2023

lindig commented Aug 6, 2023

edwintorok commented Aug 6, 2023

craff commented Aug 6, 2023 via email

lindig commented Aug 6, 2023

craff commented Aug 6, 2023 via email

edwintorok commented Aug 6, 2023 •

edited

Loading

craff commented Aug 6, 2023

Wait buffer #11

Wait buffer #11

Conversation

craff commented Aug 6, 2023 • edited Loading

lindig commented Aug 6, 2023

lindig commented Aug 6, 2023

edwintorok commented Aug 6, 2023

craff commented Aug 6, 2023 via email

lindig commented Aug 6, 2023

craff commented Aug 6, 2023 via email

edwintorok commented Aug 6, 2023 • edited Loading

craff commented Aug 6, 2023

craff commented Aug 6, 2023 •

edited

Loading

edwintorok commented Aug 6, 2023 •

edited

Loading