Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait buffer #11

Closed
wants to merge 4 commits into from
Closed

Wait buffer #11

wants to merge 4 commits into from

Conversation

craff
Copy link
Contributor

@craff craff commented Aug 6, 2023

Introduces EventBuffer to avoid allocating at each wait/wait_fold.

Warning: This PR is based on constant3 (PR#10) and not master, because I added one line in src/constants.ml

Questions:

  • I think the new wait_buffer/wait_buffer_fold could replace wait and wait_fold which do not have the intended complexity. We could use deprecation or just accept a breaking change ? Problem with deprecation is we need new function names.
  • I am not sure if I want EventBuffer.t = int * Bytes.t or just Bytes.t and to a division to compute val_max ?

@lindig
Copy link
Owner

lindig commented Aug 6, 2023

The allocation using alloca on the C stack is basically cost free and cheaper than any other form of allocation either using malloc or using the OCaml heap because the latter requires garbage collection. Allocation with alloca is just a pointer operation and memory is freed as soon as possible. The fact that this buffer is allocated at every invocation of epoll_wait and does not survive the call is irrelevant. I don't believe that pre-allocating a buffer which is GC'ed later is cheaper or faster. So this would require a benchmark to convince me otherwise.

@lindig
Copy link
Owner

lindig commented Aug 6, 2023

Maybe epoll_wait and epoll_wait_fold could be unified? The latter is the more general case.

@edwintorok
Copy link
Collaborator

Note that if you want the highest performance you should look at using uring (https://github.com/ocaml-multicore/ocaml-uring), which works with OCaml 4.x too (with 5.x you get a higher level and more convenient interface to it with Eio). It is a much lower level interface though.

polly is useful in situations where you do not have uring: an older kernel, when you are not root (in case 'uring' has been restricted to root only), or if uring has been compiled out of the kernel. It is also useful when you want to debug your program using strace (uring avoids making any syscalls in some modes, so it is harder to debug).
Also polly doesn't allocate memory on the OCaml side, and requires no work from the GC, whereas 'uring' does perform small allocations in a few places.
I would expect the performance of most functions in polly to be dominated by the syscall time (which is quite slow, especially with all the security mitigations in place).

I have some benchmark code for a small HTTP load generator built using polly, but I don't have benchmark code for polly itself. IIRC 'polly' didn't show up in my performance profiles in a significant way, so I haven't looked at optimizing it, but I don't have concrete numbers at hand. When I have some time I can try to rerun the benchmarks and see how much time 'Polly.wait_fold' and 'Polly.upd' takes up.

@craff
Copy link
Contributor Author

craff commented Aug 6, 2023 via email

@lindig
Copy link
Owner

lindig commented Aug 6, 2023

Maybe my understanding is wrong (@edwintorok please correct me), but alloca reserves space on the C stack by increasing the current stack frame. It is invisible to the OCaml GC and is automatically de-allocated when the C function returns. This is why it is so cheap. The only potential problem are very large numbers of val_max because it could overflow the stack. But the structure we are allocating per instance is small: about two 64-bit words.

@craff
Copy link
Contributor Author

craff commented Aug 6, 2023 via email

@edwintorok
Copy link
Collaborator

edwintorok commented Aug 6, 2023

Variables allocated on the stack are uninitialized, there is no need for a bzero there. You can look at the generated assembly code, I wouldn't worry about the performance of the alloca, it is quite similar to allocating an array on the stack.
Using objdump -dS on _build/default/src/main.exe shows the C source code together with the disassembly (approximatively, due to optimization the code is rearranged a bit, and depending on OS and compiler version the exact code may be different a bit). You can see that the alloca is basically just sub %rax, %rsp (where %rax is computed based on val_max):

      if (Int_val(val_max) <= 0)
  52fc18:       85 c0                   test   %eax,%eax
  52fc1a:       0f 8e ed 00 00 00       jle    52fd0d <caml_polly_wait_fold+0x1dd>
                caml_invalid_argument(__FUNCTION__);
        events =
            (struct epoll_event *)alloca(Int_val(val_max) *
  52fc20:       48 98                   cltq
  52fc22:       48 8d 04 40             lea    (%rax,%rax,2),%rax
  52fc26:       48 8d 04 85 17 00 00    lea    0x17(,%rax,4),%rax
  52fc2d:       00 
  52fc2e:       48 83 e0 f0             and    $0xfffffffffffffff0,%rax
  52fc32:       48 29 c4                sub    %rax,%rsp
                                         sizeof(struct epoll_event));

        caml_enter_blocking_section();
  52fc35:       e8 66 60 02 00          call   555ca0 <caml_enter_blocking_section>

@craff
Copy link
Contributor Author

craff commented Aug 6, 2023

Variables allocated on the stack are uninitialized, there is no need for a bzero there. You can look at the generated assembly code, I wouldn't worry about the performance of the alloca, it is quite similar to allocating an array on the stack. Using objdump -dS on _build/default/src/main.exe shows the C source code together with the disassembly (approximatively, due to optimization the code is rearranged a bit, and depending on OS and compiler version the exact code may be different a bit). You can see that the alloca is basically just sub %rax, %rsp (where %rax is computed based on val_max):

      if (Int_val(val_max) <= 0)
  52fc18:       85 c0                   test   %eax,%eax
  52fc1a:       0f 8e ed 00 00 00       jle    52fd0d <caml_polly_wait_fold+0x1dd>
                caml_invalid_argument(__FUNCTION__);
        events =
            (struct epoll_event *)alloca(Int_val(val_max) *
  52fc20:       48 98                   cltq
  52fc22:       48 8d 04 40             lea    (%rax,%rax,2),%rax
  52fc26:       48 8d 04 85 17 00 00    lea    0x17(,%rax,4),%rax
  52fc2d:       00 
  52fc2e:       48 83 e0 f0             and    $0xfffffffffffffff0,%rax
  52fc32:       48 29 c4                sub    %rax,%rsp
                                         sizeof(struct epoll_event));

        caml_enter_blocking_section();
  52fc35:       e8 66 60 02 00          call   555ca0 <caml_enter_blocking_section>

Thanks for the information. This PR is therefore clearly useless. I close it.

@craff craff closed this Aug 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants