Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store both ends of the stack chain in continuations #12735

Merged
merged 3 commits into from
Nov 29, 2023

Conversation

lpw25
Copy link
Contributor

@lpw25 lpw25 commented Nov 13, 2023

This PR changes the representation of continuations to include pointers to both ends of the chain of stacks that it represents. Currently, when resuming a continuation we must walk the whole chain to find the other end, making continue linear in the length of the chain. With this PR the walk is not needed and continue is constant time.

Keeping both ends of the chain in the continuation also allows for a reperform operation that takes a continuation as input. Such an operation could technically be implemented now, but would produce quadratic behaviour. I don't include this operation in this PR, but it is implemented in a later one.

The other end of the chain is stored tagged in the second field of the continuation. The behaviour of the first field is unchanged. The underlying perform primitive now takes 4 parameters and we project out the end of the chain and pass it in.

The bytecode interpreter and all the native code backends have been updated to remove the walk over the stack chain. I've only tested the bytecode and x86_64 implementations though. I also just rebased this over some TSAN related changes, and probably someone who understands those should check things still look right.

runtime/amd64.S Outdated Show resolved Hide resolved
@kayceesrk
Copy link
Contributor

Currently, when resuming a continuation we must walk the whole chain to find the other end, making perform linear in the length of the chain. With this PR the walk is not needed and perform is constant time.

The wording is a bit confusing. I agree that resuming a continuation is linear in the length of the fiber chain. This PR makes it constant time as we hold the handle to the last fiber. However, perform was and is still is linear time, isn't it? We still need to dynamically check for a matching handler in the stack of handlers one at a time.

@lpw25
Copy link
Contributor Author

lpw25 commented Nov 14, 2023

Yes, sorry, I meant continue/discontinue (or resume when its a primitive). I'll fix the PR text.

@kayceesrk
Copy link
Contributor

kayceesrk commented Nov 15, 2023

I am yet to review this change, but I am broadly in favour of the idea.

We have discussed this optimisation, holding a pointer to both ends of the linked list of fibers, in the continuation as a means of speeding up resumption. Good to see this implemented.

Many papers that propose efficient compilation of effect handlers by translating them through CPS, evidence translation, etc., compare the performance of their system against OCaml. The chosen benchmark is usually a microbenchmark with a deeply nested handler stack (which makes sense if the aim is to effect-handler-oriented programming without relying on built-in features in OCaml such as ref cells). OCaml is usually slower than the other systems on such benchmarks. There are two parts to the slowdown observed in OCaml compared to other compilation methods:

  1. OCaml's need to look for a matching handler one at a time from the top of the handler stack. In other works, they may have enough information to jump directly to the matching handler.
  2. OCaml's need to perform linear walk through the list of fibers for resumption.

While (1) cannot be easily avoided in current scheme, (2) certainly can be and this PR fixes (2). This would make OCaml's performance closer to other systems where there is a deeply nested handler stack. This change does not modify the user-facing API and technically is a very small change.

I'll leave a detailed review.

The PR needs change in every architecture where OCaml 5 is supported. It would be great if @dustanddreams can help review and potentially implement the missing bits.

For TSan bits that may be affected, may I ask @fabbing or @OlivierNicole to see whether the changes are sensible?

runtime/amd64.S Outdated Show resolved Hide resolved
runtime/s390x.S Outdated Show resolved Hide resolved
@dustanddreams
Copy link
Contributor

The runtime/power.S bits are missing.

The following diff appears to work for me.

diff --git runtime/power.S runtime/power.S
index 7afefe8735..2fe6fd02ec 100644
--- runtime/power.S
+++ runtime/power.S
@@ -719,31 +719,28 @@ FUNCTION caml_reperform
 ENDFUNCTION caml_reperform
 
 FUNCTION caml_resume
   /* r3: new fiber
      r4: fun
-     r5: arg */
+     r5: arg
+     r6: last_fiber */
         addi    3, 3, -1 /* r3 = Ptr_val(r3) */
         ld      12, 0(4) /* r12 = code pointer */
         mtctr   12       /* CTR = code pointer */
     /* Check if stack is null, then already used */
         cmpdi   3, 0
-        beq     2f
-    /* Find end of list of stacks (put in r7) */
-        mr      TMP, 3
-1:      ld      7, Stack_handler(TMP)
-        ld      TMP, Handler_parent(7)
-        cmpdi   TMP, 0
-        bne     1b
+        beq     1f
     /* Add current stack to the end */
+        addi    6, 6, -1 /* r6 = Ptr_val(r6) */
+        ld      7, Stack_handler(6)
         ld      8, Caml_state(current_stack)
         std     8, Handler_parent(7)
     /* Switch stacks and run code */
         SWITCH_OCAML_STACKS 8, 3
         mr      3, 5
         bctr
-2:      Addrglobal(C_CALL_FUN, caml_raise_continuation_already_resumed)
+1:      Addrglobal(C_CALL_FUN, caml_raise_continuation_already_resumed)
         b       .Lcaml_c_call
 ENDFUNCTION caml_resume
 
 /* Run a function on a new stack, then either
    return the value or invoke exception handler */

@gasche
Copy link
Member

gasche commented Nov 15, 2023

@kayceesrk your comment on the performance of OCaml native handlers compared to related works in deeply-nested scenarios is interesting. It would be nice to invite someome motivated to take a benchmark from the literature, and re-run the OCaml measurements before and after the present PR. Do you have a more precise paper/benchmark in mind that we could point someone to? (I am not up to date on recent handlers-compilation work, so myself I would just look at the latest handler-optimisation work by Daan Leijen's group, but there may be better choices.)

@gasche gasche added runtime-system Performance PR or issues affecting runtime performance of the compiled programs effect handlers labels Nov 15, 2023
@kayceesrk
Copy link
Contributor

kayceesrk commented Nov 15, 2023

Daan Leijen's work on Koka and Jonathan Brachthäuser work on Effekt are the ones that I have in mind. @dhil are you aware of papers where deeply nested handlers were evaluated?

A deeply nested state handler microbenchmark should be easy to write. Let me give this a go.

@dhil
Copy link
Contributor

dhil commented Nov 15, 2023

Daan Leijen's work on Koka and Jonathan Brachthäuser work on Effekt are the ones that I have in mind. @dhil are you aware of papers where deeply nested handlers were evaluated?

I think Handlers in Action may feature such a benchmark. There is my own UNIX operating system implementation (built from ~20 handlers), though I do not yet have an OCaml implementation of it. The C++-effects paper has something with generators, but I cannot remember offhand whether they measure wrt. to the depth of the call stack or handler stack.

I have a contrived microbenchmark for this sort of thing in our WasmFX implementation. The program installs a toplevel handler which handles op and then nests N handlers underneath it which do not handle op, at the bottom we perform op. I can transliterate it to OCaml, if there is interest.

@kayceesrk
Copy link
Contributor

kayceesrk commented Nov 16, 2023

I wrote up the microbenchmark here: https://gist.github.com/kayceesrk/92e029ee70ad8cb98c5ef34e88b48959.

The benchmark installs a number of nested integer state handlers (depth) with an outermost handler for string state. The computation performs a number of string state get and set operations that need to traverse through the entire handler stack for each operation.

The program takes the number integer state handlers (depth) as the first argument and the number of operations to be done (ops) as the second. The results were obtained using hyperfine on an M2 mac.

$ hyperfine -P depth 0 8 './deep_state_trunk.exe $((2**{depth})) 500_000' \
    './deep_state_pr12735.exe $((2**{depth})) 500_000' --export-markdown out.md
Command Mean [ms] Min [ms] Max [ms] Relative
./deep_state_trunk.exe $((2**0)) 500_000 42.5 ± 0.9 42.0 47.4 1.00
./deep_state_pr12735.exe $((2**0)) 500_000 43.2 ± 1.3 42.5 52.3 1.02 ± 0.04
./deep_state_trunk.exe $((2**1)) 500_000 55.0 ± 0.9 54.5 60.4 1.30 ± 0.03
./deep_state_pr12735.exe $((2**1)) 500_000 55.2 ± 1.2 54.3 60.8 1.30 ± 0.04
./deep_state_trunk.exe $((2**2)) 500_000 77.0 ± 0.3 76.4 78.2 1.81 ± 0.04
./deep_state_pr12735.exe $((2**2)) 500_000 76.9 ± 1.1 76.0 83.1 1.81 ± 0.05
./deep_state_trunk.exe $((2**3)) 500_000 129.1 ± 0.3 128.6 130.2 3.04 ± 0.06
./deep_state_pr12735.exe $((2**3)) 500_000 129.0 ± 3.3 127.6 142.4 3.04 ± 0.10
./deep_state_trunk.exe $((2**4)) 500_000 243.3 ± 0.4 242.7 244.2 5.73 ± 0.12
./deep_state_pr12735.exe $((2**4)) 500_000 234.2 ± 2.0 233.2 240.3 5.52 ± 0.12
./deep_state_trunk.exe $((2**5)) 500_000 466.8 ± 2.2 464.9 472.7 10.99 ± 0.23
./deep_state_pr12735.exe $((2**5)) 500_000 438.3 ± 6.4 433.7 454.7 10.32 ± 0.26
./deep_state_trunk.exe $((2**6)) 500_000 921.0 ± 5.7 915.8 931.2 21.69 ± 0.46
./deep_state_pr12735.exe $((2**6)) 500_000 855.4 ± 3.2 850.5 860.8 20.15 ± 0.42
./deep_state_trunk.exe $((2**7)) 500_000 1936.8 ± 21.1 1922.7 1994.1 45.62 ± 1.05
./deep_state_pr12735.exe $((2**7)) 500_000 1714.7 ± 29.8 1692.0 1776.8 40.39 ± 1.08
./deep_state_trunk.exe $((2**8)) 500_000 4406.1 ± 48.4 4344.7 4470.9 103.78 ± 2.40
./deep_state_pr12735.exe $((2**8)) 500_000 3900.0 ± 31.2 3860.0 3953.3 91.86 ± 2.01

The PR makes the program run ~10% faster for deep handler stack.

I should note that this 10% is a reasonable improvement. On trunk, the search for the right handler as well as the resumption is linear in the depth of the handler stack (length of the linked list of fibers). In this PR, the former is still linear whereas the latter is constant time. However, the former has a larger constant factor since the search for the matching handler requires executing OCaml code at every handler. The latter is just a traversal of the linked list of fibers, handwritten in assembly, which is now replaced with a constant time operation.

@dhil
Copy link
Contributor

dhil commented Nov 16, 2023

However, the former has a larger constant factor since the search for the matching handler requires executing OCaml code at every handler.

If we were to give a collective name to operations (e.g. as in #12736) then we could potentially optimise the matching too by storing this name in the header of the each fiber stack. Then rather than having to execute user code to find a matching handler, we simply have to compare identities. With such a scheme in place we can also eliminate (most) unnecessary context switches, as the search for a matching handler can be performed entirely on the performing fiber stack (e.g. as I do here). If a matching handler happens to be incomplete, then we can fallback to the reperform/resume strategy.

@kayceesrk
Copy link
Contributor

If we were to give a collective name to operations (e.g. as in #12736) then we could potentially optimise the matching too by storing this name in the header of the each fiber stack.

This is a very good point. I wonder if @lpw25 was thinking about such an optimisation in separating effects from operations and having a given handler handle all the operations of a given effect.

@lpw25
Copy link
Contributor Author

lpw25 commented Nov 16, 2023

Yes, that is what I had in mind in the "Supports more efficient implementation" of that PR description.

Copy link
Contributor

@kayceesrk kayceesrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me overall. The PR improves performance of deeply nested handlers by around 10% in my experiments. I support the inclusion of this change.

Change to be done:

  • There are a few minor correctness and clarity fixes that @dustanddreams has pointed out, which need to be fixed before the PR can be merged.
  • backtrace/backtrace_effects_nested.ml test fails on flambda since the reference files have not been promoted. All other tests pass.

The PR also needs a precheck on the INRIA CI to confirm that it works on all the supported platforms. It has been a while since I triggered a precheck. The instructions here seem out of date? https://github.com/ocaml/ocaml/blob/b58aafcdd1335fd0d13ab6214a6a884dfb9b87ab/HACKING.adoc#running-inrias-ci-on-a-publicly-available-git-branch. I don't see a "Build with parameters" option anywhere.

@gasche
Copy link
Member

gasche commented Nov 22, 2023

Visiting https://ci.inria.fr/ocaml/job/precheck/ (logged in with my CI-enabled account), I see a list of icons with text on the left column of the page, one of them being "Build with Parameters".

@xavierleroy
Copy link
Contributor

See run at https://ci.inria.fr/ocaml/job/precheck/914/ . There are some problems:

  • with flambda enabled, on all platforms tested with flambda,
List of failed tests:
    tests/backtrace/'backtrace_effects_nested.ml' with line 57 (native) 
  • On system Z,
List of failed tests:
    tests/effects/'shallow_state_io.ml' with default (native) Process 3153726 got signal 4(Illegal instruction), core dumped
    tests/effects/'test2.ml' with default (native) Process 3153827 got signal 4(Illegal instruction), core dumped
    tests/effects/'partial.ml' with default (native) Process 3153625 got signal 4(Illegal instruction), core dumped
    tests/effects/'test_lazy.ml' with default (native) Process 3153954 got signal 4(Illegal instruction), core dumped
    tests/parallel/'test_issue_11094.ml' with line 5 (native) Process 3198561 got signal 4(Illegal instruction), core dumped
    tests/effects/'test11.ml' with default (native) Process 3153802 got signal 4(Illegal instruction), core dumped
    tests/parallel/'deadcont.ml' with default (native) Process 3197643 got signal 4(Illegal instruction), core dumped
    tests/effects/'test5.ml' with default (native) Process 3153903 got signal 4(Illegal instruction), core dumped
    tests/effects/'test10.ml' with default (native) Process 3153777 got signal 4(Illegal instruction), core dumped
    tests/backtrace/'backtrace_effects_nested.ml' with line 53 (native) Process 3150074 got signal 4(Illegal instruction), core dumped
    tests/effects/'backtrace.ml' with default (native) Process 3153467 got signal 4(Illegal instruction), core dumped
    tests/effects/'test4.ml' with default (native) Process 3153878 got signal 4(Illegal instruction), core dumped
    tests/effects/'shallow_state.ml' with default (native) Process 3153701 got signal 4(Illegal instruction), core dumped
    tests/parallel/'mctest.ml' with line 7 (native) Process 3198170 got signal 4(Illegal instruction), core dumped
    tests/backtrace/'backtrace_effects.ml' with default (native) Process 3150039 got signal 4(Illegal instruction), core dumped
    tests/effects/'sched.ml' with default (native) Process 3153676 got signal 4(Illegal instruction), core dumped
    tests/effects/'used_cont.ml' with default (native) Process 3154036 got signal 4(Illegal instruction), core dumped
    tests/effects/'reperform.ml' with default (native) Process 3153651 got signal 4(Illegal instruction), core dumped

@dustanddreams
Copy link
Contributor

* On system Z,
List of failed tests:
    tests/effects/'shallow_state_io.ml' with default (native) Process 3153726 got signal 4(Illegal instruction), core dumped
    tests/effects/'test2.ml' with default (native) Process 3153827 got signal 4(Illegal instruction), core dumped
    tests/effects/'partial.ml' with default (native) Process 3153625 got signal 4(Illegal instruction), core dumped
    tests/effects/'test_lazy.ml' with default (native) Process 3153954 got signal 4(Illegal instruction), core dumped
    tests/parallel/'test_issue_11094.ml' with line 5 (native) Process 3198561 got signal 4(Illegal instruction), core dumped
    tests/effects/'test11.ml' with default (native) Process 3153802 got signal 4(Illegal instruction), core dumped
    tests/parallel/'deadcont.ml' with default (native) Process 3197643 got signal 4(Illegal instruction), core dumped
    tests/effects/'test5.ml' with default (native) Process 3153903 got signal 4(Illegal instruction), core dumped
    tests/effects/'test10.ml' with default (native) Process 3153777 got signal 4(Illegal instruction), core dumped
    tests/backtrace/'backtrace_effects_nested.ml' with line 53 (native) Process 3150074 got signal 4(Illegal instruction), core dumped
    tests/effects/'backtrace.ml' with default (native) Process 3153467 got signal 4(Illegal instruction), core dumped
    tests/effects/'test4.ml' with default (native) Process 3153878 got signal 4(Illegal instruction), core dumped
    tests/effects/'shallow_state.ml' with default (native) Process 3153701 got signal 4(Illegal instruction), core dumped
    tests/parallel/'mctest.ml' with line 7 (native) Process 3198170 got signal 4(Illegal instruction), core dumped
    tests/backtrace/'backtrace_effects.ml' with default (native) Process 3150039 got signal 4(Illegal instruction), core dumped
    tests/effects/'sched.ml' with default (native) Process 3153676 got signal 4(Illegal instruction), core dumped
    tests/effects/'used_cont.ml' with default (native) Process 3154036 got signal 4(Illegal instruction), core dumped
    tests/effects/'reperform.ml' with default (native) Process 3153651 got signal 4(Illegal instruction), core dumped

You need to test with this fix applied.

@xavierleroy
Copy link
Contributor

xavierleroy commented Nov 22, 2023

CI tests branches, not branches + patches. The fix needs to be merged in this branch.

@kayceesrk
Copy link
Contributor

I confirmed that the PR passes the test suite on the RISC-V backend.

@kayceesrk
Copy link
Contributor

@lpw25 would you be able to complete the pending tasks? I'm happy to approve after that: #12735 (review).

@lpw25
Copy link
Contributor Author

lpw25 commented Nov 28, 2023

Branch rebased, comments addressed and Changes entry added

@kayceesrk
Copy link
Contributor

LGTM. Thanks for this work @lpw25.

@xavierleroy
Copy link
Contributor

"Precheck" CI is green on all platforms. Feel free to merge.

@kayceesrk kayceesrk merged commit c2476e3 into ocaml:trunk Nov 29, 2023
9 checks passed
@kayceesrk kayceesrk mentioned this pull request Mar 18, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effect handlers need-precheck Performance PR or issues affecting runtime performance of the compiled programs runtime-system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants