Store both ends of the stack chain in continuations #12735

lpw25 · 2023-11-13T11:25:45Z

This PR changes the representation of continuations to include pointers to both ends of the chain of stacks that it represents. Currently, when resuming a continuation we must walk the whole chain to find the other end, making continue linear in the length of the chain. With this PR the walk is not needed and continue is constant time.

Keeping both ends of the chain in the continuation also allows for a reperform operation that takes a continuation as input. Such an operation could technically be implemented now, but would produce quadratic behaviour. I don't include this operation in this PR, but it is implemented in a later one.

The other end of the chain is stored tagged in the second field of the continuation. The behaviour of the first field is unchanged. The underlying perform primitive now takes 4 parameters and we project out the end of the chain and pass it in.

The bytecode interpreter and all the native code backends have been updated to remove the walk over the stack chain. I've only tested the bytecode and x86_64 implementations though. I also just rebased this over some TSAN related changes, and probably someone who understands those should check things still look right.

runtime/amd64.S

kayceesrk · 2023-11-14T10:56:37Z

Currently, when resuming a continuation we must walk the whole chain to find the other end, making perform linear in the length of the chain. With this PR the walk is not needed and perform is constant time.

The wording is a bit confusing. I agree that resuming a continuation is linear in the length of the fiber chain. This PR makes it constant time as we hold the handle to the last fiber. However, perform was and is still is linear time, isn't it? We still need to dynamically check for a matching handler in the stack of handlers one at a time.

lpw25 · 2023-11-14T17:35:11Z

Yes, sorry, I meant continue/discontinue (or resume when its a primitive). I'll fix the PR text.

kayceesrk · 2023-11-15T04:55:30Z

I am yet to review this change, but I am broadly in favour of the idea.

We have discussed this optimisation, holding a pointer to both ends of the linked list of fibers, in the continuation as a means of speeding up resumption. Good to see this implemented.

Many papers that propose efficient compilation of effect handlers by translating them through CPS, evidence translation, etc., compare the performance of their system against OCaml. The chosen benchmark is usually a microbenchmark with a deeply nested handler stack (which makes sense if the aim is to effect-handler-oriented programming without relying on built-in features in OCaml such as ref cells). OCaml is usually slower than the other systems on such benchmarks. There are two parts to the slowdown observed in OCaml compared to other compilation methods:

OCaml's need to look for a matching handler one at a time from the top of the handler stack. In other works, they may have enough information to jump directly to the matching handler.
OCaml's need to perform linear walk through the list of fibers for resumption.

While (1) cannot be easily avoided in current scheme, (2) certainly can be and this PR fixes (2). This would make OCaml's performance closer to other systems where there is a deeply nested handler stack. This change does not modify the user-facing API and technically is a very small change.

I'll leave a detailed review.

The PR needs change in every architecture where OCaml 5 is supported. It would be great if @dustanddreams can help review and potentially implement the missing bits.

For TSan bits that may be affected, may I ask @fabbing or @OlivierNicole to see whether the changes are sensible?

runtime/amd64.S

runtime/s390x.S

dustanddreams · 2023-11-15T07:50:29Z

The runtime/power.S bits are missing.

The following diff appears to work for me.

diff --git runtime/power.S runtime/power.S
index 7afefe8735..2fe6fd02ec 100644
--- runtime/power.S
+++ runtime/power.S
@@ -719,31 +719,28 @@ FUNCTION caml_reperform
 ENDFUNCTION caml_reperform
 
 FUNCTION caml_resume
   /* r3: new fiber
      r4: fun
-     r5: arg */
+     r5: arg
+     r6: last_fiber */
         addi    3, 3, -1 /* r3 = Ptr_val(r3) */
         ld      12, 0(4) /* r12 = code pointer */
         mtctr   12       /* CTR = code pointer */
     /* Check if stack is null, then already used */
         cmpdi   3, 0
-        beq     2f
-    /* Find end of list of stacks (put in r7) */
-        mr      TMP, 3
-1:      ld      7, Stack_handler(TMP)
-        ld      TMP, Handler_parent(7)
-        cmpdi   TMP, 0
-        bne     1b
+        beq     1f
     /* Add current stack to the end */
+        addi    6, 6, -1 /* r6 = Ptr_val(r6) */
+        ld      7, Stack_handler(6)
         ld      8, Caml_state(current_stack)
         std     8, Handler_parent(7)
     /* Switch stacks and run code */
         SWITCH_OCAML_STACKS 8, 3
         mr      3, 5
         bctr
-2:      Addrglobal(C_CALL_FUN, caml_raise_continuation_already_resumed)
+1:      Addrglobal(C_CALL_FUN, caml_raise_continuation_already_resumed)
         b       .Lcaml_c_call
 ENDFUNCTION caml_resume
 
 /* Run a function on a new stack, then either
    return the value or invoke exception handler */

gasche · 2023-11-15T08:05:34Z

@kayceesrk your comment on the performance of OCaml native handlers compared to related works in deeply-nested scenarios is interesting. It would be nice to invite someome motivated to take a benchmark from the literature, and re-run the OCaml measurements before and after the present PR. Do you have a more precise paper/benchmark in mind that we could point someone to? (I am not up to date on recent handlers-compilation work, so myself I would just look at the latest handler-optimisation work by Daan Leijen's group, but there may be better choices.)

kayceesrk · 2023-11-15T12:54:17Z

Daan Leijen's work on Koka and Jonathan Brachthäuser work on Effekt are the ones that I have in mind. @dhil are you aware of papers where deeply nested handlers were evaluated?

A deeply nested state handler microbenchmark should be easy to write. Let me give this a go.

dhil · 2023-11-15T13:07:25Z

Daan Leijen's work on Koka and Jonathan Brachthäuser work on Effekt are the ones that I have in mind. @dhil are you aware of papers where deeply nested handlers were evaluated?

I think Handlers in Action may feature such a benchmark. There is my own UNIX operating system implementation (built from ~20 handlers), though I do not yet have an OCaml implementation of it. The C++-effects paper has something with generators, but I cannot remember offhand whether they measure wrt. to the depth of the call stack or handler stack.

I have a contrived microbenchmark for this sort of thing in our WasmFX implementation. The program installs a toplevel handler which handles op and then nests N handlers underneath it which do not handle op, at the bottom we perform op. I can transliterate it to OCaml, if there is interest.

kayceesrk · 2023-11-16T09:11:44Z

I wrote up the microbenchmark here: https://gist.github.com/kayceesrk/92e029ee70ad8cb98c5ef34e88b48959.

The benchmark installs a number of nested integer state handlers (depth) with an outermost handler for string state. The computation performs a number of string state get and set operations that need to traverse through the entire handler stack for each operation.

The program takes the number integer state handlers (depth) as the first argument and the number of operations to be done (ops) as the second. The results were obtained using hyperfine on an M2 mac.

$ hyperfine -P depth 0 8 './deep_state_trunk.exe $((2**{depth})) 500_000' \
    './deep_state_pr12735.exe $((2**{depth})) 500_000' --export-markdown out.md

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./deep_state_trunk.exe $((2**0)) 500_000`	42.5 ± 0.9	42.0	47.4	1.00
`./deep_state_pr12735.exe $((2**0)) 500_000`	43.2 ± 1.3	42.5	52.3	1.02 ± 0.04
`./deep_state_trunk.exe $((2**1)) 500_000`	55.0 ± 0.9	54.5	60.4	1.30 ± 0.03
`./deep_state_pr12735.exe $((2**1)) 500_000`	55.2 ± 1.2	54.3	60.8	1.30 ± 0.04
`./deep_state_trunk.exe $((2**2)) 500_000`	77.0 ± 0.3	76.4	78.2	1.81 ± 0.04
`./deep_state_pr12735.exe $((2**2)) 500_000`	76.9 ± 1.1	76.0	83.1	1.81 ± 0.05
`./deep_state_trunk.exe $((2**3)) 500_000`	129.1 ± 0.3	128.6	130.2	3.04 ± 0.06
`./deep_state_pr12735.exe $((2**3)) 500_000`	129.0 ± 3.3	127.6	142.4	3.04 ± 0.10
`./deep_state_trunk.exe $((2**4)) 500_000`	243.3 ± 0.4	242.7	244.2	5.73 ± 0.12
`./deep_state_pr12735.exe $((2**4)) 500_000`	234.2 ± 2.0	233.2	240.3	5.52 ± 0.12
`./deep_state_trunk.exe $((2**5)) 500_000`	466.8 ± 2.2	464.9	472.7	10.99 ± 0.23
`./deep_state_pr12735.exe $((2**5)) 500_000`	438.3 ± 6.4	433.7	454.7	10.32 ± 0.26
`./deep_state_trunk.exe $((2**6)) 500_000`	921.0 ± 5.7	915.8	931.2	21.69 ± 0.46
`./deep_state_pr12735.exe $((2**6)) 500_000`	855.4 ± 3.2	850.5	860.8	20.15 ± 0.42
`./deep_state_trunk.exe $((2**7)) 500_000`	1936.8 ± 21.1	1922.7	1994.1	45.62 ± 1.05
`./deep_state_pr12735.exe $((2**7)) 500_000`	1714.7 ± 29.8	1692.0	1776.8	40.39 ± 1.08
`./deep_state_trunk.exe $((2**8)) 500_000`	4406.1 ± 48.4	4344.7	4470.9	103.78 ± 2.40
`./deep_state_pr12735.exe $((2**8)) 500_000`	3900.0 ± 31.2	3860.0	3953.3	91.86 ± 2.01

The PR makes the program run ~10% faster for deep handler stack.

I should note that this 10% is a reasonable improvement. On trunk, the search for the right handler as well as the resumption is linear in the depth of the handler stack (length of the linked list of fibers). In this PR, the former is still linear whereas the latter is constant time. However, the former has a larger constant factor since the search for the matching handler requires executing OCaml code at every handler. The latter is just a traversal of the linked list of fibers, handwritten in assembly, which is now replaced with a constant time operation.

dhil · 2023-11-16T09:29:58Z

However, the former has a larger constant factor since the search for the matching handler requires executing OCaml code at every handler.

If we were to give a collective name to operations (e.g. as in #12736) then we could potentially optimise the matching too by storing this name in the header of the each fiber stack. Then rather than having to execute user code to find a matching handler, we simply have to compare identities. With such a scheme in place we can also eliminate (most) unnecessary context switches, as the search for a matching handler can be performed entirely on the performing fiber stack (e.g. as I do here). If a matching handler happens to be incomplete, then we can fallback to the reperform/resume strategy.

kayceesrk · 2023-11-16T10:19:14Z

If we were to give a collective name to operations (e.g. as in #12736) then we could potentially optimise the matching too by storing this name in the header of the each fiber stack.

This is a very good point. I wonder if @lpw25 was thinking about such an optimisation in separating effects from operations and having a given handler handle all the operations of a given effect.

lpw25 · 2023-11-16T12:26:50Z

Yes, that is what I had in mind in the "Supports more efficient implementation" of that PR description.

kayceesrk

The code looks good to me overall. The PR improves performance of deeply nested handlers by around 10% in my experiments. I support the inclusion of this change.

Change to be done:

There are a few minor correctness and clarity fixes that @dustanddreams has pointed out, which need to be fixed before the PR can be merged.
backtrace/backtrace_effects_nested.ml test fails on flambda since the reference files have not been promoted. All other tests pass.

The PR also needs a precheck on the INRIA CI to confirm that it works on all the supported platforms. It has been a while since I triggered a precheck. The instructions here seem out of date? https://github.com/ocaml/ocaml/blob/b58aafcdd1335fd0d13ab6214a6a884dfb9b87ab/HACKING.adoc#running-inrias-ci-on-a-publicly-available-git-branch. I don't see a "Build with parameters" option anywhere.

gasche · 2023-11-22T15:44:18Z

Visiting https://ci.inria.fr/ocaml/job/precheck/ (logged in with my CI-enabled account), I see a list of icons with text on the left column of the page, one of them being "Build with Parameters".

xavierleroy · 2023-11-22T15:58:03Z

See run at https://ci.inria.fr/ocaml/job/precheck/914/ . There are some problems:

with flambda enabled, on all platforms tested with flambda,

List of failed tests:
    tests/backtrace/'backtrace_effects_nested.ml' with line 57 (native)

On system Z,

List of failed tests:
    tests/effects/'shallow_state_io.ml' with default (native) Process 3153726 got signal 4(Illegal instruction), core dumped
    tests/effects/'test2.ml' with default (native) Process 3153827 got signal 4(Illegal instruction), core dumped
    tests/effects/'partial.ml' with default (native) Process 3153625 got signal 4(Illegal instruction), core dumped
    tests/effects/'test_lazy.ml' with default (native) Process 3153954 got signal 4(Illegal instruction), core dumped
    tests/parallel/'test_issue_11094.ml' with line 5 (native) Process 3198561 got signal 4(Illegal instruction), core dumped
    tests/effects/'test11.ml' with default (native) Process 3153802 got signal 4(Illegal instruction), core dumped
    tests/parallel/'deadcont.ml' with default (native) Process 3197643 got signal 4(Illegal instruction), core dumped
    tests/effects/'test5.ml' with default (native) Process 3153903 got signal 4(Illegal instruction), core dumped
    tests/effects/'test10.ml' with default (native) Process 3153777 got signal 4(Illegal instruction), core dumped
    tests/backtrace/'backtrace_effects_nested.ml' with line 53 (native) Process 3150074 got signal 4(Illegal instruction), core dumped
    tests/effects/'backtrace.ml' with default (native) Process 3153467 got signal 4(Illegal instruction), core dumped
    tests/effects/'test4.ml' with default (native) Process 3153878 got signal 4(Illegal instruction), core dumped
    tests/effects/'shallow_state.ml' with default (native) Process 3153701 got signal 4(Illegal instruction), core dumped
    tests/parallel/'mctest.ml' with line 7 (native) Process 3198170 got signal 4(Illegal instruction), core dumped
    tests/backtrace/'backtrace_effects.ml' with default (native) Process 3150039 got signal 4(Illegal instruction), core dumped
    tests/effects/'sched.ml' with default (native) Process 3153676 got signal 4(Illegal instruction), core dumped
    tests/effects/'used_cont.ml' with default (native) Process 3154036 got signal 4(Illegal instruction), core dumped
    tests/effects/'reperform.ml' with default (native) Process 3153651 got signal 4(Illegal instruction), core dumped

dustanddreams · 2023-11-22T16:08:32Z

* On system Z,

List of failed tests:
    tests/effects/'shallow_state_io.ml' with default (native) Process 3153726 got signal 4(Illegal instruction), core dumped
    tests/effects/'test2.ml' with default (native) Process 3153827 got signal 4(Illegal instruction), core dumped
    tests/effects/'partial.ml' with default (native) Process 3153625 got signal 4(Illegal instruction), core dumped
    tests/effects/'test_lazy.ml' with default (native) Process 3153954 got signal 4(Illegal instruction), core dumped
    tests/parallel/'test_issue_11094.ml' with line 5 (native) Process 3198561 got signal 4(Illegal instruction), core dumped
    tests/effects/'test11.ml' with default (native) Process 3153802 got signal 4(Illegal instruction), core dumped
    tests/parallel/'deadcont.ml' with default (native) Process 3197643 got signal 4(Illegal instruction), core dumped
    tests/effects/'test5.ml' with default (native) Process 3153903 got signal 4(Illegal instruction), core dumped
    tests/effects/'test10.ml' with default (native) Process 3153777 got signal 4(Illegal instruction), core dumped
    tests/backtrace/'backtrace_effects_nested.ml' with line 53 (native) Process 3150074 got signal 4(Illegal instruction), core dumped
    tests/effects/'backtrace.ml' with default (native) Process 3153467 got signal 4(Illegal instruction), core dumped
    tests/effects/'test4.ml' with default (native) Process 3153878 got signal 4(Illegal instruction), core dumped
    tests/effects/'shallow_state.ml' with default (native) Process 3153701 got signal 4(Illegal instruction), core dumped
    tests/parallel/'mctest.ml' with line 7 (native) Process 3198170 got signal 4(Illegal instruction), core dumped
    tests/backtrace/'backtrace_effects.ml' with default (native) Process 3150039 got signal 4(Illegal instruction), core dumped
    tests/effects/'sched.ml' with default (native) Process 3153676 got signal 4(Illegal instruction), core dumped
    tests/effects/'used_cont.ml' with default (native) Process 3154036 got signal 4(Illegal instruction), core dumped
    tests/effects/'reperform.ml' with default (native) Process 3153651 got signal 4(Illegal instruction), core dumped

You need to test with this fix applied.

xavierleroy · 2023-11-22T16:10:11Z

CI tests branches, not branches + patches. The fix needs to be merged in this branch.

kayceesrk · 2023-11-23T08:24:15Z

I confirmed that the PR passes the test suite on the RISC-V backend.

kayceesrk · 2023-11-27T04:19:28Z

@lpw25 would you be able to complete the pending tasks? I'm happy to approve after that: #12735 (review).

lpw25 · 2023-11-28T19:39:46Z

Branch rebased, comments addressed and Changes entry added

kayceesrk · 2023-11-29T05:06:29Z

LGTM. Thanks for this work @lpw25.

xavierleroy · 2023-11-29T12:50:35Z

"Precheck" CI is green on all platforms. Feel free to merge.

dustanddreams reviewed Nov 13, 2023

View reviewed changes

runtime/amd64.S Outdated Show resolved Hide resolved

lpw25 mentioned this pull request Nov 13, 2023

An alternative effect API #12736

Open

dustanddreams reviewed Nov 15, 2023

View reviewed changes

runtime/amd64.S Outdated Show resolved Hide resolved

dustanddreams reviewed Nov 15, 2023

View reviewed changes

runtime/s390x.S Outdated Show resolved Hide resolved

gasche added runtime-system Performance PR or issues affecting runtime performance of the compiled programs effect handlers labels Nov 15, 2023

kayceesrk requested changes Nov 22, 2023

View reviewed changes

kayceesrk added the need-precheck label Nov 22, 2023

lpw25 added 3 commits November 28, 2023 19:38

Store both ends of the stack chain in continuations

f80990e

Bootstrap

9e44fbd

Add Changes entry

3437f2e

lpw25 force-pushed the continuation-last-fiber branch from d0f918f to 3437f2e Compare November 28, 2023 19:39

kayceesrk approved these changes Nov 29, 2023

View reviewed changes

kayceesrk merged commit c2476e3 into ocaml:trunk Nov 29, 2023
9 checks passed

hhugo mentioned this pull request Dec 4, 2023

Support for OCaml trunk ocsigen/js_of_ocaml#1487

Draft

11 tasks

kayceesrk mentioned this pull request Dec 8, 2023

make dls thread local #12724

Open

kit-ty-kate mentioned this pull request Mar 5, 2024

OCaml 5.2 support dhil/ocaml-multicont#6

Closed

kayceesrk mentioned this pull request Mar 18, 2024

Add effect syntax #12309

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store both ends of the stack chain in continuations #12735

Store both ends of the stack chain in continuations #12735

lpw25 commented Nov 13, 2023 •

edited

kayceesrk commented Nov 14, 2023

lpw25 commented Nov 14, 2023

kayceesrk commented Nov 15, 2023 •

edited

dustanddreams commented Nov 15, 2023

gasche commented Nov 15, 2023

kayceesrk commented Nov 15, 2023 •

edited

dhil commented Nov 15, 2023 •

edited

kayceesrk commented Nov 16, 2023 •

edited

dhil commented Nov 16, 2023

kayceesrk commented Nov 16, 2023

lpw25 commented Nov 16, 2023

kayceesrk left a comment •

edited

gasche commented Nov 22, 2023

xavierleroy commented Nov 22, 2023

dustanddreams commented Nov 22, 2023

xavierleroy commented Nov 22, 2023 •

edited

kayceesrk commented Nov 23, 2023

kayceesrk commented Nov 27, 2023

lpw25 commented Nov 28, 2023

kayceesrk commented Nov 29, 2023

xavierleroy commented Nov 29, 2023

Store both ends of the stack chain in continuations #12735

Store both ends of the stack chain in continuations #12735

Conversation

lpw25 commented Nov 13, 2023 • edited

kayceesrk commented Nov 14, 2023

lpw25 commented Nov 14, 2023

kayceesrk commented Nov 15, 2023 • edited

dustanddreams commented Nov 15, 2023

gasche commented Nov 15, 2023

kayceesrk commented Nov 15, 2023 • edited

dhil commented Nov 15, 2023 • edited

kayceesrk commented Nov 16, 2023 • edited

dhil commented Nov 16, 2023

kayceesrk commented Nov 16, 2023

lpw25 commented Nov 16, 2023

kayceesrk left a comment • edited

Choose a reason for hiding this comment

gasche commented Nov 22, 2023

xavierleroy commented Nov 22, 2023

dustanddreams commented Nov 22, 2023

xavierleroy commented Nov 22, 2023 • edited

kayceesrk commented Nov 23, 2023

kayceesrk commented Nov 27, 2023

lpw25 commented Nov 28, 2023

kayceesrk commented Nov 29, 2023

xavierleroy commented Nov 29, 2023

lpw25 commented Nov 13, 2023 •

edited

kayceesrk commented Nov 15, 2023 •

edited

kayceesrk commented Nov 15, 2023 •

edited

dhil commented Nov 15, 2023 •

edited

kayceesrk commented Nov 16, 2023 •

edited

kayceesrk left a comment •

edited

xavierleroy commented Nov 22, 2023 •

edited