Improvements to tail-call-modulo-construction optimization #316

neon64 · 2022-05-24T23:31:43Z

when dealing with different proc leaves which want different outputs to be made outByReference, take the union of all such outputs, rather than reporting an error about "incompatible protos"
- this allows us to make partition() tail-recursive

when writing the result of a recursive call into a structure, the current (pseudo) code can look a bit like this:

alloca(?var)
append(...,...,outByReference var)
?var1 = load(var)
alloc(..., ?structure)
mutate(structure,?structure1,..., var1)

we should optimize to write the result directly into the memory location of the destination field

alloc(..., ?structure)
append(...,...,outByReference var)
mutate(structure,?structure1,..., takeReference var)

added some more tests + corner cases, some of which currently highlight some deficiencies in the analysis (hence this PR is WIP until I can hopefully fix some more of them)

Making this a draft for now, while I focus on the report.

…Ref args this came up when looking at this example of `partition`, where it wouldn't get TCMC-optimized since each leaf branch wanted a different output argument to be outByReference. Instead: what if we make *both* args outByReference, then the function is eligible for optimization. Yay

this is a further optimization to existing TCMC optimisation, which tries to optimize the caller of these tail-call-optimized procs. We identify the pattern of `alloca` -> call -> `load` -> mutate() which occurs quite often. Instead we calculate the address and then call passing in the final address to write to (i.e.: writing *directly into the structure*)

jimbxb · 2022-05-25T01:17:19Z

test-cases/final-dump/tcmc_partition_corner_case.wybe

+
+# UNSAFE
+# this forcibly writes without checking whether it is the correct variant or not
+def {inline} unsafe_update_tail(!x:list(T), field:list(T)) {


You should be able to do this without the use of a helper

?ls = [1] tail(!ls, [2], _) # ignore if the test failed # ls = [1,2]

hey, thanks for the suggestion, I tried it out, alas I don't think it's quite what I'm trying to do

I want to check that the function isn't TCMC-optimized if there happens to be two or more mutates writing to the same field, e.g.: both yes0 and no0 are written to field at offset 8 here:

1: tcmc_fields_alias.partition<0>(~p##0:(T, ?wybe.bool), ~t##0:wybe.list(T), ?yes0##0:wybe.list(T), ?no0##0:wybe.list(T)) #2 @tcmc_fields_alias:12:17 foreign lpvm alloc(16:wybe.int, ?tmp#11##0:wybe.list(T)) @list:9:27 foreign lpvm mutate(~tmp#11##0:wybe.list(T), ?tmp#12##0:wybe.list(T), 0:wybe.int, 1:wybe.int, 16:wybe.int, 0:wybe.int, ~h##0:T) @list:9:27 foreign lpvm mutate(~tmp#12##0:wybe.list(T), ?tmp#0##0:wybe.list(T), 8:wybe.int, 1:wybe.int, 16:wybe.int, 0:wybe.int, ~yes0##0:wybe.list(T)) @list:9:27 foreign lpvm mutate(~tmp#0##0:wybe.list(T), ?yes##1:wybe.list(T), 8:wybe.int, 1:wybe.int, 16:wybe.int, 0:wybe.int, ~no0##0:wybe.list(T)) @tcmc_fields_alias:6:5 foreign llvm move(0:wybe.list(T), ?no##0:wybe.list(T)) @tcmc_fields_alias:20:18

I agree tail(!ls, [2], _) compiles and works fine, but it's LPVM it produces isn't quite what I want to test:

1: tcmc_fields_alias.partition<0>(~p##0:(T, ?wybe.bool), ~t##0:wybe.list(T), ?yes0##0:wybe.list(T), ?no0##0:wybe.list(T)) #2 @tcmc_fields_alias:10:17 foreign lpvm alloc(16:wybe.int, ?tmp#12##0:wybe.list(T)) @list:9:27 foreign lpvm mutate(~tmp#12##0:wybe.list(T), ?tmp#13##0:wybe.list(T), 0:wybe.int, 1:wybe.int, 16:wybe.int, 0:wybe.int, ~h##0:T) @list:9:27 foreign lpvm mutate(~tmp#13##0:wybe.list(T), ?tmp#0##0:wybe.list(T), 8:wybe.int, 1:wybe.int, 16:wybe.int, 0:wybe.int, ~yes0##0:wybe.list(T)) @list:9:27 foreign llvm icmp_ne(tmp#0##0:wybe.int, 0:wybe.int, ?tmp#16##0:wybe.bool) case ~tmp#16##0:wybe.bool of 0: foreign llvm move(~tmp#0##0:wybe.list(T), ?yes##1:wybe.list(T)) foreign llvm move(0:wybe.list(T), ?no##0:wybe.list(T)) @tcmc_fields_alias:18:18 1: foreign lpvm {noalias} mutate(~tmp#0##0:wybe.list(T), ?yes##1:wybe.list(T), 8:wybe.int, 1:wybe.int, 16:wybe.int, 0:wybe.int, ~no0##0:wybe.list(T)) @list:9:27 foreign llvm move(0:wybe.list(T), ?no##0:wybe.list(T)) @tcmc_fields_alias:18:18

the extra icmp_ne and also extra branching already disqualifies this function from TCMC-optimisation

That's unfortunate. It looks like the mutator is inlined, causing the branch.

I wonder if the getter/mutator price being marked for inlining is the right idea. I agree for single-constructor types that they should be, however.

I imagine you'd be able to use the ProcVariant tags somehow to recognise "mutation chains" also, but now that the proc has been inlined you don't get that information.

EDIT: also, that LLVM code looks fishy. We should be able to recognise that the same memory address is overwritten, and hence remove the former instruction. If re-ordered, with the former mutate at is performed before the previous mutate then it's a little more interesting to analyse

In this case, the LPVM code is comparing the output of a mutate instruction with 0. A mutate instruction should never return 0 (Wybe assumes that memory below smallestAllocatedAddress (64K) cannot be written, so that icmp_ne comparison should always return 1, which would allow BodyBuilder to eliminate the fork. Chris, I assume that if the fork were removed, leaving only the 1 branch, then the move(0, ?no##0) would be moved before the recursive call, leaving only the mutate, which your transformation would handle?

You're right, James, that a mutate instruction with the destructive flag set always returns the same address as it receives as input, but we don't want to say that in the LPVM code itself, as that's the impure view of what's happening, and then it would appear that the mutate instruction isn't needed. Instead we want to make use of the fact that two mutate instructions (with or without the destructive flag set) that mutate non-overlapping fields can be exchanged, or rather the offsets, field sizes, and values can be swapped, keeping the address arguments the same.

BTW, a fork following the recursive call need not rule out TCMC optimisation, if the optimisation can be applied in some or all of the branches. For example, it should be possible to optimise code like:

?result0 = recursive_call(xs) ?result = if{bla:: result0 | else:: [a|result0]}

(assuming bla doesn't depend on result0 and the compiler generates suitable LPVM).

Chris, I assume that if the fork were removed, leaving only the 1 branch, then the move(0, ?no##0) would be moved before the recursive call, leaving only the mutate, which your transformation would handle?

yep, the 1st stage of the transformation should move the move() above

For example, it should be possible to optimise code like:

?result0 = recursive_call(xs) ?result = if{bla:: result0 | else:: [a|result0]}

indeed this should be possible, I have noted this limitation down in #320.
Alas I don't think I will actually get to solve this, at least until after the report is due. I've instead put time into a solution for #322 and #322 (which is part of this implementation in this PR)

Alas I don't think I will actually get to solve this, at least until after the report is due.

Oh, no, definitely just focus on the report. I'd suggest not working on the code unless it's needed for the report.

After the semester is over, if you feel like pushing this further, of course that would be warmly welcomed. :-)

neon64 · 2022-05-26T01:53:33Z

test-cases/final-dump/tcmc_mutate_struct_directly_fail_2.wybe

+pub def appendstuff(l: list(int), l2: list(int), ?result: combined(int)) {
+    ?l3 = l ,, l2
+    ?result = combined(123, [])
+    !result^combined.y = l3


combined.y should not be needed - check

remaining test cases I'd like to fix are tcmc_unrolled_map, and tcmc_mutates_in_wrong_order. Both of these should be fixed by using a generalisation of "mutate chains" - namely "mutation graphs", which I plan to implement and detail in the report

final missing step is to convert a mutate graph back into a (sorted) list of prims

…improvements

neon64 · 2022-06-20T06:38:10Z

src/LastCallAnalysis.hs

-    (PrimCall _ spec' _ _) | spec == spec' -> True
+    (PrimCall _ spec' _ _) | procSpecMod spec == procSpecMod spec' &&
+                             procSpecName spec == procSpecName spec' &&
+                             procSpecID spec == procSpecID spec' -> True
    _ -> False


I believe this change should fix #336 - we check whether the caller and callee match, ignoring the specz version.

Will this also modify the multi-specz bodies?

I believe the specz bodies are only generated right at the very end, after this LastCallAnalysis pass. So while the LastCallAnalysis pass won't directly touch the multi-specz bodies, the multi-specz bodies get generated from the post-LastCallAnalysis unspecialized body, so importantly they will have the correct signature (with e.g.: outByReference params).

On a related note: I feel like there could be case where a proc is only tail-call optimizable in a particular specialized version of a proc, but not in the unspecialized version. In this case we'd have to do LastCallAnalysis after multi-specz bodies are generated in order to fully optimize the proc. But for many other examples, (e.g.: append), all different multi-specz versions end up getting TCMC-optimized the same way, so running LastCallAnalysis after generating multi-specz bodies would just increase the amount of times lastCallAnalyseProc must be called, for not much gain.

Okay, that makes sense.

I'm not sure this is right. Multiple specialisation can change the code, so it's possible that TCMC would be applicable for a specialised version but not the non-specialised version. Remember multiple specialisation changes some non-destructive updates to be destructive, and changes some allocations to structure reuses. Longer term, it's even possible in principle that TCMC would be applicable to a proc but not to its specialisation.

I think the right solution is to separately apply TCMC analysis/transformation to specialised procs.

Longer term, it's even possible in principle that TCMC would be applicable to a proc but not to its specialisation.

Ah I see how this would be problematic. I'll have to rework this PR before moving it out of "draft" status.

I am not sure with the current multi-specz framework if such a gradient is possible. All specialised procs must have the same interface as the original version of the procedure. If TCMC or anything else wants to change the flow/whatever of a parameter, this cannot be described with a specialisation.

Good point. That makes multi-specz easier to manage, but it doesn't really need to be that way. Generalising it might be a lot of work, though

Yeah, I've been looking at the multi-specz code recently trying to digest it. It seems very specialised for what it currently achieves, which is fair enough.

I'm just not sure how to extend it currently to support these types of specialisations

neon64 added 3 commits May 23, 2022 12:56

lots of corner cases

a14adec

jimbxb reviewed May 25, 2022

View reviewed changes

neon64 commented May 26, 2022

View reviewed changes

neon64 added 10 commits May 26, 2022 23:01

logging changes

edcbc0c

handle multiple outputs case correctly

ba14af1

working mutate graph generation

2dbc3d8

final missing step is to convert a mutate graph back into a (sorted) list of prims

codegen for unrolled map now works

648f16e

more fixes

5c559ff

happy with test cases

57da2ef

Merge branch 'master' of https://github.com/pschachte/wybe into tcmc-…

6e866ca

…improvements

codegen simplification, mutate graph bugfixes

073242a

Merge branch 'master' of https://github.com/pschachte/wybe into tcmc-…

2d4aeda

…improvements

neon64 commented Jun 20, 2022

View reviewed changes

neon64 mentioned this pull request Jun 20, 2022

TCMC optimisation fails to kick in when in the presence of multi-specz #336

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to tail-call-modulo-construction optimization #316

Improvements to tail-call-modulo-construction optimization #316

neon64 commented May 24, 2022

jimbxb May 25, 2022 •

edited

neon64 May 28, 2022

jimbxb May 28, 2022 •

edited

pschachte May 31, 2022

pschachte May 31, 2022

neon64 Jun 1, 2022

pschachte Jun 1, 2022

neon64 May 26, 2022 •

edited

neon64 Jun 20, 2022

jimbxb Jun 20, 2022

neon64 Jun 20, 2022

jimbxb Jun 20, 2022

pschachte Jun 21, 2022

neon64 Jun 26, 2022

jimbxb Jun 28, 2022 •

edited

pschachte Jul 7, 2022 •

edited

jimbxb Jul 7, 2022

Improvements to tail-call-modulo-construction optimization #316

Are you sure you want to change the base?

Improvements to tail-call-modulo-construction optimization #316

Conversation

neon64 commented May 24, 2022

jimbxb May 25, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimbxb May 28, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neon64 May 26, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimbxb Jun 28, 2022 • edited

Choose a reason for hiding this comment

pschachte Jul 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimbxb May 25, 2022 •

edited

jimbxb May 28, 2022 •

edited

neon64 May 26, 2022 •

edited

jimbxb Jun 28, 2022 •

edited

pschachte Jul 7, 2022 •

edited