Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to tail-call-modulo-construction optimization #316

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

neon64
Copy link
Contributor

@neon64 neon64 commented May 24, 2022

  • when dealing with different proc leaves which want different outputs to be made outByReference, take the union of all such outputs, rather than reporting an error about "incompatible protos"

    • this allows us to make partition() tail-recursive
  • when writing the result of a recursive call into a structure, the current (pseudo) code can look a bit like this:

    alloca(?var)
    append(...,...,outByReference var)
    ?var1 = load(var)
    alloc(..., ?structure)
    mutate(structure,?structure1,..., var1)
    

    we should optimize to write the result directly into the memory location of the destination field

    alloc(..., ?structure)
    append(...,...,outByReference var)
    mutate(structure,?structure1,..., takeReference var)
    
  • added some more tests + corner cases, some of which currently highlight some deficiencies in the analysis (hence this PR is WIP until I can hopefully fix some more of them)

Making this a draft for now, while I focus on the report.

…Ref args

this came up when looking at this example of `partition`, where it wouldn't get TCMC-optimized since each leaf branch wanted a different output argument to be outByReference. Instead: what if we make *both* args outByReference, then the function is eligible for optimization. Yay
this is a further optimization to existing TCMC optimisation, which tries to optimize the caller of these tail-call-optimized procs. We identify the pattern of `alloca` -> call -> `load` -> mutate() which occurs quite often. Instead we calculate the address and then call passing in the final address to write to (i.e.: writing *directly into the structure*)

# UNSAFE
# this forcibly writes without checking whether it is the correct variant or not
def {inline} unsafe_update_tail(!x:list(T), field:list(T)) {
Copy link
Collaborator

@jimbxb jimbxb May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to do this without the use of a helper

?ls = [1]
tail(!ls, [2], _) # ignore if the test failed
# ls = [1,2]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, thanks for the suggestion, I tried it out, alas I don't think it's quite what I'm trying to do

I want to check that the function isn't TCMC-optimized if there happens to be two or more mutates writing to the same field, e.g.: both yes0 and no0 are written to field at offset 8 here:

1:
            tcmc_fields_alias.partition<0>(~p##0:(T, ?wybe.bool), ~t##0:wybe.list(T), ?yes0##0:wybe.list(T), ?no0##0:wybe.list(T)) #2 @tcmc_fields_alias:12:17
            foreign lpvm alloc(16:wybe.int, ?tmp#11##0:wybe.list(T)) @list:9:27
            foreign lpvm mutate(~tmp#11##0:wybe.list(T), ?tmp#12##0:wybe.list(T), 0:wybe.int, 1:wybe.int, 16:wybe.int, 0:wybe.int, ~h##0:T) @list:9:27
            foreign lpvm mutate(~tmp#12##0:wybe.list(T), ?tmp#0##0:wybe.list(T), 8:wybe.int, 1:wybe.int, 16:wybe.int, 0:wybe.int, ~yes0##0:wybe.list(T)) @list:9:27
            foreign lpvm mutate(~tmp#0##0:wybe.list(T), ?yes##1:wybe.list(T), 8:wybe.int, 1:wybe.int, 16:wybe.int, 0:wybe.int, ~no0##0:wybe.list(T)) @tcmc_fields_alias:6:5
            foreign llvm move(0:wybe.list(T), ?no##0:wybe.list(T)) @tcmc_fields_alias:20:18

I agree tail(!ls, [2], _) compiles and works fine, but it's LPVM it produces isn't quite what I want to test:

1:
            tcmc_fields_alias.partition<0>(~p##0:(T, ?wybe.bool), ~t##0:wybe.list(T), ?yes0##0:wybe.list(T), ?no0##0:wybe.list(T)) #2 @tcmc_fields_alias:10:17
            foreign lpvm alloc(16:wybe.int, ?tmp#12##0:wybe.list(T)) @list:9:27
            foreign lpvm mutate(~tmp#12##0:wybe.list(T), ?tmp#13##0:wybe.list(T), 0:wybe.int, 1:wybe.int, 16:wybe.int, 0:wybe.int, ~h##0:T) @list:9:27
            foreign lpvm mutate(~tmp#13##0:wybe.list(T), ?tmp#0##0:wybe.list(T), 8:wybe.int, 1:wybe.int, 16:wybe.int, 0:wybe.int, ~yes0##0:wybe.list(T)) @list:9:27
            foreign llvm icmp_ne(tmp#0##0:wybe.int, 0:wybe.int, ?tmp#16##0:wybe.bool)
            case ~tmp#16##0:wybe.bool of
            0:
                foreign llvm move(~tmp#0##0:wybe.list(T), ?yes##1:wybe.list(T))
                foreign llvm move(0:wybe.list(T), ?no##0:wybe.list(T)) @tcmc_fields_alias:18:18

            1:
                foreign lpvm {noalias} mutate(~tmp#0##0:wybe.list(T), ?yes##1:wybe.list(T), 8:wybe.int, 1:wybe.int, 16:wybe.int, 0:wybe.int, ~no0##0:wybe.list(T)) @list:9:27
                foreign llvm move(0:wybe.list(T), ?no##0:wybe.list(T)) @tcmc_fields_alias:18:18

the extra icmp_ne and also extra branching already disqualifies this function from TCMC-optimisation

Copy link
Collaborator

@jimbxb jimbxb May 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunate. It looks like the mutator is inlined, causing the branch.

I wonder if the getter/mutator price being marked for inlining is the right idea. I agree for single-constructor types that they should be, however.

I imagine you'd be able to use the ProcVariant tags somehow to recognise "mutation chains" also, but now that the proc has been inlined you don't get that information.

EDIT: also, that LLVM code looks fishy. We should be able to recognise that the same memory address is overwritten, and hence remove the former instruction. If re-ordered, with the former mutate at is performed before the previous mutate then it's a little more interesting to analyse

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the LPVM code is comparing the output of a mutate instruction with 0. A mutate instruction should never return 0 (Wybe assumes that memory below smallestAllocatedAddress (64K) cannot be written, so that icmp_ne comparison should always return 1, which would allow BodyBuilder to eliminate the fork. Chris, I assume that if the fork were removed, leaving only the 1 branch, then the move(0, ?no##0) would be moved before the recursive call, leaving only the mutate, which your transformation would handle?

You're right, James, that a mutate instruction with the destructive flag set always returns the same address as it receives as input, but we don't want to say that in the LPVM code itself, as that's the impure view of what's happening, and then it would appear that the mutate instruction isn't needed. Instead we want to make use of the fact that two mutate instructions (with or without the destructive flag set) that mutate non-overlapping fields can be exchanged, or rather the offsets, field sizes, and values can be swapped, keeping the address arguments the same.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, a fork following the recursive call need not rule out TCMC optimisation, if the optimisation can be applied in some or all of the branches. For example, it should be possible to optimise code like:

?result0 = recursive_call(xs)
?result = if{bla:: result0 | else:: [a|result0]}

(assuming bla doesn't depend on result0 and the compiler generates suitable LPVM).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chris, I assume that if the fork were removed, leaving only the 1 branch, then the move(0, ?no##0) would be moved before the recursive call, leaving only the mutate, which your transformation would handle?

yep, the 1st stage of the transformation should move the move() above

For example, it should be possible to optimise code like:

?result0 = recursive_call(xs)
?result = if{bla:: result0 | else:: [a|result0]}

indeed this should be possible, I have noted this limitation down in #320.
Alas I don't think I will actually get to solve this, at least until after the report is due. I've instead put time into a solution for #322 and #322 (which is part of this implementation in this PR)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas I don't think I will actually get to solve this, at least until after the report is due.

Oh, no, definitely just focus on the report. I'd suggest not working on the code unless it's needed for the report.

After the semester is over, if you feel like pushing this further, of course that would be warmly welcomed. :-)

pub def appendstuff(l: list(int), l2: list(int), ?result: combined(int)) {
?l3 = l ,, l2
?result = combined(123, [])
!result^combined.y = l3
Copy link
Contributor Author

@neon64 neon64 May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • combined.y should not be needed - check

neon64 added 10 commits May 26, 2022 23:01
remaining test cases I'd like to fix are tcmc_unrolled_map, and tcmc_mutates_in_wrong_order. Both of these should be fixed by using a generalisation of "mutate chains" - namely "mutation graphs", which I plan to implement and detail in the report
final missing step is to convert a mutate graph back into a (sorted) list of prims
(PrimCall _ spec' _ _) | spec == spec' -> True
(PrimCall _ spec' _ _) | procSpecMod spec == procSpecMod spec' &&
procSpecName spec == procSpecName spec' &&
procSpecID spec == procSpecID spec' -> True
_ -> False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this change should fix #336 - we check whether the caller and callee match, ignoring the specz version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this also modify the multi-specz bodies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the specz bodies are only generated right at the very end, after this LastCallAnalysis pass. So while the LastCallAnalysis pass won't directly touch the multi-specz bodies, the multi-specz bodies get generated from the post-LastCallAnalysis unspecialized body, so importantly they will have the correct signature (with e.g.: outByReference params).

On a related note: I feel like there could be case where a proc is only tail-call optimizable in a particular specialized version of a proc, but not in the unspecialized version. In this case we'd have to do LastCallAnalysis after multi-specz bodies are generated in order to fully optimize the proc. But for many other examples, (e.g.: append), all different multi-specz versions end up getting TCMC-optimized the same way, so running LastCallAnalysis after generating multi-specz bodies would just increase the amount of times lastCallAnalyseProc must be called, for not much gain.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is right. Multiple specialisation can change the code, so it's possible that TCMC would be applicable for a specialised version but not the non-specialised version. Remember multiple specialisation changes some non-destructive updates to be destructive, and changes some allocations to structure reuses. Longer term, it's even possible in principle that TCMC would be applicable to a proc but not to its specialisation.

I think the right solution is to separately apply TCMC analysis/transformation to specialised procs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer term, it's even possible in principle that TCMC would be applicable to a proc but not to its specialisation.

Ah I see how this would be problematic. I'll have to rework this PR before moving it out of "draft" status.

Copy link
Collaborator

@jimbxb jimbxb Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure with the current multi-specz framework if such a gradient is possible. All specialised procs must have the same interface as the original version of the procedure. If TCMC or anything else wants to change the flow/whatever of a parameter, this cannot be described with a specialisation.

Copy link
Owner

@pschachte pschachte Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. That makes multi-specz easier to manage, but it doesn't really need to be that way. Generalising it might be a lot of work, though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've been looking at the multi-specz code recently trying to digest it. It seems very specialised for what it currently achieves, which is fair enough.

I'm just not sure how to extend it currently to support these types of specialisations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants