Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Tail calls optimizations #6914

Open
marek-safar opened this Issue Feb 9, 2018 · 12 comments

Comments

6 participants
@marek-safar
Copy link
Member

marek-safar commented Feb 9, 2018

The goal is to make our tail call optimization to match netcore/.NET

The four main platforms we care about at first are: x86, amd64, arm and arm64; the optimization should work equally well on all of them.

Case Priority JIT,{,hybrid,full}AOT JIT,{,hybrid,full}AOT + LLVM bitcode interpreter (?)
tail.call high
tail.callvirt high amd64, x86, arm, arm64 amd64(?)
tail.callvirt (interface) high
tail.call (generics) high
tail.callvirt (generics) high
tail.callvirt (generics,interface) high
native pointers medium
tail.calli low
function pointers low
managed references low

Milestone I

  • Bring coreclr tests for tail calls to Mono source tree #6962
  • Update our JIT to pass all tail calls tests we have #6963
  • Extend our AOT compiler to support tail calls and pass all our tests

@marek-safar marek-safar added this to New Issues in Short Term Projects via automation Feb 9, 2018

@marek-safar marek-safar moved this from New Issues to Epics in Short Term Projects Feb 9, 2018

@marek-safar

This comment has been minimized.

Copy link
Member Author

marek-safar commented Feb 24, 2018

Collecting notes for somewhat future tailcall work.

csc can emit things like:
IL_0002: callvirt instance string Test.Base::AbstractOverrideNil()
IL_0007: stloc.0
IL_0008: br.s IL_000a
IL_000a: ldloc.0
IL_000b: ret

or if you say csc -optimize:
IL_0001: callvirt instance string Test.Base::VirtualOverrideNil()
IL_0006: ret

JIT should recognize the second and possibly the first pattern and attempt to tail call. Even without the tail prefix.

The first pattern is somewhat generalizable in that "effectively nop" is unbounded -- a general optimization problem. Some of this this work might involve more surveying of what csc/mcs tends to emit. And surveying of people tend to use csc -optimize. The second pattern is trivial.

If these are already handled, then just add tests to assert it.

Meta-point is that csc never outputs "tail." yet programmers might deliberately write code amenable to tail optimizations. Until/unless C#/csc are fixed, the burden is all on the JIT. C# should gain a syntax though, and then this bug can be skipped.

@jaykrell

This comment has been minimized.

Copy link
Contributor

jaykrell commented Jun 28, 2018

Pretty much everything is believed to work. There are a few known problems. Traversing the test matrix takes a long time.
Known problems:

  1. arm32 can't pass "extra arg" in a tailcall, so that breaks many generic/interface calls.

  2. gsharedvt has wrappers which break tailcall on FullAOT only. This can probably be reduced and work in in-only scenarios. gsharedvt isn't just about recursive generic parameters, but also interface calls and/or virtual calls, because FullAOT doesn't see the functions as used so doesn't compile them. Suggestion is intersected tailcaled interface/virtual signatures with interface/virtual signatures and compile those. This is overkill, but would not penalize C# FullAOT at all (it has no tailcalls).

  3. Of course, passing more parameters on stack than recieved. Big difficult problem and CoreCLR on Unix is believed no better, but CLR on desktop is better.

  4. Some scenarios involving valuetypes, like passing them by value on some architectures (arm64, amd64), and returning them by value similarly. This can be improved by reusing space from incoming parameters.

  5. "auto" tailcall remains very limited -- only direct calls to self. This should be able to inherit most of the tail.call improvements though.

Many many scenarios improved.
We used to essentially never tailcall generics or virtual calls or many other cases.
arm64 could not pass tailcall parameters on stack, only registers.
x86/arm32 could not pass valuetypes.
Among aot/fullaot/hybridaot, some where blocked for no apparent reason.
x86 had racy code patching.
Probably a lot more I forget.

@dsyme

This comment has been minimized.

Copy link
Contributor

dsyme commented Mar 19, 2019

@jaykrell Just to check - is the above an accurate summary of where we are with tailcalls today? Thanks!

@jaykrell

This comment has been minimized.

Copy link
Contributor

jaykrell commented Mar 19, 2019

Yes pretty accurate. Your test suite, can you try it? It is in mono build, as IL, with some disabled, varying per architectute. I send link later.

@jaykrell

This comment has been minimized.

Copy link
Contributor

jaykrell commented Mar 19, 2019

See around here https://github.com/mono/mono/tree/master/mono/tests/tailcall
and the makefile one level up. Look for tailcall and fsharp.

@luhenry

This comment has been minimized.

Copy link
Member

luhenry commented Mar 20, 2019

A lot of these changes had to be reverted due to introduced breakages. I wouldn't say the situation to have changed much since before @jaykrell's work, and the existing set of tests, even if somehow exhaustive, doesn't give a clear picture as to what exactly is passing.

@dsyme

This comment has been minimized.

Copy link
Contributor

dsyme commented Mar 20, 2019

@luhenry Do you have a list of relevant PRs, both pushed and reverting? Thank you

A full status report would be very helpful.

@jaykrell

This comment has been minimized.

Copy link
Contributor

jaykrell commented Mar 20, 2019

Very little had to be reverted.
Or if not mostly restored after revert.

The main problem was ARM32 could often not be fixed due to the use of an extra
non-volatile register, to pass an extra parameter for generics, and then restore it after the call,
means those could not be tailcalled.
i.e. ARM32 disabling here:

# FIXME These presumably take an extra parameter though this should be confirmed.

Similar problem with interface calls, still ARM32-specific:

PLATFORM_DISABLED_TESTS += tailcall/fsharp-deeptail.exe

This could be trivially fixed for "bitcode" targets (watch, TV) as they work differently but that was rejected:

# Interface calls cannot be tailcalls on ARM due to extra_arg. Bitcode could easily work however.

Though again, the problem is ARM32-specific.

There is also a problem with gsharedvt.
This is only on FullAOT. JIT and other AOT scenarios are ok.
The problem here is that mono sometimes creates generic instantiations by wrapping functions and doing some parameter/return conversion before and after.
Wrapping functions means, you have to return to the wrapper, the wrapper
cannot be a tailcall, stack will be consumed, even if some tailcalling is applied.

See here:

PROFILE_DISABLED_TESTS += tailcall-interface-conservestack.exe

This disabling is only under FullAOT. JIT works and regular AOT and I think HybridAOT.
Generic value types with FullAOT do work, when the AOT compiler can figure what to instantiate, but not when wrappers are applied at runtime.

Also, value types passed by value on AMD64 and ARM64 as a problem, as they are passed by reference
to a local. This could be improved, if the caller took the same types as parameters. The incoming
parameter could be replaced by outgoing.

Also, mono is stricter than CoreCLR about type matching.
i.e. it won't tailcall passing an r4 as an r8, or i32 vs. i64, in a tailcall,
even though they would often take up the same parameter space.

Here:

# return type mismatch (float32 vs. float64)

and here:
// As well I <=> I[48] and U <=> U[48] would be ok, for matching size.

Lots of tests used to fail and now pass.
The F# tests really are useful here.

Then there is the "auto tailcall" scenario, tailcalling without a tail. prefix.
This is severely limited but could be improved to about the point of tail.
Mono only does autotail for self-calls, and then not always.

I didn't change that much or at all.
People really don't like auto tail stuff much, and it is less relevant for F#.

The problems here are:
There are tests that depend on non-tailcalls. That walk the stack
and check who there caller is. So it breaks tests.
There is even an API that depends on stackwalk, because
the assembly of its caller is a parameter.

There are tests that depend on stackwalk to form strings for logging.

Debugging. People complain about lost frames.

Social. People don't think it is important.

See here for a start, uncommited PRs:

#9620 probably ignore this in dereference to the next two.

and then redone later more conservatively:
#9389

and then just focusing on the sensitivity separately:

#9759

The best/easiest thing is to try the F# tests.

And then there is a very large matrix to run them, which is largely covered by CI:
{amd64, x86, arm32, arm64} x { JIT, AOT, HybridAOT, FullAOT, Interpreter } x { LLVM or not}

Some of the cells don't make sense or are unimportant but many are valid.

The test infrastructure splits up the F# tests runs them all individually.
They have a very high pass rate, if you ignore arm32 (extra parameter) and FullAOT (wrappers).

@jaykrell

This comment has been minimized.

Copy link
Contributor

jaykrell commented Mar 20, 2019

I'm just repeating myself. Yes the summary from #6914 (comment) is accurate.

It is confusing I agree, because there are many variables, more than you might expect. Many cases were improved, and some improvements are blocked per-architecture (arm32) or codegen variant (FullAOT gsharedvt).

@lambdageek

This comment has been minimized.

Copy link
Member

lambdageek commented Mar 20, 2019

Let me try and split the difference between @luhenry and @jaykrell - our tail call support is better than it used to be, but it is difficult to concisely characterize when we will compile a tailcall successfully and diagnosing why a particular tail-call had to fall back to a regular call is difficult.

  1. If you set the environment variables MONO_LOG_LEVEL=debug MONO_LOG_MASK=tailcall mono will emit log messages like this:
tail.call isEven -> isOdd tailcall:1 tailcall_calli:1 gshared:0 extra_arg:0 virtual_:0
tailcalllog success from isEven
tail.call isOdd -> isEven tailcall:1 tailcall_calli:1 gshared:0 extra_arg:0 virtual_:0
tailcalllog success from isOdd

The first line means that isEven is calling isOdd, and both a tailcall or an indirect tailcall would be allowed for this callsite. The call isn't being made from a generic sharing caller, and Mono does not need to pass an extra argument that's not part of the call signature; and it's not a virtual call.

if we can't make the tailcall, there may be a message like:

mono_is_not_supported_tailcall_helper isOdd -> Invoke cfg->gsharedvt:1
  1. Automatic call->tail.call conversion does not happen. Only explicit tail. prefixes are considered.

  2. Explicit tail recursion (self tail-calls) generally has fewer restrictions as we can implement by manipulating the control flow graph.

  3. We will fall back to a regular call instead of a tailcall if any of the following happens:

    • if we're targeting watchOS (or another of our LLVM bitcode targets; when cfg->llvm_only is true)
    • callvirt if target architecture doesn't have_op_tailcall_membase (x86, amd64, arm32 and arm64 all have it).
    • calli if target architecture doesn't have_op_tailcall_reg (x86, amd64, arm32, arm64 all have it).
    • call{,virt} callee is an instance method on a valuetype
    • call{,virt} callee is a pinvoke
    • all tailcalls across aot/interpreter boundary (or other places where Mono will need to insert a wrapper method)
    • calli of an instance method
    • call{,virt} through Dynamic
    • if mono is compiling a caller method using generic valuetype sharing
      #10323
    • call{,virt} when mono needs to pass an extra argument that isn't part of a method signature and the architecture does not have an available volatile non-parameter register (x86, amd64, arm64 have one; arm32 doesn't)
      • interface calls fall under this
      • also when mono is compiling using generic sharing (of reference types)
    • if any arguments are byref or there are pointer arguments or fnptr (fnptr is not the same as delegates; delegates are normal reference parameters).
    • if the return type of the caller and the callee are not exact matches (e.g. caller is float, callee is double or vice versa).
    • if mono computes that the callee would need to use more stack than the caller for parameter passing
    • if any valuetype arg is passed by address (this depends on the callee signature and the calling convention on the platform).
  4. if Mono determines that the caller can tailcall the callee, that will inhibit inlining of the caller (into its caller). (For both self and general tail calls)

For future reference: This summary is from mono HEAD 86e7400.

@dsyme

This comment has been minimized.

Copy link
Contributor

dsyme commented Mar 20, 2019

@jaykrell

This comment has been minimized.

Copy link
Contributor

jaykrell commented Mar 20, 2019

Thank you.

That is a pretty accurate rendition of the source, yes.
And confusingly long, eh?
But clearer than code, agreed.

And despite all the conditions, this does let through a lot more than before.

Automatic call->tail.call conversion does not happen. Only explicit tail. prefixes are considered.

Close but not quite.
We can auto tailcall to self, with some limits.
That was always the case.
We just move the parameters and jmp.
Imho we could do much better there with little effort.
But it does break some tests, mostly bogus tests, and one API.

Aside: There is a separate jmp instruction that is pretty broken.

calli of an instance method

I'm fuzzy on what I did with calli.
calli is a rare opcode as I recall, maybe never output by C# or F#.

if mono computes that the callee would need
to use more stack than the caller for parameter passing

This is actually the main obvious limitation and CoreCLR
probably has the same, at least on Unix.
Everything else -- the extra parameters to generics and interface calls and the gsharedvt -- is kinda surprising.

The inlining thing is a dilema.
Inlining and tailcall interplay in a kinda unsolvable way.
In particular, because inlinine impacts that entire list
of variables. Inline can induce any of the variables
to be true or false. Because it changes who calls who.

You could try to adjust inlining to encourage tailcall
but it is probably N-P, and could contradict all
other inlining heuristics.

We might actually inhibit inlining of anything with tail. prefix?
I'd have to look.
What you say is better, to only inhibit inlining if in fact we can tailcall.

@luhenry luhenry removed their assignment Mar 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.