Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: use switch case to improve codegen #52

Closed
wants to merge 1 commit into from

Conversation

quicknir
Copy link

This PR is just to provide a convenient interface for initial code review.

For now I only implemented for variants of up to 2 types. It seems to compile and pass unit tests. I tried to insert the implementation branch (to switch case) at the lowest point in the call stack just before it heads to the function pointer table. It took me a bit of trial and error to understand what to use to get the reference to the ith type from the variant at that point in the callstack so I'm hopefully using the right thing. Hopefully, access::base::get_alt is both correct and also unchecked, the latter seems to be important in getting optimal assembly (e.g. when I wrote my top level switch case code with get instead of get_if it didn't generate as good code).

Anyhow, let me know what you think, we can finalize the max number of types for the optimization later once we have the details nailed down (it's very easy to change from how it's written).

@quicknir
Copy link
Author

Oh, also I'm aware you might not want to use if constexpr in the implementation. I just did it this way for now, can easily switch to tag dispatch.

I didn't do the fdiagonal into switch case yet either, that could be interesting as well. AFAICS, the "visit_at" line of functions seems to only be used for two variants of same type, visiting at the same index. Seemingly for binary operations like ==? Should probably do that too actually, seems like significantly better assembly when switch case is used: https://gcc.godbolt.org/z/To-fsq.

@quicknir quicknir mentioned this pull request Oct 11, 2018
@mpark
Copy link
Owner

mpark commented Oct 12, 2018

Is the idea to use a switch for only 1 variant upto some fixed-N alternatives? Could we at least support it for arbitrary N alternatives for 1 variant?

@quicknir
Copy link
Author

quicknir commented Oct 12, 2018

@mpark Well, you can't have a switch with an arbitrary number of cases. If variant has a well defined hard limit to the number of types it can hold, then you could generate up to that point. But if variant doesn't make such an attempt, and it's only limited by e.g. compiler limits on template recursion and such, then there is no way to do the switch case approach for purely arbitrary N types in a variant.

Personally I don't see this as a big issue; we already know this doesn't cover everything and instead is supposed to be an optimization for common cases. Covering up to e.g. 20 types will in practice significantly improve codegen for the overwhelming majority (IMHO) of uses of variant.

Note also that it will also cover built in binary operations (like comparison) for up to N types, despite those involving two variants, because the dispatch only occurs if the index for both variants matches, so we can handle that with switch case quite easily as well (not yet present in implementation, wanted to verify basic approach first).

@mpark
Copy link
Owner

mpark commented Oct 12, 2018

I was thinking of something like this, assuming the fixed-N is 32:

template <std::size_t B, typename T>
decltype(auto) switch_visit(T&& v) {
  switch (v.index()) {
    case B+0: /* ... */ break;
    case B+1:  /* ... */ break;
    // ...
    case B+31: /* ... */ break;
    // this should be tail-recursive-able:
    default: switch_visit<B+32>(std::forward<T>(v));
  }
}

@quicknir
Copy link
Author

@mpark Very neat! I hadn't considered an approach like that. However, there's a rather significant issue, which I couldn't solve on the user side but could very well be solvable when we are working inside the implementation. The code inside the /* .. */ is basically going to ask for a reference to the type inside the variant at index I (where I is just a constant matching the case). If you use an index that isn't within the size of the variant, you immediately get a hard compilation error (at least, this is the behavior both if you use the public api with get_if, and I think it's what you see with get_alt as well). For that reason, there has to be a separate switch case function for every size of variant. As well, for performance reasons the compiler may be able to generate better code when it has the unreachable pragma.

In order to use this approach, I'd: a) need some way to get a reference to the Ith type such that a compilation error is not triggered when I is out of bounds, b) we'd need to convince ourselves that this doesn't make the assembly worse in common case. I'm not against covering these two as long as it's reasonably time efficient, but I still feel like there is little disadvantage in only covering up to fixed N.

What do you think?

@mpark
Copy link
Owner

mpark commented Oct 13, 2018

@quicknir: We have the exact same issue for a fixed size N though, right? As in, if our N is bigger than the variant_size, then we're gonna have to not hard-error on get. So it seems to me like we need to solve that problem regardless.

@quicknir
Copy link
Author

quicknir commented Oct 13, 2018

@mpark Well, I actually generate a separate switch case function for each size of variant. That's what the integer template parameter in the function is for. So in the current approach we don't need to change get_alt and such.

@SaadAttieh
Copy link

SaadAttieh commented Oct 13, 2018 via email

@quicknir
Copy link
Author

I like your first idea a lot, I'll try to show a POC godbolt example this weekend.

@mpark
Copy link
Owner

mpark commented Oct 14, 2018

Here's a version that handles arbitrary # of alternatives, with N=4, for one variant with 6 alternatives: https://godbolt.org/z/TBd9rr

@mpark
Copy link
Owner

mpark commented Oct 15, 2018

Okay, I've compared the following options to visit a single variant with 7 alternatives.

  • handwritten switch
  • recursive switch_visit with N=4
  • recursive switch_visit with N=8
  • mpark::visit
Clang 7 GCC 8.2
switch HS-C HS-G
switch_visit, N=4 SV4-C SV4-G
switch_visit, N=8 SV8-C SV8-G
mpark::visit MV-C MV-G

Results (roughly.. in terms of code size/quality):

GCC doesn't seem to know how to unroll the recursive call in case it's needed, but I think that's fine as it's still better than the current manual jump table approach.

@mpark
Copy link
Owner

mpark commented Oct 21, 2018

@quicknir: have you had time to look into this stuff?

@quicknir
Copy link
Author

@mpark Sorry, got a bit busy, will make time this coming weekend.

@mpark
Copy link
Owner

mpark commented Oct 28, 2018

@quicknir: Cool, no problem! I wanted to share my preference in direction so that we can align our expectations.

  • I'd like to pursue the full switch-based visit solution here, and can the existing mechanism.
  • As far as I see, the switch-based solution is strictly better for single visitation. Even for GCC which didn't seem to do as much when the # of alternatives is greater than N.
  • We can build the N-ary visitation off of the single visitation from above, and I think we'll still get better code-gen than the existing solution.

The following are some experiments with double visitation:

The fact that the templated nested switch generates the same code as the hand-rolled code makes me optimistic about even N-ary visitation performing better than the manual jump table approach.

Anyway, I'm not sure exactly what stage you're in at the moment, but let me know if you disagree so that we can figure out what we can agree to do here!

@quicknir
Copy link
Author

@mpark I am fine with the recursive switch solution you suggested for single visitation. The only caveat is that I think that N is reasonable (say 10, as opposed to like 2) so that in common cases we still get optimal codegen, from gcc as well. N-ary visitation, makes me nervous; I tend to agree the codegen will probably be better, it's just that implementing it for arbitrary ariness seems like it will be significantly more complicated, and will delay things going through, and for significantly less benefit. Neither of the links obviously scales to arbitrary ariness AFAICS. It's possible that some kind of recursive approach would work (peel off the first variant each time, create a new visitor that hardcodes the position of the first variant and visits the rest normally and pass it down recursively).

Ideally (IMHO), we would start with single visitation + diagonal visitation (which IIUC is only used internally for binary operators like ==), get that to a place we're happy with and merge it, and only then start looking at N-ary visitation. That way, in parallel we can look at pushing this approach into the standard libraries. Ultimately getting single + diagonal visitation into standard libraries ASAP should be the main priority here, I think, because this is going to deliver the vast majority of the value to users. That said, I'm open if we can find a good implementation strategy. At the start I was doubtful about arbitrary N for single visitation and all the smart people on this thread changed my mind rather quickly :-).

@quicknir
Copy link
Author

Okay, updated PR. It should now be doing single variant visitation with arbitrary number of types with recursive switch case. For now I only switch-case on 2 values at a time before recursing, but if you're ok with this approach I can extend to something more reasonable like 10, and also proceed to extend it to binary diagonal visitation. I think we might be pretty close!

@mpark
Copy link
Owner

mpark commented Oct 29, 2018

@quicknir:

N-ary visitation, makes me nervous; I tend to agree the codegen will probably be better, it's just that implementing it for arbitrary ariness seems like it will be significantly more complicated, and will delay things going through, and for significantly less benefit.

I've put together the following snippet to show that the N-ary visitation code is achievable without too much complexity over the previous snippet. I would strongly prefer to replace the existing solution entirely, so that we get strictly better code-gen in general and also to solve problems like #41 for free. I promise it will not delay our progress.. personally I'd be much more excited/engaged if we were to pursue the full solution.

Neither of the links obviously scales to arbitrary ariness AFAICS.

I was only trying to hint that the generated code for the N-ary approach would be optimizable by the compiler with these links.

Anyway, the following is comparison of the options:

  • handwritten (nested) switch
  • N-ary switch_visit with N=8
  • mpark::visit

Single Visitation (7 alternatives):

Clang 7 GCC 8.2
switch HS1-C HS1-G
switch_visit SV1-C SV1-G
mpark::visit MV1-C MV1-G

Results:

Double Visitation (5 alternatives):

Clang 7 GCC 8.2
switch HS2-C HS2-G
switch_visit SV2-C SV2-G
mpark::visit MV2-C MV2-G

Results:

@quicknir
Copy link
Author

@mpark Sorry, you say "I put together the following snippet" but then I'm not sure exactly which snippet you are referring to? My basic idea btw for scaling it to n-ariness is something like this:

template <class Visitor, class FirstVariant, class ... Variants>
decltype(auto) switch_visit(Visitor&& visitor, FirstVariant&& first_variant, Variants&& ... variants) {
   switch (first_variant.index()) {
      case(0): return switch_visit(make_recursing_visitor(visitor, *get_if<0>(first_variant)), std::forward<Variants>(variants)...);
      ...
   }
}

Basically, this function make_recursing_visitor is essentially performing currying for us. There's a lot of details to get right here with perfect forwarding, but I think it's basically doable. Is this what you had in mind for a completely generic solution?

@mpark
Copy link
Owner

mpark commented Oct 29, 2018

@quicknir: Sorry about that. I was referring to the SV links in the tables above.

The approach you mention of retrieving the alternative per variant and passing it down
the recursive calls induced push/pop calls from my experiments. It seems better to actually
forward the variants through while building up the indices, and retrieve all of them at the end
in the final dispatch. The following is the full solution inlined:

template <bool, typename F, typename... Vs>
struct Switch;

template <typename F, typename... Vs>
struct Switch<false, F, Vs...> {
  template <std::size_t = 0, std::size_t... Is, typename... Vs_>
  static void switch_(std::index_sequence<Is...>, F&&, Vs&&..., Vs_&&...) { __builtin_unreachable(); }
};

template <typename F, typename... Vs>
struct Switch<true, F, Vs...> {
  template <std::size_t B = 0, std::size_t... Is>
  static void switch_(std::index_sequence<Is...>, F&& f, Vs&&... vs) {
    static_assert(B == 0);
    std::invoke(std::forward<F>(f), *mpark::get_if<Is>(&vs)...);
  }

  template <std::size_t B = 0, std::size_t... Is, typename V_, typename... Vs_>
  static void switch_(std::index_sequence<Is...>, F&& f, Vs&&... vs, V_&& v_, Vs_&&... vs_) {
    static constexpr std::size_t size = mpark::variant_size_v<std::decay_t<V_>>;
    switch (v_.index()) {
      case B+0: Switch<B+0 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+0>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+1: Switch<B+1 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+1>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+2: Switch<B+2 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+2>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+3: Switch<B+3 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+3>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+4: Switch<B+4 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+4>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+5: Switch<B+5 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+5>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+6: Switch<B+6 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+6>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      case B+7: Switch<B+7 < size, F, Vs...>::switch_(std::index_sequence<Is..., B+7>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs_>(vs_)...); break;
      default: Switch<B+8 < size, F, Vs...>::template switch_<B+8>(std::index_sequence<Is...>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<V_>(v_), std::forward<Vs_>(vs_)...); break;
    }
  }
};

template <typename F, typename... Vs>
void switch_visit(F&& f, Vs&&... vs) {
  Switch<true, F, Vs...>::switch_(std::index_sequence<>{}, std::forward<F>(f), std::forward<Vs>(vs)..., std::forward<Vs>(vs)...);
}

@quicknir
Copy link
Author

@mpark That seems reasonable. Okay, I will put together arbitrary visitation including diagonal visitation in the PR by the end of next weekend.

@mpark
Copy link
Owner

mpark commented Oct 29, 2018

I'm not sure if we even need a diagonal special case anymore with this approach actually 🤔
but I haven't tested that yet.

@quicknir
Copy link
Author

quicknir commented Oct 29, 2018

@mpark Maybe it's not conclusive evidence but this snippet leads me to believe that you still need a diagonal special case to be optimal: https://godbolt.org/z/2Xek12. You can see that when I mark the off-diagonal case no inline, code is still generated for it, even though because of the early exit, it's not possible to reach it. This in turn isn't really surprising because the whole switch operation doesn't get fully inlined into f; this means that the whole visit code does not actually have knowledge of the early exit and therefore still has to generate branches etc concerning itself with the off diagonal case.

Note: I think the early exit is actually necessary; not here necessarily for equality but e.g. for operator< you can't recover the indices from the types in general, so you'll have to have the branch before the visitation, and only do visitation when indices are equal, but then you'll still have branches entering the non-diagonal cases even though they are unreachable, which is sub-optimal.

@mpark
Copy link
Owner

mpark commented Oct 29, 2018

I'm not understanding what the purpose of the no-inline is... could you explain? I assume we wouldn't mark it inline, and also tell the compiler that it's unreachable: https://godbolt.org/z/N-ezqd

Anyway, I'm totally fine with implementing the diagonal case. It should be pretty simple, as it'll essentially be the single visitation code, taking arbitrary # of variants and retrieving the Ith alternative for all of them.

@quicknir
Copy link
Author

@mpark Cool, I think I will just implement diagonal side-by-side with the snippet above and look at the assembly. I also realized some other improvements in your approach compared to my PR :-). Feeding the B+K < size into the template parameter is clever for sure. I'll try to put it all together by the end of the coming weekend like I said and hopefully we can move along.

@quicknir
Copy link
Author

quicknir commented Nov 4, 2018

@mpark Ok tired for today but made some good progress. I was able to take your snippet and resolve both the return type issue, and get rid of double-passing all the variants. The build and unit tests all passed for me; currently it's using recursive switch-case for all non-diagonal visitation. So we're in good shape! I'll look more at diagonal visitation and make a decision there, and add commits taking care of that and removing the dead code.

For now, the number of cases before recursion is just 2; I lowered it to reduce the size of errors while working :-). We can discuss the correct number at the end; because of the macro I use it's very very easy to change the number of cases.

@mpark
Copy link
Owner

mpark commented Nov 8, 2018

@quicknir: are you sure the tests pass...? It doesn't seem to compile for me 😕

../include/mpark/variant.hpp:511:38: error: constexpr variable 'size' must be initialized by a constant expression
        constexpr std::size_t size = v_.size();
                                     ^~~~~~~~~

@quicknir
Copy link
Author

quicknir commented Nov 8, 2018

@mpark I only compiled locally with 17, so perhaps that is the issue? v_.size() is actually calling a static constexpr member of v_'s class, so it should still compile even if v itself is not a constexpr expression. That seems like a compiler bug?

It's not infinite recursion because what happen is that V_ gets added to the list of types that the struct is templated. Basically, the list of types in Switch are the types that have not yet been recursed through; that's why the initial call doesn't give it any types at all. Each recurse a type is added to that list. The termination of the recursion happens when Switch is templated on the full type list, at that point the switch_ overload with no extra arguments gets selected.

@quicknir
Copy link
Author

quicknir commented Nov 9, 2018

Okay, so I think I've convinced myself that a separate diagonal implementation is not necessary. Here was my quick implementation of diagonal visitation: https://godbolt.org/z/JHkwE7. It doesn't seem to generate markedly better code, though it's hard to say. The current approach actually tends to encourage more inlining because there's more complex templating; with the diagonal visitor it didn't tend to inline so much.

At any rate I think it's close enough we can justify doing the simple thing and just not having a diagonal implementation.

@mpark
Copy link
Owner

mpark commented Jan 9, 2019

Closed by #56

@mpark mpark closed this Jan 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants