Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operate on WebAssembly? #323

Open
fitzgen opened this issue Apr 17, 2018 · 82 comments
Open

Operate on WebAssembly? #323

fitzgen opened this issue Apr 17, 2018 · 82 comments

Comments

@fitzgen
Copy link
Contributor

fitzgen commented Apr 17, 2018

I don't see any references to it anymore, but my understanding was that Souper supported not just LLVM IR but also MSVC's IR. With that in mind, I think it would be exciting if Souper also supported WebAssembly as an IR (potentially hinting to binaryen what optimizations it is missing).

Brief intro to WebAssembly for the uninitiated (from http://webassembly.org/ ):

WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable target for compilation of high-level languages like C/C++/Rust, enabling deployment on the web for client and server applications.

WebAssembly also has an extensive specification, which hopefully makes supporting it easier than it would otherwise be.

I'm not proposing anything concrete (yet?), just trying to gauge interest and get feedback.

Thoughts?

cc @kripken @sunfishcode

@fitzgen
Copy link
Contributor Author

fitzgen commented Apr 17, 2018

Related: a simple, offline super optimizer for wasm by @kripken: https://github.com/WebAssembly/binaryen/blob/superoptimizer/src/tools/wasm-analyze.cpp

@regehr
Copy link
Collaborator

regehr commented Apr 17, 2018

I'll need to think about this, but I'll just note right now that Souper doesn't support MSVC, but rather MSVC IR and Souper IR are similar enough that Gratian at Microsoft wrote a quick MSVC->Souper translator and then manually implemented a bunch of optimizations that it discovered. Also, we recently learned that the Mono compiler people used Souper similarly! But I don't have much detail about that yet. I assume they used their SSA IR but I don't know for sure.

As far as non-SSA IRs go, a while ago I started on a CompCert->Souper translator but got stalled for lack of time. I don't have a good sense for how easy it would be to do a WebAssembly->Souper translator.

@regehr
Copy link
Collaborator

regehr commented Apr 20, 2018

Ok, I thought about this a bit more.

The first issue is getting Souper to understand WebAssembly, this is less trivial than other cases due to the stack machine. On the other hand, a very nice thing about conversion to Souper is that you can stop whenever you want, for example because you run into something tricky. I've never written code to symbolically execute stack machine code to make it into register code, but conceptually this doesn't sound that hard (again, as long as we can bail out when things get tricky).

As far as I know, after un-stack-conversion (is there a word for this?) the underlying operations should map nicely between WebAssembly and Souper, and Souper's synthesizer should find optimizations perfectly normally.

The tricky part, then, is translating the synthesized results back into WebAssembly optimizations. I suppose this would be 100% manual at first, and then there are probably things that make sense to automate.

So anyhow my initial impression is that we want an unstackifier that makes Souper IR out of webasm, and then we can proceed from there. I think this makes more sense than trying to shoehorn a stack machine into the Souper implementation.

@fitzgen
Copy link
Contributor Author

fitzgen commented Apr 20, 2018

Thanks for the thoughtful reply :)

One benefit I see of destackifying wasm into Souper IR is that experimentation developing that tool can happen out-of-tree and can move fast and loose.

Souper's ability to bail out on tricky constructs is nice, too. This enables incremental development, where we could get an MVP working and then start ratcheting up the percent wasm that is understood by the translator.

On the other hand, I also have a concern with this approach. Code size is super important for wasm because it is delivered over the network, and therefore potential size optimizations are particularly enticing. I fear that translating wasm into Souper IR won't help us discover "silly" wasm-specific size optimizations, like https://kripken.github.io/blog/binaryen/2018/04/13/a-silly-optimization.html

Maybe this fear is unfounded, and size optimizations aren't as specific to the instruction set and its encoding as I'm assuming? Or maybe translation into Souper IR just isn't the right approach for wasm size optimization, and further development of a wasm-specific super optimizer is a more promising avenue of investigation?

@regehr
Copy link
Collaborator

regehr commented Apr 20, 2018

I have the same worry about size! Best we can do is give it a try, and maybe it'll turn out that re-stackifying the Souper output gets back the benefits?
Synthesis is a hard enough problem that I think it's worth exploring with Souper before doing more one-offs.
Random thought: I wonder if perhaps the webassembly typechecker/bytecode verifier/whatever contains the simple symbolic execution engine that we want for unstackifying webasm?

@kripken
Copy link

kripken commented Apr 20, 2018

I agree this is definitely worth trying! I could look into creating a Binaryen pass to convert Binaryen IR to Souper IR, if that would be helpful? (Binaryen IR is almost identical to wasm, can be converted to and from it, and preserves size and performance aspects.) It might be simplest for me to convert to the Souper text format maybe, to get started?

About the size concern, yeah, that seems like it could be an issue. For example if the wasm => Souper IR conversion is inefficient, then Souper's proposed optimizations might be to remove those inefficiencies. We'll just have to make sure it's a good conversion :)

@regehr
Copy link
Collaborator

regehr commented Apr 20, 2018

Alon, if you could hack something up that generates Souper text, this would be great! I'm happy to read code, play with crappy prototypes, etc. Also I'm happy to do the Souper runs (synthesizer can be finicky).

One thing to keep in mind is that one of Souper's most powerful features is its path conditions-- so if there's a way to generate those easily as part of the conversion, this will be especially helpful in generating optimizations (though applying the optimizations becomes less straightforward).

@kripken
Copy link

kripken commented Apr 20, 2018

Great! I'll see what I can hack up.

Is the Souper paper the best place for docs on the IR text format?

About path conditions, naively it seems like that should be easy since we know the conditions of conditional branches etc. But if we need to do stuff like forward conditions through multiple conditional branches etc. (if the condition is still true in them) then I'm not sure offhand.

@regehr
Copy link
Collaborator

regehr commented Apr 20, 2018

Argh we don't really have docs for Souper IR -- but it is really simple. I guess maybe ask me lots of questions and look at the test cases?

find souper/test -name '*.opt'

@regehr
Copy link
Collaborator

regehr commented Apr 20, 2018

Re. path conditions, extracting the condition from only the most recent branch is probably the way to go.

@chenyang78
Copy link
Contributor

Sorry for chiming in on this :) Beside *.opt as John says, I think souper/InstRef.rst is a good place in terms of Souper IR. Otherwise, you may have to end up scanning the relevant code.

In terms of path condition (PC in Souper's terminology), I think knowing the conditions of conditional branches should be enough for the first version. You can skip the BlockPC part at the moment.

@regehr
Copy link
Collaborator

regehr commented Apr 20, 2018

Ah, regarding docs/InstRef.rst, let me take a look and make sure it's up to date, has not been touched for a quite a while.

@chenyang78
Copy link
Contributor

@regehr , right, I just realized that we haven't touched it for a while :)

@kripken
Copy link

kripken commented Apr 20, 2018

Thanks @regehr and @chenyang78, those look like what I need to get started!

@kripken
Copy link

kripken commented Apr 22, 2018

Ok, I wrote up something that seems like it's starting to have the right general shape for simple things. Feedback welcome! Code is in a souper branch in binaryen, the new pass is here.

To try it, you can do something like

wasm-opt input.wasm --flatten --simplify-locals-nonesting --souperify

(flatten flattens the IR; simplify-locals-nonesting simplifies it while keeping it flat; souperify runs the actual conversion). This will print out all the LHSes it can handle. Example input and output are in the test suite. For example, something conceptually like

int func(int x) {
  if (x & 1)
    x += 1;
  else
    x += 2;
  return x & 1;
}

can be written in wasm like this

    (if
      (i32.and
        (get_local $x)
        (i32.const 1)
      )
      (set_local $x
        (i32.add
          (get_local $x)
          (i32.const 1)
        )
      )
      (set_local $x
        (i32.add
          (get_local $x)
          (i32.const 2)
        )
      )
    )
    (return
      (i32.and
        (get_local $x)
        (i32.const 1)
      )
    )
)

and --souperify will emit 5 LHSes for it, the interesting one is for the returned value, which is

%0 = block 2
%1 = var
%2 = and %1, 1:i32
%3 = blockpc %0 0 %2 1:i32
%4 = blockpc %0 1 %2 0:i32
%5 = add %1, 1:i32
%6 = add %1, 2:i32
%7 = phi %0, %5, %6
%8 = and %7, 1:i32
infer %8

which looks like it might be valid Souper IR. I didn't try to load it yet though... :)

Still a lot of TODOs (IR not yet converted) and FIXMEs (no i1 conversions yet). It just stops a particular LHS when it see stuff it can't handle. For control flow so far only ifs are handled. blockpcs for them are implemented too (see example above), but no regular pcs yet. I suspect ifs are enough control flow to get interesting results.

Some thoughts as I was doing this:

  • One thing I would expect the Souper approach to help with is "backend" inefficiencies. wasm is often code that has been run through the LLVM optimizer (nothing new there for Souper), but then it goes through an LLVM backend, which maybe introduces extra work there.
  • wasm not run through the LLVM optimizer is interesting too, e.g. generated from Go or AssemblyScript, and maybe optimized by Binaryen. Souper on such wasm could find stuff the Go, AssemblyScript, and Binaryen optimizers could do better. I'm guessing there is quite a lot there!
  • We can't expect this to handle very low-level wasm things like -(-64) being smaller than 64, or things with tees, block and if values, etc. All those details get lost in the translation to Souper IR. Overall I expect we can benefit both from the Souper approach and from superoptimizing on wasm directly (I recently got that building again for more tests), they are probably complementary.

@regehr
Copy link
Collaborator

regehr commented Apr 22, 2018

Awesome. Is there a particular C/C++ code base that is particularly representative of code you're interested in optimizing, that I should start with? Otherwise I'll just build a few random things.

So far I've been creating wasm by passing --target=wasm32 to clang, is that the right approach?

Indeed, Go's equivalent of InstCombine is quite weak so Souper will find a lot of stuff.

There's an interesting piece of this puzzle that I don't think has been attacked seriously yet, that would not only be useful to solve but also would make a good submission to a conference like PLDI. This is taking patterns from Souper, perhaps abstracting them so they're more broadly applicable, and then making a very fast matcher for them. Just mentioning this since it's something I'm very interested in working on.

@kripken
Copy link

kripken commented Apr 22, 2018

Awesome. Is there a particular C/C++ code base that is particularly representative of code you're interested in optimizing, that I should start with? Otherwise I'll just build a few random things.

Hmm, some good samples are here: https://github.com/kripken/embenchen/tree/master/asm_v_wasm (that's output from the emscripten benchmark suite). In particular the larger ones like bullet, box2d, and lua, those are important real-world codebases I think.

So far I've been creating wasm by passing --target=wasm32 to clang, is that the right approach?

In general yes, however it's worth running the binaryen optimizer on it too (wasm-opt -Os input.wasm -o output.wasm) as it will shrink it by 10% or so. Probably not the same optimizations as Souper would look for, but while shrinking it makes the code cleaner which is more likely to work in the --souperify pass.

@regehr
Copy link
Collaborator

regehr commented Apr 22, 2018

Thanks Alon. I should have a bit of time this week to work on this!

@kripken
Copy link

kripken commented Apr 23, 2018

pc implemention (for ifs, the only control flow so far) added to that branch. Output looks like it might be correct, but I'm not sure :)

@regehr
Copy link
Collaborator

regehr commented Apr 23, 2018

I've been reading the code looking to fix a couple of easy things, but if you're in there right now it'll be quicker for you to do these if you have time!

  • var instructions require a width, e.g.
    %x:i32 = var
  • we don't support redundant comparison instructions, so we don't have sgt, sge, ugt, uge (I know this is quirky and weird, but canonicalization has zero relevance in Souper IR so we've not bothered to fix this)
  • pc arguments need to be i1 instead of i32

I'm working on updating our Inst documentation to reflect this kind of thing, thanks for your patience!

@rsas
Copy link
Collaborator

rsas commented Apr 23, 2018

Hi Alon, great idea to apply superoptimization on wasm! Particularly, I second the thought of applying Souper on the wasm directly that comes from non-LLVM back-ends. Let's start with getting the Souper IR right first.

@kripken
Copy link

kripken commented Apr 23, 2018

@regehr thanks! I fixed those 3 things on the branch.

Output looks better now, but probably still a lot left to do to get it right. Also, to add operations aside from binaries (unaries, etc.)...

@regehr
Copy link
Collaborator

regehr commented Apr 24, 2018

Ok, great!
I'm now seeing two small issues.

First, while most LHSs have the proper var declarations there are also some degenerate LHSs that are just this, which makes Souper's parser barf:

; start LHS
infer %0

Then second there's a common pattern of comparing the results of comparisons against zero using the wrong width as seen here:

; start LHS
%0:i32 = var
%1 = slt -1000999:i32, %0
%2 = sle 0:i32, %0
%3 = eq %2, 0:i32
pc %3 1:i1
infer %1

Since Souper doesn't have any implicit sext/zext, it should be:

%3 = eq %2, 0:i1

But leaving aside these issues I'm already seeing some perhaps-useful stuff coming out of the synthesizer!

@regehr
Copy link
Collaborator

regehr commented Apr 24, 2018

As a random example, here's an optimization that Souper would like to perform.
LHS:

%0:i32 = var
%1 = sle 0:i32, %0
%2:i32 = zext %1
%3:i32 = var
%4 = sle 0:i32, %3
%5:i32 = zext %4
%6 = or %2, %5
%7 = eq %6, 0:i32
infer %7

RHS:

%8:i32 = and %0, %3
%9:i1 = slt %8, 0:i32
result %9

@kripken
Copy link

kripken commented Apr 24, 2018

Exciting to see results already! And nice optimization it found there.

Those two issues should be fixed on the branch. Also added unary ops (cttz etc.).

@regehr
Copy link
Collaborator

regehr commented Apr 24, 2018

Ok!
Next two issues should also be easy:

  • please use lshr/ashr instead of sshr/ushr
  • blockpc doesn't get a label in Souper IR, so instead of:
%3 = blockpc %0 0 %2 1:i1

we want:

blockpc %0 0 %2 1:i1

@regehr
Copy link
Collaborator

regehr commented Apr 24, 2018

And here are a few more results:
https://gist.github.com/regehr/a428f164ae60ff783e5872d5efc5416e

To get these, I took wasm_lua_scimark.c.wasm and ran it through binaryen -O3, then ran the output through Souper in "constant synthesis" mode, where the only thing it does is try to prove that the LHS can be replaced by a constant.

I haven't gone and tracked these back to the wasm, so some of them could be artifacts of the translation, but it looks like at least some of them represent real opportunities for shaving code size.

To facilitate backtracking, would it make sense for --souperify to also print the wasm in a comment? Or print a line number for the root of the LHS maybe?

@regehr
Copy link
Collaborator

regehr commented Apr 24, 2018

You'll notice that sometimes the LHS contains far more context than is required to justify the optimization, making it harder to understand what's going on. I have a reducer for LHSs that I can run in the future.

@kripken
Copy link

kripken commented Apr 24, 2018

Thanks! Fixed those two minor issues on the branch.

About backtracking, I need to think about that - we have a DebugInfo mechanism that might work somehow, but it's not trivial to connect it. Looks like we definitely need a way to do that for debugging, though, as several of those results do look like my translation is wrong, for example the optimizer definitely knows that multiplying an integer by zero is zero, and can add two constants, etc.

@kripken
Copy link

kripken commented Apr 25, 2018

To get something simple I added support for setting BINARYEN_DEBUG_SOUPERIFY=1 in the env, then it prints each wasm expression before the souper one. However, it's not in a comment (no trivial way do that atm, I'll look into it) but worse it's printing the "flattened" IR, so it may not be useful enough for real debugging.

However, when trying to use this to debug those weird muls and adds, I think I might be misreading your gist:

; RHS inferred successfully
%0:i32 = add 0:i32, 1:i32
cand %0 1:i32

; RHS inferred successfully
%0:i32 = var
%1:i32 = mul 0:i32, %0
cand %1 0:i32

Does that mean I should be able to search in the output from processing that lua file (after -O3) and find the strings = add 0 or = mul 0 in the LHSes? Neither seems to show up. All the adds and muls have a non-constant left side (starting with %) as I'd expect, so maybe I'm confused? Or was this another file than https://github.com/kripken/embenchen/blob/master/asm_v_wasm/wasm_lua_scimark.c.wasm ?

@kripken
Copy link

kripken commented May 1, 2018

I see what's going on, yeah - if you ran two commands, then binaryen by default won't save function names (you need -g to preserve them). And yeah, it's fine to compose those things, -O3, --souperify etc. are all just passes (or lists of passes), run in order, so a single command like my example is fine too.

@kripken
Copy link

kripken commented May 1, 2018

And yeah, now I see those trivial add/sub pairs - they are in the wasmbackend versions, not the asm2wasm ones. E.g.

; start LHS (in $luaL_error)
%0:i32 = var
; (i32.sub(get_local $5)(i32.const 128))
%1 = sub %0, 128:i32
; (i32.add(get_local $3)(i32.const 128))
%2 = add %1, 128:i32
infer %2

@kripken
Copy link

kripken commented May 1, 2018

And here's the relevant wast:

 (func $luaL_error (; 41 ;) (type $3) (param $var$0 i32) (param $var$1 i32) (param $var$2 i32) (result i32)
  (local $var$3 i32)
  (local $var$4 i32)
  (i32.store
   (i32.const 1024)
   (tee_local $var$3
    (i32.sub
     (i32.load
      (i32.const 1024)
     )
     (i32.const 128)
    )
   )
  )
  [..]
  (i32.store
   (i32.const 1024)
   (i32.add
    (get_local $var$3)
    (i32.const 128)
   )
  )
  (get_local $var$0)
 )

Looks like stack prologue/epilogue stuff - things added by the wasm backend, after LLVM IR.

I think that could be improved, but I'm not sure. Opened https://bugs.llvm.org/show_bug.cgi?id=37299

@regehr
Copy link
Collaborator

regehr commented May 1, 2018

Interesting!
I'll do the next synthesis run with instructions on the asm2wasm version.
Unfortunately it may not finish before I leave town again, and my current fastest machine that I use for synthesis happens to be at home, so it may be the weekend before I post results.
Let me know if it would be useful to do a constant/nop synthesis run on the asm2wasm version, I can turn that around quickly.
I have money for faster machines at work but haven't gotten around to ordering anything, argh.

@regehr
Copy link
Collaborator

regehr commented May 1, 2018

Here's an example where we need to ask what the cost model should be. Souper wants to eliminate an instruction, but is that an appropriate optimization here?

; start LHS (in $_lua_settop)
%0:i32 = var
; (i32.le_s(get_local $1)(i32.const -1))
%1 = sle %0, -1:i32
%2:i32 = zext %1
infer %2

; RHS inferred successfully
%3:i32 = lshr %0, 31:i32
result %3

@regehr
Copy link
Collaborator

regehr commented May 1, 2018

Some constants and nops:
https://gist.github.com/regehr/649c148d8c154b2550062d865b30a244
From:

BINARYEN_DEBUG_SOUPERIFY=1 ~/binaryen/build/bin/wasm-opt asm2wasm_lua_scimark.c.wasm -O3 --flatten --simplify-locals-nonesting --souperify

@regehr
Copy link
Collaborator

regehr commented May 1, 2018

Does it seem weird that Souper gets 16k LHSs out of asm2wasm_lua_scimark.c.js.bc, but --souperify gets 200k LHSs out of asm2wasm_lua_scimark.c.wasm?

@regehr
Copy link
Collaborator

regehr commented May 1, 2018

Well, in the ongoing instruction synthesis run, I'm getting quite a bit of this:

; start LHS (in $_lua_tolstring)
%0:i32 = var
; (i32.le_s(i32.const 0)(get_local $68))
%1 = sle 0:i32, %0
%2:i32 = zext %1
; (i32.eq(unreachable)(unreachable))
%3 = eq %2, 0:i32
infer %3

; RHS inferred successfully
%4:i32 = ashr %0, 31:i32
%5:i1 = trunc %4
result %5

Something I'd like to do is make a custom cost function for Souper that closely mirrors the actual code size costs in the webassembly binary format.

I gather that some of the sexts/zexts in --souperify output are zero-cost in the sense that they were added only to make Souper happy. One thing we might consider is marking (in --souperify output) which instructions are freebies that have zero cost in the original wasm file. I could hack Souper to take this into account when it attempts to make a profitable optimization. This, plus some handling of external uses, might significantly reduce the noise level in synthesis results.

@kripken
Copy link

kripken commented May 1, 2018

Does "synthesis" mean stuff aside from const/nop?

In general results on the wasm backend are more important for the future (as we plan to switch to it from asm2wasm), however right now it is less optimized so more trivial stuff is possible there. But that's trivial stuff we need to fix ;)

Yeah, the cost model seems like it matters in that example. For code size, a zext of an i1 is free in wasm. On the other hand, I'm not sure it's free in the compiled machine code? So depending on if we're optimizing for download size or runtime speed could matter.

@regehr
Copy link
Collaborator

regehr commented May 1, 2018

Yeah, sorry, "synthesis" means making new instructions that weren't there before.

The issue of cost models is one that we've spent a lot of time thinking about. I think the issues are basically the same for LLVM IR and WebAssembly. Options include:

  • right now we just eyeball it, tweaking the model when synthesis comes up with something it shouldn't have. this is only minimally workable. LLVM has these somewhat complicated and poorly documented ideas about canonicalization and they're hard to capture in a simple cost model.
  • probably the next thing I'll try is to run a ton of small LLVM functions through opt and then build a cost model by observing LLVM's behavior. I'm not certain that this approach makes sense for WebAssembly since the optimizer isn't as thorough yet.
  • I also want to run a ton of small functions through a backend, apply an assembly-level cost model, and then try to map that back to the LLVM level. this should work really well when optimizing for the size of the eventual object code; it isn't clear how well it'll work when optimizing for throughput, since predicting throughput of assembly code is difficult for modern processors. we can try this approach for webassembly assuming that you have an ahead-of-time webassembly backend?

Hope this all makes sense. Summary is that this is a problem we need to solve anyway, so happy to put some effort into it.

@kripken
Copy link

kripken commented May 3, 2018

Interesting about the cost model. Yeah, sounds like a lot to do there.

Meanwhile I generated some LHSes that might be interesting, this is from a function in the entropy coder in the AV1 media codec, which is something they tell me is relevant for superoptimization:
https://gist.github.com/kripken/b5da8f8da6b3cb88de012aa24e3393ce Would be interesting if it finds any especially clever stuff there.

I've also been improving the LHS generation mechanism. I'm working on a fuzzer that will verify that we generate proper LHSes from wasm, by doing trivial operations on them that should change nothing and then running to see that the output doesn't change. Already found one bug.

@regehr
Copy link
Collaborator

regehr commented May 3, 2018

Great! I'm traveling and won't do anything on this until the weekend unfortunately. Looking forward to seeing the new stuff. Instruction synthesis hadn't finished before I left home this morning argh.

@regehr
Copy link
Collaborator

regehr commented May 5, 2018

Ok, here are synthesis results from wasm_lua_scimark.c.wasm:
https://gist.github.com/regehr/0271e1060f791d7d813c000fa5596c8e
A few patterns dominate the output but there's some interesting stuff hiding in there.

And here's what Souper finds in the corresponding LLVM bitcode:
https://gist.github.com/regehr/b19854020a21a39fb34e50eff922e187

Is the AV1 codec still the best thing to do next?

@regehr
Copy link
Collaborator

regehr commented May 6, 2018

Results from the AV1 LHSs linked above:
https://gist.github.com/regehr/be7fd81ee4940086fcbf1fa3dd521fe0
The last one (with bswap) might be cute for processors where bswap isn't microcoded.

@kripken
Copy link

kripken commented May 6, 2018

Thanks!

Reading the lua synthesis results, one problem is that we tell Souper to infer the artificial instructions that we add just for legal reasons, like zexts and comparisons to zero for path conditions. I pushed an improvement to avoid that.

Another issue is that the cost of zext and trunc should be 0. Not inferring those instructions (last paragraph) will help but they can also appear in the middle of LHSes. But, maybe that will be rare enough it's not a problem.

A third issue is the presence of additional uses not seen in the LHS, as you brought up before. I looked into a couple manually, and in all of them that was why Binaryen didn't already perform the optimization. I wonder if this is more of an issue on wasm than on LLVM IR for some reason. In any case, I wonder if it wouldn't be more interesting to focus on single-use instructions, and I added a pass for that, --souperify-single-use (can use it just like --souperify).

Thinking about this, the single-use case might be interesting to focus on for an additional reason, that it should be much easier to create a fast matcher for such patterns - in particular, such a matcher could operate on Binaryen IR directly as opposed to first constructing the DataFlow IR. That should be far faster.

Reading the AV1 results, those are very interesting :) I'll pass those along.

@regehr
Copy link
Collaborator

regehr commented May 7, 2018

Great, I'm glad we're at the point of worrying about these sorts of details.

The current Souper cost model is dead simple:
https://github.com/google/souper/blob/master/lib/Inst/Inst.cpp#L640

I'll make an alternate version that has cost of ext/trunc at 0.

Regarding external uses: empirically, the single-use restriction will be a serious one, though certainly we should pursue the low hanging fruit there. But soon I'd like to move to a more sophisticated model where you mark instructions with external uses in the Souper IR and then synthesis will take them into account. The syntax for this isn't worked out yet but we'll do this soon. There's room for even more improvement beyond that, but let's not worry about that right now.

@regehr
Copy link
Collaborator

regehr commented May 7, 2018

Using the latest binaryen, I've just kicked off a synthesis run for these LHSs:

BINARYEN_DEBUG_SOUPERIFY=1 ~/binaryen/build/bin/wasm-opt wasmbackend_bullet.wasm -O3 --flatten --simplify-locals-nonesting --souperify-single-use

I'm worried, however, that there may be a bug because I'm seeing quite a few optimizations such as this one that are trivial due an operation having the same value passed in both argument positions:

; start LHS (in $free)
%0:i32 = var
%1:i32 = var
; (i32.ne(get_local $36)(get_local $2))
%2 = ne %0, %0
infer %2

; RHS inferred successfully
result 0:i1

@regehr
Copy link
Collaborator

regehr commented May 7, 2018

OK, feel free to generate Souper that looks like this:

%0:i32 = var (hasExternalUses)
%1:i32 = add 0:i32, %0
%2:i32 = add 0:i32, %1
%3:i32 = add 0:i32, %2 (hasExternalUses)
%4:i32 = add 0:i32, %3

We'll modify synthesis to treat externally used values as fixed, and not try to replace them.
Please keep in mind that:

  • checking for a single use is not the right algorithm, since many values have multiple uses that are captured by the Souper LHS; we really want to only mark values that have uses that are not visible in the Souper code
  • the root of a LHS can obviously have as many uses as it likes and all of them are, by definition, invisible to Souper, and these have no impact on synthesis or the subsequent optimization

@kripken
Copy link

kripken commented May 7, 2018

%2 = ne %0, %0

Thanks, yes, definitely a bug. Fixed on the branch now, and added a test.

the root of a LHS can obviously have as many uses as it likes and all of them are, by definition, invisible to Souper, and these have no impact on synthesis or the subsequent optimization

Very good point, yeah. Fixed now. This was unnecessarily excluding a lot of LHSes! :)

checking for a single use is not the right algorithm, since many values have multiple uses that are captured by the Souper LHS; we really want to only mark values that have uses that are not visible in the Souper code

Fair point, but this does bring us to a tradeoff with the speed of matching the pattern. If (aside from the root) we have only single uses, then we are just looking at some simple linear wasm expression code (with no set_locals in the middle, in particular), which is very fast to compare to a pattern. And I could be wrong but my current intuition is that we'll have many or most of the wins this way anyhow.

@regehr
Copy link
Collaborator

regehr commented May 7, 2018

Great! I'll kick off a new synthesis run tonight but then I'm leaving town tomorrow so it'll be the weekend again before I see the results (maybe I should poke holes in my home firewall but I don't really want to).

@regehr
Copy link
Collaborator

regehr commented May 14, 2018

here are a few synthesis results from wasmbackend_bullet.wasm:
https://gist.github.com/regehr/2378313b8b4ed859948341e79c375eeb

@regehr
Copy link
Collaborator

regehr commented May 14, 2018

Also here's souper's cache dump of the same information (which I find easier to read without the webasm noise): https://gist.github.com/regehr/152b5e8359509c52a6c36829506d9d05

@kripken
Copy link

kripken commented May 15, 2018

Thanks!

Seeing some trivial stuff that shouldn't be there, I found that the single-use calculation was actually wrong - it ignored uses in things like stores and calls. Fixed, hopefully now it's correct.

@kripken
Copy link

kripken commented May 26, 2018

--souperify now emits (hasExternalUses) when relevant. This required some redesign of the IR to properly handle a single wasm instruction that turns into multiple Souper instructions, etc.

(--souperify-single-use still exists, and should never emit that notation, which is useful for fuzzing)

I think that's all I had on my list of stuff to fix.

@regehr
Copy link
Collaborator

regehr commented May 29, 2018

Thanks! I'll post some new synthesis results soon.

kripken added a commit to WebAssembly/binaryen that referenced this issue Aug 27, 2018
Background: google/souper#323

This adds a --souperify pass, which emits Souper IR in text format. That can then be read by Souper which can emit superoptimization rules. We hope that eventually we can integrate those rules into Binaryen.

How this works is we emit an internal "DataFlow IR", which is an SSA-based IR, and then write that out into Souper text.

This also adds a --dfo pass, which stands for data-flow optimizations. A DataFlow IR is generated, like in souperify, and then performs some trivial optimizations using it. There are very few things that can do that our other optimizations can't already, but this is also good testing for the DataFlow IR, plus it is good preparation for using Souper's superoptimization output (which would also construct DataFlow IR, like here, but then do some matching on the Souper rules).
@regehr
Copy link
Collaborator

regehr commented Oct 28, 2019

@shrin18 please explain in more detail what you want to do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants