Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall/js: performance considerations #32591

Open
dmitshur opened this issue Jun 13, 2019 · 22 comments
Open

syscall/js: performance considerations #32591

dmitshur opened this issue Jun 13, 2019 · 22 comments

Comments

@dmitshur
Copy link
Member

@dmitshur dmitshur commented Jun 13, 2019

I was porting some frontend Go code to be compiled to WebAssembly instead of GopherJS, and noticed the performance was noticeably reduced. The Go code in question makes a lot of DOM manipulation calls and queries, so I decided to benchmark the performance of making calls from WebAssembly to the JavaScript APIs via syscall/js.

I found it's approximately 10x slower than native JavaScript.

Results of running a benchmark in Chrome 75.0.3770.80 on macOS 10.14.5:

  131.212518 ms/op - WebAssembly via syscall/js
   61.850000 ms/op - GopherJS via syscall/js
   12.040000 ms/op - GopherJS via github.com/gopherjs/gopherjs/js
   11.320000 ms/op - native JavaScript

Here's the benchmark code I used, written to be self-contained:

Source Code

main.go

package main

import (
	"fmt"
	"runtime"
	"syscall/js"
	"testing"
	"time"

	"honnef.co/go/js/dom/v2"
)

var document = dom.GetWindow().Document().(dom.HTMLDocument)

func main() {
	loaded := make(chan struct{})
	switch readyState := document.ReadyState(); readyState {
	case "loading":
		document.AddEventListener("DOMContentLoaded", false, func(dom.Event) { close(loaded) })
	case "interactive", "complete":
		close(loaded)
	default:
		panic(fmt.Errorf("internal error: unexpected document.ReadyState value: %v", readyState))
	}
	<-loaded

	for i := 0; i < 10000; i++ {
		div := document.CreateElement("div")
		div.SetInnerHTML(fmt.Sprintf("foo <strong>bar</strong> baz %d", i))
		document.Body().AppendChild(div)
	}

	time.Sleep(time.Second)

	runBench(BenchmarkGoSyscallJS, WasmOrGJS+" via syscall/js")
	if runtime.GOARCH == "js" { // GopherJS-only benchmark.
		runBench(BenchmarkGoGopherJS, "GopherJS via github.com/gopherjs/gopherjs/js")
	}
	runBench(BenchmarkNativeJavaScript, "native JavaScript")

	document.Body().Style().SetProperty("background-color", "lightgreen", "")
}

func runBench(f func(*testing.B), desc string) {
	r := testing.Benchmark(f)
	msPerOp := float64(r.T) * 1e-6 / float64(r.N)
	fmt.Printf("%f ms/op - %s\n", msPerOp, desc)
}

func BenchmarkGoSyscallJS(b *testing.B) {
	var total float64
	for i := 0; i < b.N; i++ {
		total = 0
		divs := js.Global().Get("document").Call("getElementsByTagName", "div")
		for j := 0; j < divs.Length(); j++ {
			total += divs.Index(j).Call("getBoundingClientRect").Get("top").Float()
		}
	}
	_ = total
}

func BenchmarkNativeJavaScript(b *testing.B) {
	js.Global().Set("NativeJavaScript", js.Global().Call("eval", nativeJavaScript))
	b.ResetTimer()
	js.Global().Get("NativeJavaScript").Invoke(b.N)
}

const nativeJavaScript = `(function(N) {
	var i, j, total;
	for (i = 0; i < N; i++) {
		total = 0;
		var divs = document.getElementsByTagName("div");
		for (j = 0; j < divs.length; j++) {
			total += divs[j].getBoundingClientRect().top;
		}
	}
	var _ = total;
})`

wasm.go

// +build wasm

package main

import "testing"

const WasmOrGJS = "WebAssembly"

func BenchmarkGoGopherJS(b *testing.B) {}

gopherjs.go

// +build !wasm

package main

import (
	"testing"

	"github.com/gopherjs/gopherjs/js"
)

const WasmOrGJS = "GopherJS"

func BenchmarkGoGopherJS(b *testing.B) {
	var total float64
	for i := 0; i < b.N; i++ {
		total = 0
		divs := js.Global.Get("document").Call("getElementsByTagName", "div")
		for j := 0; j < divs.Length(); j++ {
			total += divs.Index(j).Call("getBoundingClientRect").Get("top").Float()
		}
	}
	_ = total
}

I know syscall/js is documented as "Its current scope is only to allow tests to run, but not yet to provide a comprehensive API for users", but I wanted to open this issue to discuss the future. Performance is important for Go applications that need to make a lot of calls into the JavaScript world.

What is the current state of syscall/js performance, and are there known opportunities to improve it?

/cc @neelance @cherrymui @hajimehoshi

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Jun 13, 2019

It would be also good to benchmark with Firefox and see the results.

IIUC, you are just benchmarking DOM manipulation. And since DOM manipulation anyways happens outside wasm, it is just about the price of context jump from wasm land to browser land and back. In that case, I wonder if it is even within the control of syscall/js and not the underlying wasm engine.

Would be also good to benchmark equivalent code using Rust and C and compare the benchmarks. I think that may be a better apples-apples comparison just to compare syscall/js performance with other languages.

@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Jun 13, 2019

As @agnivade said, probably worth trying Firefox. V8 is known to have some performance problems with the Wasm code generated by the Go compiler.

@dmitshur

This comment has been minimized.

Copy link
Member Author

@dmitshur dmitshur commented Jun 13, 2019

It would be also good to benchmark with Firefox and see the results.

Agreed. I'll do this later and share results.

IIUC, you are just benchmarking DOM manipulation. And since DOM manipulation anyways happens outside wasm, it is just about the price of context jump from wasm land to browser land and back. In that case, I wonder if it is even within the control of syscall/js and not the underlying wasm engine.

Yes. When I said syscall/js, I meant the entire performance cost of jumping from Wasm to the browser APIs and back. It's what the user sees when they use the API to interact with the JavaScript world.

Would be also good to benchmark equivalent code using Rust and C and compare the benchmarks. I think that may be a better apples-apples comparison just to compare syscall/js performance with other languages.

Agreed, that would be good and more representative of the actual WebAssembly <-> JS call overhead. Doing that would give us more information. I won't have a chance to do this, but if someone else can, it'd be helpful.

@eliasnaur

This comment has been minimized.

Copy link
Contributor

@eliasnaur eliasnaur commented Jun 13, 2019

Perhaps it's not worth doing anything substantial here before something like WASI is standardized. @neelance even did a WIP implementation at #31105.

@dmitshur

This comment has been minimized.

Copy link
Member Author

@dmitshur dmitshur commented Jun 14, 2019

I've tried the benchmark again with recent development versions of 3 browsers:

Chrome Canary
Version 77.0.3824.0 (Official Build) canary (64-bit)

    114.154496 ms/op - WebAssembly via syscall/js
     63.350000 ms/op - GopherJS via syscall/js
     11.740000 ms/op - GopherJS via github.com/gopherjs/gopherjs/js
     11.360000 ms/op - native JavaScript

Firefox Nightly
69.0a1 (2019-06-13) (64-bit)

     94.150003 ms/op - WebAssembly via syscall/js
     85.300000 ms/op - GopherJS via syscall/js
      7.695000 ms/op - GopherJS via github.com/gopherjs/gopherjs/js
      7.405000 ms/op - native JavaScript

Safari Technology Preview
Release 85 (Safari 13.0, WebKit 14608.1.28.1)

     57.249996 ms/op - WebAssembly via syscall/js
     42.866666 ms/op - GopherJS via syscall/js
      5.536666 ms/op - GopherJS via github.com/gopherjs/gopherjs/js
      5.073333 ms/op - native JavaScript

The results are pretty consistent across the 3 browsers in that doing lots of DOM queries via WebAssembly was about 10x slower than with pure JavaScript.

@hajimehoshi

This comment has been minimized.

Copy link
Contributor

@hajimehoshi hajimehoshi commented Jun 14, 2019

Could you share the code to take the benchmark to output the values [s/op]?

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Jun 14, 2019

Thanks for the tests @dmitshur. I would have thought that after https://hacks.mozilla.org/2018/10/calls-between-javascript-and-webassembly-are-finally-fast-%F0%9F%8E%89/, the DOM access overhead would have reduced in Firefox. And interesting that Safari is much faster for DOM access than Firefox.

The tests with Rust/C should give us a better idea on what exactly can be improved from Go side. If anybody can post results for that, that'll be great.

@dmitshur

This comment has been minimized.

Copy link
Member Author

@dmitshur dmitshur commented Jun 14, 2019

@hajimehoshi Sure. I've updated the source code in the original post.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jun 22, 2019

Change https://golang.org/cl/183457 mentions this issue: runtime,syscall/js: reuse wasm memory DataView

@eliasnaur

This comment has been minimized.

Copy link
Contributor

@eliasnaur eliasnaur commented Aug 1, 2019

@martisch suggested that I add a "real-world" example that demonstrates the performance hit of webassembly compared to running natively. A good example is the "gophers" demo from Gio (gioui.org). With modules enabled and using Go 1.13 (tip), you can build and run the demo with two commands:

    $ export GO111MODULE=on
    $ go run gioui.org/cmd/gio -target js gioui.org/apps/gophers -stats # for building gophers
    $ go run github.com/shurcooL/goexec 'http.ListenAndServe(":8080", http.FileServer(http.Dir("gophers")))' # for serving gophers on localhost:8080

Then, open a browser and open http://localhost:8080. The target frame time is ~16.7ms (60 Hz), but on my macbook pro it almost never hit the target.

Running the example natively,

     $ go run gioui.org/apps/gophers -stats

it easily hits the 60 Hz target.

In both Chrome and Firefox the builtin profiler is a great way to see what takes up the time. I've attached a screenshot of a single frame from Chrome's "Performance" tab. The frame time is 24ms.

Screenshot 2019-08-02 at 16 06 29

Unfortunately, the function names are all mangled ("wasm-function[]") which makes it much harder to discern which functions take up time. (Fixed by not passing -w -s to ldflags).

CC @cherrymui who recently optimized wasm.

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Aug 1, 2019

Thanks @eliasnaur. I have sent CL 183457 which should alleviate the DOM overhead to some extent. Would you be able to try that and check if it helps at all ? Just a note that the CL only optimizes DOM overhead, so if your app is heavy on computations in the wasm land itself, it might not help very much.

Regarding profiles, yes the Chrome profiler is a great tool. The wasm-function[] is indeed a bother (see bug I filed with the Chrome team). Until then, may I take this chance to suggest you to use wasmbrowsertest ? It was mentioned in @johanbrandhorst's Gophercon talk. Using this you can natively take cpu profiles for wasm just as you would do for amd64. It automagically converts wasm-function to their appropriate names and you can directly analyze the profiles using go tool pprof 🙂. That should give you better insight into your app regarding what's going on and you can see if there is a possibility to optimize hot functions.

@eliasnaur

This comment has been minimized.

Copy link
Contributor

@eliasnaur eliasnaur commented Aug 2, 2019

Thanks @agnivade, wasmbrowsertest is definitely useful for running benchmark and standalone tests on wasm. However, the full drawing and rendering to a window doesn't lend itself to that model yet.

Fortunately, I figured out how to bring back function names: the gioui.org/cmd/gio command passed -ldflags=-w -s which as a side effect stripped the function names from browser debuggers. I've removed the flags which didn't save much space anyway.

Finally I updated my comment to add the -stats flag that enable profiling without Ctrl-P.

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Aug 2, 2019

Re: function names, it is a cold cache phenomenon as far as I understood. For the first time, it comes up as wasm-function, and then on all consecutive reloads, the names show up. Although, it is hard to reproduce. See the bug I filed.

Anyways, I see some syscall/js.ValueCall in the profile. So my CL shouldTM be able to help. Feel free to give it a try whenever you have a chance.

@eliasnaur

This comment has been minimized.

Copy link
Contributor

@eliasnaur eliasnaur commented Aug 2, 2019

I tested with your CL 183457 which seems to help: the frame times are lower and more consistent. This is an example for a 17ms frame (the above profile had frame times above 20ms):

Screenshot 2019-08-02 at 16 14 51

However, the CPU usage still seems too high. According to the profile, almost 10ms of CPU time is spent building the vector shape for the frame timer in the top right corner. The text layout code is definitely CPU heavy and unoptimized, but 10ms seems excessive.

To verify the profile, I cut out the rendering of the statistics label and redid the profile:

Screenshot 2019-08-02 at 16 23 47

Firefox also misses the frame target:

Screenshot 2019-08-02 at 16 28 46

It looks like CPU heavy code is faster in Firefox, whereas DOM calls are slower. Perhaps DOM calls are only slower because Firefox' WebGL implementation is slower.

In summary, it looks like the demo is CPU bound, leading to the claim that Go generates inefficient webassembly.

I'll work on preparing a benchmark that can run in wasmbrowsertest and that skips all rendering/DOM calls.

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Aug 2, 2019

Great stuff ! I think we are getting somewhere. Yes, the wasm code generation can use some love. I have a couple of CLs which apply some rewrite optimizations which were there in amd64 but absent in wasm, which should go in when the tree opens.

But it would be great if you can prepare a standalone benchmark. That would allow us to compare the generated code with amd64 and see if there are some obvious places for improvement.

@eliasnaur

This comment has been minimized.

Copy link
Contributor

@eliasnaur eliasnaur commented Aug 2, 2019

I split the UI update from its rendering and added a benchmark. To see the difference, I ran:

    $ go test -bench . -count 8 -cpu 1 gioui.org/apps/gophers > native.bench
    $ GOOS=js GOARCH=wasm go test -exec ~/go/bin/wasmbrowsertest -bench . -count 8 gioui.org/apps/gophers > wasm.bench
    $ benchstat native.bench wasm.bench
    name  old time/op  new time/op   delta
    UI    14.9µs ± 1%  216.5µs ±22%  +1354.01%  (p=0.000 n=7+8)

So more than 10 times slower on wasm compared to native code, at least on my 2014 MBP.

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Aug 17, 2019

I investigated the profiles and started looking at the GOSSAFUNC output of some hot functions. The amd64 code showed lots of (MUL/DIV)SS. However, the wasm code showed something interesting, there were lots of F32DemoteF64 and F64PromoteF32 in the generated code. For example:

v395 00419 (14) F32Load	"".ctrl1+32(SP)
v395 00420 (14) F64PromoteF32
v395 00421 (14) F64Sub
v395 00422 (14) F64Const	$(0.5)

And in fact, several times, code like this was generated -

v403 00474 (213) I32WrapI64
v403 00475 (213) F32Load	""..autotmp_318-64(SP)
v403 00476 (213) F64PromoteF32
v403 00477 (213) F32DemoteF64
v403 00478 (213) F32Store	$0

This means all 32 bit FP values are being promoted to 64bit, then worked on, and then again demoted to 32 bit before being written back to memory.

A quick look into WasmOps.go revealed that 32 bit FP instructions were missing. And then I understood why. It is because all the FP registers (F0-F15) are treated as 64 bit registers.

Now here is where my speculation begins. Since Go SSA works with only registers, these virtual registers were created to work with SSA. But in the generated code, all references to registers are rewritten to local.(get|set|tee). So theoretically it should be possible to construct another set of 32bit registers and add 32 bit FP instructions which just deal with them, and avoid this 32-64 jump.

@neelance / @cherrymui - Is this analysis correct ? If so, how would you recommend to extend the F0-F15 register set to include 32 bit registers too. I have a local CL where I have already added the 32 bit instructions. Now I just need to fix these local.(get|set|tee) to work with 32 bit values.

@neelance

This comment has been minimized.

Copy link
Member

@neelance neelance commented Aug 17, 2019

The F64PromoteF32+F32DemoteF64 combination should only happen if rounding to 32 bits is actually necessary. In many cases the Go spec allows to use 64 bit precision for float32 values.

Yes, it is possible to add registers for 32 bit floats, but I'm not sure how much this would affect performance, because I guess that CPUs are not faster on 32 bit floats than on 64 bit floats (might be wrong).

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Aug 17, 2019

The F64PromoteF32+F32DemoteF64 combination should only happen if rounding to 32 bits is actually necessary.

I think you are referring to this

case ssa.OpWasmLoweredRound32F:
		getValue64(s, v.Args[0])
		s.Prog(wasm.AF32DemoteF64)
		s.Prog(wasm.AF64PromoteF32)

I actually found another code path in case ssa.OpWasmF32Store where getValue64 actually generates a F64PromoteF32 and then because of if v.Op == ssa.OpWasmF32Store {, another AF32DemoteF64 gets added. I did not look much deeper into it though.

Yes, it is possible to add registers for 32 bit floats, but I'm not sure how much this would affect performance,

Sure, if there is no perf boost, then there is no use. But I would like to try and check the benchmarks. What is the right way to add 32 bit registers ? Just add F16-F32 ? Or is there another way ?

@neelance

This comment has been minimized.

Copy link
Member

@neelance neelance commented Aug 17, 2019

Is the F64PromoteF32 the one emitted by case ssa.OpLoadReg: of ssaGenValueOnStack? If yes, then this is indeed something we could optimize.

What is the right way to add 32 bit registers ? Just add F16-F32 ? Or is there another way ?

This is not easy to describe in a few words...

@beenshi

This comment has been minimized.

Copy link

@beenshi beenshi commented Aug 18, 2019

@agnivade does your CL contain fixes to 32-bit integral instructions? it also amazed me to see such 32-64-32 int/fp convertions.

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Aug 18, 2019

I have not sent any CL yet. And no, I have not looked into 32bit integral instructions.

gopherbot pushed a commit that referenced this issue Aug 28, 2019
Currently, every call to mem() incurs a new DataView object. This was necessary
because the wasm linear memory could grow at any time.

Now, whenever the memory grows, we make a call to the front-end. This allows us to
reuse the existing DataView object and create a new one only when the memory actually grows.

This gives us a boost in performance during DOM operations, while incurring an extra
trip to front-end when memory grows. However, since the GrowMemory calls are meant to decrease
over the runtime of an application, this is a good tradeoff in the long run.

The benchmarks have been tested inside a browser (Google Chrome 75.0.3770.90 (Official Build) (64-bit)).
It is hard to get stable nos. for DOM operations since the jumps make the timing very unreliable.
But overall, it shows a clear gain.

name  old time/op  new time/op  delta
DOM    135µs ±26%    84µs ±10%  -37.22%  (p=0.000 n=10+9)

Go1 benchmarks do not show any noticeable degradation:
name                   old time/op    new time/op    delta
BinaryTree17              22.5s ± 0%     22.5s ± 0%     ~     (p=0.743 n=8+9)
Fannkuch11                15.1s ± 0%     15.1s ± 0%   +0.17%  (p=0.000 n=9+9)
FmtFprintfEmpty           324ns ± 1%     303ns ± 0%   -6.64%  (p=0.000 n=9+10)
FmtFprintfString          535ns ± 1%     515ns ± 0%   -3.85%  (p=0.000 n=10+10)
FmtFprintfInt             609ns ± 0%     589ns ± 0%   -3.28%  (p=0.000 n=10+10)
FmtFprintfIntInt          938ns ± 0%     920ns ± 0%   -1.92%  (p=0.000 n=9+10)
FmtFprintfPrefixedInt     950ns ± 0%     924ns ± 0%   -2.72%  (p=0.000 n=10+9)
FmtFprintfFloat          1.41µs ± 1%    1.43µs ± 0%   +1.01%  (p=0.000 n=10+10)
FmtManyArgs              3.66µs ± 1%    3.46µs ± 0%   -5.43%  (p=0.000 n=9+10)
GobDecode                38.8ms ± 1%    37.8ms ± 0%   -2.50%  (p=0.000 n=10+8)
GobEncode                26.3ms ± 1%    26.3ms ± 0%     ~     (p=0.853 n=10+10)
Gzip                      1.16s ± 1%     1.16s ± 0%   -0.37%  (p=0.008 n=10+9)
Gunzip                    210ms ± 0%     208ms ± 1%   -1.01%  (p=0.000 n=10+10)
JSONEncode               48.0ms ± 0%    48.1ms ± 1%   +0.29%  (p=0.019 n=9+9)
JSONDecode                348ms ± 1%     326ms ± 1%   -6.34%  (p=0.000 n=10+10)
Mandelbrot200            6.62ms ± 0%    6.64ms ± 0%   +0.37%  (p=0.000 n=7+9)
GoParse                  23.9ms ± 1%    24.7ms ± 1%   +2.98%  (p=0.000 n=9+9)
RegexpMatchEasy0_32       555ns ± 0%     561ns ± 0%   +1.10%  (p=0.000 n=8+10)
RegexpMatchEasy0_1K      3.94µs ± 1%    3.94µs ± 0%     ~     (p=0.906 n=9+8)
RegexpMatchEasy1_32       516ns ± 0%     524ns ± 0%   +1.51%  (p=0.000 n=9+10)
RegexpMatchEasy1_1K      4.39µs ± 1%    4.40µs ± 1%     ~     (p=0.171 n=10+10)
RegexpMatchMedium_32     25.1ns ± 0%    25.5ns ± 0%   +1.51%  (p=0.000 n=9+8)
RegexpMatchMedium_1K      196µs ± 0%     203µs ± 1%   +3.23%  (p=0.000 n=9+10)
RegexpMatchHard_32       11.2µs ± 1%    11.6µs ± 1%   +3.62%  (p=0.000 n=10+10)
RegexpMatchHard_1K        334µs ± 1%     348µs ± 1%   +4.21%  (p=0.000 n=9+10)
Revcomp                   2.39s ± 0%     2.41s ± 0%   +0.78%  (p=0.000 n=8+9)
Template                  385ms ± 1%     336ms ± 0%  -12.61%  (p=0.000 n=10+9)
TimeParse                2.18µs ± 1%    2.18µs ± 1%     ~     (p=0.424 n=10+10)
TimeFormat               2.28µs ± 1%    2.22µs ± 1%   -2.30%  (p=0.000 n=10+10)

name                   old speed      new speed      delta
GobDecode              19.8MB/s ± 1%  20.3MB/s ± 0%   +2.56%  (p=0.000 n=10+8)
GobEncode              29.1MB/s ± 1%  29.2MB/s ± 0%     ~     (p=0.810 n=10+10)
Gzip                   16.7MB/s ± 1%  16.8MB/s ± 0%   +0.37%  (p=0.007 n=10+9)
Gunzip                 92.2MB/s ± 0%  93.2MB/s ± 1%   +1.03%  (p=0.000 n=10+10)
JSONEncode             40.4MB/s ± 0%  40.3MB/s ± 1%   -0.28%  (p=0.025 n=9+9)
JSONDecode             5.58MB/s ± 1%  5.96MB/s ± 1%   +6.80%  (p=0.000 n=10+10)
GoParse                2.42MB/s ± 0%  2.35MB/s ± 1%   -2.83%  (p=0.000 n=8+9)
RegexpMatchEasy0_32    57.7MB/s ± 0%  57.0MB/s ± 0%   -1.09%  (p=0.000 n=8+10)
RegexpMatchEasy0_1K     260MB/s ± 1%   260MB/s ± 0%     ~     (p=0.963 n=9+8)
RegexpMatchEasy1_32    62.1MB/s ± 0%  61.1MB/s ± 0%   -1.53%  (p=0.000 n=10+10)
RegexpMatchEasy1_1K     233MB/s ± 1%   233MB/s ± 1%     ~     (p=0.190 n=10+10)
RegexpMatchMedium_32   39.8MB/s ± 0%  39.1MB/s ± 1%   -1.74%  (p=0.000 n=9+10)
RegexpMatchMedium_1K   5.21MB/s ± 0%  5.05MB/s ± 1%   -3.09%  (p=0.000 n=9+10)
RegexpMatchHard_32     2.86MB/s ± 1%  2.76MB/s ± 1%   -3.43%  (p=0.000 n=10+10)
RegexpMatchHard_1K     3.06MB/s ± 1%  2.94MB/s ± 1%   -4.06%  (p=0.000 n=9+10)
Revcomp                 106MB/s ± 0%   105MB/s ± 0%   -0.77%  (p=0.000 n=8+9)
Template               5.04MB/s ± 1%  5.77MB/s ± 0%  +14.48%  (p=0.000 n=10+9)

Updates #32591

Change-Id: Id567e14a788e359248b2129ef1cf0adc8cc4ab7f
Reviewed-on: https://go-review.googlesource.com/c/go/+/183457
Run-TryBot: Agniva De Sarker <agniva.quicksilver@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Richard Musiol <neelance@gmail.com>
tomocy added a commit to tomocy/go that referenced this issue Sep 1, 2019
Currently, every call to mem() incurs a new DataView object. This was necessary
because the wasm linear memory could grow at any time.

Now, whenever the memory grows, we make a call to the front-end. This allows us to
reuse the existing DataView object and create a new one only when the memory actually grows.

This gives us a boost in performance during DOM operations, while incurring an extra
trip to front-end when memory grows. However, since the GrowMemory calls are meant to decrease
over the runtime of an application, this is a good tradeoff in the long run.

The benchmarks have been tested inside a browser (Google Chrome 75.0.3770.90 (Official Build) (64-bit)).
It is hard to get stable nos. for DOM operations since the jumps make the timing very unreliable.
But overall, it shows a clear gain.

name  old time/op  new time/op  delta
DOM    135µs ±26%    84µs ±10%  -37.22%  (p=0.000 n=10+9)

Go1 benchmarks do not show any noticeable degradation:
name                   old time/op    new time/op    delta
BinaryTree17              22.5s ± 0%     22.5s ± 0%     ~     (p=0.743 n=8+9)
Fannkuch11                15.1s ± 0%     15.1s ± 0%   +0.17%  (p=0.000 n=9+9)
FmtFprintfEmpty           324ns ± 1%     303ns ± 0%   -6.64%  (p=0.000 n=9+10)
FmtFprintfString          535ns ± 1%     515ns ± 0%   -3.85%  (p=0.000 n=10+10)
FmtFprintfInt             609ns ± 0%     589ns ± 0%   -3.28%  (p=0.000 n=10+10)
FmtFprintfIntInt          938ns ± 0%     920ns ± 0%   -1.92%  (p=0.000 n=9+10)
FmtFprintfPrefixedInt     950ns ± 0%     924ns ± 0%   -2.72%  (p=0.000 n=10+9)
FmtFprintfFloat          1.41µs ± 1%    1.43µs ± 0%   +1.01%  (p=0.000 n=10+10)
FmtManyArgs              3.66µs ± 1%    3.46µs ± 0%   -5.43%  (p=0.000 n=9+10)
GobDecode                38.8ms ± 1%    37.8ms ± 0%   -2.50%  (p=0.000 n=10+8)
GobEncode                26.3ms ± 1%    26.3ms ± 0%     ~     (p=0.853 n=10+10)
Gzip                      1.16s ± 1%     1.16s ± 0%   -0.37%  (p=0.008 n=10+9)
Gunzip                    210ms ± 0%     208ms ± 1%   -1.01%  (p=0.000 n=10+10)
JSONEncode               48.0ms ± 0%    48.1ms ± 1%   +0.29%  (p=0.019 n=9+9)
JSONDecode                348ms ± 1%     326ms ± 1%   -6.34%  (p=0.000 n=10+10)
Mandelbrot200            6.62ms ± 0%    6.64ms ± 0%   +0.37%  (p=0.000 n=7+9)
GoParse                  23.9ms ± 1%    24.7ms ± 1%   +2.98%  (p=0.000 n=9+9)
RegexpMatchEasy0_32       555ns ± 0%     561ns ± 0%   +1.10%  (p=0.000 n=8+10)
RegexpMatchEasy0_1K      3.94µs ± 1%    3.94µs ± 0%     ~     (p=0.906 n=9+8)
RegexpMatchEasy1_32       516ns ± 0%     524ns ± 0%   +1.51%  (p=0.000 n=9+10)
RegexpMatchEasy1_1K      4.39µs ± 1%    4.40µs ± 1%     ~     (p=0.171 n=10+10)
RegexpMatchMedium_32     25.1ns ± 0%    25.5ns ± 0%   +1.51%  (p=0.000 n=9+8)
RegexpMatchMedium_1K      196µs ± 0%     203µs ± 1%   +3.23%  (p=0.000 n=9+10)
RegexpMatchHard_32       11.2µs ± 1%    11.6µs ± 1%   +3.62%  (p=0.000 n=10+10)
RegexpMatchHard_1K        334µs ± 1%     348µs ± 1%   +4.21%  (p=0.000 n=9+10)
Revcomp                   2.39s ± 0%     2.41s ± 0%   +0.78%  (p=0.000 n=8+9)
Template                  385ms ± 1%     336ms ± 0%  -12.61%  (p=0.000 n=10+9)
TimeParse                2.18µs ± 1%    2.18µs ± 1%     ~     (p=0.424 n=10+10)
TimeFormat               2.28µs ± 1%    2.22µs ± 1%   -2.30%  (p=0.000 n=10+10)

name                   old speed      new speed      delta
GobDecode              19.8MB/s ± 1%  20.3MB/s ± 0%   +2.56%  (p=0.000 n=10+8)
GobEncode              29.1MB/s ± 1%  29.2MB/s ± 0%     ~     (p=0.810 n=10+10)
Gzip                   16.7MB/s ± 1%  16.8MB/s ± 0%   +0.37%  (p=0.007 n=10+9)
Gunzip                 92.2MB/s ± 0%  93.2MB/s ± 1%   +1.03%  (p=0.000 n=10+10)
JSONEncode             40.4MB/s ± 0%  40.3MB/s ± 1%   -0.28%  (p=0.025 n=9+9)
JSONDecode             5.58MB/s ± 1%  5.96MB/s ± 1%   +6.80%  (p=0.000 n=10+10)
GoParse                2.42MB/s ± 0%  2.35MB/s ± 1%   -2.83%  (p=0.000 n=8+9)
RegexpMatchEasy0_32    57.7MB/s ± 0%  57.0MB/s ± 0%   -1.09%  (p=0.000 n=8+10)
RegexpMatchEasy0_1K     260MB/s ± 1%   260MB/s ± 0%     ~     (p=0.963 n=9+8)
RegexpMatchEasy1_32    62.1MB/s ± 0%  61.1MB/s ± 0%   -1.53%  (p=0.000 n=10+10)
RegexpMatchEasy1_1K     233MB/s ± 1%   233MB/s ± 1%     ~     (p=0.190 n=10+10)
RegexpMatchMedium_32   39.8MB/s ± 0%  39.1MB/s ± 1%   -1.74%  (p=0.000 n=9+10)
RegexpMatchMedium_1K   5.21MB/s ± 0%  5.05MB/s ± 1%   -3.09%  (p=0.000 n=9+10)
RegexpMatchHard_32     2.86MB/s ± 1%  2.76MB/s ± 1%   -3.43%  (p=0.000 n=10+10)
RegexpMatchHard_1K     3.06MB/s ± 1%  2.94MB/s ± 1%   -4.06%  (p=0.000 n=9+10)
Revcomp                 106MB/s ± 0%   105MB/s ± 0%   -0.77%  (p=0.000 n=8+9)
Template               5.04MB/s ± 1%  5.77MB/s ± 0%  +14.48%  (p=0.000 n=10+9)

Updates golang#32591

Change-Id: Id567e14a788e359248b2129ef1cf0adc8cc4ab7f
Reviewed-on: https://go-review.googlesource.com/c/go/+/183457
Run-TryBot: Agniva De Sarker <agniva.quicksilver@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Richard Musiol <neelance@gmail.com>
t4n6a1ka added a commit to t4n6a1ka/go that referenced this issue Sep 5, 2019
Currently, every call to mem() incurs a new DataView object. This was necessary
because the wasm linear memory could grow at any time.

Now, whenever the memory grows, we make a call to the front-end. This allows us to
reuse the existing DataView object and create a new one only when the memory actually grows.

This gives us a boost in performance during DOM operations, while incurring an extra
trip to front-end when memory grows. However, since the GrowMemory calls are meant to decrease
over the runtime of an application, this is a good tradeoff in the long run.

The benchmarks have been tested inside a browser (Google Chrome 75.0.3770.90 (Official Build) (64-bit)).
It is hard to get stable nos. for DOM operations since the jumps make the timing very unreliable.
But overall, it shows a clear gain.

name  old time/op  new time/op  delta
DOM    135µs ±26%    84µs ±10%  -37.22%  (p=0.000 n=10+9)

Go1 benchmarks do not show any noticeable degradation:
name                   old time/op    new time/op    delta
BinaryTree17              22.5s ± 0%     22.5s ± 0%     ~     (p=0.743 n=8+9)
Fannkuch11                15.1s ± 0%     15.1s ± 0%   +0.17%  (p=0.000 n=9+9)
FmtFprintfEmpty           324ns ± 1%     303ns ± 0%   -6.64%  (p=0.000 n=9+10)
FmtFprintfString          535ns ± 1%     515ns ± 0%   -3.85%  (p=0.000 n=10+10)
FmtFprintfInt             609ns ± 0%     589ns ± 0%   -3.28%  (p=0.000 n=10+10)
FmtFprintfIntInt          938ns ± 0%     920ns ± 0%   -1.92%  (p=0.000 n=9+10)
FmtFprintfPrefixedInt     950ns ± 0%     924ns ± 0%   -2.72%  (p=0.000 n=10+9)
FmtFprintfFloat          1.41µs ± 1%    1.43µs ± 0%   +1.01%  (p=0.000 n=10+10)
FmtManyArgs              3.66µs ± 1%    3.46µs ± 0%   -5.43%  (p=0.000 n=9+10)
GobDecode                38.8ms ± 1%    37.8ms ± 0%   -2.50%  (p=0.000 n=10+8)
GobEncode                26.3ms ± 1%    26.3ms ± 0%     ~     (p=0.853 n=10+10)
Gzip                      1.16s ± 1%     1.16s ± 0%   -0.37%  (p=0.008 n=10+9)
Gunzip                    210ms ± 0%     208ms ± 1%   -1.01%  (p=0.000 n=10+10)
JSONEncode               48.0ms ± 0%    48.1ms ± 1%   +0.29%  (p=0.019 n=9+9)
JSONDecode                348ms ± 1%     326ms ± 1%   -6.34%  (p=0.000 n=10+10)
Mandelbrot200            6.62ms ± 0%    6.64ms ± 0%   +0.37%  (p=0.000 n=7+9)
GoParse                  23.9ms ± 1%    24.7ms ± 1%   +2.98%  (p=0.000 n=9+9)
RegexpMatchEasy0_32       555ns ± 0%     561ns ± 0%   +1.10%  (p=0.000 n=8+10)
RegexpMatchEasy0_1K      3.94µs ± 1%    3.94µs ± 0%     ~     (p=0.906 n=9+8)
RegexpMatchEasy1_32       516ns ± 0%     524ns ± 0%   +1.51%  (p=0.000 n=9+10)
RegexpMatchEasy1_1K      4.39µs ± 1%    4.40µs ± 1%     ~     (p=0.171 n=10+10)
RegexpMatchMedium_32     25.1ns ± 0%    25.5ns ± 0%   +1.51%  (p=0.000 n=9+8)
RegexpMatchMedium_1K      196µs ± 0%     203µs ± 1%   +3.23%  (p=0.000 n=9+10)
RegexpMatchHard_32       11.2µs ± 1%    11.6µs ± 1%   +3.62%  (p=0.000 n=10+10)
RegexpMatchHard_1K        334µs ± 1%     348µs ± 1%   +4.21%  (p=0.000 n=9+10)
Revcomp                   2.39s ± 0%     2.41s ± 0%   +0.78%  (p=0.000 n=8+9)
Template                  385ms ± 1%     336ms ± 0%  -12.61%  (p=0.000 n=10+9)
TimeParse                2.18µs ± 1%    2.18µs ± 1%     ~     (p=0.424 n=10+10)
TimeFormat               2.28µs ± 1%    2.22µs ± 1%   -2.30%  (p=0.000 n=10+10)

name                   old speed      new speed      delta
GobDecode              19.8MB/s ± 1%  20.3MB/s ± 0%   +2.56%  (p=0.000 n=10+8)
GobEncode              29.1MB/s ± 1%  29.2MB/s ± 0%     ~     (p=0.810 n=10+10)
Gzip                   16.7MB/s ± 1%  16.8MB/s ± 0%   +0.37%  (p=0.007 n=10+9)
Gunzip                 92.2MB/s ± 0%  93.2MB/s ± 1%   +1.03%  (p=0.000 n=10+10)
JSONEncode             40.4MB/s ± 0%  40.3MB/s ± 1%   -0.28%  (p=0.025 n=9+9)
JSONDecode             5.58MB/s ± 1%  5.96MB/s ± 1%   +6.80%  (p=0.000 n=10+10)
GoParse                2.42MB/s ± 0%  2.35MB/s ± 1%   -2.83%  (p=0.000 n=8+9)
RegexpMatchEasy0_32    57.7MB/s ± 0%  57.0MB/s ± 0%   -1.09%  (p=0.000 n=8+10)
RegexpMatchEasy0_1K     260MB/s ± 1%   260MB/s ± 0%     ~     (p=0.963 n=9+8)
RegexpMatchEasy1_32    62.1MB/s ± 0%  61.1MB/s ± 0%   -1.53%  (p=0.000 n=10+10)
RegexpMatchEasy1_1K     233MB/s ± 1%   233MB/s ± 1%     ~     (p=0.190 n=10+10)
RegexpMatchMedium_32   39.8MB/s ± 0%  39.1MB/s ± 1%   -1.74%  (p=0.000 n=9+10)
RegexpMatchMedium_1K   5.21MB/s ± 0%  5.05MB/s ± 1%   -3.09%  (p=0.000 n=9+10)
RegexpMatchHard_32     2.86MB/s ± 1%  2.76MB/s ± 1%   -3.43%  (p=0.000 n=10+10)
RegexpMatchHard_1K     3.06MB/s ± 1%  2.94MB/s ± 1%   -4.06%  (p=0.000 n=9+10)
Revcomp                 106MB/s ± 0%   105MB/s ± 0%   -0.77%  (p=0.000 n=8+9)
Template               5.04MB/s ± 1%  5.77MB/s ± 0%  +14.48%  (p=0.000 n=10+9)

Updates golang#32591

Change-Id: Id567e14a788e359248b2129ef1cf0adc8cc4ab7f
Reviewed-on: https://go-review.googlesource.com/c/go/+/183457
Run-TryBot: Agniva De Sarker <agniva.quicksilver@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Richard Musiol <neelance@gmail.com>
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.