image/jpeg: Decode is slow #24499
Mac OS Sierra
I noticed that the use of decode jpeg is very slow.
decode image jpeg 1920x1080
I test github.com/pixiv/go-libjpeg/jpeg and native jpeg
go 1.10 jpeg.decode ≈ 30 ms cpu ≈ 15 %
will it ever go as fast as other libraries?
The text was updated successfully, but these errors were encountered:
Performance does seem quite bad. Using stb_image (via cgo) seems to be much faster in a simple test I've done:
This wasn't a scientific test, but the results were markedly different between the two approaches, and consistently so (+/- 10ms here or there with a few repeats).
I cut what I have into a standalone benchmark, and pushed it up at https://github.com/rburchell/imgtestcase. In my case, the format conversion is costly, but loading the image itself is also far from fast.
A CPU profile of repeatedly loading images gives me this:
I took a look into it. The majority of the improvements in the C version come from SIMD(SSE2) support. If I disable SIMD and then run the benchmarks, the C version runs much slower. Although the Go version is still slow compared to it.
Other than that, there are some BCE improvements that can be done in idct and fdct. But that barely gives improvement of 40-50ns.
I don't see any major algorithm improvements that can be applied. All math operations like huffman decoding, idct etc. have their fast paths, which is the same as the C version.
There may be some logic tweaks that can be done in the top time-taking functions like
Anybody is welcome to investigate and send CLs.
Before - $gotip build -gcflags="-d=ssa/check_bce/debug=1" fdct.go idct.go ./fdct.go:89:10: Found IsInBounds ./fdct.go:90:10: Found IsInBounds ./fdct.go:91:10: Found IsInBounds ./fdct.go:92:10: Found IsInBounds ./fdct.go:93:10: Found IsInBounds ./fdct.go:94:10: Found IsInBounds ./fdct.go:95:10: Found IsInBounds ./fdct.go:96:10: Found IsInBounds ./idct.go:77:9: Found IsInBounds ./idct.go:77:27: Found IsInBounds ./idct.go:77:45: Found IsInBounds ./idct.go:78:7: Found IsInBounds ./idct.go:78:25: Found IsInBounds ./idct.go:78:43: Found IsInBounds ./idct.go:78:61: Found IsInBounds ./idct.go:79:13: Found IsInBounds ./idct.go:92:13: Found IsInBounds ./idct.go:93:12: Found IsInBounds ./idct.go:94:12: Found IsInBounds ./idct.go:95:12: Found IsInBounds ./idct.go:97:12: Found IsInBounds ./idct.go:98:12: Found IsInBounds ./idct.go:99:12: Found IsInBounds After - $gotip build -gcflags="-d=ssa/check_bce/debug=1" fdct.go idct.go ./fdct.go:90:9: Found IsSliceInBounds ./idct.go:76:11: Found IsSliceInBounds ./idct.go:145:11: Found IsSliceInBounds name old time/op new time/op delta FDCT-4 1.85µs ± 2% 1.74µs ± 1% -5.95% (p=0.000 n=10+10) IDCT-4 1.94µs ± 2% 1.89µs ± 1% -2.67% (p=0.000 n=10+9) DecodeBaseline-4 1.45ms ± 2% 1.46ms ± 1% ~ (p=0.156 n=9+10) DecodeProgressive-4 2.21ms ± 1% 2.21ms ± 1% ~ (p=0.796 n=10+10) EncodeRGBA-4 24.9ms ± 1% 25.0ms ± 1% ~ (p=0.075 n=10+10) EncodeYCbCr-4 26.1ms ± 1% 26.2ms ± 1% ~ (p=0.573 n=8+10) name old speed new speed delta DecodeBaseline-4 42.5MB/s ± 2% 42.4MB/s ± 1% ~ (p=0.162 n=9+10) DecodeProgressive-4 27.9MB/s ± 1% 27.9MB/s ± 1% ~ (p=0.796 n=10+10) EncodeRGBA-4 49.4MB/s ± 1% 49.1MB/s ± 1% ~ (p=0.066 n=10+10) EncodeYCbCr-4 35.3MB/s ± 1% 35.2MB/s ± 1% ~ (p=0.586 n=8+10) name old alloc/op new alloc/op delta DecodeBaseline-4 63.0kB ± 0% 63.0kB ± 0% ~ (all equal) DecodeProgressive-4 260kB ± 0% 260kB ± 0% ~ (all equal) EncodeRGBA-4 4.40kB ± 0% 4.40kB ± 0% ~ (all equal) EncodeYCbCr-4 4.40kB ± 0% 4.40kB ± 0% ~ (all equal) name old allocs/op new allocs/op delta DecodeBaseline-4 5.00 ± 0% 5.00 ± 0% ~ (all equal) DecodeProgressive-4 13.0 ± 0% 13.0 ± 0% ~ (all equal) EncodeRGBA-4 4.00 ± 0% 4.00 ± 0% ~ (all equal) EncodeYCbCr-4 4.00 ± 0% 4.00 ± 0% ~ (all equal) Updates #24499 Change-Id: I6828d077b851817503a7c1a08235763f81bdadf9 Reviewed-on: https://go-review.googlesource.com/c/go/+/167417 Run-TryBot: Agniva De Sarker <firstname.lastname@example.org> TryBot-Result: Gobot Gobot <email@example.com> Reviewed-by: Nigel Tao <firstname.lastname@example.org>
First, the AssemblyPolicy (for Go packages overall) applies equally well to
Amongst other things, standard library code is often read by people learning Go, so for that code, we favor simplicity and readability over raw performance more than what other Go packages choose. Neither position is wrong, just a different trade-off.
If adding a small amount of SIMD assembly to the standard library gets you a 1.5x benchmark improvement, then I'd probably take it. If adding a large amount of SIMD assembly gets you a 1.05x benchmark improvement, then I'd probably reject it.
What "small" and "large" means is subjective. It's hard to say without a specific SIMD code listing.
Note also that, when I say benchmarks, I'm primarily concerned about overall decode / encode benchmarks, not just FDCT / IDCT benchmarks. Users want to decode a JPEG image, they don't want to run IDCTs directly.
For example, https://go-review.googlesource.com/c/go/+/167417/2//COMMIT_MSG shows a 1.03x benchmark improvement to IDCTs, but no significant change to the overall decode benchmarks. If that change was a SIMD assembly change, I'd reject it as being of too small a benefit, compared to the costs of supporting assembly.