Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC 2019] Improve the performance of JavaScript version of OpenCV (OpenCV.js) #15371

Merged
merged 4 commits into from
Sep 24, 2019

Conversation

Wenzhao-Xiang
Copy link
Contributor

@Wenzhao-Xiang Wenzhao-Xiang commented Aug 22, 2019

Overview

This pullrequest changes

  • Create the base of OpenCV.js performance test:
    This perf test is based on benchmark.js. And we first add cvtColor, Resize, Threshold into it. We support both browser and Node.js version of it for test.
  • Optimize the OpenCV.js performance by WASM threads:
    This optimization is based on Web Worker API and SharedArrayBuffer, so it can be only used in browser. We expose two new API cv.parallel_pthreads_set_threads_num(number) and cv.parallel_pthreads_get_threads_num(), so we can use the former to set threads number dynamically and use the latter to get the current threads number. And the default threads number is the logic core number of the device.
  • Optimize the OpenCV.js performance by WASM SIMD:
    Add WASM SIMD backend for OpenCV Universal Intrinsics. It's experimental as WASM SIMD is still in development. The simd version of OpenCV.js built by latest LLVM upstream may not work with the stable browser or old version of Node.js. Please use the latest version of unstable browser or Node.js to get new features, like Chrome Dev.

The Test

Test Environment:

  OS: Ubuntu 16.04
  Emscripten: 1.38.42, LLVM upstream backend
  Browser: Chrome, Version 78.0.3880.4 (Official Build) dev (64-bit)
  Hardware: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz with 8 logical cores:
  • OpenCV.js tests: all passed
  • Universal Intrinsics WASM backend test: all passed

Results

  • For OpenCV kernels, take Threshold kernel with parameter (1920x1080, CV_8UC1, THRESH_BINARY) as example:
OS: Ubuntu 16.04<br>
Emscripten: 1.38.42, LLVM upstream backend<br>
Browser: Chrome, Version 78.0.3880.4 (Official Build) dev (64-bit)<br>
Hardware: Core(TM) i7-8700 CPU @ 3.20GHz with 12 logical cores
build mean time (ms) speed up (to scalar)
scalar 1.164 1
threads 0.261 4.45
simd 0.123 9.46
threads + simd 0.039 29.84
  • For real case, take OpenCV.js face recognition sample as example:
OS: Ubuntu Linux 16.04.5
Emscripten: 1.38.42, LLVM upstream backend
Browser: Chrome, Version 78.0.3880.4 (Official Build) dev (64-bit)
Hardware: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz with 12 logical cores
OpenCV.js build FPS Speedup (to scalar)
scalar 3 1
threads 10 3.33
simd 12 4
threads + simd 26 8.6

Performance Analysis

Kernel performance(ms)

Test Environment:
OS: Ubuntu 16.04
Emscripten: 1.38.42, LLVM upstream backend
Browser: Chrome, Version 78.0.3880.4 (Official Build) dev (64-bit)
Hardware: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz with 8 logical cores:

Kernel cvtColor resize threshold
Type BGR2GRAY YUV2BGR BGR2YUV LINEAR_EXACT THRESH_BINARY
WASM(scalar) 2.747 12.04 10.685 5.425 1.347
WASM(threads=8) 0.722 2.636 2.618 2.529 0.353
WASM(SIMD128) 11.355 27.02 27.245 4.309 0.14
WASM(threads+SIMD) 1.869 4.757 4.769 1.79 0.043
Native(scalar) 2.11 7.89 8.68 6.45 0.6
Native(threads=8) 0.58 1.97 2.19 2.48 0.17
Native(SIMD128) 1.91 3.4 4.15 1.03 0.11
Native(threads+SIMD) 0.43 0.86 0.91 0.65 0.04

Analysis

With the current optimization, threads optimization works as we expected. However, wasm simd still have some issues. As we can see in the Kernel performance result, now resize only have 1.34x speed up than scalar version and cvtColor is even 2-3x slower than scalar version, which still have a big gap compared with Native SIMD optimization.
Thanks @huningxin for the investigation, here are some analysis results:

  • For resize, the biggest reason is that we now use shift to simulation integer widening instructions in v_dotprod. We have opened an emscripten issue. And we can continue to optimize resize kernel after this new feature is enabled.
  • For cvtColor, the root cause is the inefficient pshufb with memory operands are generated by V8 for current implementation.
    One solution is to refer to sse implementation that uses punpcklbw and punpckhqdq. We tried but it still fails due to an emscripten issue that leads V8 fails to generate those instructions. Let's see the response from emscripten community.
force_builders=Custom
buildworker:Custom=linux-1,linux-2,linux-4
build_image:Docs=docs-js
build_image:Custom=javascript

@huningxin
Copy link
Contributor

@Wenzhao-Xiang , please use git diff --check to find and fix the trailing white space issues.

@Wenzhao-Xiang
Copy link
Contributor Author

@huningxin Thanks! I will fix the trailing white space issues and merge the two commits into one to take it as my GSoC final commit.

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! Great job 👍

modules/core/include/opencv2/core/hal/intrin_wasm.hpp Outdated Show resolved Hide resolved
modules/core/include/opencv2/core/hal/intrin_wasm.hpp Outdated Show resolved Hide resolved
modules/core/include/opencv2/core/hal/intrin_wasm.hpp Outdated Show resolved Hide resolved
modules/core/src/mathfuncs_core.simd.hpp Show resolved Hide resolved
@Wenzhao-Xiang
Copy link
Contributor Author

@huningxin @terfendail I just found, almost all the implementation of f64/u64/i64 function use fallback implementation due to the limitation of wasm intrin, which means only few function could be removed. So I'd suggest we just keep all the fallback function. WDYT?

@Wenzhao-Xiang
Copy link
Contributor Author

Update the performance analysis #15371 (comment)

@huningxin
Copy link
Contributor

which means only few function could be removed. So I'd suggest we just keep all the fallback function. WDYT?

Could you please specify what they are? I think it will help the decision. Thanks.

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update!

@@ -152,6 +152,11 @@
# define CV_VSX3 1
#endif

#if defined(EMSCRIPTEN)
# define CV_WASM_SIMD 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually this macro is used in OpenCV for check:

defined(__EMSCRIPTEN__)

Is there any difference?

How SIMD feature can be disabled (via CMake/.py script parameters)? (it is useful for debugging purposes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review!
According to Detecting Emscripten in preprocessor, the correct define to use is __EMSCRIPTEN__.
emscripten-core/emscripten#4665 introduced a strict build mode and removed the EMSCRIPTEN define. Therefore it is not recommended to use EMSCRIPTEN even though it still works in non-strict build mode.
I'll fix that then.

Copy link
Contributor Author

@Wenzhao-Xiang Wenzhao-Xiang Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For how to disable SIMD feature, it's decided by a .py script flag --simd. If you build with this flag, CV_ENABLE_INTRINSICS will be turned on, and then SIMD feature will be detected. And if not, only scalar version will be built.

@Wenzhao-Xiang
Copy link
Contributor Author

Wenzhao-Xiang commented Sep 3, 2019

@huningxin
See here:

v_rshr_pack, v_rshr_pack_store, v_pack_b, + , -, * , /, v_mul_expand, v_sqrt, v_invsqrt, 
v_abs, min, max, ==, !=, <, >, <=, >=, v_not_nan, v_absdiff, v_magnitude, <<, >>, v_shl, 
v_shr, v_store_low, v_store_high, v_reduce_sum, v_popcount, v_signmask, v_check_all, 
v_check_any, v_round, v_floor, v_ceil, v_trunc, v_cvt_f32, v_cvt_f64, v_cvt_f64_high, 

They are almost for f64, i64, u64.
Keeping all the fallback functions will also help debug, I think.

@Wenzhao-Xiang
Copy link
Contributor Author

@alalek updated it. Is there any issues?

@huningxin
Copy link
Contributor

Keeping all the fallback functions will also help debug, I think.

+1. Thanks for the information.

@terfendail
Copy link
Contributor

I suppose that retaining a few more fallback functions shouldn't essentially affect the size of the library. So let's keep them.

@Wenzhao-Xiang
Copy link
Contributor Author

I suppose that retaining a few more fallback functions shouldn't essentially affect the size of the library.

I agree! Thanks!

@Wenzhao-Xiang
Copy link
Contributor Author

Any updates here? @alalek @terfendail @huningxin

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

@Wenzhao-Xiang
Copy link
Contributor Author

Thanks! @alalek
Any inputs? @terfendail

Copy link
Contributor

@terfendail terfendail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Wenzhao-Xiang
Copy link
Contributor Author

Rebased this branch to solve the conflicts.

@huningxin
Copy link
Contributor

huningxin commented Sep 23, 2019

@Wenzhao-Xiang thanks for the rebase. @terfendail @alalek , is it OK to merge now? Otherwise we need Wenzhao to keep rebasing this PR.

@alalek
Copy link
Member

alalek commented Sep 23, 2019

I rebased PR onto 3.4 branch: https://github.com/alalek/opencv/commits/pr15371_r

Please pull these changes into Wenzhao-Xiang:gsoc_2019 or follow the instruction here: https://github.com/opencv/opencv/wiki/Branches

@Wenzhao-Xiang Wenzhao-Xiang changed the base branch from master to 3.4 September 23, 2019 15:40
Improve the performance of JavaScript version of OpenCV (OpenCV.js):
1. Create the base of OpenCV.js performance test:
     This perf test is based on benchmark.js(https://benchmarkjs.com). And first add `cvtColor`, `Resize`, `Threshold` into it.
2. Optimize the OpenCV.js performance by WASM threads:
     This optimization is based on Web Worker API and SharedArrayBuffer, so it can be only used in browser.
3. Optimize the OpenCV.js performance by WASM SIMD:
     Add WASM SIMD backend for OpenCV Universal Intrinsics. It's experimental as WASM SIMD is still in development.
1. use short license header
2. fix documentation node issue
3. remove the unused `hasSIMD128()` api
1. fix emscripten define
2. use fallback function for f16
Fix rebase issue
@Wenzhao-Xiang
Copy link
Contributor Author

@alalek
Done! PTAL, Thanks!

@alalek alalek merged commit c209677 into opencv:3.4 Sep 24, 2019
@huningxin
Copy link
Contributor

Awesome! Thanks @alalek @terfendail @Wenzhao-Xiang .

@waliurjs
Copy link

Guys, begging you to release. This will be so dope!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants