Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WASM SIMD for NNUE #30

Closed
wants to merge 11 commits into from
Closed

Conversation

hi-ogawa
Copy link

@hi-ogawa hi-ogawa commented Feb 10, 2021

EDIT: I removed NOTE.md from the repository and move its content to here https://github.com/hi-ogawa/stockfish.wasm/wiki/NOTE

Hi @niklasf
I experimented with WASM SIMD and wanted to share the result, which looks promising even though I'm not sure if it's usable stage.

I've seen the closed PR (#21) where it relied on emscripten's x86 SIMD emulation feature, but here, I implemented some linear algebra routines directly using WASM SIMD intrisics.

These are some benchmark results (see NOTE.md for the detail)

# NOTE:
#  emscripten's node (emsdk/node/12.18.1_64bit/bin/node) is old and doesn't seem to compatible with latest simd (cf. https://github.com/emscripten-core/emscripten/issues/11484).
#  So, I'm using the locally installed node with this version.
$ node -e 'console.log(`node: ${process.version}\nv8: ${process.versions.v8}`)'
node: v15.8.0
v8: 8.6.395.17-node.23

# [ Build type `wasm_simd_post_mvp=yes` ]
$ node --experimental-wasm-{simd,threads} --experimental-repl-await --wasm-simd-post-mvp

> const sf = await require("./stockfish")();

> sf.postMessage("bench 16 1 22 current depth classical");
Total time (ms) : 5056
Nodes searched  : 3478378
Nodes/second    : 687970

> sf.postMessage("bench 16 1 22 current depth NNUE");
Total time (ms) : 6424
Nodes searched  : 2670151
Nodes/second    : 415652


# [ Build type `wasm_simd=yes wasm_simd_post_mvp=no` ]
> sf.postMessage("bench 16 1 22 current depth NNUE");
Total time (ms) : 9472
Nodes searched  : 2670151
Nodes/second    : 281899


# [ Build type `wasm_simd=no wasm_simd_post_mvp=no` ]
> sf.postMessage("bench 16 1 22 current depth NNUE");
Total time (ms) : 54008
Nodes searched  : 2670151
Nodes/second    : 49439

As a comparison, these are the results from current stockfish master (29ed22d) on my machine.

# Built with flag `ARCH=x86-64-modern` which enables SSSE3.
bench 16 1 22 current depth classical
Total time (ms) : 4323
Nodes searched  : 4850165
Nodes/second    : 1121944

bench 16 1 22 current depth NNUE
Total time (ms) : 2992
Nodes searched  : 2655513
Nodes/second    : 887537

# Built with `ARCH=x86-64 sse=yes sse2=yes SUPPORTED_ARCH=true` (SSE3 is disabled)
# I think this setting is the closest one we can get to with WASM SIMD.
bench 16 1 22 current depth NNUE
Total time (ms) : 5202
Nodes searched  : 2655513
Nodes/second    : 510479

A few things I wanted to mention:

  • I made a separate file nnue/math.cpp and WASM SIMD matrix-vector multiplication routine lives there (mostly because I wanted separate translation unit so that I can rebuild faster).

  • I added two build types wasm_simd and wasm_simd_post_mvp, where wasm_simd_post_mvp has substantially faster implementation due to new intrinsic __builtin_wasm_dot_s_i32x4_i16x8.

  • NNUE data is embedded in executable by generating c++ string literal with my little script embedded_nnue_data.py,
    which means the size of stockfish.wasm becomes 21MB.

  • I made bench_eval command to measure directly the performance of NNUE inference with my another little script timeit.hpp.

  • It seems both Chrome/Chromium(88.0.4324.150) and Firefox(85.0.1) already enables "post-mvp" features when SIMD is enabled.

  • On Firefox, there is some security restriction regarding SharedArrayBuffer and it needs some special http headers to be sent.
    So, I included misc/server.py to run http server with such headers.

At this point, I'm not really sure if this should be even PR, but I'm curious what you think about this.
For the next step, I'm planning to implement a way to run javscript as UCI engine executable so that we can compare two versions' strengths direcly by pitting (e.g. by cutechess-cli).

Thanks a lot for reading!

@niklasf
Copy link
Member

niklasf commented Feb 10, 2021

Wow, amazing work!

Do you know if browsers have a good way to test for the required post MVP features? If so, it should be possible to get a test version out on Lichess pretty quickly.

@hi-ogawa
Copy link
Author

Thanks for the reply!

Do you know if browsers have a good way to test for the required post MVP features?

Probably, we have to check the wasm binary itself by WebAssembly.validate.
But, of course, we don't want to download a whole wasm file before knowing if it runs, so one idea is that, maybe before downloading, we can just validate some small wasm binary with simd opcode we want to use.
I think I can make some example right away, let's see.

@hi-ogawa
Copy link
Author

I think I found a very quick way to check if certain wasm opcode is supported. It's something like this:

# Feature detection for `i32x4.dot_i16x8_s` (see `misc/feature_detection.wat` for how these bytes are generated)

/usr/bin/node --experimental-wasm-simd
> data = Uint8Array.from([0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 123, 3, 2, 1, 0, 7, 8, 1, 4, 116, 101, 115, 116, 0, 0, 10, 15, 1, 13, 0, 65, 0, 253, 17, 65, 0, 253, 17, 253, 186, 1, 11])
> WebAssembly.validate(data)
false

/usr/bin/node --experimental-wasm-simd --wasm-simd-post-mvp
> data = Uint8Array.from([0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 123, 3, 2, 1, 0, 7, 8, 1, 4, 116, 101, 115, 116, 0, 0, 10, 15, 1, 13, 0, 65, 0, 253, 17, 65, 0, 253, 17, 253, 186, 1, 11])
> WebAssembly.validate(data)
true

@hi-ogawa
Copy link
Author

hi-ogawa commented Feb 11, 2021

I implemented uci.js to run this stockfish as an UCI engine executable.
Also, I added new github actions match.yml which runs cutechess-cli tornament between two versions (nnue and classical).
It's only triggered manually (workflow_dispatch mode) and I just started four tournaments on my repository (unfortunately, it cannot be run from web interface unless it's master branch. so I need to trigger via REST API).
You can check progress and result here https://github.com/hi-ogawa/stockfish.wasm/actions?query=workflow%3AMatch.
(Probably, Github CI is not really a good place to run engine testing, but I've been using this approach for my own engine because I don't have a machine to spare.)

Four tournaments settings are

each with 100 rounds (200 games) and openings are from noob_3moves.epd.

EDIT:

Here is the result of this very rudimental test:

  • nnue(wasm_simd_post_mvp=yes) vs classical, tc=10+0.1 (https://github.com/hi-ogawa/stockfish.wasm/runs/1877640888)
    Score of nnue vs classical: 104 - 21 - 75 [0.708] 200
    ... nnue playing White: 57 - 12 - 31 [0.725] 100
    ... nnue playing Black: 47 - 9 - 44 [0.690] 100
    ... White vs Black: 66 - 59 - 75 [0.517] 200
    Elo difference: 153.4 +/- 39.3, LOS: 100.0 %, DrawRatio: 37.5 %

  • nnue(wasm_simd_post_mvp=yes) vs classical, tc=60+0.6 (https://github.com/hi-ogawa/stockfish.wasm/runs/1877641945)
    Score of nnue vs classical: 95 - 8 - 97 [0.718] 200
    ... nnue playing White: 54 - 3 - 43 [0.755] 100
    ... nnue playing Black: 41 - 5 - 54 [0.680] 100
    ... White vs Black: 59 - 44 - 97 [0.537] 200
    Elo difference: 161.9 +/- 34.0, LOS: 100.0 %, DrawRatio: 48.5 %

  • nnue(wasm_simd=yes) vs classical, tc=10+0.1 (https://github.com/hi-ogawa/stockfish.wasm/runs/1877643010)
    Score of nnue vs classical: 80 - 28 - 92 [0.630] 200
    ... nnue playing White: 40 - 13 - 47 [0.635] 100
    ... nnue playing Black: 40 - 15 - 45 [0.625] 100
    ... White vs Black: 55 - 53 - 92 [0.505] 200
    Elo difference: 92.5 +/- 35.6, LOS: 100.0 %, DrawRatio: 46.0 %

  • nnue(wasm_simd=yes) vs classical, tc=60+0.6 (https://github.com/hi-ogawa/stockfish.wasm/runs/1877643839)
    Score of nnue vs classical: 84 - 8 - 108 [0.690] 200
    ... nnue playing White: 43 - 3 - 54 [0.700] 100
    ... nnue playing Black: 41 - 5 - 54 [0.680] 100
    ... White vs Black: 48 - 44 - 108 [0.510] 200
    Elo difference: 139.0 +/- 31.7, LOS: 100.0 %, DrawRatio: 54.0 %

@hi-ogawa
Copy link
Author

hi-ogawa commented Feb 11, 2021

I was crawling through v8's code and I found that actually __builtin_wasm_dot_s_i32x4_i16x8 is not "post mvp" anymore since v8 version 8.8.101 v8/v8@01b8b3e.
On my machine, NodeJS has v8 (8.6.395.17) and Chromium has v8 (8.8.278.15), so that's the reason why the fast version is able to run on Chromium.
Essentially what I thought "post mvp" build might not be "post mvp" build anymore, though this fact itself kind of indicates such feature is not really stable yet.

EDIT:

According to this page https://omahaproxy.appspot.com where it lists latest chrome versions and corresponding v8 version, it seems all (except iOS, which doesn't use v8) should support latest WASM SIMD features.

EDIT:

I confirmed that the version with wasm_simd_post_mvp=yes runs on my Android phone with latest Chrome.

@hi-ogawa
Copy link
Author

hi-ogawa commented Feb 12, 2021

I made a little frontend misc/uci.html where you can run uci command for quick testing. I published it as my fork's Github Page here https://hi-ogawa.github.io/stockfish.wasm/misc/uci.html.

EDIT: Due to the security header issue, the page doesn't work for Firefox...

@niklasf
Copy link
Member

niklasf commented Feb 20, 2021

Still very impressed with your work. I also see you have created a much cleaner port of Stockfish at https://github.com/hi-ogawa/Stockfish/tree/emscripten. To me it would make sense to stop maintaining this repository in favor of yours. Are you planning to publish it to npm?

@hi-ogawa
Copy link
Author

Thanks for noticing my port!
I was just experimenting with emscripten's ASINCIFY and PROXY_TO_PTHREAD, and (somehow) managed it to work. There's a bit sketchy part in the uci command communication but I think (hope) it doesn't have a problem.

Are you planning to publish it to npm?

I wasn't planning to make npm package, but sure, I can do that.
So, after publishing npm package from my repo, I will update lila's PR lichess-org/lila#8154 too accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants