Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance expectations #4

Open
chadkirby opened this issue Apr 18, 2024 · 5 comments
Open

performance expectations #4

chadkirby opened this issue Apr 18, 2024 · 5 comments

Comments

@chadkirby
Copy link

First, thanks for putting this project together!

I modified examples/basic/index.html to use a more capable model: https://huggingface.co/lmstudio-ai/gemma-2b-it-GGUF/resolve/main/gemma-2b-it-q4_k_m.gguf, which is 1.5gb.

Using LM Studio on my laptop (with GPU Acceleration disabled), I get roughly 25 tokens per second from gemma-2b-it-q4_k_m.gguf.

Running examples/basic/index.html in Chrome 124 on my laptop, I get roughly 6-7 tokens per second from gemma-2b-it-q4_k_m.gguf. (Similar performance in Edge 123.)

Generally, the wasm bindings seem roughly 3-4x slower than native. Is that more or less expected? Are there any wllama knobs I can twiddle to improve performance?

@ngxson
Copy link
Owner

ngxson commented Apr 19, 2024

It is expected, since WebAssembly SIMD only support the equivalent to AVX instruction, not AVX2. This should be the biggest impact to performance atm.

Another issue is that we're using emscripten's non-native exception handler which maintains support with older browsers, but come with a small performance cost. We may move to native exception handler in the future.

Edit: seems like most mainstream versions of browsers already support native wasm exception (see here), so it's safe to enable it. The support will be added in the next build of wllama.

@ngxson
Copy link
Owner

ngxson commented Apr 21, 2024

v1.6.0 is now using native exception handler via -fwasm-exceptions. Here is the matrix for browser support: https://webassembly.org/features/

@iSuslov
Copy link

iSuslov commented May 12, 2024

Hey @chadkirby, out of curiosity, have you tried on latest version with native exception handler?

@chadkirby
Copy link
Author

Hey @chadkirby, out of curiosity, have you tried on latest version with native exception handler?

I did. IIRC, I saw a modest performance improvement, but wasm speed was still roughly 3x slower than native.

@felladrin
Copy link
Contributor

One important consideration is that certain browsers, such as Brave, may alter the value of navigator.hardwareConcurrency to prevent fingerprinting.

As a result, it is possible that the browser was utilizing only 2 threads, leading to slow inference.

Using 8 threads has resulted in satisfactory performance for the Phi-3 model:

minisearch-phi-3-wllama.mp4
Details image image

@ngxson ngxson pinned this issue May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants