You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With other methods of running LLMs using fp16 or quantization methods down to 4bit/5bit/8bit I'm wondering if the web demo could be faster/smaller in the future with quantization or at least fp16.
The text was updated successfully, but these errors were encountered:
Thanks for your advice. We are testing fp16 correctness and speed internally and will make it public soon
I'm wondering about what the web demo uses currently. The model size is similar to a q4_0 ggml model, so is it running 4bit? I couldn't find any specific info on what precision you guys are using.
With other methods of running LLMs using fp16 or quantization methods down to 4bit/5bit/8bit I'm wondering if the web demo could be faster/smaller in the future with quantization or at least fp16.
The text was updated successfully, but these errors were encountered: