Is this running in fp16 or fp32, or is it something different? #71

xzuyn · 2023-04-28T02:50:05Z

With other methods of running LLMs using fp16 or quantization methods down to 4bit/5bit/8bit I'm wondering if the web demo could be faster/smaller in the future with quantization or at least fp16.

jinhongyii · 2023-04-28T03:10:31Z

Thanks for your advice. We are testing fp16 correctness and speed internally and will make it public soon

xzuyn · 2023-04-28T03:15:07Z

Thanks for your advice. We are testing fp16 correctness and speed internally and will make it public soon

I'm wondering about what the web demo uses currently. The model size is similar to a q4_0 ggml model, so is it running 4bit? I couldn't find any specific info on what precision you guys are using.

jinhongyii · 2023-04-28T03:17:49Z

it is using 4bit quantization and fp32 for compute

xzuyn · 2023-04-28T03:18:59Z

it is using 4bit quantization and fp32 for compute

Thank you. Good luck with everything, I'm looking forward to seeing how this progresses.

xzuyn closed this as completed Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this running in fp16 or fp32, or is it something different? #71

Is this running in fp16 or fp32, or is it something different? #71

xzuyn commented Apr 28, 2023

jinhongyii commented Apr 28, 2023

xzuyn commented Apr 28, 2023

jinhongyii commented Apr 28, 2023

xzuyn commented Apr 28, 2023

Is this running in fp16 or fp32, or is it something different? #71

Is this running in fp16 or fp32, or is it something different? #71

Comments

xzuyn commented Apr 28, 2023

jinhongyii commented Apr 28, 2023

xzuyn commented Apr 28, 2023

jinhongyii commented Apr 28, 2023

xzuyn commented Apr 28, 2023