Changes for prompt speedup for BF16 models - Both inputs as BF16 for mulmat operation #2

Srihari-mcw · 2024-05-03T13:28:34Z

The PR contains prompt speedup version for BF16 models.
The code is derived from fork - https://github.com/jart/llama.cpp/tree/unified.

The PR modifies the second input (Btype) to be GGML_TYPE_BF16 for mulmat kernels and removes the GGML_TYPE_F32 case, which enables the input of Btype (second operand) to get quantized to BF16 format and hence uses BF16 intrinsics in turn for dot product operation. Significant speedup was observed while comparing the code with original version in the fork where the second operand of mulmat operation is in FP32 format. Results :

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B BF16 (without prompt speedup)	12.55 GiB	6.74 B	CPU	6	pp 512	39.111 ± 0.03		4e57aa6
llama 7B BF16 ( prompt speedup - both inputs BF16)	12.55 GiB	6.74 B	CPU	6	pp 512	103.343 ± 0.14	164.23%	b25ba28
llama 7B BF16 ( prompt speedup - BF16 x FP32 - second input FP32 format)	12.55 GiB	6.74 B	CPU	6	pp 512	45.126 ± 0.04	15.379%	4e57aa6

The code was tested in AMD Raphael 7600X machine which has AVX512_BF16 support in Linux platform. The original unquantized model is taken from https://huggingface.co/TheBloke/wizardLM-7B-HF. Changes in PR #1 is included while testing the changes here

Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as their canonical floating point format. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 brain16 This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───────────────────┐ 0b00000000000000000000000000000000 IEEE binary32 The issue is that converting bf16 to fp16 can result in information loss. Only 13% of bf16 numbers can be precisely represented in fp16 which in practice ends up being 99.71% of Mistral 7b v0.2's weights however there is currently no way other than fp32 to get the others ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌─┴─┐┌─┴──────┐ 0b0000000000000000 IEEE binary16 This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16

…up version of BF16 model type

jart and others added 9 commits May 1, 2024 09:59

Remove GGML code that's not needed

24fb84d

Minimize the GGML API surface area for BF16

48ade53

Remove bf16 luts

bb6d919

Make the GGML header look nicer

f70e923

Fix documentation

3a6baef

Apply ggerganov's fixes for test-backend-ops

e4949b7

Add BF16 code for new ggml_validate_row_data() function

82aebcf

Make changes to have both BF16 inputs for both inputs in prompt speed…

b1368d7

…up version of BF16 model type

Srihari-mcw mentioned this pull request May 3, 2024

Introduce bfloat16 support ggerganov/llama.cpp#6412

Merged

jart force-pushed the bf16 branch from 82aebcf to 632624e Compare May 8, 2024 05:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes for prompt speedup for BF16 models - Both inputs as BF16 for mulmat operation #2

Changes for prompt speedup for BF16 models - Both inputs as BF16 for mulmat operation #2

Srihari-mcw commented May 3, 2024 •

edited

Changes for prompt speedup for BF16 models - Both inputs as BF16 for mulmat operation #2

Are you sure you want to change the base?

Changes for prompt speedup for BF16 models - Both inputs as BF16 for mulmat operation #2

Conversation

Srihari-mcw commented May 3, 2024 • edited

Srihari-mcw commented May 3, 2024 •

edited