Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for prompt speedup for BF16 models - Both inputs as BF16 for mulmat operation #2

Open
wants to merge 9 commits into
base: bf16
Choose a base branch
from

Conversation

Srihari-mcw
Copy link

@Srihari-mcw Srihari-mcw commented May 3, 2024

The PR contains prompt speedup version for BF16 models.
The code is derived from fork - https://github.com/jart/llama.cpp/tree/unified.

The PR modifies the second input (Btype) to be GGML_TYPE_BF16 for mulmat kernels and removes the GGML_TYPE_F32 case, which enables the input of Btype (second operand) to get quantized to BF16 format and hence uses BF16 intrinsics in turn for dot product operation. Significant speedup was observed while comparing the code with original version in the fork where the second operand of mulmat operation is in FP32 format. Results :

model size params backend threads test t/s speedup Commit id
llama 7B BF16 (without prompt speedup) 12.55 GiB 6.74 B CPU 6 pp 512 39.111 ± 0.03 4e57aa6
llama 7B BF16 ( prompt speedup - both inputs BF16) 12.55 GiB 6.74 B CPU 6 pp 512 103.343 ± 0.14 164.23% b25ba28
llama 7B BF16 ( prompt speedup - BF16 x FP32 - second input FP32 format) 12.55 GiB 6.74 B CPU 6 pp 512 45.126 ± 0.04 15.379% 4e57aa6

The code was tested in AMD Raphael 7600X machine which has AVX512_BF16 support in Linux platform. The original unquantized model is taken from https://huggingface.co/TheBloke/wizardLM-7B-HF. Changes in PR #1 is included while testing the changes here

jart and others added 9 commits May 1, 2024 09:59
Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───┐
    0b0000000000000000 brain16

This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───────────────────┐
    0b00000000000000000000000000000000 IEEE binary32

The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others

      ┌sign
      │
      │  ┌exponent
      │  │
      │  │    ┌mantissa
      │  │    │
      │┌─┴─┐┌─┴──────┐
    0b0000000000000000 IEEE binary16

This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants