mistralai/Mixtral-8x7B-v0.1 bfloat16 much slower than FP32 on Intel EMR CPU #30588

badhri-intel · 2024-04-30T20:06:32Z

System Info

transformers==4.36.0
Hardware : Intel Emerald Rappids CPU
OS == CentOS Stream 8
python = 3.9

Who can help?

@ArthurZucker and @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

#FP32
# model = AutoModelForCausalLM.from_pretrained(model_id)


#BF16
model = AutoModelForCausalLM.from_pretrained(model_id,torch_dtype=torch.bfloat16)

text = "Hello my name is"

model.eval()
inputs = tokenizer(text, truncation=True, max_length=1024,return_tensors="pt")

for i in range(5):
    tic = time.time()
    outputs = model.generate(**inputs, max_new_tokens=50,use_cache=True)
    toc = time.time()
    print("Generation took : ",toc-tic)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Expected behavior

For FP32, the inference/generation time is around 10 seconds whereas for BFloat16, the time taken is around 20 seconds.

Bfloat16 log

Setting pad_token_id to eos_token_id:2 for open-end generation.
Generation took : 21.25957155227661 seconds

Generated Text : Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying a degree in English Literature and Creative Writing at the University of Winchester. I have always had a passion for writing and I am hoping to pursue

FP32 log

Setting pad_token_id to eos_token_id:2 for open-end generation.
Generation took : 10.631409406661987 seconds

Generated Text : Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying a degree in English Literature and Creative Writing at the University of Winchester. I have always had a passion for writing and I am hoping to pursue

The text was updated successfully, but these errors were encountered:

badhri-intel · 2024-04-30T20:08:12Z

This behavior is also observed for other LLMs in AutoModelForCausalLM such as EleutherAI/gpt-j-6b

amyeroberts · 2024-04-30T20:08:54Z

@badhri-intel, thanks for opening this issue! I think you might have tagged the wrong @younesbelkada :)

badhri-intel · 2024-04-30T20:14:15Z

@amyeroberts Thanks for the heads up. I have updated it now

younesbelkada · 2024-06-10T10:39:39Z

Hi @badhri-intel
Apologies for the late reply, I tried to reproduce locally with no success, I am getting similar performance between fp32 and bf16 on a CPU. Some hardware might not support bfloat16 operations out of the box, therefore bf16 operations may need to be emulated (similarly as what is done in optimum-quanto to emulate fp8 on devices that do not support fp8). Another thing that might help is updating your pytorch version. I advise you also to cross post this issue in intel-extensions-for-pytorch or intel-extensions-for-transformers.
Similar issue: intel/intel-extension-for-pytorch#178

github-actions bot closed this as completed Jun 8, 2024

amyeroberts reopened this Jun 10, 2024

huggingface deleted a comment from github-actions bot Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mistralai/Mixtral-8x7B-v0.1 bfloat16 much slower than FP32 on Intel EMR CPU #30588

mistralai/Mixtral-8x7B-v0.1 bfloat16 much slower than FP32 on Intel EMR CPU #30588

badhri-intel commented Apr 30, 2024 •

edited

badhri-intel commented Apr 30, 2024

amyeroberts commented Apr 30, 2024

badhri-intel commented Apr 30, 2024

younesbelkada commented Jun 10, 2024

mistralai/Mixtral-8x7B-v0.1 bfloat16 much slower than FP32 on Intel EMR CPU #30588

mistralai/Mixtral-8x7B-v0.1 bfloat16 much slower than FP32 on Intel EMR CPU #30588

Comments

badhri-intel commented Apr 30, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Bfloat16 log

FP32 log

badhri-intel commented Apr 30, 2024

amyeroberts commented Apr 30, 2024

badhri-intel commented Apr 30, 2024

younesbelkada commented Jun 10, 2024

badhri-intel commented Apr 30, 2024 •

edited