Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mistralai/Mixtral-8x7B-v0.1 bfloat16 much slower than FP32 on Intel EMR CPU #30588

Open
2 of 4 tasks
badhri-intel opened this issue Apr 30, 2024 · 4 comments
Open
2 of 4 tasks

Comments

@badhri-intel
Copy link

badhri-intel commented Apr 30, 2024

System Info

transformers==4.36.0
Hardware : Intel Emerald Rappids CPU
OS == CentOS Stream 8
python = 3.9

Who can help?

@ArthurZucker and @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

#FP32
# model = AutoModelForCausalLM.from_pretrained(model_id)


#BF16
model = AutoModelForCausalLM.from_pretrained(model_id,torch_dtype=torch.bfloat16)

text = "Hello my name is"

model.eval()
inputs = tokenizer(text, truncation=True, max_length=1024,return_tensors="pt")

for i in range(5):
    tic = time.time()
    outputs = model.generate(**inputs, max_new_tokens=50,use_cache=True)
    toc = time.time()
    print("Generation took : ",toc-tic)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))



Expected behavior

For FP32, the inference/generation time is around 10 seconds whereas for BFloat16, the time taken is around 20 seconds.

Bfloat16 log

Setting pad_token_id to eos_token_id:2 for open-end generation.
Generation took : 21.25957155227661 seconds

Generated Text : Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying a degree in English Literature and Creative Writing at the University of Winchester. I have always had a passion for writing and I am hoping to pursue

FP32 log

Setting pad_token_id to eos_token_id:2 for open-end generation.
Generation took : 10.631409406661987 seconds

Generated Text : Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying a degree in English Literature and Creative Writing at the University of Winchester. I have always had a passion for writing and I am hoping to pursue

@badhri-intel
Copy link
Author

This behavior is also observed for other LLMs in AutoModelForCausalLM such as EleutherAI/gpt-j-6b

@amyeroberts
Copy link
Collaborator

@badhri-intel, thanks for opening this issue! I think you might have tagged the wrong @younesbelkada :)

@badhri-intel
Copy link
Author

@amyeroberts Thanks for the heads up. I have updated it now

@github-actions github-actions bot closed this as completed Jun 8, 2024
@amyeroberts amyeroberts reopened this Jun 10, 2024
@huggingface huggingface deleted a comment from github-actions bot Jun 10, 2024
@younesbelkada
Copy link
Contributor

Hi @badhri-intel
Apologies for the late reply, I tried to reproduce locally with no success, I am getting similar performance between fp32 and bf16 on a CPU. Some hardware might not support bfloat16 operations out of the box, therefore bf16 operations may need to be emulated (similarly as what is done in optimum-quanto to emulate fp8 on devices that do not support fp8). Another thing that might help is updating your pytorch version. I advise you also to cross post this issue in intel-extensions-for-pytorch or intel-extensions-for-transformers.
Similar issue: intel/intel-extension-for-pytorch#178

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants