You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
#FP32
# model = AutoModelForCausalLM.from_pretrained(model_id)
#BF16
model = AutoModelForCausalLM.from_pretrained(model_id,torch_dtype=torch.bfloat16)
text = "Hello my name is"
model.eval()
inputs = tokenizer(text, truncation=True, max_length=1024,return_tensors="pt")
for i in range(5):
tic = time.time()
outputs = model.generate(**inputs, max_new_tokens=50,use_cache=True)
toc = time.time()
print("Generation took : ",toc-tic)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Expected behavior
For FP32, the inference/generation time is around 10 seconds whereas for BFloat16, the time taken is around 20 seconds.
Bfloat16 log
Setting pad_token_id to eos_token_id:2 for open-end generation.
Generation took : 21.25957155227661 seconds
Generated Text : Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying a degree in English Literature and Creative Writing at the University of Winchester. I have always had a passion for writing and I am hoping to pursue
FP32 log
Setting pad_token_id to eos_token_id:2 for open-end generation.
Generation took : 10.631409406661987 seconds
Generated Text : Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying a degree in English Literature and Creative Writing at the University of Winchester. I have always had a passion for writing and I am hoping to pursue
The text was updated successfully, but these errors were encountered:
Hi @badhri-intel
Apologies for the late reply, I tried to reproduce locally with no success, I am getting similar performance between fp32 and bf16 on a CPU. Some hardware might not support bfloat16 operations out of the box, therefore bf16 operations may need to be emulated (similarly as what is done in optimum-quanto to emulate fp8 on devices that do not support fp8). Another thing that might help is updating your pytorch version. I advise you also to cross post this issue in intel-extensions-for-pytorch or intel-extensions-for-transformers.
Similar issue: intel/intel-extension-for-pytorch#178
System Info
transformers==4.36.0
Hardware : Intel Emerald Rappids CPU
OS == CentOS Stream 8
python = 3.9
Who can help?
@ArthurZucker and @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
For FP32, the inference/generation time is around 10 seconds whereas for BFloat16, the time taken is around 20 seconds.
Bfloat16 log
Setting
pad_token_id
toeos_token_id
:2 for open-end generation.Generation took : 21.25957155227661 seconds
Generated Text : Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying a degree in English Literature and Creative Writing at the University of Winchester. I have always had a passion for writing and I am hoping to pursue
FP32 log
Setting
pad_token_id
toeos_token_id
:2 for open-end generation.Generation took : 10.631409406661987 seconds
Generated Text : Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying a degree in English Literature and Creative Writing at the University of Winchester. I have always had a passion for writing and I am hoping to pursue
The text was updated successfully, but these errors were encountered: