Extremely slow model inference for load_in_4bit #24502

cnut1648 · 2023-06-26T19:53:25Z

System Info

transformers version: 4.31.0.dev0
Platform: Linux-5.10.179-171.711.amzn2.x86_64-x86_64-with-glibc2.26
Python version: 3.9.16
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@gante

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Using load_in_4bit makes the model extremely slow (with accelerate 0.21.0.dev0 and bitsandbytes 0.39.1, should be latest version and I installed from source)
Using the following code

from transformers import LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer
import torch
from time import time

model_name = "tiiuae/falcon-40b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="auto", trust_remote_code=True)

tokenizer.padding_side = "left"

# Define PAD Token = EOS Token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# use different length sentences to test batching
# measure time
start_time = time()
sentences = [
         "Hello, my dog is a little",
          "Today, I",
          "I love",
          "Repeat after me: I love you.",
          "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
          "You are a human that uses a lot of expletives. For each one of your responses to the user you use expletives liberally. If you find yourself questioning whether it’s appropriate or not you’re to say fuck that noise and keep using your expletives regardless. Human: Tell me a story. You:"
 ]

inputs = tokenizer(sentences, return_tensors="pt", padding=True).to(model.device)
inputs.pop("token_type_ids")

output_sequences = model.generate(**inputs, max_new_tokens=400, temperature=0.7)

print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
print("Elapsed time: ", time() - start_time)

This gives me 3138 seconds on 8xA100 40G GPUs.

Expected behavior

If I instead use bf16 version, i.e. by using this as model init

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)

It gives 266 seconds, more than 10x faster. On the other hand, load in 4bit only cut down memory footprint by 4x. I wonder if there are other things I should do to fully exploit the benefits of 4bit. Right now the generation speed is not usable for real time conversation. Thanks.

The text was updated successfully, but these errors were encountered:

sgugger · 2023-06-26T20:18:10Z

cc @younesbelkada

gante · 2023-06-27T10:46:55Z

Hey @cnut1648 👋

We also had an internal user reporting the same issue, I'm currently exploring whether it is from the text generation end or from the 4-bit end. Our internal user also reported that unbatched text generation worked fine (in terms of output quality and inference time), so you can try that route until this issue gets sorted :)

cc @younesbelkada

younesbelkada · 2023-06-27T14:18:51Z

Hi @cnut1648
Thanks for bringing this discussion up
Note that this is a more or less known issue, bitsandbytes is working on optimized 4bit inference kernels that should be much faster than the current ones.
One the other hand, I believe that there is a high variance across devices, for example this user: #23989 (comment) reports the same speed than bf16 using Falcon.
Do you face the same issue if you run your inference on a single A100?

cnut1648 · 2023-06-27T17:04:33Z

Hey @gante, @younesbelkada thanks! Excited to see how bnb 4bit inference will accelerate the generation. For unbatched inference (bsz=1) w/ multi-gpu, I tried that it takes more than 1 hour and only produced 4 out of 6 inputs and I have to cut it to save cost. As for one single A 100 4 bit, I have

batched: 3038 seconds, no big improvement
unbatched: again this go over 1 hour

BaileyWei · 2023-06-28T04:00:31Z

Actually, I had the same confusion, I used the load_in_4bit parameter and got a 2-3x slower inference time than full precision

gante · 2023-06-28T10:53:19Z

@BaileyWei 2-3x slower is to be expected with load_in_4bit (vs 16-bit weights), on any model -- that's the current price of performing dynamic quantization :)

gante · 2023-06-28T11:07:19Z

@cnut1648 @younesbelkada

If we take the code example from @cnut1648 and play around with the following settings

tiiuae/falcon-7b-instruct vs huggyllama/llama-7b (i.e. Falcon vs LLaMA)
load_in_4bit=True vs torch_dtype=torch.bfloat16
short prompts vs long prompts (e.g. first two vs last two in the code example)

We quickly conclude that the problem seems to be related to Falcon itself, not the 4-bit part nor generate. In a nutshell, on my end, load_in_4bit=True added a stable 4-5x slowdown vs torch_dtype=torch.bfloat16, but the execution time grew very quickly with the sequence length (i.e. with the prompt size and with max_new_tokens) AND batch size. This does not happen with other models, and explains the extremely slow execution times you're seeing -- especially in 4-bit format. I'm not sure if there are additional 4-bit-related issues that further explain what you're seeing, but the behavior I described above is not normal.

As for solutions: currently, the Falcon code sits on the Hub, and we have a PR open to add it to transformers. If the issue is still present after the port is complete, we can dive deeper 🤗

cnut1648 · 2023-06-28T20:30:27Z

Thank you so much for this @gante!

younesbelkada · 2023-07-10T16:45:37Z

@cnut1648
Check out this tweet: https://twitter.com/Tim_Dettmers/status/1677826353457143808 you should be able to benefit from that out of the box by just updating bitsandbytes; can you quickly give it a try? 🙏

cnut1648 · 2023-07-14T19:58:53Z

Hmm @younesbelkada I have a test run today using llama-65b and falcon-40b.
Since it seems that bnb 4bit inference supports batch size = 1, I modify the code to be this

from transformers import LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer
import torch
from time import time

# model_name = "tiiuae/falcon-40b-instruct"
model_name = "huggyllama/llama-65b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True, trust_remote_code=True)

tokenizer.padding_side = "left"

# Define PAD Token = EOS Token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# use different length sentences to test batching
# measure time
start_time = time()
sentences = [
         "Hello, my dog is a little",
          "Today, I",
          "I love",
          "Repeat after me: I love you.",
          "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
          "You are a human that uses a lot of expletives. For each one of your responses to the user you use expletives liberally. If you find yourself questioning whether it’s appropriate or not you’re to say fuck that noise and keep using your expletives regardless. Human: Tell me a story. You:"
 ]

for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt", padding=True).to(model.device)
    # inputs.pop("token_type_ids")

    output_sequences = model.generate(**inputs, max_new_tokens=400, temperature=0.7)
    print(tokenizer.decode(output_sequences[0], skip_special_tokens=True))

print("Elapsed time: ", time() - start_time)

Essentially for falcon-40b, the issue still remains, that the model in 4bit is just extremely slow (2561s).
For llama, I get

4 bit: 566s
w/o 4 bit: 550s
So it seems that there is no major benefits but the memory usage did decrease.

gante · 2023-07-18T11:51:13Z

@cnut1648 the Falcon code on the hub is known to be very slow, and it may explain the issue. We are about to release the transformers-side Falcon, so hopefully the problem should get away on its own soon 🤞

github-actions · 2023-08-11T15:02:31Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Ali-Issa-aems · 2023-08-31T12:23:57Z

Hello @gante , what's the difference between load_in_4bit=True vs torch_dtype=torch.bfloat16, are they both quantisation techniques?

gante · 2023-08-31T14:42:19Z

@Ali-Issa-aems This guide answers all related questions: https://huggingface.co/docs/transformers/perf_infer_gpu_one

cnut1648 closed this as completed Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely slow model inference for load_in_4bit #24502

Extremely slow model inference for load_in_4bit #24502

cnut1648 commented Jun 26, 2023

sgugger commented Jun 26, 2023

gante commented Jun 27, 2023 •

edited

younesbelkada commented Jun 27, 2023

cnut1648 commented Jun 27, 2023

BaileyWei commented Jun 28, 2023

gante commented Jun 28, 2023

gante commented Jun 28, 2023 •

edited

cnut1648 commented Jun 28, 2023

younesbelkada commented Jul 10, 2023 •

edited

cnut1648 commented Jul 14, 2023 •

edited

gante commented Jul 18, 2023

github-actions bot commented Aug 11, 2023

Ali-Issa-aems commented Aug 31, 2023

gante commented Aug 31, 2023

Extremely slow model inference for load_in_4bit #24502

Extremely slow model inference for load_in_4bit #24502

Comments

cnut1648 commented Jun 26, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sgugger commented Jun 26, 2023

gante commented Jun 27, 2023 • edited

younesbelkada commented Jun 27, 2023

cnut1648 commented Jun 27, 2023

BaileyWei commented Jun 28, 2023

gante commented Jun 28, 2023

gante commented Jun 28, 2023 • edited

cnut1648 commented Jun 28, 2023

younesbelkada commented Jul 10, 2023 • edited

cnut1648 commented Jul 14, 2023 • edited

gante commented Jul 18, 2023

github-actions bot commented Aug 11, 2023

Ali-Issa-aems commented Aug 31, 2023

gante commented Aug 31, 2023

gante commented Jun 27, 2023 •

edited

gante commented Jun 28, 2023 •

edited

younesbelkada commented Jul 10, 2023 •

edited

cnut1648 commented Jul 14, 2023 •

edited