Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow model inference for load_in_4bit #24502

Closed
2 of 4 tasks
cnut1648 opened this issue Jun 26, 2023 · 14 comments
Closed
2 of 4 tasks

Extremely slow model inference for load_in_4bit #24502

cnut1648 opened this issue Jun 26, 2023 · 14 comments

Comments

@cnut1648
Copy link

System Info

  • transformers version: 4.31.0.dev0
  • Platform: Linux-5.10.179-171.711.amzn2.x86_64-x86_64-with-glibc2.26
  • Python version: 3.9.16
  • Huggingface_hub version: 0.15.1
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.1+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Using load_in_4bit makes the model extremely slow (with accelerate 0.21.0.dev0 and bitsandbytes 0.39.1, should be latest version and I installed from source)
Using the following code

from transformers import LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer
import torch
from time import time

model_name = "tiiuae/falcon-40b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="auto", trust_remote_code=True)

tokenizer.padding_side = "left"

# Define PAD Token = EOS Token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# use different length sentences to test batching
# measure time
start_time = time()
sentences = [
         "Hello, my dog is a little",
          "Today, I",
          "I love",
          "Repeat after me: I love you.",
          "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
          "You are a human that uses a lot of expletives. For each one of your responses to the user you use expletives liberally. If you find yourself questioning whether it’s appropriate or not you’re to say fuck that noise and keep using your expletives regardless. Human: Tell me a story. You:"
 ]

inputs = tokenizer(sentences, return_tensors="pt", padding=True).to(model.device)
inputs.pop("token_type_ids")

output_sequences = model.generate(**inputs, max_new_tokens=400, temperature=0.7)

print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
print("Elapsed time: ", time() - start_time)

This gives me 3138 seconds on 8xA100 40G GPUs.

Expected behavior

If I instead use bf16 version, i.e. by using this as model init

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)

It gives 266 seconds, more than 10x faster. On the other hand, load in 4bit only cut down memory footprint by 4x. I wonder if there are other things I should do to fully exploit the benefits of 4bit. Right now the generation speed is not usable for real time conversation. Thanks.

@sgugger
Copy link
Collaborator

sgugger commented Jun 26, 2023

cc @younesbelkada

@gante
Copy link
Member

gante commented Jun 27, 2023

Hey @cnut1648 👋

We also had an internal user reporting the same issue, I'm currently exploring whether it is from the text generation end or from the 4-bit end. Our internal user also reported that unbatched text generation worked fine (in terms of output quality and inference time), so you can try that route until this issue gets sorted :)

cc @younesbelkada

@younesbelkada
Copy link
Contributor

Hi @cnut1648
Thanks for bringing this discussion up
Note that this is a more or less known issue, bitsandbytes is working on optimized 4bit inference kernels that should be much faster than the current ones.
One the other hand, I believe that there is a high variance across devices, for example this user: #23989 (comment) reports the same speed than bf16 using Falcon.
Do you face the same issue if you run your inference on a single A100?

@cnut1648
Copy link
Author

Hey @gante, @younesbelkada thanks! Excited to see how bnb 4bit inference will accelerate the generation. For unbatched inference (bsz=1) w/ multi-gpu, I tried that it takes more than 1 hour and only produced 4 out of 6 inputs and I have to cut it to save cost. As for one single A 100 4 bit, I have

  • batched: 3038 seconds, no big improvement
  • unbatched: again this go over 1 hour

@BaileyWei
Copy link

Actually, I had the same confusion, I used the load_in_4bit parameter and got a 2-3x slower inference time than full precision

@gante
Copy link
Member

gante commented Jun 28, 2023

@BaileyWei 2-3x slower is to be expected with load_in_4bit (vs 16-bit weights), on any model -- that's the current price of performing dynamic quantization :)

@gante
Copy link
Member

gante commented Jun 28, 2023

@cnut1648 @younesbelkada

If we take the code example from @cnut1648 and play around with the following settings

  1. tiiuae/falcon-7b-instruct vs huggyllama/llama-7b (i.e. Falcon vs LLaMA)
  2. load_in_4bit=True vs torch_dtype=torch.bfloat16
  3. short prompts vs long prompts (e.g. first two vs last two in the code example)

We quickly conclude that the problem seems to be related to Falcon itself, not the 4-bit part nor generate. In a nutshell, on my end, load_in_4bit=True added a stable 4-5x slowdown vs torch_dtype=torch.bfloat16, but the execution time grew very quickly with the sequence length (i.e. with the prompt size and with max_new_tokens) AND batch size. This does not happen with other models, and explains the extremely slow execution times you're seeing -- especially in 4-bit format. I'm not sure if there are additional 4-bit-related issues that further explain what you're seeing, but the behavior I described above is not normal.

As for solutions: currently, the Falcon code sits on the Hub, and we have a PR open to add it to transformers. If the issue is still present after the port is complete, we can dive deeper 🤗

@cnut1648
Copy link
Author

Thank you so much for this @gante!

@younesbelkada
Copy link
Contributor

younesbelkada commented Jul 10, 2023

@cnut1648
Check out this tweet: https://twitter.com/Tim_Dettmers/status/1677826353457143808 you should be able to benefit from that out of the box by just updating bitsandbytes; can you quickly give it a try? 🙏

@cnut1648
Copy link
Author

cnut1648 commented Jul 14, 2023

Hmm @younesbelkada I have a test run today using llama-65b and falcon-40b.
Since it seems that bnb 4bit inference supports batch size = 1, I modify the code to be this

from transformers import LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer
import torch
from time import time

# model_name = "tiiuae/falcon-40b-instruct"
model_name = "huggyllama/llama-65b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True, trust_remote_code=True)

tokenizer.padding_side = "left"

# Define PAD Token = EOS Token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# use different length sentences to test batching
# measure time
start_time = time()
sentences = [
         "Hello, my dog is a little",
          "Today, I",
          "I love",
          "Repeat after me: I love you.",
          "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
          "You are a human that uses a lot of expletives. For each one of your responses to the user you use expletives liberally. If you find yourself questioning whether it’s appropriate or not you’re to say fuck that noise and keep using your expletives regardless. Human: Tell me a story. You:"
 ]

for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt", padding=True).to(model.device)
    # inputs.pop("token_type_ids")

    output_sequences = model.generate(**inputs, max_new_tokens=400, temperature=0.7)
    print(tokenizer.decode(output_sequences[0], skip_special_tokens=True))

print("Elapsed time: ", time() - start_time)

Essentially for falcon-40b, the issue still remains, that the model in 4bit is just extremely slow (2561s).
For llama, I get

  • 4 bit: 566s
  • w/o 4 bit: 550s
    So it seems that there is no major benefits but the memory usage did decrease.

@gante
Copy link
Member

gante commented Jul 18, 2023

@cnut1648 the Falcon code on the hub is known to be very slow, and it may explain the issue. We are about to release the transformers-side Falcon, so hopefully the problem should get away on its own soon 🤞

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Ali-Issa-aems
Copy link

Hello @gante , what's the difference between load_in_4bit=True vs torch_dtype=torch.bfloat16, are they both quantisation techniques?

@gante
Copy link
Member

gante commented Aug 31, 2023

@Ali-Issa-aems This guide answers all related questions: https://huggingface.co/docs/transformers/perf_infer_gpu_one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants