-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bnb
] Fix blip2 4bit
#23895
[bnb
] Fix blip2 4bit
#23895
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a tiny bit brittle. Are you sure it doesn't break any other model quantization?
It should be all good, I have verified the slow tests pass for 8bit and 4bit. Let me know if there is anything particular I should have a look. Per my understanding this only affects Blip2 as it is the only model (from what I know) that have an lm head as part of a submodule. |
Hmm getting some gibberish output with the fix, need to investigate more |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I know this was closed, but I'm getting the following error: Runner: Docker container using python:3.9 image. Usage:
Model class:
Complete traceback:
|
Hi @kevinknights29 , I see that in your script, you are trying to load in 8-bit and in 4-bit at the same time. Please select only one option. return AutoModelForCausalLM.from_pretrained(
self.model_path,
config=self.config,
trust_remote_code=True,
# either remove load_in_8bit arg
load_in_8bit=True,
device_map="cpu",
torch_dtype=torch.float16,
# or remove quantization_config
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
llm_int8_enable_fp32_cpu_offload=True,
) On my side I was able to execute the following script with :
import torch
from transformers import Blip2ForConditionalGeneration, Blip2Processor, BitsAndBytesConfig
from PIL import Image
import requests
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
)
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-6.7b-coco")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-6.7b-coco", device_map='auto', quantization_config=nf4_config)
def prepare_img():
url = "https://huggingface.co/hf-internal-testing/blip-test-image/resolve/main/demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)
return image
image = prepare_img()
inputs = processor(images=[image, image], return_tensors="pt").to(dtype=torch.float16)
predictions = model.generate(**inputs, num_beams=2)
print(processor.batch_decode(predictions, skip_special_tokens=True)[0].strip())
# print -> a woman sitting on the beach with her dog |
I am getting similar error while using llma2 7b model, and I am using the latest version of transformers here is the code
I also tried 4.34, it doesn't work either. Besides that, I check this #23895, it doesn't look like that it is in any of the release branch nor the maser branch here is the error I get FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first. Cell In[17], line 13, in generate_text(prompt) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/transformers/generation/utils.py:1345, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/transformers/generation/utils.py:644, in GenerationMixin._prepare_encoder_decoder_kwargs_for_generation(self, inputs_tensor, model_kwargs, model_input_name) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module..new_forward(*args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:1094, in T5Stack.forward(self, input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask, inputs_embeds, head_mask, cross_attn_head_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module..new_forward(*args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:694, in T5Block.forward(self, hidden_states, attention_mask, position_bias, encoder_hidden_states, encoder_attention_mask, encoder_decoder_position_bias, layer_head_mask, cross_attn_layer_head_mask, past_key_value, use_cache, output_attentions, return_dict) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module..new_forward(*args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:601, in T5LayerSelfAttention.forward(self, hidden_states, attention_mask, position_bias, layer_head_mask, past_key_value, use_cache, output_attentions) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module..new_forward(*args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:520, in T5Attention.forward(self, hidden_states, mask, key_value_states, position_bias, past_key_value, layer_head_mask, query_length, use_cache, output_attentions) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module..new_forward(*args, **kwargs) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:256, in Linear4bit.forward(self, x) File /opt/conda/envs/domino-ray/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:566, in matmul_4bit(A, B, quant_state, out, bias) AssertionError: |
Found the issue, changing device_map to 'auto' fix the issue. Anyone can explain why?
it should be
transformers and accelerate versions are as belpow |
What does this PR do?
Fixes #23839
Indeed, for models such as Blip2 that have lm head inside submodules (and not directly on the top level of the model), the lm head does get converted to 4bit / 8bit models, leading to unexpected behavior for 4bit models. The PR fixes this by making sure to consider the last term after
.
when creatingmodules_not_to_convert
.cc @sgugger