-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Description
Not sure if the root cause of the issue is in huggingface/tranformers or huggingface/onnxruntime, but posting it here in case people have more context. Sorry if this ended up being noise for this forum.
System Info
transformers version: 4.37.2
optimum[onnxruntime-gpu]==1.16.2
onnxruntime-gpu==1.17.0
Platform: linux_x86_64 cp310 ubuntu-22.04
Python version: 3.10
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.26.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.2 (True) torch-2.1.2-cu118-cp310-cp310-linux_x86_64.whl
Tensorflow version (GPU?): not installed
Flax version (CPU?/GPU?/TPU?): not installed
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes. A100
CUDA_VERSION: 11.8.0
Using distributed or parallel set-up in script?: yes (deepspeed 0.11.2)
Who can help?
@amyeroberts @pacman100 @JingyaHuang @younesbelkada @michaelbenayoun
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
GPTBigCode is listed as supported model under onnxruntime. After loading a bigcode/starcoderbase-3b as optimum.onnxruntime.modeling_decoder.ORTGPTBigCodeForCausalLM, inference runs fine for greedy search (num_beams = 1) but crashes for beam search (num_beams > 1) with the following stacktrace:
2024-02-08 15:37:47 [info ] loaded the model as ORTModel model_type=<class 'optimum.onnxruntime.modeling_decoder.ORTGPTBigCodeForCausalLM'>
2024-02-08 15:37:47 [warning ] switching the tokenizer padding side from 'right' to 'left' for a causal LM
2024-02-08 15:37:51.218582394 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 36 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-02-08 15:37:51.237708474 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-02-08 15:37:51.237736553 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-02-08 15:37:55 [info ] Successfully created the pipeline device=cuda:0
Traceback (most recent call last):
File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/text_generation.py", line 219, in __call__
return super().__call__(text_inputs, **kwargs)
File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/base.py", line 1143, in __call__
outputs = list(final_iterator)
File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
item = next(self.iterator)
File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
processed = self.infer(item, **self.params)
File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/base.py", line 1068, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "/opt/tools/redacted/__main__/pantscode/redacted/ml_inference/inference_pipeline.py", line 66, in _forward
return super()._forward(model_inputs, **generate_kwargs)
File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/text_generation.py", line 295, in _forward
generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
File "/opt/tools/redacted/ml_deps_torch/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/generation/utils.py", line 1558, in generate
return self.beam_search(
File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/generation/utils.py", line 2940, in beam_search
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
File "/opt/tools/redacted/ml_deps_optimum/site-packages/optimum/onnxruntime/modeling_decoder.py", line 684, in prepare_inputs_for_generation
past_length = past_key_values[0].shape[1]
AttributeError: 'tuple' object has no attribute 'shape
Model is loaded as ORTModelForCausalLM instead of transformers.AutoModelForCausalLM, but still with transformers.TextGenerationPipeline
model = onnxruntime.ORTModelForCausalLM.from_pretrained(
model_path,
use_io_binding=True,
provider="CUDAExecutionProvider",
use_cache=True,
export=True
)
Not every model is having the bug, but perhaps GPTBigCode is not the only one, as mentioned in another git issue. Since there's no acknowledgement of the bug in the aforementioned issue, I'm wondering if it was posted in the correct place.
Expected behavior
When num_beams > 1, inference would not crash with AttributeError: 'tuple' object has no attribute 'shape