Skip to content

Starcoder/GPTBigCode has broken beam search when converted to ONNX runtime model #28996

@lidingsnyk

Description

@lidingsnyk

Not sure if the root cause of the issue is in huggingface/tranformers or huggingface/onnxruntime, but posting it here in case people have more context. Sorry if this ended up being noise for this forum.

System Info

transformers version: 4.37.2
optimum[onnxruntime-gpu]==1.16.2 
onnxruntime-gpu==1.17.0
Platform: linux_x86_64 cp310 ubuntu-22.04
Python version: 3.10
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.26.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.2 (True) torch-2.1.2-cu118-cp310-cp310-linux_x86_64.whl
Tensorflow version (GPU?): not installed
Flax version (CPU?/GPU?/TPU?): not installed
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes. A100
CUDA_VERSION: 11.8.0
Using distributed or parallel set-up in script?: yes (deepspeed 0.11.2)

Who can help?

@amyeroberts @pacman100 @JingyaHuang @younesbelkada @michaelbenayoun

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

GPTBigCode is listed as supported model under onnxruntime. After loading a bigcode/starcoderbase-3b as optimum.onnxruntime.modeling_decoder.ORTGPTBigCodeForCausalLM, inference runs fine for greedy search (num_beams = 1) but crashes for beam search (num_beams > 1) with the following stacktrace:

2024-02-08 15:37:47 [info     ] loaded the model as ORTModel   model_type=<class 'optimum.onnxruntime.modeling_decoder.ORTGPTBigCodeForCausalLM'>
2024-02-08 15:37:47 [warning  ] switching the tokenizer padding side from 'right' to 'left' for a causal LM
2024-02-08 15:37:51.218582394 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 36 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-02-08 15:37:51.237708474 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-02-08 15:37:51.237736553 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-02-08 15:37:55 [info     ] Successfully created the pipeline device=cuda:0
Traceback (most recent call last):
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/text_generation.py", line 219, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/base.py", line 1143, in __call__
    outputs = list(final_iterator)
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/base.py", line 1068, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/opt/tools/redacted/__main__/pantscode/redacted/ml_inference/inference_pipeline.py", line 66, in _forward
    return super()._forward(model_inputs, **generate_kwargs)
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/text_generation.py", line 295, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/opt/tools/redacted/ml_deps_torch/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/generation/utils.py", line 1558, in generate
    return self.beam_search(
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/generation/utils.py", line 2940, in beam_search
    model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
  File "/opt/tools/redacted/ml_deps_optimum/site-packages/optimum/onnxruntime/modeling_decoder.py", line 684, in prepare_inputs_for_generation
    past_length = past_key_values[0].shape[1]
AttributeError: 'tuple' object has no attribute 'shape

Model is loaded as ORTModelForCausalLM instead of transformers.AutoModelForCausalLM, but still with transformers.TextGenerationPipeline

model = onnxruntime.ORTModelForCausalLM.from_pretrained(
            model_path,
            use_io_binding=True,
            provider="CUDAExecutionProvider",
            use_cache=True,
            export=True
        )

Not every model is having the bug, but perhaps GPTBigCode is not the only one, as mentioned in another git issue. Since there's no acknowledgement of the bug in the aforementioned issue, I'm wondering if it was posted in the correct place.

Expected behavior

When num_beams > 1, inference would not crash with AttributeError: 'tuple' object has no attribute 'shape

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions