Starcoder/GPTBigCode has broken beam search when converted to ONNX runtime model

Not sure if the root cause of the issue is in `huggingface/tranformers` or `huggingface/onnxruntime`, but posting it here in case people have more context. Sorry if this ended up being noise for this forum. 

### System Info

```
transformers version: 4.37.2
optimum[onnxruntime-gpu]==1.16.2 
onnxruntime-gpu==1.17.0
Platform: linux_x86_64 cp310 ubuntu-22.04
Python version: 3.10
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.26.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.2 (True) torch-2.1.2-cu118-cp310-cp310-linux_x86_64.whl
Tensorflow version (GPU?): not installed
Flax version (CPU?/GPU?/TPU?): not installed
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes. A100
CUDA_VERSION: 11.8.0
Using distributed or parallel set-up in script?: yes (deepspeed 0.11.2)
```

### Who can help?

@amyeroberts @pacman100 @JingyaHuang @younesbelkada  @michaelbenayoun 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

GPTBigCode is listed as supported model under onnxruntime. After loading a `bigcode/starcoderbase-3b` as `optimum.onnxruntime.modeling_decoder.ORTGPTBigCodeForCausalLM`, inference runs fine for greedy search (`num_beams = 1`)  but crashes for beam search (`num_beams > 1`) with the following stacktrace:

```
2024-02-08 15:37:47 [info     ] loaded the model as ORTModel   model_type=<class 'optimum.onnxruntime.modeling_decoder.ORTGPTBigCodeForCausalLM'>
2024-02-08 15:37:47 [warning  ] switching the tokenizer padding side from 'right' to 'left' for a causal LM
2024-02-08 15:37:51.218582394 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 36 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-02-08 15:37:51.237708474 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-02-08 15:37:51.237736553 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-02-08 15:37:55 [info     ] Successfully created the pipeline device=cuda:0
Traceback (most recent call last):
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/text_generation.py", line 219, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/base.py", line 1143, in __call__
    outputs = list(final_iterator)
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/base.py", line 1068, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/opt/tools/redacted/__main__/pantscode/redacted/ml_inference/inference_pipeline.py", line 66, in _forward
    return super()._forward(model_inputs, **generate_kwargs)
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/pipelines/text_generation.py", line 295, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/opt/tools/redacted/ml_deps_torch/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/generation/utils.py", line 1558, in generate
    return self.beam_search(
  File "/opt/tools/redacted/ml_deps_transformers/site-packages/transformers/generation/utils.py", line 2940, in beam_search
    model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
  File "/opt/tools/redacted/ml_deps_optimum/site-packages/optimum/onnxruntime/modeling_decoder.py", line 684, in prepare_inputs_for_generation
    past_length = past_key_values[0].shape[1]
AttributeError: 'tuple' object has no attribute 'shape
```

Model is loaded as `ORTModelForCausalLM` instead of `transformers.AutoModelForCausalLM`, but still with `transformers.TextGenerationPipeline`
```
model = onnxruntime.ORTModelForCausalLM.from_pretrained(
            model_path,
            use_io_binding=True,
            provider="CUDAExecutionProvider",
            use_cache=True,
            export=True
        )
```

Not every model is having the bug, but perhaps GPTBigCode is not the only one, as mentioned in another [git issue](https://github.com/huggingface/optimum/issues/1475).  Since there's no acknowledgement of the bug in the aforementioned issue, I'm wondering if it was posted in the correct place.

### Expected behavior

When `num_beams > 1`, inference would not crash with `AttributeError: 'tuple' object has no attribute 'shape`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Starcoder/GPTBigCode has broken beam search when converted to ONNX runtime model #28996

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Starcoder/GPTBigCode has broken beam search when converted to ONNX runtime model #28996

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions