Simple Bug in modeling_attn_mask_utils.py

### System Info

(torch) (base) anxiang.zhang@n214-176-142:~/DeepSeek-Coder$ transformers-cli env
WARNING:tensorflow:From /data02/home/anxiang.zhang/miniconda3/envs/torch/lib/python3.10/site-packages/transformers/commands/env.py:100: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.36.0
- Platform: Linux-5.4.56.bsk.9-amd64-x86_64-with-glibc2.28
- Python version: 3.10.13
- Huggingface_hub version: 0.20.1
- Safetensors version: 0.4.1
- Accelerate version: 0.25.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.1.2 (True)
- Tensorflow version (GPU?): 2.9.3 (True)
- Flax version (CPU?/GPU?/TPU?): 0.7.4 (cpu)
- Jax version: 0.4.18
- JaxLib version: 0.4.18
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

At transformers.modeling_attn_mask_utils.py:238.

The code is 
```
tmp = torch.arange(attention_mask.shape[1], 0, -1)
indices = torch.argmax(attention_mask.cpu() * tmp, 1, keepdim=True)
````

The attention_mask.cpu() is clear an error when the global default tensor type is not a cpu dtype. For example, if you set torch.set_default_tensor_type(torch.cuda.HalfTensor). then the  torch.arange(attention_mask.shape[1], 0, -1) would return a tensor on cuda instead of CPU, which will lead to error by multiplying a CPU tensor with CUDA tensor. 

A simple fix would be replace attention_mask.cpu() as attention_mask.to(tmp.device)

### Expected behavior

A simple fix would be replace attention_mask.cpu() as attention_mask.to(tmp.device)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simple Bug in modeling_attn_mask_utils.py #28317

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Simple Bug in modeling_attn_mask_utils.py #28317

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions