`BatchFeature` performance improvement: convert `List[np.ndarray]` to `np.ndarray` before converting to pytorch tensors #14307

eladsegal · 2021-11-07T01:36:08Z

🚀 Feature request

@NielsRogge, @sgugger
When using a FeatureExtractor for images and passing List[np.ndarray] with return_tensors="pt", the following warning is outputted:

.../lib/python3.8/site-packages/transformers/feature_extraction_utils.py:158: 
UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. 
Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. 
(Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)

As reported in pytorch/pytorch#13918, a significant performance improvement can be obtained by using torch.tensor on a numpy.ndarray instead of on List[numpy.ndarray].

I think a possible solution would be #14306:

transformers/src/transformers/feature_extraction_utils.py

Lines 136 to 144 in 05fed8b

    
           elif tensor_type == TensorType.PYTORCH: 
        
               if not is_torch_available(): 
        
                   raise ImportError("Unable to convert output to PyTorch tensors format, PyTorch is not installed.") 
        
               import torch 
        
               def as_tensor(value): 
        
                   return torch.tensor(value if not isinstance(value, list) else np.array(value)) 
        
               is_tensor = torch.is_tensor

The text was updated successfully, but these errors were encountered:

NielsRogge · 2021-11-08T10:34:25Z

Thanks for reporting this. Could it be that PyTorch only added this warning in 1.10?

eladsegal · 2021-11-08T21:41:02Z

Yes, the problem is longstanding but the warning is new in 1.10. Here's the commit where it was added:
pytorch/pytorch@5a00152

Alex-ley · 2022-02-18T00:03:30Z

I am getting the same warning on this line with v4.16.2:
https://github.com/huggingface/transformers/blob/v4.16.2/src/transformers/tokenization_utils_base.py#L707

presumably stemming from these lines - which look identical to those in @eladsegal's PR above:
https://github.com/huggingface/transformers/blob/v4.16.2/src/transformers/tokenization_utils_base.py#L677-L683

NielsRogge · 2022-07-08T08:07:42Z

@sgugger this warning is also triggered when using the Trainer at:

/usr/local/lib/python3.7/dist-packages/transformers/data/data_collator.py:131: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:210.)
  batch[k] = torch.tensor([f[k] for f in features])

I'm using PyTorch 1.11 and Transformers v4.20.1

FlorinAndrei · 2022-10-23T03:08:30Z

@sgugger I'm getting lots and lots of this warning all the time, which make troubleshooting pretty hard. The Jupyter interface has issues because the output gets very big after a while.

transformers/data/data_collator.py:131: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)
  batch[k] = torch.tensor([f[k] for f in features])

Python 3.10.6
Pytorch 1.12.1+cu116
Transformers 4.23.1

sgugger · 2022-10-24T13:29:25Z

There is no use commenting on an issue that was resolved without providing a code reproducer. You should open a new issue and follow the template :-)

Alex-ley · 2023-04-12T21:27:44Z

@sgugger but what if the issue turns out to be only partially resolved? I think my example of the lines show that the PR potentially only fixed one occurrence of this issue but missed others? Do you think it is better to make a new issue in that case rather than re-open the original one?

sgugger · 2023-04-12T22:04:34Z

You should definitely open a new one with a code sample that shows the problem: tokenizers do not return NumPy arrays but list of token IDs so even if the line is the same as what was impacted in this PR, it doesn't mean there is a problem to fix either.

eladsegal mentioned this issue Nov 7, 2021

BatchFeature: Convert List[np.ndarray] to np.ndarray before converting to pytorch tensors #14306

Merged

5 tasks

sgugger closed this as completed in #14306 Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`BatchFeature` performance improvement: convert `List[np.ndarray]` to `np.ndarray` before converting to pytorch tensors #14307

`BatchFeature` performance improvement: convert `List[np.ndarray]` to `np.ndarray` before converting to pytorch tensors #14307

eladsegal commented Nov 7, 2021

NielsRogge commented Nov 8, 2021

eladsegal commented Nov 8, 2021

Alex-ley commented Feb 18, 2022 •

edited

Loading

NielsRogge commented Jul 8, 2022 •

edited

Loading

FlorinAndrei commented Oct 23, 2022

sgugger commented Oct 24, 2022

Alex-ley commented Apr 12, 2023

sgugger commented Apr 12, 2023

BatchFeature performance improvement: convert List[np.ndarray] to np.ndarray before converting to pytorch tensors #14307

BatchFeature performance improvement: convert List[np.ndarray] to np.ndarray before converting to pytorch tensors #14307

Comments

eladsegal commented Nov 7, 2021

🚀 Feature request

NielsRogge commented Nov 8, 2021

eladsegal commented Nov 8, 2021

Alex-ley commented Feb 18, 2022 • edited Loading

NielsRogge commented Jul 8, 2022 • edited Loading

FlorinAndrei commented Oct 23, 2022

sgugger commented Oct 24, 2022

Alex-ley commented Apr 12, 2023

sgugger commented Apr 12, 2023

`BatchFeature` performance improvement: convert `List[np.ndarray]` to `np.ndarray` before converting to pytorch tensors #14307

`BatchFeature` performance improvement: convert `List[np.ndarray]` to `np.ndarray` before converting to pytorch tensors #14307

Alex-ley commented Feb 18, 2022 •

edited

Loading

NielsRogge commented Jul 8, 2022 •

edited

Loading