Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchFeature performance improvement: convert List[np.ndarray] to np.ndarray before converting to pytorch tensors #14307

Closed
eladsegal opened this issue Nov 7, 2021 · 8 comments · Fixed by #14306

Comments

@eladsegal
Copy link
Contributor

🚀 Feature request

@NielsRogge, @sgugger
When using a FeatureExtractor for images and passing List[np.ndarray] with return_tensors="pt", the following warning is outputted:

.../lib/python3.8/site-packages/transformers/feature_extraction_utils.py:158: 
UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. 
Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. 
(Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)

As reported in pytorch/pytorch#13918, a significant performance improvement can be obtained by using torch.tensor on a numpy.ndarray instead of on List[numpy.ndarray].

I think a possible solution would be #14306:

elif tensor_type == TensorType.PYTORCH:
if not is_torch_available():
raise ImportError("Unable to convert output to PyTorch tensors format, PyTorch is not installed.")
import torch
def as_tensor(value):
return torch.tensor(value if not isinstance(value, list) else np.array(value))
is_tensor = torch.is_tensor

@NielsRogge
Copy link
Contributor

Thanks for reporting this. Could it be that PyTorch only added this warning in 1.10?

@eladsegal
Copy link
Contributor Author

Yes, the problem is longstanding but the warning is new in 1.10. Here's the commit where it was added:
pytorch/pytorch@5a00152

@Alex-ley
Copy link

Alex-ley commented Feb 18, 2022

I am getting the same warning on this line with v4.16.2:
https://github.com/huggingface/transformers/blob/v4.16.2/src/transformers/tokenization_utils_base.py#L707

presumably stemming from these lines - which look identical to those in @eladsegal's PR above:
https://github.com/huggingface/transformers/blob/v4.16.2/src/transformers/tokenization_utils_base.py#L677-L683

@NielsRogge
Copy link
Contributor

NielsRogge commented Jul 8, 2022

@sgugger this warning is also triggered when using the Trainer at:

/usr/local/lib/python3.7/dist-packages/transformers/data/data_collator.py:131: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:210.)
  batch[k] = torch.tensor([f[k] for f in features])

I'm using PyTorch 1.11 and Transformers v4.20.1

@FlorinAndrei
Copy link

@sgugger I'm getting lots and lots of this warning all the time, which make troubleshooting pretty hard. The Jupyter interface has issues because the output gets very big after a while.

transformers/data/data_collator.py:131: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)
  batch[k] = torch.tensor([f[k] for f in features])

Python 3.10.6
Pytorch 1.12.1+cu116
Transformers 4.23.1

@sgugger
Copy link
Collaborator

sgugger commented Oct 24, 2022

There is no use commenting on an issue that was resolved without providing a code reproducer. You should open a new issue and follow the template :-)

@Alex-ley
Copy link

@sgugger but what if the issue turns out to be only partially resolved? I think my example of the lines show that the PR potentially only fixed one occurrence of this issue but missed others? Do you think it is better to make a new issue in that case rather than re-open the original one?

@sgugger
Copy link
Collaborator

sgugger commented Apr 12, 2023

You should definitely open a new one with a code sample that shows the problem: tokenizers do not return NumPy arrays but list of token IDs so even if the line is the same as what was impacted in this PR, it doesn't mean there is a problem to fix either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants