You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for looking into it. We further did some investigations after your comment and figured out it's only affecting some hardware/software configurations with the pytorch installation of conda-forge. Based on this we found the following issue in PyTorch: pytorch/pytorch#102269 with a quick fix for now.
Since it seems to be a deeper issue with forking processes, the difference betweenmultiprocess and multiprocessing didn't make a difference.
Closing this, since the issue comes from pytorch not dataset.
Describe the bug
I noticed that the performance of my dataset preprocessing with
map(...,num_proc=32)
decreases when PyTorch is imported.Steps to reproduce the bug
I created two example scripts to reproduce this behavior:
Takes around 4 seconds on my machine.
While the same code, but with an
import torch
:takes around 22 seconds.
Expected behavior
I would expect that the import of torch to not have such a significant effect on the performance of map using multiprocessing.
Environment info
datasets
version: 2.12.0The text was updated successfully, but these errors were encountered: