You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A clear and concise description of what the bug is.
bug occurs while calling dataloader with multi num workers.
Here, ’trainer‘ is initialized from transformers. If I only debug with dataloader as follows, the code works.
‘’‘
for _, data in tqdm(enumerate(trainer.get_train_dataloader())):
print('dataloader: ', _, data.keys())
’‘’
However, if running trainer.train(), then the code throws the error as the title.
To Reproduce
Steps to reproduce the behavior:
Go to '...'
Click on '....'
Scroll down to '....'
See error
Expected behavior
A clear and concise description of what you expected to happen.
ds_report output
Please run ds_report to give us details about your setup.
Screenshots
System info (please complete the following information):
OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version
Any other relevant info about your setup
Launcher context
launching your experiment with the deepspeed launcher, MPI, or something else?
Docker context
Do not use docker.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
hi, thanks for your quick reply. My dataloader contains function for video decoding with gpu, which maybe the reason of the issue. If I comment this part, the dataset with transformer and deepspeed works well. The minimal script for reproduction would be provided later.
@loadams Hi, the reason to the issue is multiprocessing_context. I think the error could be reproduced while setting multiprocessing_context in dataloader as 'spawn'.
Describe the bug
A clear and concise description of what the bug is.
bug occurs while calling dataloader with multi num workers.
Here, ’trainer‘ is initialized from transformers. If I only debug with dataloader as follows, the code works.
‘’‘
for _, data in tqdm(enumerate(trainer.get_train_dataloader())):
print('dataloader: ', _, data.keys())
’‘’
However, if running trainer.train(), then the code throws the error as the title.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
ds_report output
Please run
ds_report
to give us details about your setup.Screenshots
System info (please complete the following information):
Launcher context
launching your experiment with the
deepspeed
launcher, MPI, or something else?Docker context
Do not use docker.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: