Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Can't pickle local object 'instrument_w_nvtx.<locals>.wrapped_fn' #5446

Open
annopackage opened this issue Apr 22, 2024 · 4 comments
Open
Labels
bug Something isn't working training

Comments

@annopackage
Copy link

Describe the bug
A clear and concise description of what the bug is.
bug occurs while calling dataloader with multi num workers.
Here, ’trainer‘ is initialized from transformers. If I only debug with dataloader as follows, the code works.
‘’‘
for _, data in tqdm(enumerate(trainer.get_train_dataloader())):
print('dataloader: ', _, data.keys())
’‘’

However, if running trainer.train(), then the code throws the error as the title.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
Please run ds_report to give us details about your setup.

Screenshots

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. two machines with x8 A100s each]
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version
  • Any other relevant info about your setup

Launcher context
launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context
Do not use docker.

Additional context
Add any other context about the problem here.

@annopackage annopackage added bug Something isn't working training labels Apr 22, 2024
@loadams
Copy link
Contributor

loadams commented Apr 22, 2024

Hi @annopackage - can you share a full minimal repro script with us please?

@annopackage
Copy link
Author

hi, thanks for your quick reply. My dataloader contains function for video decoding with gpu, which maybe the reason of the issue. If I comment this part, the dataset with transformer and deepspeed works well. The minimal script for reproduction would be provided later.

@annopackage
Copy link
Author

annopackage commented Apr 24, 2024

@loadams Hi, the reason to the issue is multiprocessing_context. I think the error could be reproduced while setting multiprocessing_context in dataloader as 'spawn'.

@annopackage
Copy link
Author

With zero2 and spawn, the code runs well.
With zero3 and fork, the code runs well.
With zero3 and spawn, the bug occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants