Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error when training with multiple GPUs : AttributeError: Can't pickle local object 'ExtractiveSummarizer.prepare_data.<locals>.longformer_modifier' #25

Closed
moyid opened this issue Oct 23, 2020 · 7 comments

Comments

@moyid
Copy link

moyid commented Oct 23, 2020

Hi @HHousen -- we have talked in a previous issue -- the good news is that I actually got the longformer training working! But now I'm trying to speed up training by using multiple GPUs. However, I get the following error with muliple GPUs while it is working fine with just 1 GPU:
Traceback (most recent call last):
File "src/main.py", line 393, in
main(main_args)
File "src/main.py", line 97, in main
trainer.fit(model)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit
results = self.accelerator_backend.train()
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 65, in train
mp.spawn(self.ddp_train, nprocs=self.nprocs, args=(self.mp_queue, model,))
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
process.start()
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/conda/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'ExtractiveSummarizer.prepare_data..longformer_modifier'

@HHousen
Copy link
Owner

HHousen commented Oct 23, 2020

It's good to hear that you've started training! I've seen this type of error before during abstractive summarization and I should be able to fix it relatively quickly. I'll have a fix in the next few days.

@HHousen
Copy link
Owner

HHousen commented Oct 23, 2020

@moyid I believe that commit 597bc9d should fix the issue. Let me know if it works now.

@moyid
Copy link
Author

moyid commented Oct 23, 2020

@HHousen - it's training!

@moyid
Copy link
Author

moyid commented Oct 23, 2020

@HHousen - sorry one more issue -- I think my performance is slow because of num_workers -- I tried setting it through dataloader_num_workers based on the documentation but got error main.py: error: unrecognized arguments: --dataloader_num_workers 1

@HHousen
Copy link
Owner

HHousen commented Oct 23, 2020

@moyid The --dataloader_num_workers argument works only for abstractive summarization. The reason you cannot change this option for extractive summarization is because the DataLoaders are created from torch.utils.data.IterableDatasets. IterableDatasets replicate the same dataset object on each worker process. Thus, the replicas must be configured differently to avoid duplicated data. See the PyTorch documentation description of iterable style datasets and the IterableDataset docstring for more information.

The docstring gives two examples of how to split an IterableDataset workload across all workers. However, I have not implemented this into the library. Ideally, I would simply use a normal Dataset but I'm not certain how to use this properly since the entire dataset cannot be loaded into memory at once. I potentially could use Apache Arrow.

I was looking at how the huggingface/transformers seq2seq example deals with this problem. They use the Dataset class instead of IterableDataset by using the built-in python module linecache, which I have not heard of before. Implementing this will be a significant refactoring of the library's extractive data loading code. I have opened a new issue for it (#27). The linecache module looks promising.

@HHousen
Copy link
Owner

HHousen commented Oct 23, 2020

Also @moyid, to determine if the number of workers is actually the problem you can train with the --profiler argument. This will output how long certain functions took to run once training is complete. To speed up training you can only train on a percentage of the dataset using the --overfit_pct argument (for example, --overfit_pct 0.001 for 0.1% of the data) You can find more info on the pytorch-lightning profiler documentation.

@moyid
Copy link
Author

moyid commented Oct 23, 2020

sounds good, I'll try that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants