error when training with multiple GPUs : AttributeError: Can't pickle local object 'ExtractiveSummarizer.prepare_data.<locals>.longformer_modifier' #25

moyid · 2020-10-23T00:47:33Z

Hi @HHousen -- we have talked in a previous issue -- the good news is that I actually got the longformer training working! But now I'm trying to speed up training by using multiple GPUs. However, I get the following error with muliple GPUs while it is working fine with just 1 GPU:
Traceback (most recent call last):
File "src/main.py", line 393, in
main(main_args)
File "src/main.py", line 97, in main
trainer.fit(model)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit
results = self.accelerator_backend.train()
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 65, in train
mp.spawn(self.ddp_train, nprocs=self.nprocs, args=(self.mp_queue, model,))
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
process.start()
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/conda/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'ExtractiveSummarizer.prepare_data..longformer_modifier'

HHousen · 2020-10-23T01:07:03Z

It's good to hear that you've started training! I've seen this type of error before during abstractive summarization and I should be able to fix it relatively quickly. I'll have a fix in the next few days.

HHousen · 2020-10-23T02:54:29Z

@moyid I believe that commit 597bc9d should fix the issue. Let me know if it works now.

moyid · 2020-10-23T04:14:25Z

@HHousen - it's training!

moyid · 2020-10-23T05:01:15Z

@HHousen - sorry one more issue -- I think my performance is slow because of num_workers -- I tried setting it through dataloader_num_workers based on the documentation but got error main.py: error: unrecognized arguments: --dataloader_num_workers 1

HHousen · 2020-10-23T15:01:40Z

@moyid The --dataloader_num_workers argument works only for abstractive summarization. The reason you cannot change this option for extractive summarization is because the DataLoaders are created from torch.utils.data.IterableDatasets. IterableDatasets replicate the same dataset object on each worker process. Thus, the replicas must be configured differently to avoid duplicated data. See the PyTorch documentation description of iterable style datasets and the IterableDataset docstring for more information.

The docstring gives two examples of how to split an IterableDataset workload across all workers. However, I have not implemented this into the library. Ideally, I would simply use a normal Dataset but I'm not certain how to use this properly since the entire dataset cannot be loaded into memory at once. I potentially could use Apache Arrow.

I was looking at how the huggingface/transformers seq2seq example deals with this problem. They use the Dataset class instead of IterableDataset by using the built-in python module linecache, which I have not heard of before. Implementing this will be a significant refactoring of the library's extractive data loading code. I have opened a new issue for it (#27). The linecache module looks promising.

HHousen · 2020-10-23T15:18:08Z

Also @moyid, to determine if the number of workers is actually the problem you can train with the --profiler argument. This will output how long certain functions took to run once training is complete. To speed up training you can only train on a percentage of the dataset using the --overfit_pct argument (for example, --overfit_pct 0.001 for 0.1% of the data) You can find more info on the pytorch-lightning profiler documentation.

moyid · 2020-10-23T17:06:40Z

sounds good, I'll try that.

HHousen closed this as completed in 597bc9d Oct 23, 2020

HHousen mentioned this issue Oct 23, 2020

Support num_workers for the extractive DataLoader (use Dataset instead of IterableDataset) #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error when training with multiple GPUs : AttributeError: Can't pickle local object 'ExtractiveSummarizer.prepare_data.<locals>.longformer_modifier' #25

error when training with multiple GPUs : AttributeError: Can't pickle local object 'ExtractiveSummarizer.prepare_data.<locals>.longformer_modifier' #25

moyid commented Oct 23, 2020

HHousen commented Oct 23, 2020

HHousen commented Oct 23, 2020

moyid commented Oct 23, 2020

moyid commented Oct 23, 2020

HHousen commented Oct 23, 2020

HHousen commented Oct 23, 2020

moyid commented Oct 23, 2020

error when training with multiple GPUs : AttributeError: Can't pickle local object 'ExtractiveSummarizer.prepare_data.<locals>.longformer_modifier' #25

error when training with multiple GPUs : AttributeError: Can't pickle local object 'ExtractiveSummarizer.prepare_data.<locals>.longformer_modifier' #25

Comments

moyid commented Oct 23, 2020

HHousen commented Oct 23, 2020

HHousen commented Oct 23, 2020

moyid commented Oct 23, 2020

moyid commented Oct 23, 2020

HHousen commented Oct 23, 2020

HHousen commented Oct 23, 2020

moyid commented Oct 23, 2020