You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the training process I'm getting this error:
RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory)
Epoch 5: 98%|███████████████████████████████████████████████████████████████████████████████████████████████▎ | 3250/3308 [1:10:30<01:15, 1.30s/it, loss=0.165, v_num=0_0]Traceback (most recent call last):████████████████████████████████████████████████████████████████████████████████▏ | 241/300 [02:57<00:39, 1.50it/s]
File "./jerex_train.py", line 20, in train
model.train(cfg)
File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/model.py", line 341, in train
trainer.fit(model, datamodule=data_module)
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
self.dispatch()
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
self.accelerator.start_training(self)
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
return self.run_train()
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
self.train_loop.run_training_epoch()
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 576, in run_training_epoch
self.trainer.run_evaluation(on_epoch=True)
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 962, in run_evaluation
output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 174, in evaluation_step
output = self.trainer.accelerator.validation_step(args)
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 226, in validation_step
return self.training_type_plugin.validation_step(*args)
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in validation_step
return self.lightning_module.validation_step(*args, **kwargs)
File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/model.py", line 126, in validation_step
return self._inference(batch, batch_idx)
File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/model.py", line 176, in _inference
output = self(**batch, inference=True)
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/model.py", line 106, in forward
max_rel_pairs=max_rel_pairs, inference=inference)
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/models/joint_models.py", line 144, in forward
return self._forward_inference(*args, **kwargs)
File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/models/joint_models.py", line 233, in _forward_inference
max_pairs=max_rel_pairs)
File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/models/modules/relation_classification_multi_instance.py", line 49, in forward
chunk_rel_sentence_distances, mention_reprs, chunk_h)
File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/models/modules/relation_classification_multi_instance.py", line 73, in _create_mention_pair_representations
rel_ctx = m + h
RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory)
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 5: 98%|███████████████████████████████████████████████████████████████████████████████████████████████▎ | 3250/3308 [1:10:48<01:15, 1.31s/it, loss=0.165, v_num=0_0]
This is the memory footprint:
(PyTorch1.7) (base) marco@pc:~/PyTorchMatters/EntitiesRelationsExtraction/jerex$ free -m
total used free shared buff/cache available
Mem: 32059 1670 20587 110 9802 29827
Swap: 979 0 979
I thought that the model (and training) should fit into 32 GB memory to be honest. Are you sure there wasn't any other memory consuming process running on the system (since the model crashes in the 5. epoch)?
Nevertheless, we have a setting in place that can lower memory consumption. You can reduce the maximum number of spans and mention pairs that are processed simultaneously (in single tensor operations) during training/inference.
Just set the following settings in configs/docred_joint/train.yaml for both training/inference:
training:
(... other settings ...)
max_spans: 200
max_rel_pairs: 50
inference:
(... other settings ...)
max_spans: 200
max_rel_pairs: 50
The settings above work well on a 11 GB GPU (about up to 4 GB of CPU memory is also occupied). By lowering the maximum numbers, memory consumption should be reduced but also training/inference speed. You may tinker with these settings to fit your system.
Regarding TPUs: I would definitely recommend to train the model on a GPU. We get a speedup of more than 10x when training on a single GPU compared to CPU. I do not have any experience with TPUs but I also would expect a speedup here.
Regarding shapes: Since we extract pos/neg samples from documents, it is hard to guarantee equal shapes across batches. This is not ideal speed wise (as is also noted on your referenced Google document), but you should still gain significant speedups when training on GPUs (and probably also TPUs).
During the training process I'm getting this error:
RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory)
This is the memory footprint:
How much memory is required and what the minimum requirements (memory, cpu, storage,....) for running the training process?
Which Google Cloud Architecture could be better suited? https://cloud.google.com/tpu/docs/tpus#when_to_use_tpus
Do you think that Google's TPU are a good fit for Jerex Model's tensors shapes and dimensions? https://cloud.google.com/tpu/docs/tpus#shapes
The text was updated successfully, but these errors were encountered: