RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory) #3

raphael10-collab · 2021-05-18T04:47:42Z

During the training process I'm getting this error:

RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory)

Epoch 5:  98%|███████████████████████████████████████████████████████████████████████████████████████████████▎ | 3250/3308 [1:10:30<01:15,  1.30s/it, loss=0.165, v_num=0_0]Traceback (most recent call last):████████████████████████████████████████████████████████████████████████████████▏                       | 241/300 [02:57<00:39,  1.50it/s]
  File "./jerex_train.py", line 20, in train
    model.train(cfg)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/model.py", line 341, in train
    trainer.fit(model, datamodule=data_module)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
    self.train_loop.run_training_epoch()
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 576, in run_training_epoch
    self.trainer.run_evaluation(on_epoch=True)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 962, in run_evaluation
    output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 174, in evaluation_step
    output = self.trainer.accelerator.validation_step(args)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 226, in validation_step
    return self.training_type_plugin.validation_step(*args)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/model.py", line 126, in validation_step
    return self._inference(batch, batch_idx)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/model.py", line 176, in _inference
    output = self(**batch, inference=True)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/model.py", line 106, in forward
    max_rel_pairs=max_rel_pairs, inference=inference)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/models/joint_models.py", line 144, in forward
    return self._forward_inference(*args, **kwargs)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/models/joint_models.py", line 233, in _forward_inference
    max_pairs=max_rel_pairs)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/models/modules/relation_classification_multi_instance.py", line 49, in forward
    chunk_rel_sentence_distances, mention_reprs, chunk_h)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/models/modules/relation_classification_multi_instance.py", line 73, in _create_mention_pair_representations
    rel_ctx = m + h
RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 5:  98%|███████████████████████████████████████████████████████████████████████████████████████████████▎ | 3250/3308 [1:10:48<01:15,  1.31s/it, loss=0.165, v_num=0_0]

This is the memory footprint:

 (PyTorch1.7) (base) marco@pc:~/PyTorchMatters/EntitiesRelationsExtraction/jerex$ free -m                                                                                     
              total        used        free      shared  buff/cache   available
Mem:          32059        1670       20587         110        9802       29827
Swap:           979           0         979

How much memory is required and what the minimum requirements (memory, cpu, storage,....) for running the training process?
Which Google Cloud Architecture could be better suited? https://cloud.google.com/tpu/docs/tpus#when_to_use_tpus
Do you think that Google's TPU are a good fit for Jerex Model's tensors shapes and dimensions? https://cloud.google.com/tpu/docs/tpus#shapes

The text was updated successfully, but these errors were encountered:

markus-eberts · 2021-05-18T15:50:38Z

I thought that the model (and training) should fit into 32 GB memory to be honest. Are you sure there wasn't any other memory consuming process running on the system (since the model crashes in the 5. epoch)?

Nevertheless, we have a setting in place that can lower memory consumption. You can reduce the maximum number of spans and mention pairs that are processed simultaneously (in single tensor operations) during training/inference.
Just set the following settings in configs/docred_joint/train.yaml for both training/inference:

training:
(... other settings ...)
max_spans: 200
max_rel_pairs: 50

inference:
(... other settings ...)
max_spans: 200
max_rel_pairs: 50

The settings above work well on a 11 GB GPU (about up to 4 GB of CPU memory is also occupied). By lowering the maximum numbers, memory consumption should be reduced but also training/inference speed. You may tinker with these settings to fit your system.

Regarding TPUs: I would definitely recommend to train the model on a GPU. We get a speedup of more than 10x when training on a single GPU compared to CPU. I do not have any experience with TPUs but I also would expect a speedup here.

Regarding shapes: Since we extract pos/neg samples from documents, it is hard to guarantee equal shapes across batches. This is not ideal speed wise (as is also noted on your referenced Google document), but you should still gain significant speedups when training on GPUs (and probably also TPUs).

raphael10-collab · 2021-06-01T18:46:05Z

After modifying the training/inference settings it seems that it finally succeeded (it took a while: around 26 hours) .

Thank you!!!

raphael10-collab mentioned this issue May 18, 2021

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference #2

Closed

raphael10-collab closed this as completed Jun 1, 2021

raphael10-collab mentioned this issue Jun 2, 2021

Help in interpreting the test results #5

Closed

markus-eberts mentioned this issue Jan 3, 2022

train problem #12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory) #3

RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory) #3

raphael10-collab commented May 18, 2021 •

edited

markus-eberts commented May 18, 2021 •

edited

raphael10-collab commented Jun 1, 2021 •

edited

RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory) #3

RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory) #3

Comments

raphael10-collab commented May 18, 2021 • edited

markus-eberts commented May 18, 2021 • edited

raphael10-collab commented Jun 1, 2021 • edited

raphael10-collab commented May 18, 2021 •

edited

markus-eberts commented May 18, 2021 •

edited

raphael10-collab commented Jun 1, 2021 •

edited