Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory. #9

Closed
vmr013 opened this issue Nov 5, 2020 · 4 comments
Closed

CUDA out of memory. #9

vmr013 opened this issue Nov 5, 2020 · 4 comments

Comments

@vmr013
Copy link

vmr013 commented Nov 5, 2020

Hi, can someone tell me what can I do to fix this issues?
This is what I get when running it on local machine or google colab.

2020-11-05 11:56:07.097272: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: WORLD_SIZE environment variable (2) is not equal to the computed world size (1). Ignored.
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=ddp
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Could not log computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
  warnings.warn(*args, **kwargs)

  | Name     | Type                        | Params
---------------------------------------------------------
0 | G        | ADDGenerator                | 372 M 
1 | E        | MultilevelAttributesEncoder | 67 M  
2 | D        | MultiscaleDiscriminator     | 8 M   
3 | Z        | ResNet                      | 43 M  
4 | Loss_GAN | GANLoss                     | 0     
5 | Loss_E_G | AEI_Loss                    | 0     
Validation sanity check: 0it [00:00, ?it/s]/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:2494: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
  "See the documentation of nn.Upsample for details.".format(mode))
Epoch 0:   0% 0/5000250 [00:00<?, ?it/s] Traceback (most recent call last):
  File "aei_trainer.py", line 62, in <module>
    main(args)
  File "aei_trainer.py", line 40, in main
    trainer.fit(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1058, in fit
    results = self.accelerator_backend.spawn_ddp_children(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/ddp_backend.py", line 123, in spawn_ddp_children
    results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/ddp_backend.py", line 224, in ddp_train
    results = self.trainer.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 491, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 844, in run_training_batch
    self.hiddens
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 1015, in optimizer_closure
    hiddens)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 1197, in training_forward
    output = self.model(*args)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/overrides/data_parallel.py", line 170, in forward
    output = self.module.training_step(*inputs[0], **kwargs[0])
  File "/content/faceshifter/aei_net.py", line 54, in training_step
    output, z_id, output_z_id, feature_map, output_feature_map = self(target_img, source_img)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/faceshifter/aei_net.py", line 44, in forward
    output = self.G(z_id, feature_map)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/faceshifter/model/AEINet.py", line 132, in forward
    x = self.model["layer_7"](x, z_att[7], z_id)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/faceshifter/model/AEINet.py", line 98, in forward
    x1 = self.activation(self.add1(h_in, z_att, z_id))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/faceshifter/model/AEINet.py", line 72, in forward
    h_out = (1-m)*a + m*i
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.73 GiB total capacity; 12.56 GiB already allocated; 45.88 MiB free; 1.17 GiB cached)
Epoch 0:   0%|          | 0/5000250 [00:36<?, ?it/s]
@m-pektas
Copy link

m-pektas commented Nov 5, 2020

Your gpu memory not enough for training this model. Firstly, You can try to decrease your batch size.

@vmr013
Copy link
Author

vmr013 commented Nov 6, 2020

I tried batch_size with values 8,4,2,1. For each there is this memory error.

This is model configuration from train.yaml

model:
  learning_rate_E_G: 4e-4
  learning_rate_D: 4e-4

  beta1: 0
  beta2: 0.999

  batch_size: 16

  num_workers: 16
  grad_clip: 0.0

Do you see the issue?
I don't see it

@m-pektas
Copy link

m-pektas commented Nov 6, 2020

I got this error repeatedly before. Reason of this error is always that my gpu memory not enough. When start training, If you use Linux, you can check your gpu memory with run "watch nvidia-smi" command in terminal.

@vmr013
Copy link
Author

vmr013 commented Nov 9, 2020

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants