New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA out of memory. #8
Comments
Late reply, but just in case for the future, anyone may need this. I set train_batch_size=4 instead of 32 in utils.argparse_init.py. It is able to train without CUDA out of memory on GPU 16GB machine.
I think GPU 16GB or 12 GB(single or multiple GPUs) are quite common. If the author can provide instruction or option to train on these machines, that would be very helpful. Thanks. |
I got it to train on multiple gpus(so that I can train on AWS easier, single gpu 32 GB or above only available in p3dn.24xlarge and p4d.24xlarge, which is expensive and wasting resources). I made following modification in case anyone in future also needs this. # in sdr_main.py
trainer = pytorch_lightning.Trainer(
...
distributed_backend="dp",
...)
# in argparse_init.py
parser.add_argument("--gpus", default=2, type=str, help="gpu count") # you can specify the gpu count you have
# in SDR.py add following method, the reason is we get multiple losses from multiple gpus.
def training_step_end(self, training_step_outputs):
return {'loss': training_step_outputs['loss'].sum()} |
@weicheng113 what is the configuration you used for multiple GPUs? 16GB each? And did you change any other params like |
@hassiahk 16GB each. Please check the last two comments I made above. |
So you could only run with |
I faced the same problem and resolved it by degrading the PyTorch version from 1.10.1 to 1.8.1 with code 11.3. $ pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html |
Train command
Stacktrace:
The text was updated successfully, but these errors were encountered: