Can't train with fp16 on Nvidia RTX3060 #45

thangnvkcn · 2022-02-09T08:24:24Z

training with fp16 doesn't work for me on a RTX3060, I'll look into fixing it, but for future reference here is the full stacktrace
torch version 1.9.0

INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for 1 nodes.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:158, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

nikich340 · 2022-09-20T08:49:35Z

maybe problem in too new torch version, have you tried 1.8.1+cu111 ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't train with fp16 on Nvidia RTX3060 #45

Can't train with fp16 on Nvidia RTX3060 #45

thangnvkcn commented Feb 9, 2022

nikich340 commented Sep 20, 2022

Can't train with fp16 on Nvidia RTX3060 #45

Can't train with fp16 on Nvidia RTX3060 #45

Comments

thangnvkcn commented Feb 9, 2022

nikich340 commented Sep 20, 2022