Grad strides do not match bucket view strides #26

LWhale13358 · 2023-10-27T02:31:23Z

Such a waring occurs at runtime
/.conda/envs/pytor1/lib/python3.7/site-packages/torch/autograd/init.py:175: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1, 1, 384], strides() = [99072, 384, 1]
bucket_view.sizes() = [1, 1, 384], strides() = [384, 384, 1] (Triggered internally at /opt/conda/conda-bld/pytorch_1656352430114/work/torch/csrc/distributed/c10d/reducer.cpp:326.)

And the training results are unreasonable
Time 10.881 (10.881) Data 9.173 ( 9.173) Loss nan (nan) Mean-P 0.34 ( 0.34) Mean-N nan ( nan)

This problem seems to be caused by Distributed training, but I set the GPU to 0, how to solve it

Jeff-Zilence · 2023-10-28T05:27:35Z

This is weird, are you training on GPU? Why set GPU to 0?

LWhale13358 · 2023-10-28T07:36:54Z

This is weird, are you training on GPU? Why set GPU to 0?

Thanks for your reply, let me describe my test process.
The GPU is not set initially, and os.environ["CUDA_VISIBLE_DEVICES"] =”0”.However, the server has a total of 0-7 available, so running this will directly runerror

CUDA error: invalid device ordinal

Then os.environ["CUDA_VISIBLE_DEVICES"]=”0,1,2,3,4,5,6,7”,here's the problem:

Grad strides do not match bucket view strides.

And there's no loss or mean-N value. Print args.gpu before training, the value is 0. So I set --gpu 0 .

Jeff-Zilence · 2023-11-23T20:31:11Z

This is weird. When using os.environ["CUDA_VISIBLE_DEVICES"] =”0”, only GPU-0 will be visible. There is no need to set --gpu 0. I did not get anything like this. It might be a CUDA or pytorch version issue?

LWhale13358 · 2023-12-11T02:05:42Z

Thanks, the problem has been solved. For the VIGOR and CVUSA data sets, the pytorch environment is quite different, and the results can be produced after replacement. However, there are still some problems, such as dim sets the dimension in sampler, whether the dim setting is the same result after embedding in the Transgeo network, at the same time, the dim in file is set to 1000, but default is 128, for these data sets, how many is better, 4096,1000,or 128?

Jeff-Zilence · 2023-12-15T20:25:02Z

Thanks, the problem has been solved. For the VIGOR and CVUSA data sets, the pytorch environment is quite different, and the results can be produced after replacement. However, there are still some problems, such as dim sets the dimension in sampler, whether the dim setting is the same result after embedding in the Transgeo network, at the same time, the dim in file is set to 1000, but default is 128, for these data sets, how many is better, 4096,1000,or 128?

Please follow the parameters in the provided script to reproduce the result.

Jeff-Zilence closed this as completed Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grad strides do not match bucket view strides #26

Grad strides do not match bucket view strides #26

LWhale13358 commented Oct 27, 2023

Jeff-Zilence commented Oct 28, 2023

LWhale13358 commented Oct 28, 2023

Jeff-Zilence commented Nov 23, 2023

LWhale13358 commented Dec 11, 2023

Jeff-Zilence commented Dec 15, 2023

Grad strides do not match bucket view strides #26

Grad strides do not match bucket view strides #26

Comments

LWhale13358 commented Oct 27, 2023

Jeff-Zilence commented Oct 28, 2023

LWhale13358 commented Oct 28, 2023

Jeff-Zilence commented Nov 23, 2023

LWhale13358 commented Dec 11, 2023

Jeff-Zilence commented Dec 15, 2023