Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grad strides do not match bucket view strides #26

Closed
LWhale13358 opened this issue Oct 27, 2023 · 5 comments
Closed

Grad strides do not match bucket view strides #26

LWhale13358 opened this issue Oct 27, 2023 · 5 comments

Comments

@LWhale13358
Copy link

Such a waring occurs at runtime
/.conda/envs/pytor1/lib/python3.7/site-packages/torch/autograd/init.py:175: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1, 1, 384], strides() = [99072, 384, 1]
bucket_view.sizes() = [1, 1, 384], strides() = [384, 384, 1] (Triggered internally at /opt/conda/conda-bld/pytorch_1656352430114/work/torch/csrc/distributed/c10d/reducer.cpp:326.)

And the training results are unreasonable
Time 10.881 (10.881) Data 9.173 ( 9.173) Loss nan (nan) Mean-P 0.34 ( 0.34) Mean-N nan ( nan)

This problem seems to be caused by Distributed training, but I set the GPU to 0, how to solve it

@Jeff-Zilence
Copy link
Owner

This is weird, are you training on GPU? Why set GPU to 0?

@LWhale13358
Copy link
Author

This is weird, are you training on GPU? Why set GPU to 0?

Thanks for your reply, let me describe my test process.
The GPU is not set initially, and os.environ["CUDA_VISIBLE_DEVICES"] =”0”.However, the server has a total of 0-7 available, so running this will directly runerror

CUDA error: invalid device ordinal

Then os.environ["CUDA_VISIBLE_DEVICES"]=”0,1,2,3,4,5,6,7”,here's the problem:

Grad strides do not match bucket view strides.

And there's no loss or mean-N value. Print args.gpu before training, the value is 0. So I set --gpu 0 .

@Jeff-Zilence
Copy link
Owner

This is weird. When using os.environ["CUDA_VISIBLE_DEVICES"] =”0”, only GPU-0 will be visible. There is no need to set --gpu 0. I did not get anything like this. It might be a CUDA or pytorch version issue?

@LWhale13358
Copy link
Author

Thanks, the problem has been solved. For the VIGOR and CVUSA data sets, the pytorch environment is quite different, and the results can be produced after replacement. However, there are still some problems, such as dim sets the dimension in sampler, whether the dim setting is the same result after embedding in the Transgeo network, at the same time, the dim in file is set to 1000, but default is 128, for these data sets, how many is better, 4096,1000,or 128?

@Jeff-Zilence
Copy link
Owner

Thanks, the problem has been solved. For the VIGOR and CVUSA data sets, the pytorch environment is quite different, and the results can be produced after replacement. However, there are still some problems, such as dim sets the dimension in sampler, whether the dim setting is the same result after embedding in the Transgeo network, at the same time, the dim in file is set to 1000, but default is 128, for these data sets, how many is better, 4096,1000,or 128?

Please follow the parameters in the provided script to reproduce the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants