-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grad strides do not match bucket view strides #26
Comments
This is weird, are you training on GPU? Why set GPU to 0? |
Thanks for your reply, let me describe my test process. CUDA error: invalid device ordinal Then os.environ["CUDA_VISIBLE_DEVICES"]=”0,1,2,3,4,5,6,7”,here's the problem: Grad strides do not match bucket view strides. And there's no loss or mean-N value. Print args.gpu before training, the value is 0. So I set --gpu 0 . |
This is weird. When using os.environ["CUDA_VISIBLE_DEVICES"] =”0”, only GPU-0 will be visible. There is no need to set --gpu 0. I did not get anything like this. It might be a CUDA or pytorch version issue? |
Thanks, the problem has been solved. For the VIGOR and CVUSA data sets, the pytorch environment is quite different, and the results can be produced after replacement. However, there are still some problems, such as dim sets the dimension in sampler, whether the dim setting is the same result after embedding in the Transgeo network, at the same time, the dim in file is set to 1000, but default is 128, for these data sets, how many is better, 4096,1000,or 128? |
Please follow the parameters in the provided script to reproduce the result. |
Such a waring occurs at runtime
/.conda/envs/pytor1/lib/python3.7/site-packages/torch/autograd/init.py:175: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1, 1, 384], strides() = [99072, 384, 1]
bucket_view.sizes() = [1, 1, 384], strides() = [384, 384, 1] (Triggered internally at /opt/conda/conda-bld/pytorch_1656352430114/work/torch/csrc/distributed/c10d/reducer.cpp:326.)
And the training results are unreasonable
Time 10.881 (10.881) Data 9.173 ( 9.173) Loss nan (nan) Mean-P 0.34 ( 0.34) Mean-N nan ( nan)
This problem seems to be caused by Distributed training, but I set the GPU to 0, how to solve it
The text was updated successfully, but these errors were encountered: