Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime error gradient computation when train.py #248

Closed
Salvatore-tech opened this issue Sep 2, 2022 · 7 comments
Closed

Runtime error gradient computation when train.py #248

Salvatore-tech opened this issue Sep 2, 2022 · 7 comments

Comments

@Salvatore-tech
Copy link
Contributor

Describe the bug
Good evening, i wanted to lauch train.py using a default config file for RAFT on a standard dataset (KITTI_2015).
I followed the instruction to install MMFlow from source successfully.

Reproduction

python tools/train.py configs/raft/raft_8x2_50k_kitti2015_and_Aug_288x960.py \
--load-from /home/s.starace/FlowNets/mmflow/checkpoints/raft/raft_8x2_100k_mixed_368x768.pth
  1. Did you make any modifications on the code or config? Did you understand what you have modified?
    I just changed the name of symlink that i created under /data (uppercase)

  2. What dataset did you use?
    KITTI_2015

Environment
I launched the command on my PC and also on a little cluster and the output error is the same.

Error traceback
See log attached: slurm-53090.out.txt

Bug fix
Not sure about it, could either be a configuration issue in Encoder/Decoder or a regression.
I'll try the train.py using other models as well and update the report if i understand better the problem.

@Salvatore-tech
Copy link
Contributor Author

Training of different models than RAFT (PwcNet, LiteFlow, ec...) with the same script to load the dataset have NOT issue reported above

@wz940216
Copy link

训练GMA的时候也会报同样错误。

@Salvatore-tech
Copy link
Contributor Author

Sorry @wz940216 but i did not get your answer, do you need additional detail about the issue?

@wz940216
Copy link

@Salvatore-tech I'm not a developer, but I had the same problem as you when training GMA and RAFT while using mmflow.

@Salvatore-tech
Copy link
Contributor Author

@wz940216 that's interesting, I hope that the owners of this repository could give us a clue (I'd like to use RAFT in my use case because it should give better performance).
@MeowZheng @Zachary-66

@Zachary-66
Copy link
Collaborator

Zachary-66 commented Sep 26, 2022

I meet the same bug using PyTorch 1.12.1. Below is my log:
raft_kitti_bug.txt

However, when I use PyTorch 1.8.0, this bug no longer appears. Below is the normal log:
raft_kitti_normal.txt

I notice you are using PyTorch 1.12.1, which is the latest version and there might be some unexpected bugs. To save your time, I recommend that you follow the official command below to install PyTorch1.8.0 instead:

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch

We will modify MMFlow as soon as possible to avoid similar bugs. Thanks a lot!

@Salvatore-tech
Copy link
Contributor Author

Thanks @Zachary-66 your fix did the job, i did not notice that operator otherwise i would have pulled the request.
I'm closing the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants