-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nan loss and weights in training #9
Comments
Hi! Hmm.. This is very weird because I haven't meet any NaN issue during training in lots of experiments... I haven't tried batch size 48 (your GPU seems very huge because it can handle 48), but I tried 8 and 16, and tried 2~4 GPUs. |
@mks0601 Hi, thank you for your prompt reply! |
I think |
@mks0601 Hi! I've already successfully executed I've tried to disabled Then I try to debug the training process to figure out what makes the mesh loss nan ( Actually this error happens randomly at a small probabity, so I wonder if it is relative to some specific imgs, so I record some imgs from Human3.6M that trigger the error:
and I write a script to reproduce the error: the above script:
just put it in
In my environment settings, the nan error ALWAYS happens on the 2 designated imgs (note that I've modified the
you can see that before the error, there's divide-by-zero warning happens in
and the error seems to be solved. I don't know if this will cause any accuracy loss or other problem and why I've just started training on that simple modification and see if there is other problem. Any suggestion? |
Hi I got this result.
Basically, I didn't get any NaN error. Could you check which cam2pixel function gives error and check whether some coordinates contain zero element? |
@mks0601 Hi! I've debugged the script only on
on line 202, the returned you can see that the no.4794 vertex is the only vertex that contains 0 on z-axis ( Then
after line 209, Then after line 226,
If I'm the only one that have this problem, then it might be sth to do with my environment settings.
Here's the good news. Could you please show me your python environment details? That will help. ^ ^ |
Did you mosifiy get_smpl_coord function of Human36M.py? For example, make the xoordinates root-relative. Could you check yours with mine line by line? cam means camera_centered coordinates, and 0 z-axis coordinate means zero distance from camera in z-axis, which is non-sense. Could you visialize smpl_coord_img on image in Human36M.py using vis_mesh function? |
Hmmmm... Besides, do you think that it could be due to using the 3DMPPE version of Human3.6M dataset? |
The data from 3DMPPE is exactly same with that of I2L-MeshNet. I just added SMPL parameters. Ah when did you download the H36M data? I changed extrinsic camera parameters and corresponding functions at Jun 8 this year. I think this can make the coordinates zero because translation vector was changed. If you downloaded them before Jun 8, could you download camera parameters again and check the error? |
@mks0601 Hi! |
Awesome! |
Problem solved! |
@mks0601 Hi, thank you for your great work.
I had a problem while training the model in the first 'lixel' stage.
I care more about the Human Pose and Mesh Estimation performance, and I've downloaded Human3.6M (from your another project 3DMPPE, they are same stuff, right?), MSCOCO2017 and 3DPW datasets with the links you provided and make them meet the requirements as mentioned in README.md. I haven't downloaded MuCo dataset, so I modify main/config.py like this:
besides that, I modified the train_batch_size from 16 to 48.
then I try to execute main/train.py with no more modification of the config:
python train.py --gpu 0-2 --stage lixel
it runs on 3 titan rtx, and everything looks fine, but nan loss occurs in the No.0 epoch:
I modified the train_batch_size from 16 to 48, so the total iters looks quite small. You can see that loss_mesh_fit, loss_mesh_normal and loss_mesh_edge become nan first, and after that all losses become nan.
I debug the program and find out that the weights of relevant layers all become nan when it happens, so it might be due to the occurence of above nan loss, and then spread to the params of all layers through BP.
I've tried several times, with different gpu numbers (0, 1, 2) and different batch size (8, 16, 32, 48), it always happens at a point in the first epoch (0/13). I thought it might be due to some specific imgs, so I record the batch of imgs that seems to trigger the above nan loss several times (simply log out the paths of them), but they don't seems to be intersect in between.
here is 8 imgs from one attempt, recorded when nan loss occurs, with 1 gpu and train_batch_size=8.
I'm new to pytorch and HPE, and I'd appreciate your suggestion.
The text was updated successfully, but these errors were encountered: