Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model can‘t converge #9

Closed
luhavefun opened this issue Aug 17, 2022 · 2 comments
Closed

model can‘t converge #9

luhavefun opened this issue Aug 17, 2022 · 2 comments

Comments

@luhavefun
Copy link

Thanks for your great work. I meet some troubles in training on HO3D-v2. I trained the model according to the given steps, but found that the model did not converge properly. Here is the logs from the last epoch:
�[92m08-09 04:21:59�[0m Epoch 69/70 itr 4123/4128: lr: 1.17649e-05 speed: 0.19(0.19s r0.00)s/itr 0.22h/epoch loss_mano_verts: 5.1809 loss_mano_joints: 5.4888 loss_mano_pose: 0.3954 loss_mano_shape: 0.2697 loss_joints_img: 3.1225
�[92m08-09 04:22:00�[0m Epoch 69/70 itr 4124/4128: lr: 1.17649e-05 speed: 0.19(0.19s r0.00)s/itr 0.22h/epoch loss_mano_verts: 11.5330 loss_mano_joints: 12.5328 loss_mano_pose: 0.6882 loss_mano_shape: 0.3608 loss_joints_img: 4.0549
�[92m08-09 04:22:00�[0m Epoch 69/70 itr 4125/4128: lr: 1.17649e-05 speed: 0.19(0.19s r0.00)s/itr 0.22h/epoch loss_mano_verts: 4.1236 loss_mano_joints: 4.4260 loss_mano_pose: 0.3778 loss_mano_shape: 0.5172 loss_joints_img: 3.0202
�[92m08-09 04:22:00�[0m Epoch 69/70 itr 4126/4128: lr: 1.17649e-05 speed: 0.19(0.19s r0.00)s/itr 0.22h/epoch loss_mano_verts: 23.6163 loss_mano_joints: 25.9149 loss_mano_pose: 0.5916 loss_mano_shape: 0.2481 loss_joints_img: 3.1539
�[92m08-09 04:22:00�[0m Epoch 69/70 itr 4127/4128: lr: 1.17649e-05 speed: 0.19(0.19s r0.00)s/itr 0.22h/epoch loss_mano_verts: 10.1710 loss_mano_joints: 10.8513 loss_mano_pose: 0.5306 loss_mano_shape: 0.2882 loss_joints_img: 4.7335

@namepllet
Copy link
Owner

We trained our model with 4 GPUs.

So the effective batch size is 64 (16 per GPU * 4 GPUs).

I think you trained model with single GPU and set learning rate as 1e-4.

If you'd like to train model with single GPU, please set learning rate as 1e-4*1/4.

@luhavefun
Copy link
Author

Thank you for your reply! It is stable at present (20epochs) after changing the lr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants