Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on a single GPU (Losses keep fluctuating and do not converge) #31

Open
nuschandra opened this issue Apr 13, 2021 · 2 comments
Open

Comments

@nuschandra
Copy link

Hi,

I am training the Faster RCNN model on 10% of labelled COCO data. It seems like while training with 1 GPU, the losses don't converge and based on an earlier issue (#12), I understand that with 1 GPU and a batch size of 1 due to tensorpack constaints, the batch size may be too small for the network to train and converge. If that's the case, what are the alternatives? Is the only alternative to move away from tensorpack in order to be able to use a larger batch size?

Any inputs/suggestions are more than welcome as I am a bit stuck at the moment and do not have access to more than 1 GPU.

Regards,
Chandra

@Shuixin-Li
Copy link

same question, I even got my losses in nan, what happen? (Actually, I don't know how to know the number of GPUs but when I checked the GPU, the computer only came out with one name, so I guess I only have one GPU. )

@nuschandra , have you solved this problem?
Any comment and advice are welcome ( TAT, a crying face

here is part of the log

[0621 21:52:52 @base.py:283] Epoch 142 (global_step 71000) finished, time:47.1 seconds.
[0621 21:52:52 @misc.py:109] Estimated Time Left: 1 day 12 hours 11 minutes 51 seconds
[0621 21:52:52 @monitor.py:474] GPUUtil/0: 75.174
[0621 21:52:52 @monitor.py:474] HostFreeMemory (GB): 233.03
[0621 21:52:52 @monitor.py:474] PeakMemory(MB)/gpu:0: 2020.6
[0621 21:52:52 @monitor.py:474] QueueInput/queue_size: 50
[0621 21:52:52 @monitor.py:474] Throughput (samples/sec): 10.624
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/box_loss: 0.045659
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/box_loss_debug: 0.045659
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/box_loss_unormalized: 2.9222
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/detect_empty_labels: 0
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/label_loss: 0.45146
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/label_metrics/accuracy: 0.92609
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/label_metrics/false_negative: 1
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/label_metrics/fg_accuracy: 2.3493e-37
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/num_fg_label: 4.73
[0621 21:52:52 @monitor.py:474] learning_rate: 0.01
[0621 21:52:52 @monitor.py:474] mean_gt_box_area: 26681
[0621 21:52:52 @monitor.py:474] multilevel_roi_align/fpn_map_rois_to_levels/num_roi_level2: 63.574
[0621 21:52:52 @monitor.py:474] multilevel_roi_align/fpn_map_rois_to_levels/num_roi_level3: 0.1091
[0621 21:52:52 @monitor.py:474] multilevel_roi_align/fpn_map_rois_to_levels/num_roi_level4: 0.31203
[0621 21:52:52 @monitor.py:474] multilevel_roi_align/fpn_map_rois_to_levels/num_roi_level5: 0.0048982
[0621 21:52:52 @monitor.py:474] rpn_losses/box_loss: 0.033246
[0621 21:52:52 @monitor.py:474] rpn_losses/label_loss: 0.21476
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/box_loss: 0.023036
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_loss: 0.15323
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_metrics/precision_th0.1: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_metrics/precision_th0.2: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_metrics/precision_th0.5: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_metrics/recall_th0.1: 0.29284
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_metrics/recall_th0.2: 0.29284
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_metrics/recall_th0.5: 0.29284
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/num_pos_anchor: 10.744
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/num_valid_anchor: 198.3
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/box_loss: 0.0064434
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_loss: 0.041088
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_metrics/precision_th0.1: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_metrics/precision_th0.2: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_metrics/precision_th0.5: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_metrics/recall_th0.1: 0.39949
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_metrics/recall_th0.2: 0.39949
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_metrics/recall_th0.5: 0.39949
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/num_pos_anchor: 3.0853
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/num_valid_anchor: 46.819
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/box_loss: 0.0014308
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_loss: 0.0070757
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_metrics/precision_th0.1: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_metrics/precision_th0.2: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_metrics/precision_th0.5: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_metrics/recall_th0.1: 0.44505
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_metrics/recall_th0.2: 0.44505
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_metrics/recall_th0.5: 0.44505
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/num_pos_anchor: 0.51146
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/num_valid_anchor: 8.6195
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/box_loss: 0.0020379
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_loss: 0.01154
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_metrics/precision_th0.1: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_metrics/precision_th0.2: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_metrics/precision_th0.5: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_metrics/recall_th0.1: 0.38672
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_metrics/recall_th0.2: 0.38672
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_metrics/recall_th0.5: 0.38672
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/num_pos_anchor: 1.0808
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/num_valid_anchor: 2.0708
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/box_loss: 0.00029753
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_loss: 0.0018269
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_metrics/precision_th0.1: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_metrics/precision_th0.2: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_metrics/precision_th0.5: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_metrics/recall_th0.1: 0.46507
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_metrics/recall_th0.2: 0.46507
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_metrics/recall_th0.5: 0.46507
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/num_pos_anchor: 0.18924
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/num_valid_anchor: 0.19345
[0621 21:52:52 @monitor.py:474] sample_fast_rcnn_targets/num_bg: 59.27
[0621 21:52:52 @monitor.py:474] sample_fast_rcnn_targets/num_fg: 4.73
[0621 21:52:52 @monitor.py:474] sample_fast_rcnn_targets/proposal_metrics/best_iou_per_gt/Merge: 0.11147
[0621 21:52:52 @monitor.py:474] sample_fast_rcnn_targets/proposal_metrics/recall_iou0.3: 0.12741
[0621 21:52:52 @monitor.py:474] sample_fast_rcnn_targets/proposal_metrics/recall_iou0.5: 0.05331
[0621 21:52:52 @monitor.py:474] total_cost: nan
[0621 21:52:52 @monitor.py:474] wd_cost: nan

@Shuixin-Li
Copy link

@zizhaozhang Thank you for your hard work on this, could you please help with this problem? Or actually, this cannot run on custom data with one GPU?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants