Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

不同的服务器运行相同的程序,一个机器可以稳定的训练几天,另一台机器训练不到3个小时就会报错 #77

Closed
dragon515 opened this issue Dec 8, 2019 · 4 comments

Comments

@dragon515
Copy link

dragon515 commented Dec 8, 2019

如题: 不同的服务器运行相同的程序,一个机器可以稳定的训练几天,另一台机器训练不到3个小时就会报错
PaddleCheckError: Expected index.dims()[0] > 0, but received index.dims()[0]:0 <= 0:0. The index of gather_op should not be empty when the index's rank is 1. at [/paddle/paddle/fluid/operators/gather.cu.h:82]

2019-12-08 09:36:34,183-INFO: 1076 samples in file dataset/coco/annotations/instances_val2017.json
2019-12-08 09:36:34,186-INFO: places would be ommited when DataLoader is not iterable
W1208 09:36:36.099746   317 device_context.cc:235] Please NOTE: device: 0, CUDA Capability: 60, Driver API Version: 10.0, Runtime API Version: 10.0
W1208 09:36:36.412878   317 device_context.cc:243] device: 0, cuDNN Version: 7.6.
2019-12-08 09:36:40,108-INFO: Loading checkpoint from output/faster_rcnn_dcn_x101_vd_64x4d_fpn_1x/27000...
2019-12-08 09:36:46,786-INFO: 20764 samples in file dataset/coco/annotations/instances_train2017.json
2019-12-08 09:36:46,893-INFO: places would be ommited when DataLoader is not iterable
I1208 09:36:47.311914   317 parallel_executor.cc:421] The number of CUDAPlace, which is used in ParallelExecutor, is 2. And the Program will be copied 2 copies
I1208 09:36:49.598440   317 graph_pattern_detector.cc:96] ---  detected 40 subgraphs
I1208 09:36:49.672873   317 graph_pattern_detector.cc:96] ---  detected 37 subgraphs
W1208 09:36:49.861197   317 fuse_all_reduce_op_pass.cc:72] Find all_reduce operators: 183. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 153.
I1208 09:36:49.873278   317 build_strategy.cc:363] SeqOnlyAllReduceOps:0, num_trainers:1
I1208 09:36:50.447363   317 parallel_executor.cc:285] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I1208 09:36:50.538653   317 parallel_executor.cc:368] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
2019-12-08 09:37:11,792-INFO: iter: 27020, lr: 0.002500, 'loss_cls': '0.107255', 'loss_bbox': '0.027132', 'loss_rpn_cls': '0.029164', 'loss_rpn_bbox': '0.004501', 'loss': '0.162709', time: 1.196, eta: 9 days, 14:16:56
2019-12-08 09:37:33,380-INFO: iter: 27040, lr: 0.002500, 'loss_cls': '0.160085', 'loss_bbox': '0.037842', 'loss_rpn_cls': '0.029905', 'loss_rpn_bbox': '0.005318', 'loss': '0.248189', time: 1.069, eta: 8 days, 13:42:18
2019-12-08 09:37:56,325-INFO: iter: 27060, lr: 0.002500, 'loss_cls': '0.111756', 'loss_bbox': '0.042657', 'loss_rpn_cls': '0.022942', 'loss_rpn_bbox': '0.003554', 'loss': '0.176668', time: 1.153, eta: 9 days, 5:55:00
2019-12-08 09:38:18,266-INFO: iter: 27080, lr: 0.002500, 'loss_cls': '0.117590', 'loss_bbox': '0.037847', 'loss_rpn_cls': '0.028765', 'loss_rpn_bbox': '0.007621', 'loss': '0.183682', time: 1.103, eta: 8 days, 20:15:40
2019-12-08 09:38:38,424-INFO: iter: 27100, lr: 0.002500, 'loss_cls': '0.109035', 'loss_bbox': '0.036450', 'loss_rpn_cls': '0.020142', 'loss_rpn_bbox': '0.002935', 'loss': '0.167680', time: 1.001, eta: 8 days, 0:35:44
2019-12-08 09:39:01,073-INFO: iter: 27120, lr: 0.002500, 'loss_cls': '0.112554', 'loss_bbox': '0.030803', 'loss_rpn_cls': '0.027392', 'loss_rpn_bbox': '0.007306', 'loss': '0.210294', time: 1.125, eta: 9 days, 0:33:28
2019-12-08 09:39:22,075-INFO: iter: 27140, lr: 0.002500, 'loss_cls': '0.119142', 'loss_bbox': '0.032270', 'loss_rpn_cls': '0.024191', 'loss_rpn_bbox': '0.004826', 'loss': '0.176259', time: 1.051, eta: 8 days, 10:21:23
2019-12-08 09:39:43,539-INFO: iter: 27160, lr: 0.002500, 'loss_cls': '0.146193', 'loss_bbox': '0.047863', 'loss_rpn_cls': '0.028642', 'loss_rpn_bbox': '0.007621', 'loss': '0.238028', time: 1.064, eta: 8 days, 12:47:36
2019-12-08 09:40:05,350-INFO: iter: 27180, lr: 0.002500, 'loss_cls': '0.095980', 'loss_bbox': '0.035866', 'loss_rpn_cls': '0.022334', 'loss_rpn_bbox': '0.004785', 'loss': '0.186832', time: 1.113, eta: 8 days, 22:10:17
2019-12-08 09:40:27,442-INFO: iter: 27200, lr: 0.002500, 'loss_cls': '0.103658', 'loss_bbox': '0.030443', 'loss_rpn_cls': '0.023747', 'loss_rpn_bbox': '0.005142', 'loss': '0.174824', time: 1.100, eta: 8 days, 19:36:49
2019-12-08 09:40:48,639-INFO: iter: 27220, lr: 0.002500, 'loss_cls': '0.123171', 'loss_bbox': '0.041430', 'loss_rpn_cls': '0.026343', 'loss_rpn_bbox': '0.007074', 'loss': '0.197138', time: 1.065, eta: 8 days, 12:56:42
2019-12-08 09:41:09,553-INFO: iter: 27240, lr: 0.002500, 'loss_cls': '0.126446', 'loss_bbox': '0.030707', 'loss_rpn_cls': '0.021942', 'loss_rpn_bbox': '0.004218', 'loss': '0.185318', time: 1.044, eta: 8 days, 8:53:24
2019-12-08 09:41:29,916-INFO: iter: 27260, lr: 0.002500, 'loss_cls': '0.109329', 'loss_bbox': '0.035236', 'loss_rpn_cls': '0.029849', 'loss_rpn_bbox': '0.008709', 'loss': '0.188167', time: 1.012, eta: 8 days, 2:44:33
2019-12-08 09:41:51,827-INFO: iter: 27280, lr: 0.002500, 'loss_cls': '0.117947', 'loss_bbox': '0.035055', 'loss_rpn_cls': '0.022669', 'loss_rpn_bbox': '0.004247', 'loss': '0.183069', time: 1.100, eta: 8 days, 19:38:23
2019-12-08 09:42:13,266-INFO: iter: 27300, lr: 0.002500, 'loss_cls': '0.104928', 'loss_bbox': '0.032062', 'loss_rpn_cls': '0.031705', 'loss_rpn_bbox': '0.006268', 'loss': '0.186234', time: 1.072, eta: 8 days, 14:19:22
2019-12-08 09:42:35,486-INFO: iter: 27320, lr: 0.002500, 'loss_cls': '0.110162', 'loss_bbox': '0.032131', 'loss_rpn_cls': '0.037423', 'loss_rpn_bbox': '0.009977', 'loss': '0.179894', time: 1.113, eta: 8 days, 22:07:25
2019-12-08 09:42:57,989-INFO: iter: 27340, lr: 0.002500, 'loss_cls': '0.098853', 'loss_bbox': '0.036495', 'loss_rpn_cls': '0.022839', 'loss_rpn_bbox': '0.007259', 'loss': '0.177353', time: 1.103, eta: 8 days, 20:13:59
2019-12-08 09:43:18,957-INFO: iter: 27360, lr: 0.002500, 'loss_cls': '0.136675', 'loss_bbox': '0.037647', 'loss_rpn_cls': '0.024854', 'loss_rpn_bbox': '0.005528', 'loss': '0.243505', time: 1.051, eta: 8 days, 10:15:43
2019-12-08 09:43:40,055-INFO: iter: 27380, lr: 0.002500, 'loss_cls': '0.107223', 'loss_bbox': '0.036117', 'loss_rpn_cls': '0.025308', 'loss_rpn_bbox': '0.005180', 'loss': '0.191671', time: 1.074, eta: 8 days, 14:33:44
2019-12-08 09:44:01,801-INFO: iter: 27400, lr: 0.002500, 'loss_cls': '0.114322', 'loss_bbox': '0.032498', 'loss_rpn_cls': '0.021028', 'loss_rpn_bbox': '0.003865', 'loss': '0.180020', time: 1.088, eta: 8 days, 17:13:38
2019-12-08 09:44:23,235-INFO: iter: 27420, lr: 0.002500, 'loss_cls': '0.101029', 'loss_bbox': '0.029264', 'loss_rpn_cls': '0.028364', 'loss_rpn_bbox': '0.007907', 'loss': '0.181565', time: 1.067, eta: 8 days, 13:19:37
2019-12-08 09:44:44,491-INFO: iter: 27440, lr: 0.002500, 'loss_cls': '0.097345', 'loss_bbox': '0.030726', 'loss_rpn_cls': '0.029091', 'loss_rpn_bbox': '0.005249', 'loss': '0.170476', time: 1.069, eta: 8 days, 13:35:30
2019-12-08 09:45:06,253-INFO: iter: 27460, lr: 0.002500, 'loss_cls': '0.107105', 'loss_bbox': '0.038378', 'loss_rpn_cls': '0.025624', 'loss_rpn_bbox': '0.005551', 'loss': '0.181381', time: 1.087, eta: 8 days, 17:10:51
2019-12-08 09:45:28,042-INFO: iter: 27480, lr: 0.002500, 'loss_cls': '0.108549', 'loss_bbox': '0.029579', 'loss_rpn_cls': '0.024967', 'loss_rpn_bbox': '0.006547', 'loss': '0.215194', time: 1.064, eta: 8 days, 12:39:42
2019-12-08 09:45:49,157-INFO: iter: 27500, lr: 0.002500, 'loss_cls': '0.112655', 'loss_bbox': '0.038301', 'loss_rpn_cls': '0.025386', 'loss_rpn_bbox': '0.006977', 'loss': '0.210186', time: 1.081, eta: 8 days, 15:58:03
2019-12-08 09:46:10,362-INFO: iter: 27520, lr: 0.002500, 'loss_cls': '0.115814', 'loss_bbox': '0.031053', 'loss_rpn_cls': '0.022635', 'loss_rpn_bbox': '0.004697', 'loss': '0.195093', time: 1.056, eta: 8 days, 11:02:29
2019-12-08 09:46:32,177-INFO: iter: 27540, lr: 0.002500, 'loss_cls': '0.116692', 'loss_bbox': '0.036010', 'loss_rpn_cls': '0.027288', 'loss_rpn_bbox': '0.005695', 'loss': '0.193081', time: 1.066, eta: 8 days, 13:02:31
2019-12-08 09:46:54,590-INFO: iter: 27560, lr: 0.002500, 'loss_cls': '0.128424', 'loss_bbox': '0.034904', 'loss_rpn_cls': '0.025524', 'loss_rpn_bbox': '0.004873', 'loss': '0.208800', time: 1.151, eta: 9 days, 5:23:29
2019-12-08 09:47:16,635-INFO: iter: 27580, lr: 0.002500, 'loss_cls': '0.107845', 'loss_bbox': '0.033880', 'loss_rpn_cls': '0.027985', 'loss_rpn_bbox': '0.005409', 'loss': '0.170375', time: 1.102, eta: 8 days, 19:58:25
2019-12-08 09:47:37,165-INFO: iter: 27600, lr: 0.002500, 'loss_cls': '0.096917', 'loss_bbox': '0.027007', 'loss_rpn_cls': '0.023711', 'loss_rpn_bbox': '0.005217', 'loss': '0.149012', time: 1.022, eta: 8 days, 4:34:20
2019-12-08 09:47:59,157-INFO: iter: 27620, lr: 0.002500, 'loss_cls': '0.148022', 'loss_bbox': '0.041588', 'loss_rpn_cls': '0.034595', 'loss_rpn_bbox': '0.008284', 'loss': '0.243836', time: 1.097, eta: 8 days, 18:56:47
2019-12-08 09:48:21,268-INFO: iter: 27640, lr: 0.002500, 'loss_cls': '0.102161', 'loss_bbox': '0.033482', 'loss_rpn_cls': '0.022656', 'loss_rpn_bbox': '0.003083', 'loss': '0.176066', time: 1.113, eta: 8 days, 22:04:28
2019-12-08 09:48:42,021-INFO: iter: 27660, lr: 0.002500, 'loss_cls': '0.116623', 'loss_bbox': '0.031174', 'loss_rpn_cls': '0.021008', 'loss_rpn_bbox': '0.006698', 'loss': '0.181974', time: 1.032, eta: 8 days, 6:25:45

2019-12-08 10:03:30,895-INFO: iter: 28260, lr: 0.002500, 'loss_cls': '0.120618', 'loss_bbox': '0.036356', 'loss_rpn_cls': '0.028862', 'loss_rpn_bbox': '0.006762', 'loss': '0.196524', time: 1.062, eta: 8 days, 12:08:16
2019-12-08 10:03:54,389-INFO: iter: 28280, lr: 0.002500, 'loss_cls': '0.116200', 'loss_bbox': '0.037455', 'loss_rpn_cls': '0.026497', 'loss_rpn_bbox': '0.006574', 'loss': '0.179571', time: 1.176, eta: 9 days, 9:52:23
2019-12-08 10:04:16,664-INFO: iter: 28300, lr: 0.002500, 'loss_cls': '0.090389', 'loss_bbox': '0.022946', 'loss_rpn_cls': '0.028943', 'loss_rpn_bbox': '0.005928', 'loss': '0.150511', time: 1.106, eta: 8 days, 20:34:48
2019-12-08 10:04:39,204-INFO: iter: 28320, lr: 0.002500, 'loss_cls': '0.099024', 'loss_bbox': '0.030282', 'loss_rpn_cls': '0.027065', 'loss_rpn_bbox': '0.006674', 'loss': '0.180380', time: 1.125, eta: 9 days, 0:10:30
2019-12-08 10:05:00,137-INFO: iter: 28340, lr: 0.002500, 'loss_cls': '0.104252', 'loss_bbox': '0.031640', 'loss_rpn_cls': '0.023287', 'loss_rpn_bbox': '0.004962', 'loss': '0.179577', time: 1.055, eta: 8 days, 10:42:35
2019-12-08 10:05:20,870-INFO: iter: 28360, lr: 0.002500, 'loss_cls': '0.106998', 'loss_bbox': '0.036490', 'loss_rpn_cls': '0.027223', 'loss_rpn_bbox': '0.006033', 'loss': '0.196109', time: 1.037, eta: 8 days, 7:08:30

2019-12-08 11:26:26,361-INFO: iter: 32040, lr: 0.002500, 'loss_cls': '0.093410', 'loss_bbox': '0.025801', 'loss_rpn_cls': '0.020566', 'loss_rpn_bbox': '0.004328', 'loss': '0.155523', time: 1.130, eta: 8 days, 23:52:36
2019-12-08 11:26:48,282-INFO: iter: 32060, lr: 0.002500, 'loss_cls': '0.116218', 'loss_bbox': '0.030774', 'loss_rpn_cls': '0.028761', 'loss_rpn_bbox': '0.008732', 'loss': '0.198504', time: 1.111, eta: 8 days, 20:15:33
2019-12-08 11:27:10,345-INFO: iter: 32080, lr: 0.002500, 'loss_cls': '0.130258', 'loss_bbox': '0.032691', 'loss_rpn_cls': '0.036417', 'loss_rpn_bbox': '0.007845', 'loss': '0.220228', time: 1.090, eta: 8 days, 16:15:38
2019-12-08 11:27:32,476-INFO: iter: 32100, lr: 0.002500, 'loss_cls': '0.108925', 'loss_bbox': '0.037187', 'loss_rpn_cls': '0.021106', 'loss_rpn_bbox': '0.007838', 'loss': '0.178534', time: 1.119, eta: 8 days, 21:54:37
2019-12-08 11:27:54,745-INFO: iter: 32120, lr: 0.002500, 'loss_cls': '0.112486', 'loss_bbox': '0.036119', 'loss_rpn_cls': '0.026041', 'loss_rpn_bbox': '0.005284', 'loss': '0.181879', time: 1.102, eta: 8 days, 18:36:12
2019-12-08 11:28:17,569-INFO: iter: 32140, lr: 0.002500, 'loss_cls': '0.105735', 'loss_bbox': '0.034030', 'loss_rpn_cls': '0.024859', 'loss_rpn_bbox': '0.003535', 'loss': '0.172869', time: 1.147, eta: 9 days, 3:05:29
2019-12-08 11:28:38,749-INFO: iter: 32160, lr: 0.002500, 'loss_cls': '0.097556', 'loss_bbox': '0.032315', 'loss_rpn_cls': '0.027183', 'loss_rpn_bbox': '0.004453', 'loss': '0.168935', time: 1.063, eta: 8 days, 11:03:53
2019-12-08 11:29:01,167-INFO: iter: 32180, lr: 0.002500, 'loss_cls': '0.106844', 'loss_bbox': '0.027867', 'loss_rpn_cls': '0.027673', 'loss_rpn_bbox': '0.007006', 'loss': '0.180671', time: 1.103, eta: 8 days, 18:40:46
2019-12-08 11:29:22,644-INFO: iter: 32200, lr: 0.002500, 'loss_cls': '0.111045', 'loss_bbox': '0.039628', 'loss_rpn_cls': '0.023468', 'loss_rpn_bbox': '0.004395', 'loss': '0.191487', time: 1.086, eta: 8 days, 15:34:46
2019-12-08 11:29:44,434-INFO: iter: 32220, lr: 0.002500, 'loss_cls': '0.124690', 'loss_bbox': '0.027931', 'loss_rpn_cls': '0.019735', 'loss_rpn_bbox': '0.005108', 'loss': '0.175699', time: 1.097, eta: 8 days, 17:33:22
2019-12-08 11:30:05,626-INFO: iter: 32240, lr: 0.002500, 'loss_cls': '0.095456', 'loss_bbox': '0.035033', 'loss_rpn_cls': '0.022073', 'loss_rpn_bbox': '0.003227', 'loss': '0.156538', time: 1.059, eta: 8 days, 10:13:40
2019-12-08 11:30:26,897-INFO: iter: 32260, lr: 0.002500, 'loss_cls': '0.119746', 'loss_bbox': '0.046080', 'loss_rpn_cls': '0.019381', 'loss_rpn_bbox': '0.004442', 'loss': '0.182005', time: 1.066, eta: 8 days, 11:36:56
2019-12-08 11:30:48,440-INFO: iter: 32280, lr: 0.002500, 'loss_cls': '0.115690', 'loss_bbox': '0.028565', 'loss_rpn_cls': '0.028244', 'loss_rpn_bbox': '0.005486', 'loss': '0.200766', time: 1.057, eta: 8 days, 9:58:26
2019-12-08 11:31:09,633-INFO: iter: 32300, lr: 0.002500, 'loss_cls': '0.107902', 'loss_bbox': '0.033771', 'loss_rpn_cls': '0.021195', 'loss_rpn_bbox': '0.004134', 'loss': '0.175219', time: 1.079, eta: 8 days, 14:09:52
2019-12-08 11:31:30,260-INFO: iter: 32320, lr: 0.002500, 'loss_cls': '0.098927', 'loss_bbox': '0.037162', 'loss_rpn_cls': '0.021459', 'loss_rpn_bbox': '0.005403', 'loss': '0.158387', time: 1.029, eta: 8 days, 4:37:34
2019-12-08 11:31:53,241-INFO: iter: 32340, lr: 0.002500, 'loss_cls': '0.107896', 'loss_bbox': '0.033144', 'loss_rpn_cls': '0.033767', 'loss_rpn_bbox': '0.004966', 'loss': '0.197591', time: 1.142, eta: 9 days, 2:11:20

2019-12-08 11:32:13,843-INFO: iter: 32360, lr: 0.002500, 'loss_cls': '0.107879', 'loss_bbox': '0.041093', 'loss_rpn_cls': '0.033133', 'loss_rpn_bbox': '0.011063', 'loss': '0.211041', time: 1.035, eta: 8 days, 5:37:19
2019-12-08 11:32:34,803-INFO: iter: 32380, lr: 0.002500, 'loss_cls': '0.110426', 'loss_bbox': '0.039289', 'loss_rpn_cls': '0.027911', 'loss_rpn_bbox': '0.005713', 'loss': '0.184945', time: 1.052, eta: 8 days, 8:53:31
2019-12-08 11:32:55,430-INFO: iter: 32400, lr: 0.002500, 'loss_cls': '0.128478', 'loss_bbox': '0.041886', 'loss_rpn_cls': '0.026493', 'loss_rpn_bbox': '0.006556', 'loss': '0.188470', time: 1.031, eta: 8 days, 4:53:08
2019-12-08 11:33:17,293-INFO: iter: 32420, lr: 0.002500, 'loss_cls': '0.110536', 'loss_bbox': '0.032399', 'loss_rpn_cls': '0.030574', 'loss_rpn_bbox': '0.007749', 'loss': '0.206734', time: 1.086, eta: 8 days, 15:30:35
2019-12-08 11:33:38,599-INFO: iter: 32440, lr: 0.002500, 'loss_cls': '0.095937', 'loss_bbox': '0.035684', 'loss_rpn_cls': '0.020276', 'loss_rpn_bbox': '0.007348', 'loss': '0.159013', time: 1.070, eta: 8 days, 12:17:27
2019-12-08 11:34:00,011-INFO: iter: 32460, lr: 0.002500, 'loss_cls': '0.120759', 'loss_bbox': '0.040102', 'loss_rpn_cls': '0.025354', 'loss_rpn_bbox': '0.005780', 'loss': '0.201859', time: 1.072, eta: 8 days, 12:43:34
2019-12-08 11:34:20,885-INFO: iter: 32480, lr: 0.002500, 'loss_cls': '0.091575', 'loss_bbox': '0.030783', 'loss_rpn_cls': '0.027161', 'loss_rpn_bbox': '0.004689', 'loss': '0.153618', time: 1.039, eta: 8 days, 6:27:44
2019-12-08 11:34:41,995-INFO: iter: 32500, lr: 0.002500, 'loss_cls': '0.131100', 'loss_bbox': '0.042119', 'loss_rpn_cls': '0.024955', 'loss_rpn_bbox': '0.005969', 'loss': '0.209270', time: 1.054, eta: 8 days, 9:20:55
2019-12-08 11:35:03,078-INFO: iter: 32520, lr: 0.002500, 'loss_cls': '0.108701', 'loss_bbox': '0.025954', 'loss_rpn_cls': '0.019844', 'loss_rpn_bbox': '0.004977', 'loss': '0.180352', time: 1.053, eta: 8 days, 9:05:40
2019-12-08 11:35:24,807-INFO: iter: 32540, lr: 0.002500, 'loss_cls': '0.100763', 'loss_bbox': '0.031137', 'loss_rpn_cls': '0.025281', 'loss_rpn_bbox': '0.004916', 'loss': '0.175427', time: 1.068, eta: 8 days, 11:56:12
2019-12-08 11:35:47,071-INFO: iter: 32560, lr: 0.002500, 'loss_cls': '0.103370', 'loss_bbox': '0.027407', 'loss_rpn_cls': '0.025815', 'loss_rpn_bbox': '0.007054', 'loss': '0.167267', time: 1.132, eta: 9 days, 0:06:07
2019-12-08 11:36:09,177-INFO: iter: 32580, lr: 0.002500, 'loss_cls': '0.093084', 'loss_bbox': '0.023440', 'loss_rpn_cls': '0.023047', 'loss_rpn_bbox': '0.006587', 'loss': '0.151249', time: 1.100, eta: 8 days, 18:06:41
2019-12-08 11:36:30,414-INFO: iter: 32600, lr: 0.002500, 'loss_cls': '0.092012', 'loss_bbox': '0.028003', 'loss_rpn_cls': '0.021767', 'loss_rpn_bbox': '0.006627', 'loss': '0.159122', time: 1.050, eta: 8 days, 8:25:08
2019-12-08 11:36:52,008-INFO: iter: 32620, lr: 0.002500, 'loss_cls': '0.118538', 'loss_bbox': '0.029654', 'loss_rpn_cls': '0.031362', 'loss_rpn_bbox': '0.007302', 'loss': '0.238050', time: 1.106, eta: 8 days, 19:08:26
2019-12-08 11:37:13,449-INFO: iter: 32640, lr: 0.002500, 'loss_cls': '0.072994', 'loss_bbox': '0.026526', 'loss_rpn_cls': '0.022744', 'loss_rpn_bbox': '0.004115', 'loss': '0.134926', time: 1.055, eta: 8 days, 9:23:13
2019-12-08 11:37:34,177-INFO: iter: 32660, lr: 0.002500, 'loss_cls': '0.097421', 'loss_bbox': '0.024360', 'loss_rpn_cls': '0.023935', 'loss_rpn_bbox': '0.006053', 'loss': '0.161697', time: 1.053, eta: 8 days, 9:03:48
2019-12-08 11:37:55,721-INFO: iter: 32680, lr: 0.002500, 'loss_cls': '0.104007', 'loss_bbox': '0.027913', 'loss_rpn_cls': '0.023525', 'loss_rpn_bbox': '0.008149', 'loss': '0.181776', time: 1.053, eta: 8 days, 9:06:54
2019-12-08 11:38:17,213-INFO: iter: 32700, lr: 0.002500, 'loss_cls': '0.115015', 'loss_bbox': '0.037233', 'loss_rpn_cls': '0.026973', 'loss_rpn_bbox': '0.007489', 'loss': '0.198169', time: 1.099, eta: 8 days, 17:47:08
2019-12-08 11:38:37,801-INFO: iter: 32720, lr: 0.002500, 'loss_cls': '0.107442', 'loss_bbox': '0.030250', 'loss_rpn_cls': '0.026453', 'loss_rpn_bbox': '0.006128', 'loss': '0.170408', time: 1.023, eta: 8 days, 3:16:08
2019-12-08 11:38:58,985-INFO: iter: 32740, lr: 0.002500, 'loss_cls': '0.115347', 'loss_bbox': '0.037755', 'loss_rpn_cls': '0.022134', 'loss_rpn_bbox': '0.003413', 'loss': '0.182344', time: 1.065, eta: 8 days, 11:21:58
2019-12-08 11:39:20,945-INFO: iter: 32760, lr: 0.002500, 'loss_cls': '0.100408', 'loss_bbox': '0.030180', 'loss_rpn_cls': '0.031180', 'loss_rpn_bbox': '0.006408', 'loss': '0.176443', time: 1.091, eta: 8 days, 16:18:49
2019-12-08 11:39:42,146-INFO: iter: 32780, lr: 0.002500, 'loss_cls': '0.090289', 'loss_bbox': '0.019610', 'loss_rpn_cls': '0.024948', 'loss_rpn_bbox': '0.003633', 'loss': '0.158710', time: 1.066, eta: 8 days, 11:26:46
2019-12-08 11:40:03,092-INFO: iter: 32800, lr: 0.002500, 'loss_cls': '0.096883', 'loss_bbox': '0.028336', 'loss_rpn_cls': '0.028111', 'loss_rpn_bbox': '0.007136', 'loss': '0.154402', time: 1.048, eta: 8 days, 7:57:52
2019-12-08 11:40:25,261-INFO: iter: 32820, lr: 0.002500, 'loss_cls': '0.127951', 'loss_bbox': '0.041875', 'loss_rpn_cls': '0.027269', 'loss_rpn_bbox': '0.006420', 'loss': '0.202579', time: 1.100, eta: 8 days, 17:53:07
2019-12-08 11:40:47,515-INFO: iter: 32840, lr: 0.002500, 'loss_cls': '0.105165', 'loss_bbox': '0.026979', 'loss_rpn_cls': '0.026087', 'loss_rpn_bbox': '0.006410', 'loss': '0.165277', time: 1.106, eta: 8 days, 19:01:51
2019-12-08 11:41:10,274-INFO: iter: 32860, lr: 0.002500, 'loss_cls': '0.113844', 'loss_bbox': '0.035535', 'loss_rpn_cls': '0.021415', 'loss_rpn_bbox': '0.005199', 'loss': '0.179511', time: 1.148, eta: 9 days, 3:02:05
2019-12-08 11:41:32,032-INFO: iter: 32880, lr: 0.002500, 'loss_cls': '0.107139', 'loss_bbox': '0.034098', 'loss_rpn_cls': '0.023092', 'loss_rpn_bbox': '0.005283', 'loss': '0.178474', time: 1.094, eta: 8 days, 16:51:14
2019-12-08 11:41:53,913-INFO: iter: 32900, lr: 0.002500, 'loss_cls': '0.107068', 'loss_bbox': '0.033079', 'loss_rpn_cls': '0.025516', 'loss_rpn_bbox': '0.005883', 'loss': '0.179267', time: 1.096, eta: 8 days, 17:05:30
2019-12-08 11:42:15,265-INFO: iter: 32920, lr: 0.002500, 'loss_cls': '0.084352', 'loss_bbox': '0.025512', 'loss_rpn_cls': '0.026543', 'loss_rpn_bbox': '0.005927', 'loss': '0.174963', time: 1.060, eta: 8 days, 10:17:00
2019-12-08 11:42:35,988-INFO: iter: 32940, lr: 0.002500, 'loss_cls': '0.094424', 'loss_bbox': '0.026548', 'loss_rpn_cls': '0.028392', 'loss_rpn_bbox': '0.004338', 'loss': '0.166381', time: 1.043, eta: 8 days, 6:58:18
2019-12-08 11:42:57,952-INFO: iter: 32960, lr: 0.002500, 'loss_cls': '0.077417', 'loss_bbox': '0.029750', 'loss_rpn_cls': '0.025627', 'loss_rpn_bbox': '0.006614', 'loss': '0.165685', time: 1.098, eta: 8 days, 17:34:56
2019-12-08 11:43:19,212-INFO: iter: 32980, lr: 0.002500, 'loss_cls': '0.104759', 'loss_bbox': '0.029914', 'loss_rpn_cls': '0.024789', 'loss_rpn_bbox': '0.005739', 'loss': '0.170735', time: 1.056, eta: 8 days, 9:36:00
2019-12-08 11:43:39,756-INFO: iter: 33000, lr: 0.002500, 'loss_cls': '0.116220', 'loss_bbox': '0.036502', 'loss_rpn_cls': '0.019375', 'loss_rpn_bbox': '0.004633', 'loss': '0.182570', time: 1.028, eta: 8 days, 4:06:34
2019-12-08 11:43:39,761-INFO: Save model to output/faster_rcnn_dcn_x101_vd_64x4d_fpn_1x/33000.
2019-12-08 11:43:55,919-INFO: Test iter 0
2019-12-08 11:44:14,859-INFO: Test iter 100
2019-12-08 11:44:33,713-INFO: Test iter 200
2019-12-08 11:44:53,128-INFO: Test iter 300
2019-12-08 11:45:12,317-INFO: Test iter 400
2019-12-08 11:45:30,595-INFO: Test iter 500
2019-12-08 11:45:48,932-INFO: Test iter 600
2019-12-08 11:46:07,444-INFO: Test iter 700
2019-12-08 11:46:25,915-INFO: Test iter 800
2019-12-08 11:46:44,161-INFO: Test iter 900
2019-12-08 11:47:02,660-INFO: Test iter 1000
2019-12-08 11:47:16,320-INFO: Test finish iter 1076
2019-12-08 11:47:16,321-INFO: Total number of images: 1076, inference time: 5.363484210675358 fps.
2019-12-08 11:47:16,698-INFO: Start evaluate...
2019-12-08 11:47:19,054-INFO: Best test box ap: 0.03341993076262028, in iter: 30000
2019-12-08 11:47:40,249-INFO: iter: 33020, lr: 0.002500, 'loss_cls': '0.108531', 'loss_bbox': '0.043173', 'loss_rpn_cls': '0.017900', 'loss_rpn_bbox': '0.004340', 'loss': '0.169767', time: 12.031, eta: 95 days, 15:54:32
2019-12-08 11:48:01,339-INFO: iter: 33040, lr: 0.002500, 'loss_cls': '0.079204', 'loss_bbox': '0.025323', 'loss_rpn_cls': '0.019523', 'loss_rpn_bbox': '0.004204', 'loss': '0.132759', time: 1.051, eta: 8 days, 8:30:30
2019-12-08 11:48:23,153-INFO: iter: 33060, lr: 0.002500, 'loss_cls': '0.090260', 'loss_bbox': '0.036725', 'loss_rpn_cls': '0.027528', 'loss_rpn_bbox': '0.006311', 'loss': '0.179389', time: 1.095, eta: 8 days, 17:01:20
2019-12-08 11:48:44,463-INFO: iter: 33080, lr: 0.002500, 'loss_cls': '0.093552', 'loss_bbox': '0.034076', 'loss_rpn_cls': '0.018074', 'loss_rpn_bbox': '0.007104', 'loss': '0.152736', time: 1.065, eta: 8 days, 11:12:37
2019-12-08 11:49:06,337-INFO: iter: 33100, lr: 0.002500, 'loss_cls': '0.136960', 'loss_bbox': '0.034778', 'loss_rpn_cls': '0.019313', 'loss_rpn_bbox': '0.004328', 'loss': '0.193764', time: 1.087, eta: 8 days, 15:29:17

2019-12-08 11:49:27,727-INFO: iter: 33120, lr: 0.002500, 'loss_cls': '0.096826', 'loss_bbox': '0.027477', 'loss_rpn_cls': '0.027583', 'loss_rpn_bbox': '0.003562', 'loss': '0.162150', time: 1.071, eta: 8 days, 12:25:21


2019-12-08 11:49:48,957-INFO: iter: 33140, lr: 0.002500, 'loss_cls': '0.111899', 'loss_bbox': '0.039398', 'loss_rpn_cls': '0.017612', 'loss_rpn_bbox': '0.005294', 'loss': '0.181398', time: 1.044, eta: 8 days, 7:10:46
2019-12-08 11:50:09,371-INFO: iter: 33160, lr: 0.002500, 'loss_cls': '0.114998', 'loss_bbox': '0.040129', 'loss_rpn_cls': '0.021477', 'loss_rpn_bbox': '0.005812', 'loss': '0.186288', time: 1.042, eta: 8 days, 6:47:26
2019-12-08 11:50:30,408-INFO: iter: 33180, lr: 0.002500, 'loss_cls': '0.088878', 'loss_bbox': '0.026715', 'loss_rpn_cls': '0.027719', 'loss_rpn_bbox': '0.007981', 'loss': '0.171643', time: 1.040, eta: 8 days, 6:19:52
2019-12-08 11:50:51,979-INFO: iter: 33200, lr: 0.002500, 'loss_cls': '0.127711', 'loss_bbox': '0.037682', 'loss_rpn_cls': '0.028521', 'loss_rpn_bbox': '0.004866', 'loss': '0.217983', time: 1.069, eta: 8 days, 11:54:39
2019-12-08 11:51:13,415-INFO: iter: 33220, lr: 0.002500, 'loss_cls': '0.135208', 'loss_bbox': '0.038123', 'loss_rpn_cls': '0.029232', 'loss_rpn_bbox': '0.005568', 'loss': '0.215429', time: 1.088, eta: 8 days, 15:28:55
2019-12-08 11:51:34,174-INFO: iter: 33240, lr: 0.002500, 'loss_cls': '0.124688', 'loss_bbox': '0.035213', 'loss_rpn_cls': '0.030661', 'loss_rpn_bbox': '0.007142', 'loss': '0.207017', time: 1.044, eta: 8 days, 7:04:49
2019-12-08 11:51:55,334-INFO: iter: 33260, lr: 0.002500, 'loss_cls': '0.107978', 'loss_bbox': '0.040368', 'loss_rpn_cls': '0.029550', 'loss_rpn_bbox': '0.005072', 'loss': '0.180974', time: 1.060, eta: 8 days, 10:09:37
2019-12-08 11:52:16,021-INFO: iter: 33280, lr: 0.002500, 'loss_cls': '0.117390', 'loss_bbox': '0.036064', 'loss_rpn_cls': '0.022095', 'loss_rpn_bbox': '0.004055', 'loss': '0.184817', time: 1.032, eta: 8 days, 4:57:10
2019-12-08 11:52:39,507-INFO: iter: 33300, lr: 0.002500, 'loss_cls': '0.099554', 'loss_bbox': '0.028488', 'loss_rpn_cls': '0.030340', 'loss_rpn_bbox': '0.004249', 'loss': '0.173735', time: 1.168, eta: 9 days, 6:50:47
2019-12-08 11:53:00,719-INFO: iter: 33320, lr: 0.002500, 'loss_cls': '0.091982', 'loss_bbox': '0.025819', 'loss_rpn_cls': '0.026813', 'loss_rpn_bbox': '0.004227', 'loss': '0.173283', time: 1.058, eta: 8 days, 9:50:57
/home/admin/.local/lib/python3.6/site-packages/paddle/fluid/executor.py:774: UserWarning: The following exception is not an EOF exception.
  "The following exception is not an EOF exception.")
Traceback (most recent call last):
  File "tools/train.py", line 340, in <module>
    main()
  File "tools/train.py", line 246, in main
    outs = exe.run(compiled_train_prog, fetch_list=train_values)
  File "/home/admin/.local/lib/python3.6/site-packages/paddle/fluid/executor.py", line 775, in run
    six.reraise(*sys.exc_info())
  File "/opt/conda/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/admin/.local/lib/python3.6/site-packages/paddle/fluid/executor.py", line 770, in run
    use_program_cache=use_program_cache)
  File "/home/admin/.local/lib/python3.6/site-packages/paddle/fluid/executor.py", line 829, in _run_impl
    return_numpy=return_numpy)
  File "/home/admin/.local/lib/python3.6/site-packages/paddle/fluid/executor.py", line 669, in _run_parallel
    tensors = exe.run(fetch_var_names)._move_to_list()
paddle.fluid.core_avx.EnforceNotMet:

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int)
2   void paddle::operators::GPUGather<float, int>(paddle::platform::DeviceContext const&, paddle::framework::Tensor const&, paddle::framework::Tensor const&, paddle::framework::Tensor*)
3   paddle::operators::GatherOpCUDAKernel<float>::Compute(paddle::framework::ExecutionContext const&) const
4   std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::GatherOpCUDAKernel<float>, paddle::operators::GatherOpCUDAKernel<double>, paddle::operators::GatherOpCUDAKernel<long>, paddle::operators::GatherOpCUDAKernel<int>, paddle::operators::GatherOpCUDAKernel<paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
5   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_,boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&, paddle::framework::RuntimeContext*) const
6   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_,boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const
7   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&)
8   paddle::framework::details::ComputationOpHandle::RunImpl()
9   paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*)
10  paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase*, std::shared_ptr<paddle::framework::BlockingQueue<unsigned long> > const&, unsigned long*)
11  std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&)
12  std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
13  ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const

------------------------------------------
Python Call Stacks (More useful to users):
------------------------------------------
  File "/home/admin/.local/lib/python3.6/site-packages/paddle/fluid/framework.py", line 2459, in append_op
    attrs=kwargs.get("attrs", None))
  File "/home/admin/.local/lib/python3.6/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
    return self.main_program.current_block().append_op(*args, **kwargs)
  File "/home/admin/.local/lib/python3.6/site-packages/paddle/fluid/layers/nn.py", line 10806, in gather
    attrs={'overwrite': overwrite})
  File "/home/admin/.local/lib/python3.6/site-packages/paddle/fluid/layers/detection.py", line 428, in rpn_target_assign
    predicted_cls_logits = nn.gather(cls_logits, score_index)
  File "/data/nas/workspace/jupyter/PaddleDetection-release-0.1/ppdet/core/workspace.py", line 113, in partial_apply
    return op(*args, **kwargs_)
  File "/data/nas/workspace/jupyter/PaddleDetection-release-0.1/ppdet/modeling/anchor_heads/rpn_head.py", line 227, in get_loss
    im_info=im_info)
  File "/data/nas/workspace/jupyter/PaddleDetection-release-0.1/ppdet/modeling/architectures/faster_rcnn.py", line 100, in build
    rpn_loss = self.rpn_head.get_loss(im_info, gt_box, is_crowd)
  File "/data/nas/workspace/jupyter/PaddleDetection-release-0.1/ppdet/modeling/architectures/faster_rcnn.py", line 196, in train
    return self.build(feed_vars, 'train')
  File "tools/train.py", line 128, in main
    train_fetches = model.train(feed_vars)
  File "tools/train.py", line 340, in <module>
    main()

----------------------
Error Message Summary:
----------------------
PaddleCheckError: Expected index.dims()[0] > 0, but received index.dims()[0]:0 <= 0:0.
The index of gather_op should not be empty when the index's rank is 1. at [/paddle/paddle/fluid/operators/gather.cu.h:82]
  [operator < gather > error]
terminate called without an active exception
W1208 11:53:03.514448   366 init.cc:205] *** Aborted at 1575777183 (unix time) try "date -d @1575777183" if you are using GNU date ***
W1208 11:53:03.517763   366 init.cc:205] PC: @                0x0 (unknown)
W1208 11:53:03.519001   366 init.cc:205] *** SIGABRT (@0x1f90000013d) received by PID 317 (TID 0x7f0f2b4bf700) from PID 317; stack trace: ***
W1208 11:53:03.525950   366 init.cc:205]     @     0x7f139613d100 (unknown)
W1208 11:53:03.542176   366 init.cc:205]     @     0x7f1395da15f7 __GI_raise
W1208 11:53:03.555480   366 init.cc:205]     @     0x7f1395da2ce8 __GI_abort
W1208 11:53:03.578831   366 init.cc:205]     @     0x7f137646d84a __gnu_cxx::__verbose_terminate_handler()
W1208 11:53:03.585741   366 init.cc:205]     @     0x7f137646bf47 __cxxabiv1::__terminate()
W1208 11:53:03.608299   366 init.cc:205]     @     0x7f137646bf7d std::terminate()
W1208 11:53:03.621994   366 init.cc:205]     @     0x7f137646bc5a __gxx_personality_v0
W1208 11:53:03.625728   366 init.cc:205]     @     0x7f1388dd3b97 _Unwind_ForcedUnwind_Phase2
W1208 11:53:03.629989   366 init.cc:205]     @     0x7f1388dd3e7d _Unwind_ForcedUnwind
W1208 11:53:03.634865   366 init.cc:205]     @     0x7f139613bd60 __GI___pthread_unwind
W1208 11:53:03.639715   366 init.cc:205]     @     0x7f1396136dd5 __pthread_exit
W1208 11:53:03.664770   366 init.cc:205]     @     0x559fb4fe2289 PyThread_exit_thread
W1208 11:53:03.670189   366 init.cc:205]     @     0x559fb4e7447a PyEval_RestoreThread.cold.736
W1208 11:53:03.674211   366 init.cc:205]     @     0x7f13427e65b9 pybind11::gil_scoped_release::~gil_scoped_release()
W1208 11:53:03.675829   366 init.cc:205]     @     0x7f134279af23 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL22pybind11_init_core_avxERNS_6moduleEEUlRNS2_9operators6reader22LoDTensorBlockingQueueERKSt6vectorINS2_9framework9LoDTensorESaISC_EEE58_bIS9_SG_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESY_
W1208 11:53:03.678143   366 init.cc:205]     @     0x7f13427fa6e6 pybind11::cpp_function::dispatcher()
W1208 11:53:03.705709   366 init.cc:205]     @     0x559fb4f23fd4 _PyCFunction_FastCallDict
W1208 11:53:03.722385   366 init.cc:205]     @     0x559fb4fb1d3e call_function
W1208 11:53:03.749676   366 init.cc:205]     @     0x559fb4fd619a _PyEval_EvalFrameDefault
W1208 11:53:03.776196   366 init.cc:205]     @     0x559fb4fac8c8 PyEval_EvalCodeEx
W1208 11:53:03.791987   366 init.cc:205]     @     0x559fb4fad456 function_call
W1208 11:53:03.819322   366 init.cc:205]     @     0x559fb4f23dde PyObject_Call
W1208 11:53:03.846752   366 init.cc:205]     @     0x559fb4fd7994 _PyEval_EvalFrameDefault
W1208 11:53:03.861977   366 init.cc:205]     @     0x559fb4fab7db fast_function
W1208 11:53:03.877820   366 init.cc:205]     @     0x559fb4fb1cc5 call_function
W1208 11:53:03.905719   366 init.cc:205]     @     0x559fb4fd619a _PyEval_EvalFrameDefault
W1208 11:53:03.921274   366 init.cc:205]     @     0x559fb4fab7db fast_function
W1208 11:53:03.937258   366 init.cc:205]     @     0x559fb4fb1cc5 call_function
W1208 11:53:03.964187   366 init.cc:205]     @     0x559fb4fd619a _PyEval_EvalFrameDefault
W1208 11:53:03.989130   366 init.cc:205]     @     0x559fb4fabe4b _PyFunction_FastCallDict
W1208 11:53:04.013437   366 init.cc:205]     @     0x559fb4f2439f _PyObject_FastCallDict
W1208 11:53:04.037919   366 init.cc:205]     @     0x559fb4f28ff3 _PyObject_Call_Prepend

@qingqing01
Copy link
Collaborator

qingqing01 commented Dec 9, 2019

@dragon515

我更新了下issue标题 和 内容格式 :)

从您的log里看是使用了您自己的数据训练,并且使用VOC的数据格式吧。
是从Loading checkpoint from output/faster_rcnn_dcn_x101_vd_64x4d_fpn_1x/27000... 恢复训练,并且使用了2个GPU卡吧? yml里默认给的8卡的学习率,如果改变卡数,请注意调节学习率试下,参考 https://github.com/PaddlePaddle/PaddleDetection/blob/release/0.1/docs/GETTING_STARTED_cn.md#faq

@qingqing01 qingqing01 changed the title 不同的服务器运行相同的程序,一个机器可以稳定的训练几天,另一台机器训练不到3个小时就会报错。PaddleCheckError: Expected index.dims()[0] > 0, but received index.dims()[0]:0 <= 0:0. The index of gather_op should not be empty when the index's rank is 1. at [/paddle/paddle/fluid/operators/gather.cu.h:82] 不同的服务器运行相同的程序,一个机器可以稳定的训练几天,另一台机器训练不到3个小时就会报错 Dec 9, 2019
@dragon515
Copy link
Author

@qingqing01
您好,我们使用自己的训练数据,数据格式是coco格式,采用恢复训练,使用两个GPU卡。

并且已经按照官方教程进行了学习率等其他参数的修改,方便的话,我可以给您提供网上服务器的帐号,您可以上去帮忙看看,同样的训练在一台本地的服务器就没有问题。

@TheLostIn
Copy link

@dragon515

我更新了下issue标题 和 内容格式 :)

从您的log里看是使用了您自己的数据训练,并且使用VOC的数据格式吧。
是从Loading checkpoint from output/faster_rcnn_dcn_x101_vd_64x4d_fpn_1x/27000... 恢复训练,并且使用了2个GPU卡吧? yml里默认给的8卡的学习率,如果改变卡数,请注意调节学习率试下,参考 https://github.com/PaddlePaddle/PaddleDetection/blob/release/0.1/docs/GETTING_STARTED_cn.md#faq

感谢 修改了学习率和变化节点就ok了

@qingqing01
Copy link
Collaborator

@dragon515 @TheLostIn 由于时间较久,关闭此issue。 如果有问题,可以再开新issue。 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants