Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I got a problem when I use KITTI dataset to train the model D2det-mmdet2.1 #4700

Closed
Machine97 opened this issue Mar 2, 2021 · 9 comments
Closed
Assignees

Comments

@Machine97
Copy link

I tried to train the model D2det-mmdet2.1 with KITTI dataset, the following errors occur every time:

2021-03-01 09:19:43,961 - mmdet - INFO - Epoch [1][440/1856] lr: 5.067e-06, eta: 8:32:39,
time: 0.400, data_time: 0.100, memory: 4264, loss_rpn_cls: 0.3128, loss_rpn_bbox: 0.2060, loss_cls: 0.2960, acc: 96.9043, loss_reg: 0.2682, loss_mask: 0.6795, loss: 1.7624
2021-03-01 09:19:47,923 - mmdet - INFO - Epoch [1][450/1856] lr: 5.177e-06, eta: 8:32:01,
time: 0.396, data_time: 0.095, memory: 4264, loss_rpn_cls: 0.3051, loss_rpn_bbox: 0.1585, loss_cls: 0.2783, acc: 96.7188, loss_reg: 0.2841, loss_mask: 0.6796, loss: 1.7056
2021-03-01 09:19:51,800 - mmdet - INFO - Epoch [1][460/1856] lr: 5.286e-06, eta: 8:31:11,
time: 0.388, data_time: 0.087, memory: 4264, loss_rpn_cls: 0.2838, loss_rpn_bbox: 0.1638, loss_cls: 0.2632, acc: 96.6406, loss_reg: 0.2799, loss_mask: 0.6799, loss: 1.6707
Traceback (most recent call last):
File "train.py", line 161, in
main()
File "train.py", line 157, in main
meta=meta)
File "/work_dirs/D2Det_mmdet2.1/mmdet/apis/train.py", line 179, in train_detector
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run
epoch_runner(data_loaders[i], *kwargs)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 43, in train
self.call_hook('after_train_iter')
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 282, in call_hook
getattr(hook, fn_name)(self) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 21, in after_train_iter runner.outputs['loss'].backward()
File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward allow_unreachable=True) # allow_unreachable flag
RuntimeError: shape mismatch: value tensor of shape [8, 256, 7, 7] cannot be broadcast to indexing result of shape [9, 256, 7, 7] (make_index_put_iterator at /opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/ATen/native/TensorAdvancedIndexing.cpp:215)frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f43abb90b5e in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::native::index_put_impl(at::Tensor&, c10::ArrayRefat::Tensor, at::Tensor const&, bool, bool) + 0x712 (0x7f43d38d0b82 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0xee23de (0x7f43d3c543de in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #3: at::native::index_put_(at::Tensor&, c10::ArrayRefat::Tensor, at::Tensor const&, bool) + 0x135 (0x7f43d38c0255 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: + 0xee210e (0x7f43d3c5410e in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #5: + 0x288fa88 (0x7f43d5601a88 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #6: torch::autograd::generated::IndexPutBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x251 (0x7f43d53cc201 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #7: + 0x2ae8215 (0x7f43d585a215 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node, torch::autograd::InputBuffer&) + 0x16f3 (0x7f43d5857513 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7f43d58582f2 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f43d5850969 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f43d8b97558 in
/opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0xc819d (0x7f43db5ff19d in /opt/conda/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #13: + 0x76db (0x7f43fbfdf6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #14: clone + 0x3f (0x7f43fbd0888f in /lib/x86_64-linux-gnu/libc.so.6)

This error occurs randomly in different iteration. In addition, every time the error occured, the first dimension size of the tensor [8, 256, 7, 7] is different.
Do you know the possible reasons for this error?

@hhaAndroid
Copy link
Collaborator

Sorry. I also don't know why does 'shape mismatch: value tensor of shape [8, 256, 7, 7] cannot be broadcast to indexing result of shape [9, 256, 7, 7]' appear. Can you provide more information that uses the issue template?

@Machine97
Copy link
Author

I think this problem should appear in the process of back propagation. Every time an error is reported, the size of the last three dimensions of tensor is always 256, 7 and 7, but the first dimension of 8 will change. I don't know if the number 8 represents the number of ROI feature maps.

@Machine97
Copy link
Author

Sorry. I also don't know why does 'shape mismatch: value tensor of shape [8, 256, 7, 7] cannot be broadcast to indexing result of shape [9, 256, 7, 7]' appear. Can you provide more information that uses the issue template?

Thank you for your reply. I think this problem should appear in the process of back propagation. Every time an error is reported, the size of the last three dimensions of tensor is always 256, 7 and 7, but the first dimension of 8 will change. I don't know if the number 8 represents the number of ROI feature maps.

@hhaAndroid
Copy link
Collaborator

Sorry. I also don't know why does 'shape mismatch: value tensor of shape [8, 256, 7, 7] cannot be broadcast to indexing result of shape [9, 256, 7, 7]' appear. Can you provide more information that uses the issue template?

Thank you for your reply. I think this problem should appear in the process of back propagation. Every time an error is reported, the size of the last three dimensions of tensor is always 256, 7 and 7, but the first dimension of 8 will change. I don't know if the number 8 represents the number of ROI feature maps.

I think so.

@Machine97
Copy link
Author

Sorry. I also don't know why does 'shape mismatch: value tensor of shape [8, 256, 7, 7] cannot be broadcast to indexing result of shape [9, 256, 7, 7]' appear. Can you provide more information that uses the issue template?

Thank you for your reply. I think this problem should appear in the process of back propagation. Every time an error is reported, the size of the last three dimensions of tensor is always 256, 7 and 7, but the first dimension of 8 will change. I don't know if the number 8 represents the number of ROI feature maps.

I think so.

Could you please give me some suggestions to solve this problem?

@hhaAndroid
Copy link
Collaborator

Can you provide more information that uses the Error report? @Machine97

@Machine97
Copy link
Author

@hhaAndroid What I've shown is the whole Error report. The other Environment information and Config are as follows:

/opt/conda/lib/python3.7/site-packages/mmcv/utils/registry.py:64: UserWarning: The old API of
register_module(module, force=False) is deprecated and will be removed, please use the new A
PI register_module(name=None, force=False, module=None) instead.
'The old API of register_module(module, force=False) '
2021-03-01 09:16:33,083 - mmdet - INFO - Environment info:

sys.platform: linux
Python: 3.7.7 (default, Mar 23 2020, 22:36:06) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GPU 0,1,2,3: GeForce RTX 2080 Ti
GCC: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
PyTorch: 1.5.0
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 arch
    itecture applications
  • Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,cod
    e=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch
    =compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=comp
    ute_37
  • CuDNN 7.6.3
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inl
    ines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK
    -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wn
    o-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sig
    n-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result
    -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-ov
    erflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics
    -color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-e
    rrno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX
    2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_
    MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DIS
    PATCH=OFF,

TorchVision: 0.6.0a0+82fd1c8
OpenCV: 4.5.1
MMCV: 0.6.1
MMDetection: 2.1.0+unknown
MMDetection Compiler: GCC 7.4
MMDetection CUDA Compiler: 10.1

2021-03-01 09:16:33,083 - mmdet - INFO - Distributed training: False
2021-03-01 09:16:33,867 - mmdet - INFO - Config:
dataset_type = 'KittiDataset'
data_root = '/dataset/kitti/'
class_names = ['Pedestrian', 'Cyclist', 'Car']
img_norm_cfg = dict(
mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=True)
input_modality = dict(use_lidar=True, use_camera=True)
voxel_size = [0.1, 0.1, 0.1]
point_cloud_range = [0, -40, -3, 70, 40, 1]
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(
type='Resize',
img_scale=(1242, 375),
multiscale_mode='value',
keep_ratio=False),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[103.53, 116.28, 123.675],
std=[1.0, 1.0, 1.0],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='CollectKitti', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4),
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1242, 375),
flip=False,
transforms=[
dict(
type='Resize',
img_scale=(1242, 375),
multiscale_mode='value',
keep_ratio=False),
dict(type='RandomFlip'),
dict(type='PointsRange2DFilter', target_size=[384, 1248]),
dict(
type='Normalize',
mean=[103.53, 116.28, 123.675],
std=[1.0, 1.0, 1.0],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='PointsToTensor', keys=['points']),
dict(type='CollectKitti', keys=['points', 'img'])
])
]
data = dict(
samples_per_gpu=2,
workers_per_gpu=0,
train=dict(
type='KittiDataset',
data_root='/dataset/kitti/',
ann_file='/dataset/kitti/kitti_infos_train.pkl',
split='training',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(
type='Resize',
img_scale=(1242, 375),
multiscale_mode='value',
keep_ratio=False),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[103.53, 116.28, 123.675],
std=[1.0, 1.0, 1.0],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='CollectKitti', keys=['img', 'gt_bboxes', 'gt_labels'])
],
modality=dict(use_lidar=True, use_camera=True),
classes=['Pedestrian', 'Cyclist', 'Car'],
test_mode=False),
val=dict(
type='KittiDataset',
data_root='/dataset/kitti/',
ann_file='/dataset/kitti/kitti_infos_val.pkl',
split='training',
pipeline=[
dict(
type='LoadPointsFromFile',
coord_type='LIDAR',
load_dim=4,
use_dim=4),
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1242, 375),
flip=False,
transforms=[
dict(
type='Resize',
img_scale=(1242, 375),
multiscale_mode='value',
keep_ratio=False),
dict(type='RandomFlip'),
dict(type='PointsRange2DFilter', target_size=[384, 1248]),
dict(
type='Normalize',
mean=[103.53, 116.28, 123.675],
std=[1.0, 1.0, 1.0],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='PointsToTensor', keys=['points']),
dict(type='CollectKitti', keys=['points', 'img'])
])
],
modality=dict(use_lidar=True, use_camera=True),
classes=['Pedestrian', 'Cyclist', 'Car'],
test_mode=True),
test=dict(
type='KittiDataset',
data_root='/dataset/kitti/',
ann_file='/dataset/kitti/kitti_infos_val.pkl',
split='training',
pipeline=[
dict(
type='LoadPointsFromFile',
coord_type='LIDAR',
load_dim=4,
use_dim=4),
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1242, 375),
flip=False,
transforms=[
dict(
type='Resize',
img_scale=(1242, 375),
multiscale_mode='value',
keep_ratio=False),
dict(type='RandomFlip'),
dict(type='PointsRange2DFilter', target_size=[384, 1248]),
dict(
type='Normalize',
mean=[103.53, 116.28, 123.675],
std=[1.0, 1.0, 1.0],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='PointsToTensor', keys=['points']),
dict(type='CollectKitti', keys=['points', 'img'])
])
],
modality=dict(use_lidar=True, use_camera=True),
classes=['Pedestrian', 'Cyclist', 'Car'],
test_mode=True))
model = dict(
type='D2Det',
pretrained='torchvision://resnet50',
backbone=dict(
type='ResNet',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=True),
norm_eval=True,
style='pytorch'),
neck=dict(
type='FPN',
in_channels=[256, 512, 1024, 2048],
out_channels=256,
num_outs=5),
rpn_head=dict(
type='RPNHead',
in_channels=256,
feat_channels=256,
anchor_generator=dict(
type='AnchorGenerator',
scales=[8],
ratios=[0.5, 1.0, 2.0],
strides=[4, 8, 16, 32, 64]),
bbox_coder=dict(
type='DeltaXYWHBBoxCoder',
target_means=[0.0, 0.0, 0.0, 0.0],
target_stds=[1.0, 1.0, 1.0, 1.0]),
loss_cls=dict(
type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
loss_bbox=dict(
type='SmoothL1Loss', beta=0.1111111111111111, loss_weight=1.0)),
roi_head=dict(
type='D2DetRoIHead',
bbox_roi_extractor=dict(
type='SingleRoIExtractor',
roi_layer=dict(
type='DeformRoIPoolingPack',
out_size=7,
sample_per_part=2,
out_channels=256,
no_trans=False,
group_size=1,
trans_std=0.1),
out_channels=256,
featmap_strides=[4, 8, 16, 32]),
bbox_head=dict(
type='Shared2FCBBoxHead',
with_reg=False,
in_channels=256,
fc_out_channels=1024,
roi_feat_size=7,
num_classes=80,
bbox_coder=dict(
type='DeltaXYWHBBoxCoder',
target_means=[0.0, 0.0, 0.0, 0.0],
target_stds=[0.1, 0.1, 0.2, 0.2]),
reg_class_agnostic=False,
loss_cls=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=2.0)),
reg_roi_extractor=dict(
type='SingleRoIExtractor',
roi_layer=dict(type='RoIAlign', out_size=14, sample_num=2),
out_channels=256,
featmap_strides=[4, 8, 16, 32]),
d2det_head=dict(
type='D2DetHead',
num_convs=1,
in_channels=256,
num_classes=80,
norm_cfg=dict(type='GN', num_groups=36),
MASK_ON=False)))
train_cfg = dict(
rpn=dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.7,
neg_iou_thr=0.3,
min_pos_iou=0.3,
ignore_iof_thr=-1),
sampler=dict(
type='RandomSampler',
num=256,
pos_fraction=0.5,
neg_pos_ub=-1,
add_gt_as_proposals=False),
allowed_border=0,
pos_weight=-1,
debug=False),
rpn_proposal=dict(
nms_across_levels=False,
nms_pre=2000,
nms_post=2000,
max_num=2000,
nms_thr=0.7,
min_bbox_size=0),
rcnn=dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.5,
neg_iou_thr=0.5,
min_pos_iou=0.5,
ignore_iof_thr=-1),
sampler=dict(
type='RandomSampler',
num=512,
pos_fraction=0.25,
neg_pos_ub=-1,
add_gt_as_proposals=True),
pos_radius=1,
pos_weight=-1,
max_num_reg=192,
mask_size=28,
debug=False))
test_cfg = dict(
rpn=dict(
nms_across_levels=False,
nms_pre=1000,
nms_post=1000,
max_num=1000,
nms_thr=0.7,
min_bbox_size=0),
rcnn=dict(
score_thr=0.03, nms=dict(type='nms', iou_thr=0.5), max_per_img=125))
optimizer = dict(type='SGD', lr=2e-05, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=1800,
warmup_ratio=0.0125,
step=[20, 23])
checkpoint_config = dict(interval=1)
log_config = dict(interval=10, hooks=[dict(type='TextLoggerHook')])
evaluation = dict(interval=1)
total_epochs = 40
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = '/work_dirs/D2Det_mmdet2.1/work_dir/'
load_from = None
resume_from = None
workflow = [('train', 1)]
gpu_ids = [0]

2021-03-01 09:16:34,738 - mmdet - INFO - load model from: torchvision://resnet50
2021-03-01 09:16:35,128 - mmdet - WARNING - The model and loaded state dict do not match exac
tly

unexpected key in source state_dict: fc.weight, fc.bias

/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py:168: UserWarning: Ru
nner was deprecated, please use EpochBasedRunner instead
'Runner was deprecated, please use EpochBasedRunner instead')
/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_runner.py:59: UserWarning: batch_proc
essor is deprecated, please implement train_step() and val_step() in the model instead.
warnings.warn('batch_processor is deprecated, please implement '
2021-03-01 09:16:40,538 - mmdet - INFO - Start running, host: root@6176961423d3, work_dir: /w
ork_dirs/D2Det_mmdet2.1/work_dir
2021-03-01 09:16:40,538 - mmdet - INFO - workflow: [('train', 1)], max: 40 epochs
2021-03-01 09:16:47,118 - mmdet - INFO - Epoch [1][10/1856] lr: 3.487e-07, eta: 13:30:46,
time: 0.655, data_time: 0.278, memory: 4260, loss_rpn_cls: 1.1617, loss_rpn_bbox: 0.8447, lo
ss_cls: 48.0927, acc: 42.9688, loss_reg: 1.0057, loss_mask: 0.6919, loss: 51.7967
2021-03-01 09:16:51,577 - mmdet - INFO - Epoch [1][20/1856] lr: 4.585e-07, eta: 11:21:03,
time: 0.446, data_time: 0.133, memory: 4262, loss_rpn_cls: 0.9803, loss_rpn_bbox: 0.9297, lo
ss_cls: 1.3802, acc: 98.9551, loss_reg: 1.0286, loss_mask: 0.6909, loss: 5.0097
2021-03-01 09:16:55,695 - mmdet - INFO - Epoch [1][30/1856] lr: 5.682e-07, eta: 10:23:49,
time: 0.412, data_time: 0.100, memory: 4262, loss_rpn_cls: 0.7760, loss_rpn_bbox: 0.6778, lo
ss_cls: 1.3409, acc: 99.1309, loss_reg: 1.0046, loss_mask: 0.6911, loss: 4.4905
。。。。

Config about testing is not used, so you don't have to pay attention to it. Thank you very much!

@Machine97
Copy link
Author

@hhaAndroid When the above error occurs, batchsize is 2. Once I set batchsize to 1, the following error will also occur apart from the above occur:

Traceback (most recent call last):
File "/work_dirs/D2Det_mmdet2.1/tools/train.py", line 161, in
main()
File "/work_dirs/D2Det_mmdet2.1/tools/train.py", line 157, in main
meta=meta)
File "/work_dirs/D2Det_mmdet2.1/mmdet/apis/train.py", line 179, in train_detector
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run
epoch_runner(data_loaders[i], *kwargs)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 43, in train
self.call_hook('after_train_iter')
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 282, in call_hook
getattr(hook, fn_name)(self)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 21, in after_train_iter
runner.outputs['loss'].backward()
File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function IndexPutBackward returned an invalid gradient at index 1 - got [353, 256, 7, 7] but expected shape compatible with [355, 256, 7, 7] (validate_outputs at /opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/autograd/engine.cpp:472)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fc24d54fb5e in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x2ae3134 (0x7fc277214134 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node
, torch::autograd::InputBuffer&) + 0x548 (0x7fc277215368 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7fc2772172f2 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fc27720f969 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fc27a556558 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0xc819d (0x7fc28fae019d in /opt/conda/bin/../lib/libstdc++.so.6)
frame #7: + 0x76db (0x7fc29e1746db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x3f (0x7fc29de9d88f in /lib/x86_64-linux-gnu/libc.so.6)

Process finished with exit code 1

@Machine97
Copy link
Author

@hhaAndroid The problem has been solved. The reason is that during the processing of KITTI data set, other classes except Car,Pedestrian, Cyclist and DontCare will be marked with -1. In version 2.1 of mmdetection, there is no error reminder of ' assertion`cur _ target > = 0&&cur _ target < n _ classes' failed'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants