Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTC: It broke down in this piece #6533

Closed
WandernForte opened this issue Nov 17, 2021 · 5 comments
Closed

HTC: It broke down in this piece #6533

WandernForte opened this issue Nov 17, 2021 · 5 comments
Assignees

Comments

@WandernForte
Copy link

网页捕获_17-11-2021_235738_www cloudam cn
it broke down from here then.
网页捕获_17-11-2021_19309_www cloudam cn
网页捕获_17-11-2021_165827_www cloudam cn

here is some console logs, for they all different, I don't know what should do....

HERE IS CH_VERSION
它是从这里开始无法运行的,具体是哪一步我也不清楚;(图一)
有好多种报错,萌新根本不知道如何处理。。。

@hhaAndroid
Copy link
Collaborator

@WandernForte The above shows OOM, you need to reduce the image size.

@WandernForte
Copy link
Author

@WandernForte The above shows OOM, you need to reduce the image size.

But it doesn't work. Even lead to a c10 error...

@WandernForte
Copy link
Author

`
Config:
dataset_type = 'CocoDataset'
data_root = '/home/cloudam/LZF/working'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
dict(
type='Resize',
img_scale=[(440, 596), (480, 650), (520, 704), (580, 785), (620, 839)],
multiscale_mode='value',
keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(p=0.5, max_h_size=64, type='Cutout'),
dict(
type='Albu',
transforms=[
dict(
type='ShiftScaleRotate',
shift_limit=0.0625,
scale_limit=0.15,
rotate_limit=15,
p=0.4),
dict(
type='RandomBrightnessContrast',
brightness_limit=0.2,
contrast_limit=0.2,
p=0.5),
dict(
type='OneOf',
transforms=[
dict(type='GaussianBlur', p=1.0, blur_limit=7),
dict(type='MedianBlur', p=1.0, blur_limit=7)
],
p=0.4)
],
bbox_params=dict(
type='BboxParams',
format='coco',
label_fields=['gt_labels'],
min_visibility=0.0,
filter_lost_elements=True),
keymap=dict(img='image', gt_bboxes='bboxes', gt_masks='masks'),
update_pad_shape=False,
skip_img_without_anno=True),
dict(
type='Normalize',
mean=[128, 128, 128],
std=[11.58, 11.58, 11.58],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_masks', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='LoadAnnotations', with_bbox=True, with_mask=True,
with_seg=False),
dict(
type='Resize',
img_scale=[(440, 596), (480, 650), (520, 704), (580, 785), (620, 839)],
multiscale_mode='value',
keep_ratio=True),
dict(type='Instaboost'),
dict(
type='Normalize',
mean=[128, 128, 128],
std=[11.58, 11.58, 11.58],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img'])
]
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
type='CocoDataset',
ann_file='../input/sartorius-coco-dataset-notebook/train.json',
img_prefix='',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='LoadAnnotations',
with_bbox=True,
with_mask=True,
with_seg=True),
dict(type='Resize', img_scale=(704, 520), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='SegRescale', scale_factor=0.125),
dict(type='DefaultFormatBundle'),
dict(
type='Collect',
keys=[
'img', 'gt_bboxes', 'gt_labels', 'gt_masks',
'gt_semantic_seg'
])
],
seg_prefix='/home/cloudam/LZF/working/',
data_root='/home/cloudam/LZF/working',
classes='labels.txt'),
val=dict(
type='CocoDataset',
ann_file='../input/sartorius-coco-dataset-notebook/val.json',
img_prefix='',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
],
data_root='/home/cloudam/LZF/working',
classes='labels.txt'),
test=dict(
type='CocoDataset',
ann_file='../input/sartorius-coco-dataset-notebook/val.json',
img_prefix='',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
],
classes='labels.txt',
data_root='/home/cloudam/LZF/working'))
evaluation = dict(metric='segm', interval=1)
optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
policy='CosineAnnealing',
by_epoch=False,
warmup='linear',
warmup_iters=125,
warmup_ratio=0.001,
min_lr=1e-07)
runner = dict(type='EpochBasedRunner', max_epochs=18)
checkpoint_config = dict(interval=1)
log_config = dict(interval=100, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = './mmdetection/checkpoints/htc_x101_32x4d_fpn_16x1_20e_coco_20200318-de97ae01.pth'
resume_from = None
workflow = [('train', 1)]
model = dict(
type='HybridTaskCascade',
backbone=dict(
type='ResNeXt',
depth=101,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=True),
norm_eval=True,
style='pytorch',
init_cfg=dict(
type='Pretrained', checkpoint='open-mmlab://resnext101_32x4d'),
groups=32,
base_width=4),
neck=dict(
type='FPN',
in_channels=[256, 512, 1024, 2048],
out_channels=256,
num_outs=5),
rpn_head=dict(
type='RPNHead',
in_channels=256,
feat_channels=256,
anchor_generator=dict(
type='AnchorGenerator',
scales=[8],
ratios=[0.5, 1.0, 2.0],
strides=[4, 8, 16, 32, 64]),
bbox_coder=dict(
type='DeltaXYWHBBoxCoder',
target_means=[0.0, 0.0, 0.0, 0.0],
target_stds=[1.0, 1.0, 1.0, 1.0]),
loss_cls=dict(
type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
loss_bbox=dict(
type='SmoothL1Loss', beta=0.1111111111111111, loss_weight=1.0)),
roi_head=dict(
type='HybridTaskCascadeRoIHead',
interleaved=True,
mask_info_flow=True,
num_stages=3,
stage_loss_weights=[1, 0.5, 0.25],
bbox_roi_extractor=dict(
type='SingleRoIExtractor',
roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
out_channels=256,
featmap_strides=[4, 8, 16, 32]),
bbox_head=[
dict(
type='Shared2FCBBoxHead',
in_channels=256,
fc_out_channels=1024,
roi_feat_size=7,
num_classes=3,
bbox_coder=dict(
type='DeltaXYWHBBoxCoder',
target_means=[0.0, 0.0, 0.0, 0.0],
target_stds=[0.1, 0.1, 0.2, 0.2]),
reg_class_agnostic=True,
loss_cls=dict(
type='CrossEntropyLoss',
use_sigmoid=False,
loss_weight=1.0),
loss_bbox=dict(type='SmoothL1Loss', beta=1.0,
loss_weight=1.0)),
dict(
type='Shared2FCBBoxHead',
in_channels=256,
fc_out_channels=1024,
roi_feat_size=7,
num_classes=3,
bbox_coder=dict(
type='DeltaXYWHBBoxCoder',
target_means=[0.0, 0.0, 0.0, 0.0],
target_stds=[0.05, 0.05, 0.1, 0.1]),
reg_class_agnostic=True,
loss_cls=dict(
type='CrossEntropyLoss',
use_sigmoid=False,
loss_weight=1.0),
loss_bbox=dict(type='SmoothL1Loss', beta=1.0,
loss_weight=1.0)),
dict(
type='Shared2FCBBoxHead',
in_channels=256,
fc_out_channels=1024,
roi_feat_size=7,
num_classes=3,
bbox_coder=dict(
type='DeltaXYWHBBoxCoder',
target_means=[0.0, 0.0, 0.0, 0.0],
target_stds=[0.033, 0.033, 0.067, 0.067]),
reg_class_agnostic=True,
loss_cls=dict(
type='CrossEntropyLoss',
use_sigmoid=False,
loss_weight=1.0),
loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0))
],
mask_roi_extractor=dict(
type='SingleRoIExtractor',
roi_layer=dict(type='RoIAlign', output_size=14, sampling_ratio=0),
out_channels=256,
featmap_strides=[4, 8, 16, 32]),
mask_head=[
dict(
type='HTCMaskHead',
with_conv_res=False,
num_convs=4,
in_channels=256,
conv_out_channels=256,
num_classes=3,
loss_mask=dict(
type='CrossEntropyLoss', use_mask=True, loss_weight=1.0)),
dict(
type='HTCMaskHead',
num_convs=4,
in_channels=256,
conv_out_channels=256,
num_classes=3,
loss_mask=dict(
type='CrossEntropyLoss', use_mask=True, loss_weight=1.0)),
dict(
type='HTCMaskHead',
num_convs=4,
in_channels=256,
conv_out_channels=256,
num_classes=3,
loss_mask=dict(
type='CrossEntropyLoss', use_mask=True, loss_weight=1.0))
],
semantic_roi_extractor=dict(
type='SingleRoIExtractor',
roi_layer=dict(type='RoIAlign', output_size=14, sampling_ratio=0),
out_channels=256,
featmap_strides=[8]),
semantic_head=dict(
type='FusedSemanticHead',
num_ins=5,
fusion_level=1,
num_convs=4,
in_channels=256,
conv_out_channels=256,
num_classes=3,
loss_seg=dict(
type='CrossEntropyLoss', ignore_index=255, loss_weight=0.2))),
train_cfg=dict(
rpn=dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.7,
neg_iou_thr=0.3,
min_pos_iou=0.3,
ignore_iof_thr=-1),
sampler=dict(
type='RandomSampler',
num=256,
pos_fraction=0.5,
neg_pos_ub=-1,
add_gt_as_proposals=False),
allowed_border=0,
pos_weight=-1,
debug=False),
rpn_proposal=dict(
nms_pre=2000,
max_per_img=2000,
nms=dict(type='nms', iou_threshold=0.7),
min_bbox_size=0),
rcnn=[
dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.5,
neg_iou_thr=0.5,
min_pos_iou=0.5,
ignore_iof_thr=-1),
sampler=dict(
type='RandomSampler',
num=512,
pos_fraction=0.25,
neg_pos_ub=-1,
add_gt_as_proposals=True),
mask_size=28,
pos_weight=-1,
debug=False),
dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.6,
neg_iou_thr=0.6,
min_pos_iou=0.6,
ignore_iof_thr=-1),
sampler=dict(
type='RandomSampler',
num=512,
pos_fraction=0.25,
neg_pos_ub=-1,
add_gt_as_proposals=True),
mask_size=28,
pos_weight=-1,
debug=False),
dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.7,
neg_iou_thr=0.7,
min_pos_iou=0.7,
ignore_iof_thr=-1),
sampler=dict(
type='RandomSampler',
num=512,
pos_fraction=0.25,
neg_pos_ub=-1,
add_gt_as_proposals=True),
mask_size=28,
pos_weight=-1,
debug=False)
]),
test_cfg=dict(
rpn=dict(
nms_pre=1000,
max_per_img=1000,
nms=dict(type='nms', iou_threshold=0.7),
min_bbox_size=0),
rcnn=dict(
score_thr=0.001,
nms=dict(type='nms', iou_threshold=0.5),
max_per_img=100,
mask_thr_binary=0.5)))
classes = '/home/cloudam/LZF/working/labels.txt'
work_dir = '/home/cloudam/LZF/working/model_output'
seed = 0
gpu_ids = range(0, 1)
fp16 = dict(loss_scale=512.0)

loading annotations into memory...
Done (t=1.73s)
creating index...
index created!
loading annotations into memory...
Done (t=0.09s)
creating index...
index created!
2021-11-20 13:15:49,470 - mmdet - INFO - load checkpoint from local path: ./mmdetection/checkpoints/htc_x101_32x4d_fpn_16x1_20e_coco_20200318-de97ae01.pth
2021-11-20 13:15:52,309 - mmdet - WARNING - The model and loaded state dict do not match exactly

size mismatch for roi_head.bbox_head.0.fc_cls.weight: copying a param with shape torch.Size([81, 1024]) from checkpoint, the shape in current model is torch.Size([4, 1024]).
size mismatch for roi_head.bbox_head.0.fc_cls.bias: copying a param with shape torch.Size([81]) from checkpoint, the shape in current model is torch.Size([4]).
size mismatch for roi_head.bbox_head.1.fc_cls.weight: copying a param with shape torch.Size([81, 1024]) from checkpoint, the shape in current model is torch.Size([4, 1024]).
size mismatch for roi_head.bbox_head.1.fc_cls.bias: copying a param with shape torch.Size([81]) from checkpoint, the shape in current model is torch.Size([4]).
size mismatch for roi_head.bbox_head.2.fc_cls.weight: copying a param with shape torch.Size([81, 1024]) from checkpoint, the shape in current model is torch.Size([4, 1024]).
size mismatch for roi_head.bbox_head.2.fc_cls.bias: copying a param with shape torch.Size([81]) from checkpoint, the shape in current model is torch.Size([4]).
size mismatch for roi_head.mask_head.0.conv_logits.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 256, 1, 1]).
size mismatch for roi_head.mask_head.0.conv_logits.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([3]).
size mismatch for roi_head.mask_head.1.conv_logits.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 256, 1, 1]).
size mismatch for roi_head.mask_head.1.conv_logits.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([3]).
size mismatch for roi_head.mask_head.2.conv_logits.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 256, 1, 1]).
size mismatch for roi_head.mask_head.2.conv_logits.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([3]).
size mismatch for roi_head.semantic_head.conv_logits.weight: copying a param with shape torch.Size([183, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 256, 1, 1]).
size mismatch for roi_head.semantic_head.conv_logits.bias: copying a param with shape torch.Size([183]) from checkpoint, the shape in current model is torch.Size([3]).
2021-11-20 13:15:52,315 - mmdet - INFO - Start running, host: cloudam@master, work_dir: /home/cloudam/LZF/working/model_output
2021-11-20 13:15:52,316 - mmdet - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) CosineAnnealingLrUpdaterHook
(ABOVE_NORMAL) Fp16OptimizerHook
(NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook

before_train_epoch:
(VERY_HIGH ) CosineAnnealingLrUpdaterHook
(NORMAL ) NumClassCheckHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook

before_train_iter:
(VERY_HIGH ) CosineAnnealingLrUpdaterHook
(LOW ) IterTimerHook
(LOW ) EvalHook

after_train_iter:
(ABOVE_NORMAL) Fp16OptimizerHook
(NORMAL ) CheckpointHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook

after_train_epoch:
(NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook

before_val_epoch:
(NORMAL ) NumClassCheckHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

before_val_iter:
(LOW ) IterTimerHook

after_val_iter:
(LOW ) IterTimerHook

after_val_epoch:
(VERY_LOW ) TextLoggerHook

after_run:
(VERY_LOW ) TextLoggerHook

2021-11-20 13:15:52,316 - mmdet - INFO - workflow: [('train', 1)], max: 18 epochs
2021-11-20 13:15:52,318 - mmdet - INFO - Checkpoints will be saved to /home/cloudam/LZF/working/model_output by HardDiskBackend.
Traceback (most recent call last):
File "train.py", line 323, in
train_detector(model, datasets, cfg, distributed=False, validate=True, meta=meta)
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmdet/apis/train.py", line 177, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 128, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.run_iter(data_batch = data_batch, train_mode=True, **kwargs)
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 238, in train_step
losses = self(**data)
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmdet/models/detectors/two_stage.py", line 150, in forward_train
*kwargs)
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmdet/models/roi_heads/htc_roi_head.py", line 270, in forward_train
gt_labels[j])
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmdet/core/bbox/assigners/max_iou_assigner.py", line 106, in assign
overlaps = self.iou_calculator(gt_bboxes, bboxes)
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmdet/core/bbox/iou_calculators/iou2d_calculator.py", line 66, in call
return bbox_overlaps(bboxes1, bboxes2, mode, is_aligned)
File "/home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/mmdet/core/bbox/iou_calculators/iou2d_calculator.py", line 251, in bbox_overlaps
eps = union.new_tensor([eps])
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1607370128159/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2ae39d9168b2 in /home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0xad2 (0x2ae39d6ae982 in /home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x2ae39d901b7d in /home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: + 0x5fea0a (0x2ae35bf6fa0a in /home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x5feab6 (0x2ae35bf6fab6 in /home/cloudam/anaconda3/envs/pytorch_gpu/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #22: __libc_start_main + 0xf5 (0x2ae339be8555 in /lib64/libc.so.6)

Aborted (core dumped)
`
HERE IS WHOLE CONSOLE

@joe1chief
Copy link

Did you solve it? I faced the same issue.

@wudaxianzi
Copy link

Faced the same issue,eps = union.new_tensor([eps])
RuntimeError: CUDA error: an illegal memory access was encountered.
Doesn't know how to solve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants