Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[help wanted] console log freezes while train... #41

Closed
jiawenhao2015 opened this issue Jul 29, 2020 · 5 comments
Closed

[help wanted] console log freezes while train... #41

jiawenhao2015 opened this issue Jul 29, 2020 · 5 comments
Labels
question Further information is requested

Comments

@jiawenhao2015
Copy link

jiawenhao2015 commented Jul 29, 2020

image
it has cost almost 1 hours ...and it seems that it doesn't train...
does any body meet this case before? many thanks..........


log_level = 'INFO'
load_from = None
resume_from = None
dist_params = dict(backend='nccl')
workflow = [('train', 1)]
checkpoint_config = dict(interval=10)
evaluation = dict(interval=10, metric='mAP')

optimizer = dict(
    type='Adam',
    lr=5e-4,
)
optimizer_config = dict(grad_clip=None)
# learning policy
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[170, 200])
total_epochs = 210
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])

channel_cfg = dict(
    num_output_channels=17,
    dataset_joints=17,
    dataset_channel=[
        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
    ],
    inference_channel=[
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
    ])

# model settings
model = dict(
    type='TopDown',
    pretrained='models/pytorch/imagenet/'
    'mobilenet_v2_batch256_20200708-3b2dc3af.pth',
    backbone=dict(type='MobileNetV2', widen_factor=1., out_indices=(7, )),
    keypoint_head=dict(
        type='TopDownSimpleHead',
        in_channels=1280,
        out_channels=channel_cfg['num_output_channels'],
    ),
    train_cfg=dict(),
    test_cfg=dict(
        flip_test=True,
        post_process=True,
        shift_heatmap=True,
        unbiased_decoding=False,
        modulate_kernel=11),
    loss_pose=dict(type='JointsMSELoss', use_target_weight=True))

data_cfg = dict(
    image_size=[192, 256],
    heatmap_size=[48, 64],
    num_output_channels=channel_cfg['num_output_channels'],
    num_joints=channel_cfg['dataset_joints'],
    dataset_channel=channel_cfg['dataset_channel'],
    inference_channel=channel_cfg['inference_channel'],
    soft_nms=False,
    nms_thr=1.0,
    oks_thr=0.9,
    vis_thr=0.2,
    bbox_thr=1.0,
    use_gt_bbox=False,
    image_thr=0.0,
    bbox_file='data/coco/person_detection_results/'
    'COCO_val2017_detections_AP_H_56_person.json',
)

train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='TopDownRandomFlip', flip_prob=0.5),
    dict(
        type='TopDownHalfBodyTransform',
        num_joints_half_body=8,
        prob_half_body=0.3),
    dict(
        type='TopDownGetRandomScaleRotation', rot_factor=40, scale_factor=0.5),
    dict(type='TopDownAffine'),
    dict(type='ToTensor'),
    dict(
        type='NormalizeTensor',
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]),
    dict(type='TopDownGenerateTarget', sigma=2),
    dict(
        type='Collect',
        keys=['img', 'target', 'target_weight'],
        meta_keys=[
            'image_file', 'joints_3d', 'joints_3d_visible', 'center', 'scale',
            'rotation', 'bbox_score', 'flip_pairs'
        ]),
]

valid_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='TopDownAffine'),
    dict(type='ToTensor'),
    dict(
        type='NormalizeTensor',
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]),
    dict(
        type='Collect',
        keys=[
            'img',
        ],
        meta_keys=[
            'image_file', 'center', 'scale', 'rotation', 'bbox_score',
            'flip_pairs'
        ]),
]

test_pipeline = valid_pipeline

data_root = '/share/jwh/utils/data/zipped/object/detect/coco/coco2017'
data = dict(
    samples_per_gpu=64,
    workers_per_gpu=2,
    train=dict(
        type='TopDownCocoDataset',
        ann_file=f'{data_root}/annotations/person_keypoints_val2017.json',
        img_prefix=f'{data_root}/val2017/',
        data_cfg=data_cfg,
        pipeline=train_pipeline),
    val=dict(
        type='TopDownCocoDataset',
        ann_file=f'{data_root}/annotations/person_keypoints_val2017.json',
        img_prefix=f'{data_root}/val2017/',
        data_cfg=data_cfg,
        pipeline=valid_pipeline),
    test=dict(
        type='TopDownCocoDataset',
        ann_file=f'{data_root}/annotations/person_keypoints_val2017.json',
        img_prefix=f'{data_root}/val2017/',
        data_cfg=data_cfg,
        pipeline=valid_pipeline),
)
@innerlee
Copy link
Contributor

We need more info to reproduce this.

Some simple checks before that:

  • data, could you double check if your data is well prepared? you may try to truncate the person_keypoints_val2017.json to a few samples only, and run on this tiny subset.
  • cpu/gpu status, check htop for cpu usage, and watch nvidia-smi for gpu usage. anything took 100% usage? How many gpus are used? is it runned in a docker env?
  • 1h is surely abnormal. if nothing happens after 5min when debugging, you may interrupt it

ps. i edited the above post to fence the codes.

@innerlee innerlee added the question Further information is requested label Jul 29, 2020
@jiawenhao2015
Copy link
Author

We need more info to reproduce this.

Some simple checks before that:

  • data, could you double check if your data is well prepared? you may try to truncate the person_keypoints_val2017.json to a few samples only, and run on this tiny subset.
  • cpu/gpu status, check htop for cpu usage, and watch nvidia-smi for gpu usage. anything took 100% usage? How many gpus are used? is it runned in a docker env?
  • 1h is surely abnormal. if nothing happens after 5min when debugging, you may interrupt it

ps. i edited the above post to fence the codes.

many thanks for your reply!

i print the log, find that it get stuck in epoch_based_runner.py(line 30),
when it fetch image
i have checked mine images ,but they are normal......
image

@jin-s13
Copy link
Collaborator

jin-s13 commented Jul 29, 2020

Please try using workers_per_gpu=0 in the config file.

@jiawenhao2015
Copy link
Author

Please try using workers_per_gpu=0 in the config file.

it works!!!!!👍👍👍👍👍 many many many thanks!!!!
i am so stupid~~~

@innerlee
Copy link
Contributor

Please open a new issue instead of bump a closed one.
Their underlying cause might be different.

rollingman1 pushed a commit to rollingman1/mmpose that referenced this issue Nov 5, 2021
HAOCHENYE pushed a commit to HAOCHENYE/mmpose that referenced this issue Jun 27, 2023
* [Feature]: Add evaluator base class.

* solve comments

* update

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants