New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems in training MJ + ST training set using robustscanner algorithm #508
Comments
I hava trained different models on mj + st datasets and tried to set the value of num_workes 0 , 2, 4 ,8 ,but got the same problems. |
environment:
TorchVision: 0.10.0+cu111 |
I suspect that it has something to do with a specific part of data. You can rerun the training process several times and check if the slowdown occurs at certain batches. Also, could you share the config that you have been using? |
Of course. Thank you very much for your help. My configuration file is below |
_base_ = [
'../../_base_/default_runtime.py',
'../../_base_/recog_models/robust_scanner.py'
]
# optimizer
optimizer = dict(type='Adam', lr=1e-3)
optimizer_config = dict(grad_clip=None)
# learning policy
lr_config = dict(policy='step', step=[3, 4])
total_epochs = 5
img_norm_cfg = dict(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='ResizeOCR',
height=48,
min_width=48,
max_width=160,
keep_aspect_ratio=True,
width_downsample_ratio=0.25),
dict(type='ToTensorOCR'),
dict(type='NormalizeOCR', **img_norm_cfg),
dict(
type='Collect',
keys=['img'],
meta_keys=[
'filename', 'ori_shape', 'resize_shape', 'text', 'valid_ratio'
]),
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiRotateAugOCR',
rotate_degrees=[0, 90, 270],
transforms=[
dict(
type='ResizeOCR',
height=48,
min_width=48,
max_width=160,
keep_aspect_ratio=True,
width_downsample_ratio=0.25),
dict(type='ToTensorOCR'),
dict(type='NormalizeOCR', **img_norm_cfg),
dict(
type='Collect',
keys=['img'],
meta_keys=[
'filename', 'ori_shape', 'resize_shape', 'valid_ratio'
]),
])
]
dataset_type = 'OCRDataset'
train_prefix = 'data/mixture/'
train_img_prefix1 = train_prefix + \
'SynthText/synthtext/SynthText_patch_horizontal'
train_img_prefix2 = train_prefix + 'mjsynth/mnt/ramdisk/max/90kDICT32px'
train_ann_file1 = train_prefix + 'SynthText/label.lmdb'
train_ann_file2 = train_prefix + 'mjsynth/label.lmdb'
train1 = dict(
type=dataset_type,
img_prefix=train_img_prefix1,
ann_file=train_ann_file1,
loader=dict(
type='LmdbLoader',
repeat=1,
parser=dict(
type='LineStrParser',
keys=['filename', 'text'],
keys_idx=[0, 1],
separator=' ')),
pipeline=None,
test_mode=False)
train2 = {key: value for key, value in train1.items()}
train2['img_prefix'] = train_img_prefix2
train2['ann_file'] = train_ann_file2
test_prefix = 'data/mixture/'
test_img_prefix1 = test_prefix + 'IIIT5K/'
test_img_prefix2 = test_prefix + 'svt/'
test_img_prefix3 = test_prefix + 'ic13/'
test_img_prefix4 = test_prefix + 'ic15/'
test_img_prefix5 = test_prefix + 'svtp/'
test_img_prefix6 = test_prefix + 'CUTE80/'
test_ann_file1 = test_prefix + 'IIIT5K/test_label.txt'
test_ann_file2 = test_prefix + 'svt/test_label.txt'
test_ann_file3 = test_prefix + 'ic13/test_label_1015.txt'
test_ann_file4 = test_prefix + 'ic15/test_label.txt'
test_ann_file5 = test_prefix + 'svtp/test_label.txt'
test_ann_file6 = test_prefix + 'CUTE80/lable.txt'
test1 = dict(
type=dataset_type,
img_prefix=test_img_prefix1,
ann_file=test_ann_file1,
loader=dict(
type='HardDiskLoader',
repeat=1,
parser=dict(
type='LineStrParser',
keys=['filename', 'text'],
keys_idx=[0, 1],
separator=' ')),
pipeline=None,
test_mode=True)
test2 = {key: value for key, value in test1.items()}
test2['img_prefix'] = test_img_prefix2
test2['ann_file'] = test_ann_file2
test3 = {key: value for key, value in test1.items()}
test3['img_prefix'] = test_img_prefix3
test3['ann_file'] = test_ann_file3
test4 = {key: value for key, value in test1.items()}
test4['img_prefix'] = test_img_prefix4
test4['ann_file'] = test_ann_file4
test5 = {key: value for key, value in test1.items()}
test5['img_prefix'] = test_img_prefix5
test5['ann_file'] = test_ann_file5
test6 = {key: value for key, value in test1.items()}
test6['img_prefix'] = test_img_prefix6
test6['ann_file'] = test_ann_file6
data = dict(
samples_per_gpu=128,
workers_per_gpu=16,
val_dataloader=dict(samples_per_gpu=1),
test_dataloader=dict(samples_per_gpu=1),
train=dict(
type='UniformConcatDataset',
datasets=[train1, train2],
pipeline=train_pipeline),
val=dict(
type='UniformConcatDataset',
datasets=[test1, test2, test3, test4, test5, test6],
pipeline=test_pipeline),
test=dict(
type='UniformConcatDataset',
datasets=[test1, test2, test3, test4, test5, test6],
pipeline=test_pipeline))
evaluation = dict(interval=1, metric='acc') |
The config looks fine. So have you tested if specific data batches are leading to the slowdown? I went over our RobustScanner log file and noticed that the data time had also been fluctuating over the entire training process. You might keep on training and let us know if there is a significant difference between your log and ours. |
I changed the training set from st + mj to III5K and there was nothing wrong with the training. So I guessed that there was something wrong with my st + mj dataset, I downloaded ST + MJ dataset again and processed the dataset according to the prompts in the document,but I still got the same problem that after a few minutes of training, the power used by GPU will always be in a very low state, and occasionally the power will be full. |
It's probably because some cropped images can be much larger than others, and result in a longer processing time. Let's keep monitoring it for several hours running first. Such a large gap between |
Yep there must be something wrong. I suspect this PR causes the slowdown. Could you comment out the changes in mmocr/models/textrecog/recognizer/encode_decode_recognizer.py and rerun the training process? |
I went back to this version and retrained and got the same error |
I got the same error too. |
Have you solved it,which type of gpu do you use? |
I'm looking into this issue but it may take some time to find the solution. @wushilian could you also share more details to help us locate the problem? Specifically:
(@Gaoxj2020 it would be great if you can share the profile results with us :)) |
|
I'm gonna close this but feel free to reopen it if you find anything worthy of reporting. |
At the beginning of training , the value of data_time is small. But a few mintues later, data_time increases gradually which leads to low GPU utilization.
The text was updated successfully, but these errors were encountered: