Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Errno 21] Is a directory: 'data/mixture/Syn90k/label.lmdb' #1105

Closed
MingyuLau opened this issue Jun 23, 2022 · 22 comments
Closed

[Errno 21] Is a directory: 'data/mixture/Syn90k/label.lmdb' #1105

MingyuLau opened this issue Jun 23, 2022 · 22 comments
Assignees

Comments

@MingyuLau
Copy link
Contributor

When I tried to train MASTER on GPUs, it raised the error as below, however, I had orgnaized my data right and the directory "label.lmdb" surely had two files named "data.mdb" and "lock.mdb"

RCS){OT{P@`2LUI@_Z4~SQ7

%ZWF28 )OY@@`@2R%T4$)TJ

@Mountchicken
Copy link
Collaborator

Hi @MingyuLau

It seems to be the problem of dataset config. Try replacing the code as below.

train1 = dict(
type='OCRDataset',
img_prefix=train_img_prefix1,
ann_file=train_ann_file1,
loader=dict(
type='AnnFileLoader',
repeat=1,
file_format='lmdb',
parser=dict(
type='LineStrParser',
keys=['filename', 'text'],
keys_idx=[0, 1],
separator=' ')),
pipeline=None,
test_mode=False)

replace to

train1 = dict(
    type='OCRDataset',
    img_prefix=train_img_prefix1,
    ann_file=train_ann_file1,
    loader=dict(
        type='AnnFileLoader',
        repeat=1,
        file_format='lmdb',
        parser=dict(
            type='LineJsonParser',
            keys=['filename', 'text']),
    pipeline=None,
    test_mode=False)

@MingyuLau
Copy link
Contributor Author

Thank you for dealing the problem for me immediately, and it is effective that the previos error is fixed,but this time a new "pipeline" error is raised.
J3_U0TLWAZ{ QV 05%VY0%G

@Mountchicken
Copy link
Collaborator

Maybe we need to fix more, try

train3['loader']['file_format'] = 'txt'
train_list = [train1, train2, train3]

replace to

train3['loader']['file_format'] = 'txt'
tran3['loader']['parser'] = dict(
                              type='LineStrParser',
                              keys=['filename', 'text'],
                              keys_idx=[0, 1],
                              separator=' ')
train_list = [train1, train2, train3]

@MingyuLau
Copy link
Contributor Author

I replace it but it raise the same error,I will check my code again

@MingyuLau
Copy link
Contributor Author

@Mountchicken Sorry for bothering, I found a typo in your code yesterday, there is a missing ")" for the "loader=dict(......)", after I fix the typo, the console raise the same error "is a directory" as before

@Mountchicken
Copy link
Collaborator

@MingyuLau
Can you show me you dataset configs? This problem is due to the incorrect use of LineStrParser when loading a dataset in lmdb format, which in fact LineJsonParser should be used.

@MingyuLau
Copy link
Contributor Author

@Mountchicken

# Text Recognition Training set, including:
# Synthetic Datasets: SynthText, Syn90k

train_root = 'data/mixture'

train_img_prefix1 = f'{train_root}/Syn90k/mnt/ramdisk/max/90kDICT32px'
train_ann_file1 = f'{train_root}/Syn90k/label.lmdb'

train1 = dict(
    type='OCRDataset',
    img_prefix=train_img_prefix1,
    ann_file=train_ann_file1,
    loader=dict(
        type='AnnFileLoader',
        repeat=1,
        file_format='lmdb',
        parser=dict(
            type='LineJsonParser',
            keys=['filename', 'text'])),
    pipeline=None,
    test_mode=False)

train_img_prefix2 = f'{train_root}/SynthText/' + \
    'synthtext/SynthText_patch_horizontal'
train_ann_file2 = f'{train_root}/SynthText/label.lmdb'

train_img_prefix3 = f'{train_root}/SynthText_Add'
train_ann_file3 = f'{train_root}/SynthText_Add/label.txt'

train2 = {key: value for key, value in train1.items()}
train2['img_prefix'] = train_img_prefix2
train2['ann_file'] = train_ann_file2

train3 = {key: value for key, value in train1.items()}
train3['img_prefix'] = train_img_prefix3
train3['ann_file'] = train_ann_file3
train3['loader']['file_format'] = 'txt'

train_list = [train1, train2, train3]

@Mountchicken
Copy link
Collaborator

@MingyuLau
Sorry for the misunderstanding. We have merged a new PR to fix this problem. The correct config should be

train_root = 'data/mixture'

train_img_prefix1 = f'{train_root}/Syn90k/mnt/ramdisk/max/90kDICT32px'
train_ann_file1 = f'{train_root}/Syn90k/label.lmdb'

train1 = dict(
    type='OCRDataset',
    img_prefix=train_img_prefix1,
    ann_file=train_ann_file1,
    loader=dict(
        type='AnnFileLoader',
        repeat=1,
        file_format='lmdb',
        parser=dict(type='LineJsonParser', keys=['filename', 'text'])),
    pipeline=None,
    test_mode=False)

train_img_prefix2 = f'{train_root}/SynthText/' + \
    'synthtext/SynthText_patch_horizontal'
train_ann_file2 = f'{train_root}/SynthText/label.lmdb'

train_img_prefix3 = f'{train_root}/SynthText_Add'
train_ann_file3 = f'{train_root}/SynthText_Add/label.txt'

train2 = {key: value for key, value in train1.items()}
train2['img_prefix'] = train_img_prefix2
train2['ann_file'] = train_ann_file2

train3 = {key: value for key, value in train1.items()}
train3['img_prefix'] = train_img_prefix3
train3['ann_file'] = train_ann_file3
train3['loader']['file_format'] = 'txt'
train3['loader']['parser'] = dict(
    type='LineStrParser',
    keys=['filename', 'text'],
    keys_idx=[0, 1],
    separator=' ')

train_list = [train1, train2, train3]

@Mountchicken
Copy link
Collaborator

Mountchicken commented Jun 24, 2022

SA is in txt format and needs to use LineStrParser. Parser is always confusing and we are working on this, and later on we will provide a more specific document.

@MingyuLau
Copy link
Contributor Author

@Mountchicken
But it still give me the same error, did I do something wrong when I convert the "label.txt" to "label.lmdb" ?I choose "the label-only" option using "lmdb_converter.py"
OB5PXUESGU_~HOLMY8UVVU1

@MingyuLau
Copy link
Contributor Author

And this is my structure of dataset, orgnised as the official document
_6Q({ENLX45{B0 91BNJ3W

@Mountchicken
Copy link
Collaborator

@MingyuLau
It's not the problem of lmdb_converter.py. The problem still lies in the use of LineStrParser for lmdb annotations, but the modified config has solved this problem. My guess is that there is a problem with your version of MMOCR or there is something missing during the installation process. Please try to install the newest version of MMOCR. Use pip install -v -e . after you clone it.

@MingyuLau
Copy link
Contributor Author

@Mountchicken
I found it strange that I had fix my dataset config file, but it seems that it didn't work because the output in console shows that the config is still "str" not "json"
9%}8Q~ _ C1}E7OIT87OWXK

@Mountchicken
Copy link
Collaborator

Try pip install -v -e .

@MingyuLau
Copy link
Contributor Author

It makes no difference

@MingyuLau
Copy link
Contributor Author

@Mountchicken
why I change the config file but the output shows that nothing had been changed?

@Mountchicken
Copy link
Collaborator

@MingyuLau
I am confused too. Is it possible that you specified two dataset configs at the same time, and the correct one is replaced by the wrong one.
E.g.

_base_ = [
    '../../_base_/recog_datasets/ST_SA_MJ_train.py',
    '../../_base_/recog_datasets/ST_MJ_train.py',
]

Are you training with master_r31_12e_ST_MJ_SA.py ?

@MingyuLau
Copy link
Contributor Author

MingyuLau commented Jun 24, 2022

@Mountchicken
Yes,I am training with master_r31_12e_ST_MJ_SA.py And this is my training file:

_base_ = [
    '../../_base_/default_runtime.py', '../../_base_/recog_models/master.py',
    '../../_base_/schedules/schedule_adam_step_12e.py',
    '../../_base_/recog_pipelines/master_pipeline.py',
    '../../_base_/recog_datasets/ST_SA_MJ_train.py',
    '../../_base_/recog_datasets/academic_test.py'
]

train_list = {{_base_.train_list}}
test_list = {{_base_.test_list}}

train_pipeline = {{_base_.train_pipeline}}
test_pipeline = {{_base_.test_pipeline}}

data = dict(
    samples_per_gpu=8,
    workers_per_gpu=4,
    val_dataloader=dict(samples_per_gpu=8),
    test_dataloader=dict(samples_per_gpu=8),
    train=dict(
        type='UniformConcatDataset',
        datasets=train_list,
        pipeline=train_pipeline),
    val=dict(
        type='UniformConcatDataset',
        datasets=test_list,
        pipeline=test_pipeline),
    test=dict(
        type='UniformConcatDataset',
        datasets=test_list,
        pipeline=test_pipeline))

evaluation = dict(interval=1, metric='acc')

@MingyuLau
Copy link
Contributor Author

@Mountchicken
I know where the error lies in ,I debug and find in the ST_SA_MJ_train.py ,the parameters of train3 cover the parameters of train1
Can you help me to fix this error?
7TJIH3GB HMHCO49FD(D~DO

@Mountchicken
Copy link
Collaborator

@MingyuLau
Thanks for pointing that out. That's a really well hidden bug. Using the following config can fix the problem.

# Text Recognition Training set, including:
# Synthetic Datasets: SynthText, Syn90k

train_root = 'data/mixture'

train_img_prefix1 = f'{train_root}/Syn90k/mnt/ramdisk/max/90kDICT32px'
train_ann_file1 = f'{train_root}/Syn90k/label.lmdb'

train1 = dict(
    type='OCRDataset',
    img_prefix=train_img_prefix1,
    ann_file=train_ann_file1,
    loader=dict(
        type='AnnFileLoader',
        repeat=1,
        file_format='lmdb',
        parser=dict(type='LineJsonParser', keys=['filename', 'text'])),
    pipeline=None,
    test_mode=False)

train_img_prefix2 = f'{train_root}/SynthText/' + \
    'synthtext/SynthText_patch_horizontal'
train_ann_file2 = f'{train_root}/SynthText/label.lmdb'

train_img_prefix3 = f'{train_root}/SynthText_Add'
train_ann_file3 = f'{train_root}/SynthText_Add/label.txt'

train2 = {key: value for key, value in train1.items()}
train2['img_prefix'] = train_img_prefix2
train2['ann_file'] = train_ann_file2

train3 = dict(
    type='OCRDataset',
    img_prefix=train_img_prefix3,
    ann_file=train_ann_file3,
    loader=dict(
        type='AnnFileLoader',
        repeat=1,
        file_format='txt',
        parser=dict(
            type='LineStrParser',
            keys=['filename', 'text'],
            keys_idx=[0, 1],
            separator=' ')),
    pipeline=None,
    test_mode=False)

train_list = [train1, train2, train3]

@Mountchicken
Copy link
Collaborator

BTW, it will be appreciate if you can also raise a PR to help us fix it.

@MingyuLau
Copy link
Contributor Author

It's my pleasure, and I'm sincerely grateful for all your help in this problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants