Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LayoutLM] How to reproduce FUNSD result #134

Closed
nv-quan opened this issue May 14, 2020 · 17 comments
Closed

[LayoutLM] How to reproduce FUNSD result #134

nv-quan opened this issue May 14, 2020 · 17 comments

Comments

@nv-quan
Copy link

nv-quan commented May 14, 2020

Hello,
I have run fine tuning for the Sequence Labeling Task with FUNSD dataset but my I couldn't achieve the result presented in the paper (precision is only 40%), here are some scripts and log that I used, any idea about what could be wrong?
Thank you very much.
Training:

#!/bin/bash

python run_seq_labeling.py  --data_dir ~/mnt/data \
                            --model_type layoutlm \
                            --model_name_or_path ~/mnt/model \
                            --do_lower_case \
                            --max_seq_length 512 \
                            --do_train \
                            --num_train_epochs 100.0 \
                            --logging_steps 10 \
                            --save_steps -1 \
                            --output_dir ~/mnt/output \
                            --labels ~/mnt/data/labels.txt \
                            --per_gpu_train_batch_size 16 \
                            --fp16

Testing:

#!/bin/bash

python run_seq_labeling.py --do_predict\
  --model_type layoutlm\
  --model_name_or_path ~/mnt/model\
  --data_dir ~/mnt/data\
  --output_dir ~/mnt/output\
  --labels ~/mnt/data/labels.txt

Some log:

05/14/2020 09:40:45 - INFO - __main__ -   ***** Running training *****
05/14/2020 09:40:45 - INFO - __main__ -     Num examples = 150
05/14/2020 09:40:45 - INFO - __main__ -     Num Epochs = 100
05/14/2020 09:40:45 - INFO - __main__ -     Instantaneous batch size per GPU = 16
05/14/2020 09:40:45 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 16
05/14/2020 09:40:45 - INFO - __main__ -     Gradient Accumulation steps = 1
05/14/2020 09:40:45 - INFO - __main__ -     Total optimization steps = 1000
05/14/2020 09:53:00 - INFO - __main__ -    global_step = 1000, average loss = 0.10387736940692412

05/14/2020 10:17:07 - INFO - __main__ -   ***** Running evaluation  *****
05/14/2020 10:17:07 - INFO - __main__ -     Num examples = 52
05/14/2020 10:17:07 - INFO - __main__ -     Batch size = 8
05/14/2020 10:17:07 - INFO - __main__ -   
           precision    recall  f1-score   support

 QUESTION       0.41      0.70      0.52       771
   HEADER       0.00      0.00      0.00       108
   ANSWER       0.39      0.50      0.44       513

micro avg       0.40      0.57      0.47      1392
macro avg       0.37      0.57      0.45      1392

05/14/2020 10:17:07 - INFO - __main__ -   ***** Eval results  *****
05/14/2020 10:17:07 - INFO - __main__ -     f1 = 0.472115668338743
05/14/2020 10:17:07 - INFO - __main__ -     loss = 2.9291565077645436
05/14/2020 10:17:07 - INFO - __main__ -     precision = 0.400600901352028
05/14/2020 10:17:07 - INFO - __main__ -     recall = 0.5747126436781609
@ranpox
Copy link
Contributor

ranpox commented May 14, 2020

Hi @nv-quan ,
Could you provide your preprocessing command? It seems that the support number in your classification report is incorrect. If the max sequence length is 512, the total number of each entity should be

support
QUESTION 1071
ANSWER 809
HEADER 119
micro avg 1999
macro avg 1999

@nv-quan
Copy link
Author

nv-quan commented May 14, 2020

Thank you, here are my preprocessing scripts:
Training:

#!/bin/bash

python scripts/funsd_preprocess.py --data_dir ~/mnt/data/training_data/annotations/\
  --data_split train\
  --output_dir ~/mnt/data\
  --model_name_or_path ~/mnt/model\

cat ~/mnt/data/train.txt | cut -d$'\t' -f 2 | grep -v "^$"| sort | uniq > ~/mnt/data/labels.txt

Testing:

#!/bin/bash

python scripts/funsd_preprocess.py --data_dir ~/mnt/data/testing_data/annotations/\
  --data_split test\
  --output_dir ~/mnt/data\
  --model_name_or_path ~/mnt/model

cat ~/mnt/data/test.txt | cut -d$'\t' -f 2 | grep -v "^$"| sort | uniq > ~/mnt/data/labels.txt

@nv-quan
Copy link
Author

nv-quan commented May 14, 2020

Also, I can see a lot of "WARNING maximum sequence length exceeded: No prediction for" in the log, is that normal?

@ranpox
Copy link
Contributor

ranpox commented May 14, 2020

I don't think so. The documents longer than 512 should be split into chunks to fit the max sequence length. So these warnings are abnormal. I can correctly generate data with the preprocessing commands you provided. Please check if the commands have been correctly executed.

@marythomaa98
Copy link

Hi @nv-quan where you able to resolve this issue?

@nv-quan
Copy link
Author

nv-quan commented May 20, 2020

@marythomaa98 not yet, I was kind of busy so I didn't look at it yet, but I'll try to fix this bug tomorrow.

@marythomaa98
Copy link

@nv-quan okay sure! Do let me know if it works out, I am getting the same support number as you.

@nv-quan
Copy link
Author

nv-quan commented May 21, 2020

@marythomaa98 The preprocessing is totally fine, but for some reason there are less prediction labels than input

@wolfshow
Copy link
Contributor

@nv-quan The dataset contains empty text but with non-empty labels. I think you may need to remove them.

@nv-quan
Copy link
Author

nv-quan commented May 21, 2020

@wolfshow I'm comparing 2 files output/test_predictions.txt and data/test.txt, everything seems ok until line 181, the test data is still continue for that example_id while in test_predictions it prints '\n' (end of example_id). And the text in the testing data is not empty at all.
Screen Shot 2020-05-21 at 4 20 52 PM

@ranpox
Copy link
Contributor

ranpox commented May 21, 2020

Hi @nv-quan ,
It seems that you didn't set max_seq_length during the evaluating stage. So please add --max_seq_length 512 to your testing command and try again.

@nv-quan
Copy link
Author

nv-quan commented May 21, 2020

@ranpox thank you, now the number of support is correct but the result is still off:

f1 = 0.4204204204204205
loss = 3.160606418337141
precision = 0.3364373685791529
recall = 0.560280140070035

@marythomaa98
Copy link

Hi @nv-quan on adding --do_lower_case and --fp16 works for me

@nv-quan
Copy link
Author

nv-quan commented May 21, 2020

@marythomaa98 thanks a lot, it works when I add --do_lower_case to my test script. And also remove the data/cached_test_model_512

@elnazsn1988
Copy link

elnazsn1988 commented Aug 4, 2020

@marythomaa98 @nv-quan @ranpox can you paste your final command here to predict? I am having a bit of trouble understanding where I place my test input, where i place test output, and where the trained model sits.

Also @ranpox I had to set the max sequence length to 128 or the cuda would run out of memory, is that an issue?

@james-griffin-deepsee
Copy link

@wolfshow I'm comparing 2 files output/test_predictions.txt and data/test.txt, everything seems ok until line 181, the test data is still continue for that example_id while in test_predictions it prints '\n' (end of example_id). And the text in the testing data is not empty at all.
Screen Shot 2020-05-21 at 4 20 52 PM

could you explain the difference between the different labels. I know the difference of Answer vs Header vs Question vs Other. but what does B-ANSWER vs E-ANSWER vs I-ANSWER vs S-ANSWER mean??

@nv-quan
Copy link
Author

nv-quan commented Jul 22, 2021

@wolfshow I'm comparing 2 files output/test_predictions.txt and data/test.txt, everything seems ok until line 181, the test data is still continue for that example_id while in test_predictions it prints '\n' (end of example_id). And the text in the testing data is not empty at all.
Screen Shot 2020-05-21 at 4 20 52 PM

could you explain the difference between the different labels. I know the difference of Answer vs Header vs Question vs Other. but what does B-ANSWER vs E-ANSWER vs I-ANSWER vs S-ANSWER mean??

As far as I understand, B is beginning, E is end I is in-between (or something similar), S is single.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants