Training Issues with ArXiv papers #777

m485liuw · 2021-06-21T23:08:47Z

Hi, we just finished a small scale of data annotation for header model with ArXiv papers (30 pdfs which was annotated wrong by header model) and retrained the header model with the original data plus newly annotated ones. Unfortunately, we found the accuracy stays the same for our generated test set (1000 ArXiv papers). In fact, we found the accuracy barely increases (~88% for abstract, 95% for header) after 300 training data points. We are wondering is it even possible to increase the accuracy by training with more data? If so, any suggestions on designing the dataset (i.e. any rule of selecting?)

Also, we found most of the errors of header model came from: 1. foot/header appears in abstract 2. title not identified 3. key words/ccs in abstract 4. upper model(segmentation model) wrong. After our annotation on papers with the first 3 issues and retraining, some of the papers with these issues got corrected whereas some new pdfs got annotated wrong (originally correct) with these issues. Do you know how to explain this scenario? Shouldn't these issues be supposed to be fixed after retraining (at least, no new errors of this type come out)?

Plus, as mentioned, some errors come from segmentation model. Do you think it is helpful to retrain the segmentation model?

Looking forward to your reply and any plan on improving the header model.

kermitt2 · 2021-06-24T18:25:45Z

Hello @m485liuw !

Thanks a lot for the issue and the detailed analysis of errors.

I think it's a bit difficult to conclude anything with 30 examples added to the existing 592, but it's possible that with the existing set of features (which is limited due to the small size of training data) and for a given relatively homogeneous collection like arXiv we don't improve beyond 300 examples. When evaluating against PubMed Central, scores were still improving as I was reaching the 600 examples, so I was not facing this issue yet.

Your approach is the one I was following: adding examples if they were wrong with the current model and this is how I got the best learning curve so far.

One important aspect I think is that the current training data is "small size" but "high quality. Every labels were checked several times and we pay a lot of attention to have very consistent labelling (in term of what exactly is labeled, which words are excluded from the labeled chunks, etc.). Introducing examples with slightly different/inconsistent/incomplete labeling approach can impact a lot the learning (make it less efficient) and introduces errors in prediction as compared to older version. Working with such small training sets makes every small labeling errors extremely impactful. On the other hand, this approach permits to get very good accuracy with limited training data and to add new examples with some effects.

If you'd like to contribute to GROBID with training data and if you're not under time pressure, I would very happy to review the 30 examples and check the overall consistency.

There would be different ways to improve the accuracy:

try a different algorithm than CRF, in particular the deep learning ones with layout features are pretty good to take advantage of more training data. SciBERT is actually very good to pick correctly some fields (like title, which reach 98.58 f1-score on PMC sample while we are at 97.0 with CRF, Levenshtein Matching), while being worst at others (keyword/abstract, possibly due to the sequence length input limit and inability to exploit layout features).
use an ensemble approach if you're not constrained by runtime... run several ML header models and use a voting to pick the best fields (I am not working on that in Grobid because I think improving base models and extending training data are more a priority, but this would be a normal optimization to maximize the accuracy )
in the old CRF approach, review the features and experiment with new ones intuitively relevant for arXiv papers. More training data makes possible to add more features with positive results. But it's more complicated as it requires to dive into the java.

In general, if the error is due to a segmentation error from the segmentation model, indeed it has to be fixed first by retraining the segmentation model - this would be the way to fix "1. foot/header appears in abstract".
Title not identified might come from reading order issue, which is something I am still working on in pdfalto. As it is still not stable, I have excluded examples with strong reading order issues from the training data.
"3. key words/ccs in abstract" is a relatively common error I also observed with the current mode, and I was thinking so far that more training data would help this.

m485liuw · 2021-06-24T20:15:43Z

Thanks a lot for the detailed reply!
Is "foot/header appears in abstract" error coming from segmentation model? We thought foot/header in the first page should be included in the header part so segmentation model was correct?
Here are some of our annotations. It would be really helpful if you could help check any inconsistency.
xml.zip

kermitt2 · 2021-06-24T20:24:05Z

Is "foot/header appears in abstract" error coming from segmentation model? We thought foot/header in the first page should be included in the header part so segmentation model was correct?

Yes you're correct, sorry my fault ! The header/footer in the first page should be included in the header parts because they usually contains metadata important to extract with the header !

Thanks for the annotations, I try to have a look soon for feedback.

kermitt2 · 2021-06-26T02:18:34Z

Here is the review. I made corrections in every documents, some of them were having significant problems in the authors/affiliation/email sequences.

xml-reviewed.zip

m485liuw · 2021-06-29T01:41:25Z

Hi, thanks for your detailed annotation. We are interested in title and abstract only so we only annotated those parts. Do you know does it affect the training if we only annotate parts of them?

kermitt2 · 2021-06-29T02:06:15Z

My experience is that errors and incomplete annotations affect significantly other annotations. By introducing more labels, we improve globally the accuracy because we enrich the representation of the contexts, make the learning more efficient and decrease sources of ambiguity. For instance in NER, to improve the recognition of dates, we typically annotate other numerical entities (like currencies, quantities, reference markers, ...).

In our case, for instance by labeling explicitly keywords and meeting places/venues, we normally improve the accuracy to identify title and abstract for the same amount of training data.

kermitt2 · 2021-07-02T01:44:06Z

For info I've added the 12 corrected header xml in the training data, together with 12 bioRxiv headers and got improvement in header f-score results for both PMC and bioRxiv evaluation sets.

In your initial set of 12 headers for instance, I corrected 3 titles (xml/2104.06550v1.training.header.tei.xml, 2104.06800v1.training.header.tei.xml and xml/2104.10542v1.training.header.tei.xml). It would mean that your additional training data have a 75% precision title field - considering Grobid annotation guidelines. I think it's not possible to improve the model which already provides above 90% title field accuracy without higher quality annotations than the current performance.

m485liuw · 2021-07-05T04:36:15Z

Hi Patrice， Did you get improvement for both abstract and title? Btw, is there anywhere I could find how you get your evaluation set?

…

On Thursday, July 1, 2021, Patrice Lopez ***@***.***> wrote: For info I've added the 12 corrected header xml in the training data, together with 12 bioRxiv headers and got improvement in header f-score results for both PMC and bioRxiv evaluation sets. In your initial set of 12 headers for instance, I corrected 3 titles ( xml/2104.06550v1.training.header.tei.xml, 2104.06800v1.training.header. tei.xml and xml/2104.10542v1.training.header.tei.xml). It would mean that your additional training data have a 75% precision title field - considering Grobid annotation guidelines. I think it's not possible to improve the model which already provides above 90% title field accuracy without higher quality annotations than the current performance. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#777 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOL3GVYZLBRV6FDIYGOEMHLTVUKXBANCNFSM47CQ2BFA> .

kermitt2 · 2021-07-05T10:20:21Z

Hello !

Btw, is there anywhere I could find how you get your evaluation set?

About the evaluation sets, it's described at https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/
PMC and bioRxiv samples are holdout sets (out of the different training data) and stable over time. It's end-to-end evaluation, so starting from PDF and comparing final TEI XML results with the JATS files.

PMC sample set if from Alexandru Constantin Ph.D, work (PDFX) and bioRxiv set has been created by Daniel Ecer.

Did you get improvement for both abstract and title?

Yes, but not a lot for PMC

Last week-end I made an additional "training data" effort (a few dozen examples in segmentation and header, originally failing), and results are again better. So we should be able to continue improving the current header model (independently from improving its design/implementation), just by augmenting the training data like this.

m485liuw · 2021-07-07T01:17:59Z

Hi Patrice, I saw you mentioned in end to end evaluation that you used pub2tei to generate ground truth. Do you also use that method in training for finding what ground annotate wrong?

…

On Monday, July 5, 2021, Patrice Lopez ***@***.***> wrote: Hello ! Btw, is there anywhere I could find how you get your evaluation set? About the evaluation sets, it's described at https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/ PMC and bioRxiv samples are holdout sets (out of the different training data) and stable over time. It's end-to-end evaluation, so starting from PDF and comparing final TEI XML results with the JATS files. PMC sample set if from Alexandru Constantin Ph.D, work (PDFX) and bioRxiv set has been created by Daniel Ecer. Did you get improvement for both abstract and title? Yes, but not a lot for PMC - PMC set, *before*: https://github.com/kermitt2/ grobid/blob/master/grobid-trainer/doc/PMC_sample_1943. results.grobid-0.7.0-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED- BidLSTM-CRF-FEATURES-CITATION-09.06.2021 <https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/PMC_sample_1943.results.grobid-0.7.0-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED-BidLSTM-CRF-FEATURES-CITATION-09.06.2021> *after*: https://github.com/kermitt2/grobid/blob/master/grobid- trainer/doc/PMC_sample_1943.results.grobid-0.7.0-SNAPSHOT- Glutton-WAPITI-29.06.2021 <https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/PMC_sample_1943.results.grobid-0.7.0-SNAPSHOT-Glutton-WAPITI-29.06.2021> - bioRxiv set, *before*: https://github.com/kermitt2/ grobid/blob/master/grobid-trainer/doc/bioRxiv_test_2000. results.grobid-0.6.2-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED- SciBERT-01.11.2020 <https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/bioRxiv_test_2000.results.grobid-0.6.2-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED-SciBERT-01.11.2020> *after*: https://github.com/kermitt2/grobid/blob/master/grobid- trainer/doc/bioRxiv_test_2000.results.grobid-0.7-0-SNAPSHOT- Glutton-DeLFT-WAPITI-MIXED-BidLSTM-CRF-FEATURES-HEADER_ CITATIONS-29.06.2021 <https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/bioRxiv_test_2000.results.grobid-0.7-0-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED-BidLSTM-CRF-FEATURES-HEADER_CITATIONS-29.06.2021> Last week-end I made an additional "training data" effort (a few dozen examples in segmentation and header, originally failing), and results are again better. So we should be able to continue improving the current header model (independently from improving its design/implementation), just by augmenting the training data like this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#777 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOL3GV6W7J5LUYOS7KOV5MLTWGBO7ANCNFSM47CQ2BFA> .

kermitt2 · 2021-07-07T10:41:20Z

Hi !

I think it's a good idea. But I am not even at this point... I simply downloaded some random PMC, bioRxiv and arXiv PDF files, process them and selected a few dozens with empty title, empty authors and/or empty abstract (which is a good sign that something very bad happened :).

I added 65 header files and 62 segmentation files in the last week effort, so increase of 10% of the training data. For headers, for PMC sample set it lead to around +1.0 F-score in average for header and for bioRxiv to around +4 (there were around 30 files from bioRxiv, because it was not really represented in the training data so far).

Except for reference parsing, I didn't get this kind of improvement after working on deep learning models during 3 years :D

m485liuw · 2021-07-07T22:27:45Z

Thanks! Have you uploaded the new models to github?

…

On Wed, Jul 7, 2021 at 3:41 AM Patrice Lopez ***@***.***> wrote: Hi ! I think it's a good idea. But I am not even at this point... I simply downloaded some random PMC, bioRxiv and arXiv PDF files, process them and selected a few dozens with empty title, empty authors and/or empty abstract (which is a good sign that something very bad happened :). I added 65 header files and 62 segmentation files in the last week effort, so increase of 10% of the training data. For headers, for PMC sample set it lead to around +1.0 F-score in average for header and for bioRxiv to around +4 (there were around 30 files from bioRxiv, because it was not really represented in the training data so far). Except for reference parsing, I didn't get this kind of improvement after working on deep learning models during 3 years :D — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#777 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOL3GVZB6TA23RCEKKKE5MDTWQVNVANCNFSM47CQ2BFA> .

kermitt2 · 2021-07-08T07:29:20Z

yes

kermitt2 added enhancement error cases Some error/test case for future improvements training guidelines Related to the annotation guidelines for training data labels Jun 25, 2021

kermitt2 mentioned this issue Jul 31, 2021

Possibility of only annotating needed parts #806

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Issues with ArXiv papers #777

Training Issues with ArXiv papers #777

m485liuw commented Jun 21, 2021

kermitt2 commented Jun 24, 2021

m485liuw commented Jun 24, 2021

kermitt2 commented Jun 24, 2021

kermitt2 commented Jun 26, 2021

m485liuw commented Jun 29, 2021

kermitt2 commented Jun 29, 2021 •

edited

Loading

kermitt2 commented Jul 2, 2021

m485liuw commented Jul 5, 2021 via email

kermitt2 commented Jul 5, 2021

m485liuw commented Jul 7, 2021 via email

kermitt2 commented Jul 7, 2021

m485liuw commented Jul 7, 2021 via email

kermitt2 commented Jul 8, 2021

Training Issues with ArXiv papers #777

Training Issues with ArXiv papers #777

Comments

m485liuw commented Jun 21, 2021

kermitt2 commented Jun 24, 2021

m485liuw commented Jun 24, 2021

kermitt2 commented Jun 24, 2021

kermitt2 commented Jun 26, 2021

m485liuw commented Jun 29, 2021

kermitt2 commented Jun 29, 2021 • edited Loading

kermitt2 commented Jul 2, 2021

m485liuw commented Jul 5, 2021 via email

kermitt2 commented Jul 5, 2021

m485liuw commented Jul 7, 2021 via email

kermitt2 commented Jul 7, 2021

m485liuw commented Jul 7, 2021 via email

kermitt2 commented Jul 8, 2021

kermitt2 commented Jun 29, 2021 •

edited

Loading