Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Issues with ArXiv papers #777

Open
m485liuw opened this issue Jun 21, 2021 · 13 comments
Open

Training Issues with ArXiv papers #777

m485liuw opened this issue Jun 21, 2021 · 13 comments
Labels
enhancement error cases Some error/test case for future improvements training guidelines Related to the annotation guidelines for training data

Comments

@m485liuw
Copy link

Hi, we just finished a small scale of data annotation for header model with ArXiv papers (30 pdfs which was annotated wrong by header model) and retrained the header model with the original data plus newly annotated ones. Unfortunately, we found the accuracy stays the same for our generated test set (1000 ArXiv papers). In fact, we found the accuracy barely increases (~88% for abstract, 95% for header) after 300 training data points. We are wondering is it even possible to increase the accuracy by training with more data? If so, any suggestions on designing the dataset (i.e. any rule of selecting?)

Also, we found most of the errors of header model came from: 1. foot/header appears in abstract 2. title not identified 3. key words/ccs in abstract 4. upper model(segmentation model) wrong. After our annotation on papers with the first 3 issues and retraining, some of the papers with these issues got corrected whereas some new pdfs got annotated wrong (originally correct) with these issues. Do you know how to explain this scenario? Shouldn't these issues be supposed to be fixed after retraining (at least, no new errors of this type come out)?

Plus, as mentioned, some errors come from segmentation model. Do you think it is helpful to retrain the segmentation model?

Looking forward to your reply and any plan on improving the header model.

@kermitt2
Copy link
Owner

Hello @m485liuw !

Thanks a lot for the issue and the detailed analysis of errors.

I think it's a bit difficult to conclude anything with 30 examples added to the existing 592, but it's possible that with the existing set of features (which is limited due to the small size of training data) and for a given relatively homogeneous collection like arXiv we don't improve beyond 300 examples. When evaluating against PubMed Central, scores were still improving as I was reaching the 600 examples, so I was not facing this issue yet.

Your approach is the one I was following: adding examples if they were wrong with the current model and this is how I got the best learning curve so far.

One important aspect I think is that the current training data is "small size" but "high quality. Every labels were checked several times and we pay a lot of attention to have very consistent labelling (in term of what exactly is labeled, which words are excluded from the labeled chunks, etc.). Introducing examples with slightly different/inconsistent/incomplete labeling approach can impact a lot the learning (make it less efficient) and introduces errors in prediction as compared to older version. Working with such small training sets makes every small labeling errors extremely impactful. On the other hand, this approach permits to get very good accuracy with limited training data and to add new examples with some effects.

If you'd like to contribute to GROBID with training data and if you're not under time pressure, I would very happy to review the 30 examples and check the overall consistency.

There would be different ways to improve the accuracy:

  • try a different algorithm than CRF, in particular the deep learning ones with layout features are pretty good to take advantage of more training data. SciBERT is actually very good to pick correctly some fields (like title, which reach 98.58 f1-score on PMC sample while we are at 97.0 with CRF, Levenshtein Matching), while being worst at others (keyword/abstract, possibly due to the sequence length input limit and inability to exploit layout features).

  • use an ensemble approach if you're not constrained by runtime... run several ML header models and use a voting to pick the best fields (I am not working on that in Grobid because I think improving base models and extending training data are more a priority, but this would be a normal optimization to maximize the accuracy )

  • in the old CRF approach, review the features and experiment with new ones intuitively relevant for arXiv papers. More training data makes possible to add more features with positive results. But it's more complicated as it requires to dive into the java.

In general, if the error is due to a segmentation error from the segmentation model, indeed it has to be fixed first by retraining the segmentation model - this would be the way to fix "1. foot/header appears in abstract".
Title not identified might come from reading order issue, which is something I am still working on in pdfalto. As it is still not stable, I have excluded examples with strong reading order issues from the training data.
"3. key words/ccs in abstract" is a relatively common error I also observed with the current mode, and I was thinking so far that more training data would help this.

@m485liuw
Copy link
Author

Thanks a lot for the detailed reply!
Is "foot/header appears in abstract" error coming from segmentation model? We thought foot/header in the first page should be included in the header part so segmentation model was correct?
Here are some of our annotations. It would be really helpful if you could help check any inconsistency.
xml.zip

@kermitt2
Copy link
Owner

Is "foot/header appears in abstract" error coming from segmentation model? We thought foot/header in the first page should be included in the header part so segmentation model was correct?

Yes you're correct, sorry my fault ! The header/footer in the first page should be included in the header parts because they usually contains metadata important to extract with the header !

Thanks for the annotations, I try to have a look soon for feedback.

@kermitt2 kermitt2 added enhancement error cases Some error/test case for future improvements training guidelines Related to the annotation guidelines for training data labels Jun 25, 2021
@kermitt2
Copy link
Owner

Here is the review. I made corrections in every documents, some of them were having significant problems in the authors/affiliation/email sequences.

xml-reviewed.zip

@m485liuw
Copy link
Author

Hi, thanks for your detailed annotation. We are interested in title and abstract only so we only annotated those parts. Do you know does it affect the training if we only annotate parts of them?

@kermitt2
Copy link
Owner

kermitt2 commented Jun 29, 2021

My experience is that errors and incomplete annotations affect significantly other annotations. By introducing more labels, we improve globally the accuracy because we enrich the representation of the contexts, make the learning more efficient and decrease sources of ambiguity. For instance in NER, to improve the recognition of dates, we typically annotate other numerical entities (like currencies, quantities, reference markers, ...).

In our case, for instance by labeling explicitly keywords and meeting places/venues, we normally improve the accuracy to identify title and abstract for the same amount of training data.

@kermitt2
Copy link
Owner

kermitt2 commented Jul 2, 2021

For info I've added the 12 corrected header xml in the training data, together with 12 bioRxiv headers and got improvement in header f-score results for both PMC and bioRxiv evaluation sets.

In your initial set of 12 headers for instance, I corrected 3 titles (xml/2104.06550v1.training.header.tei.xml, 2104.06800v1.training.header.tei.xml and xml/2104.10542v1.training.header.tei.xml). It would mean that your additional training data have a 75% precision title field - considering Grobid annotation guidelines. I think it's not possible to improve the model which already provides above 90% title field accuracy without higher quality annotations than the current performance.

@m485liuw
Copy link
Author

m485liuw commented Jul 5, 2021 via email

@kermitt2
Copy link
Owner

kermitt2 commented Jul 5, 2021

Hello !

Btw, is there anywhere I could find how you get your evaluation set?

About the evaluation sets, it's described at https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/
PMC and bioRxiv samples are holdout sets (out of the different training data) and stable over time. It's end-to-end evaluation, so starting from PDF and comparing final TEI XML results with the JATS files.

PMC sample set if from Alexandru Constantin Ph.D, work (PDFX) and bioRxiv set has been created by Daniel Ecer.

Did you get improvement for both abstract and title?

Yes, but not a lot for PMC

Last week-end I made an additional "training data" effort (a few dozen examples in segmentation and header, originally failing), and results are again better. So we should be able to continue improving the current header model (independently from improving its design/implementation), just by augmenting the training data like this.

@m485liuw
Copy link
Author

m485liuw commented Jul 7, 2021 via email

@kermitt2
Copy link
Owner

kermitt2 commented Jul 7, 2021

Hi !

I think it's a good idea. But I am not even at this point... I simply downloaded some random PMC, bioRxiv and arXiv PDF files, process them and selected a few dozens with empty title, empty authors and/or empty abstract (which is a good sign that something very bad happened :).

I added 65 header files and 62 segmentation files in the last week effort, so increase of 10% of the training data. For headers, for PMC sample set it lead to around +1.0 F-score in average for header and for bioRxiv to around +4 (there were around 30 files from bioRxiv, because it was not really represented in the training data so far).

Except for reference parsing, I didn't get this kind of improvement after working on deep learning models during 3 years :D

@m485liuw
Copy link
Author

m485liuw commented Jul 7, 2021 via email

@kermitt2
Copy link
Owner

kermitt2 commented Jul 8, 2021

yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement error cases Some error/test case for future improvements training guidelines Related to the annotation guidelines for training data
Projects
None yet
Development

No branches or pull requests

2 participants