-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Issues with ArXiv papers #777
Comments
Hello @m485liuw ! Thanks a lot for the issue and the detailed analysis of errors. I think it's a bit difficult to conclude anything with 30 examples added to the existing 592, but it's possible that with the existing set of features (which is limited due to the small size of training data) and for a given relatively homogeneous collection like arXiv we don't improve beyond 300 examples. When evaluating against PubMed Central, scores were still improving as I was reaching the 600 examples, so I was not facing this issue yet. Your approach is the one I was following: adding examples if they were wrong with the current model and this is how I got the best learning curve so far. One important aspect I think is that the current training data is "small size" but "high quality. Every labels were checked several times and we pay a lot of attention to have very consistent labelling (in term of what exactly is labeled, which words are excluded from the labeled chunks, etc.). Introducing examples with slightly different/inconsistent/incomplete labeling approach can impact a lot the learning (make it less efficient) and introduces errors in prediction as compared to older version. Working with such small training sets makes every small labeling errors extremely impactful. On the other hand, this approach permits to get very good accuracy with limited training data and to add new examples with some effects. If you'd like to contribute to GROBID with training data and if you're not under time pressure, I would very happy to review the 30 examples and check the overall consistency. There would be different ways to improve the accuracy:
In general, if the error is due to a segmentation error from the segmentation model, indeed it has to be fixed first by retraining the segmentation model - this would be the way to fix "1. foot/header appears in abstract". |
Thanks a lot for the detailed reply! |
Yes you're correct, sorry my fault ! The header/footer in the first page should be included in the header parts because they usually contains metadata important to extract with the header ! Thanks for the annotations, I try to have a look soon for feedback. |
Here is the review. I made corrections in every documents, some of them were having significant problems in the authors/affiliation/email sequences. |
Hi, thanks for your detailed annotation. We are interested in title and abstract only so we only annotated those parts. Do you know does it affect the training if we only annotate parts of them? |
My experience is that errors and incomplete annotations affect significantly other annotations. By introducing more labels, we improve globally the accuracy because we enrich the representation of the contexts, make the learning more efficient and decrease sources of ambiguity. For instance in NER, to improve the recognition of dates, we typically annotate other numerical entities (like currencies, quantities, reference markers, ...). In our case, for instance by labeling explicitly keywords and meeting places/venues, we normally improve the accuracy to identify title and abstract for the same amount of training data. |
For info I've added the 12 corrected header xml in the training data, together with 12 bioRxiv headers and got improvement in header f-score results for both PMC and bioRxiv evaluation sets. In your initial set of 12 headers for instance, I corrected 3 titles ( |
Hi Patrice,
Did you get improvement for both abstract and title? Btw, is there anywhere
I could find how you get your evaluation set?
…On Thursday, July 1, 2021, Patrice Lopez ***@***.***> wrote:
For info I've added the 12 corrected header xml in the training data,
together with 12 bioRxiv headers and got improvement in header f-score
results for both PMC and bioRxiv evaluation sets.
In your initial set of 12 headers for instance, I corrected 3 titles (
xml/2104.06550v1.training.header.tei.xml, 2104.06800v1.training.header.
tei.xml and xml/2104.10542v1.training.header.tei.xml). It would mean that
your additional training data have a 75% precision title field -
considering Grobid annotation guidelines. I think it's not possible to
improve the model which already provides above 90% title field accuracy
without higher quality annotations than the current performance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#777 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOL3GVYZLBRV6FDIYGOEMHLTVUKXBANCNFSM47CQ2BFA>
.
|
Hello !
About the evaluation sets, it's described at https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/ PMC sample set if from Alexandru Constantin Ph.D, work (PDFX) and bioRxiv set has been created by Daniel Ecer.
Yes, but not a lot for PMC
Last week-end I made an additional "training data" effort (a few dozen examples in segmentation and header, originally failing), and results are again better. So we should be able to continue improving the current header model (independently from improving its design/implementation), just by augmenting the training data like this. |
Hi Patrice,
I saw you mentioned in end to end evaluation that you used pub2tei to
generate ground truth. Do you also use that method in training for finding
what ground annotate wrong?
…On Monday, July 5, 2021, Patrice Lopez ***@***.***> wrote:
Hello !
Btw, is there anywhere I could find how you get your evaluation set?
About the evaluation sets, it's described at
https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/
PMC and bioRxiv samples are holdout sets (out of the different training
data) and stable over time. It's end-to-end evaluation, so starting from
PDF and comparing final TEI XML results with the JATS files.
PMC sample set if from Alexandru Constantin Ph.D, work (PDFX) and bioRxiv
set has been created by Daniel Ecer.
Did you get improvement for both abstract and title?
Yes, but not a lot for PMC
-
PMC set, *before*: https://github.com/kermitt2/
grobid/blob/master/grobid-trainer/doc/PMC_sample_1943.
results.grobid-0.7.0-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED-
BidLSTM-CRF-FEATURES-CITATION-09.06.2021
<https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/PMC_sample_1943.results.grobid-0.7.0-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED-BidLSTM-CRF-FEATURES-CITATION-09.06.2021>
*after*: https://github.com/kermitt2/grobid/blob/master/grobid-
trainer/doc/PMC_sample_1943.results.grobid-0.7.0-SNAPSHOT-
Glutton-WAPITI-29.06.2021
<https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/PMC_sample_1943.results.grobid-0.7.0-SNAPSHOT-Glutton-WAPITI-29.06.2021>
-
bioRxiv set, *before*: https://github.com/kermitt2/
grobid/blob/master/grobid-trainer/doc/bioRxiv_test_2000.
results.grobid-0.6.2-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED-
SciBERT-01.11.2020
<https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/bioRxiv_test_2000.results.grobid-0.6.2-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED-SciBERT-01.11.2020>
*after*: https://github.com/kermitt2/grobid/blob/master/grobid-
trainer/doc/bioRxiv_test_2000.results.grobid-0.7-0-SNAPSHOT-
Glutton-DeLFT-WAPITI-MIXED-BidLSTM-CRF-FEATURES-HEADER_
CITATIONS-29.06.2021
<https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/bioRxiv_test_2000.results.grobid-0.7-0-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED-BidLSTM-CRF-FEATURES-HEADER_CITATIONS-29.06.2021>
Last week-end I made an additional "training data" effort (a few dozen
examples in segmentation and header, originally failing), and results are
again better. So we should be able to continue improving the current header
model (independently from improving its design/implementation), just by
augmenting the training data like this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#777 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOL3GV6W7J5LUYOS7KOV5MLTWGBO7ANCNFSM47CQ2BFA>
.
|
Hi ! I think it's a good idea. But I am not even at this point... I simply downloaded some random PMC, bioRxiv and arXiv PDF files, process them and selected a few dozens with empty title, empty authors and/or empty abstract (which is a good sign that something very bad happened :). I added 65 header files and 62 segmentation files in the last week effort, so increase of 10% of the training data. For headers, for PMC sample set it lead to around +1.0 F-score in average for header and for bioRxiv to around +4 (there were around 30 files from bioRxiv, because it was not really represented in the training data so far). Except for reference parsing, I didn't get this kind of improvement after working on deep learning models during 3 years :D |
Thanks! Have you uploaded the new models to github?
…On Wed, Jul 7, 2021 at 3:41 AM Patrice Lopez ***@***.***> wrote:
Hi !
I think it's a good idea. But I am not even at this point... I simply
downloaded some random PMC, bioRxiv and arXiv PDF files, process them and
selected a few dozens with empty title, empty authors and/or empty abstract
(which is a good sign that something very bad happened :).
I added 65 header files and 62 segmentation files in the last week effort,
so increase of 10% of the training data. For headers, for PMC sample set it
lead to around +1.0 F-score in average for header and for bioRxiv to around
+4 (there were around 30 files from bioRxiv, because it was not really
represented in the training data so far).
Except for reference parsing, I didn't get this kind of improvement after
working on deep learning models during 3 years :D
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#777 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOL3GVZB6TA23RCEKKKE5MDTWQVNVANCNFSM47CQ2BFA>
.
|
yes |
Hi, we just finished a small scale of data annotation for header model with ArXiv papers (30 pdfs which was annotated wrong by header model) and retrained the header model with the original data plus newly annotated ones. Unfortunately, we found the accuracy stays the same for our generated test set (1000 ArXiv papers). In fact, we found the accuracy barely increases (~88% for abstract, 95% for header) after 300 training data points. We are wondering is it even possible to increase the accuracy by training with more data? If so, any suggestions on designing the dataset (i.e. any rule of selecting?)
Also, we found most of the errors of header model came from: 1. foot/header appears in abstract 2. title not identified 3. key words/ccs in abstract 4. upper model(segmentation model) wrong. After our annotation on papers with the first 3 issues and retraining, some of the papers with these issues got corrected whereas some new pdfs got annotated wrong (originally correct) with these issues. Do you know how to explain this scenario? Shouldn't these issues be supposed to be fixed after retraining (at least, no new errors of this type come out)?
Plus, as mentioned, some errors come from segmentation model. Do you think it is helpful to retrain the segmentation model?
Looking forward to your reply and any plan on improving the header model.
The text was updated successfully, but these errors were encountered: