-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] to skip FullTextParse on certain page region #950
Comments
Hi @frankang Thank you for the issue ! The recognition of figure and table zone is indeed one of the two main problems with Grobid currently. I also think that figures and tables should be processed first, upstream other text body parsing. I though initially to use a R-CNN or LayoutLM approach for figures and tables (it works very well for these objects, not so much for the other coarse ones as compared to Grobid), but this is heavy/slow and there's still the issue of associating well captions, figure/table titles, table notes, etc. So I started with a different approach. There is an ongoing branch to tackle this problem, the branch is called fix-vector-graphics. Despite the name of the branch, this is a relatively important redesign of the model cascading approach:
So this is consistent with your suggestion of addressing table/figure as first step, but very integrated with the usual cascading approach of Grobid. Progress is very slow, because I have unfortunately very little time for Grobid. |
Thanks @kermitt2 , looking forward to it. |
@frankang The other main problem for me is that all the models lack training data ! For example there are only 40 training examples for the fulltext model... Each time I add a bit of new training data, the metrics in end-to-end evaluation increase, so it's a bit frustrating that the tool is running with lower accuracy than its capacity. So if I had 2 months to work only on Grobid, I would spend one to fix the figure/table extraction, and one just producing training data :) |
The provided model cannot correctly categorize some "vaguely" plotted Figures and Tables. In this case, the word in the Table region will be considered as normal Text, thus hinder the normal reading order.
IMHO, one solution is to parse the PDF file and use certain rules to detect Figures and Tables, then we can pass these region information to Grobid to preempt
FullTextParse
on those "hard" parts. Another solution could be an API exposure for the sequence labeling task, so we can directly pass a manually region-cleared ALTO (xml) file and let Grobid finish the remaining procedures.The text was updated successfully, but these errors were encountered: