Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] to skip FullTextParse on certain page region #950

Open
frankang opened this issue Sep 22, 2022 · 3 comments
Open

[feature request] to skip FullTextParse on certain page region #950

frankang opened this issue Sep 22, 2022 · 3 comments
Assignees

Comments

@frankang
Copy link

The provided model cannot correctly categorize some "vaguely" plotted Figures and Tables. In this case, the word in the Table region will be considered as normal Text, thus hinder the normal reading order.
IMHO, one solution is to parse the PDF file and use certain rules to detect Figures and Tables, then we can pass these region information to Grobid to preempt FullTextParse on those "hard" parts. Another solution could be an API exposure for the sequence labeling task, so we can directly pass a manually region-cleared ALTO (xml) file and let Grobid finish the remaining procedures.

@kermitt2
Copy link
Owner

Hi @frankang

Thank you for the issue !

The recognition of figure and table zone is indeed one of the two main problems with Grobid currently. I also think that figures and tables should be processed first, upstream other text body parsing.

I though initially to use a R-CNN or LayoutLM approach for figures and tables (it works very well for these objects, not so much for the other coarse ones as compared to Grobid), but this is heavy/slow and there's still the issue of associating well captions, figure/table titles, table notes, etc. So I started with a different approach.

There is an ongoing branch to tackle this problem, the branch is called fix-vector-graphics. Despite the name of the branch, this is a relatively important redesign of the model cascading approach:

  • figure and table zones are identified prior to the segmentation model (new models are called figure-segmentation and table-segmentation). These models are anchored on clustered graphic elements (vector graphics and bitmap) and are trained to extend up and down the zones, eventually resulting to well formed figure/table areas with captions, etc. figures with several images, etc. or rejecting graphics as mere publisher decorations.

  • the segmentation model then applies on the content without these zones, as well as then the full text parser, which are simplified because the very noisy figure/table content is removed.

  • the branch comes with a more advanced processing of vector graphics to avoid slow heavy and possibly very slow process like rasterizing.

So this is consistent with your suggestion of addressing table/figure as first step, but very integrated with the usual cascading approach of Grobid. Progress is very slow, because I have unfortunately very little time for Grobid.

@frankang
Copy link
Author

Thanks @kermitt2 , looking forward to it.
BTW, what is the other "main problem" about Grobit? Just curious about it, as you said figure and table recognition is one of two "main problems".

@kermitt2
Copy link
Owner

@frankang The other main problem for me is that all the models lack training data ! For example there are only 40 training examples for the fulltext model... Each time I add a bit of new training data, the metrics in end-to-end evaluation increase, so it's a bit frustrating that the tool is running with lower accuracy than its capacity.

So if I had 2 months to work only on Grobid, I would spend one to fix the figure/table extraction, and one just producing training data :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants