Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF annotation: tables, figures, formulas should be excluded #22

Closed
lfoppiano opened this issue Mar 16, 2017 · 6 comments
Closed

PDF annotation: tables, figures, formulas should be excluded #22

lfoppiano opened this issue Mar 16, 2017 · 6 comments
Assignees

Comments

@lfoppiano
Copy link
Owner

lfoppiano commented Mar 16, 2017

I've noticed that tables and figures, with the current amount of training data, are incorrectly annotated. Document hal-00643787.

See example of a figure:
screen shot 2017-03-16 at 11 03 36

here an example on a table:
screen shot 2017-03-16 at 11 07 30

here a formula:
screen shot 2017-03-16 at 11 05 23

@kermitt2
Copy link
Collaborator

yes I was a bit in a hurry so I didn't go fine-grained for processing the PDF via GROBID. Right now the quantity parser is applied on the title, abstract, the whole full text and the annexes (the rest is ignored).

So to be done : in the body part, we should exclude formula, all the ref markers (ref of biblio [Foppiano and al. 2017], figure, formula, etc.) present in the body and figures/tables for the moment.

@lfoppiano lfoppiano self-assigned this Mar 17, 2017
@lfoppiano lfoppiano changed the title PDF annotation: tables, figures, formulas should be excluded (at least for the moment) PDF annotation: tables, figures, formulas should be excluded Mar 17, 2017
@lfoppiano
Copy link
Owner Author

OK, so I excluded <formula>, <table>, <equation> and <figure>

@lfoppiano
Copy link
Owner Author

here how it looks a figure:

screen shot 2017-03-17 at 16 57 07

@kermitt2
Copy link
Collaborator

Nice!
Maybe we could also let the model annotate the caption and figure/table titles?

Then it requires the figure and table model to be applied to the LayoutTokens labelled with TaggingLabels.FIGURE and TaggingLabels.TABLE

@lfoppiano
Copy link
Owner Author

Now only captions are processed:
screenshot 2018-11-11 at 14 26 09

@lfoppiano
Copy link
Owner Author

I think this is solved. Sometimes tables are annotated anyway, but this is more due to the table model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants