Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore FORM tags in HTMLHighlighter #44

Open
GoogleCodeExporter opened this issue Jan 8, 2016 · 3 comments
Open

Ignore FORM tags in HTMLHighlighter #44

GoogleCodeExporter opened this issue Jan 8, 2016 · 3 comments

Comments

@GoogleCodeExporter
Copy link

When using HTMLHighlighter some times boilerpipe keeps some artifacts related 
coming from FORM and LABEL tags.

This can be easily prevented by addding a new ignorable element to TAG_ACTIONS 
map in HTMLHighlighter.java:

TAG_ACTIONS.put("FORM", TA_IGNORABLE_ELEMENT);


Original issue reported on code.google.com by xavi.beu...@gmail.com on 24 Mar 2012 at 6:40

@GoogleCodeExporter
Copy link
Author

Issue 45 has been merged into this issue.

Original comment by ckkohl79 on 25 Mar 2012 at 2:12

@GoogleCodeExporter
Copy link
Author

Adding FORM as an ignorable element at the highlighter (but not at the 
Extractor itself) has two disadvantages:

1. The highlighted HTML will not be consistent with the TextDocument's content 
information (i.e., it should rather go into DefaultTagActionMap)
2. FORM may span over many otherwise relevant text blocks. Adding FORM as 
TA_IGNORABLE_ELEMENT significantly reduces extraction accuracy (with L3S-GN1, 
avg. token-level F1 goes down by about 5% to 7%).

Could you please give examples (URLs) where boilerpipe fails currently?
Please try the very latest version from trunk, or -- preferably -- the version 
at http://boilerpipe-web.appspot.com/

Original comment by ckkohl79 on 25 Mar 2012 at 2:19

  • Added labels: Type-Enhancement
  • Removed labels: Type-Defect

@GoogleCodeExporter
Copy link
Author

Adding FORM as an ignorable element at the highlighter (but not at the 
Extractor itself) has two disadvantages:

1. The highlighted HTML will not be consistent with the TextDocument's content 
information (i.e., it should rather go into DefaultTagActionMap)
2. FORM may span over many otherwise relevant text blocks. Adding FORM as 
TA_IGNORABLE_ELEMENT significantly reduces extraction accuracy (with L3S-GN1, 
avg. token-level F1 goes down by about 5% to 7%).

Could you please give examples (URLs) where boilerpipe fails currently?
Please try the very latest version from trunk, or -- preferably -- the version 
at http://boilerpipe-web.appspot.com/

Original comment by ckkohl79 on 25 Mar 2012 at 2:19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant