You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using HTMLHighlighter some times boilerpipe keeps some artifacts related
coming from FORM and LABEL tags.
This can be easily prevented by addding a new ignorable element to TAG_ACTIONS
map in HTMLHighlighter.java:
TAG_ACTIONS.put("FORM", TA_IGNORABLE_ELEMENT);
Original issue reported on code.google.com by xavi.beu...@gmail.com on 24 Mar 2012 at 6:40
The text was updated successfully, but these errors were encountered:
Adding FORM as an ignorable element at the highlighter (but not at the
Extractor itself) has two disadvantages:
1. The highlighted HTML will not be consistent with the TextDocument's content
information (i.e., it should rather go into DefaultTagActionMap)
2. FORM may span over many otherwise relevant text blocks. Adding FORM as
TA_IGNORABLE_ELEMENT significantly reduces extraction accuracy (with L3S-GN1,
avg. token-level F1 goes down by about 5% to 7%).
Could you please give examples (URLs) where boilerpipe fails currently?
Please try the very latest version from trunk, or -- preferably -- the version
at http://boilerpipe-web.appspot.com/
Original comment by ckkohl79 on 25 Mar 2012 at 2:19
Adding FORM as an ignorable element at the highlighter (but not at the
Extractor itself) has two disadvantages:
1. The highlighted HTML will not be consistent with the TextDocument's content
information (i.e., it should rather go into DefaultTagActionMap)
2. FORM may span over many otherwise relevant text blocks. Adding FORM as
TA_IGNORABLE_ELEMENT significantly reduces extraction accuracy (with L3S-GN1,
avg. token-level F1 goes down by about 5% to 7%).
Could you please give examples (URLs) where boilerpipe fails currently?
Please try the very latest version from trunk, or -- preferably -- the version
at http://boilerpipe-web.appspot.com/
Original comment by ckkohl79 on 25 Mar 2012 at 2:19
Original issue reported on code.google.com by
xavi.beu...@gmail.com
on 24 Mar 2012 at 6:40The text was updated successfully, but these errors were encountered: