Feature request: support for schema.org/Article in ArticleExtractor #5

EgbertW · 2015-07-13T11:04:25Z

(updated)

HTML5 introduces microdata by adding the attributes itemscope, itemid, itemtype, itemprop and itemref. These tags provide valuable information about the semantic role of the parts of a document. This information can also be very useful in parsing the contents of a website as the author intented, rather than by estimating their intent by using statistical or other heuristics.

An effort to standardize the value of these attributes is available on http://schema.org/ which defines various types of documents, such as Article: http://schema.org/Article

One example of a website that uses this effectively that I encountered is http://tweakers.net/. The ArticleExtractor itself does a poor job on this website as it does not only include the article text itself but also includes several (but not all) user comments.

In my setup, I have currently implemented this by first checking for the existence of any HTML elements with a itemprop=articleBody or itemprop=description attribute and using that text when available rather than invoking BoilerPipe, but it would be great if this knowledge could somehow be incorporated into a library such as BoilerPipe that focuses at extracting the article from such a HTML document.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: support for schema.org/Article in ArticleExtractor #5

Feature request: support for schema.org/Article in ArticleExtractor #5

EgbertW commented Jul 13, 2015

Feature request: support for schema.org/Article in ArticleExtractor #5

Feature request: support for schema.org/Article in ArticleExtractor #5

Comments

EgbertW commented Jul 13, 2015