Fix Issue #42 to improve content identification #47

nzv8fan · 2016-05-11T03:48:07Z

When text has formatting, e.g., strong, b, em, tags in it the weightChildNodes in ArticleTextExtractor ignores data. This change uses all text under a p tag to calculate the child's weight score. Subsequently this improves identifying the main area of text. An additional test case has been created to show this working.

…r text in p tags will include all text in subtags, e.g. strong, em tags

karussell · 2016-05-11T08:45:52Z

Thanks!

As this is no longer maintained from me: would you mind looking at the other failing test cases? Maybe you have to just remove them as some URLs are no longer valid or find something on those domains?

We get slightly less content with our implementation (probably because of karussell#47).

Change to weightChildNodes method of ArticleTextExtractor - weight fo…

0326a5f

…r text in p tags will include all text in subtags, e.g. strong, em tags

nzv8fan mentioned this pull request May 11, 2016

Many websites only extract partial content #42

Open

karussell merged commit cb24ab4 into karussell:master May 11, 2016

nzv8fan deleted the issue42 branch May 12, 2016 06:16

rborer mentioned this pull request May 23, 2016

Karussel merge finity-ai/snacktory#8

Merged

rborer pushed a commit to finity-ai/snacktory that referenced this pull request Aug 12, 2016

Fix testBizJournal test

c2bc38a

We get slightly less content with our implementation (probably because of karussell#47).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Issue #42 to improve content identification #47

Fix Issue #42 to improve content identification #47

nzv8fan commented May 11, 2016

karussell commented May 11, 2016

Fix Issue #42 to improve content identification #47

Fix Issue #42 to improve content identification #47

Conversation

nzv8fan commented May 11, 2016

karussell commented May 11, 2016