Skip to content
This repository has been archived by the owner on Mar 9, 2021. It is now read-only.

Fix Issue #42 to improve content identification #47

Merged
merged 1 commit into from
May 11, 2016

Conversation

nzv8fan
Copy link
Contributor

@nzv8fan nzv8fan commented May 11, 2016

When text has formatting, e.g., strong, b, em, tags in it the weightChildNodes in ArticleTextExtractor ignores data. This change uses all text under a p tag to calculate the child's weight score. Subsequently this improves identifying the main area of text. An additional test case has been created to show this working.

…r text in p tags will include all text in subtags, e.g. strong, em tags
@karussell
Copy link
Owner

Thanks!

As this is no longer maintained from me: would you mind looking at the other failing test cases? Maybe you have to just remove them as some URLs are no longer valid or find something on those domains?

@karussell karussell merged commit cb24ab4 into karussell:master May 11, 2016
@nzv8fan nzv8fan deleted the issue42 branch May 12, 2016 06:16
rborer pushed a commit to finity-ai/snacktory that referenced this pull request Aug 12, 2016
We get slightly less content with our implementation (probably because of karussell#47).
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants