-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider #hash links in link density #646
Conversation
Readability.js
Outdated
var href = linkNode.getAttribute("href"); | ||
var coefficient = href && href.match(this.REGEXPS.hashUrl) ? 0.2 : 1; | ||
linkLength += this._getInnerText(linkNode).length * coefficient; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting - why not drop scoring for these links completely (ie why 0.2 rather than 0, or wrapping the linkLength +=
statement in an if condition testing the href?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had that that way. But after testing edge cases and playing with isList
detection it will I decided to add coefficient.
For example when you have an element that has mix of hash links and full links you want to be more sensitive then just considering hash links as plain text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After more testing I changed the coefficient
to 0.3
so it sits between the 0.5
threshold for links and 0.2
for other content. That way it works really well with test cases and have only 1 false positive.
Readability.js
Outdated
@@ -1745,7 +1746,9 @@ Readability.prototype = { | |||
|
|||
// XXX implement _reduceNodeList? | |||
this._forEachNode(element.getElementsByTagName("a"), function(linkNode) { | |||
linkLength += this._getInnerText(linkNode).length; | |||
var href = linkNode.getAttribute("href"); | |||
var coefficient = href && href.match(this.REGEXPS.hashUrl) ? 0.2 : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this.REGEXPS.hashUrl.test(href)
is faster (and this code can be hot, so it makes sense to use it here).
In fact, should we just use href.startsWith("#")
here? That accomplishes the same thing, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the regexp needs to be there because I found many cases where html contains <a href="#">
and the navigation is managed by JS. Those are quite often buttons and not proper hash links.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to this.REGEXPS.hashUrl.test(href)
in latest commit
Thank you for the PR! This looks really nice.
Oops, looks like something went missing here?
Thanks for this comprehensive explanation.
OK. Would you mind filing an issue for this after this PR lands?
Can we add the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, forgot to mark as request changes.
I've played more with the ToC detection and managed to get it even better by fixing the Current effects on tests
|
Thanks! |
Fixes #643
As discussed in the issue Table Of Content gets sometimes removed because it has high link density and so seems to be a menu. But for readability it makes sense to keep ToCs.
This PR tackles that by considering if link is navigation on the page
#hash
link and count those into linkDensity with just 20% weight. (That was selected because it plays well withcleanConditionally
rule and will )Side effects on existing test cases:
mercurial
✅ it correctly detected ToC and kept it
mozilla-1
🤔✅ it kept menu that is navigating only on page and all links are working, so I would consider it correct
nytimes-1
&nytimes-2
story
inid
. Easiest way to tackle this would be to ad-ad-
toREGEX.negative
to counter weight thestory
. I've also tired to improve the list detection as discussed in issue and it can solve this issue but it was causing a lot of other side effects that go absolutely over scope of this PR.wikipedia
&wikipedia-2
✅ it correctly detected ToC and kept it