Consider #hash links in link density #646

jakubriedl · 2020-11-17T02:37:48Z

Fixes #643

As discussed in the issue Table Of Content gets sometimes removed because it has high link density and so seems to be a menu. But for readability it makes sense to keep ToCs.

This PR tackles that by considering if link is navigation on the page #hash link and count those into linkDensity with just 20% weight. (That was selected because it plays well with cleanConditionally rule and will )

Side effects on existing test cases:

mercurial
✅ it correctly detected ToC and kept it
⚠️ but after that the first paragraph changed and was detected as excerpt. That is not very correct but it is bit more problem of excerpt detection than if the P should be kept. Would recommend to tackle in different PR.
mozilla-1
🤔✅ it kept menu that is navigating only on page and all links are working, so I would consider it correct
nytimes-1 & nytimes-2
⚠️ kept ads that are pointing to headings on page. It is also because they have story in id. Easiest way to tackle this would be to ad -ad- to REGEX.negative to counter weight the story. I've also tired to improve the list detection as discussed in issue and it can solve this issue but it was causing a lot of other side effects that go absolutely over scope of this PR.
wikipedia & wikipedia-2
✅ it correctly detected ToC and kept it

…ionally`

gijsk · 2020-11-19T12:56:29Z

Readability.js

+      var href = linkNode.getAttribute("href");
+      var coefficient = href && href.match(this.REGEXPS.hashUrl) ? 0.2 : 1;
+      linkLength += this._getInnerText(linkNode).length * coefficient;


Interesting - why not drop scoring for these links completely (ie why 0.2 rather than 0, or wrapping the linkLength += statement in an if condition testing the href?

I originally had that that way. But after testing edge cases and playing with isList detection it will I decided to add coefficient.
For example when you have an element that has mix of hash links and full links you want to be more sensitive then just considering hash links as plain text.

After more testing I changed the coefficient to 0.3 so it sits between the 0.5 threshold for links and 0.2 for other content. That way it works really well with test cases and have only 1 false positive.

gijsk · 2020-11-19T12:57:40Z

Readability.js

@@ -1745,7 +1746,9 @@ Readability.prototype = {

    // XXX implement _reduceNodeList?
    this._forEachNode(element.getElementsByTagName("a"), function(linkNode) {
-      linkLength += this._getInnerText(linkNode).length;
+      var href = linkNode.getAttribute("href");
+      var coefficient = href && href.match(this.REGEXPS.hashUrl) ? 0.2 : 1;


Nit: this.REGEXPS.hashUrl.test(href) is faster (and this code can be hot, so it makes sense to use it here).

In fact, should we just use href.startsWith("#") here? That accomplishes the same thing, right?

the regexp needs to be there because I found many cases where html contains <a href="#"> and the navigation is managed by JS. Those are quite often buttons and not proper hash links.

changed it to this.REGEXPS.hashUrl.test(href) in latest commit

gijsk · 2020-11-19T13:00:00Z

Fixes #643

As discussed in the issue Table Of Content gets sometimes removed because it has high link density and so seems to be a menu. But for readability it makes sense to keep ToCs.

Thank you for the PR! This looks really nice.

This PR tackles that by considering if link is navigation on the page #hash link and count those into linkDensity with just 20% weight. (That was selected because it plays well with cleanConditionally rule and will )

Oops, looks like something went missing here?

Side effects on existing test cases:

Thanks for this comprehensive explanation.

* `mercurial`
  ✅ it correctly detected ToC and kept it
  ⚠️ but after that the first paragraph changed and was detected as excerpt. That is not very correct but it is bit more problem of excerpt detection than if the P should be kept. Would recommend to tackle in different PR.

OK. Would you mind filing an issue for this after this PR lands?

* `mozilla-1`
  🤔✅ it kept menu that is navigating only on page and all links are working, so I would consider it correct

* `nytimes-1` & `nytimes-2`
  ⚠️ kept ads that are pointing to headings on page. It is also because they have `story` in `id`. Easiest way to tackle this would be to ad `-ad-` to `REGEX.negative` to counter weight the `story`. I've also tired to improve the list detection as discussed in issue and it can solve this issue but it was causing a lot of other side effects that go absolutely over scope of this PR.

Can we add the -ad- thing to this PR so the nytimes testcase remains unchanged?

* `wikipedia` & `wikipedia-2`
  ✅ it correctly detected ToC and kept it

gijsk

Oops, forgot to mark as request changes.

jakubriedl · 2020-11-20T04:17:25Z

I've played more with the ToC detection and managed to get it even better by fixing the isList condition and tweaking the params. It (mostly) solved the things with nytimes-1 & nytimes-2 and even further improved wikipedia and one other case.

Current effects on tests

bug-1255978
✅ this actually fixed a case where there was leftover (menu heading) for removed menu.
mercurial
✅ + ⚠️ as mentioned in the original comment, will raise an issue when this will get merged
mozilla-1
✅ as mentioned in original comment
nytimes-2
⚠️ it included now a block which is navigation to the article. It is because weight = 50 and linkDensity dropped below 0.5. I don't think we can do anything more with this case.
nytimes-4
🤔 ✅ it kept date in header, which is sort of correct. Don't know why it was removed before
wikipedia (1,2,3)
✅ kept correct ToC and Cite references

gijsk · 2020-11-23T09:08:18Z

Thanks!

jakubriedl added 4 commits November 17, 2020 12:04

don't count #hash links to link density

804b810

detect hash links correctly

e8d207a

update existing test cases

a0d58d1

use coefficient for link density as it better plays with `cleanCondit…

4e84691

…ionally`

jakubriedl mentioned this pull request Nov 19, 2020

Table of content removed #643

Closed

gijsk reviewed Nov 19, 2020

View reviewed changes

gijsk requested changes Nov 19, 2020

View reviewed changes

improve isList detection

411cd08

gijsk merged commit 3c83389 into mozilla:master Nov 23, 2020

jakubriedl mentioned this pull request Nov 23, 2020

Excerpt detection sometimes picks irrelevant first paragraph #651

Open

jakubriedl deleted the link-density branch November 23, 2020 10:01

gijsk mentioned this pull request Dec 10, 2020

div id="preamble" dropped from reader mode for article on plato.stanford.edu #655

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider #hash links in link density #646

Consider #hash links in link density #646

jakubriedl commented Nov 17, 2020

gijsk Nov 19, 2020

jakubriedl Nov 20, 2020

jakubriedl Nov 20, 2020

gijsk Nov 19, 2020

jakubriedl Nov 20, 2020 •

edited

Loading

jakubriedl Nov 20, 2020

gijsk commented Nov 19, 2020

gijsk left a comment

jakubriedl commented Nov 20, 2020 •

edited

Loading

gijsk commented Nov 23, 2020

Consider #hash links in link density #646

Consider #hash links in link density #646

Conversation

jakubriedl commented Nov 17, 2020

gijsk Nov 19, 2020

Choose a reason for hiding this comment

jakubriedl Nov 20, 2020

Choose a reason for hiding this comment

jakubriedl Nov 20, 2020

Choose a reason for hiding this comment

gijsk Nov 19, 2020

Choose a reason for hiding this comment

jakubriedl Nov 20, 2020 • edited Loading

Choose a reason for hiding this comment

jakubriedl Nov 20, 2020

Choose a reason for hiding this comment

gijsk commented Nov 19, 2020

gijsk left a comment

Choose a reason for hiding this comment

jakubriedl commented Nov 20, 2020 • edited Loading

gijsk commented Nov 23, 2020

jakubriedl Nov 20, 2020 •

edited

Loading

jakubriedl commented Nov 20, 2020 •

edited

Loading