Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider #hash links in link density #646

Merged
merged 5 commits into from
Nov 23, 2020
Merged

Conversation

jakubriedl
Copy link
Contributor

Fixes #643

As discussed in the issue Table Of Content gets sometimes removed because it has high link density and so seems to be a menu. But for readability it makes sense to keep ToCs.

This PR tackles that by considering if link is navigation on the page #hash link and count those into linkDensity with just 20% weight. (That was selected because it plays well with cleanConditionally rule and will )

Side effects on existing test cases:

  • mercurial
    ✅ it correctly detected ToC and kept it
    ⚠️ but after that the first paragraph changed and was detected as excerpt. That is not very correct but it is bit more problem of excerpt detection than if the P should be kept. Would recommend to tackle in different PR.
  • mozilla-1
    🤔✅ it kept menu that is navigating only on page and all links are working, so I would consider it correct
  • nytimes-1 & nytimes-2
    ⚠️ kept ads that are pointing to headings on page. It is also because they have story in id. Easiest way to tackle this would be to ad -ad- to REGEX.negative to counter weight the story. I've also tired to improve the list detection as discussed in issue and it can solve this issue but it was causing a lot of other side effects that go absolutely over scope of this PR.
  • wikipedia & wikipedia-2
    ✅ it correctly detected ToC and kept it

Readability.js Outdated
Comment on lines 1749 to 1751
var href = linkNode.getAttribute("href");
var coefficient = href && href.match(this.REGEXPS.hashUrl) ? 0.2 : 1;
linkLength += this._getInnerText(linkNode).length * coefficient;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - why not drop scoring for these links completely (ie why 0.2 rather than 0, or wrapping the linkLength += statement in an if condition testing the href?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had that that way. But after testing edge cases and playing with isList detection it will I decided to add coefficient.
For example when you have an element that has mix of hash links and full links you want to be more sensitive then just considering hash links as plain text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more testing I changed the coefficient to 0.3 so it sits between the 0.5 threshold for links and 0.2 for other content. That way it works really well with test cases and have only 1 false positive.

Readability.js Outdated
@@ -1745,7 +1746,9 @@ Readability.prototype = {

// XXX implement _reduceNodeList?
this._forEachNode(element.getElementsByTagName("a"), function(linkNode) {
linkLength += this._getInnerText(linkNode).length;
var href = linkNode.getAttribute("href");
var coefficient = href && href.match(this.REGEXPS.hashUrl) ? 0.2 : 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this.REGEXPS.hashUrl.test(href) is faster (and this code can be hot, so it makes sense to use it here).

In fact, should we just use href.startsWith("#") here? That accomplishes the same thing, right?

Copy link
Contributor Author

@jakubriedl jakubriedl Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the regexp needs to be there because I found many cases where html contains <a href="#"> and the navigation is managed by JS. Those are quite often buttons and not proper hash links.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to this.REGEXPS.hashUrl.test(href) in latest commit

@gijsk
Copy link
Contributor

gijsk commented Nov 19, 2020

Fixes #643

As discussed in the issue Table Of Content gets sometimes removed because it has high link density and so seems to be a menu. But for readability it makes sense to keep ToCs.

Thank you for the PR! This looks really nice.

This PR tackles that by considering if link is navigation on the page #hash link and count those into linkDensity with just 20% weight. (That was selected because it plays well with cleanConditionally rule and will )

Oops, looks like something went missing here?

Side effects on existing test cases:

Thanks for this comprehensive explanation.

* `mercurial`
  ✅ it correctly detected ToC and kept it
  ⚠️ but after that the first paragraph changed and was detected as excerpt. That is not very correct but it is bit more problem of excerpt detection than if the P should be kept. Would recommend to tackle in different PR.

OK. Would you mind filing an issue for this after this PR lands?

* `mozilla-1`
  🤔✅ it kept menu that is navigating only on page and all links are working, so I would consider it correct

* `nytimes-1` & `nytimes-2`
  ⚠️ kept ads that are pointing to headings on page. It is also because they have `story` in `id`. Easiest way to tackle this would be to ad `-ad-` to `REGEX.negative` to counter weight the `story`. I've also tired to improve the list detection as discussed in issue and it can solve this issue but it was causing a lot of other side effects that go absolutely over scope of this PR.

Can we add the -ad- thing to this PR so the nytimes testcase remains unchanged?

* `wikipedia` & `wikipedia-2`
  ✅ it correctly detected ToC and kept it

Copy link
Contributor

@gijsk gijsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, forgot to mark as request changes.

@jakubriedl
Copy link
Contributor Author

jakubriedl commented Nov 20, 2020

I've played more with the ToC detection and managed to get it even better by fixing the isList condition and tweaking the params. It (mostly) solved the things with nytimes-1 & nytimes-2 and even further improved wikipedia and one other case.

Current effects on tests

  • bug-1255978
    ✅ this actually fixed a case where there was leftover (menu heading) for removed menu.
  • mercurial
    ✅ + ⚠️ as mentioned in the original comment, will raise an issue when this will get merged
  • mozilla-1
    ✅ as mentioned in original comment
  • nytimes-2
    ⚠️ it included now a block which is navigation to the article. It is because weight = 50 and linkDensity dropped below 0.5. I don't think we can do anything more with this case.
  • nytimes-4
    🤔 ✅ it kept date in header, which is sort of correct. Don't know why it was removed before
  • wikipedia (1,2,3)
    ✅ kept correct ToC and Cite references

@gijsk gijsk merged commit 3c83389 into mozilla:master Nov 23, 2020
@gijsk
Copy link
Contributor

gijsk commented Nov 23, 2020

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Table of content removed
2 participants