Skip to content

Commit

Permalink
Consider #hash links in link density (#646)
Browse files Browse the repository at this point in the history
* don't count #hash links to link density

* detect hash links correctly

* update existing test cases

* use coefficient for link density as it better plays with `cleanConditionally`

* improve isList detection
  • Loading branch information
jakubriedl committed Nov 23, 2020
1 parent fc78270 commit 3c83389
Show file tree
Hide file tree
Showing 13 changed files with 3,022 additions and 8 deletions.
17 changes: 13 additions & 4 deletions Readability.js
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ Readability.prototype = {
okMaybeItsACandidate: /and|article|body|column|content|main|shadow/i,

positive: /article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story/i,
negative: /hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|foot|footer|footnote|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|widget/i,
negative: /-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|foot|footer|footnote|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|widget/i,
extraneous: /print|archive|comment|discuss|e[\-]?mail|share|reply|all|login|sign|single|utility/i,
byline: /byline|author|dateline|writtenby|p-author/i,
replaceFonts: /<(\/?)font[^>]*>/gi,
Expand All @@ -135,6 +135,7 @@ Readability.prototype = {
prevLink: /(prev|earl|old|new|<|«)/i,
whitespace: /^\s*$/,
hasContent: /\S$/,
hashUrl: /^#.+/,
srcsetUrl: /(\S+)(\s+[\d.]+[xw])?(\s*(?:,|$))/g,
b64DataUrl: /^data:\s*([^\s;,]+)\s*;\s*base64\s*,/i,
// See: https://schema.org/Article
Expand Down Expand Up @@ -1745,7 +1746,9 @@ Readability.prototype = {

// XXX implement _reduceNodeList?
this._forEachNode(element.getElementsByTagName("a"), function(linkNode) {
linkLength += this._getInnerText(linkNode).length;
var href = linkNode.getAttribute("href");
var coefficient = href && this.REGEXPS.hashUrl.test(href) ? 0.3 : 1;
linkLength += this._getInnerText(linkNode).length * coefficient;
});

return linkLength / textLength;
Expand Down Expand Up @@ -2007,8 +2010,6 @@ Readability.prototype = {
if (!this._flagIsActive(this.FLAG_CLEAN_CONDITIONALLY))
return;

var isList = tag === "ul" || tag === "ol";

// Gather counts for other typical elements embedded within.
// Traverse backwards so we can remove nodes at the same time
// without effecting the traversal.
Expand All @@ -2020,6 +2021,14 @@ Readability.prototype = {
return t._readabilityDataTable;
};

var isList = tag === "ul" || tag === "ol";
if (!isList) {
var listLength = 0;
var listNodes = this._getAllNodesWithTag(node, ["ul", "ol"]);
this._forEachNode(listNodes, (list) => listLength += this._getInnerText(list).length);
isList = listLength / this._getInnerText(node).length > 0.9;
}

if (tag === "table" && isDataTable(node)) {
return false;
}
Expand Down
3 changes: 0 additions & 3 deletions test/test-pages/bug-1255978/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,6 @@
<p>Zeev Sharon said that the old rule of thumb is that for every $1000 invested in a room, the hotel should charge $1 in average daily rate. So a room that cost $300,000 to build, should sell on average for $300/night.</p>
<h3>5. Beware the wall-mounted hairdryer</h3>
<p>It contains the most germs of anything in the room. Other studies have said the TV remote and bedside lamp switches are the most unhygienic. “Perhaps because it's something that's easy for the housekeepers to forget to check or to squirt down with disinfectant,” Forrest Jones said.</p>
<div data-scald-gallery="3739501">
<h2><span></span>Business news in pictures</h2>
</div>
<h3>6. Mini bars almost always lose money</h3>
<p>Despite the snacks in the minibar seeming like the most overpriced food you have ever seen, hotel owners are still struggling to make a profit from those snacks. "Minibars almost always lose money, even when they charge $10 for a Diet Coke,” Sharon said.</p>
<div>
Expand Down
2 changes: 1 addition & 1 deletion test/test-pages/mercurial/expected-metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"title": "Shared Mutable History — evolve extension for Mercurial",
"byline": null,
"dir": null,
"excerpt": "Once you have mastered the art of mutable history in a single repository (see the user guide), you can move up to the next level: shared mutable history. evolve lets you push and pull draft changesets between repositories along with their obsolescence markers. This opens up a number of interesting possibilities.",
"excerpt": "Contents",
"siteName": null,
"readerable": true
}
61 changes: 61 additions & 0 deletions test/test-pages/mercurial/expected.html
Original file line number Diff line number Diff line change
@@ -1,5 +1,66 @@
<div id="readability-page-1" class="page">
<div id="evolve-shared-mutable-history">
<div id="contents">
<p> Contents </p>
<ul>
<li>
<a href="#evolve-shared-mutable-history" id="id4">Evolve: Shared Mutable History</a>
<ul>
<li>
<a href="#sharing-with-a-single-developer" id="id5">Sharing with a single developer</a>
<ul>
<li>
<a href="#publishing-and-non-publishing-repositories" id="id6">Publishing and non-publishing repositories</a>
</li>
<li>
<a href="#setting-up" id="id7">Setting up</a>
</li>
<li>
<a href="#example-1-amend-a-shared-changeset" id="id8">Example 1: Amend a shared changeset</a>
</li>
<li>
<a href="#example-2-amend-again-locally" id="id9">Example 2: Amend again, locally</a>
</li>
</ul>
</li>
<li>
<a href="#sharing-with-multiple-developers-code-review" id="id10">Sharing with multiple developers: code review</a>
<ul>
<li>
<a href="#id2" id="id11">Setting up</a>
</li>
<li>
<a href="#example-3-alice-commits-and-amends-a-draft-fix" id="id12">Example 3: Alice commits and amends a draft fix</a>
</li>
<li>
<a href="#example-4-bob-implements-and-publishes-a-new-feature" id="id13">Example 4: Bob implements and publishes a new feature</a>
</li>
<li>
<a href="#example-5-alice-integrates-and-publishes" id="id14">Example 5: Alice integrates and publishes</a>
</li>
</ul>
</li>
<li>
<a href="#getting-into-trouble-with-shared-mutable-history" id="id15">Getting into trouble with shared mutable history</a>
<ul>
<li>
<a href="#id3" id="id16">Setting up</a>
</li>
<li>
<a href="#example-6-divergent-changesets" id="id17">Example 6: Divergent changesets</a>
</li>
<li>
<a href="#phase-divergence-when-a-rewritten-changeset-is-made-public" id="id18">Phase-divergence: when a rewritten changeset is made public</a>
</li>
</ul>
</li>
<li>
<a href="#conclusion" id="id19">Conclusion</a>
</li>
</ul>
</li>
</ul>
</div>
<p> Once you have mastered the art of mutable history in a single repository (see the <a href="http://fakehost/test/user-guide.html">user guide</a>), you can move up to the next level: <em>shared</em> mutable history. <tt><span>evolve</span></tt> lets you push and pull draft changesets between repositories along with their obsolescence markers. This opens up a number of interesting possibilities. </p>
<p> The simplest scenario is a single developer working across two computers. Say you’re working on code that must be tested on a remote test server, probably in a rack somewhere, only accessible by SSH, and running an “enterprise-grade” (out-of-date) OS. But you probably prefer to write code locally: everything is setup the way you like it, and you can use your preferred editor, IDE, merge/diff tools, etc. </p>
<p> Traditionally, your options are limited: either </p>
Expand Down
11 changes: 11 additions & 0 deletions test/test-pages/mozilla-1/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,17 @@ <h2>Designed to <br />be redesigned</h2>
<p><img src="http://mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/animations/flexible-bottom-fallback.cafd48a3d0a4.png" alt="" /></p>
</div>
</div>
<div id="customize" data-ga-label="More ways to customize">
<h2>More ways to customize</h2>
<ul id="customizer-list" role="tablist">
<li> <a id="customize-themes" href="#themes"> Themes </a>
</li>
<li> <a id="customize-addons" href="#add-ons"> Add-ons </a>
</li>
<li> <a id="customize-awesomebar" href="#awesome-bar"> Awesome Bar </a>
</li>
</ul>
</div>
<div id="customizers-wrapper">
<div id="themes" role="tabpanel" aria-labelledby="customize-themes">
<div>
Expand Down
3 changes: 3 additions & 0 deletions test/test-pages/nytimes-2/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@
<p><a href="#story-continues-1">Continue reading the main story</a>
</p>
</div>
<div id="story-continues-1">
<p><a href="#story-continues-2">Continue reading the main story</a></p>
</div>
<div>
<p data-para-count="602" data-total-count="1935" id="story-continues-2">In the second step, at the closing, <a href="https://www.sec.gov/Archives/edgar/data/1011006/000119312516656036/d178500dex22.htm">Yahoo will sell the stock</a> in the single subsidiary to Verizon. At that point, Yahoo will change its name to something without “Yahoo” in it. My favorite is simply Remain Co., the name Yahoo executives are using. Remain Co. will become a holding company for the Alibaba and Yahoo Japan stock. Included will also be $10 billion in cash, plus the Excalibur patent portfolio and a number of minority investments including Snapchat. Ahh, if only Yahoo had bought Snapchat instead of Tumblr (indeed, if only Yahoo had bought Google or Facebook when it had the chance).</p>
<p data-para-count="262" data-total-count="2197" id="story-continues-3">Because it is a sale of a subsidiary, the $4.8 billion will be paid to Yahoo. Its shareholders will not receive any money unless Yahoo pays it out in a dividend (after paying taxes). Instead, Yahoo shareholders will be left holding shares in the renamed company.</p>
Expand Down
9 changes: 9 additions & 0 deletions test/test-pages/nytimes-4/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,15 @@
</figcaption>
</figure>
</div>
<div>
<ul>
<li>
<time datetime="2018-09-25">Sept. 25, 2018</time>
</li>
<li>
</li>
</ul>
</div>
</header>
<section name="articleBody" itemprop="articleBody">
<div>
Expand Down
8 changes: 8 additions & 0 deletions test/test-pages/toc-missing/expected-metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"title": "Simple Anomaly Detection Using Plain SQL",
"byline": "Haki Benita",
"dir": null,
"excerpt": "Many developers think that having a critical bug in their code is the worse thing that can happen. Well, there is something much worst than that: Having a critical bug in your code and not knowing about it! Using some high school level statistics and a fair knowledge of SQL, I implemented a very simple anomaly detection system.",
"siteName": "Haki Benita",
"readerable": true
}

0 comments on commit 3c83389

Please sign in to comment.