Skip to content

Readability

Erik Rose edited this page Mar 22, 2016 · 2 revisions

[Safari](Safari used the Readability code (Apache 2 licensed). http://www.theregister.co.uk/2010/06/08/safari_reader_based_on_open_source_project/ ) and FF both use it.

Apache-2 licensed

Algorithm

This is roughly accurate as of the Arc90 Labs work. It needs to be updated to reflect the changes Mozilla has made since then.

  • Rip out some unlikely things by id and class, like "comment", "disqus", "menu", etc. (except if they're on the body tag).
  • Turn divs that don't contain any block elements into p tags.
  • Score using…
    • Length of paragraphs
    • Number of commas (?!)
  • Scale scores by link density.
  • Prepend and append sibling nodes of winner if…
    • Their scores are ≥1/5 of the winner
    • They're at least 80 chars long and have low link density or
    • They're short but have no links and have at least one thing that looks like a sentence.

Differences we'd like

For our purposes, we would need…

  • Some output even if we aren't confident that it's the main content. Err on the side of too much.
  • No automatic traversal to additional pages of paginated content