Skip to content

Clean up web pages and extract the main content, powered by Mozilla Readability

License

Notifications You must be signed in to change notification settings

magynhard/ruby-readability_js

Repository files navigation

ReadabilityJS for Ruby

Gem Gem License: MIT

Clean up web pages and extract the main content, powered by Mozilla Readability.

This is a Ruby wrapper gem for readability, by running a node process with nodo.

Contents

Installation

Prerequisites

NodeJS >= 22.x is installed and available via commandline (in PATH).

Gem

Add this line to your application's Gemfile:

gem 'readability_js'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install readability_js

Usage examples

Original parse

Using this method, only the mozilla readability parse method is called.

    require 'readability_js'
    html = File.read("my_article.html")
    result = ReadabilityJs.parse(html)
    p result

Extended parse

Using this method, the extended parse method is called, which provides more cleaned up output, and includes a beautified markdown version of the content.

    require 'readability_js'
    html = File.read("my_article.html")
    # extend has included a DEFAULT_SELECTOR_BLACKLIST and you can add your own selectors to it as well, 
    # that will be used to remove unwanted elements from the content before parsing at all.
    result = ReadabilityJs.parse_extended(html, blacklist_selectors: [".advertisement", "#sponsored"])
    p result

Query parameters

You can pass all parameters supported by readability, checkout the rubydoc for more details.

Here an example with all parameters, the camelCase parameters are converted to snake_case in ruby:

    require 'readability_js'
data = ReadabilityJs.parse(
  # TODO: add parameters here
)
# => Hash

Parse response

The response object is of type Hash. It contains the data returned by readability, with hash keys transformed in snake_case.

{
  "title" => "Article Title",
  "content" => "<div>...</div>",
  "text_content" => "Plain text content",
  "markdown_content" => "## Markdown content", # only for extended parse
  "length" => 1234,
  "excerpt" => "This is an excerpt of the article...",
  "byline" => "Author Name",
  "dir" => "ltr",
  "site_name" => "example.com",
  "lang" => "en",
  "published_time" => "2024-01-01T12:00:00Z",
  "image_url" => "https://example.com/image.jpg" # only for extended parse
}    

Documentation

Check out the doc at RubyDoc:
https://rubydoc.info/github/magynhard/ruby-readability_js

As this library is only a wrapper, checkout the original readability documentation:
https://github.com/mozilla/readability

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/magynhard/ruby-readability_js.

This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

About

Clean up web pages and extract the main content, powered by Mozilla Readability

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published