Skip to content
πŸ“œ Extracting content from the chaos of the web.
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci fix: run parser preview Mar 15, 2019
.github docs: PR and Issue templates (#211) Jan 24, 2019
assets docs: add usage gif (#308) Mar 5, 2019
dist release: 2.0.0 (#275) Feb 13, 2019
fixtures Extract content from GitHub repos. (#306) Mar 14, 2019
scripts dx: comment on custom parser pr fix (#278) Feb 28, 2019
src feat: Return specific errors on failed parse attempts Mar 20, 2019
.agignore
.babelrc chore: update node rollup config (#229) Jan 30, 2019
.eslintignore
.eslintrc deps: upgrade (#218) Jan 23, 2019
.gitattributes fix: i put a bad comment in .gitattributes (#125) Jan 27, 2017
.gitignore
.nvmrc
.prettierignore
.prettierrc
.remarkrc
CHANGELOG.md
CODE_OF_CONDUCT.md docs: add code of conduct (#204) Jan 23, 2019
CONTRIBUTING.md docs: cleanup and update docs (#238) Feb 1, 2019
LICENSE-APACHE
LICENSE-MIT docs: add license files (#217) Jan 24, 2019
README.md docs: add content formats to README.md (#318) Mar 12, 2019
RELEASE.md
cli.js
karma.conf.js deps: upgrade (#218) Jan 23, 2019
package.json
preview feat: add content format output options (#256) Feb 8, 2019
rollup.config.js
rollup.config.web.js deps: upgrade (#218) Jan 23, 2019
score-move
yarn.lock

README.md

Mercury Parser

Mercury Parser - Extracting content from chaos

CircleCI Greenkeeper badge Apache License MITC License Gitter chat

The Mercury Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.

Mercury Parser powers the Mercury AMP Converter and Mercury Reader, a Chrome extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.

Mercury Parser allows you to easily create custom parsers using simple JavaScript and CSS selectors. This allows you to proactively manage parsing and migration edge cases. There are many examples available along with documentation.

How? Like this.

Installation

# If you're using yarn
yarn add @postlight/mercury-parser

# If you're using npm
npm install @postlight/mercury-parser

Usage

import Mercury from '@postlight/mercury-parser';

Mercury.parse(url).then(result => console.log(result));

// NOTE: When used in the browser, you can omit the URL argument
// and simply run `Mercury.parse()` to parse the current page.

The result looks like this:

{
  "title": "Thunder (mascot)",
  "content": "... <p><b>Thunder</b> is the <a href=\"https://en.wikipedia.org/wiki/Stage_name\">stage name</a> for the...",
  "author": "Wikipedia Contributors",
  "date_published": "2016-09-16T20:56:00.000Z",
  "lead_image_url": null,
  "dek": null,
  "next_page_url": null,
  "url": "https://en.wikipedia.org/wiki/Thunder_(mascot)",
  "domain": "en.wikipedia.org",
  "excerpt": "Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos",
  "word_count": 4677,
  "direction": "ltr",
  "total_pages": 1,
  "rendered_pages": 1
}

If Mercury is unable to find a field, that field will return null.

parse() Options

By default, Mercury Parser returns the content field as HTML. However, you can override this behavior by passing in options to the parse function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are 'html', 'markdown', and 'text'). For example:

Mercury.parse(url, { contentType: 'markdown' }).then(result => console.log(result));

This returns the the page's content as GitHub-flavored Markdown:

"content": "...**Thunder** is the [stage name](https://en.wikipedia.org/wiki/Stage_name) for the..."

The command-line parser

Mercury Parser also ships with a CLI, meaning you can use the Mercury Parser from your command line like so:

Mercury Parser CLI Basic Usage

# Install Mercury globally
yarn global add @postlight/mercury-parser
#   or
npm -g install @postlight/mercury-parser

# Then
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source

# Pass optional --format argument to set content type (html|markdown|text)
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --format=markdown

License

Licensed under either of the below, at your preference:

Contributing

For details on how to contribute to Mercury, including how to write a custom content extractor for any site, see CONTRIBUTING.md

Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.

You can’t perform that action at this time.