|
1 |
| -# microlink-core |
| 1 | +# metascraper |
2 | 2 |
|
3 |
| - |
4 |
| -[](https://travis-ci.org/microlinkhq/microlink-core) |
5 |
| -[](https://coveralls.io/github/microlinkhq/microlink-core) |
6 |
| -[](https://david-dm.org/microlinkhq/microlink-core) |
7 |
| -[](https://david-dm.org/microlinkhq/microlink-core#info=devDependencies) |
8 |
| -[](https://www.npmjs.org/package/microlink-core) |
9 |
| -[](https://paypal.me/microlinkhq) |
| 3 | + |
| 4 | +[](https://travis-ci.org/microlinkhq/metascraper) |
| 5 | +[](https://coveralls.io/github/microlinkhq/metascraper) |
| 6 | +[](https://david-dm.org/microlinkhq/metascraper) |
| 7 | +[](https://david-dm.org/microlinkhq/metascraper#info=devDependencies) |
| 8 | +[](https://www.npmjs.org/package/metascraper) |
10 | 9 |
|
11 |
| -> Get metadata from HTML. |
| 10 | +> A library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks. |
12 | 11 |
|
13 |
| -## Install |
| 12 | +## Table of Contents |
| 13 | + |
| 14 | +TODO: INSERT TABLE OF CONTENT |
| 15 | + |
| 16 | +## Getting Started |
| 17 | + |
| 18 | +**metascraper** is library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks. |
| 19 | + |
| 20 | +It follows a few principles: |
| 21 | + |
| 22 | +- Have a high accuracy for online articles by default. |
| 23 | +- Be usable on the server and in the browser. |
| 24 | +- Make it simple to add new rules or override existing ones. |
| 25 | +- Don't restrict rules to CSS selectors or text accessors. |
| 26 | + |
| 27 | +## Installation |
14 | 28 |
|
15 | 29 | ```bash
|
16 |
| -$ npm install microlink-core --save |
| 30 | +$ npm install metascraper --save |
17 | 31 | ```
|
18 | 32 |
|
19 | 33 | ## Usage
|
20 | 34 |
|
| 35 | +Let's extract accurate information from the followgin article: |
| 36 | + |
| 37 | +[](http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance) |
| 38 | + |
21 | 39 | ```js
|
22 |
| -const microlink = require('microlink-core') |
23 |
| -const get = require('simple-get') |
| 40 | +const metascraper = require('metascraper') |
| 41 | +const got = require('got') |
24 | 42 |
|
25 |
| -get.concat('http://example.com', function (err, res, html) { |
26 |
| - if (err) throw err |
| 43 | +const targetUrl = 'http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance' |
27 | 44 |
|
28 |
| - const output = microlink(html) |
29 |
| - console.log(output) |
30 |
| -}) |
| 45 | +;(async () => { |
| 46 | + const {body: html, url} = await got(targetUrl) |
| 47 | + const metadata = await microlink({html, url}) |
| 48 | + console.log(metadata) |
| 49 | +})() |
31 | 50 | ```
|
| 51 | + |
| 52 | +Where the output will be something like: |
| 53 | + |
| 54 | +```json |
| 55 | +{ |
| 56 | + "author": "Ellen Huet", |
| 57 | + "date": "2016-05-24T18:00:03.894Z", |
| 58 | + "description": "The HR startups go to war.", |
| 59 | + "image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v1/-1x-1.jpg", |
| 60 | + "publisher": "Bloomberg.com", |
| 61 | + "title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance", |
| 62 | + "url": "http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance" |
| 63 | +} |
| 64 | +``` |
| 65 | + |
| 66 | +## Metadata |
| 67 | + |
| 68 | +Here is a list of the metadata that **metascraper** collects by default: |
| 69 | + |
| 70 | +- **`author`** — eg. `Noah Kulwin`<br/> |
| 71 | + A human-readable representation of the author's name. |
| 72 | + |
| 73 | +- **`date`** — eg. `2016-05-27T00:00:00.000Z`<br/> |
| 74 | + An [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) representation of the date the article was published. |
| 75 | + |
| 76 | +- **`description`** — eg. `Venture capitalists are raising money at the fastest rate...`<br/> |
| 77 | + The publisher's chosen description of the article. |
| 78 | + |
| 79 | +- **`image`** — eg. `https://assets.entrepreneur.com/content/3x2/1300/20160504155601-GettyImages-174457162.jpeg`<br/> |
| 80 | + An image URL that best represents the article. |
| 81 | + |
| 82 | + - **`logo`** — eg. `https://entrepreneur.com/favicon180x180.png`<br/> |
| 83 | + An image URL that best represents the publisher brand. |
| 84 | + |
| 85 | +- **`publisher`** — eg. `Fast Company`<br/> |
| 86 | + A human-readable representation of the publisher's name. |
| 87 | + |
| 88 | +- **`title`** — eg. `Meet Wall Street's New A.I. Sheriffs`<br/> |
| 89 | + The publisher's chosen title of the article. |
| 90 | + |
| 91 | +- **`url`** — eg. `http://motherboard.vice.com/read/google-wins-trial-against-oracle-saves-9-billion`<br/> |
| 92 | + The URL of the article. |
| 93 | + |
| 94 | +## API |
| 95 | + |
| 96 | +### metascraper(options) |
| 97 | + |
| 98 | +#### options |
| 99 | + |
| 100 | +##### html |
| 101 | + |
| 102 | +*Required*<br> |
| 103 | +Type: `String` |
| 104 | + |
| 105 | +The HTML markup for extracting the content. |
| 106 | + |
| 107 | +#### url |
| 108 | + |
| 109 | +*Required*<br> |
| 110 | +Type: `String` |
| 111 | + |
| 112 | +The URL associated with the HTML markup. |
| 113 | + |
| 114 | +It is used for resolve relative links that can be present in the HTML markup. |
| 115 | + |
| 116 | +it can be used as fallback field for different rules as well. |
| 117 | + |
| 118 | +## Comparison |
| 119 | + |
| 120 | +To give you an idea of how accurate **metascraper** is, here is a comparison of similar libraries: |
| 121 | + |
| 122 | +| Library | [`metascraper`](https://www.npmjs.com/package/metascraper) | [`html-metadata`](https://www.npmjs.com/package/html-metadata) | [`node-metainspector`](https://www.npmjs.com/package/node-metainspector) | [`open-graph-scraper`](https://www.npmjs.com/package/open-graph-scraper) | [`unfluff`](https://www.npmjs.com/package/unfluff) | |
| 123 | +| :--- | :--- | :--- | :--- | :--- | :--- | |
| 124 | +| Correct | **95.54%** | **74.56%** | **61.16%** | **66.52%** | **70.90%** | |
| 125 | +| Incorrect | 1.79% | 1.79% | 0.89% | 6.70% | 10.27% | |
| 126 | +| Missed | 2.68% | 23.67% | 37.95% | 26.34% | 8.95% | |
| 127 | + |
| 128 | +A big part of the reason for **metascraper**'s higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph. |
| 129 | + |
| 130 | +**metascraper**'s default settings are targetted specifically at parsing online articles, which is why it's able to be more highly-tuned than the other libraries for that purpose. |
| 131 | + |
| 132 | +If you're interested in the breakdown by individual pieces of metadata, check out the [full comparison summary](/support/comparison), or dive into the [raw result data for each library](/support/comparison/results). |
| 133 | + |
32 | 134 | ## License
|
33 | 135 |
|
34 |
| -MIT © [microlonk.io](https://github.com/microlinkhq). |
| 136 | +**metascraper** © [Ian Storm Taylor](https://github.com/ianstormtaylor), Released under the [MIT](https://github.com/Kikobeats/free-email-domains/blob/master/LICENSE.md) License.<br> |
| 137 | +Maintained by [Kiko Beats](https://kikobeats.com) with help from [contributors](https://github.com/microlinkhq/metascraper/contributors). |
0 commit comments