Skip to content

Commit 96a5601

Browse files
committed
Add meta and new docs
1 parent a285914 commit 96a5601

File tree

2 files changed

+176
-28
lines changed

2 files changed

+176
-28
lines changed

README.md

Lines changed: 122 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,137 @@
1-
# microlink-core
1+
# metascraper
22

3-
![Last version](https://img.shields.io/github/tag/microlinkhq/microlink-core.svg?style=flat-square)
4-
[![Build Status](https://img.shields.io/travis/microlinkhq/microlink-core/master.svg?style=flat-square)](https://travis-ci.org/microlinkhq/microlink-core)
5-
[![Coverage Status](https://img.shields.io/coveralls/microlinkhq/microlink-core.svg?style=flat-square)](https://coveralls.io/github/microlinkhq/microlink-core)
6-
[![Dependency status](https://img.shields.io/david/microlinkhq/microlink-core.svg?style=flat-square)](https://david-dm.org/microlinkhq/microlink-core)
7-
[![Dev Dependencies Status](https://img.shields.io/david/dev/microlinkhq/microlink-core.svg?style=flat-square)](https://david-dm.org/microlinkhq/microlink-core#info=devDependencies)
8-
[![NPM Status](https://img.shields.io/npm/dm/microlink-core.svg?style=flat-square)](https://www.npmjs.org/package/microlink-core)
9-
[![Donate](https://img.shields.io/badge/donate-paypal-blue.svg?style=flat-square)](https://paypal.me/microlinkhq)
3+
![Last version](https://img.shields.io/github/tag/microlinkhq/metascraper.svg?style=flat-square)
4+
[![Build Status](https://img.shields.io/travis/microlinkhq/metascraper/master.svg?style=flat-square)](https://travis-ci.org/microlinkhq/metascraper)
5+
[![Coverage Status](https://img.shields.io/coveralls/microlinkhq/metascraper.svg?style=flat-square)](https://coveralls.io/github/microlinkhq/metascraper)
6+
[![Dependency status](https://img.shields.io/david/microlinkhq/metascraper.svg?style=flat-square)](https://david-dm.org/microlinkhq/metascraper)
7+
[![Dev Dependencies Status](https://img.shields.io/david/dev/microlinkhq/metascraper.svg?style=flat-square)](https://david-dm.org/microlinkhq/metascraper#info=devDependencies)
8+
[![NPM Status](https://img.shields.io/npm/dm/metascraper.svg?style=flat-square)](https://www.npmjs.org/package/metascraper)
109

11-
> Get metadata from HTML.
10+
> A library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.
1211
13-
## Install
12+
## Table of Contents
13+
14+
TODO: INSERT TABLE OF CONTENT
15+
16+
## Getting Started
17+
18+
**metascraper** is library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.
19+
20+
It follows a few principles:
21+
22+
- Have a high accuracy for online articles by default.
23+
- Be usable on the server and in the browser.
24+
- Make it simple to add new rules or override existing ones.
25+
- Don't restrict rules to CSS selectors or text accessors.
26+
27+
## Installation
1428

1529
```bash
16-
$ npm install microlink-core --save
30+
$ npm install metascraper --save
1731
```
1832

1933
## Usage
2034

35+
Let's extract accurate information from the followgin article:
36+
37+
[![](/support/screenshot.png)](http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance)
38+
2139
```js
22-
const microlink = require('microlink-core')
23-
const get = require('simple-get')
40+
const metascraper = require('metascraper')
41+
const got = require('got')
2442

25-
get.concat('http://example.com', function (err, res, html) {
26-
if (err) throw err
43+
const targetUrl = 'http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance'
2744

28-
const output = microlink(html)
29-
console.log(output)
30-
})
45+
;(async () => {
46+
const {body: html, url} = await got(targetUrl)
47+
const metadata = await microlink({html, url})
48+
console.log(metadata)
49+
})()
3150
```
51+
52+
Where the output will be something like:
53+
54+
```json
55+
{
56+
"author": "Ellen Huet",
57+
"date": "2016-05-24T18:00:03.894Z",
58+
"description": "The HR startups go to war.",
59+
"image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v1/-1x-1.jpg",
60+
"publisher": "Bloomberg.com",
61+
"title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance",
62+
"url": "http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance"
63+
}
64+
```
65+
66+
## Metadata
67+
68+
Here is a list of the metadata that **metascraper** collects by default:
69+
70+
- **`author`** — eg. `Noah Kulwin`<br/>
71+
A human-readable representation of the author's name.
72+
73+
- **`date`** — eg. `2016-05-27T00:00:00.000Z`<br/>
74+
An [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) representation of the date the article was published.
75+
76+
- **`description`** — eg. `Venture capitalists are raising money at the fastest rate...`<br/>
77+
The publisher's chosen description of the article.
78+
79+
- **`image`** — eg. `https://assets.entrepreneur.com/content/3x2/1300/20160504155601-GettyImages-174457162.jpeg`<br/>
80+
An image URL that best represents the article.
81+
82+
- **`logo`** — eg. `https://entrepreneur.com/favicon180x180.png`<br/>
83+
An image URL that best represents the publisher brand.
84+
85+
- **`publisher`** — eg. `Fast Company`<br/>
86+
A human-readable representation of the publisher's name.
87+
88+
- **`title`** — eg. `Meet Wall Street's New A.I. Sheriffs`<br/>
89+
The publisher's chosen title of the article.
90+
91+
- **`url`** — eg. `http://motherboard.vice.com/read/google-wins-trial-against-oracle-saves-9-billion`<br/>
92+
The URL of the article.
93+
94+
## API
95+
96+
### metascraper(options)
97+
98+
#### options
99+
100+
##### html
101+
102+
*Required*<br>
103+
Type: `String`
104+
105+
The HTML markup for extracting the content.
106+
107+
#### url
108+
109+
*Required*<br>
110+
Type: `String`
111+
112+
The URL associated with the HTML markup.
113+
114+
It is used for resolve relative links that can be present in the HTML markup.
115+
116+
it can be used as fallback field for different rules as well.
117+
118+
## Comparison
119+
120+
To give you an idea of how accurate **metascraper** is, here is a comparison of similar libraries:
121+
122+
| Library | [`metascraper`](https://www.npmjs.com/package/metascraper) | [`html-metadata`](https://www.npmjs.com/package/html-metadata) | [`node-metainspector`](https://www.npmjs.com/package/node-metainspector) | [`open-graph-scraper`](https://www.npmjs.com/package/open-graph-scraper) | [`unfluff`](https://www.npmjs.com/package/unfluff) |
123+
| :--- | :--- | :--- | :--- | :--- | :--- |
124+
| Correct | **95.54%** | **74.56%** | **61.16%** | **66.52%** | **70.90%** |
125+
| Incorrect | 1.79% | 1.79% | 0.89% | 6.70% | 10.27% |
126+
| Missed | 2.68% | 23.67% | 37.95% | 26.34% | 8.95% |
127+
128+
A big part of the reason for **metascraper**'s higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph.
129+
130+
**metascraper**'s default settings are targetted specifically at parsing online articles, which is why it's able to be more highly-tuned than the other libraries for that purpose.
131+
132+
If you're interested in the breakdown by individual pieces of metadata, check out the [full comparison summary](/support/comparison), or dive into the [raw result data for each library](/support/comparison/results).
133+
32134
## License
33135

34-
MIT © [microlonk.io](https://github.com/microlinkhq).
136+
**metascraper** © [Ian Storm Taylor](https://github.com/ianstormtaylor), Released under the [MIT](https://github.com/Kikobeats/free-email-domains/blob/master/LICENSE.md) License.<br>
137+
Maintained by [Kiko Beats](https://kikobeats.com) with help from [contributors](https://github.com/microlinkhq/metascraper/contributors).

package.json

Lines changed: 54 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,66 @@
11
{
2-
"name": "microlink-core",
3-
"description": "Get metadata from HTML",
4-
"homepage": "https://github.com/microlinkhq/microlink-core",
5-
"version": "0.0.0",
2+
"name": "metascraper",
3+
"description": "A library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.",
4+
"homepage": "https://metascraper.js.org",
5+
"version": "1.0.7",
66
"main": "index.js",
77
"author": {
8-
"url": "https://github.com/microlinkhq"
8+
"email": "ian@ianstormtaylor.com",
9+
"name": "Ian Storm Taylor",
10+
"url": "https://github.com/ianstormtaylor"
911
},
12+
"contributors": [
13+
{
14+
"name": "Kiko Beats",
15+
"email": "josefrancisco.verdu@gmail.com",
16+
"url": "https://github.com/Kikobeats"
17+
}
18+
],
1019
"repository": {
1120
"type": "git",
12-
"url": "git+https://github.com/microlinkhq/microlink-core.git"
21+
"url": "git+https://github.com/microlinkhq/metascraper.git"
1322
},
1423
"bugs": {
15-
"url": "https://github.com/microlinkhq/microlink-core/issues"
24+
"url": "https://github.com/microlinkhq/metascraper/issues"
1625
},
1726
"keywords": [
18-
"metadata"
27+
"article",
28+
"browser",
29+
"cheerio",
30+
"content",
31+
"expand",
32+
"extract",
33+
"facebook",
34+
"fallback",
35+
"fetch",
36+
"get",
37+
"graph",
38+
"html",
39+
"meta",
40+
"metadata",
41+
"micro format",
42+
"microformat",
43+
"og",
44+
"open",
45+
"open graph",
46+
"opengraph",
47+
"page",
48+
"parse",
49+
"parser",
50+
"scrape",
51+
"scraper",
52+
"server",
53+
"site",
54+
"summarize",
55+
"summary",
56+
"tag",
57+
"tags",
58+
"twitter",
59+
"unfluff",
60+
"unfurl",
61+
"url",
62+
"web",
63+
"website"
1964
],
2065
"dependencies": {
2166
"async": "~2.5.0",
@@ -47,7 +92,7 @@
4792
"standard-markdown": "latest"
4893
},
4994
"engines": {
50-
"node": "8"
95+
"node": ">= 8"
5196
},
5297
"files": [
5398
"index.js",

0 commit comments

Comments
 (0)