Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader.
npm i --save seize
Seize can be used with DOM libraries such as jsdom for example. It only extracts and prepares certain DOM-node for further usage.
var Seize = require('seize'),
jsdom = require('jsdom').jsdom;
var window = jsdom('<your html here>').defaultView,
seize = new Seize(window.document);
seize.content(); // returns DOM-node
seize.text(); // returns only text
For browser usage you shoud clone you DOM object or create it from HTML string:
/**
* Converts html string to Document
* @param {String} html html document string
* @return {Node} document
*/
function HTMLParser(html){
var doc = document.implementation.createHTMLDocument("example");
doc.documentElement.innerHTML = html;
return doc;
};
Here is algorythm how it works:
- Getting html tags that we expect to be text or content container such as
p
,table
,img
, etc. - Filtering unnesessary tags by content and tag names wich defenantly can't be in a content container
- Setting score for each container by containing tags
- Setting score by class name, id name, tag xPath score and text score
- Sorting canditates by score
- Taking first candidate
- Cleaning up article
Seize still in development, so you can use it at one's own risk. You always can help to improve it.
- Improve readme
- Improve text scoring
- Improve page detection wich can't be extracted
- More tests
- More examples
You are welcomed to improve this small piece of software :)