GitHub - harryf/node-soupselect: Port of Simon Willison's Soup Select (for BeautifulSoup) to node.js and node-htmlparser

node-soupselect

A port of Simon Willison's soupselect for use with node.js and node-htmlparser.

$ npm install soupselect

Minimal example...

var select = require('soupselect').select;
// dom provided by htmlparser...
select(dom, "#main a.article").forEach(function(element) {//...});

Wanted a friendly way to scrape HTML using node.js. Tried using jsdom, prompted by this article but, unfortunately, jsdom takes a strict view of lax HTML making it unusable for scraping the kind of soup found in real world web pages. Luckily htmlparser is more forgiving. More details on this found here.

A complete example including fetching HTML etc...;

var select = require('soupselect').select,
    htmlparser = require("htmlparser"),
    http = require('http');

// fetch some HTML...
var http = require('http');
var host = 'www.reddit.com';
var client = http.createClient(80, host);
var request = client.request('GET', '/',{'host': host});

request.on('response', function (response) {
    response.setEncoding('utf8');

    var body = "";
    response.on('data', function (chunk) {
        body = body + chunk;
    });

    response.on('end', function() {

        // now we have the whole body, parse it and select the nodes we want...
        var handler = new htmlparser.DefaultHandler(function(err, dom) {
            if (err) {
                console.error("Error: " + err);
            } else {

                // soupselect happening here...
                var titles = select(dom, 'a.title');

                sys.puts("Top stories from reddit");
                titles.forEach(function(title) {
                    sys.puts("- " + title.children[0].raw + " [" + title.attribs.href + "]\n");
                })
            }
        });

        var parser = new htmlparser.Parser(handler);
        parser.parseComplete(body);
    });
});
request.end();

Notes:

Requires node-htmlparser > 1.6.2 & node.js 2+
Calls to select are synchronous - not worth trying to make it asynchronous IMO given the use case

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
deps		deps
lib		lib
testdata		testdata
tests		tests
.gitmodules		.gitmodules
AUTHORS		AUTHORS
BENCHMARK.md		BENCHMARK.md
README.md		README.md
benchmark.js		benchmark.js
example.js		example.js
nodelint.js		nodelint.js
package.json		package.json
test.js		test.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deps

deps

lib

lib

testdata

testdata

tests

tests

.gitmodules

.gitmodules

AUTHORS

AUTHORS

BENCHMARK.md

BENCHMARK.md

README.md

README.md

benchmark.js

benchmark.js

example.js

example.js

nodelint.js

nodelint.js

package.json

package.json

test.js

test.js

Repository files navigation

node-soupselect

About

Releases

Packages

Contributors 4

Languages

harryf/node-soupselect

Folders and files

Latest commit

History

Repository files navigation

node-soupselect

About

Resources

Stars

Watchers

Forks

Languages