Skip to content

joakimbeng/spindel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spindel

Build status NPM version XO code style

A web crawler/spider

"spindel" is the Swedish word for spider.

Installation

Install spindel using npm:

npm install --save spindel

Usage

Module usage

Start with single url

const spindel = require('spindel');

// Start a crawler at http://example.com:
const stream = spindel('http://example.com');

stream.on('data', res => {
	// see response object format below
});

Start with multiple urls

// Start a crawler with an initial queue consisting of two urls:
const stream = spindel([
	'http://example.com',
	'http://another.com'
]);

stream.on('data', res => {
	// see response object format below
});

Use a database as url queue

// Start a crawler with a custom queue:
const redisQueue = {
	popUrl() {
		return getNextUrlFromRedisAndReturnAPromise();
	},
	pushUrl(url) {
		return pushUrlToRedisAndReturnAPromise(url);
	}
};
const stream = spindel(redisQueue);

stream.on('data', res => {
	// see response object format below
});

API

spindel(urlsOrQueue, options)

Name Type Description
urlsOrQueue String, Array or Object A single url, an array of urls or a queue implementation
options Object The options object

Returns: stream.Readable which emits response objects on the 'data' event.

Options

options.transformHtml

Type: Function
Default: noop

Params:

Name Type Description
body String The response body
url String The url for the page being crawled
res Object The full response object

Return value: Any or Promise<Any>.

For responses containing HTML (i.e. having a content-type which begins with text/ and ends with html) this function will be run and its return value will be set to transformedHtml in the response object.

options.gotOptions

Type: Object
Default: {}

Options passed to got.

Streamed response objects

A response object has the format:

{
	url: String, // the crawled url
	statusCode: Number, // the HTTP status code
	statusMessage: String, // the HTTP status message
	body: String, // the response body
	headers: Object, // the HTTP response headers
	hrefs: Array(String), // found <a href /> urls in the body if content is HTML
	transformedHtml: String // if content is HTML this contains the `body` after applying the `transformHtml` option function
}

Queue implementation

A queue implementation consists of two functions popUrl and pushUrl.

queue.popUrl

Type: function

Params:

Name Type Description
lastUrl String The last crawled url, or null for the first url

Should return: String or Promise<String> to continue crawling or null or Promise<null> to stop crawling.

queue.pushUrl

Type: function

Params:

Name Type Description
href String A found href in the currently crawled response body
referral String The url for the current crawl

Should return: nothing or Promise.

Example of the internal ArrayQueue
function arrayQueue(initialUrls) {
	const urls = initialUrls.slice();

	return {
		pushUrl(url) {
			urls.push(url);
		},
		popUrl() {
			return urls.pop();
		}
	};
}

The queue implementation above is used if spindel's urlsOrQueue parameter is a String or Array.

License

MIT © Joakim Carlstein

About

A web crawler/spider

Resources

Stars

Watchers

Forks

Packages

No packages published