spider

spider in node.js

features

1 can set how many request per second

2 can use extend to inherit

install

npm install xspider

usage

`js
var Spider = require('xspider').Spider,
    Crawler = require('xspider').Crawler;

var s = new Spider('http://www.sina.com.cn/'),

see examples/v2ex.js

API

Spider

options

create a spider with option:

var s = new Spider(option)

options: maxConnections: max connection conncurry rps: max requests per second, if less than 1, for example 0.5, the spider request per two seconds. maxPages: max pages to be crawled.

methods:

start

start(crawler)

start spider to crawl. before start, you should set the spider's crawler instance, use crawler() method or call start with a crawler instance.

stop

stop()

stop spider.

cycle

cycle(crawler, interval)

crawler

crawler(crawl)

set or get the spider's crawler instance

pause

pause()

pause spider.

resume

resume()

resume a paused spider.

static method

extend

extend the Spider class

`js
var MySpider = Spider.extend({
    start: ....
})

Crawler

methods

setRoute

set crawler callbacks for specify urls.

router

router(url)

return a crawler's method to handle this url. crawler has two basic method: index, detail.

fetch

fetch(url)

internal used. this method return a promise instance.

handle

handle(url)

this method is an interface to handle a url. internally, it first call router to find if there is a method to handle this url, if found, it first call fetch method, then use the method to handle to url and it's response.

static method

extend

extend the Crawler class

`js
var MyCrawler = Crawler.extend({
    index: function(url, resp) { return [] },
    detail: function(url, resp) {return [] },
})

the callback function should return an array, which is an url list to be crawled.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
examples		examples
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crawler.js		crawler.js
index.js		index.js
memcache.js		memcache.js
package.json		package.json
queue.js		queue.js
redis.js		redis.js
spider.js		spider.js
utils.js		utils.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spider

features

install

usage

API

Spider

options

methods:

start

stop

cycle

crawler

pause

resume

static method

extend

Crawler

methods

setRoute

router

fetch

handle

static method

extend

About

Releases

Packages

Languages

License

guotie/spider

Folders and files

Latest commit

History

Repository files navigation

spider

features

install

usage

API

Spider

options

methods:

start

stop

cycle

crawler

pause

resume

static method

extend

Crawler

methods

setRoute

router

fetch

handle

static method

extend

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages