TurtleFly

A flexible crawler framework based on NodeJS.

Intro

TurtleFly is a Crawler framework based on Node.js, and now with the version 2.0. which is more simplify and flexible then version 1.

Turtlefly 2.0 consist Crawler schema design, task scheduler, request module and content parse module.

Module

Crawler - The core class of Node Crawler, all crawler extends this class.
Requester - Use to make Web request.
RedisScheduler [new] - Task interaction with redis.
Task [new] - Record task process and schedule task.

Code of framework index.js file:

module.exports = {
	Crawler: require('./lib/Crawler'),
	Requester: require('./lib/Requester'),
	Scheduler: require('./lib/Scheduler')
}

Installation

$ npm i turtlefly

Main Modules API

Crawler

Crawler.constructor(config)

Crawler.loadDom(buffer html, string charset = 'utf-8') : {object} cheerio object

Crawler.getList(object cheerioObj, object schema) : array dataList

Crawler.getContent(object cheerioObj, object schema) : object data

Requester

Requester.handle(string url, {charset = "UTF-8", respType = 'object', config = {}} = {}) : Promise

Task

Task.constructor(object crawler)

Task.next(bool nextPage = true, bool nextParam = false)

Task.request({ url, charset, respType, method, config, payload })

RedisScheduler

RedisScheduler module provide support for distributed task control.

Redis data construction:

${key} for current crawler project name.

# Crawler task list, record task info.
turtlefly:task:${key}

# Task status - check this status when crawler start and switch to none when task finished.
# working|none
turtlefly:task:${key}:status {string}

# Pending task
turtlefly:task:${key}:pending {list} [task detail json]

# In progress task
turtlefly:task:${key}:inProgress {hash} [hask -> task detail json]

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
lib		lib
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TurtleFly

Intro

Module

Installation

Main Modules API

Crawler

Requester

Task

RedisScheduler

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

imcsd/turtlefly

Folders and files

Latest commit

History

Repository files navigation

TurtleFly

Intro

Module

Installation

Main Modules API

Crawler

Requester

Task

RedisScheduler

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages