Skip to content

imcsd/turtlefly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TurtleFly

A flexible crawler framework based on NodeJS.

Intro

TurtleFly is a Crawler framework based on Node.js, and now with the version 2.0. which is more simplify and flexible then version 1.

Turtlefly 2.0 consist Crawler schema design, task scheduler, request module and content parse module.

Module

  • Crawler - The core class of Node Crawler, all crawler extends this class.
  • Requester - Use to make Web request.
  • RedisScheduler [new] - Task interaction with redis.
  • Task [new] - Record task process and schedule task.

Code of framework index.js file:

module.exports = {
	Crawler: require('./lib/Crawler'),
	Requester: require('./lib/Requester'),
	Scheduler: require('./lib/Scheduler')
}

Installation

$ npm i turtlefly

Main Modules API

Crawler

Crawler.constructor(config)

Crawler.loadDom(buffer html, string charset = 'utf-8') : {object} cheerio object

Crawler.getList(object cheerioObj, object schema) : array dataList

Crawler.getContent(object cheerioObj, object schema) : object data

Requester

Requester.handle(string url, {charset = "UTF-8", respType = 'object', config = {}} = {}) : Promise

Task

Task.constructor(object crawler)

Task.next(bool nextPage = true, bool nextParam = false)

Task.request({ url, charset, respType, method, config, payload })

RedisScheduler

RedisScheduler module provide support for distributed task control.

Redis data construction:

  • ${key} for current crawler project name.
# Crawler task list, record task info.
turtlefly:task:${key}

# Task status - check this status when crawler start and switch to none when task finished.
# working|none
turtlefly:task:${key}:status {string}

# Pending task
turtlefly:task:${key}:pending {list} [task detail json]

# In progress task
turtlefly:task:${key}:inProgress {hash} [hask -> task detail json]

About

A crawler framework based on NodeJS.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •