SYMFONY-SPIDER

一个使用非常简单的多进程爬虫，基于php的symfony框架开发

依赖服务

php >= 5.6

安装

git clone git@github.com:Jaggle/symfony-spider.git spider
cd spider 
composer install

composer install命令的最后，根据提示输入数据库配置（数据库名称现在给出，但是不需要现在建数据库，下面的命令会帮助自动创建数据库）以及redis dsn(例如：redis://pass@localhost)。

创建数据库

php app/console doctrine:database:create

创建表结构

php app/console doctrine:schema:update --force --dump-sql

创建一个爬虫

php app/console spider:create

添加抓取规则

vim app/config/rules.json

规则介绍：

目前只能爬取四个字段，下面以爬取知乎为例：

{
  "sf-spider" : {
    "linkRule": {
      "status": false,
      "rule": ""
    },
    "documentRule": {
      "title":  {
        "type": "text",
        "rule": "h1.QuestionHeader-title"
      },
      "meta": {
        "type": "href",
        "rule": ".UserLink-link"
      },
      "desc": {
        "type": "text",
        "rule": ".QuestionHeader-detail span.RichText.ztext"
      },
      "content": {
        "type": "html",
        "rule": "div.RichContent .RichContent-inner"
      }
    }
  }
}

运行爬虫

SPIDER_NAME为你创建的爬虫名称，例如我的为“sf-spider”。

php app/console spider:run --spiderName=SPIDER_NAME --workerCount=4

or

php app/console spider:run SPIDER_NAME --workerCount=4

or 开启日志输出

php app/console spider:run SPIDER_NAME --workerCount=4 --debug

workerCount: 启动的进程数量，默认为1

spiderName: 爬虫名称，默认"default"

执行过程

               |  -- job进程<抓取网页内容>
master进程 -----| -- job进程<抓取网页内容>
               |  -- job进程<抓取网页内容>


任务队列 ----| -- 文档任务，分析网页，进行文档的入库操作
            | -- job任务，控制job的状态，给job进程分配job

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
app		app
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
composer.json		composer.json
composer.lock		composer.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

src

src

.gitignore

.gitignore

.travis.yml

.travis.yml

README.md

README.md

composer.json

composer.json

composer.lock

composer.lock

Repository files navigation

SYMFONY-SPIDER

依赖服务

安装

创建数据库

创建表结构

创建一个爬虫

添加抓取规则

运行爬虫

执行过程

About

Releases

Packages

Contributors 2

Languages

jjsty1e/symfony-spider

Folders and files

Latest commit

History

Repository files navigation

SYMFONY-SPIDER

依赖服务

安装

创建数据库

创建表结构

创建一个爬虫

添加抓取规则

运行爬虫

执行过程

About

Topics

Resources

Stars

Watchers

Forks

Languages