Skip to content
A Powerful Spider(Web Crawler) System in Python.
Python JavaScript CSS HTML
Branch: master
Clone or download
Pull request Compare This branch is 439 commits behind binux:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
docs
pyspider
tests
tools
.coveragerc
.gitignore
.travis.yml
Dockerfile
LICENSE
MANIFEST.in
README.md
mkdocs.yml
requirements.txt
run.py
setup.py
tox.ini

README.md

pyspider Build Status Coverage Status Try

A Powerful Spider(Web Crawler) System in Python. TRY IT NOW!

Tutorial: http://docs.pyspider.org/en/latest/tutorial/
Documentation: http://docs.pyspider.org/
Release notes: https://github.com/binux/pyspider/releases

Sample Code

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

Demo

Installation

Quickstart: http://docs.pyspider.org/en/latest/Quickstart/

Contribute

TODO

v0.4.0

  • local mode, load script from file.
  • works as a framework (all components running in one process, no threads)
  • redis
  • shell mode like scrapy shell
  • a visual scraping interface like portia

more

  • edit script with vim via WebDAV

License

Licensed under the Apache License, Version 2.0

You can’t perform that action at this time.