Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
A Python web crawler using Tornado and ZeroMQ

This branch is even with retresco:master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
docs-source
src/spyder
test
.gitignore
.pylintrc
LICENSE
MANIFEST.in
README.rst
bootstrap.py
buildout.cfg
crawluri.thrift
local.cfg.template
setup.py
versions.cfg

README.rst

Spyder

ALONG CAME A SPIDER

Spyder is a scalable web-spider written in Python using the non-blocking tornado library and ZeroMQ as messaging layer. The messages are serialized using Thrift.

The architecture is very basic: a Master process contains the crawl Frontier that organises the urls that need to be crawled; several Worker processes actually download the content and extract new urls that should be crawled in the future. For storing the content you may attach a Sink to the Master and be informed about the interesting events for an url.

Getting Started

Spyder is just a library for creating web crawlers. In order to really crawl content, you first have to create a Spyder skeleton:

$ mkdir my-crawler && cd my-crawler
$ spyder start
$ ls
log logging.conf master.py settings.py sink.py spyder-ctrl.py

This will copy the skeleton into my-crawler. The main file is settings.py. In it, you can configure the logging level for Masters and Workers and define the crawl scope. In master.py you should manipulate the starting URLs and add your specific sink.py into the Frontier. spyder-ctrl.py is just a small control script that helps you start the Log Sink, Master and Worker.

In the skeleton everything is setup as if you would want to crawl Sailing related pages from DMOZ. That should give you a starting point for your own crawler.

So, when you wrote your sink and have everything configured right, it's time to start crawling. First, on one of your nodes you start the logsink:

$ spyder-ctrl.py logsink &

Again on one node (the same as the logsink, e.g.) you start the Master:

$ spyder-ctrl.py master &

Finally you can start as many Workers as you want:

$ spyder-ctrl.py worker & $ spyder-ctrl.py worker & $ spyder-ctrl.py worker &

Here we started 3 workers since it is a powerful node having a quad core CPU.

Scaling the Crawl

With the default settings it is not possible to start workers on different nodes. Most of the time one node is powerful enough to crawl quite an amount of data. But there are times when you simply want to crawl using many nodes. This can be done by configuring the ZeroMQ transports to something like

ZEROMQ_MASTER_PUSH = "tcp://NodeA:5005" ZEROMQ_MASTER_SUB = "tcp://NodeA:5007"

ZEROMQ_MGMT_MASTER = "tcp://NodeA:5008" ZEROMQ_MGMT_WORKER = "tcp://NodeA:5009"

ZEROMQ_LOGGING = "tcp://NodeA:5010"

Basically we have setup a 2 node crawl cluster. NodeA acts as logging sink and controls the crawl via the Master. NodeB Is a pure Worker node. Only the Master actually binds ZeroMQ sockets, the Worker always connect to them so the Master does not have to know where the Workers are really running.

From here

There is plenty of room for improvement and development ahead. Everything will be handled by Github tickets from now on and, if there is interest, we may setup a Google Group.

Something went wrong with that request. Please try again.