Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

a complex but scalable web spider

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 arachne
Octocat-spinner-32 docs
Octocat-spinner-32 tests
Octocat-spinner-32 .gitignore
Octocat-spinner-32 LICENCE
Octocat-spinner-32 MANIFEST.in
Octocat-spinner-32 README.rst
Octocat-spinner-32 setup.cfg
Octocat-spinner-32 setup.py
README.rst

arachne

Arachne is meant to be a next generation version of hiispider, a flexible web spider written at hiidef for flavors.me. It features a very similar high level architecture, but implements them differently to achieve a few important objectives:

  • HTTP interfaces should be rich and easily extendable
  • Plugins should be easy to run synchronously
  • DRY-ness in the resultant plugin code
  • Should not depend on undocumented architectural decisions

Asynchronicity is achieved with gevent, which should be patched by users of arachne. Without patching, arachne behaves synchronously and nearly all of its clients and libraries are usable from the python shell.

architectural overview

Arachne is split up into 3 major pieces:

  • A scheduler which puts jobs on a queue
  • A worker which executes scheduled jobs
  • An interface which runs jobs on demand via HTTP

Jobs are all tied to methods implemented in plugins. Arachne makes certain basic assumptions and decisions, and will take care of these problems:

  • Mapping URLs to plugin methods
  • Basic plugin execution and result storage
  • Registration and lookup for available plugins
  • Associating a run-interval (every n seconds) with each plugin method
  • Daemonization, start/stop/restart & pidfiles

You will have to decide:

  • What a "job" looks like coming on and off the queue
  • Where and how to store plugin results
  • How to schedule those jobs
  • How to store data necessary to run the jobs

batteries

Arachne comes with a number of batteries included:

  • a simple no-magic configuration management system

  • a rich http library, based on requests with:
    • header caching on a pluggable backend (eg. memcached)
    • header-based json/xml parsing with forced overrides
    • OAuth 1.0a helpers (via requests-oauth)
    • alternate session style helpers w/ with base-url support
  • a memcached wrapper based on ultramemcache

  • a mysql wrapper based on ultramysql

  • an AMQP client based on kombu and amqplib

  • a cassandra client based on pycassa

All of these clients will attempt to auto-configure with arachne's configuration management system.

Something went wrong with that request. Please try again.