Just a basic web crawler written in python based on dotcloud platform...
Python
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
README.md
dotcloud.yml
mkdb.py
requirements.txt
supervisord.conf
tools.py
worker.py
wsgi.py

README.md

DotCloud Crawler Sample

DotCloud sample to experiment with the platform. This crawler, written in python, specifically returns image links encountered through the traversal whom the depth is configurable.

Installation

Spawning the whole thing is pretty straightforward:

$> git clone git://github.com/jpbillaud/dotcloud-crawler.git
$> cd dotcloud-crawler
$> dotcloud create mycrawler
$> dotcloud push mycrawler

Note that after the first deployement, a database named 'crawldb' needs to be instantiated (to be fixed):

$> dotcloud run mycrawler.db -- mysql -uroot -p<root_password>
mysql> create database crawldb;

Usage

The www crawler web service takes some json in input such as:

$> cat myurls
{ "depth": "1",
  "urls": [
  "http://www.dotcloud.com",
  "http://www.cnn.com",
  "http://www.tf1.fr",
  "http://www.facebook.com" ] }

$> curl -X POST -D /dev/stderr -d @myurls <your_dotcloud_www_app>
HTTP/1.1 201 Created
Server: nginx/1.0.11
Date: Sun, 05 Feb 2012 09:21:54 GMT
Content-Type: text/plain
Location: http://<your_dotcloud_www_app>/57
Transfer-Encoding: chunked
Connection: keep-alive

At any time, one can get the status (returned in json) of a given job:

$> curl -X GET <your_dotcloud_www_app>/57/status
{ "result": {
     "urls_completed": "1",
     "urls_requested": "1",
     "creation": "2012-02-06 09:11:02" } }

As well as the incremental result (returned in json):

$> curl -X GET <your_dotcloud_www_app>/57/result
{ "images": [ "http://files.posterous.com/user_profile_pics/1183290/avatar_thumb.jpg",
[...]
"http://www.dotcloud.com/about/static/img/navbar-signup.png",
"http://www.dotcloud.com/about/static/img/padlock.png",
"http://www.dotcloud.com/accounts/login?next=/dashboard/static/img/logo.png",
"http://www.dotcloud.com/accounts/login?next=/dashboard/static/img/navbar-signup.png",
"http://www.dotcloud.com/accounts/login?next=/dashboard/static/img/padlock.png",
"http://www.dotcloud.com/accounts/register/static/img/logo.png",
"http://www.dotcloud.com/accounts/register/static/img/navbar-signup.png",
"http://www.dotcloud.com/accounts/register/static/img/padlock.png",
"http://www.dotcloud.com/gallery/static/img/logo.png",
"http://www.dotcloud.com/gallery/static/img/navbar-signup.png",
"http://www.dotcloud.com/gallery/static/img/padlock.png" ] }