Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
Clone or download
jmg Merge pull request #17 from crawley-project/master
Add Travis CI integration and fixed newer version dependencies
Latest commit 3953dca Jun 13, 2015

Pythonic Crawling / Scraping Framework Built on Eventlet

Build Status Code Climate Stories in Ready


  • High Speed WebCrawler built on Eventlet.
  • Supports relational databases engines like Postgre, Mysql, Oracle, Sqlite.
  • Supports NoSQL databased like Mongodb and Couchdb. New!
  • Export your data into Json, XML or CSV formats. New!
  • Command line tools.
  • Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python).
  • Cookie Handlers.
  • Very easy to use (see the example).


Project WebSite

To install crawley run

~$ python install

or from pip

~$ pip install crawley

To start a new project run

~$ crawley startproject [project_name]
~$ cd [project_name]

Write your Models

""" """

from crawley.persistance import Entity, UrlEntity, Field, Unicode

class Package(Entity):
    #add your table fields here
    updated = Field(Unicode(255))    
    package = Field(Unicode(255))
    description = Field(Unicode(255))

Write your Scrapers

""" """

from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper
from crawley.extractors import XPathExtractor
from models import *

class pypiScraper(BaseScraper):
    #specify the urls that can be scraped by this class
    matching_urls = ["%"]
    def scrape(self, response):
        #getting the current document's url.
        current_url = response.url        
        #getting the html table.
        table = response.html.xpath("/html/body/div[5]/div/div/div[3]/table")[0]
        #for rows 1 to n-1
        for tr in table[1:-1]:
            #obtaining the searched html inside the rows
            td_updated = tr[0]
            td_package = tr[1]
            package_link = td_package[0]
            td_description = tr[2]
            #storing data in Packages table
            Package(updated=td_updated.text, package=package_link.text, description=td_description.text)

class pypiCrawler(BaseCrawler):
    #add your starting urls here
    start_urls = [""]
    #add your scraper classes here    
    scrapers = [pypiScraper]
    #specify you maximum crawling depth level    
    max_depth = 0
    #select your favourite HTML parsing tool
    extractor = XPathExtractor

Configure your settings

""" """

import os 
PATH = os.path.dirname(os.path.abspath(__file__))

#Don't change this if you don't have renamed the project

DATABASE_ENGINE = 'sqlite'     
DATABASE_NAME = 'pypi'  
DATABASE_USER = ''             
DATABASE_HOST = ''             


Finally, just run the crawler

~$ crawley run