Skip to content
main
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

scrapelib

https://coveralls.io/repos/jamesturk/scrapelib/badge.png?branch=master Documentation Status

scrapelib is a library for making requests to less-than-reliable websites, it is implemented (as of 0.7) as a wrapper around requests.

scrapelib originated as part of the Open States project to scrape the websites of all 50 state legislatures and as a result was therefore designed with features desirable when dealing with sites that have intermittent errors or require rate-limiting.

Advantages of using scrapelib over alternatives like httplib2 simply using requests as-is:

  • All of the power of the suberb requests library.
  • HTTP, HTTPS, and FTP requests via an identical API
  • support for simple caching with pluggable cache backends
  • request throttling
  • configurable retries for non-permanent site failures

Written by James Turk <dev@jamesturk.net>, thanks to Michael Stephens for initial urllib2/httplib2 version

See https://github.com/jamesturk/scrapelib/graphs/contributors for contributors.

Requirements

  • python >=3.7
  • requests >= 2.0

Example Usage

Documentation: http://scrapelib.readthedocs.org/en/latest/

import scrapelib
s = scrapelib.Scraper(requests_per_minute=10)

# Grab Google front page
s.get('http://google.com')

# Will be throttled to 10 HTTP requests per minute
while True:
    s.get('http://example.com')