Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tree: 6e2691f0af
Fetching contributors…

Octocat-spinner-32-eaf2f5

Cannot retrieve contributors at this time

file 51 lines (34 sloc) 1.68 kb

pull - A web scraper library

A scrape or feed can be broken down into 3 steps: Fetch->Parse->Store. The library defines a scaffolding that will run these steps for a feed or collection of feeds. A user needs to define the steps with the classes provided.

  1. Fetch - Criteria+Protocol
  2. Parse - Parser
  3. Store - Updater

The typical flow is a FileListCriteria is built given a date range that returns a list of (url, cache_file_location) tuples. The UrlProtocol will perform a HTTP GET for each url and write the contents to a file at the specified cache_file_location. Next the parser is called for each cached file to open and parse returning a list of dicts for the updater to store. A default updater is provided called StoreItems that simply stores the items to a list. A user is expected to provide their own updater.

It is sometimes useful to bypass a step but still use the scaffolding. For example if you were using OAuth with a WebAPI that returns json and you just wanted to pass the json respose directly to an Updater to store without using a Protocol or Parser. Simply build your feed using the SkipProtocol and don't specify a Parser.

myfeed = build_feed('myfeed', SkipUrlProtocol(MyFetchCriteria()), updater=MyUpdater())

System Message: ERROR/3 (<string>, line 38)

Unknown directive type "toctree".

.. toctree::
   :maxdepth: 2



Indices and tables

Something went wrong with that request. Please try again.