Skip to content
Python Web Scraping Utilities
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md
__init__.py
html_cache.py

README.md

Python Web Scraping Utilities (PWSU)

Fork me on Github: http://github.com/pariser/pwsu

Written by Andrew Pariser -- http://pariser.me / @pariser

Provides the following tools

  • HTMLCache

About HTMLCache

The goal of HTMLCache is to locally cache the HTML from a live URL, so that while writing a web scraping tool, you reduce the number of live calls to the server.

Usage

The goal is to make the library as easy to use as possible. To download the html for a given url and cache the result (or read the cached version, if it exists), run:

html = HTMLCache.read_url(url, redownload=False)

Configuration

Use the following methods to set options that affect library behavior

HTMLCache.set_logger( logger )
HTMLCache.set_cache_dir( cache_dir )
HTMLCache.set_slash_character( slash_character )
  • logger is an override of a default Python logger, which prints DEBUG messages regarding the operation of the HTMLCache.
  • cache_dir defines the folder into which html files are cached. Its default value is "~/data/html_cache".
  • slash_character is used to to replace slashes in the output file name. Its default value is "~", so that the url http://github.com/pariser/pwsu gets put to disk at cache_dir + "/http:~~github.com~pariser~pwsu".

A convenience method allows setting all options at once:

HTMLCache.set_opts( opts={} )

where opts is a dictionary with optional keys logger, cache_dir, and slash_character.

You can’t perform that action at this time.