Getting Started Writing Scrapers
While we strive to make writing scrapers as simple as possible, there are a few prerequisites:
If you're already well-versed in Python, GitHub, and basics of web scraping you can skip to Getting Started.
These instructions are intended for Linux or OS X. If you're using Windows you'll probably benefit from using something like MinGW or a VM running Linux. If you're using OS X you may also find the excellent OS X-specific docs published by Open North useful.
If you aren't already familiar with Python you might want to start with Python on Codecademy.
Make sure you are using Python 3.3 or newer.
Having a local development environment is recommended, virtualenv & virtualenvwrapper are optional tools that will help you keep your Python environment clean if you work on multiple projects.
It is useful to understand the basic concept of web scraping before beginning, which is somewhat beyond the scope of this documentation. We recommend this source.
In our experience spending a few minutes brushing up on the basics of XPath is well worth it as it makes scrapers easier to write and more maintainable in the long run.
The first thing to do is to choose a repository to work with, or create a new one.
Most likely you'll be creating a fork of one of the existing scraper repositories:
- scrapers-us-municipal - US municipal governments
- scrapers-us-state - US state-level governments
- scrapers-us-federal - US federal government
- scrapers-ca - Canadian legislative
- influence-usa/scrapers-us-state - US state influence data
If your scraper falls into one of those categories you should fork it and create a new directory within that repository. We'd also suggest you work on a branch to make merging changes as easy as possible.
If you're hoping to create a scraper for something not yet covered please email the Open Civic Data list and we can work with you to decide the best way to proceed.
Once you've chosen a repository you'll need to install the pupa library (the first syllable of pupa is pronounced 'pew' as in 'pew pew pew pew pew'). Also install any other dependencies (like lxml) that you'll be using to do your scraping. If you're using an existing repo, you should be able to get all necessary libraries by installing the requirements listed in that repository's requirements.txt file.
An example of how you might configure your setup:
# using a virtualenv highly recommended $ mkvirtualenv --python `which python3` opencivicdata # Install pupa $ pip install --upgrade pupa # Clone the repo that you forked on GitHub $ git clone firstname.lastname@example.org:<yourusername>/scrapers-us-state.git # Switch to a branch to make pulling your work later as easy as possible $ cd scrapers-us-state $ git checkout --branch <new-branch-name> # ...do work... $ git push --set-upstream origin <new-branch-name>
If you're all set up, you can move on to :doc:`new`.