Skip to content
Canadian legislative scrapers
Python Other
  1. Python 99.4%
  2. Other 0.6%
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
ca
ca_ab
ca_ab_calgary
ca_ab_candidates
ca_ab_edmonton
ca_ab_grande_prairie
ca_ab_grande_prairie_county_no_1
ca_ab_lethbridge
ca_ab_strathcona_county
ca_ab_wood_buffalo
ca_bc
ca_bc_abbotsford
ca_bc_burnaby
ca_bc_coquitlam
ca_bc_kelowna
ca_bc_langley
ca_bc_municipalities
ca_bc_municipalities_candidates
ca_bc_new_westminster
ca_bc_richmond
ca_bc_saanich
ca_bc_surrey
ca_bc_vancouver
ca_bc_victoria
ca_candidates
ca_mb
ca_mb_candidates
ca_mb_winnipeg
ca_nb
ca_nb_fredericton
ca_nb_moncton
ca_nb_municipalities
ca_nb_saint_john
ca_nl
ca_nl_st_john_s
ca_ns
ca_ns_cape_breton
ca_ns_halifax
ca_on
ca_on_ajax
ca_on_belleville
ca_on_brampton
ca_on_brantford
ca_on_burlington
ca_on_caledon
ca_on_cambridge
ca_on_chatham_kent
ca_on_fort_erie
ca_on_georgina
ca_on_greater_sudbury
ca_on_grimsby
ca_on_guelph
ca_on_haldimand_county
ca_on_hamilton
ca_on_huron
ca_on_king
ca_on_kingston
ca_on_kitchener
ca_on_lambton
ca_on_lincoln
ca_on_london
ca_on_markham
ca_on_milton
ca_on_mississauga
ca_on_newmarket
ca_on_niagara
ca_on_niagara_on_the_lake
ca_on_north_dumfries
ca_on_oakville
ca_on_oshawa
ca_on_ottawa
ca_on_peel
ca_on_pickering
ca_on_richmond_hill
ca_on_st_catharines
ca_on_thunder_bay
ca_on_toronto
ca_on_toronto_candidates
ca_on_uxbridge
ca_on_vaughan
ca_on_waterloo
ca_on_waterloo_region
ca_on_welland
ca_on_wellesley
ca_on_whitby
ca_on_whitchurch_stouffville
ca_on_windsor
ca_pe
ca_pe_charlottetown
ca_qc
ca_qc_beaconsfield
ca_qc_brossard
ca_qc_cote_saint_luc
ca_qc_dollard_des_ormeaux
ca_qc_dorval
ca_qc_gatineau
ca_qc_kirkland
ca_qc_laval
ca_qc_levis
ca_qc_longueuil
ca_qc_mercier
ca_qc_montreal
ca_qc_montreal_est
ca_qc_pointe_claire
ca_qc_quebec
ca_qc_saguenay
ca_qc_saint_jean_sur_richelieu
ca_qc_saint_jerome
ca_qc_sainte_anne_de_bellevue
ca_qc_senneville
ca_qc_sherbrooke
ca_qc_terrebonne
ca_qc_trois_rivieres
ca_qc_westmount
ca_sk
ca_sk_regina
ca_sk_saskatoon
disabled
docker
.gitignore
.travis.yml
LICENSE
README.md
__init__.py
country-ca.csv
patch.py
pupa_settings.py
requirements-test.txt
requirements.txt
setup.py
tasks.py
tox.ini
utils.py

README.md

Canadian Legislative Scrapers Build Status

Usage

Follow the instructions in the Python Quick Start Guide to install Homebrew, Git, PostGIS, Python 3.3+ and virtualenv.

mkvirtualenv scrapers-ca --python=`which python3`
git clone https://github.com/opencivicdata/scrapers-ca.git
cd scrapers-ca
pip install -r requirements.txt

Initialize the database:

createdb pupa
psql pupa -c "CREATE EXTENSION postgis;"
pupa dbinit ca

If you get an error like "no password supplied", then you need to configure the default DATABASE_URL in pupa_settings.py, e.g. postgis://USERNAME:PASSWORD@localhost/pupa.

Run a scraper

pupa update ca_ab_edmonton

To run only the scraping step and skip the import step add the --scrape switch:

pupa update --scrape ca_ab_edmonton

For documentation on the pupa command:

pupa -h

For documentation on the update subcommand:

pupa update -h

Create a scraper

See the first few steps of this wiki page to create a scraper.

Develop a scraper

Read the Pupa documentation or an existing scraper's code.

Avoid using the XPath string() function unless the expression is known to not have matches on some pages. Otherwise, scrapers may continue to run without error despite failing to find a match. A comment like # can be empty or # allow string() should accompany the use of string().

Use the get_email and get_phone helpers as much as possible.

In late 2014/early 2015, we disabled some single-jurisdiction scrapers to lower maintenance costs, some of which have been re-enabled, and disabled all multi-jurisdiction scrapers, because Pupa didn't support them. The disabled scrapers are in disabled/.

We heavily modify Pupa's validations in patch.py to be as strict as possible in order to keep data quality high. We subclass Pupa's Scraper, Jurisdiction and Person classes in utils.py to reduce code duplication and to correct common data quality issues.

Maintenance

List the available maintenance tasks:

invoke -l

Make the code style consistent:

flake8

Check module names, class names, classification, division_name, name and url in __init.py__ files:

invoke tidy

Check sources are credited and assertions are made:

invoke sources_and_assertions

Check jurisdiction URLs (look for Delete COUNCIL_PAGE or Missing COUNCIL_PAGE instructions):

invoke council_pages

Update the OCD-IDs:

curl -O https://raw.githubusercontent.com/opencivicdata/ocd-division-ids/master/identifiers/country-ca.csv

Check whether any non-authoritative CSVs are likely to be stale:

invoke csv_stale

Check whether any CSV errors can be reported to data publishers:

invoke csv_error

Scraper code rarely undergoes code review. The focus is on the quality of the data.

Bugs? Questions?

This repository is on GitHub: https://github.com/opencivicdata/scrapers-ca, where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.

Copyright (c) 2013 Open North Inc., released under the MIT license

You can’t perform that action at this time.