Transports Quebec Infrastructure Database Scraper
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
mtqinfra
.gitignore
LICENSE
README.md
reprocess_json.py
scrapy.cfg

README.md

Overview

This is an experiment to scrape the Transports Quebec infrastructure database and save the data in different, easily usable formats.

Please see this blog post for the technical details or this blog post for the story and output files in CSV, JSON, LineJSON, XML and KML format.

Requirements (Tested On)

Not tested with newer versions of the above. YMMV.

NOTE

I currently have not plan to "support" this project. However, if you find and fix issues (e.g. stuff that does not work anymore because the HTML being scraped has been changed) or add features, feel free to send me pull requests.

If you find an issue that yourself have no plan to fix, feel free to open a ticket to let me know. Maybe by that time I will have found a portal to another dimension where I have extra time or a clone that would allow me to work on it.

Cheers!

UPDATE 2013/10/18

Roberto Rocca @robroc from The Gazette asked me if I had any recent scrape from the MTQ database.

I had not looked at this code in a long time and I was curious to see if it still worked. It did not.

However, by doing some tests in the Scrapy shell and checking HTML source code, I realized little would be necessary to fix things. So I found some time to update the scraper to have it work on the current MTQ website. Mostly, I had to change the base URL, the table IDs and XPath selector to get the structure photo URL.

NOTE: I did not test the code with the latest and greatest Scrapy version. Instead, to save myself trouble, I went with one of the oldest available version on PyPI (0.14.4) which did not require any change in my code.

NOTE 2: The latest version of the MTQ website uses cookies to track session. To easily break and inspect past the initial form submission, use inspect_response in parse_main_list or parse_details.