rhymeswithcycle edited this page Aug 13, 2010 · 4 revisions

These are mostly larger projects; if there’s something smaller you’d like to work on, I’d love to hear about it.

Flexible search

UI and backend code (we use Solr, via the lightweight pysolr wrapper) for more advanced search features. Sorting, filtering, faceting.

Version française

J’aimerais énormément avoir une version française du site. Je pense que, au lieu d’inclure les infos français dans la base de données actuelle, c’est meilleur de créer une site séparée, avec une differente base de données. (Ce projet nécessite non pas seulement traduire les pages HTML de openparliament.ca, mais obtenir les versions françaises du Hansard et des autres documents qu’on utilise.)

Senate and committee data

I’m listing this here mostly because I’ve received a lot of requests for it. And indeed these are both crucial parts of the legislative process. But I’m reluctant to add new scraped data sources — more things that can go wrong, more potential for ongoing maintenance (and then a dead site when I don’t have the time or will to keep up with the maintenance). Especially as Parliament plans to release XML versions of data, which will hopefully include senate and committee transcripts. That said, if you’re interested in willing to put in the scraping and parsing work, we should discuss.


This is a fun and difficult one. You can pull House video off of the Parliament and CPAN streams, and it’s archived at Mycelium. Automatically matching that video with the statement transcript is tricky but totally possible. We have an approximate (+- 5 minutes) timestamp for each statement. The QP video almost always shows an onscreen banner with the name of the speaker. I suspect the video resolution’s too low to do OCR on the name, but the name’s accompanied with a color-coded party banner, and detecting that color bar and mapping it to a party name is possible. So you could get, given a video, a timestamped list of which parties spoke when. (I have rough proof-of-concept code for this somewhere.) That could then be paired with the Hansard via some kind of approximate sequence-matching algorithm.


There’s all manner of fun/informative statistics that could be derived from our dataset. If you know statistics (bonus points for knowing how to compute stats in Python) and can suggest potential things to do with the data I have, get in touch.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.