The goal here is to build scrapers and parsers in order to get as much state legislative data as possible in one place.
For details on the reasons for the project and goals behind the project see the project announcement.
For an editable overview of each state's progress visit the Sunlight Labs Wiki.
- Collect URLs of State Legislature and Legislative Information Pages [done]
- Grab legislators and legislation
- Build scrapers and obtain data files for legislation in each of the fifty states
- create sponsor relationship between legislators and legislation
- Grab votes
- Build scrapers and obtain data files for legislator votes on legislation
- create voting relationship between legislators and legislation
- Build tools on top of data
To encourage as many contributions as possible we aren't saying "write in Python" or anything, but we do need the code to follow a few guidelines.
For details on how scripts should be written and how they should run see scripts/README.rst. For details on how data should be stored see data/README.rst.
- Valid options:
- --year: a year or years the parser should attempt --all: Attempt to parse years from 1969-2009 --upper: Parse upper chamber --lower: Parse lower chamber
- The vision is that the flow will look something like this:
- $ ./scripts/nc/get_legislation --year=2009 --upper
If you are interested in contributing the recommended procedure is to check on the Sunlight Labs Wiki and in the repository to see where your state is. The next step is generally to announce your interest on the Sunlight Labs Mailing List (this is where you can ask questions and make suggestions regarding the project).
Once you have claimed a state on the wiki and mailing list you should probably maintain your own fork of the project on github.
Please avoid making changes to files in other states/etc. on your state branch. Stick to editing files in the scripts/your_state directory and where necessary in any relevant utils directories.
Whenever your state script works as it should announce it on the mailing list and someone will merge your changes into the core.
We feel that in order to protect everyone's intellectual property it makes the most sense to keep the code under the AGPL license. In a nutshell this means that you are not permitted to make changes to this source without releasing your code as well, this is intended to prevent abuse but we are sensitive to license choice on a project of this scale so if you have strong objections please raise them on the mailing list. See LICENSING for the full terms of the AGPLv3.
- BeautifulSoup
- hpricot (gem install hpricot)
- fastercsv (gem install fastercsv)
- mechanize (gem install mechanize)
Because there are potentially fifty plus contributors to this project, a real effort should be made to keep the requirements of running the full suite to a minimum.
In other words, if you can write your parser in language X and Y, choose language X if language X is already a requirement. If you are using a language (say Python) and you need to parse HTML and there is already an HTML parsing library (say BeautifulSoup) favor that over some other library (unless absolutely necessary).