pr_scrapers

Scrapers to collect press releases. In Python.

This project is a stand-alone sandbox for easy development of press release scrapers for incorporation into churnalism.com. It's designed to require minimal setup...

Requirements

lxml
BeautifulSoup (for UnicodeDammit)

Details

Just copy an existing scraper (eg onepoll.py) and start hacking about!

base.py is a mockup of the churnalism.com scraper interface. Rather than working against a database it just dumps scraped press releases out to stdout. It provides the BaseScraper interface to derive scrapers from. It also installs a caching handler for urllib2 which creates a ".cache" directory to stores downloaded files. This makes repeated test runs during development a lot quicker. Just delete the ".cache" dir to clear the cache.

To try out your scraper, add something like this:

if __name__ == "__main__":
    scraper = Scraper()
    scraper.run()

Then you can just run it directly, eg:

$ python <your_scraper>

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.markdown		README.markdown
base.py		base.py
conservative_party.py		conservative_party.py
onepoll.py		onepoll.py
urllib2helpers.py		urllib2helpers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pr_scrapers

Requirements

Details

About

Releases

Packages

Languages

mediastandardstrust/pr_scrapers

Folders and files

Latest commit

History

Repository files navigation

pr_scrapers

Requirements

Details

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages