handy python functions for repetitive tasks associated with downloading all files linked from a web page.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
analaysis
json
output
.gitignore
README.md
makedir.py
requirements.txt
scrapetools.py
seec.py
to_json.py

README.md

scrapetools.py

Handles repetitive tasks associated with downloading all files linked on a web page.

jake kara

jake@jakekara.com

January 2017

Files

I often want to bulk download all of the links on a page, such as a web page with a table of CSV or XLSX files. The scrapetools.py file has some functions for handling some of the repetitive processes, including:

  • creating directories to store output (if they don't exist)
  • getting the web "index" page, which contains all the links
  • downloading each link's href and saving it

In addition to scrapetools.py, there is an example of code I wrote using scrapetools.py to scrape CT campaign finance data, called seec.py

TODO

Sometimes you might want to name the files based on some other data on the page, such as the text content of the link.