What it does
This scraper was constructed to enable researchers to collect Blogger and Blogspot content for the purpose of analysis. In its current incarnation, this scraper is capable of collecting 500 most recent posts from a blog and up to 500 of the most recent comments from each of these posts. Please note that this code is written in Python 3.5 and has only been tested in OS X version 10.11.5 (El Capitan).
What you'll need
You're going to need a few tools if in order to use this scraper.
First and foremost, you'll need Python. If you're on a Mac, you already have Python, but you'll need to get and install version 3.5. If you don't already have this version of Python, I recommend that you install it using Anaconda. Installing Anaconda will install the most recent version of Python and will allow you to, using the included
environment.yml, recreate the virtual environment in which this code was developed and tested. Here's a helpful blogpost that will teach you a bit about using Anaconda.
In addition to Python, you'll need an API key from Google. You can get your key by following these steps:
Sign in to the Google developer console using a google account (if you have gmail, that username/password will work).
Find and click "credentials" in the left sidebar, and then click "create project."
Give your project a name (it can be anything), decide whether you want to receive emails, agree to the terms of service, and then click "create".
Give google a few moments, and then click "create credentials." This will open a dropdown menu. Within this menu, click "API key." This will bring up a new window. Click "Server key."
Name your key (anything will work) and enter the IP addresses you expect to be using this scraper from. If you need to get your IP address, click here. Then, click "create."
Now you have your API key. Make sure to keep track of this for later.
How to use the scraper
settings.cfg in a text editor. Fill in information for each value. You'll need to get an API Key from Google (instructions on how to do this will be added here later). For
max_comments, enter an integer no larger than 500.
Use the included
environment.yml file to create an anaconda environment.
Collect, parse, and format the data
Run each of the three scripts in the following order:
blogger_collect.py (retrieves data based on parameters specified in
blogger_parse.py (gets the data you probably want out of the raw data dump)
blogger_toCSV.py (exports the parsed data to a human-readable spreadsheet)