Skip to content
Switch branches/tags
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Scripts to scrape and analyze data from, comparing pre- and post-inauguration versions in 2017.

To begin, use wget to download the data you want to analyze. Create a folder for the archive you want to download, and navigate to that folder in your terminal. Then for the current content, use:

wget -r -e robots=off --convert-links -nd

For past snapshots, locate the snapshot in question via the Internet Archive Wayback Machine. Once you have chosen your snapshot, I recommend using the Wayback Machine Downloader, a Ruby gem, to download it. (See linked page for installation instructions.) With Wayback Machine Downloader installed, the following command will download a complete snapshot into the current directory. Note: this will likely take a long time, and the time code following the -t will be different, depending on the snapshot you've chosen. (This time code comes from the middle of the URL for the snapshot, in this case: has a time code of 20170120112330.)

wayback_machine_downloader -d ./ -t 20170120112330

Once you've downloaded your archives, move all the html files to their own folder. Using wget, most of them have no .html extension, so be careful! I found it helpful to make folders for css, docs, fonts, html, images, and scripts, to make sure I didn't miss anything.

Then open and update the source_folder and output_file variables, if necessary.

Finally, run python to convert all of the downloaded html data into a single CSV file you can use for data analysis.

files is a Python script that takes downloaded html files, parses them for page title and content (looks for sections with id='page', the current format for pages), and saves them in a single CSV file with a single record for each page.

mine_the_text.R is an R script that imports scraped and pre-processed data into R for text mining, analysis, and visualization. Currently it only does basic word, bigram, and trigram counts, and does comparative analysis between the current and an archive of Trump campaign speeches (not included in this repository until I investigate their sources/exhaustiveness).

trump-20170125.csv is an example output, based on a download of on January 25, 2017.

obama-20170120.csv is an example output, based on a download of a snapshot from right before the inauguration on January 20, 2017.


diffs contains tables that show page-level changes (additions and deletions) between versions of pages_unique_to_obama.csv lists pages that were on on January 20, 2017, but are no longer on the site. pages_new_with_trump.csv lists pages that are new to since January 20. pages_always_on_whitehouse_dot_gov.csv lists pages that existed both before and after the change in administration. pages_new_or_deleted_on_Jan31.csv lists pages that were added or deleted between the January 25 and January 31 versions of Note: these are page-level changes, not content-level changes. These files only list pages that are new/deleted, not pages that are changed.

visualizations contains visualizations produced by the R script from preliminary analyses comparing January 20 and January 25 versions of (Note: work still needs to be done to clean the various headers and sidebars containing site architectural information from the pages in order to better focus on the actual content of each page.)


scripts to scrape and analyze data from, comparing pre- and post-inauguration versions in 2017




No releases published


No packages published