A data scrape and analysis of WhoPaysWriters.com
Branch: master
Clone or download
Latest commit c1674a4 Feb 15, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore Update .gitignore Jan 26, 2019
Clean_Data.ipynb update Jan 28, 2019
Explore_Data.ipynb proprietary data conflict Feb 16, 2019
publications.csv first commit Jan 26, 2019
publications_rank.csv
readme.md Update readme.md Feb 16, 2019
scrapeWPW.py first commit Jan 26, 2019

readme.md

WhoPaysWriters

UPDATE: WhoPaysWriters.com did not take kindly to their data posted on a third-party site, so the datasets have been removed. Please email me with any questions.


A data scrape and analysis of WhoPaysWriters.com. A summary of the results can be found here. Collected for an article in the Columbia Journalism Review. Questions and suggestions for improvement are welcome: kevinrmcelwee@gmail.com.


WhoPaysWriters.com is an anonymous platform where freelance journalists post details about their compensation. There were approximately 3000 submissions to the site from 2012-2018, making it the largest publicly-available dataset available of its kind. Journalists not only submit their pay, but also include information about their rights, their relationship with the editor, and other contextual data.

scrapeWPW.py

This script opens creates three kinds of CSVs:

  • publications.csv, which lists all publications scraped from the opening webpage.
  • A CSV created for each publication's page under the data folder.
  • allData_raw.csv, which is one CSV of everything in data. It requires that the user download ChromeDriver in addition to its python packages.

Clean_Data.ipynb

Cleans data for analysis. Other than normal cleaning, here are some decisions made:

  • I replaced most other entries with NaNs.
  • I dropped everything with fewer than 100 words.
  • I dropped all fiction and poetry entries.
  • I removed entries for 2019.
  • Potential spam, unreasonable outliers are cut. They are addressed on a case-by-case basis. This notebook creates allData_clean.csv, what is ultimately used for analysis.

Explore_Data.ipynb

Explores most 2-variable relationships and creates appropriate graphs for study. Also creates publications_rank.csv, which uses rankings from totalPaid, wordRate, daysToBePaid, and paymentDifficulty to rank publications with more than 7 submissions.