This code gets word counts for media coverage of NFL quarterbacks. It downloads
sentences from MediaCloud and does simple stopwording and
word counting. The resulting data is stored in the data folder so you can do analysis
on it.
- This code uses Python 2.7.
- Register to get a MediaCloud API Key. You will be able to see the key on your profile page once you are registered.
- Copy the
config.txt.templatetoconfig.txtand then paste in your API key where indicated. - Install the dependencies via pip:
pip install -r requirements.pip
Once you have set it all up, just run sentencedownload.py to get all the sentences and build wordcounts. This doesn't use the MediaCloud word counting because we do not want to sample the data; we want the true full word counts to support this analysis.
We're using Jupyter Notebooks for analysis. Open up the qb-analysis.ipynb to start exploring.
We scraped Wikipedia and nfl.com to generate a list of regular season starting quarterbacks. Then we added a column for "race" (see qb-table.csv). Race determination is a thorny and difficult task. Ideally, we would categorize using each player's self-identified race. However, we were unable to find evidence of self-identification for any of the players. For this project, we elected to use the categorizations found in Best Ticket's unofficial 2014 NFL Player Census. To support testing our hypotheses we group players as either "white" or "other" (ie. non-white).
We created a list of sources including US Mainstream Media, US Regional Media, and the top US sports websites (as identified by Alexa rankings). These are listed with their MediaCloud ids in sources.csv.
Our initial list of english stopwords is listed in stop-words-english4.txt. This list combines various other stopword lists into one larger one to remove more words we don't want for our analysis.
We took the top 200 words for both "white" and "other" and coded them to identify domain-specific stopwords (ie. "pass", "twitter", etc.). We coded into the following categories:
- health (injury, knee)
- football stop word (quarterback)
- name (Newton)
- descriptors (smart, fast)
- action (passed, caught, run)
- personal (career, top, performance, other)
The results are saved in manual-word-tags.csv.