A study of global online media coverage. The goal is to identify top American online media outlets (via Alexa ranking), use the MediaCloud dataset and tools to geoparse and locate stories, and then compare global coverage between them.
Set up mongodb.
Install the python requirements: pip install -r requirements.pip
- Ran
scraper/scrape-alexa.pyto pull the top arts and news rankings from the Alexa website. (Jan 21, 2014) - Three of us handcoded the
scraper/data/alexa-NNNN-ranks-raw.csvfollowing thedoc/AlexaSourceCodingGuidelines.pdfdocument. - We ran
scraper/compute-intercoder-reliability.pyto compute inter-coder reliability. - We resolved (small number of) disagreements manually to produce the
scraper/data/alexa-NNNN-ranks-golden.csvfiles. - We added in the Alexa "Global Rank" metric by running
scraper/scrape-alexa-details.py. - We ran
scraper/make-top-results.pyto produce "Top N" lists for each source type we care about. (Feb 17, 2014) - We hand-edited any entries in the "Top N" lists that didn't make sense - sportsillustrated was the only one (Feb 17, 2014)
- We removed any non-english sources - eenadu.net on the newspaper list was the only one (Feb 28, 2014)
- We added a MediaCloud source_id to the "Top N" lists, and adding in any missing sites to MediaCloud via their admin UI.
- We used a tiny web app (in
media-source-dashboard/) to make sure stories were being collected by MediaCloud correctly
- We ran
story-fetcher/fetch-stories.pyto download all the sentences for the sources for the month we care about into a MongoDB. - We ran
json-generator/generate-json.pyto aggregate the information into a JSON describing country coverage by type of media source.
- We added external data files to
analysis/externalas listed in that directory's readme. - We added a
common.csvdata file todata. - We set the
min_articlesthreshold inmc-client.configto 200. - We ran
story-fetcher/fetch-story-counts.pyto create a csv of story counts for each source in the configured time period. - We ran
analysis/find-entropy.pyto produceanalysis/output/foreign_attention.csv. Each row gives the fraction of a particula source's foreign articles that are dedicated to a specific country.
Make sure you set up these indices in the database:
db.people.ensureIndex( { "type": 1 } );
db.people.ensureIndex( { "cliffCountriesOfFocus": 1 } );