Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitelist documentation and missing news sites #31

Open
profkenm opened this issue May 22, 2020 · 4 comments
Open

Whitelist documentation and missing news sites #31

profkenm opened this issue May 22, 2020 · 4 comments

Comments

@profkenm
Copy link

Hello and thanks for Icore!

I have a couple of comments about the whitelisted GKG sites on Icore.

First, the whitelist documentation is mentioned on the Icore site but I can't find the actual documentation anywhere. This would be necessary for researchers to scrutinize how, and preferably which, GKG sites were blacklisted.

Second, (maybe related, maybe not) after a few general searches, I noticed that some news sites I expected to find in the U.S. (such as ABC News and the Wall Street Journal) are not present, but the fake news site infowars.com is there, and so is the controversial (maybe fake?) site zerohedge.com. Also controversial sites such as breitbart and bizpacreview.com.

For the purposes of research, all of them should all be included. (Indeed, as you know, some of the first published political communication research that uses GDELT deals with fake news). Might the omitted mainstream news sites have been accidentally blacklisted by Icore or, rather, did GDELT fail to scrape them? How would we know?

@musainayatmalik
Copy link
Collaborator

Hi Ken,

Thanks for using icore and for your feedback!

We provide a short description of our source selection logic at http://icore.mnl.ucsb.edu/whitelist. These sites are also uniquely associated with a political bias score (controversial as that idea is itself and biased as this particular metric maybe) derived from Media Bias Monitor. There is also a whitelist table at the bottom that you can type a source name into and therefore check if icore whitelists it or not.

Good point on ABC News. I checked April 2020 and it does seem that 'abcnews.go.com' is not being included. This is worth fixing. Thanks for pointing it out!

On that note, we do plan on extending our source whitelist beyond what we have currently. This updated, larger source list will be implemented it in our next round of updates. On that note, and keeping in mind the scale of global news data, what metric would you suggest as a user and researcher to determine if a news source is significant enough to be included in our database or not? Any relevant databases we could tap into for this purpose that you know of?

Thanks again for your feedback!

@profkenm
Copy link
Author

profkenm commented May 22, 2020 via email

@fhopp
Copy link
Collaborator

fhopp commented May 24, 2020

Hey @profkenm ,

great points, thanks for highlighting that. A couple of notes on the whitelist and why we choose to whitelist.

First, GDELT monitors tens of thousands of sources, many of which do not have any 'news' content (e.g., cars.com). Ingesting all of these sites would result in the same 'big-data' problem that icore tries to mitigate.

The idea is to include a broad selection of news sources per country. As metric for inclusion, we mainly focus on online reach. The fact that certain sites are not part of icore likely means that they are not included in GDELT, but we will spot check the cases you outlined just to be sure. Your example of WSJ is likely due to the payment wall of WSJ itself.

Furthermore, icore is built to address specific research questions that are geared either towards particular countries, regions, or news sites. Indeed, obtaining 'representative' news coverage of a particular country is challenging, but including all sources of a country is not only infeasible from a computational point of view, but likely also not needed (check traditional news research that usually just contrasts ~10 sources at a time). In fact, IMHO, the ~800 sources that icore currently includes provide more than ample opportunity to study news from various perspectives (see our case studies in CCR). If I understand your concerns correctly, you would like to have a 'representative' whitelist for particular countries? This is a great, but labor-intensive idea; perhaps to facilitate this process, we can make the whitelist available as CSV along with metadata (e.g., bias). Would that be helpful to your research? If you told us a bit more about the specific questions you seek to address with icore we could probably facilitate knowledge/features that help you address these questions.

Again, thank you so much for your helpful feedback and comments.

@profkenm
Copy link
Author

profkenm commented May 26, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants