-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whitelist documentation and missing news sites #31
Comments
Hi Ken, Thanks for using icore and for your feedback! We provide a short description of our source selection logic at http://icore.mnl.ucsb.edu/whitelist. These sites are also uniquely associated with a political bias score (controversial as that idea is itself and biased as this particular metric maybe) derived from Media Bias Monitor. There is also a whitelist table at the bottom that you can type a source name into and therefore check if icore whitelists it or not. Good point on ABC News. I checked April 2020 and it does seem that 'abcnews.go.com' is not being included. This is worth fixing. Thanks for pointing it out! On that note, we do plan on extending our source whitelist beyond what we have currently. This updated, larger source list will be implemented it in our next round of updates. On that note, and keeping in mind the scale of global news data, what metric would you suggest as a user and researcher to determine if a news source is significant enough to be included in our database or not? Any relevant databases we could tap into for this purpose that you know of? Thanks again for your feedback! |
Thank you Musa. Allow me please to follow up on your comments. I am
delighted to hear that you plan to revisit the whitelist algorithm. The
summary of the whitelist procedure and validation says that it weeds out
'insignificant' sites; those that are too small or not engaged 'news
reporting.' The description says the algorithm has been validated, and
that is good to hear, but unless I misunderstood you, it sounds like
abcnews.go and wsj.com (and who knows what else) were accidentally
blacklisted by the whitelist algorithm?(!!).
If so, it would be helpful, and for some research, probably essential, that
researchers have access to the validation, if and procedure, and even the
blacklist.
You asked what I would suggest regarding fine tuning the whitelist
procedure. This is a challenging question. Is it necessary to blacklist
any sites on the GKG? It is a question that has both substantive (what is
news?) and practical (thousands of irrelevant hits?) implications. Do
"small" include local news? If so, local news might be of interest to
researchers. Does it omit fake news? Hopefully not. Two of the very few
mass comm studies that use gdelt data (Vargo, Guo, and Amazeen 2018 and
Guo, Vargo 2020) study fake news content.
Unfortunately, I am not aware databases per se that could help with this.
Nor any research (though perhaps there is some) on how to find the news
needle in a big data haystack (like the GKG). Addressing that question
with GDELT would make for an interesting mass comm methods paper/article.
That is, how to weed out nonnews sites from news sites while still casting
a wide net? At some point soon, if not now, that research will be
necessary.
Thank you again for your hard work on this. And thank you for Icore.
…On Fri, May 22, 2020 at 6:30 AM musainayatmalik ***@***.***> wrote:
Hi Ken,
Thanks for using icore and for your feedback!
We provide a short description of our source selection logic at
http://icore.mnl.ucsb.edu/whitelist. These sites are also uniquely
associated with a political bias score (controversial as that idea is
itself and biased as this particular metric maybe) derived from Media Bias
Monitor. There is also a whitelist table at the bottom that you can type a
source name into and therefore check if icore whitelists it or not.
Good point on ABC News. I checked April 2020 and it does seem that '
abcnews.go.com' is not being included. This is worth fixing. Thanks for
pointing it out!
On that note, we do plan on extending our source whitelist beyond what we
have currently. This updated, larger source list will be implemented it in
our next round of updates. On that note, and keeping in mind the scale of
global news data, what metric would you suggest as a user and researcher to
determine if a news source is significant enough to be included in our
database or not? Any relevant databases we could tap into for this purpose
that you know of?
Thanks again for your feedback!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKDDO3TLBFAUMJOQKXI6I7TRSZO4BANCNFSM4NHPN6TA>
.
--
Ken Mulligan, Associate Professor
Director of Undergraduate Studies
Department of Political Science
Southern Illinois University
3144 Faner Hall
Carbondale, IL 62901
|
Hey @profkenm , great points, thanks for highlighting that. A couple of notes on the whitelist and why we choose to whitelist. First, GDELT monitors tens of thousands of sources, many of which do not have any 'news' content (e.g., cars.com). Ingesting all of these sites would result in the same 'big-data' problem that icore tries to mitigate. The idea is to include a broad selection of news sources per country. As metric for inclusion, we mainly focus on online reach. The fact that certain sites are not part of icore likely means that they are not included in GDELT, but we will spot check the cases you outlined just to be sure. Your example of WSJ is likely due to the payment wall of WSJ itself. Furthermore, icore is built to address specific research questions that are geared either towards particular countries, regions, or news sites. Indeed, obtaining 'representative' news coverage of a particular country is challenging, but including all sources of a country is not only infeasible from a computational point of view, but likely also not needed (check traditional news research that usually just contrasts ~10 sources at a time). In fact, IMHO, the ~800 sources that icore currently includes provide more than ample opportunity to study news from various perspectives (see our case studies in CCR). If I understand your concerns correctly, you would like to have a 'representative' whitelist for particular countries? This is a great, but labor-intensive idea; perhaps to facilitate this process, we can make the whitelist available as CSV along with metadata (e.g., bias). Would that be helpful to your research? If you told us a bit more about the specific questions you seek to address with icore we could probably facilitate knowledge/features that help you address these questions. Again, thank you so much for your helpful feedback and comments. |
Thank you for your detailed and helpful explanation, Frederic. Icore's
approach to whitelisting does seem reasonable given its goals. And
very useful for most research on most news media. On second thought, I
think the gist of my concerns earlier was misguided, and perhaps useful
mainly for researchers who want either a preselected (country based)
representative sample of news sources, or the population of all news
sources, by country, (including fake news), which would not fit Icore's
mission of making the GDELT data fire hose useful for non-programmers.
…On Sun, May 24, 2020 at 5:54 PM Frederic R. Hopp ***@***.***> wrote:
Hey @profkenm <https://github.com/profkenm> ,
great points, thanks for highlighting that. A couple of notes on the
whitelist and why we choose to whitelist.
First, GDELT monitors tens of thousands of sources, many of which do not
have any 'news' content (e.g., cars.com). Ingesting all of these sites
would result in the same 'big-data' problem that icore tries to mitigate.
The idea is to include a broad selection of news sources per country. As
metric for inclusion, we mainly focus on online reach. The fact that
certain sites are not part of icore likely means that they are not included
in GDELT, but we will spot check the cases you outlined just to be sure.
Your example of WSJ is likely due to the payment wall of WSJ itself.
Furthermore, icore is built to address specific research questions that
are geared either towards particular countries, regions, or news sites.
Indeed, obtaining 'representative' news coverage of a particular country is
challenging, but including all sources of a country is not only infeasible
from a computational point of view, but likely also not needed (check
traditional news research that usually just contrasts ~10 sources at a
time). In fact, IMHO, the ~800 sources that icore currently includes
provide more than ample opportunity to study news from various perspectives
(see our case studies in CCR). If I understand your concerns correctly, you
would like to have a 'representative' whitelist for particular countries?
This is a great, but labor-intensive idea; perhaps to facilitate this
process, we can make the whitelist available as CSV along with metadata
(e.g., bias). Would that be helpful to your research? If you told us a bit
more about the specific questions you seek to address with icore we could
probably facilitate knowledge/features that help you address these
questions.
Again, thank you so much for your helpful feedback and comments.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKDDO3UTUFZSIN4D2BDYG3TRTGQTHANCNFSM4NHPN6TA>
.
|
Hello and thanks for Icore!
I have a couple of comments about the whitelisted GKG sites on Icore.
First, the whitelist documentation is mentioned on the Icore site but I can't find the actual documentation anywhere. This would be necessary for researchers to scrutinize how, and preferably which, GKG sites were blacklisted.
Second, (maybe related, maybe not) after a few general searches, I noticed that some news sites I expected to find in the U.S. (such as ABC News and the Wall Street Journal) are not present, but the fake news site infowars.com is there, and so is the controversial (maybe fake?) site zerohedge.com. Also controversial sites such as breitbart and bizpacreview.com.
For the purposes of research, all of them should all be included. (Indeed, as you know, some of the first published political communication research that uses GDELT deals with fake news). Might the omitted mainstream news sites have been accidentally blacklisted by Icore or, rather, did GDELT fail to scrape them? How would we know?
The text was updated successfully, but these errors were encountered: