[drafting…]
Project Goal: Turn Wikimedia into a news-site credibility tool.
These are resources and a data diary for the WikiCred/Iffy.news project, adding news-site credibility indicators, found in external databases, into Wikidata/Wikipedia. The external data I have was mostly U.S. and English. Those with news-media data for other countries and languages may find this repo helpful.
The following workflow came from trial and many errors in my attempts to:
- Find news-media items in Wikidata.
- Create new items for news-media not in Wikidata.
- Match news-media items in Wikidata with their domain names (to relate Wikidata items with their entries in external databases).
- Add data from external media databases into Wikidata (especially crediility indicaters like press-association membership and street address).
Useful datasets created by this project include (more coming):
- Wikipedia: US newspapers, auto-compiled (code) from state listings, with WD QID and WP path and page ID.
- Wikidata: US state press associations, added/updated via QuickStatements).
- Wikidata: US cities and towns, with QIDs (also in csv).
- Wikidata: US states, with QID, lat/lon, FIPS and abbreviations (two-letter and AP).
- Wikidata: Identfiers, news-outlet references at external sites.
I gathered Wikidata items with the Wikidata Query Service searches (example: news media
in the United States
), added data with Quick Statements (example: add place of publication
) and wikibase-cli, and merged Wikidata with external datasets mostly in Google Sheets, helped by the Wikipedia and Wikidata Tools sheets add-on.
After starting over several times, I remembered my betters taught me to make each step replicable and reversable — so I could back out of any import mess I made. To do this, I usually added a column with a sortable flag, indicating the source of imported data — to track where things like circulation estimates and domain names came from. As they (should) say in the tech world: Move slow and fix things.
Wikidata stores stuctured data used in Wikipedia and other Wikimedia projects. It's a collection of entries for Items, "all the things in human knowledge, including topics, concepts, and objects." Each Item has its own page, URL, and unique QID (Q + a number).
The Denver Post (Q2668654)
is an item. It has a label (its name), QID , a short description ("daily newspaper in Denver, Colorado"), and aliases (alternative names: "Denver Post | denverpost.com"). Those are followed by a list of Statements about the item. Statements have a Property (P + a number) and a Value (in that property's specified data type):
property | value | (data type) |
---|---|---|
instance of (P31) |
daily newspaper (Q1110794) |
(Item) |
inception (P571) |
1892 | (Point in time) |
official website (P856) |
https://www.denverpost.com/ | (URL) |
News media often have a separate list of statements under the heading Identifiers. Those properties have a data type called External identifier, for example, Facebook ID (P2013)
and ISSN (P236)
, the International Standard Serial Number.
An item isn't always one thing. It can be a concept: a Class of things in a heirarchy, with one item being a subclass of (P279)
of another. Each of these newsy items is a subclass of the one to its left:
media
➡️ mass media
➡️ news media
➡️ written news media
➡️ newspaper
➡️ daily newspaper
For this project, it was convenient to have all news-media outlets in Wikidata be an instance of, or instance of a subclass of, news media
. Most already were. But some news outlets weren't showing up because they were instances of classes that weren't in the news media
heirarchy (e.g., investigative journalism (Q1127717)
, news program
, news magazine
. I fixed those by added statement making them a news media
subclass. (Check this network chart in Wikidata Graph Builder).
mindmap
root((news media))
id(news agency)
news photo agency
newswire
id(newsletter)
municipal newsletter
night letter 1>
school newsletter
stock exchange newsletter
Wikimedia newsletter
id(investigative journalism)
id(medical press)
id(news program)
current affairs shows
flagship newscast
television news magazine
United States cable news
id(news broadcasting)
children's broadcasted news
current affairs
election broadcast
reporting television program
id(news magazine)
id(news media in the United States)
id(news website)
fake news website
news aggregation website 2>
online newspaper
sports news website
television news website
video game news website 1>
id(talk radio)
conservative talk radio
Internet talk radio
Progressive talk radio
id(press center)
id(written news media)
Cooperative press
Famille de presse
newspaper 202>
id(women's press)
Subclasses of news media
, 2 levels down
[@Todo: Briefly explain diff btwn instance and subclass] A few news-outlets were instance of items that should new-media subclasses but weren't (e.g., news program
and news magazine
. I brought them into the fold (i.e., made them a news media
subclass, or subclass of a news media
subclass.)
The classification wrangling went something like this:
- Get all news outlets under one general category:
news media
. - Get subclasses into logical categories (one or two levels down).
- Change specific new-outlets improperly assigned
subclass
to 'instance of`. - Label unlabled subclasses (one or two levels down), consulting the item's Wikipedia article or
official website
for the best name.
[@Todo: Describe ways to find, confirm website URL, and add both to new-media items.]
[@Todo: Briefly explain: Find out which properties WD folk use most often for news-media items. Then go with the wiki-crowd wisdom in deciding which property/class to use.]
[@Todo: Briefly explain: The city was most often a place of publication
, but sometimes was headquarters location
(P159), location
(P276), and/or located in the administrative territorial entity
(P131). Done: Add place of publication
to all news media. Todo: Add street address
(P6375) (use format in prop's example: street, city, state, zip)]
[@Todo: Briefly explain: The date a publicaton ceased was 90% in dissolved, abolished or demolished (P576
) statements, with the rest as end time (P582
). Done: Copy all dates in end time
into dissolved…
(with precision: day, month, or year).]
[@Todo: Briefly explain: Membership in a press asscoiation was almost always member of
(P463) but a few times affiliation
(P1416).]