jq-wikidata
jq module to process Wikidata JSON format
This git repository contains a module for the jq data transformation language to process entity data from Wikidata or other Wikibase instances serialized in its JSON format.
Several methods exist to get entity data from Wikidata. This module is designed to process entities in their JSON serialization especially for large numbers of entities. Please also consider using a dedicated client such as wikidata-cli instead.
Table of Contents
Install
Installation requires jq version 1.5 or newer.
Put wikidata.jq to a place where jq can find it as module.
One way to do so is to check out this repository to directory ~/.jq/wikidata/:
mkdir -p ~/.jq && git clone https://github.com/nichtich/jq-wikidata.git ~/.jq/wikidataUsage
The shortest method to use functions of this jq module is to directly include the module.
jq -c 'include "wikidata"; .labels|reduceLabels' entities.ndjsonMore complex scripts should better be put into a .jq file:
include "wikidata";
.labels|reduceLabelsProcess JSON dumps
Wikidata JSON dumps are made available at https://dumps.wikimedia.org/wikidatawiki/entities/. The current dumps exceed 35GB even in its most compressed form. The file contains one large JSON array so it should better be converted into a stream of JSON objects for further processing.
With a fast and stable internet connection it's possible to process the dump on-the fly like this:
curl -s https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 \
| bzcat | jq -nc --stream 'import "wikidata"; ndjson' | jq .idPer-item access
JSON data for single entities can be ontained via the Entity Data URL. Examples:
- https://www.wikidata.org/wiki/Special:EntityData/Q42.json
- https://www.wikidata.org/wiki/Special:EntityData/L3006.json
- https://www.wikidata.org/wiki/Special:EntityData/L3006-F1.json
The module function entity_data_url creates these URLs from Wikidata
itentifier strings. The resulting data is wrapped in JSON object; unwrap with
.entities|.[]:
curl $(echo Q42 | jq -rR 'include "wikidata"; entity_data_url') | jq '.entities|.[]'As mentioned above you better use wikidata-cli for accessing small sets of items:
wd d Q42To get sets of items that match a given criteria either use SPARL or MediaWiki API module wbsearchentities and/or MediaWiki API module wbgetentities.
Reduce entity data
Use function reduceEntity or more specific functions (reduceInfo, reduceItem, reduceProperty, reduceLexeme) to reduce the JSON data structure without loss of essential information.
Furher select only some specific fields if needed:
jq '{id,labels}' entities.ndjsonAPI
Reduce Entity
Applies reduceInfo and one of reduceItem, reduceProperty, reduceLexeme.
reduceEntityReduce item
Simplifies labels, descriptions, aliases, claims, and sitelinks of an item.
reduceItemReduce property
Simplifies labels, descriptions, aliases, and claims of a property.
reducePropertyReduce labels
.labels|reduceLabelsReduce descriptions
.descriptions|reduceDescriptionsReduce aliases
.aliases|reduceAliasesReduce sitelinks
.sitelinks|reduceSitelinksReduce lexeme
Simplifies lemmas, forms, and senses of a lexeme entity.
~~jq reduceLexeme
### Reduce forms
~~jq
.forms|reduceForms
Reduce senses
~~jq .senses|reduceSenses
## Reduce claims
Removes unnecessary fields `.id`, `.hash`, `.type`, `.property` and simplifies
values.
~~~jq
.claims|reduceClaims
Reduce forms
Only lexemes have forms.
.forms|reduceForms
Reduce info
reduceInfoRemoves additional information fields pageid, ns, title, lastrevid, and modified.
To remove selected field see jq function del.
Stream an array of entities
Module function ndjson can be used to process a stream with an array of
entities into a list of entities:
bzcat latest-all.json.bz2 | jq -n --stream 'import "wikidata"; ndjson'Contributing
The source code is hosted at https://github.com/nichtich/jq-wikidata.
Bug reports and feature requests are welcome!
License
Made available under the MIT License by Jakob Voß.