wikidata

Tools for working with wikidata (structured data from wikipedia)

data/YYYYMMDD-properties.json

These files are maps from property identifiers to (usually) english language labels. You can generate property maps and extracted data with whichever language you choose. For an example index, see data/20160215-properties.json

wd-extract.py

Extract items from a JSON dump of wikidata.org. Currently, extracting only English strings (-l en) and stripping all sitelinks reduces the size of the JSON by roughly 10 times.

Usage: wd-extract.py [-c|-C] [-DfFnR] [-i file] [-I labels] [-l lc] [-o file] [-p lc] [-s pat] [-t type] [-w] wd-dump-json

Option	Long Option	Description
-C	--claims	Don't simplify claims. By default, the complex structure will be simplified.
-c	--classes	Create a class hierarchy and dump it in JSON format.
-D	--datatypes	Don't simplify datatypes. e.g. string values will remain wrapped in JSON objects
-f	--failonerror	If present, exit if an error occurs.
-F	--format	Format the extracted data readably; this is mainly useful for testing
-i	--index file	Output an index to a file. This can be used to quickly read an item out of the extracted data; you must specify a type with -t
-I	--include labels	Don't remove the properties in the quoted comma separated list of labels (see the list below for properties that would normally be removed)
-l	--language lc	Use language lc for all string members, falling back to en if needed, falling back to a random language if needed. The member name will also be depluralized (e.g. "labels" to "label"). If not specified, the multilingual string tables will be left unmodified.
-L	--label-language	Preserve language ids in extracted label strings: {"type": "string", "language": , "value": }
-n	--names	Print labels only instead of dumping objects in JSON. Uses language, or en if none specified.
-o	--output file	Output the extracted data or list to a file. Default=stdout
-p	--properties lc	Replace property ids with labels in language lc, falling back to en or a random language if needed. If not already present, a file named ########-properties.json will be generated, containing a map of property ids to labels.
-s	--sitelinks pat	Pattern for sitelinks to include or "" to exclude all sitelinks. Sitelinks are links to other websites.
-t	--type type	Type of object to extract (all\|property\|item\|Q####). Default=item. If "all" not specified, the type member will be removed from all extracted objects.
-R	--references	TBD: Don't remove references. References are links to sources of information.
-w	--warning	Print warnings.

Examples

To generate a sorted list of all of the books (id=Q571) in wikidata (72432 as of 2016-02-15), run the following command:

./wd-extract.py -n -l en -p en -s "" -t Q571 data/20160215.json | sort -d

Ignored Properties

The following properties will be removed unless explicitly included with the -I option:

Property Label	Description
BNCF Thesaurus	Florentine national central library
BnF identifier	French national library
Commons category	Wikimedia Commons
Commons gallery	Wikimedia Commons
Freebase identifier	Defunct structured data source, purchased and closed by Google
GND identifier	German universal authority file
IMDb identifier	Internet movie database
ISFDB title ID	Internet speculative fiction database
KINENOTE film ID	Japanese KINENOTE movie database
LCAuth identifier	US libary of congress
Library of Congress Classification	US libary of congress
LibraryThing work identifier	LibraryThing
MusicBrainz artist ID	MusicBrainz
MusicBrainz release group ID	MusicBrainz
MusicBrainz work ID	MusicBrainz
NDL identifier	Japan national diet library
NLA (Australia) identifier	Australian national library
OCLC control number	WorldCat
Open Library identifier	openlibrary.org
PSH ID	Czech technical library
Regensburg Classification	German university of Regensburg library
SUDOC authorities	French university libraries
VIAF identifier	Virtual international authority file

wd-package

Generate a package of data from data extracted from a wikidata dump

Usage: wd-package.py [-l languages] class data-file index-file

Currently, class can be the item identifier of a class to package, or it can be books, which will extract all objects related to books. This script is a work in progress.

Option	Long Option	Description
-l	--languages list	Include only books in one of the comma separated list of languages (default=all). e.g. "English,German". If no 'original language of work' is specified and the book does not have a label in the requested language, don't include it. This requires that the extraction being used has multilingual strings (`-l` was not specified) or has label languages preserved (`-L` was specified).

wd-query.py

Query a package of wikidata in the packageFile. By default, list all labels of the named attribute

Usage: wd-query.py [-f label] attribute packageFile

Option	Long Option	Description
-f	--find label	Search for the label in the named atttibute, displaying the matching object in JSON if found

Examples

To list the labels of all the books in a package of books:

wd-query.py books books-package.json

To find and display (in JSON) the book "Ringworld":

wd-query.py -f Ringworld books books-package.json

wd-diagram.py

Generates a class diagram from extracted classes

Usage: wd-diagram.py [-dfw] [-l n] wd-classes-json

Option	Long Option	Description
-d	--dot	Output the diagram in dot format (default=ascii)
-f	--failonerror	If present, exit if an error occurs
-l	--levels n	Show only the first n levels of classes below the root in the hierarchy (default=unlimited)
-w	--warning	Print warnings

wd-lookup.py

Looks up a key (an item identifier number with or without the leading Q) in the data extracted from a wikidata dump. This script can be used to test the Index class.

Usage: wd-lookup key data-file index-file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikidata

data/YYYYMMDD-properties.json

wd-extract.py

Examples

Ignored Properties

wd-package

wd-query.py

Examples

wd-diagram.py

wd-lookup.py

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
data		data
examples		examples
lib		lib
test		test
LICENSE		LICENSE
README.md		README.md
wd-diagram.py		wd-diagram.py
wd-extract.py		wd-extract.py
wd-lookup.py		wd-lookup.py
wd-package.py		wd-package.py
wd-query.py		wd-query.py

License

jimbelton/wikidata

Folders and files

Latest commit

History

Repository files navigation

wikidata

data/YYYYMMDD-properties.json

wd-extract.py

Examples

Ignored Properties

wd-package

wd-query.py

Examples

wd-diagram.py

wd-lookup.py

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages