- How to
- NodeJS >= v6. If your distribution doesn't provide an recent version of NodeJS, you might want to uninstall NodeJS and reinstall it using NVM
git clone https://github.com/maxlath/import-wikidata-dump-to-couchdb cd import-wikidata-dump-to-couchdb npm install
Now you can customize
./config/default.js to your needs.
Download Wikidata latest dump
Extract the subset of the dump fitting your needs, as you might not want to throw ~40Go at your database's face.
For instance, for the needs of the authors-birthday bot, I wanted to keep only Wikidata entities of writers:
As each line of the dump is an entity, you could do something like this with grep
cat dump.json | grep '36180\,' > isWriter.json
Here the trick is that every entity with occupation-> writer (P106->Q36180) will have 36180 somewhere in the line (as a claim
numeric-id). And tadaa, you went from a 39Go dump to a way nicer 384Mo subset.
But now, we can do something cleaner using wikidata-filter:
cat dump.json | wikidata-filter --claim P106:Q36180 > isWriter.json
This new file isnt valid json (it's line-delimited JSON), but every new line is, once you remove the coma at the end of the line, so here is the plan: take every line, remove the coma, PUT it in your database:
Specify start and end line numbers:
startline=5 # the line 10 will be included endline=10 ./import.js ./isWriter.json $startline $endline
Behavior on conflict
In the config file (
./config/default.js), you can set the behavior on conflict, that is, when the importers tries to add an entity that was already previously added to CouchDB:
update(default): update document if there is a change, otherwise pass.
pass: always pass
exit: exit process at first conflict
- wikidata-filter: a command-line tool to filter a Wikidata dump by claim
- wikidata-subset-search-engine: tools to setup an ElasticSearch instance fed with subsets of Wikidata
- wikidata-cli: read and edit Wikidata from the command line