import a subset or a full Wikidata dump into a CouchDB database
JavaScript
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
config
lib
.gitignore
README.md
import.js
package.json

README.md

import-wikidata-dump-to-couchdb

A tool to transfer an extract of a wikidata dump into a CouchDB database

Summary

Dependency

  • NodeJS >= v6. If your distribution doesn't provide an recent version of NodeJS, you might want to uninstall NodeJS and reinstall it using NVM

Installation

git clone https://github.com/maxlath/import-wikidata-dump-to-couchdb
cd import-wikidata-dump-to-couchdb
npm install

Now you can customize ./config/default.js to your needs.

How to

Download dump

Download Wikidata latest dump

Extract subset

Extract the subset of the dump fitting your needs, as you might not want to throw ~40Go at your database's face.

For instance, for the needs of the authors-birthday bot, I wanted to keep only Wikidata entities of writers:

As each line of the dump is an entity, you could do something like this with grep

cat dump.json | grep '36180\,' > isWriter.json

Here the trick is that every entity with occupation-> writer (P106->Q36180) will have 36180 somewhere in the line (as a claim numeric-id). And tadaa, you went from a 39Go dump to a way nicer 384Mo subset.

But now, we can do something cleaner using wikidata-filter:

cat dump.json | wikidata-filter --claim P106:Q36180 > isWriter.json

Import

This new file isnt valid json (it's line-delimited JSON), but every new line is, once you remove the coma at the end of the line, so here is the plan: take every line, remove the coma, PUT it in your database:

./import.js ./isWriter.json

Specify start and end line numbers:

startline=5
# the line 10 will be included
endline=10
./import.js ./isWriter.json $startline $endline

Behavior on conflict

In the config file (./config/default.js), you can set the behavior on conflict, that is, when the importers tries to add an entity that was already previously added to CouchDB:

  • update (default): update document if there is a change, otherwise pass.
  • pass: always pass
  • exit: exit process at first conflict

See also

License

MIT