Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
190 lines (152 sloc) 8.5 KB

How-To

Summary

Filter entities

By claims

  • from a local file
cat entities.json | wikidata-filter --claim P31:Q5 > humans.ndjson
cat entities.json | wikidata-filter --claim P18 > entities_with_an_image.ndjson
cat entities.json | wikidata-filter --claim P31:Q5,Q6256 > humans_and_countries.ndjson

this command filters entities_dump.json into a subset where all lines are the json with an entity having Q5 in it's P31 claims

(where ndjson stands for newline-delimited json)

  • directly from a Wikidata dump
curl https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz | gzip -d | wikidata-filter --claim P31:Q5 > humans.ndjson

this can be quite convinient when you don't have enough space to keep the whole decompressed dump on your disk: here you only write the desired subset.

Of course, this probably only make sense if the kind of entities you are looking for is somewhere above 100 000 units(?), given that under this level, it would probably be faster/more efficient to get the list of ids from Wikidata Query, then get the entities data from the API (wikidata-sdk can be helpful there).

Long claim option

If your claim is too long and triggers a Argument list too long error, you can pass a file instead:

echo 'P31:Q5,Q6256' > ./claim
cat entities.json | wikidata-filter --claim ./claim > humans_and_countries.ndjson

By sitelinks

Keep only entities with a certain sitelink

# entities with a page on Wikimedia Commons
cat entities.json | wikidata-filter --sitelink commonswiki > subset.ndjson
# entities with a Dutch Wikipedia article
cat entities.json | wikidata-filter --sitelink nlwiki > subset.ndjson
# entities with a Russian Wikipedia articles or Wikiquote article
cat entities.json | wikidata-filter --sitelink 'ruwiki|ruwikiquote' > subset.ndjson

You can even do finer filters by combining conditions with & (AND) / | (OR).

# entities with Chinese and French Wikipedia articles
cat entities.json | wikidata-filter --sitelink 'zhwiki&frwiki' > subset.ndjson
# entities with Chinese and French Wikipedia articles, or Chinese and Spanish articles
cat entities.json | wikidata-filter --sitelink 'zhwiki&frwiki|eswiki' > subset.ndjson

NB: A&B|C is interpreted as A AND (B OR C)

By type

Default: item

cat entities.json | wikidata-filter --type item
cat entities.json | wikidata-filter --type property
cat entities.json | wikidata-filter --type both

By something else

Need another kind of filter? Just ask for it in the issues, or make a pull request!

Parallelize

If your hardware happens to have several cores, we can do better:

  • replace gzip by pigz
  • load balance lines over several wikidata-filter processes using load-balance-lines or something that does the same job
# install those new dependencies
sudo apt-get install pigz
npm install --global load-balance-lines

# increase the max RAM available to node processes, to prevent allocation errors
NODE_OPTIONS=--max_old_space_size=4096

wget --continue https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
nice -n+19 pigz -d < latest-all.json.gz | nice -n+19 load-balance-lines wikidata-filter --claim P31:Q5 > humans.ndjson

Using nice to tell the system that those processes, while eager to eat all the CPUs, should have the lowest priority. If you are not familiar with the < operator, it's the equivalent of cat latest-all.json.gz | nice -n+19 pigz -d but in a shell built-in way. (See I/O Redirection doc)

Format entities

Filter attributes

Wikidata entities have the following attributes: id, type, labels, descriptions, aliases, claims, sitelinks. All in all, this whole takes a lot of place and might not be needed in your use case: for instance, if your goal is to do full text search on a subset of Wikidata, you just need to keep the labels, aliases and descriptions, and you can omit the claims and sitelinks that do take a lot of space.

This can be done with either the --keep or the --omit command:

cat entities.json | wikidata-filter --omit claims,sitelinks > humans.ndjson
# which is equivalent to
cat entities.json | wikidata-filter --keep id,type,labels,descriptions,aliases > humans.ndjson

Filter languages

Keep only the desired languages for labels, descriptions, aliases, and sitelinks.

cat entities.json | wikidata-filter --languages en,fr,de,zh,eo > subset.ndjson

Simplify entity data

Uses wikidata-sdk simplify.entity function to parse the labels, descriptions, aliases, claims, and sitelinks.

cat entities.json | wikidata-filter --simplified > simplified_dump.ndjson

Other options

-h, --help      output usage information
-V, --version   output the version number

Usage as module

The wikidata-filter module provides helper methods to parse, serialize and filter entities from Wikidata dumps.

const { parser, serializer, filter } = require('wikidata-filter')

Chain filter functions

# Filter functions must return a (possibly modified) entity, or undefined or null
# to filter-out the entity
const fixEntities = entity => {
  if (entity.id == 'Q12345') {
    entity.labels.en = { language: 'en', value: 'Count von Count' }
  }
  return entity
}

// Build a filter from options documented above
var langFilter = filter({ type: 'item', languages: [ 'en', 'fr' ] })

// return a stream of entities
parser(process.stdin)
// filter entity stream
.filter(fixEntities)
// filters can be chained
.filter(langFilter)
// serialize entities as newline delimited JSON
.filter(serializer)
.pipe(process.stdout)

or directly add a configured filter to a stream of entities

parser(process.stdin, {
  type: 'item',
  keep: [ 'labels', 'claims' ]
  simplified: true,
  languages: [ 'zh', 'fr' ]
})

Examples

Get all entities with a Chinese and a French Wikipedia article, keeping only id, labels, and sitelinks matching those languages

The equivalent SPARQL query times out

DUMP='https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz'
curl $DUMP | gzip -d | wikidata-filter --sitelink 'zhwiki&frwiki' --keep id,labels,sitelinks --languages zh,fr --simplified > subset.ndjson