Skip to content

Latest commit

 

History

History
168 lines (138 loc) · 8.82 KB

cli.md

File metadata and controls

168 lines (138 loc) · 8.82 KB

CLI

Summary

Filter entities

⚠️ Filtering entities is done by parsing entities JSON objects and applying logic on it; while it allows to build very specific filters, this will always be slower than filtering based on text patterns (typically with grep): considering your options to prefilter the dump can save enormous amount of time.

  • from a local file
cat entities.json | wikibase-dump-filter --claim P31:Q5 > humans.ndjson
cat entities.json | wikibase-dump-filter --claim P18 > entities_with_an_image.ndjson
cat entities.json | wikibase-dump-filter --claim P31:Q5,Q6256 > humans_and_countries.ndjson

this command filters entities_dump.json into a subset where all lines are the json with an entity having Q5 in it's P31 claims

(where ndjson stands for newline-delimited json)

  • directly from a Wikidata dump
curl https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz | gzip -d | wikibase-dump-filter --claim P31:Q5 > humans.ndjson

this can be quite convinient when you don't have enough space to keep the whole decompressed dump on your disk: here you only write the desired subset.

Of course, this probably only make sense if the kind of entities you are looking for is somewhere above 2 000 000 units(?), given that under this level, it would probably be faster/more efficient to get the list of ids from your Wikibase Query Service (see Wikidata Query Service), then get the entities data from the API (which could easily be done using wikibase-cli wb data).

claims logical operators

and

# operator: &
cat entities.json | wikibase-dump-filter --claim 'P31:Q571&P50' > books_with_an_author.ndjson

or

# operator: |
cat entities.json | wikibase-dump-filter --claim 'P31:Q146|P31:Q144' > cats_and_dogs.ndjson
# which is equivalent to
cat entities.json | wikibase-dump-filter --claim 'P31:Q146,Q144' > cats_and_dogs.ndjson

# the 'or' operator has priority on the 'and' operator:
# this claim filter is equivalent to (P31:Q571 && (P50 || P110))
cat entities.json | wikibase-dump-filter --claim 'P31:Q571&P50|P110' > books_with_an_author_or_an_illustrator.ndjson

not

# operator: ~
cat entities.json | wikibase-dump-filter --claim 'P31:Q571&~P50' > books_without_author.ndjson

Long claim option

If your claim is too long and triggers a Argument list too long error, you can pass a file instead:

echo 'P31:Q5,Q6256' > ./claim
cat entities.json | wikibase-dump-filter --claim ./claim > humans_and_countries.ndjson

Keep only entities with a certain sitelink

# entities with a page on Wikimedia Commons
cat entities.json | wikibase-dump-filter --sitelink commonswiki > subset.ndjson
# entities with a Dutch Wikipedia article
cat entities.json | wikibase-dump-filter --sitelink nlwiki > subset.ndjson
# entities with a Russian Wikipedia articles or Wikiquote article
cat entities.json | wikibase-dump-filter --sitelink 'ruwiki|ruwikiquote' > subset.ndjson

You can even do finer filters by combining conditions with & (AND) / | (OR).

# entities with Chinese and French Wikipedia articles
cat entities.json | wikibase-dump-filter --sitelink 'zhwiki&frwiki' > subset.ndjson
# entities with Chinese and French Wikipedia articles, or Chinese and Spanish articles
cat entities.json | wikibase-dump-filter --sitelink 'zhwiki&frwiki|eswiki' > subset.ndjson

NB: A&B|C is interpreted as A AND (B OR C)

By type

Default: item

cat entities.json | wikibase-dump-filter --type item
cat entities.json | wikibase-dump-filter --type property
cat entities.json | wikibase-dump-filter --type both

By something else

Need another kind of filter? Just ask for it in the issues, or make a pull request!

Format entities

Filter attributes

Wikidata entities have the following attributes: id, type, labels, descriptions, aliases, claims, sitelinks. All in all, this whole takes a lot of place and might not be needed in your use case: for instance, if your goal is to do full text search on a subset of Wikidata, you just need to keep the labels, aliases and descriptions, and you can omit the claims and sitelinks that do take a lot of space.

This can be done with either the --keep or the --omit command:

cat entities.json | wikibase-dump-filter --omit claims,sitelinks > humans.ndjson
# which is equivalent to
cat entities.json | wikibase-dump-filter --keep id,type,labels,descriptions,aliases > humans.ndjson

Filter languages

Keep only the desired languages for labels, descriptions, aliases, and sitelinks.

cat entities.json | wikibase-dump-filter --languages en,fr,de,zh,eo > subset.ndjson

Simplify entity data

Uses wikidata-sdk simplify.entity function to parse the labels, descriptions, aliases, claims, and sitelinks.

# Default simplify options
cat entities.json | wikibase-dump-filter --simplify > simplified_dump.ndjson
# Custom options, see wdk.simplify.entity documentation https://github.com/maxlath/wikidata-sdk/blob/master/docs/simplify_entities_data.md
# and specifically for claims options, see https://github.com/maxlath/wikidata-sdk/blob/master/docs/simplify_claims.md#options
cat entities.json | wikibase-dump-filter --simplify '{"keepRichValues":"true","keepQualifiers":"true","keepReferences":"true"}' > simplified_dump.ndjson
# The options can also be passed in a lighter, urlencoded-like, key=value format
# that's simpler than typing all those JSON double quotes
cat entities.json | wikibase-dump-filter --simplify 'keepRichValues=true&keepQualifiers=true&keepReferences=true' > simplified_dump.ndjson

All the options (see wbk.simplify.entity documentation for more details):

Other options

-h, --help                     output usage information
-p, --progress                 enable the progress bar
-q, --quiet                    disable the progress bar
-V, --version                  output the version number

Tips