Skip to content

Latest commit

 

History

History
19 lines (15 loc) · 1.13 KB

parallelize.md

File metadata and controls

19 lines (15 loc) · 1.13 KB

Parallelize

If your hardware happens to have several cores, we can do better:

  • replace gzip by pigz
  • load balance lines over several wikibase-dump-filter processes using load-balance-lines or something that does the same job
# install those new dependencies
sudo apt-get install pigz
npm install --global load-balance-lines

# increase the max RAM available to node processes, to prevent allocation errors
NODE_OPTIONS=--max_old_space_size=4096

wget --continue https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
nice -n+19 pigz -d < latest-all.json.gz | nice -n+19 load-balance-lines wikibase-dump-filter --claim P31:Q5 > humans.ndjson

Using nice to tell the system that those processes, while eager to eat all the CPUs, should have the lowest priority. If you are not familiar with the < operator, it's the equivalent of cat latest-all.json.gz | nice -n+19 pigz -d but in a shell built-in way. (See I/O Redirection doc)