Wikipedia Languages Pipeline is multistep pipeline script to:

  • Collect list of wikipedia languages based on stats from -

  • Collect wikipedia articles usage per language

  • Create languages vocabularies for every language

  • Download top articles for every language

  • Split every article in sentences and then in words

  • Automatically train sentence splitter/tokenizer for every language (based on top articles)

  • Build foreign language - english dictionaries (based on single word wikipedia titles with language links)


Wikipydia library

wpTextExtractor library

NLTK library

Running pipeline

Run pipeline to generate vocabularies for all languages

Get help on command line options

python --help

Run full production pipeline

python --settings settings --debug INFO --tokenizer TRAIN

--tokenizer parameter explained: TRAIN - will train new tokenizers and save them SKIP - will not train tokenizers, will use existing ones (assuming that they are exist for all languages)

--pipeline parameter explained: PROCESS - please proceed SKIP - skip generating vocabulary/dictionary

--override parameter explained: YES - override existing data NO - skip if vocabulary/dictionary is alrady exist