Skip to content
Wikipedia Languages Pipeline
Python
Find file
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
scripts
.gitignore
README.md
langlib.py
main.py
settings.example.py

README.md

Wiki Languages Pipeline

Wikipedia Languages Pipeline is multistep pipeline script to:

  • Collect list of wikipedia languages based on stats from - http://meta.wikimedia.org/wiki/List_of_Wikipedias#All_Wikipedias_ordered_by_number_of_articles

  • Collect wikipedia articles usage per language

  • Create languages vocabularies for every language

  • Download top articles for every language

  • Split every article in sentences and then in words

  • Automatically train sentence splitter/tokenizer for every language (based on top articles)

  • Build foreign language - english dictionaries (based on single word wikipedia titles with language links)

Requirements

Wikipydia library

wpTextExtractor library

NLTK library

Running pipeline

Run pipeline to generate vocabularies for all languages

Get help on command line options

python main.py --help

Run full production pipeline

python main.py --settings settings --debug INFO --tokenizer TRAIN

--tokenizer parameter explained: TRAIN - will train new tokenizers and save them SKIP - will not train tokenizers, will use existing ones (assuming that they are exist for all languages)

--pipeline parameter explained: PROCESS - please proceed SKIP - skip generating vocabulary/dictionary

--override parameter explained: YES - override existing data NO - skip if vocabulary/dictionary is alrady exist

Something went wrong with that request. Please try again.