Mining MediaWiki dumps to create better TTS engines (using Machine Learning)
- Version: 0.1.0
- Date: 2016-06-09
- Developer: Alberto Pettarin
- License: the MIT License (MIT)
- Contact: click here
This is work in progress. Code, tools, APIs, etc. are subject to change without any further notice. Use at your own risk, until v1.0.0 is released (and this notice disappears). Current TODO list:
- implement comparison with alternatives in lexdiffer
- map eSpeak phones and/or create eSpeak voice from .symbol files
MediaWiki sites (e.g., Wikipedia or Wiktionary ) contain a lot of information about the pronunciation of words in several languages:
This information is described by "pronunciation tags", which contain the phonetic/phonemic transcription of the word, written using the International Phonetic Alphabet (IPA):
===Pronunciation===
* {{IPA|/pɹəˌnʌn.siˈeɪ.ʃən/|lang=en}}, {{enPR|prə-nŭn'-sē-ā′-shən}}
* {{IPA|/pɹəˌnʌn.ʃiˈeɪ.ʃən/|lang=en}}, {{enPR|prə-nŭn'-shē-ā′-shən}}
* {{audio|En-us-pronunciation.ogg|Audio (US)|lang=en}}
* {{rhymes|eɪʃən|lang=en}}
* {{hyphenation|pro|nun|ci|a|tion|lang=en}}
This project provides tools to extract pronunciation information from MediaWiki dump files, to clean the mined IPA strings, and to prepare input files for Machine Learning (ML) tools used in computation linguistics and speech processing.
Possible applications include:
- improving existing open source/free software Text-To-Speech (TTS) tools, for example espeak-ng or idlak, by incorporating the mined pronunciation lexica and/or Letter-To-Sound/Grapheme-To-Phoneme (LTS/G2P) models trained from the mined pronunciation lexica;
- creating TTS voices for "minority" languages;
- creating MediaWiki bots to add a (tentative) IPA transcription to Wiktionary articles missing it;
- creating MediaWiki bots to review the existing IPA transcription provided by a human editor;
- building a crowdsourced CAPTCHA-like service to further refine the transcriptions and hence the derived models;
- research projects in linguistics, phonology, natural language processing, speech synthesis, and speech processing.
This repository contains the following Python 2.7.x/3.5.x tools:
wiktts.mw.splitter
split a MediaWiki dump into chunkswiktts.mw.miner
: mine IPA pronunciation strings from a MediaWiki dump filewiktts.lexcleaner
: clean+normalize a pronunciation lexiconwiktts.trainer
: prepare train/test/symbol sets for ML tools (e.g., Phonetisaurus or Sequitur)wiktts.lexdiffer
: compare two pronunciation/mapped lexica
This project uses the sister ipapy
Python module,
available on PyPI
and GitHub,
under the same license (MIT License).
The ipapy
module is released and maintained on a separate GitHub repository,
since it might be used in applications other than wiktts
,
although the development of ipapy
is currently heavily influenced by the needs of wiktts
.
- Python 2.7.x or 3.5.x
- Python module
lxml
(pip install lxml
) - Python module
ipapy
(pip install ipapy
)
-
Install the dependencies listed above.
-
Clone this repo:
$ git clone https://github.com/pettarin/wiktts.git
-
Download the dump(s) you want to work on from Wikimedia Downloads:
$ cd wiktts/dumps $ wget "https://dumps.wikimedia.org/enwiktionary/20160407/enwiktionary-20160407-pages-meta-current.xml.bz2"
-
Install the ML tool(s) you want to work with. Currently,
wiktts.trainer
outputs in formats readable by:
$ python -m wiktts.mw.splitter DUMP.XML[.BZ2] [OPTIONS]
$ python -m wiktts.mw.mwminer PARSER DUMP OUTPUTDIR [OPTIONS]
$ python -m wiktts.lexcleaner LEXICON OUTPUTDIR [OPTIONS]
$ python -m wiktts.trainer TOOL LEXICON OUTPUTDIR [OPTIONS]
$ python -m wiktts.lexdiffer LEXICON1 LEXICON2 [OPTIONS]
Note: you might want to use tmux/screen since some of the following commands will require several minutes/hours to run.
$ # clone the repo
$ git clone https://github.com/pettarin/wiktts.git
$ cd wiktts
$ # download the English Wiktionary dump (minutes)
$ cd dumps
$ wget "https://dumps.wikimedia.org/enwiktionary/20160407/enwiktionary-20160407-pages-meta-current.xml.bz2" -O enwiktionary-20160407.xml.bz2
$ cd ..
$ # extract the IPA strings (minutes)
$ python -m wiktts.mw.miner enwiktionary dumps/enwiktionary-20160407.xml.bz2 /tmp/
$ # clean the mined (word, IPA) pairs (minutes)
$ python -m wiktts.lexcleaner /tmp/enwiktionary-20160407.xml.bz2.lex /tmp/
$ # create train/test/symbol files for Sequitur G2P (minutes)
$ python -m wiktts.trainer sequitur /tmp/enwiktionary-20160407.xml.bz2.lex.clean /tmp/
$ # train a G2P model using Sequitur G2P (hours)
$ cd /tmp
$ bash run_sequitur.sh train
$ # self-test the trained G2P model
$ bash run_sequitur.sh test
$ # apply the trained G2P model to the given lexicon
$ bash run_sequitur.sh apply new_words.txt
wiktts is released under the MIT License.
If you use wiktts for a research project, please include a citation in your publication/documentation with (at least) the following information:
Alberto Pettarin. wiktts [VERSION_YOU_USE]. https://github.com/pettarin/wiktts (last access: 2016-MM-DD).
For example:
Alberto Pettarin. wiktts v0.1.0. https://github.com/pettarin/wiktts (last access: 2016-06-09).
For a list of resources used to design and implement wiktts, please consult the REFERENCES file.
- Many thanks to Dr. Tony Robinson for many useful discussions about this project.