wiktts

Mining MediaWiki dumps to create better TTS engines (using Machine Learning)

Version: 0.1.0
Date: 2016-06-09
Developer: Alberto Pettarin
License: the MIT License (MIT)
Contact: click here

VERY IMPORTANT NOTICE

This is work in progress. Code, tools, APIs, etc. are subject to change without any further notice. Use at your own risk, until v1.0.0 is released (and this notice disappears). Current TODO list:

implement comparison with alternatives in lexdiffer
map eSpeak phones and/or create eSpeak voice from .symbol files

Abstract

MediaWiki sites (e.g., Wikipedia or Wiktionary ) contain a lot of information about the pronunciation of words in several languages:

This information is described by "pronunciation tags", which contain the phonetic/phonemic transcription of the word, written using the International Phonetic Alphabet (IPA):

===Pronunciation===
* {{IPA|/pɹəˌnʌn.siˈeɪ.ʃən/|lang=en}}, {{enPR|prə-nŭn'-sē-ā′-shən}}
* {{IPA|/pɹəˌnʌn.ʃiˈeɪ.ʃən/|lang=en}}, {{enPR|prə-nŭn'-shē-ā′-shən}}
* {{audio|En-us-pronunciation.ogg|Audio (US)|lang=en}}
* {{rhymes|eɪʃən|lang=en}}
* {{hyphenation|pro|nun|ci|a|tion|lang=en}}

This project provides tools to extract pronunciation information from MediaWiki dump files, to clean the mined IPA strings, and to prepare input files for Machine Learning (ML) tools used in computation linguistics and speech processing.

Possible applications include:

improving existing open source/free software Text-To-Speech (TTS) tools, for example espeak-ng or idlak, by incorporating the mined pronunciation lexica and/or Letter-To-Sound/Grapheme-To-Phoneme (LTS/G2P) models trained from the mined pronunciation lexica;
creating TTS voices for "minority" languages;
creating MediaWiki bots to add a (tentative) IPA transcription to Wiktionary articles missing it;
creating MediaWiki bots to review the existing IPA transcription provided by a human editor;
building a crowdsourced CAPTCHA-like service to further refine the transcriptions and hence the derived models;
research projects in linguistics, phonology, natural language processing, speech synthesis, and speech processing.

In The Box

This repository contains the following Python 2.7.x/3.5.x tools:

wiktts.mw.splitter split a MediaWiki dump into chunks
wiktts.mw.miner: mine IPA pronunciation strings from a MediaWiki dump file
wiktts.lexcleaner: clean+normalize a pronunciation lexicon
wiktts.trainer: prepare train/test/symbol sets for ML tools (e.g., Phonetisaurus or Sequitur)
wiktts.lexdiffer: compare two pronunciation/mapped lexica

This project uses the sister ipapy Python module, available on PyPI and GitHub, under the same license (MIT License). The ipapy module is released and maintained on a separate GitHub repository, since it might be used in applications other than wiktts, although the development of ipapy is currently heavily influenced by the needs of wiktts.

Dependencies

Python 2.7.x or 3.5.x
Python module lxml (pip install lxml)
Python module ipapy (pip install ipapy)

Installation

Install the dependencies listed above.

Clone this repo:

$ git clone https://github.com/pettarin/wiktts.git

Download the dump(s) you want to work on from Wikimedia Downloads:

$ cd wiktts/dumps
$ wget "https://dumps.wikimedia.org/enwiktionary/20160407/enwiktionary-20160407-pages-meta-current.xml.bz2"

Install the ML tool(s) you want to work with. Currently, wiktts.trainer outputs in formats readable by:
- Phonetisaurus
- Sequitur

Usage

wiktts.mw.splitter

$ python -m wiktts.mw.splitter DUMP.XML[.BZ2] [OPTIONS]

Details

wiktts.mw.miner

$ python -m wiktts.mw.mwminer PARSER DUMP OUTPUTDIR [OPTIONS]

Details

wiktts.lexcleaner

$ python -m wiktts.lexcleaner LEXICON OUTPUTDIR [OPTIONS]

Details

wiktts.trainer

$ python -m wiktts.trainer TOOL LEXICON OUTPUTDIR [OPTIONS]

Details

wiktts.lexdiffer

$ python -m wiktts.lexdiffer LEXICON1 LEXICON2 [OPTIONS]

Details

Putting All Together

Note: you might want to use tmux/screen since some of the following commands will require several minutes/hours to run.

$ # clone the repo
$ git clone https://github.com/pettarin/wiktts.git
$ cd wiktts

$ # download the English Wiktionary dump (minutes)
$ cd dumps
$ wget "https://dumps.wikimedia.org/enwiktionary/20160407/enwiktionary-20160407-pages-meta-current.xml.bz2" -O enwiktionary-20160407.xml.bz2
$ cd ..

$ # extract the IPA strings (minutes)
$ python -m wiktts.mw.miner enwiktionary dumps/enwiktionary-20160407.xml.bz2 /tmp/ 

$ # clean the mined (word, IPA) pairs (minutes)
$ python -m wiktts.lexcleaner /tmp/enwiktionary-20160407.xml.bz2.lex /tmp/

$ # create train/test/symbol files for Sequitur G2P (minutes)
$ python -m wiktts.trainer sequitur /tmp/enwiktionary-20160407.xml.bz2.lex.clean /tmp/

$ # train a G2P model using Sequitur G2P (hours)
$ cd /tmp
$ bash run_sequitur.sh train

$ # self-test the trained G2P model
$ bash run_sequitur.sh test

$ # apply the trained G2P model to the given lexicon
$ bash run_sequitur.sh apply new_words.txt

License

wiktts is released under the MIT License.

Citation And References

If you use wiktts for a research project, please include a citation in your publication/documentation with (at least) the following information:

Alberto Pettarin. wiktts [VERSION_YOU_USE]. https://github.com/pettarin/wiktts (last access: 2016-MM-DD).

For example:

Alberto Pettarin. wiktts v0.1.0. https://github.com/pettarin/wiktts (last access: 2016-06-09).

For a list of resources used to design and implement wiktts, please consult the REFERENCES file.

Acknowledgments

Many thanks to Dr. Tony Robinson for many useful discussions about this project.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
dumps		dumps
imgs		imgs
wiktts		wiktts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REFERENCES.md		REFERENCES.md
VERSION		VERSION

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wiktts

VERY IMPORTANT NOTICE

Abstract

In The Box

Dependencies

Installation

Usage

wiktts.mw.splitter

wiktts.mw.miner

wiktts.lexcleaner

wiktts.trainer

wiktts.lexdiffer

Putting All Together

License

Citation And References

Acknowledgments

About

Releases

Packages

Languages

License

pettarin/wiktts

Folders and files

Latest commit

History

Repository files navigation

wiktts

VERY IMPORTANT NOTICE

Abstract

In The Box

Dependencies

Installation

Usage

wiktts.mw.splitter

wiktts.mw.miner

wiktts.lexcleaner

wiktts.trainer

wiktts.lexdiffer

Putting All Together

License

Citation And References

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages