Skip to content
This repository has been archived by the owner. It is now read-only.
master
Switch branches/tags
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

wiktts

Mining MediaWiki dumps to create better TTS engines (using Machine Learning)

VERY IMPORTANT NOTICE

This is work in progress. Code, tools, APIs, etc. are subject to change without any further notice. Use at your own risk, until v1.0.0 is released (and this notice disappears). Current TODO list:

  • implement comparison with alternatives in lexdiffer
  • map eSpeak phones and/or create eSpeak voice from .symbol files

Abstract

MediaWiki sites (e.g., Wikipedia or Wiktionary ) contain a lot of information about the pronunciation of words in several languages:

IPA pronunciation for the word "pronunciation" from the English Wiktionary

This information is described by "pronunciation tags", which contain the phonetic/phonemic transcription of the word, written using the International Phonetic Alphabet (IPA):

===Pronunciation===
* {{IPA|/pɹəˌnʌn.siˈeɪ.ʃən/|lang=en}}, {{enPR|prə-nŭn'-sē-ā′-shən}}
* {{IPA|/pɹəˌnʌn.ʃiˈeɪ.ʃən/|lang=en}}, {{enPR|prə-nŭn'-shē-ā′-shən}}
* {{audio|En-us-pronunciation.ogg|Audio (US)|lang=en}}
* {{rhymes|eɪʃən|lang=en}}
* {{hyphenation|pro|nun|ci|a|tion|lang=en}}

This project provides tools to extract pronunciation information from MediaWiki dump files, to clean the mined IPA strings, and to prepare input files for Machine Learning (ML) tools used in computation linguistics and speech processing.

Possible applications include:

  • improving existing open source/free software Text-To-Speech (TTS) tools, for example espeak-ng or idlak, by incorporating the mined pronunciation lexica and/or Letter-To-Sound/Grapheme-To-Phoneme (LTS/G2P) models trained from the mined pronunciation lexica;
  • creating TTS voices for "minority" languages;
  • creating MediaWiki bots to add a (tentative) IPA transcription to Wiktionary articles missing it;
  • creating MediaWiki bots to review the existing IPA transcription provided by a human editor;
  • building a crowdsourced CAPTCHA-like service to further refine the transcriptions and hence the derived models;
  • research projects in linguistics, phonology, natural language processing, speech synthesis, and speech processing.

In The Box

This repository contains the following Python 2.7.x/3.5.x tools:

  • wiktts.mw.splitter split a MediaWiki dump into chunks
  • wiktts.mw.miner: mine IPA pronunciation strings from a MediaWiki dump file
  • wiktts.lexcleaner: clean+normalize a pronunciation lexicon
  • wiktts.trainer: prepare train/test/symbol sets for ML tools (e.g., Phonetisaurus or Sequitur)
  • wiktts.lexdiffer: compare two pronunciation/mapped lexica

This project uses the sister ipapy Python module, available on PyPI and GitHub, under the same license (MIT License). The ipapy module is released and maintained on a separate GitHub repository, since it might be used in applications other than wiktts, although the development of ipapy is currently heavily influenced by the needs of wiktts.

Dependencies

  1. Python 2.7.x or 3.5.x
  2. Python module lxml (pip install lxml)
  3. Python module ipapy (pip install ipapy)

Installation

  1. Install the dependencies listed above.

  2. Clone this repo:

    $ git clone https://github.com/pettarin/wiktts.git
  3. Download the dump(s) you want to work on from Wikimedia Downloads:

    $ cd wiktts/dumps
    $ wget "https://dumps.wikimedia.org/enwiktionary/20160407/enwiktionary-20160407-pages-meta-current.xml.bz2"
  4. Install the ML tool(s) you want to work with. Currently, wiktts.trainer outputs in formats readable by:

Usage

wiktts.mw.splitter

$ python -m wiktts.mw.splitter DUMP.XML[.BZ2] [OPTIONS]

Details

wiktts.mw.miner

$ python -m wiktts.mw.mwminer PARSER DUMP OUTPUTDIR [OPTIONS]

Details

wiktts.lexcleaner

$ python -m wiktts.lexcleaner LEXICON OUTPUTDIR [OPTIONS]

Details

wiktts.trainer

$ python -m wiktts.trainer TOOL LEXICON OUTPUTDIR [OPTIONS]

Details

wiktts.lexdiffer

$ python -m wiktts.lexdiffer LEXICON1 LEXICON2 [OPTIONS]

Details

Putting All Together

Note: you might want to use tmux/screen since some of the following commands will require several minutes/hours to run.

$ # clone the repo
$ git clone https://github.com/pettarin/wiktts.git
$ cd wiktts

$ # download the English Wiktionary dump (minutes)
$ cd dumps
$ wget "https://dumps.wikimedia.org/enwiktionary/20160407/enwiktionary-20160407-pages-meta-current.xml.bz2" -O enwiktionary-20160407.xml.bz2
$ cd ..

$ # extract the IPA strings (minutes)
$ python -m wiktts.mw.miner enwiktionary dumps/enwiktionary-20160407.xml.bz2 /tmp/ 

$ # clean the mined (word, IPA) pairs (minutes)
$ python -m wiktts.lexcleaner /tmp/enwiktionary-20160407.xml.bz2.lex /tmp/

$ # create train/test/symbol files for Sequitur G2P (minutes)
$ python -m wiktts.trainer sequitur /tmp/enwiktionary-20160407.xml.bz2.lex.clean /tmp/

$ # train a G2P model using Sequitur G2P (hours)
$ cd /tmp
$ bash run_sequitur.sh train

$ # self-test the trained G2P model
$ bash run_sequitur.sh test

$ # apply the trained G2P model to the given lexicon
$ bash run_sequitur.sh apply new_words.txt

License

wiktts is released under the MIT License.

Citation And References

If you use wiktts for a research project, please include a citation in your publication/documentation with (at least) the following information:

Alberto Pettarin. wiktts [VERSION_YOU_USE]. https://github.com/pettarin/wiktts (last access: 2016-MM-DD).

For example:

Alberto Pettarin. wiktts v0.1.0. https://github.com/pettarin/wiktts (last access: 2016-06-09).

For a list of resources used to design and implement wiktts, please consult the REFERENCES file.

Acknowledgments

About

Mining MediaWiki dumps to create better TTS engines (using Machine Learning)

Resources

License

Releases

No releases published

Packages

No packages published