Skip to content

eSpeak NG is an open source speech synthesizer that supports more than hundred languages and accents.

License

GPL-3.0 and 4 other licenses found

Licenses found

GPL-3.0
COPYING
Apache-2.0
COPYING.APACHE
BSD-2-Clause
COPYING.BSD2
Unknown
COPYING.IEEE
Unknown
COPYING.UCD
Notifications You must be signed in to change notification settings

mbarnig/espeak-ng-lb

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Introduction

The present fork adds Luxembourgish as 127th language to the eSpeak-NG text-to-speech (TTS) synthesizer (version.1.50.1).

The main purpose of the project is the creation of a rule based International Phonetic Alphabet (IPA) phonemizer for transcriptions of luxembourgish audio utterances.

The phonemized text allows to create and use luxembourgish high-quality synthetic voices, trained with deep machine learning (ML) speech models, based on neural networks.

The speech generated with the associated eSpeak-NG sound engine is based on formant synthesis techniques and is of low quality. The related luxembourgish voice Luxi is intelligible, but I did no sound optimization because my focus is put on the phonemization front-end process.

Luxembourgish customization

Four files are needed to include Luxembourgish as additional language in the eSpeak-NG project :

The customization process can be splitted into two steps :

  1. convert the characters (graphemes) of text into phonetic descriptions (phoneme-ID's)
  2. convert the phoneme-ID's into IPA phonemes and define instructions how to generate sounds for each phoneme.

eSpeak-NG is an evolution of the famous speak-program for the ACORN/RISC_OS computers developed in 1995 by Jonathan Duddington. In 2007 the program was renamed eSpeak. After the disappearance of the original developer early 2015, a new project eSpeak-NG (new generation) was started in December 2015 under the direction of Reece H. Dunn.

When speak was created the phoneme-ID's were based on the text encoding ASCII, available at this time. This was never changed in the eSpeak-NG fork to avoid too much divergence from the original code. A few years ago, the support of IPA names in the output was added, but this work-around is still laborious, compared to a state-of-art solution where IPA names are used as phoneme-ID's. The current problems with the eSpeak-NG architecture are stated in this roadmap document.

Let's start to describe the ph_letzebuergesch file which is the heart of the whole process.

Phoneme inventory

We distinguish two types of phonemes : vowels (monophtongs and diphtongs) and consonants.

I use the word Auto with two vowels au, o and one consonant t to explain the definition of the related phonemes in the ph_letzebuergesch file. The code is formed by simple text strings. The minimal code for the word Auto has the following format :

phoneme aU
  vwl starttype #a endtype #u
  ipa ɑʊ
  length 190
  FMT(vdiph/au)
endphoneme

phoneme t
  vls alv stp
  ipa t
  WAV(ustop/t, 90)
endphoneme

phoneme o:
  vwl starttype #o endtype #o
  ipa oː
  length 170
  FMT(vowel/o)
endphoneme

The minimal phoneme definition includes 5 parts :

  • the term phoneme followed by the phoneme-ID
  • the description of the articulation
  • the IPA symbol
  • the parameters for the sound generation (length, FMT, WAV, etc)
  • the closing term endphoneme

To specify the sound generation for a phoneme, eSpeak-NG provides a whole set of specific parameters and conditions for fine-tuning. A typical more complex example is shown hereafter :

phoneme t
  vls alv stp
  voicingswitch d
  lengthmod 2
  Vowelin f1=0  f2=1700 -300 300  f3=-100 80
  Vowelout f1=0 f2=1700 -300 250  f3=-100 80  rms=20

  IF nextPh(isPause2) THEN
    WAV(ustop/t_)
  ELIF nextPh(r) THEN
    WAV(ustop/t_)
  ELIF nextPh(@-) THEN
    WAV(ustop/t_dnt, 50)
  ENDIF
  WAV(ustop/t, 90)
endphoneme

I mentionned in the introduction that my goal is not the perfect sound design. Therefore I specified the strict minimum of parameters for each phoneme. Interested users can easely modify the code to enhance the speech quality. The official documentation provides the necessary guidance :

It is not necessary to specify all the phonemes for a language, they can be inherited from a master phoneme file or from another language. Most of the 126 languages supported by eSpeak-NG use this mecanism. Some languages even rely only on inherited phoneme definitions.

To get the full control of the phonemization process and to rest independent from code changes in other languages, I prefer however to define all the used phonemes for the luxembourgish language.

Speech models based on deep ML models are usually trained with about hundred hours of audio records and related transcriptions. To train a deep ML neural network with small datasets, which is the case for Luxembourgish, it is necessary to use a low number of different phonemes to obtain valid results. Based on my earlier experience with ML-TTS models I selected the following sets of phonemes (total : 56) for this first release of my project :

Monophtongs: (20)
IPA phoneme symbols :    ɑ, aː, ɛː, e, æ, eː, ə,  ɐ,  i, iː, o, oː, u, uː, y, yː, ɑ̃ː, ɛ̃ː, õː, œː 
my eSpeak phoneme-ID's : a, a:, aE, e, E, e:, @E, eR, i, i:, o, o:, u, u:, y, y:, eA, iA, oA, OU

Diphtongs: (9)
IPA phoneme symbols :     æːɪ, ɑʊ, æːʊ, ɑɪ, ɜɪ, oɪ, iə, əʊ, uə
my eSpeak phoneme-ID's :   aI, aU,  AU, eI, OI, eU, iE, oU, uE 

Consonants: (27)
IPA phoneme symbols      b, c, d, f, g, h, j, k, l, m, n, ŋ, p, ʀ, s, t, v, w, z, ʒ, ʁ, ʑ,  ʦ,  ʃ, dʒ, X, ɕ
my eSpeak phoneme-ID's : b, c, d, f, g, h, j, k, l, m, n, N, p, r, s, t, v, w, z, J, rR, Z, TS, S, dZ, x, X

To add more phonemes, if required in the future, it's easy to modify the ph_letzebuergesch file and to adapt the related files lb_rules, lb_list and lb_emoji.

In the next chapter I will describe the process to link letters (characters, graphemes) to phoneme-ID's.

Text to Phoneme translation

lb_rules

The main file to define the translation (correspondence) between letters and phoneme-ID's is lb_rules. It's a sort of large table with the following format, starting with a .group line :

.group <one character>  

or

.group <two characters>  

then

   previous letter sequence )  current letter sequence ( next letter sequence   phoneme-ID's sequence  // eventually comments

Let's take again the Auto example to explain the text code by considering that the the word starts with the dipthtong au, followed by the consonant t. The word ends with the monophtong o, preceded by the consonant t. The consonant t itself is preceded by the dipthtong au and followed by the monophtong o.

.group aU
  _) au (t aU
  
.group o
  t) o (_  o:
  
.group t
  au) t (o  t 

Again, eSpeak-NG provides a large set of specific commands and parameters to define the translation rules in a very detailed and comprehensive manner. A simple example are the capital letter A which indicates any vowel and the capital letter C which indicates any consonant. If we use A and C in the above example, we can easely define the rules to phonemize the words Aula and Auto with combined code :

.group aU
  _) au (C aU
  
.group o
  C) o (_  o:

.group a
  C) a (_  a:
  
.group t
  A) t (A  t 
  
.group l
  A) l (A  l

Another way for coding to obtain the same result is :

.group aU
  _) au (C aU
  
.group to
  A) to (_  to:
  
.group la
  A) la (_  la:

To find the pronunciation of a word, the rules are searched and any which match the letters at the current position in the word are given a score, depending on how many letters are matched. The pronunciation from the best matching rule is chosen. The position pointer inside the word is then advanced past those consumed letters which have been matched and the process is repeated until all the letters of the word have been processed.

There are numerous possibilities to code the phonemization rules with prev, current and post letter sequences (or placeholders, symbols, options, letter groups with some common features, etc).

There are also some restrictions. The first letter of a rule-group-ID with two letters must be a 7bit-ASCII character (less than 0x80). In the case of a one-letter-group-ID extended ASCII-characters are accepted (as è ö î etc).

The guide Text to Phoneme Translation provides all available options to set up a complete and complex translation table. I opted however for a few simple rules because I have a preference to include a luxembourgish pronunciation lexicon into the file lb_list described below. This way I take into account that Luxembourgish contains numerous loanwords from german, french and english, which makes the definition of global rules more difficult.

An interested user can easely extend and adapt the rules in the lb_rules file to his own needs.

lb_list

Usually a text to synthesize does not only contain common words from a specific language, but also punctuations, numbers, symbols, personal names, abbreviations, loanwords and eventually Speech Synthesis Markup Language (SSML) code. These exceptions are handled in the lb_list. It's a simple text lexicon with the following entries :

letter-name    phoneme-ID               or            $accent
word           phoneme-ID sequence
word           other similar word                     $text
word           (stressed, unstressed, ..)             $u  // intonation 
abbreviation   phoneme-ID sequence      or            $$abbrev
number         phoneme-ID sequence
symbol         phoneme-ID sequence

Here is a small excerpt from the luxembourgish lb_list file to show the principles :

// letter names
_a a:
_e e:
ä    $accent  
û    $accent  
// intonation
de      $u  // unstress
den     $u
// numbers
_0 nul
_1 e:nt
_2 TSve:
_0C honert
_0M1 daUz@End
// punctuations
_. punkt
_, koma
// symbols
+ pluz
& an
€ eUro
CSV    $abbrev
OGBL   $abbrev
asbl   $abbrev
ASCII azki
// personal names
Marco  marko:
Barnig barniS
Simone zimon

The flag $accent tells the processor to say the letter name, followed by the name of the diacritics. The flag $abbrev indicates to say the letter names in sequence. Speaking the names of punctuations works only if the flag --punc is specified at inference.

Again, eSpeak-NG offers a lot of options to finetune the list entries. My lb_list file for the first version of this project is work in progress. I will update it progressively to comply to my needs. An interested user can do the same.

Note: the supported SSML tags are documented in the document SSML and HTML Support.

lb_emoji

Today, graphics are gaining more and more importance in text. For this reason eSpeak-NG replaced the former file language_extra by a file called language_emoji. Emoji's are small pictures which can be represented as encoded characters. Originating on Japanese mobile phones in 1997, emoji's became increasingly popular worldwide in the 2010s after being added to several mobile operating systems. To assure an interoperability between browsers, mobiles and messaging systems, the emoji's are standardized since 2014 by the Unicode Consortium. Every year additional emoji's are added to the standard upon public proposals.

The full list of emoji's is available at the home-page of the Unicode Consortium.

Here are two examples how to include emoji's (graphics) in a luxembourgish sentence to submit to the eSpeak-NG phonemizer :

An der 🕰 hunn sech den 🧭💨 an d’🌞 gestridden, wie vun hinnen zwee wuel méi 💪 wier, wéi e 🚶, deen an ee waarme 🧥 agepak war, iwwert de 🛤 koum.

Haut sinn ☝ mat mengen Enkelkanner 🧑‍🤝‍🧑 , 👦 , 👧 , an 👩 an den 🎪 gaangen. Do hunn mer e 🦍, eng 🦒, en 🐘 an en 🦏 gesinn.

To phonemize the above sentences I added the following entries into the lb_emoji file :

&#x1F570;  TSaIt
&#x1F9ED;  nort
&#x1F32C;  vant
&#x1F31E;  zon
&#x1F4AA;  Sta:rk
&#x1F6B6;  vand@EreR
&#x1F9E5;  mant@El
&#x1F6E4;  ve:
&#x261D;   eS
&#x1F9D1;&#x200D;&#x1F91D;&#x200D;&#x1F9D1; Sa:rel, toma:
&#x1F466;  asto:r
&#x1F467;  kapuzin
&#x1F469;  tessa
&#x1F3AA;  TSirkus
&#x1F98D;  gorila:
&#x1F992;  giraf
&#x1F418;  elefant
&#x1F98F;  rino:zerus

A more comfortable option is to insert emoji's from the popup menu in the text-editor :

insert emoji

Here are the results of the phonemization :

ɑn dɐ ‚ʦæ:ɪt / hun zeɕ dən ’noʀtvɑnt ɑn ‚dzon gə’ʃtʀidən / viə fun hinən ‚ʦve: vuəl ‚meɪ ʃta:ʀk viɐ / veɪ eː ‚vɑndəʀɐ / de:n ɑn eː ‚va:ʀmə ‚mɑntəl ‚a:ɡəpa:k va:ʀ / ivɐt də ‚veː kəʊm //

haUt sin eS mat mengen enkelkaneR Sa:rel / toma: / asto:r / kapuzin an tessa an den TSirkus ga:ngen // do: hun meR e gorila: / eN giraf / en elefant an e rino:zerus gesin //

For convenience I moved the symbols from the lb_list file to the lb_emoji file.

Integration

In the last step the four luxembourgish files are embedded into the source code of the eSpeak-NG project. The following source files are modified :

In the Makefile.am file I added the following lines :

in position 392 :

phsource/ph_letzebuergesch \

in position 570 :

espeak-ng-data/lb_dict \

in position 792 :

lb: espeak-ng-data/lb_dict
espeak-ng-data/lb_dict: dictsource/lb_list dictsource/lb_rules dictsource/lb_emoji

In the phsource/phonemes file I added these lines at position 1763 :

phonemetable lb base1
include ph_letzebuergesch

In the docs/languages.md file I added this line after the latin language :

gmw | lb | West Germanic | Lëtzebuergesch

Two additional files must be created :

Both files are very simple if we include only the strict minimum.

Here comes the content of the language file lb :

name Lëtzebuergesch
language lb

Here is the voice file Luxi :

name Luxi
language lb
maintainer mbarnig

As usual, eSpeak-NG provides numerous options to customize these files. Please read the guide Voice and Language files to get a detailed documentation about all available features. folder When building the project the first time, an additional file lb_dict is created inside the folder espeak-ng-data, which is a compressed binary combination of the three files lb_rules, lb_list and lb_emoji.

Build and use the project

Now the forked eSpeak-NG source code is ready for compilation, hopefully without problems. Compilation is easy if you have a personal computer with a well configured development environment and all required tools for C-compilation. On my Ubuntu 20.04 system this is the case and I build and install the project with 4 commands :

./autogen.sh
./configure
make
sudo make install

build espeak-ng-lb

The building guide provides detailed info how to compile and build eSpeak NG from the source on different operating systems. If you need help, please visit the issues section from the master eSpeak-NG project. Currently there are 333 open and 311 closed issues, so probably another user found already a solution for your problem.

To use the project, I need only one command :

espeak-ng -v lb -q --ipa -f <text file>

The content of the text file is phonemized with luxembourgish rules (flag -v lb) with IPA symbols (flag --ipa), without producing speech (flag -q). All the optional configuration flags are explained in the --help document.

A more comfortable option is to use the great tool espeak-phonemizer, created by Michael Hansen (alias synesthesiam), which transforms the output from espeak-NG in a format ready for submission as input to train a deep machine learning luxembourgish TTS model, based on neural networks.

My favorite Github projects for my luxembourgish ML-TTS experiments are :

On November 11, 2021, the present code has been merged into the main project by Valdis Vitolins, alias valdisvi, one of the administrators of eSpeak-NG, upon my pull request of November 7, 2021. The following changes have been made by valdisvi :

Renamed 'Letzebuergesch' to 'Luxembourgish' everywhere because English language names are used
Cleaned up unneeded changes in _layouts folder
Moved overwritten README.md to docs/languages/lb.md
Removed unneeded espeak-ng-data/lb_dict file (it is created automatically during compilation)
Merged Makefile.am with 9dba972 commit
Removed unneeded Luxi voice, because it didn't have any customization
Updated number of supported languages in docs/languages.md
Updated CHANGELOG.md
Updated tests/language-phonemes.test to pass make check

References

About

eSpeak NG is an open source speech synthesizer that supports more than hundred languages and accents.

Resources

License

GPL-3.0 and 4 other licenses found

Licenses found

GPL-3.0
COPYING
Apache-2.0
COPYING.APACHE
BSD-2-Clause
COPYING.BSD2
Unknown
COPYING.IEEE
Unknown
COPYING.UCD

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C 77.7%
  • Java 9.4%
  • Shell 4.0%
  • C++ 3.2%
  • Makefile 1.8%
  • Python 1.8%
  • Other 2.1%