Multimodal neural pronunciation modeling for spoken languages with logographic origin

This code implements neural network models to predict pronunciation of Cantonese, using pronunciation of cognates in historically related languages (Mandarin, Vietnamese, Korean) and embeddings from logographic characters. The code uses Keras with Tensorflow backend. The models were presented in the paper:

Minh Nguyen, Gia H. Ngo, Nancy F. Chen, Multimodal neural pronunciation modeling for spoken languages with logographic origin, Empirical Methods in Natural Language Processing (EMNLP), 2018.

Data

The dataset is extracted from the UniHan database, which is the pronunciation database of characters from Han logographic languages.

Training set: data/train.csv
Validation set: data/validation.csv
Test set: data/test.csv

Each file consists of multiple lines, each corresponds to a logogram. Each row consists of 4 columns, corresponding to the Unicode of the logogram, the corresponding phonemes in Mandarin, Cantonese, Korean and Vietnamese. Examples from the training set can be shown using the following command: python3 preview.py -d data/train.csv

ids.txt was cloned from https://github.com/cjkvi/cjkvi-ids, containing the Ideographic Description Sequence data derived from CHISE project. The Ideographic Description Sequence is used to construct logoraphs' embedding.

Reproducing the paper results

Clone this repository.

git clone https://github.com/nguyen-binh-minh/logographic  
cd logographic

Install Anaconda

Set up Python environment with Anaconda

conda env create --name py3_env --file environment.yaml

Replicate the experiments

source activate py3_env  
./scripts/example_mlp_bor.sh  # MLP with bag-of-radicals input  
./scripts/example_lstm_geod.sh  # LSTM with GeoD input  
./scripts/example_mlp_bor_ph.sh  # MLP with bag-of-radicals input and cognates' phonemes input  
./scripts/example_lstm_geod_ph.sh  # LSTM with GeoD input and cognates' phonemes input

License

The code in this repository is released under the terms of the MIT license.
The Ideographic Description Sequence data is under GPLv2 license.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
encoding		encoding
experiment		experiment
scripts		scripts
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
dataset.py		dataset.py
enviroment.yaml		enviroment.yaml
ids.txt		ids.txt
ids_conv.py		ids_conv.py
ids_update.txt		ids_update.txt
inventory.py		inventory.py
mapping.py		mapping.py
preview.py		preview.py
train.py		train.py
utilities.py		utilities.py

License

mnhng/logographic

Folders and files

Latest commit

History

Repository files navigation

Multimodal neural pronunciation modeling for spoken languages with logographic origin

Data

Reproducing the paper results

License

About

Resources

License

Stars

Watchers

Forks

Languages