GraphemeBERT

This is the source code of the paper "Neural Grapheme-to-Phoneme Conversion with Pretrained Grapheme Models".

Neural Grapheme-to-Phoneme Conversion with Pretrained Grapheme Models (ICASSP 2022).

Lu Dong, Zhi-Qiang Guo, Chao-Hong Tan, Ya-Jun Hu, Yuan Jiang and Zhen-Hua Ling

If you find this repo helpful, please cite the following paper:

```bibtex
@inproceedings{dong2021neural,
    title="Neural Grapheme-to-Phoneme Conversion with Pretrained Grapheme Models",
    author={Lu Dong, Zhi-Qiang Guo, Chao-Hong Tan, Ya-Jun Hu, Yuan Jiang and Zhen-Hua Ling},
    year={2022},
    month={May},
    booktitle={2022 IEEE International Conference on Acoustics, Speech, and Signal Processing},
    publisher={IEEE},
    url={https://arxiv.org/abs/2201.10716},
}
```

Some Reference codes

Requirements

python 3.7
pytorch 1.7.0
torchtext 0.5.0 (from pip)
hangul_jamo 1.0.0 (from pip)
numpy
pandas

Overview

In this paper, we proposes a pre-trained grapheme model called grapheme BERT (GBERT), which is built by self-supervised training on a large, language-specific word list with only grapheme information. We borrowed the mask machanism of BERT to capture the contextual grapheme information in a word. Furthermore, two approaches are developed to incorporate GBERT into the state-of-the-art Transformer-based G2P model, i.e., fine-tuning GBERT directly or fusing GBERT into the Transformer model by attention. Experimental results on the Dutch, Serbo-Croatian, Bulgarian and Korean datasets of the SIGMORPHON 2021 G2P task confirm the effectiveness of our GBERT-based G2P models under both medium-resource and low-resource data conditions.

GBERT
GBERT finetuning
1. pretrained GBERT encoder + random initialized Transformer decoder
GBERT attention (This is adopted from BERT-fused model), which is an application of how to utilize BERT for the generation task (NMT is used in their paper)).

File Structure

monolingual_GBERT_pretrain
1. get_wikipron_monolingual_word_data_without_dev_and_test_g2p_word.py # pretrain GBERT data preprocessing (remove Wikipron words in the dev and test set of the G2P dataset and divided a train and dev word list)
2. GBERT_pretrain.py # pre-training a GBERT
monolingual_G2P_model
1. Transformer.py
2. GBERT_attention.py
3. GBERT_finetuning.py
Data
1. monolingual_g2p_data # g2p data
2. monolingual_g2p_grapheme_bert_input_data # g2p data for Grapheme BERT input (five column)
3. monolingual_word_data # word data(divided train/dev)
4. monolingual_word_dictionary # word dictionary(undivided)
Model and vocab
1. pretrain_model_vocab
2. torch_models # pretrain models and downstream G2P models

Word list

We collected word list from WikiPron, which is the source G2P database for the SIGMORPHON2021 G2P task. The raw word lists are in the monolingual\_word\_dictionary , However, we only collected the word list from WikiPron but used no pronunciation information, and we removed the words in the dev and test set of the G2P tasks. To train a monolingual GBERT, we divided the remaining words into training and validation set with the ratio of 9:1。The final divided word lists are in the monolingual\_word\_data.

Pre-training

The GBERT is a Transformer encoder framework. For all languages, we used a 6-layer Transformer encoder, the training details refers to BERT. Since the pre-trained GBERT is quite small (~4M parameters while BERT has 110M parameters), it will only take ~6h pre-training on a single GTX1080 GPU. So, we did not release our pre-trained GBERT, you can pre-train a GBERT quickly.

G2P Datasets

We use the Dutch, Serbo-Croatian, Bulgarian and Korean datasets in the medium-resource subtask of the SIGMORPHON2021 G2P task for the experiments. Since our code is originally build for a multilingual system, our preprocessing code will added a language code for the input words. However, it will be removed in the tokenizer function of the monolingual experiments.

Train and Evaluate

''' python x.py >>output.log '''

The result can be search final result in the output.log. the prediction pronunciation files are not outputed. If you want to get the prediction pronunciation files with word, gold pronunciation and prediction pronunciation, you need to reverse the tokenized index to the vocab in the evaluate_beam function.

Implement Details

To reproduce our experimental results for different languages, you may follow our process of tunning hyperparameters for different languages and data condition. The best hypetparameters are chosen by the performance of the corresponding validation set.

GBERT
1. batch size=1024, lr=1e-4, n layers=6, dropout=0.1, gelu (followed by BERT), the experiments of GBERT attention of bul (medium resource) used a GBERT with relu since a GBERT with gelu did not work well.
Transformers
1. batch size(1024/512/256 for medium-resource, 16/8 for low-resource)
2. lr(1e-3/5e-4)
3. hid dim(256/128 for medium resource, 128 for low-resource),
4. n layers(3 for medium resource, 2 for low-resource),
5. dropout(0.1/0.2/0.3)
6. relu/gelu
GBERT finetuning
1. The parameters of the transformer decoder follows the paramters of tunned Transformer baseline.
2. Both the learning rates of the pretrained GBERT encoder and the Transformer decoder are chosen from {1e-3, 5e-4, 3e-4, 1e-4, 1e-5}.
GBERT attention
1. The parameters of the transformer decoder follows the paramters of tunned Transformer baseline.
Multilingual experiments
1. Added a language id for multilingual GBERT
2. Change the tokenizer manually for multilingual Transformer, which will input a language tag for the input of multilingual Transformer.
Our hypermeters (This may be influenced by the random seeds or machines , we ran our model on V100 and Centos 7)
1. Medium resource
  1. Transformer
    1. dut: batch size 256, lr=1e-3, hid dim 256, n layers 3, dropout 0.2, relu
    2. hbs: batch size 256, lr=1e-3, hid dim 256, n layers 3, dropout 0.2, gelu
    3. bul: batch size 1024, lr=1e-3, hid dim 256, n layers 3, dropout 0.2, relu
    4. kor: batch size 256, lr=1e-3, hid dim 128, n layers=3, dropout 0.2, gelu (the encoder-decoder attention in GBERT finetuning will be (W_Q 128 ->256, W_K 256-->256, W_V 256 --> 256, W_O 256 --> 128), the encoder-decoder, GBERT-Enc and GBERT-Dec require something similar)
  2. GBERT finetuning
    1. dut: lr_encoder=3e-5, lr_decoder=1e-3
    2. hbs: lr_encoder=1e-4, lr_decoder=5e-4
    3. bul: lr_encoder=1e-4, lr_decoder=5e-4
    4. kor: lr_encoder=1e-4, lr_decoder=1e-3
  3. GBERT attention
    1. lr_second_train=5e-4 for all experiments.
2. Low-resource
  1. Transformer
    1. dut: batch size 8, lr=1e-3, hid dim 128, n layers 2, dropout 0.2, relu
    2. hbs: batch size 16, lr=1e-3, hid dim 128, n layers 2, dropout 0.2, gelu
    3. bul: batch size 16, lr=1e-3, hid dim 128, n layers 2, dropout 0.2, gelu
    4. kor: batch size 8, lr=1e-3, hid dim 128, n layers 2, dropout 0.2, gelu
  2. GBERT finetuning
    1. dut: lr_encoder=1e-5, lr_decoder=5e-4
    2. hbs: lr_encoder=1e-4, lr_decoder=5e-4
    3. bul: lr_encoder=1e-4, lr_decoder=1e-3
    4. kor: lr_encoder=1e-5, lr_decoder=1e-4
  3. GBERT attention
    1. lr_second_train=5e-4 for all experiments.
3. Low resource Transfer
  1. Transformer
    1. eng + dut: batch size 256, lr=1e-3, hid dim 256, n layers 3, dropout 0.1, gelu
  2. GBERT finetuning
    1. eng + dut: lr_encoder=3e-5, lr_decoder=1e-3
  3. GBERT attention
    1. lr_second_train=5e-4 for all experiments.

Results

We report the mean and standard deviation of WER and PER results of five runs the medium-resource and low-resource G2P tasks.

We also report

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
figs		figs
monolingual_G2P_model		monolingual_G2P_model
monolingual_GBERT_pretrain		monolingual_GBERT_pretrain
monolingual_g2p_data		monolingual_g2p_data
monolingual_g2p_grapheme_bert_input_data		monolingual_g2p_grapheme_bert_input_data
monolingual_word_data		monolingual_word_data
monolingual_word_dictionary		monolingual_word_dictionary
GBERT.png		GBERT.png
LICENSE		LICENSE
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

figs

figs

monolingual_G2P_model

monolingual_G2P_model

monolingual_GBERT_pretrain

monolingual_GBERT_pretrain

monolingual_g2p_data

monolingual_g2p_data

monolingual_g2p_grapheme_bert_input_data

monolingual_g2p_grapheme_bert_input_data

monolingual_word_data

monolingual_word_data

monolingual_word_dictionary

monolingual_word_dictionary

GBERT.png

GBERT.png

LICENSE

LICENSE

Readme.md

Readme.md

Repository files navigation

GraphemeBERT

Some Reference codes

Requirements

Overview

File Structure

Word list

Pre-training

G2P Datasets

Train and Evaluate

Implement Details

Results

About

Releases

Packages

Languages

License

ldong1111/GraphemeBERT

Folders and files

Latest commit

History

Repository files navigation

GraphemeBERT

Some Reference codes

Requirements

Overview

File Structure

Word list

Pre-training

G2P Datasets

Train and Evaluate

Implement Details

Results

About

Resources

License

Stars

Watchers

Forks

Languages