English-Vietnamese Bilingual Translation using Transformer model applied Separated Positional Embedding (SPETModel)

Continued development from the repo NeuralMachineTranslation

@github{Translation,
  author    = {The Ho Sy},
  title     = {English-Vietnamese Bilingual Translation with Transformer},
  year      = {2023},
  url       = {https://github.com/hsthe29/Translation},
}

Model Architecture

Modified from Vanilla Transformer's Architecture

Data

See PhoMT.

Training Task:

Target Masked Translation Modeling (Target MTM)

Target MTM:

Training: 
  input: ["en<s>", "How", "are", "you?", "</s>"]
  target in: ["vi<s>", "Bạn", "có", "<mask>", "không?", "</s>"]
  target out: ["Bạn", "có", "khỏe", "không?", "</s>", "<pad>"]
Inference:
  input: ["en<s>", "How", "are", "you?", "</s>"]
  target in: ["vi<s>"]
  Autoregressive -> full target out: ["vi<s>", Bạn", "có", "khỏe", "không?", "</s>"]

Bilingual Vocabulary:

English sentence start token: en<s>
Vietnamese sentence start token: vi<s>
End sentence token: </s>
Mask token: <mask> for task MLM (training only)

Example:

Natural english "Hello, how are you?", target start token "vi<s>":
- Transform to "en<s> Hello, how are you? </s>"
- Target: "vi<s> Xin chào, bạn có khỏe không? </s>""
Natural vietnamese "Xin chào, bạn có khỏe không?", target start token "en<s>":
- Transform to "vi<s> Xin chào, bạn có khỏe không? </s>"
- Target: "en<s> Hello, how are you? </s>""

Model configuration

See file config.py and configV1.json

Preload dataset

Because of the large amount of data, my resources are limited, so I have to process and segment the data to be able to train the model.
Preload parameter:
- seed: a seed to create randoms from random generator
- shuffle: if True, the dataset will be shuffled before chunked
- chunk_size: size of each chunk

Training parameters

Optimizer: AdamW
Learning rate scheduler: WarmupLinearLR

Training arguments:
  - config: "assets/config/configV1.json"
  - load_prestates: True
  - epochs: 20
  - init_lr: 1e-4
  - train_data_dir: /path_to_train_data_dir/
  - val_data_dir: /path_to_val_data_dir/
  - train_batch_size: 16
  - val_batch_size: 32
  - print_steps: 500
  - validation_steps: 1000
  - max_warmup_steps: 10000
  - gradient_accumulation_steps: 4
  - save_state_steps: 1000
  - weight_decay: 0.001
  - warmup_proportion: 0.1
  - use_gpu: True
  - max_grad_norm: 1.0
  - save_ckpt: True
  - ckpt_loss_path: /path_to_loss_ckpt/
  - ckpt_bleu_path: /path_to_bleu_ckpt/
  - state_path: /path_to_state/

Training

$ pip install -r requirements.txt
$ python preload_data.py --config=... --data_dir=... --save_dir=... --chunk_size=... --shuffle=...
$ python train.py [training arguments]

Inference

Example

Attention Maps

Web server

Use Flask to deploy a simple web server run on localhost that provides bilingual translation and visualizes attention weights between pairs of sentences
Use Plotly.js to visualize attention maps.
Use checkpoint at step 100k (25k update steps), Cross Entropy per tokens: 3.5154,
- Download checkpoint and edit "pretrained_path" (path that checkpoint has been downloaded) in config file
Run: $ python run_app.py or $ python3 run_app.py

Simple UI

Please give me a star if you find this project interesting

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.idea		.idea
assets		assets
translation		translation
utils		utils
webserver		webserver
README.md		README.md
img.png		img.png
img_1.png		img_1.png
plot.png		plot.png
preload_data.py		preload_data.py
requirements.txt		requirements.txt
run_app.py		run_app.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

English-Vietnamese Bilingual Translation using Transformer model applied Separated Positional Embedding (SPETModel)

Model Architecture

Data

Training Task:

Bilingual Vocabulary:

Example:

Model configuration

Preload dataset

Training parameters

Training

Inference

Example

Attention Maps

Web server

Simple UI

Please give me a star if you find this project interesting

About

Releases

Packages

Languages

hsthe29/Translation

Folders and files

Latest commit

History

Repository files navigation

English-Vietnamese Bilingual Translation using Transformer model applied Separated Positional Embedding (SPETModel)

Model Architecture

Data

Training Task:

Bilingual Vocabulary:

Example:

Model configuration

Preload dataset

Training parameters

Training

Inference

Example

Attention Maps

Web server

Simple UI

Please give me a star if you find this project interesting

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages