Skip to content

ls4408/bi-gru-translator

Repository files navigation

Auther: Liwei Song

Project name: Bi-GRU Machine translator (French to English) with attention.

Environment: Python 3.6 with pytorch installed, ubuntu 16.04 (or compatiable version on cloud)


The translation program could be launched by the followin command in shell:

python tranlator.py

Process:

Data preparation--->Build RNN Encoder/Decoder--->Add attention for decoder---->Try Bidirection------>Training model------>Translation coder---->Evaluation with BLEU

Dataset

The dataset is bi-language subtitle, originally from opensubtitles.org, and opus.nlpl.eu convert the subtitles into parallel corpus.
Orginal Data format:tmx
Data size:202180 pairs of subtitles in the original data set. 131690 pairs are kept (with words appear more than 3 in these corpus)
Training sample size: 100000 pairs , Validation sample size: 22000 pairs, Test sample size:9690 pairs.
Cleaned data: saved ./data folder in txt format--en.txt & fr.txt.

Model Training


Language model: As I am focusing on the seq-to-seq language models, two Gated recurrent networks are used as encoder and decoders separately. Due to time limiation, only bi-rnn with 2 layers is tested.(50000 epoches scheduled, 15000 finished)

Optimization methods: Minibatch gradient descent is used to estimate the optimal solution, and backprogration is used to estimate the gradient descent for the object function given the minibatch data points.

The model is trained on NYU hpc cloud with paramenter set up in run2.sh.
However, it could be locally trained with cpu( which is recommended for no-cuda devices).

Command:    ./run2.sh for cloud   
            python train_cloud2.py

Files clarification:

tranlator.py: main program
train_model: define encoder/decoder/attention class as well as evalation functions.
batch.py : genrate minibatch during training
Text_preprocessing_cloud.py: text preprocessing file and define language class for convient word embedding.
masked_cross_entropy:define cross entropy error for objective function
./data/model2-update-decoder.pth saved decoder
./data/model2-update-encoder.pth saved encoder

To do list:

Calculate Bleu Score for the validation data set.
Compare other rnn unit combinations: GRU/LSTM, different parameters. Finished 50000 epoches or kill it when the train error converges.

Refernce list:

Effective Approaches to Attention-based Neural Machine Translation
https://arxiv.org/abs/1508.04025
for attention model
https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb
seq to seq tutorial for reference and basis for my translation machine

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published