English-Bodo (Eng-Brx) Neural Machine Translation despite having potential no prior research has been done. According to 2011 Census of India, Bodo has 14,57,547 native speakers and a total of 14,82,929 total speakers. During the initial stage of this work we searched for English-Bodo parallel corpus, to our surprise we found only one resource - Indian Language Technology Proliferation and Deployment Centre.
Tourism corpus: English-Bodo parallel corpus of Tourism domain (20901 sentences) provided by the TDIL-DC
The detailed steps of cleaning and preprocessing is present in paper.
All experiment are performed using Tensorflow NMT Framework by Thang Luong, Eugene Brevdo, Rui Zhao.
The training process is similar to that of Tensorflow NMT however for better handling of hyper-parameters and execution we made a shell script start.sh. The hyper-parameters could be changed in the start.sh file.
bash start.sh
or
chmod +x start.sh
./start.sh
The trained models are saved in the models/ directory.
For testing the trained model on test set execute out.sh.
- Translating 2090 English sentences to Bodo sentences
bash out.sh
or
chmod +x out.sh
./out.sh
- View the translated sentence
gedit output.brx
Terminal editor like nano does not render Bodo characters properly so it's better to view it in gedit or leafpad
- Calculate BLEU score
perl multi-bleu.perl nmt_data/tst2013.brx < output.brx
- Enter English sentence which you want to translate in test.en file
- Change the models path in translate.sh
- Generate translation [Eng->Brx]
bash translate.sh
or
chmod +x translate.sh
./translate.sh
- See translated Bodo sentence
gedit out.brx