This repository is for the following paper:
Enhancing Statistical Machine Translation For Low-ResourceLanguages Using Semantic Similarity
The repository includes:
- Corpora
- Bilingual corpora: training, tuning, and test sets for language pairs: Japanese-Vietnamese, Indonesian-Vietnamese, Malay-Vietnamese, Filipino-Vietnamese.
- Sentence alignment
- The Java implementation of [Moore, 2002] for sentence alignment.
- Extending word alignment by word similarity using word2vec
- Pivot translation
- The Java implementation of [Wu and Wang, 2007].
[1] Moore, Robert C. "Fast and accurate sentence alignment of bilingual corpora." Conference of the Association for Machine Translation in the Americas. Springer Berlin Heidelberg, 2002.
[2] Wu, Hua, and Haifeng Wang. "Pivot language approach for phrase-based statistical machine translation." Machine Translation 21.3 (2007): 165-181.