A Tulu Resource for Machine Translation

Overview

We present an English-Tulu machine translation model developed using a transfer learning approach that exploits similarities between high-resource and low-resource languages. This model and its training approach is heavily inspired from NMT-Adapt and the findings of the associated paper.

We also present the first parallel dataset for English-Tulu translation. Our dataset is constructed by integrating human translations into the multilingual machine translation resource FLORES-200. We use this dataset for evaluating our translation model.

Tulu (Tcy), classified within the South Dravidian linguistic family branch, is predominantly spoken by approximately 2.5 million individuals in southwestern India. We use Kannada (Kan) as a closely related high-resource language for the transfer learning approach.

For more details, see our paper.

Requirements

Dataset

Training datasets

En-Kan parallel: Samanantar

Tcy monolingual: Sentences extracted from 1894 articles on the Tulu Wikipedia archived here using wikiextractor.

En-Tcy test dataset: 1300 sentences in English and Tulu, created by us from the 2009 publicly available sentences in the FLORES-200 benchmark. This is an ongoing project.

Model Training

Pre-trained model: IndicBARTSS Tokenizer: AlbertTokenizer

YANMTT was used to follow an iterative training procedure based on NMT-Adapt as laid out in our paper.

Task 1:

Fine-tune the IndicBARTSS model in Kan-En direction using the parallel training dataset, forming the base model for Tcy-En translation.

Task 2:

Fine-tune a second IndicBARTSS model with backtranslated data of Tcy sentences in the monolingual dataset using the model from Task 1, forming the base model for En-Tcy translation.

Task 3:

Fine-tune the model from Task 2 using En-Kan from the parallel dataset.

Task 4:

Denoising autoencoding step using noised sentences prepared from the monolingual Tcy dataset and the Kan training dataset by random shuffling and masking of words.

Task 5:

FInetune the En-Tcy model from Task 4 with back-translated pairs. This model is the used to generate the En side of the back-translated pairs, and this data is then used to repeat Task 2 to Task 5 on the Tcy-En base model from Task 1.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
download_and_clean_monolingualdata.py		download_and_clean_monolingualdata.py
extract_and_clean_wiki_dump.sh		extract_and_clean_wiki_dump.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Tulu Resource for Machine Translation

Overview

Requirements

Dataset

Training datasets

Model Training

Task 1:

Task 2:

Task 3:

Task 4:

Task 5:

Evaluation

Model Adaptation

Results

Usage Examples

Acknowledgements

References

Contact

About

Releases

Packages

Contributors 2

Languages

License

manunarayanan/Tulu-NMT

Folders and files

Latest commit

History

Repository files navigation

A Tulu Resource for Machine Translation

Overview

Requirements

Dataset

Training datasets

Model Training

Task 1:

Task 2:

Task 3:

Task 4:

Task 5:

Evaluation

Model Adaptation

Results

Usage Examples

Acknowledgements

References

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages