Skip to content

A Tulu-English translation model using a transfer learning approach with Kannada as a related high-resource language.

License

Notifications You must be signed in to change notification settings

manunarayanan/Tulu-NMT

Repository files navigation

A Tulu Resource for Machine Translation

Overview

We present an English-Tulu machine translation model developed using a transfer learning approach that exploits similarities between high-resource and low-resource languages. This model and its training approach is heavily inspired from NMT-Adapt and the findings of the associated paper.

We also present the first parallel dataset for English-Tulu translation. Our dataset is constructed by integrating human translations into the multilingual machine translation resource FLORES-200. We use this dataset for evaluating our translation model.

Tulu (Tcy), classified within the South Dravidian linguistic family branch, is predominantly spoken by approximately 2.5 million individuals in southwestern India. We use Kannada (Kan) as a closely related high-resource language for the transfer learning approach.

For more details, see our paper.

Requirements

Dataset

Training datasets

En-Kan parallel: Samanantar

Tcy monolingual: Sentences extracted from 1894 articles on the Tulu Wikipedia archived here using wikiextractor.

En-Tcy test dataset: 1300 sentences in English and Tulu, created by us from the 2009 publicly available sentences in the FLORES-200 benchmark. This is an ongoing project.

Model Training

Pre-trained model: IndicBARTSS Tokenizer: AlbertTokenizer

YANMTT was used to follow an iterative training procedure based on NMT-Adapt as laid out in our paper.

Task 1:

Fine-tune the IndicBARTSS model in Kan-En direction using the parallel training dataset, forming the base model for Tcy-En translation.

Task 2:

Fine-tune a second IndicBARTSS model with backtranslated data of Tcy sentences in the monolingual dataset using the model from Task 1, forming the base model for En-Tcy translation.

Task 3:

Fine-tune the model from Task 2 using En-Kan from the parallel dataset.

Task 4:

Denoising autoencoding step using noised sentences prepared from the monolingual Tcy dataset and the Kan training dataset by random shuffling and masking of words.

Task 5:

FInetune the En-Tcy model from Task 4 with back-translated pairs. This model is the used to generate the En side of the back-translated pairs, and this data is then used to repeat Task 2 to Task 5 on the Tcy-En base model from Task 1.

Evaluation

Evaluation was done using sacreBLEU scores.

Model Adaptation

Results

Usage Examples

Acknowledgements

References

Contact

About

A Tulu-English translation model using a transfer learning approach with Kannada as a related high-resource language.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published