We present an English-Tulu machine translation model developed using a transfer learning approach that exploits similarities between high-resource and low-resource languages. This model and its training approach is heavily inspired from NMT-Adapt and the findings of the associated paper.
We also present the first parallel dataset for English-Tulu translation. Our dataset is constructed by integrating human translations into the multilingual machine translation resource FLORES-200. We use this dataset for evaluating our translation model.
Tulu (Tcy), classified within the South Dravidian linguistic family branch, is predominantly spoken by approximately 2.5 million individuals in southwestern India. We use Kannada (Kan) as a closely related high-resource language for the transfer learning approach.
For more details, see our paper.
En-Kan parallel: Samanantar
Tcy monolingual: Sentences extracted from 1894 articles on the Tulu Wikipedia archived here using wikiextractor.
En-Tcy test dataset: 1300 sentences in English and Tulu, created by us from the 2009 publicly available sentences in the FLORES-200 benchmark. This is an ongoing project.
Pre-trained model: IndicBARTSS Tokenizer: AlbertTokenizer
YANMTT was used to follow an iterative training procedure based on NMT-Adapt as laid out in our paper.
Fine-tune the IndicBARTSS model in Kan-En direction using the parallel training dataset, forming the base model for Tcy-En translation.
Fine-tune a second IndicBARTSS model with backtranslated data of Tcy sentences in the monolingual dataset using the model from Task 1, forming the base model for En-Tcy translation.
Fine-tune the model from Task 2 using En-Kan from the parallel dataset.
Denoising autoencoding step using noised sentences prepared from the monolingual Tcy dataset and the Kan training dataset by random shuffling and masking of words.
FInetune the En-Tcy model from Task 4 with back-translated pairs. This model is the used to generate the En side of the back-translated pairs, and this data is then used to repeat Task 2 to Task 5 on the Tcy-En base model from Task 1.
Evaluation was done using sacreBLEU scores.