T3L: Translate-and-test transfer learning for cross-lingual text classification

The code reporitory for the paper research paper published in the Transactions of the Association for Computational Linguistics (TACL).

Installation

Python version > 3.6.

Install required packages with pip:

python

pip install -r requirements.txt

# Install our adapted transformers package
cd t3l/transformers
pip install -e .
cd ../..

Data

The XNLI and MultiEurlex datasets and the respective few-shot 10 and few-shot 100 data splits, as well as the TED talk translation corpus used in the paper for the fine-tuning of the MT-LM models of bg-en, el-en and sw-en can be found here.

Downloading MLdoc data requires an approval from TREC and to follow the script provided by the MLdoc paper authors to extract the benchmark dataset.

Getting started

The following commands show how to fine-tune the different described in the paper.

Fine-tune MT-LM

python -m scripts.train_translation --train_data $TRAIN \
                                    --validation_data $DEV \
                                    --test_data $TEST \
                                    --src $SRC \
                                    --tgt $TGT \
                                    --save_dir $OUT \
                                    --tokenizer $PRETRAINED_LM \
                                    --model_lm_path $PRETRAINED_LM \
                                    --batch_size $BATCH_SIZE \
                                    --grad_accum $GRAD_ACCUM \
                                    --gpus $NUM_GPUS \
                                    --epochs $EPOCHS

Fine-tune TC-LM

python -m scripts.train_lm --task XNLI \ 
                           --train_data $TRAIN \
                           --validation_data $DEV \
                           --test_data $TEST \
                           --save_dir $OUT \
                           --batch_size $BATCH_SIZE \
                           --grad_accum $GRAD_ACCUM \
                           --lr $LEARNING_RATE \
                           --gpus $NUM_GPUS \
                           --tokenizer $PRETRAINED_EN \
                           --model_lm_path $PRETRAINED_EN \
                           --epochs $EPOCH

T3L

python -m scripts.training_t3l --task XNLI \
                               --train_data $TRAIN \
                               --validation_data $DEV \
                               --test_data $TEST \
                               --src $SRC \
                               --tgt $TGT \
                               --save_dir $OUT \
                               --batch_size $BATCH_SIZE \
                               --grad_accum $GRAD_ACCUM \
                               --freeze_strategy fix_nmt_dec_lm_enc \
                               --gpus $NUM_GPUS \
                               --tokenizer $PRETRAINED_LM_EN \
                               --max_input_len $MAX_SEQ_LEN_INPUT \
                               --max_output_len $MAX_SEQ_LEN_OUTPUT \
                               --warmup $WARMUP \
                               --lr $LEARNING_RATE \
                               --model_lm_path $PRETRAINED_LM_EN \
                               --model_seq2seq_path $PRETRAINED_MT \
                               --epochs $EPOCHS

Testing models

To perform inference (test) over the test sets with the trained models, simply add the --test flag and the --from_pretrained argument providing the path to the trained checkpoint.

Citation

Please cite our paper in your work:

@article{10.1162/tacl_a_00593,
    author = {Jauregi Unanue, Inigo and Haffari, Gholamreza and Piccardi, Massimo},
    title = "{T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text Classification}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {11},
    pages = {1147-1161},
    year = {2023},
    month = {09},
    abstract = "{Cross-lingual text classification leverages text classifiers trained in a high-resource language to perform text classification in other languages with no or minimal fine-tuning (zero/ few-shots cross-lingual transfer). Nowadays, cross-lingual text classifiers are typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest. However, the performance of these models varies significantly across languages and classification tasks, suggesting that the superposition of the language modelling and classification tasks is not always effective. For this reason, in this paper we propose revisiting the classic “translate-and-test” pipeline to neatly separate the translation and classification stages. The proposed approach couples 1) a neural machine translator translating from the targeted language to a high-resource language, with 2) a text classifier trained in the high-resource language, but the neural machine translator generates “soft” translations to permit end-to-end backpropagation during fine-tuning of the pipeline. Extensive experiments have been carried out over three cross-lingual text classification datasets (XNLI, MLDoc, and MultiEURLEX), with the results showing that the proposed approach has significantly improved performance over a competitive baseline.}",
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00593},
    url = {https://doi.org/10.1162/tacl\_a\_00593},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00593/2159097/tacl\_a\_00593.pdf},
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
scripts		scripts
t3l		t3l
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

t3l

t3l

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

T3L: Translate-and-test transfer learning for cross-lingual text classification

Installation

Data

Getting started

Fine-tune MT-LM

Fine-tune TC-LM

T3L

Testing models

Citation

About

Releases

Packages

Languages

inigo-jauregi/t3l

Folders and files

Latest commit

History

Repository files navigation

T3L: Translate-and-test transfer learning for cross-lingual text classification

Installation

Data

Getting started

Fine-tune MT-LM

Fine-tune TC-LM

T3L

Testing models

Citation

About

Resources

Stars

Watchers

Forks

Languages