ZeroSwot

This repository contains the code for the paper "Pushing the Limits of Zero-shot End-to-End Speech Translation".

The paper is available on ACL 2024 Proceedings.

Installation

Set the following environment variables:

export FAIRSEQ_ROOT=...     # path to fairseq repo
export ZS_ROOT=...          # path to zeroswot repo
export MODELS_ROOT=...      # where pretrained models (wa2vec2.0, nllb) are stored
export SAVE_DIR=...         # where the models will be saved
export DATA_ROOT=...        # where the data is stored

Clone the repository and install the dependencies:

git clone https://github.com/mt-upc/ZeroSwot.git ${ZS_ROOT}
conda env create -f ${ZS_ROOT}/environment.yml
conda activate zeroswot
source ${ZS_ROOT}/constants.sh

Install the fairseq fork with the zeroswot branch:

git clone -b zeroswot https://github.com/mt-upc/fairseq.git ${FAIRSEQ_ROOT}
pip install --editable ${FAIRSEQ_ROOT}
export PYTHONPATH=${FAIRSEQ_ROOT}:${FAIRSEQ_ROOT}/examples:${ZS_ROOT}:${PYTHONPATH}

Download pretrained CTC and MT models

Our models are based on pretrained CTC Encoders and MT models. We are using wav2vec 2.0 and NLLB models, but you can use any other CTC and MT models (with some modifications in the code). The models are stored at ${MODELS_ROOT}.

Download the CTC-finetuned wav2vec 2.0 model and the letter dictionary:

mkdir -p ${MODELS_ROOT}/wav2vec2
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_960h_pl.pt -O ${MODELS_ROOT}/wav2vec2/wav2vec_vox_960h_pl.pt
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt -O ${MODELS_ROOT}/wav2vec2/dict.ltr.txt

Download the two distilled NLLB models, 600M (Med) and 1.3B (Lrg), and the spm tokenizer:

mkdir -p ${MODELS_ROOT}/nllb
wget https://tinyurl.com/nllb200densedst600mcheckpoint -O ${MODELS_ROOT}/nllb/nllb-200-distilled-600M.pt
wget https://tinyurl.com/nllb200densedst1bcheckpoint -O ${MODELS_ROOT}/nllb/nllb-200-distilled-1.3B.pt
wget https://tinyurl.com/flores200sacrebleuspm -O ${MODELS_ROOT}/nllb/flores200_sacrebleu_tokenizer_spm.model
wget https://tinyurl.com/nllb200dictionary -O ${MODELS_ROOT}/nllb/dictionary.txt
cp ${ZS_ROOT}/mt/dictionary.full.txt ${MODELS_ROOT}/nllb/dictionary.full.txt
cp ${ZS_ROOT}/mt/lang_dict.txt ${MODELS_ROOT}/nllb/lang_dict.txt

Data

Download and prepare the tsv files for MUSTC/CoVoST2 with raw waveforms as input at ${DATA_ROOT}/st. You can find more details at zs_st/README.md. If you already have prepared the data for a different project, make sure to replace the tgt_lang with the corresponding code used by NLLB (e.g. de -> deu_Latn).

Model Weights and Usage

We trained Speech Encoders using ASR data with Subword Compression and Optimal Transport in order to adapt them to the representation space of multilingual MT models. The MT models are either the original NLLB models (600M/1.3B) or finetuned versions on MUSTC/CoVoST2. The weights for the Speech Encoders and MT models are available for download below.

Download and extract the weights of a Speech Encoder at ${SAVE_DIR}/speech_encoders. The training scripts for the Speech Encoders can be found at zs_st/README.md.

ASR Data	Adapted to	Link	MUSTC ZS-ST BLEU	CoVoST2 ZS-ST BLEU
MUSTC	NLLB-600M (original)	Download	29.6	/
MUSTC	NLLB-600M (MUSTC)	Download	31.9	/
MUSTC	NLLB-1.3B (original)	Download	31.4	/
MUSTC	NLLB-1.3B (MUSTC)	Download	32.9	/
CommonVoice	NLLB-600M (original)	Download	26.0	23.1
CommonVoice	NLLB-600M (CoVoST2)	Download	/	30.2
CommonVoice	NLLB-1.3B (original)	Download	27.4	25.5
CommonVoice	NLLB-1.3B (CoVoST2)	Download	/	31.2

In case you want to use one of the Speeech Encoders that was adapted to a finetuned NLLB, download and extract the weights of the corresponding MT model at ${SAVE_DIR}/mt_models. The training scripts for the MT models can be found at mt/README.md.

Model	MT Data	Link	MUSTC MT BLEU	CoVoST2 MT BLEU
NLLB-600M	MUSTC En-X	Download	35.9	/
NLLB-600M	CoVoST2 En-X	Download	/	35.0
NLLB-1.3B	MUSTC En-X	Download	37.2	/
NLLB-1.3B	CoVoST2 En-X	Download	/	36.1

Due to the size of the models, we cannot host all the experiments done in the paper. If you need the weights of another model from the paper, please open an issue and we will provide you with the download link.

Based on a Speech Encoder and an MT model, you can build a ZeroSwot model for Speech Translation as follows. The script basically replaces the MT Embedding layer with the newly trained Speech Encoder.

bash ${ZS_ROOT}/zs_st/utils/construct_model.sh $path_to_speech_encoder $path_to_mt_model

$path_to_speech_encoder should be pointing to the directory of the experiment (i.e ${exp_path}/ckpts/avg_best_10_checkpoint.pt), while $path_to_mt_model should be pointing directly to the .pt checkpoint file of the MT model. This will create the ZeroSwot checkpoint in ${exp_path}/ckpts_zs.

To use the model for zero-shot ST inference, refer to zs_st/README.md in order to prepare the test sets of MUSTC or CoVoST2, and use the following command, where $dataset_name is either MUSTC_v1.0 or CoVoST2, $dataset_split is the split of the dataset (e.g. tst-COMMON), and $tgt_lang is the target language of the translation (e.g. de):

bash ${ZS_ROOT}/zs_st/eval.sh $path_to_speech_encoder $path_to_mt_model $dataset_name $dataset_split $tgt_lang

Training and More

To train a new Speech Encoder using our method, refer to zs_st/README.md for more details.

For finetuning the NLLB models on MUSTC or CoVoST2, refer to mt/README.md for more details.

For our experiments regarding Cascade ST, refer to cascade_st/README.md for more details.

We also provide some scripts for supervised ST finetuning of our ZeroSwot models, refer to supervised_st/README.md for more details.

Citation

If you use this code or the models in your research, please cite this work as:

@inproceedings{tsiamas-etal-2024-pushing,
    title = {{Pushing the Limits of Zero-shot End-to-End Speech Translation}},
    author = "Tsiamas, Ioannis  and
      G{\'a}llego, Gerard  and
      Fonollosa, Jos{\'e}  and
      Costa-juss{\`a}, Marta",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.847",
    pages = "14245--14267",
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZeroSwot

Installation

Download pretrained CTC and MT models

Data

Model Weights and Usage

Training and More

Citation

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
cascade_st		cascade_st
mt		mt
supervised_st		supervised_st
zs_st		zs_st
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
constants.sh		constants.sh
environment.yml		environment.yml

License

mt-upc/ZeroSwot

Folders and files

Latest commit

History

Repository files navigation

ZeroSwot

Installation

Download pretrained CTC and MT models

Data

Model Weights and Usage

Training and More

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages