This is the official code for the paper titled "An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference" (EMNLP 2024 Findings). For reproduction, please refer to Reproduction.
- Python 3.10 or later
- PyTorch v2.1.0 or later
- transformers==4.35.0.dev0
- peft==0.6.2
- datasets==2.15.0
- evaluate==0.4.1
- bitsandbytes==0.41.2.post2
- scipy==1.11.4
- scikit-learn==1.3.2
- sentencepiece
- seaborn==0.13.0
- fasttext: Please visit https://github.com/facebookresearch/fastText to install this package.
- jupyterlab
- sumeval
- janome
- protobuf==4.25.1
- entmax==1.1
- fastdist==1.1.6
- dynamic_embedding_pruning==0.0.1
- rouge-score==0.1.2
- numba==0.58.1
- tensorboardX==2.6.2.2
- pyarabic==0.6.15
- rouge==1.0.1
After manually installing PyTorch
, transformers
, and fasttext
, please run the following.
pip install -r requirements.txt
See Preprocessing.
See Adaptation.
See Tuning.
See Evaluation.
All models are available on the Hugging Face Model Hub.
Approach | BLOOM-1B | BLOOM-7B | TigerBot-7B | Mistral-7B |
---|---|---|---|---|
LAPT | de/ja/ar/sw | de/ja/ar/sw | de/ja/ar/sw | de/ja/ar/sw |
Random | de/ja/ar/sw | de/ja/ar/sw | de/ja/ar/sw | de/ja/ar/sw |
CLP | de/ja/ar/sw | de/ja/ar/sw | de/ja/ar/sw | de/ja/ar/sw |
Heuristics | de/ja/ar/sw | de/ja/ar/sw | de/ja/ar/sw | de/ja/ar/sw |
FOCUS | de/ja/ar/sw | de/ja/ar/sw | de/ja/ar/sw | de/ja/ar/sw |
CLP+ | de/ja/ar/sw | de/ja/ar/sw | de/ja/ar/sw | de/ja/ar/sw |
We also release some TigerBot-7B and Mistral-7B models whose output layer is initialized according to each corresponding vocabulary initialization method instead of random initialization.
Approach | TigerBot-7B | Mistral-7B |
---|---|---|
Heuristics | de/ja/ar/sw | de/ja/ar/sw |
CLP+ | de/ja/ar/sw | de/ja/ar/sw |
Pre-trained fastText weights, used for FOCUS initialization, are uploaded with BLOOM-1B FOCUS models.
Note that adapted tokenizers were obtained from the following for each language:
- German: https://huggingface.co/malteos/gpt2-xl-wechsel-german
- Japanese: https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo
- Arabic: https://huggingface.co/aubmindlab/aragpt2-base
- Swahili: https://huggingface.co/benjamin/gpt2-wechsel-swahili
Due to the license restriction of the Arabic tokenizer, we have excluded the Arabic tokenizer from each corresponding adapted model. To use it, please make sure to download the tokenizer beforehand from the above link.
If you find this work useful, please cite the following:
@article{yamaguchi2024empirical,
title={An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference},
author={Atsuki Yamaguchi and Aline Villavicencio and Nikolaos Aletras},
journal={ArXiv},
year={2024},
volume={abs/2402.10712},
url={https://arxiv.org/abs/2402.10712}
}