An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference

This is the official code for the paper titled "An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference" (EMNLP 2024 Findings). For reproduction, please refer to Reproduction.

Requirements

Python 3.10 or later
PyTorch v2.1.0 or later
transformers==4.35.0.dev0
peft==0.6.2
datasets==2.15.0
evaluate==0.4.1
bitsandbytes==0.41.2.post2
scipy==1.11.4
scikit-learn==1.3.2
sentencepiece
seaborn==0.13.0
fasttext: Please visit https://github.com/facebookresearch/fastText to install this package.
jupyterlab
sumeval
janome
protobuf==4.25.1
entmax==1.1
fastdist==1.1.6
dynamic_embedding_pruning==0.0.1
rouge-score==0.1.2
numba==0.58.1
tensorboardX==2.6.2.2
pyarabic==0.6.15
rouge==1.0.1

Installation

After manually installing PyTorch, transformers, and fasttext, please run the following.

pip install -r requirements.txt

Reproduction

Adapted Models

All models are available on the Hugging Face Model Hub.

Approach	BLOOM-1B	BLOOM-7B	TigerBot-7B	Mistral-7B
LAPT	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw
Random	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw
CLP	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw
Heuristics	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw
FOCUS	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw
CLP+	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw

+ Output projection layer initialization

We also release some TigerBot-7B and Mistral-7B models whose output layer is initialized according to each corresponding vocabulary initialization method instead of random initialization.

Approach	TigerBot-7B	Mistral-7B
Heuristics	de/ja/ar/sw	de/ja/ar/sw
CLP+	de/ja/ar/sw	de/ja/ar/sw

fastText weights

Pre-trained fastText weights, used for FOCUS initialization, are uploaded with BLOOM-1B FOCUS models.

License

MIT License

Adapted Tokenizer

Note that adapted tokenizers were obtained from the following for each language:

German: https://huggingface.co/malteos/gpt2-xl-wechsel-german
Japanese: https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo
Arabic: https://huggingface.co/aubmindlab/aragpt2-base
Swahili: https://huggingface.co/benjamin/gpt2-wechsel-swahili

Due to the license restriction of the Arabic tokenizer, we have excluded the Arabic tokenizer from each corresponding adapted model. To use it, please make sure to download the tokenizer beforehand from the above link.

Citation

If you find this work useful, please cite the following:

@article{yamaguchi2024empirical,
  title={An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference}, 
  author={Atsuki Yamaguchi and Aline Villavicencio and Nikolaos Aletras},
  journal={ArXiv},
  year={2024},
  volume={abs/2402.10712},
  url={https://arxiv.org/abs/2402.10712}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
adaptation		adaptation
eval		eval
preprocessing		preprocessing
tuning		tuning
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference

Requirements

Installation

Reproduction

1. Preprocessing

2. Target Model Initialization

3. LAPT

4. Evaluation

Adapted Models

+ Output projection layer initialization

fastText weights

License

Adapted Tokenizer

Citation

About

Releases

Packages

Languages

License

gucci-j/llm-cva

Folders and files

Latest commit

History

Repository files navigation

An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference

Requirements

Installation

Reproduction

1. Preprocessing

2. Target Model Initialization

3. LAPT

4. Evaluation

Adapted Models

+ Output projection layer initialization

fastText weights

License

Adapted Tokenizer

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages