Skip to content

EMNLP 2024 Findings - An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference

License

Notifications You must be signed in to change notification settings

gucci-j/llm-cva

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference

This is the official code for the paper titled "An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference" (EMNLP 2024 Findings). For reproduction, please refer to Reproduction.

Requirements

  • Python 3.10 or later
  • PyTorch v2.1.0 or later
  • transformers==4.35.0.dev0
  • peft==0.6.2
  • datasets==2.15.0
  • evaluate==0.4.1
  • bitsandbytes==0.41.2.post2
  • scipy==1.11.4
  • scikit-learn==1.3.2
  • sentencepiece
  • seaborn==0.13.0
  • fasttext: Please visit https://github.com/facebookresearch/fastText to install this package.
  • jupyterlab
  • sumeval
  • janome
  • protobuf==4.25.1
  • entmax==1.1
  • fastdist==1.1.6
  • dynamic_embedding_pruning==0.0.1
  • rouge-score==0.1.2
  • numba==0.58.1
  • tensorboardX==2.6.2.2
  • pyarabic==0.6.15
  • rouge==1.0.1

Installation

After manually installing PyTorch, transformers, and fasttext, please run the following.

pip install -r requirements.txt

Reproduction

1. Preprocessing

See Preprocessing.

2. Target Model Initialization

See Adaptation.

3. LAPT

See Tuning.

4. Evaluation

See Evaluation.

Adapted Models

All models are available on the Hugging Face Model Hub.

Approach BLOOM-1B BLOOM-7B TigerBot-7B Mistral-7B
LAPT de/ja/ar/sw de/ja/ar/sw de/ja/ar/sw de/ja/ar/sw
Random de/ja/ar/sw de/ja/ar/sw de/ja/ar/sw de/ja/ar/sw
CLP de/ja/ar/sw de/ja/ar/sw de/ja/ar/sw de/ja/ar/sw
Heuristics de/ja/ar/sw de/ja/ar/sw de/ja/ar/sw de/ja/ar/sw
FOCUS de/ja/ar/sw de/ja/ar/sw de/ja/ar/sw de/ja/ar/sw
CLP+ de/ja/ar/sw de/ja/ar/sw de/ja/ar/sw de/ja/ar/sw

+ Output projection layer initialization

We also release some TigerBot-7B and Mistral-7B models whose output layer is initialized according to each corresponding vocabulary initialization method instead of random initialization.

Approach TigerBot-7B Mistral-7B
Heuristics de/ja/ar/sw de/ja/ar/sw
CLP+ de/ja/ar/sw de/ja/ar/sw

fastText weights

Pre-trained fastText weights, used for FOCUS initialization, are uploaded with BLOOM-1B FOCUS models.

License

MIT License

Adapted Tokenizer

Note that adapted tokenizers were obtained from the following for each language:

Due to the license restriction of the Arabic tokenizer, we have excluded the Arabic tokenizer from each corresponding adapted model. To use it, please make sure to download the tokenizer beforehand from the above link.

Citation

If you find this work useful, please cite the following:

@article{yamaguchi2024empirical,
  title={An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference}, 
  author={Atsuki Yamaguchi and Aline Villavicencio and Nikolaos Aletras},
  journal={ArXiv},
  year={2024},
  volume={abs/2402.10712},
  url={https://arxiv.org/abs/2402.10712}
}

About

EMNLP 2024 Findings - An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published