Vocabulary Expansion for Low-resource Cross-lingual Transfer

This is the official code for the paper titled "Vocabulary Expansion for Low-resource Cross-lingual Transfer." For reproduction, please refer to Reproduction.

Requirements

Python 3.11.7 or later
CUDA 11.8
torch==2.2.1
transformers==4.39.0.dev0
jupyterlab==4.1.2
peft==0.8.2
datasets==2.17.1
evaluate==0.4.1
bitsandbytes==0.42.0
scikit-learn==1.4.1.post1
seaborn==0.13.2
sumeval==0.2.2
janome==0.5.0
protobuf==4.25.1
entmax==1.3
fastdist==1.1.6
rouge-score==0.1.2
numba==0.59.0
tensorboardX==2.6.2.2
tensorboard==2.16.2
torch_tb_profiler==0.4.3
pyarabic==0.6.15
rouge==1.0.1
huggingface-hub==0.21.1
zstandard==0.22.0
lm_eval==0.4.2
lighteval==0.4.0
openai==1.25.0
tiktoken==0.6.0
fasttext==0.9.2 (See below)

Installation

After manually installing PyTorch and transformers, please run the following.

pip install -r requirements.txt
git clone https://github.com/facebookresearch/fastText.git
cd fastText
pip install .

Reproduction

Adapted Models

All models will be available on the Hugging Face Model Hub.

License

MIT License

Citation

If you find this work useful, please cite the following:

@article{yamaguchi2024vocabulary,
  title={Vocabulary Expansion for Low-resource Cross-lingual Transfer}, 
  author={Atsuki Yamaguchi and Aline Villavicencio and Nikolaos Aletras},
  year={2024},
  journal={arXiv},
  volume={abs/2406.11477},
  url={https://arxiv.org/abs/2406.11477}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
eval		eval
instantiation		instantiation
lapt		lapt
preprocessing		preprocessing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vocabulary Expansion for Low-resource Cross-lingual Transfer

Requirements

Installation

Reproduction

1. Preprocessing

2. Target Model Initialization

3. Language Adaptive Pre-training

4. Evaluation

Adapted Models

License

Citation

About

Releases

Packages

Languages

License

gucci-j/lowres-cva

Folders and files

Latest commit

History

Repository files navigation

Vocabulary Expansion for Low-resource Cross-lingual Transfer

Requirements

Installation

Reproduction

1. Preprocessing

2. Target Model Initialization

3. Language Adaptive Pre-training

4. Evaluation

Adapted Models

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages