Haonan Li*, Fajri Koto*, Minghao Wu, Alham Fikri Aji, Timothy Baldwin (*equal contribution)
- [2023.05.25] The preprint of our paper is here.
- [2023.05.25] Bactrian-X llama-7b-lora, llama-13b-lora, bloom-7b1-lora are available.
- [2023.04.22] Release of data in 52 languages here.
Bactrian-X dataset contains 3.4M pairs of instructions and responses in 52 languages.
The instructions were obtained from alpaca-52k, and dolly-15k, and tranlated into 52 languages (52 languages x 67k instances = 3.4M instances).
The responses in 52 languages were generated from gpt-3.5-turbo
model.
Bactrian-X models are a series of LLM models fine-tuned (using low-rank adaptation/LoRA) on Bactrian-X dataset.
Usage and License Notices: Bactrian-X is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
We curate our Bactrian instruction dataset with the following steps:
- Collecting English instructions: The English instructions are obtained from alpaca-52k and dolly-15k, and they are saved to instructions.json.
- Translating the English instructions into foreign languages: The instructions (and the corresponding inputs, if any) are translated into 51 languages using the Google Translate API (conducted in April 2023).
- Generating the responses: We generate output from
gpt-3.5-turbo
for the instructions in each language (conducted in April 2023).
With our dataset and Low-Rank Adaptation (LoRA), we present a family of multilingual and monolingual models based on LLaMA and BLOOM. Our instruction-tuned multilingual Bactrian-X models are available at:
Note: We are continually updating this repository. The number of languages will be more than 52 in the future, and the current models are mostly only 7B in size. We welcome any collaborators who are willing to contribute larger models.
conda create -n bactrian python=3.9
conda activate bactrian
pip install -r requirements.txt
Models are trained with the following hyperparameters:
Hyper-parameter | Bactrian-X |
---|---|
batch_size | 128 |
num_epochs | 4 |
learning_rate | 3e-4 |
cutoff_len | 768 |
lora_r | 64 |
lora_alpha | 16 |
Below is a command to train a LLaMA-7B adapter with our datasets in specific language(s). Replace <lang_iso>
with a list of (one or more) ISO-639-2 language codes separated by commas (e.g., en,zh
for English
and Chinese
), and <your_output_dir>
to specify where to store the outputs.
# Script to train on 4x Nvidia A100 80GB gpus
WORLD_SIZE=4
CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --nproc_per_node=4 --master_port=1234 finetune.py \
--model_name_or_path decapoda-research/llama-7b-hf \
--lang <lang_iso> \
--output_dir <your_output_dir> \
--load_in_8bit \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--num_train_epochs 8 \
--model_max_length 768 \
--learning_rate 3e-4 \
--val_set_size 2000 \
--warmup_steps 200 \
--lora_r 64 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--lora_target_modules 'q_proj,k_proj,v_proj,o_proj' \
--group_by_length
This is example code that loads both the foundation model and Bactrian LoRA weights from the Hugging Face model hub, and runs a Gradio interface for inference on a specified input.
python generate.py \
--load_8bit \
--base_model 'decapoda-research/llama-7b-hf' \
--lora_weights 'MBZUAI/bactrian-x-llama-7b-lora' \
--share_gradio
To merge the LoRA weights back into the base model for export to Hugging Face format and to PyTorch state_dicts
, go to Alpaca-LoRA.
This should help users who want to run inference in projects like llama.cpp or alpaca.cpp.
Please check output examples here.
Please cite the repo if you use the data, model or code in this repo. A paper will be released very soon.
@misc{li2023bactrianx,
title={Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation},
author={Haonan Li and Fajri Koto and Minghao Wu and Alham Fikri Aji and Timothy Baldwin},
year={2023},
eprint={2305.15011},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Naturally, you should also cite the original LLaMA paper, the Self-Instruct paper, and the Stanford Alpaca repo.
We are standing on the shoulders of giants and would like to especially acknowledge the previous efforts of the following works.: