XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts

Important

We are constantly working on cleaning the code, improving the documentation, and adding more implementation details. Plese stay tuned!

We build XFT based on the implementation of Magicoder (https://github.com/ise-uiuc/magicoder). To set up environment for experiments on DeepSeek-Coder-1.3B, please run the following command:

conda env create -f xft_env.yml
conda activate xft
pip install flash-attn==2.1.0 --no-build-isolation

To obtain XFT_DS, you need to run the code step by step as follows:

Step 1: Upcycle an MoE model from DeepSeek-Coder-1.3B Base.

export PYTHONPATH=:[YOUR_HOME_PATH]/xft/src:[YOUR_HOME_PATH]/xft/src/magicoder
cd [YOUR_HOME_PATH]/xft/src/magicoder
python convert_dense_to_moe.py \
 --model deepseek-ai/deepseek-coder-1.3b-base \
 --save_path "deepseek-coder-8x1.3b-top-6-moe-base"

Step 2: Download Evol-Instruct dataset and put it under xft/data folder.

Instruction tune the upcycled MoE model on evol-instruct dataset.

bash train_moe.sh

Evaluate the instruction-tuned MoE model on HumanEval(+).

bash test_moe.sh

Step 3: Extract FFN weights from the instruction-tuned MoE model.

python convert_moe_to_ffn.py \
 --model "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4" \
 --save_path "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4_ffn"

Step 4: Set the shared_expert_weight ($lambda$) and ffn_folder_path (path to the folder of FFN weights) in the config file of the instruction-tuned MoE model (ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4/config.json) before learning the mixing coefficients.

Step 5: Initialize the mixing coefficients which aims to merge the experts in the instruction-tuned MoE model.

python convert_moe_to_weighted.py \
 --model "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4" \
 --save_path "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4_weighted_dense" \
 --num_experts 8

Step 6: Learn the mixing coefficients on evol-instruct dataset.

bash train_weighted.sh

Step 7: Merge the instruction-tuned MoE model based on the learned mixing coefficients. Now you will get a instruction-tuned model that has the same architecture as DeepSeek-Coder-1.3B Base.

python convert_weighted_to_dense.py \
 --model_moe "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4" \
 --model_dense "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4_weighted_dense-lambda-75-1e-5_bs_64_epoch_1" \
 --save_path "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4_weighted_dense-lambda-75-1e-5_bs_64_epoch_1-dense" \
 --num_experts 8 \
 --shared_expert_weight 0.75

Evaluate the final model on HumanEval(+).

bash test.sh

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
data		data
demo		demo
experiments		experiments
src/magicoder		src/magicoder
tests		tests
LICENSE		LICENSE
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
test.sh		test.sh
test_moe.sh		test_moe.sh
xft_env.yml		xft_env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts

About

Releases

Languages

License

ise-uiuc/xft

Folders and files

Latest commit

History

Repository files navigation

XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts

About

Resources

License

Stars

Watchers

Forks

Releases

Languages