Skip to content

microsoft/WaveCoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

WaveCoder
WaveCoder: Widespread And Versatile Enhanced Code LLM

[πŸ“œ Paper] β€’ [πŸ€— HF Models] β€’ [🐱 GitHub]
[🐦 Twitter] β€’ [πŸ’¬ Reddit] β€’ [πŸ€ Unofficial Blog]

Repo for "WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation" [ACL 2024 Main]


Figure 1: WaveCoder models pipeline.

πŸ”₯ News

  • [2024/05/16] WaveCoder paper is accepted by main conference of ACL 2024.
  • [2024/04/10] πŸ”₯πŸ”₯πŸ”₯ WaveCoder repo, models released at πŸ€— HuggingFace!
  • [2023/12/26] WaveCoder paper released.

πŸ’‘ Introduction

WaveCoder 🌊 is a series of large language models (LLMs) for the coding domain, designed to solve relevant problems in the field of code through instruction-following learning. Its training dataset was generated from a subset of code-search-net data using a generator-discriminator framework based on LLMs that we proposed, covering four general code-related tasks: code generation, code summary, code translation, and code repair.

Model HumanEval MBPP(500) HumanEval
Fix(Avg.)
HumanEval
Explain(Avg.)
GPT-4 85.4 - 47.8 52.1
WaveCoder-DS-6.7B 65.8 63.0 49.5 40.8
WaveCoder WaveCoder-Pro-6.7B 74. 4 63.4 52.1 43.0
WaveCoder WaveCoder-Ultra-6.7B 79.9 64.6 52.3 45.7

LLM-based Generator-Discriminator


Figure 2: Main framwork of LLM-based Generator-Discriminator.

Example of Instruction Generation


Figure 3: An Example of Our Data Generation.

Data Decontamination

We combine our dataset with the decontaminated evol-codealpaca-v1 dataset (WaveCoder-evol-instruct) to train WaveCoder-Ultra-6.7B.

πŸš€ Quick Start

βš™οΈ Setup

We recommend using Conda to manage your environment. Run the following commands to setup your environment:

conda create -n wavecoder python=3.9
conda activate wavecoder
cd src
pip install -r requirements.txt
pip install transformers==4.34.1
pip install flash-attn==2.5.5

⚑️ Training

We also open-source our complete training scripts for the community, and you may construct your own dataset for training. Our training scripts refer to Fastchat

To train a model, run the following command:

cd src
bash script/train.sh

βš–οΈ Evaluation

  • For HumanEval benchmark, we use the code base from Evalplus. We recommend using the code base from Magicoder and the following command to reproduce the HumanEval result of WaveCoder.
MODEL_KEY=deepseek-ai/deepseek-coder-6.7b-base
MODEL=microsoft/wavecoder-ultra-6.7b

DATASET=humaneval
SAVE_PATH=evalplus-$(basename $MODEL)-$DATASET.jsonl
SANITIZED_PATH=humaneval_result/evalplus-$(basename $MODEL)-$DATASET-sanitized.jsonl

python -m experiments.text2code \
  --model_key $MODEL_KEY \
  --model_name_or_path $MODEL \
  --save_path $SAVE_PATH \
  --dataset $DATASET \
  --temperature 0.0 \
  --top_p 1.0 \
  --max_new_tokens 512 \
  --n_problems_per_batch 28 \
  --n_samples_per_problem 1 \
  --n_batches 1

echo "$MODEL"
evalplus.evaluate --dataset $DATASET --samples $SAVE_PATH
  • For MBPP (500), you can get generations by running the following command:
cd src
bash script/generate.sh

and then get a pass_k score and the error type analysis by running the following command:

bash script/evaluate.sh

🌲 Data Generation

Firstly, you should prepare your raw code data and save it as .jsonl file, then you can run the following command:

cd src
bash script/coreset.sh

to get the coreset of you raw data. Once you get the coreset, you can run

cd src
bash script/data_generate.sh

to launch the LLM-based Generator-Discriminator framework. You can customize your data by controlling the prompt and the configurations in the above .sh script.

πŸ“– License

This code repository is licensed under the MIT License. The use of DeepSeek Coder models is subject to the its License.

β˜•οΈ Citation

If you find this repository helpful, please consider citing our paper:

@article{yu2023wavecoder,
  title={Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation},
  author={Yu, Zhaojian and Zhang, Xin and Shang, Ning and Huang, Yangyu and Xu, Can and Zhao, Yishujie and Hu, Wenxiang and Yin, Qiufeng},
  journal={arXiv preprint arXiv:2312.14187},
  year={2023}
}

πŸ€ Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

Resources:

✨ Star History

Star History Chart

About

Advancing LLM with Diverse Coding Capabilities

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •