There were three main steps to the training process:
- Supervised fine-tuning of the base llama-7b model to create llama-7b-se:
- Reward modeling using dialog pairs from the SE dataset using the llama-7b-se to create llama-7b-se-rm:
- RL fine-tuning of llama-7b-se with the llama-7b-se-rm reward model:
For all methods use python run.py -e configs/ and choose the corresponding config
My LoRA layers for the vanilla StackLLaMA are publicly available on huggingface as
LoRA layers were using at all stages to reduce memory requirements. At each stage the peft adapter layers were merged with the base model, using:
python examples/stack_llama/scripts/merge_peft_adapter.py --adapter_model_name=XXX --base_model_name=YYY --output_name=ZZZI used huggyllama/llama-7b as the base model. Note the order that models must be merged:
- llama-7b-se = merge_peft_adapter llama-7b + llama-7b-se-peft
- llama-7b-se-rm = merge_peft_adapter llama-7b-se + llama-7b-se-rm-peft
- llama-7b-se-rl = merge_peft_adapter llama-7b-se + llama-7b-se-rl-peft