Trains and compares a variety of preference models (reward models) with different losses and datasets.
- Add model.
- Add training code.
- Add evaluation code.
- Test complete workflow with 10% train and 10% eval data for one epoch.
- Add requirements.txt.
- Train to make sure that loss is going down.
- Add metrics to measure accuracy while training.
- Try different configs:
- Freeze some of the layers to avoid overfitting.
- Train first layer for 0.1 epoch. Then train the other layers.
- Deepspeed with a config file.
- Add Deepspeed config and try Deepspeed training.
- Try PyTorch compile.
- Compare different losses.
- Compare different datasets.
- Add synthetic datasets.
- Incorporate WANDB to keep track of experiments.
- Code forked from https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf.