This is project is a proof-of-concept to test out the recent Direct Preference Optimization loss in a simple context. I wanted to better understand how to work with this loss in the simple context of classifying images (mainly on the CIFAR10 dataset). We are using a ResNet fine-tuned on CIFAR10 (with added DropOut) as a baseline model to compare the results obtained from optimizing with DPO.
Since DPO makes use of two networks
The DPO Loss:
In the case of CIFAR10, we only have ground truth labels, so there's nothing related to preferences. We modeled the winning/preferred labels (
There are a few parameters of the training scripts that are available to test out such as:
- do_polyak: The reference model parameters are a moving average towards the parameters of the policy model.
- do_copy: Instead of averaging the models parameters together, we just copy the policy into the reference model at the end of each epoch.
The implementation of DPO is adapted from the original paper: https://arxiv.org/pdf/2305.18290.pdf
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.