[Question] Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training? #1595

aplmikex · 2024-04-28T15:06:17Z

Hi Hugging Face team,

I'm exploring the possibility of using the TRL library for training a reinforcement learning model with a simulated environment. Specifically, I'm interested in using the DPO (Deep Policy Optimization) trainer to generate training data based on the simulated environment, similar to how PPO (Proximal Policy Optimization) works with its step-based training.

However, after reviewing the TRL documentation and examples, I couldn't find any clear indication of whether this is supported or not. I'd like to know if it's possible to use the DPO trainer in TRL to generate training data based on a simulated environment, where the environment provides rewards and observations that can be used to update the policy.

If this is supported, could you please provide an example or point me to the relevant documentation? If not, are there any plans to add this feature in the future?

Additional context:

I've reviewed the TRL documentation and examples, but couldn't find any mention of using a simulated environment with the DPO trainer.
I've seen examples of using PPO with a simulated environment, where the environment provides rewards and observations that are used to update the policy.
I'm interested in using TRL because of its ease of use and flexibility, but I need to know if it can support my specific use case.

github-actions · 2024-05-29T15:05:17Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

aplmikex changed the title ~~Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training?~~ [Question] Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training? Apr 28, 2024

github-actions bot closed this as completed Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training? #1595

[Question] Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training? #1595

aplmikex commented Apr 28, 2024

github-actions bot commented May 29, 2024

[Question] Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training? #1595

[Question] Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training? #1595

Comments

aplmikex commented Apr 28, 2024

github-actions bot commented May 29, 2024