You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm exploring the possibility of using the TRL library for training a reinforcement learning model with a simulated environment. Specifically, I'm interested in using the DPO (Deep Policy Optimization) trainer to generate training data based on the simulated environment, similar to how PPO (Proximal Policy Optimization) works with its step-based training.
However, after reviewing the TRL documentation and examples, I couldn't find any clear indication of whether this is supported or not. I'd like to know if it's possible to use the DPO trainer in TRL to generate training data based on a simulated environment, where the environment provides rewards and observations that can be used to update the policy.
If this is supported, could you please provide an example or point me to the relevant documentation? If not, are there any plans to add this feature in the future?
Additional context:
I've reviewed the TRL documentation and examples, but couldn't find any mention of using a simulated environment with the DPO trainer.
I've seen examples of using PPO with a simulated environment, where the environment provides rewards and observations that are used to update the policy.
I'm interested in using TRL because of its ease of use and flexibility, but I need to know if it can support my specific use case.
The text was updated successfully, but these errors were encountered:
aplmikex
changed the title
Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training?
[Question] Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training?
Apr 28, 2024
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Hi Hugging Face team,
I'm exploring the possibility of using the TRL library for training a reinforcement learning model with a simulated environment. Specifically, I'm interested in using the DPO (Deep Policy Optimization) trainer to generate training data based on the simulated environment, similar to how PPO (Proximal Policy Optimization) works with its step-based training.
However, after reviewing the TRL documentation and examples, I couldn't find any clear indication of whether this is supported or not. I'd like to know if it's possible to use the DPO trainer in TRL to generate training data based on a simulated environment, where the environment provides rewards and observations that can be used to update the policy.
If this is supported, could you please provide an example or point me to the relevant documentation? If not, are there any plans to add this feature in the future?
Additional context:
I've reviewed the TRL documentation and examples, but couldn't find any mention of using a simulated environment with the DPO trainer.
I've seen examples of using PPO with a simulated environment, where the environment provides rewards and observations that are used to update the policy.
I'm interested in using TRL because of its ease of use and flexibility, but I need to know if it can support my specific use case.
The text was updated successfully, but these errors were encountered: