Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training? #1595

Closed
aplmikex opened this issue Apr 28, 2024 · 1 comment

Comments

@aplmikex
Copy link

Hi Hugging Face team,

I'm exploring the possibility of using the TRL library for training a reinforcement learning model with a simulated environment. Specifically, I'm interested in using the DPO (Deep Policy Optimization) trainer to generate training data based on the simulated environment, similar to how PPO (Proximal Policy Optimization) works with its step-based training.

However, after reviewing the TRL documentation and examples, I couldn't find any clear indication of whether this is supported or not. I'd like to know if it's possible to use the DPO trainer in TRL to generate training data based on a simulated environment, where the environment provides rewards and observations that can be used to update the policy.

If this is supported, could you please provide an example or point me to the relevant documentation? If not, are there any plans to add this feature in the future?

Additional context:

I've reviewed the TRL documentation and examples, but couldn't find any mention of using a simulated environment with the DPO trainer.
I've seen examples of using PPO with a simulated environment, where the environment provides rewards and observations that are used to update the policy.
I'm interested in using TRL because of its ease of use and flexibility, but I need to know if it can support my specific use case.

@aplmikex aplmikex changed the title Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training? [Question] Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training? Apr 28, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this as completed Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant