-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical Policy for Discrete Action Spaces? #86
Comments
Actually, you cannot use the Boltzman policy for policy gradient methods, as the interface is lacking the gradient of the logarithm. I will put it in the ToDo list. |
Ok thanks for clarifying! When I tried last night, I discovered that the Boltzmann policy has no I need this categorical policy for discrete actions and discrete state spaces for my research and I'm happy to implement it myself. How would you recommend doing so? To be clear, I don't think anything should need to be deep in gridworld. Tabular PG and Tabular AC methods should (at least in principle) be applicable to gridworld, right? |
As a work-around, is the following a possible solution to obtain a PG agent in a discrete state space and discrete action space with the following approach? Use a |
To explain why, for my research, I want to test policy gradient and actor-critic methods against value-based approaches in tabular domains with discrete action spaces. Is there a way to do this using mushroom-rl? I'm happy to implement whatever I need to myself, if you give me an outline of what needs to change where (and what pitfalls to watch out for)! |
I just tried this myself, and hit the following error inside
Specifically stemming from the method:
|
The simplest approach is to implement the ParametricPolicy interface, with an appropriate policy. This will allow standard policy gradient to work, at least as far as I know. If that's not true, you may want to change the policy gradient approaches to support your setting or implement another approximator to support integer inputs. I want to remark that you can define the policy however you want, there's no need to use any of the mushroom tools (but they can be helpful for more complex scenarios). For deep actor-critic, you can use the torch Boltzmann policy, and define an appropriate network that makes sense for an integer input. In general, it doesn't seem to be a very good idea to do so, however, I'll not comment on this point further as it's out of the scope of mushroom and it's a very particular setting. Probably, you cannot expect that a deep actor-critic approach will have amazing results on grid worlds... |
I think you're misunderstanding what I want to do. The goal is simple: REINFORCE in Gridworld using a Categorical policy. No deep learning required. This is maybe the simplest application of REINFORCE and I'm finding it surprisingly difficult to implement. |
The solution for this is described in the post above: implement a Boltzmann policy using the ParametricPolicy interface. My comment on deep actor-critic is that these approaches, even without deep networks, are unlikely to work. Also, they will be pretty complex to implement in this setting, requiring many complicated assumptions. Classical actor-critic, if you make standard policy search to work, instead, can be ported similarly. |
Ok thank you. |
I want to explore policy gradient and actor critic agents on
GridWorld
environments. To that end, I want to parameterize the policy as a Categorical distribution at each state. How do I do this?Looking through the available policies,
policy.td_policy.Boltzmann
appears to perform softmax(logits), which is what I have in mind, but its logits appear to be dictated by Q values:I don't want the policy gradient agents to learn a Q function, and the fact that
Boltzmann
is undertd_policy
is making me hesitate because policy gradient methods are not a form of TD learning.The text was updated successfully, but these errors were encountered: