Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapting A2C and deep policy gradient methods to Discrete envs #20

Closed
lionely opened this issue Feb 3, 2020 · 4 comments
Closed

Adapting A2C and deep policy gradient methods to Discrete envs #20

lionely opened this issue Feb 3, 2020 · 4 comments

Comments

@lionely
Copy link

lionely commented Feb 3, 2020

Describe the bug
Hi, I am getting the following error. I’m trying to use the A2C algorithm, which samples actions from a Gaussian distribution when given a state. The code seems expects torch.float32. I have checked and my inputs are indeed torch.float32 not Long. I’m not sure what to do. I know the algorithm runs as I have run it before, but I get this error when I try to use it in an environment with discrete actions.

Here is the library’s draw_action function:

def draw_action_t(self, state):
        print('draw_action_t',state.dtype)
        return self.distribution_t(state).sample().detach()

and

def distribution_t(self, state):
        mu, sigma = self.get_mean_and_covariance(state)
        return torch.distributions.MultivariateNormal(loc=mu, covariance_matrix=sigma)

Final error:

RuntimeError: _th_normal_ not supported on CPUType for Long

Help is much appreciated!

System information (please complete the following information):

  • OS: macOS 10.12.6
  • Python version: Python3.6
  • Torch version: Pytorch 1.2
  • Mushroom version: master

Additional context
I'm working with the Gym 'CartPole-v0' environment.
I need these deep policy gradient methods to work for discrete environments as I am testing something for my research. Any advice on this?
Help is greatly appreciated!

@boris-il-forte
Copy link
Collaborator

Hi,
Currently, we don't support actor-critic methods with finite actions.
However, I think it would be quite easy to make it work for A2C, given the simplicity of the algorithm.
Indeed, you are trying to use a gaussian policy in a discrete action environment. What you should do, instead, is implementing a discrete policy (e.g. boltzmann policy) extending the torch policy interface. it should be quite easy btw.

the only problematic part is to remember to cast back to long the action that is converted to float tensor inside the _loss method (or change the line in the a2c code)

@lionely
Copy link
Author

lionely commented Feb 4, 2020

Hi, thanks for the response! Do you know if there is a deep RL method in your library that supports finite actions? Apart from DQN? I will make the necessary changes for A2C if there aren't any available.

@boris-il-forte
Copy link
Collaborator

Mostly is missing the policy. After that you should be almost done.

All variants of dqn (double, averaged, categorical) support finite actions.

no actor-critic method currently supports finite actions.

You could also try fitted q iteration with deep networks, even if this algorithm works better with extra trees.

@boris-il-forte
Copy link
Collaborator

We just pushed the BoltzmannTorchPolicy in the dev branch.
see commit 4d20e68
Also, there is an example of usage of this policy with a2c here

This should fix your issue. If not, feel free to open another issue/bug report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants