Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beta distribution as policy for environments with bounded continuous action spaces [feature request] #112

Open
skervim opened this issue Dec 4, 2018 · 22 comments
Labels
enhancement New feature or request experimental Experimental Feature help wanted Help from contributors is needed

Comments

@skervim
Copy link

skervim commented Dec 4, 2018

There is an issue at open-ai baselines ( here ) about the advantages of a beta distribution over a diagonal gaussian distribution + clipping.
The relevant paper: Improving Stochastic Policy Gradients in Continuous Control with Deep
Reinforcement Learning using the Beta Distribution

Is it possible to add a beta distribution to the repository?

@skervim skervim changed the title [feature request Beta distribution for environments with continuous action spaces [feature request] Dec 4, 2018
@skervim skervim changed the title Beta distribution for environments with continuous action spaces [feature request] Beta distribution as policy for environments with continuous action spaces [feature request] Dec 4, 2018
@skervim skervim changed the title Beta distribution as policy for environments with continuous action spaces [feature request] Beta distribution as policy for environments with bounded continuous action spaces [feature request] Dec 4, 2018
@araffin araffin added the enhancement New feature or request label Dec 4, 2018
@araffin
Copy link
Collaborator

araffin commented Dec 4, 2018

Hello,

Is it possible to add a beta distribution to the repository?

This is not planned but we are opened to PR. Also, as for the huber loss (see #95 ), we would need to run several benchmark to assess the utility of such feature before merging it.

@antoine-galataud
Copy link

Hello,

I'm working on continuous control problems with asymmetric, bounded continuous action spaces. While gaussian policies offer descent performance, it often takes long time to train and the action distribution is often not 100% matching the problem space.
My current workaround is to rescale the action and clip it (btw, I had to disable the way clipping is currently done so I can apply a custom transformation). But it only helps with matching environment constraints.

Some real world continuous control problems would benefit from this. Mainly thinking about mechanical engine parts control, or industrial machine optimization (e.g. calibration).

@araffin araffin added the help wanted Help from contributors is needed label Dec 5, 2018
@antoine-galataud
Copy link

I'm testing the following (draft) implementation

class BetaProbabilityDistribution(ProbabilityDistribution):
    def __init__(self, flat):
        self.flat = flat
        print(flat)
        # as per http://proceedings.mlr.press/v70/chou17a/chou17a.pdf
        alpha = 1.0 + tf.layers.dense(flat, len(flat.shape)-1, activation=tf.nn.softplus)
        beta  = 1.0 + tf.layers.dense(flat, len(flat.shape)-1, activation=tf.nn.softplus)
        self.dist = tf.distributions.Beta(concentration1=alpha, concentration0=beta, validate_args=True, allow_nan_stats=False)

    def flatparam(self):
        return self.flat

    def mode(self):
        return self.dist.mode()

    def neglogp(self, x):
        return tf.reduce_sum(-self.dist.log_prob(x), axis=-1)

    def kl(self, other):
        assert isinstance(other, BetaProbabilityDistribution)
        return self.dist.kl_divergence(other.dist)

    def entropy(self):
        return self.dist.entropy()

    def sample(self):
        return self.dist.sample()

    @classmethod
    def fromflat(cls, flat):
        return cls(flat)

For now I've been able to run it with a custom policy like:

        pdtype = BetaProbabilityDistributionType(ac_space)
        ...

        obs = self.processed_x

        with tf.variable_scope("model"):
            x = obs
            ...
            x = tf.nn.relu(tf.layers.dense(x, 128, name='pi_fc'+str(i), kernel_initializer=U.normc_initializer(1.0)))
            self.policy = tf.layers.dense(x, ac_space.shape[0], name='pi')
            self.proba_distribution = pdtype.proba_distribution_from_flat(x)

            x = obs
            ...
            x = tf.nn.relu(tf.layers.dense(x, 128, name='vf_fc' + str(i), kernel_initializer=U.normc_initializer(1.0)))
            value_fn = tf.layers.dense(x, 1, name='vf')
            self.q_value = tf.layers.dense(value_fn, 1, name='q')

...            

I'm now running it with PPO1 & PPO2 against my benchmark environment (asym, bounded, continuous action space) to see how it compares with Gaussian. I'm running into troubles with TRPO, but I didn't have time to investigate further.

Note: it still requires to rescale action from [0,1] to environment action space. This can be done manually, or it could be added a custom post-processing mechanism of the action.

@antoine-galataud
Copy link

Well, after testing a bit, it doesn't seem to improve overall performance, at least on my environment (I didn't test on classic control tasks). It does seem to converge, but it's slower and average reward is lower than for Gaussian policy.

@araffin
Copy link
Collaborator

araffin commented Dec 6, 2018

I would say you need some hyperparameter tuning... The parameters present in the current implementation were tuned for gaussian policies, so it is not completely fair to compare them without tuning.

@antoine-galataud
Copy link

@araffin I'll try to spend some time on that. Any idea of what hyperparam would be best to try tuning first?

@araffin
Copy link
Collaborator

araffin commented Dec 6, 2018

The best practice would be to use hyperband or hyperopt to do it automatically (see https://github.com/araffin/robotics-rl-srl#hyperparameter-search).
This script written by @hill-a can get you started.

Otherwise, with PPO, the hyperparameters that are the most important in my experience: n_steps (together with nminibatches), ent_coef (entropy coeff), lam (GAE lambda coeff). Additionally, you can also tune noptepochs, cliprange and the learning rate.

@HareshKarnan
Copy link

@antoine-galataud can you share your implementation of the beta distribution ?

@antoine-galataud
Copy link

@HareshMiriyala sure, I'll PR that soon.

@antoine-galataud
Copy link

I don't think it's ready for a PR so here is the branch link: https://github.com/antoine-galataud/stable-baselines/tree/beta-pd

This is based on Tensorflow Beta implementation and Improving Stochastic Policy Gradients in Continuous Control with Deep
Reinforcement Learning using the Beta Distribution
(notably the idea of using softplus activation and adding 1 as a constant to alpha and beta).

Usage: I didn't work on configuring what distribution to use in a generic manner, you have to use it in a custom policy. Ideally there should be way to choose between gaussian and beta in

def make_proba_dist_type(ac_space):

You can refer to example above about creating a custom policy that uses it.

@araffin
Copy link
Collaborator

araffin commented Jan 7, 2019

@antoine-galataud before submitting a PR, please look at the contribution guide #148 (that would save time ;))
It will be merged with master soon.

@HareshKarnan
Copy link

@antoine-galataud Thanks a bunch !

@HareshKarnan
Copy link

@antoine-galataud How are you handling scaling the sample from beta distribution (0,1) to the action space bounds ?

@antoine-galataud
Copy link

@HareshMiriyala I do it like this:

def step(self, action):
  action = action * (self.action_space.high - self.action_space.low) + self.action_space.low
  ...

@HareshKarnan
Copy link

Thanks, where do you make this change ? I'm not so familiar with the code, can you guide me into which file and class you make this change in ?

@skervim
Copy link
Author

skervim commented Jan 11, 2019

@araffin: Could you give an estimate for how long it will take until the beta distribution will be merged with master? Thanks in advance!

@araffin
Copy link
Collaborator

araffin commented Jan 11, 2019

@skervim well, I don't know as I'm not in charge if implementing it nor testing it. However that does not mean you cannot test it before (cf install from source in the doc).

@skervim
Copy link
Author

skervim commented Jan 11, 2019

@araffin: Sorry!! I misunderstood your message:

It will be merged with master soon.

@antoine-galataud: I don't know if it helps you, but there is also a beta distribution implemented in Tensorforce (here). May be it can serve you as orientation? Thank you very much for implementing the beta distribution. I think it will help in a lot of environments and RL problems. :)

@antoine-galataud
Copy link

@HareshMiriyala the step() function is one that you implement when you write a custom gym env. you can also modify an existing env to see how it goes.

@skervim I couldn’t dedicate time to testing (apart a quick one with a custom env). I also have to write unit tests. If you have access to continuous control environments (mujoco, ...) to give it a try that would definitely help. Apart from that, I’d like to provide a better integration with action value scaling and distribution type selection based on configuration. Maybe later, if we see any benefit with this implementation. It doesn’t prevent from testing it as is anyway.

@araffin
Copy link
Collaborator

araffin commented Jan 11, 2019

@skervim if you want to test on continuous envs for free (no mujoco licence required), I recommend you the pybullets envs (see the rl baselines zoo)

@AGPX
Copy link

AGPX commented Feb 4, 2020

@HareshMiriyala I do it like this:

def step(self, action):
  action = action * (self.action_space.high - self.action_space.low) + self.action_space.low
  ...

@antoine-galataud It's legit/better to perform this operation in the step function of the environment? Or is better to put it in the network (updating the proba_distribution_from_latent function)? In the first case, during training, after a certain amount of episodes I have experimented a drop of the average reward. If I put this as final network layer, this doesn't happen (although the convergence is not so good).

@antoine-galataud
Copy link

@HareshMiriyala I’ve seen rescaling performed in various parts of the code, depending on the env, the framework or the project. In my opinion, it shouldn’t impact overall performance, if rescaling output is consistently giving same output for a given input.
Out of curiousity, on what type of problem are you applying this and how is the performance compared to Gaussian pd?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request experimental Experimental Feature help wanted Help from contributors is needed
Projects
None yet
Development

No branches or pull requests

5 participants