Beta distribution as policy for environments with bounded continuous action spaces [feature request] #112

skervim · 2018-12-04T13:46:15Z

There is an issue at open-ai baselines ( here ) about the advantages of a beta distribution over a diagonal gaussian distribution + clipping.
The relevant paper: Improving Stochastic Policy Gradients in Continuous Control with Deep
Reinforcement Learning using the Beta Distribution
Is it possible to add a beta distribution to the repository?

araffin · 2018-12-04T16:19:18Z

Hello,

Is it possible to add a beta distribution to the repository?

This is not planned but we are opened to PR. Also, as for the huber loss (see #95 ), we would need to run several benchmark to assess the utility of such feature before merging it.

antoine-galataud · 2018-12-05T09:32:33Z

Hello,

I'm working on continuous control problems with asymmetric, bounded continuous action spaces. While gaussian policies offer descent performance, it often takes long time to train and the action distribution is often not 100% matching the problem space.
My current workaround is to rescale the action and clip it (btw, I had to disable the way clipping is currently done so I can apply a custom transformation). But it only helps with matching environment constraints.

Some real world continuous control problems would benefit from this. Mainly thinking about mechanical engine parts control, or industrial machine optimization (e.g. calibration).

antoine-galataud · 2018-12-05T15:56:58Z

I'm testing the following (draft) implementation

class BetaProbabilityDistribution(ProbabilityDistribution):
    def __init__(self, flat):
        self.flat = flat
        print(flat)
        # as per http://proceedings.mlr.press/v70/chou17a/chou17a.pdf
        alpha = 1.0 + tf.layers.dense(flat, len(flat.shape)-1, activation=tf.nn.softplus)
        beta  = 1.0 + tf.layers.dense(flat, len(flat.shape)-1, activation=tf.nn.softplus)
        self.dist = tf.distributions.Beta(concentration1=alpha, concentration0=beta, validate_args=True, allow_nan_stats=False)

    def flatparam(self):
        return self.flat

    def mode(self):
        return self.dist.mode()

    def neglogp(self, x):
        return tf.reduce_sum(-self.dist.log_prob(x), axis=-1)

    def kl(self, other):
        assert isinstance(other, BetaProbabilityDistribution)
        return self.dist.kl_divergence(other.dist)

    def entropy(self):
        return self.dist.entropy()

    def sample(self):
        return self.dist.sample()

    @classmethod
    def fromflat(cls, flat):
        return cls(flat)

For now I've been able to run it with a custom policy like:

        pdtype = BetaProbabilityDistributionType(ac_space)
        ...

        obs = self.processed_x

        with tf.variable_scope("model"):
            x = obs
            ...
            x = tf.nn.relu(tf.layers.dense(x, 128, name='pi_fc'+str(i), kernel_initializer=U.normc_initializer(1.0)))
            self.policy = tf.layers.dense(x, ac_space.shape[0], name='pi')
            self.proba_distribution = pdtype.proba_distribution_from_flat(x)

            x = obs
            ...
            x = tf.nn.relu(tf.layers.dense(x, 128, name='vf_fc' + str(i), kernel_initializer=U.normc_initializer(1.0)))
            value_fn = tf.layers.dense(x, 1, name='vf')
            self.q_value = tf.layers.dense(value_fn, 1, name='q')

...

I'm now running it with PPO1 & PPO2 against my benchmark environment (asym, bounded, continuous action space) to see how it compares with Gaussian. I'm running into troubles with TRPO, but I didn't have time to investigate further.

Note: it still requires to rescale action from [0,1] to environment action space. This can be done manually, or it could be added a custom post-processing mechanism of the action.

antoine-galataud · 2018-12-06T09:00:20Z

Well, after testing a bit, it doesn't seem to improve overall performance, at least on my environment (I didn't test on classic control tasks). It does seem to converge, but it's slower and average reward is lower than for Gaussian policy.

araffin · 2018-12-06T09:53:58Z

I would say you need some hyperparameter tuning... The parameters present in the current implementation were tuned for gaussian policies, so it is not completely fair to compare them without tuning.

antoine-galataud · 2018-12-06T11:07:21Z

@araffin I'll try to spend some time on that. Any idea of what hyperparam would be best to try tuning first?

araffin · 2018-12-06T11:53:32Z

The best practice would be to use hyperband or hyperopt to do it automatically (see https://github.com/araffin/robotics-rl-srl#hyperparameter-search).
This script written by @hill-a can get you started.

Otherwise, with PPO, the hyperparameters that are the most important in my experience: n_steps (together with nminibatches), ent_coef (entropy coeff), lam (GAE lambda coeff). Additionally, you can also tune noptepochs, cliprange and the learning rate.

HareshKarnan · 2019-01-07T01:06:02Z

@antoine-galataud can you share your implementation of the beta distribution ?

antoine-galataud · 2019-01-07T08:32:07Z

@HareshMiriyala sure, I'll PR that soon.

antoine-galataud · 2019-01-07T09:18:20Z

I don't think it's ready for a PR so here is the branch link: https://github.com/antoine-galataud/stable-baselines/tree/beta-pd

This is based on Tensorflow Beta implementation and Improving Stochastic Policy Gradients in Continuous Control with Deep
Reinforcement Learning using the Beta Distribution (notably the idea of using softplus activation and adding 1 as a constant to alpha and beta).

Usage: I didn't work on configuring what distribution to use in a generic manner, you have to use it in a custom policy. Ideally there should be way to choose between gaussian and beta in

stable-baselines/stable_baselines/common/distributions.py

Line 467 in 596a5c4

def make_proba_dist_type(ac_space):

You can refer to example above about creating a custom policy that uses it.

araffin · 2019-01-07T09:59:09Z

@antoine-galataud before submitting a PR, please look at the contribution guide #148 (that would save time ;))
It will be merged with master soon.

HareshKarnan · 2019-01-09T04:49:22Z

@antoine-galataud Thanks a bunch !

HareshKarnan · 2019-01-10T22:57:12Z

@antoine-galataud How are you handling scaling the sample from beta distribution (0,1) to the action space bounds ?

antoine-galataud · 2019-01-11T13:14:39Z

@HareshMiriyala I do it like this:

def step(self, action):
  action = action * (self.action_space.high - self.action_space.low) + self.action_space.low
  ...

HareshKarnan · 2019-01-11T14:57:52Z

Thanks, where do you make this change ? I'm not so familiar with the code, can you guide me into which file and class you make this change in ?

skervim · 2019-01-11T16:16:30Z

@araffin: Could you give an estimate for how long it will take until the beta distribution will be merged with master? Thanks in advance!

araffin · 2019-01-11T17:13:58Z

@skervim well, I don't know as I'm not in charge if implementing it nor testing it. However that does not mean you cannot test it before (cf install from source in the doc).

skervim · 2019-01-11T21:00:29Z

@araffin: Sorry!! I misunderstood your message:

It will be merged with master soon.

@antoine-galataud: I don't know if it helps you, but there is also a beta distribution implemented in Tensorforce (here). May be it can serve you as orientation? Thank you very much for implementing the beta distribution. I think it will help in a lot of environments and RL problems. :)

antoine-galataud · 2019-01-11T22:17:11Z

@HareshMiriyala the step() function is one that you implement when you write a custom gym env. you can also modify an existing env to see how it goes.

@skervim I couldn’t dedicate time to testing (apart a quick one with a custom env). I also have to write unit tests. If you have access to continuous control environments (mujoco, ...) to give it a try that would definitely help. Apart from that, I’d like to provide a better integration with action value scaling and distribution type selection based on configuration. Maybe later, if we see any benefit with this implementation. It doesn’t prevent from testing it as is anyway.

araffin · 2019-01-11T23:15:45Z

@skervim if you want to test on continuous envs for free (no mujoco licence required), I recommend you the pybullets envs (see the rl baselines zoo)

AGPX · 2020-02-04T15:03:14Z

@HareshMiriyala I do it like this:

def step(self, action):
  action = action * (self.action_space.high - self.action_space.low) + self.action_space.low
  ...

@antoine-galataud It's legit/better to perform this operation in the step function of the environment? Or is better to put it in the network (updating the proba_distribution_from_latent function)? In the first case, during training, after a certain amount of episodes I have experimented a drop of the average reward. If I put this as final network layer, this doesn't happen (although the convergence is not so good).

antoine-galataud · 2020-02-04T19:34:06Z

@HareshMiriyala I’ve seen rescaling performed in various parts of the code, depending on the env, the framework or the project. In my opinion, it shouldn’t impact overall performance, if rescaling output is consistently giving same output for a given input.
Out of curiousity, on what type of problem are you applying this and how is the performance compared to Gaussian pd?

skervim changed the title ~~[feature request~~ Beta distribution for environments with continuous action spaces [feature request] Dec 4, 2018

skervim changed the title ~~Beta distribution for environments with continuous action spaces [feature request]~~ Beta distribution as policy for environments with continuous action spaces [feature request] Dec 4, 2018

skervim changed the title ~~Beta distribution as policy for environments with continuous action spaces [feature request]~~ Beta distribution as policy for environments with bounded continuous action spaces [feature request] Dec 4, 2018

araffin added the enhancement New feature or request label Dec 4, 2018

araffin added the help wanted Help from contributors is needed label Dec 5, 2018

araffin mentioned this issue Dec 31, 2018

model.predict() returns actions in range [-1,1] from model trained with range [0,1] #147

Closed

araffin mentioned this issue Jul 10, 2019

[question] MLPpolicy network output layer tanh activation for limiting actions? #403

Closed

araffin mentioned this issue Feb 5, 2020

Actions normalization. How to implement it? #678

Closed

araffin added the experimental Experimental Feature label Oct 12, 2020

araffin mentioned this issue Jun 28, 2021

[Question] PPO/TRPO with tanh squash compared to not using it. DLR-RM/stable-baselines3#494

Closed

2 tasks

araffin mentioned this issue Jul 11, 2022

[Question] Is it possible to use other distributions for continuous actions? DLR-RM/stable-baselines3#955

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beta distribution as policy for environments with bounded continuous action spaces [feature request] #112

Beta distribution as policy for environments with bounded continuous action spaces [feature request] #112

skervim commented Dec 4, 2018 •

edited

araffin commented Dec 4, 2018

antoine-galataud commented Dec 5, 2018

antoine-galataud commented Dec 5, 2018

antoine-galataud commented Dec 6, 2018

araffin commented Dec 6, 2018

antoine-galataud commented Dec 6, 2018

araffin commented Dec 6, 2018

HareshKarnan commented Jan 7, 2019

antoine-galataud commented Jan 7, 2019

antoine-galataud commented Jan 7, 2019

araffin commented Jan 7, 2019

HareshKarnan commented Jan 9, 2019

HareshKarnan commented Jan 10, 2019

antoine-galataud commented Jan 11, 2019

HareshKarnan commented Jan 11, 2019

skervim commented Jan 11, 2019

araffin commented Jan 11, 2019

skervim commented Jan 11, 2019 •

edited

antoine-galataud commented Jan 11, 2019

araffin commented Jan 11, 2019

AGPX commented Feb 4, 2020 •

edited

antoine-galataud commented Feb 4, 2020

Beta distribution as policy for environments with bounded continuous action spaces [feature request] #112

Beta distribution as policy for environments with bounded continuous action spaces [feature request] #112

Comments

skervim commented Dec 4, 2018 • edited

araffin commented Dec 4, 2018

antoine-galataud commented Dec 5, 2018

antoine-galataud commented Dec 5, 2018

antoine-galataud commented Dec 6, 2018

araffin commented Dec 6, 2018

antoine-galataud commented Dec 6, 2018

araffin commented Dec 6, 2018

HareshKarnan commented Jan 7, 2019

antoine-galataud commented Jan 7, 2019

antoine-galataud commented Jan 7, 2019

araffin commented Jan 7, 2019

HareshKarnan commented Jan 9, 2019

HareshKarnan commented Jan 10, 2019

antoine-galataud commented Jan 11, 2019

HareshKarnan commented Jan 11, 2019

skervim commented Jan 11, 2019

araffin commented Jan 11, 2019

skervim commented Jan 11, 2019 • edited

antoine-galataud commented Jan 11, 2019

araffin commented Jan 11, 2019

AGPX commented Feb 4, 2020 • edited

antoine-galataud commented Feb 4, 2020

skervim commented Dec 4, 2018 •

edited

skervim commented Jan 11, 2019 •

edited

AGPX commented Feb 4, 2020 •

edited