Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actions normalization. How to implement it? #678

Closed
AGPX opened this issue Feb 4, 2020 · 10 comments
Closed

Actions normalization. How to implement it? #678

AGPX opened this issue Feb 4, 2020 · 10 comments
Labels
question Further information is requested RTFM Answer is the documentation

Comments

@AGPX
Copy link

AGPX commented Feb 4, 2020

Hi,

I'm struggling trying to replace the PPO implementation in DeepMimic with the PPO2 of the stable-baselines. The actor barely learn to walk for two or three steps and then falls. Now, I believe I have a problem of actions normalization (that is fully supported in the DeepMimic PPO implementation). From the documentation of stable baselines, I read:

normalize your action space and make it symmetric when continuous (cf potential issue below) A good practice is to rescale your actions to lie in [-1, 1]. This does not limit you as you can easily rescale the action inside the environment

How to make this is not clear at all. DeepMimic have a continuous observation and action space (197 real for observation and 36 real for action). I know the bounds of the actions, but the problem here is that the output of the MlpPolicy can have values by far bigger than 1 (in modulo).
We have 2 hidden layers of size 1024 and 512. Theoretically I have to normalize the output of the network, but AFAIK it's not possibile to add other layers after the output, am I right? And I think that the normalization must be done in the network (not in the environment) to make the backpropagation works correctly.
So, how I can normalize action in range [-1, 1]?
In addition, there's a way to show if the action space is sampled uniformerly (considering that is a 36D space)?

Thanks in advance.

P.S.: Any example of how to handle this with OpenAI baselines? I cannot found any on internet.

@araffin araffin added the question Further information is requested label Feb 4, 2020
@Miffyli
Copy link
Collaborator

Miffyli commented Feb 5, 2020

As you pointed out this information is in the documentation. Just below the line you mentioned there is an example of different action spaces. No need to touch the network here, just map your continuous actions from/to [-1, 1] interval as if you were normalizing e.g. variables of datapoints to [-1, 1] with x = [ (x - min(x)) / (max(x) - min(x)) - 0.5] * 2.

@AGPX
Copy link
Author

AGPX commented Feb 5, 2020

@Miffyli Thanks so much for the reply. If I understand correctly, I have to add a code like the one you suggested in the step function of the environment, in order to map the interval to [-1, 1].

The min(x) and max(x) are the running minimum/maximum of the produced actions, right? They cannot be the (the known) action bounds, because the x (the output of the network) falls outside the action boundary, regardless how I set low and high values of the action space. And about this, I have to put low=-1 and high=1, am I right?

@araffin araffin added the RTFM Answer is the documentation label Feb 5, 2020
@araffin
Copy link
Collaborator

araffin commented Feb 5, 2020

MlpPolicy can have values by far bigger than 1

From the documentation:
"For all algorithms (except DDPG, TD3 and SAC), continuous actions are clipped during training and testing (to avoid out of bound error)."

Also related: #112

So, as a summary:
PPO is parametrized by a Gaussian so can in theory output infinite values.
In practice, the action is clipped to match the boundaries (but because of that and the initialization, it is recommended to do the rescaling inside the environment and keep the bounds in [-1, 1])
A cleaner solution to that would be to use a squashed Gaussian (using tanh) as it is done for SAC.

@AGPX
Copy link
Author

AGPX commented Feb 5, 2020

but because of that and the initialization, it is recommended to do the rescaling inside the environment and keep the bounds in [-1, 1]

This means that is better to (avoiding to use tanh as final layer):

  1. Declare the action space as a Box without a low and high bound.
  2. Rescale the action in the environment, considering as min and max value, the min and max action value received so far.

Is it correct?

@araffin
Copy link
Collaborator

araffin commented Feb 5, 2020

Is it correct?

Not really... You are creating the environment, so you know in advance the limits of the actions.
As recommended (and because for the different reasons mentioned in the doc and in this issue), you should tell the agent that actions are in [-1, 1], and then in your env, rescale the received actions (that will lie in [-1, 1]) to the correct range.

@AGPX
Copy link
Author

AGPX commented Feb 5, 2020

@araffin Okay, I'll give a try, but in all the numerous attempts I made, I couldn't even get close to the sampling efficiency of the DeepMimic's PPO implementation (they don't use Adam as an optimizer but rather the SGD with momentum, but I don't think it can make all this difference). In my opinion, in their PPO implementation there are tricks that would be worth analyzing (especially regarding the normalization), because they may lead to an improvement of the current PPO2 algorithm.

@AGPX
Copy link
Author

AGPX commented Feb 7, 2020

@araffin @Miffyli Okay, I tried both solutions and adding normalization directly in the network (through a Tanh layer), gave me (by far) best results. Telling the agent that the actions are in [-1, 1], without using Tanh, simply not works because this leads to saturating the actions (-1 or 1) from the first step evaluation and this leads the actor to immediately become unstable (and therefore learning proceed very very slowly, if proceed at all).
Big movements are the evil (any brutal movement leads a physical character to fall relentlessly... the force of gravity does not forgive!), so I have tried to compensate this issue reducing the cliprange but, still, the result is not satisfactory. Instead, the output of the network implemented in DeepMimic is very small from the beginning and the actor slowly, but consistently, learns to take the first steps (and, at the end, it walks infinitely).
Anyway, even with the Tanh trick (btw, it would be nice if there was an easier way to add it...), my actor manages to take a few steps and then loses his balance. On the contrary, with the same number of iterations, the original code gives a stable walk (although not perfect): there must be a problem somewhere else. Useless, the extensive tuning of the hyper parameters or using Nadam or Momentum instead of Adam.
I almost ran out of arrows at my bow... in the end I think I will rewrite the DeepMimic's PPO implementation in C++ (I think I will use PyTorch which has a much better C++ API than Tensorflow), but it's a shame, because I would have liked to play with the other RL algorithms to check which ones give the best result in this challenging environment (by the way, I noticed that very few of them allow you to parallelize the execution on multiple processes, but in applications like DeepMimic where the GPU can give little help, the multi processing is fundamental. Finally, I really love the idea of vectorized environments implemented in OpenAI!).
Phew, how much I wrote, sorry! Let me conclude saying that there is a great effort in offering AI courses to democratize it, but when one is faced with these difficulties, I find really hard to find who can give you support. So thank you very much for your advice and your time, guys.

@denyHell
Copy link

just curious what to do if

  • the action space is not bounded

  • the action space A(s) depends on current state s

@araffin
Copy link
Collaborator

araffin commented Feb 19, 2020

the action space is not bounded

in the real life, infinite does not exist, so you usually have a upper bound that has some meaning (e.g. torque limits for a robot).

the action space A(s) depends on current state s

This is research, I think there is an issue about that: #461

@araffin araffin closed this as completed Feb 28, 2020
@HJ-TANG
Copy link

HJ-TANG commented Oct 7, 2020

@araffin @Miffyli Okay, I tried both solutions and adding normalization directly in the network (through a Tanh layer), gave me (by far) best results. Telling the agent that the actions are in [-1, 1], without using Tanh, simply not works because this leads to saturating the actions (-1 or 1) from the first step evaluation and this leads the actor to immediately become unstable (and therefore learning proceed very very slowly, if proceed at all).
Big movements are the evil (any brutal movement leads a physical character to fall relentlessly... the force of gravity does not forgive!), so I have tried to compensate this issue reducing the cliprange but, still, the result is not satisfactory. Instead, the output of the network implemented in DeepMimic is very small from the beginning and the actor slowly, but consistently, learns to take the first steps (and, at the end, it walks infinitely).
Anyway, even with the Tanh trick (btw, it would be nice if there was an easier way to add it...), my actor manages to take a few steps and then loses his balance. On the contrary, with the same number of iterations, the original code gives a stable walk (although not perfect): there must be a problem somewhere else. Useless, the extensive tuning of the hyper parameters or using Nadam or Momentum instead of Adam.

Hi you said ''adding normalization directly in the network (through a Tanh layer)''. Do you mean to use the tanh as the activation function? I don't know how to do it. Because I scaled the action between [-1,1] and met the same problem as you referred ''without using Tanh, simply not works because this leads to saturating the actions (-1 or 1) from the first step evaluation and this leads the actor to immediately become unstable''.
Waiting for your reply, thanks a loooooooot in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested RTFM Answer is the documentation
Projects
None yet
Development

No branches or pull requests

5 participants