-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions regarding VecNormalize #698
Comments
the boundaries in the observation space does not really matter (for everything that is not images), we usually set them to [-inf, inf].
Good question, the answer is there: openai#538 and openai#629
yes
We should change that (we would appreciate a PR for that), it is an old example, no real reason for not normalizing the reward too. |
Thank you! One more question: is VecNormalize compatible with GoalEnv? How would I go about using VecNormalize with HER? |
For now, it is not.
I would advise to use DDPG and its built-in normalization in that case, or to create a |
Ok, thanks!
Oh, so not all algorithms require us to normalize the observations? Are there other that have this built-in besides DDPG? |
No, as mentioned in the doc, it is usually crucial for PPO and A2C, but not for SAC and TD3 for instance. DDPG is the only one, for legacy reason. |
I wanted to follow up on this topic to ensure I am implementing VecNormalize properly. I am primarily looking at how to continue to training while keeping moving average from the previous training. I have attached some code below env_learn = SubprocVecEnv(env_list)
env_learn = VecCheckNan(env_learn , raise_exception=True)
env_learn = VecNormalize(env_learn , training=True, norm_obs=True, norm_reward=True)
env_render = Simulator() # custom environment used to rending
env_render.render_environment = True
model = PPO2(policy ='CustomPolicy', env = env_learn , verbose = 1,
vf_coef = VF_COEFF,
noptepochs = EPOCHS,
ent_coef = ENT_COEFF,
learning_rate = LEARNING_RATE,
tensorboard_log = tensorboard_log_location,
n_steps = NSTEPS,
nminibatches = MINIBATCHES)
model.save(results_folder + run_name)
# Training the model
for i in range(number_training_steps):
logname = run_name + '_' + str(i)
model.learn(total_timesteps = int((total_timesteps/number_training_steps)),
reset_num_timesteps = False,
tb_log_name = logname)
env_learn.close()
path = results_folder + logname
model.save(path)
# testing the performance of the model
for j in range(3):
obs = env_render.reset()
done = False
rewards = 0.0
while done != True:
action, _states = model.predict(obs)
obs, reward, done, info = env_render.step(action)
rewards += reward
env_render.render(mode='file')
print(f'For Training step {i}, and test number {j} the reward was: {rewards}')
if i < number_training_steps:
env_learn = SubprocVecEnv(env_list)
env_learn = VecCheckNan(env_learn, raise_exception=True)
env_learn = VecNormalize(env_learn, training=True, norm_obs=True, norm_reward=True)
model.load(load_path=path)
model.set_env(env_learn) Curious if anyone has done an implementation like this in the past and if I am doing it correctly. I am concerned with the way I currently have it implemented, the moving averages won't be retained for the next set of training the way I have it setup |
Does the |
Layer normalization is quite different, see associated paper: https://arxiv.org/abs/1607.06450
@cevans3098 I can only recommend you to take a look at the rl zoo, you forgot to save and load the Closing this issue as the original question was answered. |
According to the docs, when creating custom environments, we should always normalize the observation space. For this, you have the VecNormalize wrapper, which creates a moving average and then normalizes the obs.
Let´s say I have 2 observations: height (m) and weight (kg) of a person. My observation space would be something like a Box with
low = [0, 0]
andhigh = [2.5, 300]
. But since I'm using a VecNormalize, this isn't correct anymore, right?So should I instead change it to
low = [-10, -10]
andhigh = [10, 10]
? (10 being the default clipping value for VecNormalize)Another question: when should we normalize the rewards as well? (in the mujoco example shown in the docs you chose to only normalize the observations - why?)
Finally, what's the purpose of the discount factor? Should it be the same as the discount factor of whatever algorithm we're using?
The text was updated successfully, but these errors were encountered: