You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
,
the comments suggest it uses Incorrect reward normalization. I was wondering if you could elaborate. Does that mean we should avoid using RewardFilter because of the incorrect normalization and try to use Zfilter instead for the reward normalization?
Another concern I have is with the reset() call of the RewardFilter. It seems that in your customized envs,
defreset(self):
# Reset the state, and the running total rewardstart_state=self.env.reset()
self.total_true_reward=0.0self.counter=0.0self.state_filter.reset()
returnself.state_filter(start_state, reset=True)
It seems the reward_filter will never reset. However, the reward_filter always multiply the existing returns by gamma. Could this be a bug?
The reward_filter is already using the gamma as part of its inputs, but do you still calculate the advantage using the gamma again or is this somehow omitted?
Thanks.
The text was updated successfully, but these errors were encountered:
hi @vwxyzjn,
To my understanding, the state_filter and reward_filter always keep the inside RunningStats object alive, and will never reset, which means keep track of running stats all over the episodes rather than per episode.(I am a little bit of concerned about its reasonability).
the gamma in reward_filter and the advantage calculation have different meaning and purpose. the former usage is to keep running sum of rewards(discount past returns and then update with momentum), and the latter one is to discount future rewards.
Hi all, I'm an author of the corresponding paper for this repo. Since this was an anonymized submission we were unable to comment/change the code during submission. There is now an update repository with better hyper parameters, where we also switched to a system where we reset the reward filter: https://github.com/MadryLab/implementation-matters
Hi, thanks for the great work. I have three questions if you don't mind.
code-for-paper/src/policy_gradients/torch_utils.py
Line 358 in 094994f
the comments suggest it uses
Incorrect reward normalization
. I was wondering if you could elaborate. Does that mean we should avoid usingRewardFilter
because of the incorrect normalization and try to useZfilter
instead for the reward normalization?Another concern I have is with the
reset()
call of theRewardFilter
. It seems that in your customized envs,It seems the
reward_filter
will never reset. However, thereward_filter
always multiply the existing returns bygamma
. Could this be a bug?The
reward_filter
is already using thegamma
as part of its inputs, but do you still calculate the advantage using thegamma
again or is this somehow omitted?Thanks.
The text was updated successfully, but these errors were encountered: