Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with the RewardFilter #2

Open
vwxyzjn opened this issue Jan 23, 2020 · 2 comments
Open

Issues with the RewardFilter #2

vwxyzjn opened this issue Jan 23, 2020 · 2 comments

Comments

@vwxyzjn
Copy link

vwxyzjn commented Jan 23, 2020

Hi, thanks for the great work. I have three questions if you don't mind.

  1. In ,
    the comments suggest it uses Incorrect reward normalization. I was wondering if you could elaborate. Does that mean we should avoid using RewardFilter because of the incorrect normalization and try to use Zfilter instead for the reward normalization?

Another concern I have is with the reset() call of the RewardFilter. It seems that in your customized envs,

    def reset(self):
        # Reset the state, and the running total reward
        start_state = self.env.reset()
        self.total_true_reward = 0.0
        self.counter = 0.0
        self.state_filter.reset()
        return self.state_filter(start_state, reset=True)
  1. It seems the reward_filter will never reset. However, the reward_filter always multiply the existing returns by gamma. Could this be a bug?

  2. The reward_filter is already using the gamma as part of its inputs, but do you still calculate the advantage using the gamma again or is this somehow omitted?

Thanks.

@walkacross
Copy link

hi @vwxyzjn,
To my understanding, the state_filter and reward_filter always keep the inside RunningStats object alive, and will never reset, which means keep track of running stats all over the episodes rather than per episode.(I am a little bit of concerned about its reasonability).

the gamma in reward_filter and the advantage calculation have different meaning and purpose. the former usage is to keep running sum of rewards(discount past returns and then update with momentum), and the latter one is to discount future rewards.

@andrewilyas
Copy link

Hi all, I'm an author of the corresponding paper for this repo. Since this was an anonymized submission we were unable to comment/change the code during submission. There is now an update repository with better hyper parameters, where we also switched to a system where we reset the reward filter: https://github.com/MadryLab/implementation-matters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants