Paper and implementation are different. #6

jcwleo · 2018-10-29T15:00:36Z

In paper,

but, implementation just use reward. not sum of discounted reward.
Why is it different?

yburda · 2018-10-29T16:35:21Z

We are normalizing the reward here:

large-scale-curiosity/cppo_agent.py

Line 140 in 0c3d179

rews = self.rollout.buf_rews / np.sqrt(self.rff_rms.var)

The normalization is by running std of a sum of discounted rewards here:

large-scale-curiosity/cppo_agent.py

Lines 226 to 236 in 0c3d179

    
           class RewardForwardFilter(object): 
        
               def __init__(self, gamma): 
        
                   self.rewems = None 
        
                   self.gamma = gamma 
        
               def update(self, rews): 
        
                   if self.rewems is None: 
        
                       self.rewems = rews 
        
                   else: 
        
                       self.rewems = self.rewems * self.gamma + rews 
        
                   return self.rewems

One caveat is that for convenience we do the discounting backwards in time rather than forwards (it's convenient because at any moment the past is fully available and the future is yet to come).

jcwleo · 2018-10-29T23:42:38Z

@yburda thank you for reply. but i’ve already known that code. RewardFowrdFilter.rewems is None. so RewardForwardFilter.update just return input reward.
Am I thinking wrong?

yburda · 2018-10-30T17:56:02Z

Thank you for pointing this out. We will update the paper (we reported results with a version of code very similar to the published one, so the code is representative).

jcwleo · 2018-10-31T02:15:09Z

@yburda Did you mean that you did not use sum of dicounted reward?

yburda · 2018-10-31T02:17:47Z

Yes.

jcwleo · 2018-10-31T02:20:15Z

@yburda Thank you very much! :)

yburda · 2018-11-01T21:31:31Z

Upon thinking about it a bit longer - RewardForwardFilter.rewems is None only the first time you call update. Then it assigns something to it:

large-scale-curiosity/cppo_agent.py

Line 233 in 0c3d179

self.rewems = rews

And for all future calls to update, it's not None anymore.

Sorry for the temporary confusion.

jcwleo · 2018-11-01T23:36:40Z

@yburda Oh I was so stupid... Thank you for letting me know.

alirezakazemipour · 2020-10-08T08:32:46Z

We are normalizing the reward here:

large-scale-curiosity/cppo_agent.py

Line 140 in 0c3d179

rews = self.rollout.buf_rews / np.sqrt(self.rff_rms.var)

The normalization is by running std of a sum of discounted rewards here:

large-scale-curiosity/cppo_agent.py

Lines 226 to 236 in 0c3d179

class RewardForwardFilter(object):

def __init__(self, gamma):

self.rewems = None

self.gamma = gamma

def update(self, rews):

if self.rewems is None:

self.rewems = rews

else:

self.rewems = self.rewems * self.gamma + rews

return self.rewems

One caveat is that for convenience we do the discounting backward in time rather than forwards (it's convenient because at any moment the past is fully available and the future is yet to come).

Sir, I think the code is not right;
It must have been self.I.buf_rews_int.T[::-1] as 4kasha has mentioned.
and " it's convenient because at any moment the past is fully available and the future is yet to come" does not make sense, since you have collected all the intrinsic rewards in a rollout, so the future is available.

jcwleo closed this as completed Oct 31, 2018

yburda reopened this Nov 1, 2018

harri-edwards closed this as completed Nov 9, 2018

zafarmah92 mentioned this issue Jan 6, 2019

Does normalized rewards works with other Agents for Attari ? #13

Open

4kasha mentioned this issue May 1, 2019

RewardForwardFilter to compute intrinsic returns for normalize intrinsic reward openai/random-network-distillation#16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paper and implementation are different. #6

Paper and implementation are different. #6

jcwleo commented Oct 29, 2018 •

edited

Loading

yburda commented Oct 29, 2018

jcwleo commented Oct 29, 2018 •

edited

Loading

yburda commented Oct 30, 2018

jcwleo commented Oct 31, 2018

yburda commented Oct 31, 2018

jcwleo commented Oct 31, 2018

yburda commented Nov 1, 2018

jcwleo commented Nov 1, 2018 •

edited

Loading

alirezakazemipour commented Oct 8, 2020

Paper and implementation are different. #6

Paper and implementation are different. #6

Comments

jcwleo commented Oct 29, 2018 • edited Loading

yburda commented Oct 29, 2018

jcwleo commented Oct 29, 2018 • edited Loading

yburda commented Oct 30, 2018

jcwleo commented Oct 31, 2018

yburda commented Oct 31, 2018

jcwleo commented Oct 31, 2018

yburda commented Nov 1, 2018

jcwleo commented Nov 1, 2018 • edited Loading

alirezakazemipour commented Oct 8, 2020

jcwleo commented Oct 29, 2018 •

edited

Loading

jcwleo commented Oct 29, 2018 •

edited

Loading

jcwleo commented Nov 1, 2018 •

edited

Loading