Skip to content
This repository was archived by the owner on Apr 8, 2026. It is now read-only.

Fixed forward discounted return estimation.#3152

Closed
maitchison wants to merge 2 commits into
openai:masterfrom
maitchison:master
Closed

Fixed forward discounted return estimation.#3152
maitchison wants to merge 2 commits into
openai:masterfrom
maitchison:master

Conversation

@maitchison
Copy link
Copy Markdown

Description

Forward discounted return estimation was incorrectly estimated in the normalised rewards wrapper. Having the variance of returns deviate significantly from 1.0 can cause performance problems when using reward normalization.

Fixes # (issue)

Type of change

  • [*] Bug fix (non-breaking change which fixes an issue)

Screenshots

I will include a notebook as a reference.

Checklist:

  • I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@maitchison
Copy link
Copy Markdown
Author

# demonstrate the issue.

import matplotlib.pyplot as plt
import numpy as np

gamma = 0.9

def true_return(rewards, terminals):
    returns = []
    acc = 0
    for r, term in zip(rewards[::-1], terminals[::-1]):
        acc = acc * gamma * (1-term) + r
        returns.append(acc)
    return returns[::-1]

def forward_return_v1(rewards, terminals):
    returns = []
    acc = 0
    for r, term in zip(rewards, terminals):
        acc = acc * gamma * (1-term) + r
        returns.append(acc)
    return returns

def forward_return_v2(rewards, terminals):
    returns = []
    acc = 0
    for r, term in zip(rewards, terminals):
        acc = acc * gamma + r
        returns.append(acc)
        acc *= (1-term)
    return returns

rew = [0,0,1,0,2,0]
term = [0, 0, 1, 0, 0, 0]

print("Returns on simple example.")
        
print("True:  ", true_return(rew,term))
print("Before:", forward_return_v1(rew,term))
print("After: ", forward_return_v2(rew,term))

xs = []
ys = []
zs = []
for _ in range(100):
    gamma = 0.99
    rew = np.random.randint(1,5, size=1000)
    term = np.random.rand(1000) < 0.01
    rew *= term
    a = np.asarray(true_return(rew,term))
    b = np.asarray(forward_return_v1(rew,term))
    c = np.asarray(forward_return_v2(rew,term))
    xs.append(a.var())
    ys.append(b.var())
    zs.append(c.var())
plt.scatter(xs, ys, marker='x', label='zero_before_append')
plt.scatter(xs, zs, marker='o', label='zero_after_append')
plt.plot(range(3), range(3), label='true return', color='black', ls='--')
plt.legend()
plt.show()    

@maitchison
Copy link
Copy Markdown
Author

returns

Copy link
Copy Markdown
Contributor

@pseudo-rnd-thoughts pseudo-rnd-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, I can just imagine how long this took to find.

We can merge this but are not planning on making another release as we just launched gymnasium, a fork of Gym by the maintainers of Gym for the past 18 months where all maintenance and improvements will happen moving forward. Could you please move this over to the new repo?

If you'd like to read more about the story behind the backstory behind this and our plans going forward, click here.

@pseudo-rnd-thoughts
Copy link
Copy Markdown
Contributor

I just discussed it internally a bit, could you please move it to Gymnasium and we'll merge it and it'll be in the next release? This was a big fix though, and we massively appreciate it

Comment thread gym/wrappers/normalize.py Outdated
dones = np.logical_or(terminateds, truncateds)
self.returns[dones] = 0.0
dones = np.logical_or(terminateds, truncateds)
self.returns = self.returns * self.gamma * (1-dones) + rews
Copy link
Copy Markdown
Contributor

@pseudo-rnd-thoughts pseudo-rnd-thoughts Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be (1-terminated) as in cases of truncation, the state values should still be boostrapped.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, you're right. I'll make the change, add a test case too, and then move it to gymnasium.

@vwxyzjn
Copy link
Copy Markdown
Contributor

vwxyzjn commented Nov 23, 2022

Hi @maitchison this is an interesting issue. Though I am having a hard time parsing through the image. It looks like the x-axis is the variance of the true return, and the y-axis is the variance of the estimated forward-discounted return, so I guess the scatter points are like the ratio of variances?

The following chart helps me understand the idea a bit better 🙂

# demonstrate the issue.
import matplotlib.pyplot as plt
import numpy as np
gamma = 0.9

def true_return(rewards, terminals):
    returns = []
    acc = 0
    for r, term in zip(rewards[::-1], terminals[::-1]):
        acc = acc * gamma * (1-term) + r
        returns.append(acc)
    return returns[::-1]

def forward_return_v1(rewards, terminals):
    returns = []
    acc = 0
    for r, term in zip(rewards, terminals):
        acc = acc * gamma * (1-term) + r
        returns.append(acc)
    return returns

def forward_return_v2(rewards, terminals):
    returns = []
    acc = 0
    for r, term in zip(rewards, terminals):
        acc = acc * gamma + r
        returns.append(acc)
        acc *= (1-term)
    return returns

rew = [0,0,1,0,2,0]
term = [0, 0, 1, 0, 0, 0]

print("Returns on simple example.")
        
print("True:  ", true_return(rew,term))
print("Before:", forward_return_v1(rew,term))
print("After: ", forward_return_v2(rew,term))

xs = []
ys = []
zs = []
for _ in range(100):
    gamma = 0.99
    rew = np.random.randint(1,5, size=1000)
    # rew = np.random.rand(1000)
    term = np.random.rand(1000) < 0.01
    rew *= term
    a = np.asarray(true_return(rew,term))
    b = np.asarray(forward_return_v1(rew,term))
    c = np.asarray(forward_return_v2(rew,term))
    xs.append(a.var())
    ys.append(b.var())
    zs.append(c.var())

plt.plot(xs, label="array of true return")
plt.plot(ys, label="array of forward return v1")
plt.plot(zs, label="array of forward return v2")
plt.ylabel("variance of the array")
plt.xlabel("trial number")
plt.legend()
plt.savefig('test.png')

image

@vwxyzjn
Copy link
Copy Markdown
Contributor

vwxyzjn commented Nov 23, 2022

Seems that the regular tasks are not impacted...?

# demonstrate the issue.

import matplotlib.pyplot as plt
import numpy as np
import gym

gamma = 0.9

def true_return(rewards, terminals):
    returns = []
    acc = 0
    for r, term in zip(rewards[::-1], terminals[::-1]):
        acc = acc * gamma * (1-term) + r
        returns.append(acc)
    return returns[::-1]

def forward_return_v1(rewards, terminals):
    returns = []
    acc = 0
    for r, term in zip(rewards, terminals):
        acc = acc * gamma * (1-term) + r
        returns.append(acc)
    return returns

def forward_return_v2(rewards, terminals):
    returns = []
    acc = 0
    for r, term in zip(rewards, terminals):
        acc = acc * gamma + r
        returns.append(acc)
        acc *= (1-term)
    return returns

rew = [0,0,1,0,2,0]
term = [0, 0, 1, 0, 0, 0]

print("Returns on simple example.")
        
print("True:  ", true_return(rew,term))
print("Before:", forward_return_v1(rew,term))
print("After: ", forward_return_v2(rew,term))

def rollout(env_id, max_steps=1000):
    env = gym.make(env_id)
    env.reset()
    done = False
    rewards = []
    dones = []
    for _ in range(max_steps):
        action = env.action_space.sample()
        _, reward, done, _ = env.step(action)
        rewards += [reward]
        dones += [done]
        if done:
            env.reset()
    return rewards, dones

env_id = "Hopper-v2"
xs = []
ys = []
zs = []
for _ in range(50):
    gamma = 0.99
    # rew = np.random.randint(1,5, size=1000)
    # # rew = np.random.rand(1000)
    # term = np.random.rand(1000) < 0.01
    # rew *= term
    rew, term = rollout(env_id)
    a = np.asarray(true_return(rew,term))
    b = np.asarray(forward_return_v1(rew,term))
    c = np.asarray(forward_return_v2(rew,term))
    xs.append(a.var())
    ys.append(b.var())
    zs.append(c.var())

plt.title(env_id)
plt.plot(xs, label="array of true return", marker="*")
plt.plot(ys, label="array of forward return v1")
plt.plot(zs, label="array of forward return v2")
plt.ylabel("variance of the array")
plt.xlabel("trial number")
plt.legend()
plt.savefig('test.png')

image

image

image

@maitchison
Copy link
Copy Markdown
Author

@vwxyzjn I found the bug when reviewing my Procgen results. I think it only causes problems when the reward is on the final frame. From memory, it was procgen-maze that had the problem. Very few Atari games have this property (Skiing being the only one that I can think of). Mujoco, is probably also unaffected, and rewards tend to be every frame.

@vwxyzjn
Copy link
Copy Markdown
Contributor

vwxyzjn commented Nov 23, 2022

Makes sense. It is nevertheless important to fix this subtle bug :)

Allowed truncated returns to pass through.
@maitchison
Copy link
Copy Markdown
Author

It's a scatter plot with the x-axis being the true variance of the returns and the y-axis being the variance estimated by the normalization wrapper. In theory, these two should be closely correlated.

Because the wrapper discounts previous rewards rather than future rewards, the return estimation is often a little bit wrong. This can't really be avoided. However, if the return is zeroed at the right time, then the variance of the return estimation is well correlated. However, in its current form, it's poorly correlated.

Hope that helps.

@jkterry1
Copy link
Copy Markdown
Collaborator

Closing in favor of Gymnasium https://github.com/Farama-Foundation/Gymnasium

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants