Fixed forward discounted return estimation.#3152
Conversation
# demonstrate the issue.
import matplotlib.pyplot as plt
import numpy as np
gamma = 0.9
def true_return(rewards, terminals):
returns = []
acc = 0
for r, term in zip(rewards[::-1], terminals[::-1]):
acc = acc * gamma * (1-term) + r
returns.append(acc)
return returns[::-1]
def forward_return_v1(rewards, terminals):
returns = []
acc = 0
for r, term in zip(rewards, terminals):
acc = acc * gamma * (1-term) + r
returns.append(acc)
return returns
def forward_return_v2(rewards, terminals):
returns = []
acc = 0
for r, term in zip(rewards, terminals):
acc = acc * gamma + r
returns.append(acc)
acc *= (1-term)
return returns
rew = [0,0,1,0,2,0]
term = [0, 0, 1, 0, 0, 0]
print("Returns on simple example.")
print("True: ", true_return(rew,term))
print("Before:", forward_return_v1(rew,term))
print("After: ", forward_return_v2(rew,term))
xs = []
ys = []
zs = []
for _ in range(100):
gamma = 0.99
rew = np.random.randint(1,5, size=1000)
term = np.random.rand(1000) < 0.01
rew *= term
a = np.asarray(true_return(rew,term))
b = np.asarray(forward_return_v1(rew,term))
c = np.asarray(forward_return_v2(rew,term))
xs.append(a.var())
ys.append(b.var())
zs.append(c.var())
plt.scatter(xs, ys, marker='x', label='zero_before_append')
plt.scatter(xs, zs, marker='o', label='zero_after_append')
plt.plot(range(3), range(3), label='true return', color='black', ls='--')
plt.legend()
plt.show() |
pseudo-rnd-thoughts
left a comment
There was a problem hiding this comment.
Damn, I can just imagine how long this took to find.
We can merge this but are not planning on making another release as we just launched gymnasium, a fork of Gym by the maintainers of Gym for the past 18 months where all maintenance and improvements will happen moving forward. Could you please move this over to the new repo?
If you'd like to read more about the story behind the backstory behind this and our plans going forward, click here.
|
I just discussed it internally a bit, could you please move it to Gymnasium and we'll merge it and it'll be in the next release? This was a big fix though, and we massively appreciate it |
| dones = np.logical_or(terminateds, truncateds) | ||
| self.returns[dones] = 0.0 | ||
| dones = np.logical_or(terminateds, truncateds) | ||
| self.returns = self.returns * self.gamma * (1-dones) + rews |
There was a problem hiding this comment.
Should this not be (1-terminated) as in cases of truncation, the state values should still be boostrapped.
There was a problem hiding this comment.
Oh, yes, you're right. I'll make the change, add a test case too, and then move it to gymnasium.
|
Hi @maitchison this is an interesting issue. Though I am having a hard time parsing through the image. It looks like the x-axis is the variance of the true return, and the y-axis is the variance of the estimated forward-discounted return, so I guess the scatter points are like the ratio of variances? The following chart helps me understand the idea a bit better 🙂 # demonstrate the issue.
import matplotlib.pyplot as plt
import numpy as np
gamma = 0.9
def true_return(rewards, terminals):
returns = []
acc = 0
for r, term in zip(rewards[::-1], terminals[::-1]):
acc = acc * gamma * (1-term) + r
returns.append(acc)
return returns[::-1]
def forward_return_v1(rewards, terminals):
returns = []
acc = 0
for r, term in zip(rewards, terminals):
acc = acc * gamma * (1-term) + r
returns.append(acc)
return returns
def forward_return_v2(rewards, terminals):
returns = []
acc = 0
for r, term in zip(rewards, terminals):
acc = acc * gamma + r
returns.append(acc)
acc *= (1-term)
return returns
rew = [0,0,1,0,2,0]
term = [0, 0, 1, 0, 0, 0]
print("Returns on simple example.")
print("True: ", true_return(rew,term))
print("Before:", forward_return_v1(rew,term))
print("After: ", forward_return_v2(rew,term))
xs = []
ys = []
zs = []
for _ in range(100):
gamma = 0.99
rew = np.random.randint(1,5, size=1000)
# rew = np.random.rand(1000)
term = np.random.rand(1000) < 0.01
rew *= term
a = np.asarray(true_return(rew,term))
b = np.asarray(forward_return_v1(rew,term))
c = np.asarray(forward_return_v2(rew,term))
xs.append(a.var())
ys.append(b.var())
zs.append(c.var())
plt.plot(xs, label="array of true return")
plt.plot(ys, label="array of forward return v1")
plt.plot(zs, label="array of forward return v2")
plt.ylabel("variance of the array")
plt.xlabel("trial number")
plt.legend()
plt.savefig('test.png') |
|
Seems that the regular tasks are not impacted...? # demonstrate the issue.
import matplotlib.pyplot as plt
import numpy as np
import gym
gamma = 0.9
def true_return(rewards, terminals):
returns = []
acc = 0
for r, term in zip(rewards[::-1], terminals[::-1]):
acc = acc * gamma * (1-term) + r
returns.append(acc)
return returns[::-1]
def forward_return_v1(rewards, terminals):
returns = []
acc = 0
for r, term in zip(rewards, terminals):
acc = acc * gamma * (1-term) + r
returns.append(acc)
return returns
def forward_return_v2(rewards, terminals):
returns = []
acc = 0
for r, term in zip(rewards, terminals):
acc = acc * gamma + r
returns.append(acc)
acc *= (1-term)
return returns
rew = [0,0,1,0,2,0]
term = [0, 0, 1, 0, 0, 0]
print("Returns on simple example.")
print("True: ", true_return(rew,term))
print("Before:", forward_return_v1(rew,term))
print("After: ", forward_return_v2(rew,term))
def rollout(env_id, max_steps=1000):
env = gym.make(env_id)
env.reset()
done = False
rewards = []
dones = []
for _ in range(max_steps):
action = env.action_space.sample()
_, reward, done, _ = env.step(action)
rewards += [reward]
dones += [done]
if done:
env.reset()
return rewards, dones
env_id = "Hopper-v2"
xs = []
ys = []
zs = []
for _ in range(50):
gamma = 0.99
# rew = np.random.randint(1,5, size=1000)
# # rew = np.random.rand(1000)
# term = np.random.rand(1000) < 0.01
# rew *= term
rew, term = rollout(env_id)
a = np.asarray(true_return(rew,term))
b = np.asarray(forward_return_v1(rew,term))
c = np.asarray(forward_return_v2(rew,term))
xs.append(a.var())
ys.append(b.var())
zs.append(c.var())
plt.title(env_id)
plt.plot(xs, label="array of true return", marker="*")
plt.plot(ys, label="array of forward return v1")
plt.plot(zs, label="array of forward return v2")
plt.ylabel("variance of the array")
plt.xlabel("trial number")
plt.legend()
plt.savefig('test.png') |
|
@vwxyzjn I found the bug when reviewing my Procgen results. I think it only causes problems when the reward is on the final frame. From memory, it was procgen-maze that had the problem. Very few Atari games have this property (Skiing being the only one that I can think of). Mujoco, is probably also unaffected, and rewards tend to be every frame. |
|
Makes sense. It is nevertheless important to fix this subtle bug :) |
Allowed truncated returns to pass through.
|
It's a scatter plot with the x-axis being the true variance of the returns and the y-axis being the variance estimated by the normalization wrapper. In theory, these two should be closely correlated. Because the wrapper discounts previous rewards rather than future rewards, the return estimation is often a little bit wrong. This can't really be avoided. However, if the return is zeroed at the right time, then the variance of the return estimation is well correlated. However, in its current form, it's poorly correlated. Hope that helps. |
|
Closing in favor of Gymnasium https://github.com/Farama-Foundation/Gymnasium |





Description
Forward discounted return estimation was incorrectly estimated in the normalised rewards wrapper. Having the variance of returns deviate significantly from 1.0 can cause performance problems when using reward normalization.
Fixes # (issue)
Type of change
Screenshots
I will include a notebook as a reference.
Checklist:
pre-commitchecks withpre-commit run --all-files(seeCONTRIBUTING.mdinstructions to set it up)