-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reward loss targets don't account for episodes that finish within n steps #3
Comments
Hi, |
Here's what I think is a quick fix. There could probably be a quicker vectorized implementation though |
Thanks for the bug spot and fix; sorry for being so slow -- merged now! |
@Greg-Farquhar I think the np -> torch commit introduced a new bug. In make_seq_mask the mask variable is being updated in place and will cause our tmp_masks variable to change in the build_sequences function.
I also think that I initially padded tmp_masks the wrong way. I forgot the pytorch pads in reverse, and we should also be padding with 1's since they will be flipped to zeros in the make_seq_mask function. If both of these changes sound correct I'll submit a PR |
Ah, thanks. Just pushed, lmk if that fixes correctly! |
looks good! |
Hi,
Great work. I enjoyed reading the paper and I replicated your work independently.
I noticed minor performance difference between the two implementations and I just noticed that while computing the target for reward loss, you aren't accounting for the episodes that finished within nsteps (whereas for q loss it is being correctly accounted for).
Specifically,
proc_seq.append(seq[env, t+offset:t+offset+depth, :])
inline 32
oftreeqn_utils.py
doesn't check fordone
flags.If the episode finished at time t=3, this error will make the target for d=1 at t=3 to be reward at t=4, which is wrong. Can you please clarify if my understanding is correct?
The text was updated successfully, but these errors were encountered: