Pytorch PPO Implementation, dimension difference #384

kevin-mahon · 2023-02-22T20:25:24Z

Hello, apologies if I do this wrong I don't contribute to open source often. I was attempting to run the Pytorch PPO implementation and kept getting several errors regarding the dimension of the observation 'o'. I believe this is because the return values from the environment are being mishandled. See my proposed changes below:
Original:

o, ep_ret, ep_len = env.reset(), 0, 0

# Main loop: collect experience in env and update/log each epoch
for epoch in range(epochs):
    for t in range(local_steps_per_epoch):
        a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32))

        next_o, r, d, _ = env.step(a)
        ep_ret += r
        ep_len += 1

        # save and log
        buf.store(o, a, r, v, logp)
        logger.store(VVals=v)
        
        # Update obs (critical!)
        o = next_o

        timeout = ep_len == max_ep_len
        terminal = d or timeout
        epoch_ended = t==local_steps_per_epoch-1

        if terminal or epoch_ended:
            if epoch_ended and not(terminal):
                print('Warning: trajectory cut off by epoch at %d steps.'%ep_len, flush=True)
            # if trajectory didn't reach terminal state, bootstrap value target
            if timeout or epoch_ended:
                _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32))
            else:
                v = 0
            buf.finish_path(v)
            if terminal:
                # only save EpRet / EpLen if trajectory finished
                logger.store(EpRet=ep_ret, EpLen=ep_len)
            o, ep_ret, ep_len = env.reset(), 0, 0

To fix this I just get rid of the 'info' return value from env.reset, but I imagine this might not work for all environments (if they have extra returns for the env.step() for example).

Adjusted:

(o,_), ep_ret, ep_len = env.reset(), 0, 0

# Main loop: collect experience in env and update/log each epoch
for epoch in range(epochs):
    for t in range(local_steps_per_epoch):
        a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32))

        next_o, r, d, _ = env.step(a)
        ep_ret += r
        ep_len += 1

        # save and log
        buf.store(o, a, r, v, logp)
        logger.store(VVals=v)
        
        # Update obs (critical!)
        o = next_o

        timeout = ep_len == max_ep_len
        terminal = d or timeout
        epoch_ended = t==local_steps_per_epoch-1

        if terminal or epoch_ended:
            if epoch_ended and not(terminal):
                print('Warning: trajectory cut off by epoch at %d steps.'%ep_len, flush=True)
            # if trajectory didn't reach terminal state, bootstrap value target
            if timeout or epoch_ended:
                _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32))
            else:
                v = 0
            buf.finish_path(v)
            if terminal:
                # only save EpRet / EpLen if trajectory finished
                logger.store(EpRet=ep_ret, EpLen=ep_len)
            (o,_), ep_ret, ep_len = env.reset(), 0, 0

Let me know if anyone else has run into this problem or if this fix works.

The text was updated successfully, but these errors were encountered:

Alberto-Hache · 2023-02-22T21:22:55Z

Hi, Kevin. I'm regularly using that same algorithm, pytorch PPO, and never had that problem. What error are you getting exactly? And on which environment(s)? Best. Alberto El mié, 22 feb 2023, 21:25, kevin ***@***.***> escribió:

…

Hello, apologies if I do this wrong I don't contribute to open source often. I was attempting to run the Pytorch PPO implementation and kept getting several errors regarding the dimension of the observation 'o'. I believe this is because the return values from the environment are being mishandled. See my proposed changes below: Original: o, ep_ret, ep_len = env.reset(), 0, 0 # Main loop: collect experience in env and update/log each epoch for epoch in range(epochs): for t in range(local_steps_per_epoch): a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32)) next_o, r, d, _ = env.step(a) ep_ret += r ep_len += 1 # save and log buf.store(o, a, r, v, logp) logger.store(VVals=v) # Update obs (critical!) o = next_o timeout = ep_len == max_ep_len terminal = d or timeout epoch_ended = t==local_steps_per_epoch-1 if terminal or epoch_ended: if epoch_ended and not(terminal): print('Warning: trajectory cut off by epoch at %d steps.'%ep_len, flush=True) # if trajectory didn't reach terminal state, bootstrap value target if timeout or epoch_ended: _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32)) else: v = 0 buf.finish_path(v) if terminal: # only save EpRet / EpLen if trajectory finished logger.store(EpRet=ep_ret, EpLen=ep_len) o, ep_ret, ep_len = env.reset(), 0, 0 To fix this I just get rid of the 'info' return value from env.reset, but I imagine this might not work for all environments (if they have extra returns for the env.step() for example). Adjusted: (o,_), ep_ret, ep_len = env.reset(), 0, 0 # Main loop: collect experience in env and update/log each epoch for epoch in range(epochs): for t in range(local_steps_per_epoch): a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32)) next_o, r, d, _ = env.step(a) print(f"next_o: {next_o}") print(f"r: {r}") print(f"d: {d}") print(f"_: {_}") ep_ret += r ep_len += 1 # save and log buf.store(o, a, r, v, logp) logger.store(VVals=v) # Update obs (critical!) o = next_o timeout = ep_len == max_ep_len terminal = d or timeout epoch_ended = t==local_steps_per_epoch-1 if terminal or epoch_ended: if epoch_ended and not(terminal): print('Warning: trajectory cut off by epoch at %d steps.'%ep_len, flush=True) # if trajectory didn't reach terminal state, bootstrap value target if timeout or epoch_ended: _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32)) else: v = 0 buf.finish_path(v) if terminal: # only save EpRet / EpLen if trajectory finished logger.store(EpRet=ep_ret, EpLen=ep_len) (o,_), ep_ret, ep_len = env.reset(), 0, 0 Let me know if anyone else has run into this problem or if this fix works. — Reply to this email directly, view it on GitHub <#384>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFRLIPR3LFAGJCMUDAKQDJDWYZY4VANCNFSM6AAAAAAVEZHQVM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

kevin-mahon · 2023-02-22T21:38:59Z

Hello,
I'm using a custom environment following the OpenAI Gym api.
This would be the error (although this time for td3)

Traceback (most recent call last):
File "src/laser_gym.py", line 43, in
train_test()
File "src/laser_gym.py", line 25, in train_test
td3(env_fn=env_fn, ac_kwargs=ac_kwargs, steps_per_epoch=1000, epochs=260, logger_kwargs=logger_kwargs)
File "/home/kevin/work/spinningup/spinup/algos/pytorch/td3/td3.py", line 334, in td3
test_agent()
File "/home/kevin/work/spinningup/spinup/algos/pytorch/td3/td3.py", line 276, in test_agent
o, r, d, _ = test_env.step(get_action(o, 0))
File "/home/kevin/work/spinningup/spinup/algos/pytorch/td3/td3.py", line 267, in get_action
a = ac.act(torch.as_tensor(o, dtype=torch.float32))
ValueError: expected sequence of length 2 at dim 1 (got 1)

My reset and step functions are defined as follows:

def reset(self, options=None):
    self._agent_location = np.random.random_integers(int(self.LOW/2), int(self.HIGH/2), size=2).astype(self.DTYPE)
    obs = self._get_obs()
    info = self._get_info()
    return obs, info

def step(self, action):
    _new_agent_location = self._agent_location + np.matmul(self.response_matrix,action)
    rew = self.reward.get_reward(self._agent_location, action, _new_agent_location)
    self._agent_location = _new_agent_location

    if (np.abs(self._agent_location) > self.HIGH).any() or (np.abs(self._agent_location) < 0.05).all():
        done = True #hit the edge
    else:
        done = False 
    
    obs = self._get_obs()
    info = self._get_info()

    if self.render_mode == 'human':
        self.render()
    
    return obs, rew, done, info

I am a little confused how this could work for other people when it seems like all the environment's should return the same tuple (https://www.gymlibrary.dev/api/core/#gym.Env.reset)

kevin-mahon closed this as completed Feb 22, 2023

kevin-mahon reopened this Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch PPO Implementation, dimension difference #384

Pytorch PPO Implementation, dimension difference #384

kevin-mahon commented Feb 22, 2023 •

edited

Loading

Alberto-Hache commented Feb 22, 2023 via email

kevin-mahon commented Feb 22, 2023

Pytorch PPO Implementation, dimension difference #384

Pytorch PPO Implementation, dimension difference #384

Comments

kevin-mahon commented Feb 22, 2023 • edited Loading

Alberto-Hache commented Feb 22, 2023 via email

kevin-mahon commented Feb 22, 2023

kevin-mahon commented Feb 22, 2023 •

edited

Loading