I've trained PPO1 with the given hyperparameters but at the end of 1M timesteps I see a decline in the episode rewards. However when rendering the environment the humanoid is able to walk better after being trained 20M timesteps(where the episode reward is at minimum) compared to being trained 1M (where the episode reward is maximum). Has anyone noticed this dilemma? Am I missing something?
I've trained PPO1 with the given hyperparameters but at the end of 1M timesteps I see a decline in the episode rewards. However when rendering the environment the humanoid is able to walk better after being trained 20M timesteps(where the episode reward is at minimum) compared to being trained 1M (where the episode reward is maximum). Has anyone noticed this dilemma? Am I missing something?