Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

steps_per_epoch in DDPG. #776

Closed
blurLake opened this issue Mar 31, 2020 · 8 comments
Closed

steps_per_epoch in DDPG. #776

blurLake opened this issue Mar 31, 2020 · 8 comments
Labels
duplicate This issue or pull request already exists question Further information is requested

Comments

@blurLake
Copy link

Hi, I saw in openai spinups

spinup.ddpg_tf1(..., steps_per_epoch=4000, epochs=100, ...)

which specifies the number of steps in each episode/epoch. Is there a similar setting in stable_baselines?
Thanks!

@Miffyli
Copy link
Collaborator

Miffyli commented Mar 31, 2020

Similar to #352, what is the definition of "epochs" here?

@araffin araffin added duplicate This issue or pull request already exists question Further information is requested labels Mar 31, 2020
@blurLake
Copy link
Author

blurLake commented Mar 31, 2020

Thank you for your reply! I meant one episode as one a sequence of states, actions and rewards, which ends with terminal state. I just wonder if we can set the length of this sequence in DDPG algorithm, something like 20, which means the agent can only interact with the environment for 20 steps. And we reset the environment after 20 steps, and repeat so on.

@Miffyli
Copy link
Collaborator

Miffyli commented Mar 31, 2020

I am still rather uncertain what is it you want to achieve, exactly. The naming of DDPG parameters can be bit vague: nb_rollout_steps means how many steps we take in the environment, before we do nb_train_steps updates to the network, followed by nb_evaluation_steps of evaluating the agent.

@blurLake
Copy link
Author

Thanks for helping me understand the parameters. It is getting closer. I attached a screenshot of DDPG alg from original paper (https://arxiv.org/pdf/1509.02971.pdf).
Screenshot 2020-03-31 at 18 38 51
I am seeking is how to set "T" in the pseudocode. Thanks!

@m-rph
Copy link

m-rph commented Mar 31, 2020

T usually, and in this case, signifies the end of the episode. So the action selection, storing, network optimisation and target update occurs once per environment step. So when the episode has finished, the noise and the environment are reset. This is done here:

with self.sess.as_default(), self.graph.as_default():
# Prepare everything.
self._reset()
obs = self.env.reset()
# Retrieve unnormalized observation for saving into the buffer
if self._vec_normalize_env is not None:
obs_ = self._vec_normalize_env.get_original_obs().squeeze()
eval_obs = None
if self.eval_env is not None:
eval_obs = self.eval_env.reset()
episode_reward = 0.
episode_step = 0
episodes = 0
step = 0
total_steps = 0

and here:

if done:
# Episode done.
epoch_episode_rewards.append(episode_reward)
episode_rewards_history.append(episode_reward)
epoch_episode_steps.append(episode_step)
episode_reward = 0.
episode_step = 0
epoch_episodes += 1
episodes += 1
maybe_is_success = info.get('is_success')
if maybe_is_success is not None:
episode_successes.append(float(maybe_is_success))
self._reset()
if not isinstance(self.env, VecEnv):
obs = self.env.reset()

@blurLake
Copy link
Author

blurLake commented Apr 1, 2020

Thanks for the reply! Exactly what I am asking for! If I understand correctly, ddpg in stable_baselines can only end of the episode if done is True, which in some cases means the reward reaches its maximum or the policy is finely tuned. I feel this is slightly different from the original algorithms, which can terminate the episode after fixed number of steps, T, without caring the reward or policies.

Especially, for some complex environment, it might take really long time till "done" is True. Is there a way to predefine the length of episodes in my script (not changing stable-baselines/stable_baselines/ddpg/ddpg.py).

Looking forward to the comments!

@araffin
Copy link
Collaborator

araffin commented Apr 1, 2020

The done signal is just the end of an episode. Usually (e.g. Pendulum-v0 or the pybullets envs), the episode has a time limit and will trigger done=True after that limit. However, if you do so, you need to add a time feature not to break the markov property.
Current algorithms in stable-baselines are step-based (instead of episode based), so they will explore during n steps (it is called a rollout) and then update the policy parameters (using one or several gradient steps).
I recommend you to read SAC or TD3 (successor of DDPG) code which is clearer than the original DDPG code.

@blurLake
Copy link
Author

blurLake commented Apr 1, 2020

Alright, thanks a lot @Solliet @Miffyli @araffin . I will try with TD3 and SAC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants