steps_per_epoch in DDPG. #776

blurLake · 2020-03-31T12:56:53Z

Hi, I saw in openai spinups

spinup.ddpg_tf1(..., steps_per_epoch=4000, epochs=100, ...)

which specifies the number of steps in each episode/epoch. Is there a similar setting in stable_baselines?
Thanks!

The text was updated successfully, but these errors were encountered:

Miffyli · 2020-03-31T13:32:56Z

Similar to #352, what is the definition of "epochs" here?

blurLake · 2020-03-31T16:03:47Z

Thank you for your reply! I meant one episode as one a sequence of states, actions and rewards, which ends with terminal state. I just wonder if we can set the length of this sequence in DDPG algorithm, something like 20, which means the agent can only interact with the environment for 20 steps. And we reset the environment after 20 steps, and repeat so on.

Miffyli · 2020-03-31T16:12:55Z

I am still rather uncertain what is it you want to achieve, exactly. The naming of DDPG parameters can be bit vague: nb_rollout_steps means how many steps we take in the environment, before we do nb_train_steps updates to the network, followed by nb_evaluation_steps of evaluating the agent.

blurLake · 2020-03-31T16:41:19Z

Thanks for helping me understand the parameters. It is getting closer. I attached a screenshot of DDPG alg from original paper (https://arxiv.org/pdf/1509.02971.pdf).

I am seeking is how to set "T" in the pseudocode. Thanks!

m-rph · 2020-03-31T18:48:36Z

T usually, and in this case, signifies the end of the episode. So the action selection, storing, network optimisation and target update occurs once per environment step. So when the episode has finished, the noise and the environment are reset. This is done here:

stable-baselines/stable_baselines/ddpg/ddpg.py

Lines 831 to 847 in 950c2a5

    
           with self.sess.as_default(), self.graph.as_default(): 
        
               # Prepare everything. 
        
               self._reset() 
        
               obs = self.env.reset() 
        
               # Retrieve unnormalized observation for saving into the buffer 
        
               if self._vec_normalize_env is not None: 
        
                   obs_ = self._vec_normalize_env.get_original_obs().squeeze() 
        
               eval_obs = None 
        
               if self.eval_env is not None: 
        
                   eval_obs = self.eval_env.reset() 
        
               episode_reward = 0. 
        
               episode_step = 0 
        
               episodes = 0 
        
               step = 0 
        
               total_steps = 0

and here:

stable-baselines/stable_baselines/ddpg/ddpg.py

Lines 934 to 951 in 950c2a5

    
           if done: 
        
               # Episode done. 
        
               epoch_episode_rewards.append(episode_reward) 
        
               episode_rewards_history.append(episode_reward) 
        
               epoch_episode_steps.append(episode_step) 
        
               episode_reward = 0. 
        
               episode_step = 0 
        
               epoch_episodes += 1 
        
               episodes += 1 
        
               maybe_is_success = info.get('is_success') 
        
               if maybe_is_success is not None: 
        
                   episode_successes.append(float(maybe_is_success)) 
        
               self._reset() 
        
               if not isinstance(self.env, VecEnv): 
        
                   obs = self.env.reset()

blurLake · 2020-04-01T09:56:43Z

Thanks for the reply! Exactly what I am asking for! If I understand correctly, ddpg in stable_baselines can only end of the episode if done is True, which in some cases means the reward reaches its maximum or the policy is finely tuned. I feel this is slightly different from the original algorithms, which can terminate the episode after fixed number of steps, T, without caring the reward or policies.

Especially, for some complex environment, it might take really long time till "done" is True. Is there a way to predefine the length of episodes in my script (not changing stable-baselines/stable_baselines/ddpg/ddpg.py).

Looking forward to the comments!

araffin · 2020-04-01T10:29:23Z

The done signal is just the end of an episode. Usually (e.g. Pendulum-v0 or the pybullets envs), the episode has a time limit and will trigger done=True after that limit. However, if you do so, you need to add a time feature not to break the markov property.
Current algorithms in stable-baselines are step-based (instead of episode based), so they will explore during n steps (it is called a rollout) and then update the policy parameters (using one or several gradient steps).
I recommend you to read SAC or TD3 (successor of DDPG) code which is clearer than the original DDPG code.

blurLake · 2020-04-01T12:46:07Z

Alright, thanks a lot @Solliet @Miffyli @araffin . I will try with TD3 and SAC.

araffin added duplicate This issue or pull request already exists question Further information is requested labels Mar 31, 2020

blurLake closed this as completed Apr 3, 2020

araffin mentioned this issue Jun 19, 2020

[Question] DDPG Parameters #897

Closed

tomasruizt mentioned this issue Jan 6, 2021

[Bug] Infinite horizon tasks are handled like episodic tasks DLR-RM/stable-baselines3#284

Closed

Missourl mentioned this issue Dec 29, 2022

How to convert timestep based learning to episodic learning #1175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

steps_per_epoch in DDPG. #776

steps_per_epoch in DDPG. #776

blurLake commented Mar 31, 2020

Miffyli commented Mar 31, 2020

blurLake commented Mar 31, 2020 •

edited

Loading

Miffyli commented Mar 31, 2020

blurLake commented Mar 31, 2020

m-rph commented Mar 31, 2020

blurLake commented Apr 1, 2020

araffin commented Apr 1, 2020

blurLake commented Apr 1, 2020

steps_per_epoch in DDPG. #776

steps_per_epoch in DDPG. #776

Comments

blurLake commented Mar 31, 2020

Miffyli commented Mar 31, 2020

blurLake commented Mar 31, 2020 • edited Loading

Miffyli commented Mar 31, 2020

blurLake commented Mar 31, 2020

m-rph commented Mar 31, 2020

blurLake commented Apr 1, 2020

araffin commented Apr 1, 2020

blurLake commented Apr 1, 2020

blurLake commented Mar 31, 2020 •

edited

Loading