# DDPG

In this Jupyter notebook I will try to show the differences between a normal critic actor algorithm and DDPG and how we can implement DDPG in stable baselines.

## The Differences in algorithms:

### Actor Cirtic
```python
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
    for h in hidden_sizes[:-1]:
        x = tf.layers.dense(x, units=h, activation=activation)
    return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)

def mlp_gaussian_policy(x, a, hidden_sizes, activation, output_activation, action_space):
    act_dim = a.shape.as_list()[-1]
    mu = mlp(x, list(hidden_sizes)+[act_dim], activation, output_activation)
    log_std = tf.get_variable(name='log_std', initializer=-0.5*np.ones(act_dim, dtype=np.float32))
    std = tf.exp(log_std)
    pi = mu + tf.random_normal(tf.shape(mu)) * std
    logp = gaussian_likelihood(a, mu, log_std)
    logp_pi = gaussian_likelihood(pi, mu, log_std)
    return pi, logp, logp_pi

def mlp_actor_critic(x, a, hidden_sizes=(64,64), activation=tf.tanh, 
                     output_activation=None, policy=None, action_space=None):
    # Actor
    policy = mlp_gaussian_policy
    with tf.variable_scope('pi'):
        pi, logp, logp_pi = policy(x, a, hidden_sizes, activation, output_activation, action_space)
    # Critic
    with tf.variable_scope('v'):
        v = tf.squeeze(mlp(x, list(hidden_sizes)+[1], activation, None), axis=1)
    return pi, logp, logp_pi, v
```

### DDPG
```python
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
    for h in hidden_sizes[:-1]:
        x = tf.layers.dense(x, units=h, activation=activation)
    return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)

def mlp_actor_critic(x, a, hidden_sizes=(400,300), activation=tf.nn.relu, 
                     output_activation=tf.tanh, action_space=None):
    act_dim = a.shape.as_list()[-1]
    act_limit = action_space.high[0]
    # Actor
    with tf.variable_scope('pi'):
        pi = act_limit * mlp(x, list(hidden_sizes)+[act_dim], activation, output_activation)
    # Critic (inputs used action and state)
    with tf.variable_scope('q'):
        q = tf.squeeze(mlp(tf.concat([x,a], axis=-1), list(hidden_sizes)+[1], activation, None), axis=1)
    # Critic (inputs action from the policy and state)
    with tf.variable_scope('q', reuse=True):
        q_pi = tf.squeeze(mlp(tf.concat([x,pi], axis=-1), list(hidden_sizes)+[1], activation, None), axis=1)
    return pi, q, q_pi
```

## The Differences in Lose function:

### Actor Critic objectives:
```python
# VPG objectives:
# Use Policy gradinet method to train the policy
pi_loss = -tf.reduce_mean(logp * adv_ph)
# Train the critic: The critic should 
# give the summ of reward as output
v_loss = tf.reduce_mean((ret_ph - v)**2)
```

### DDPG objectives:
```python
## DDPG losses
# Train the policy
pi_loss = -tf.reduce_mean(q_pi)

# Train the critic based on the used action in the trajectory 
# Bellman backup for Q function,  use the policy to calculate the target for the q function
backup = tf.stop_gradient(r_ph + gamma*(1-d_ph)*q_pi_targ)
q_loss = tf.reduce_mean((q-backup)**2)
```

## DDPG Example with stable_baselines

In [57]:
import gym
import numpy as np

from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.ddpg.policies import FeedForwardPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec
from stable_baselines import DDPG

Reward is 100 for reaching the target of the hill on the right hand side, minus the squared sum of actions from start to goal.

This reward function raises an exploration challenge, because if the agent does not reach the target soon enough, it will figure out that it is better not to move, and won't find the target anymore.

Note that this reward is unusual with respect to most published work, where the goal was to reach the target as fast as possible, hence favouring a bang-bang strategy.

In [58]:
env = DummyVecEnv([lambda: gym.make('MountainCarContinuous-v0')])
# Environment for evaluation
eval_env = gym.make('MountainCarContinuous-v0')

### Exploration
DDPG trains a deterministic policy in an off-policy way. Because the policy is deterministic, if the agent were to explore on-policy, in the beginning it would probably not try a wide enough variety of actions to find useful learning signals. To make DDPG policies explore better, we add noise to their actions at training time. The authors of the original DDPG paper recommended time-correlated OU noise. To facilitate getting higher-quality training data, you may reduce the scale of the noise over the course of training.

In [59]:
# the noise objects for DDPG
n_actions = env.action_space.shape[-1]
param_noise = None
action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))

### Create Custom Policy 

In [60]:
# Custom MLP policy of two layers of size 16 each
class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           layers=[32, 32,32],
                                           layer_norm=False,
                                           feature_extraction="mlp")

### Define a  Model:
Some parameters that we have to define:
* **policy** – (DDPGPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, LnMlpPolicy, …)
* **env** – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
* **gamma** – (float) the discount factor
* **eval_env** – (Gym Environment) the evaluation environment (can be None)
* **nb_train_steps** – (int) the number of training steps (how many times we sample batches from Replay Buffer)
* **nb_rollout_steps** – (int) the number of rollout(epsiode) steps
* **nb_eval_steps** – (int) the number of evalutation steps
* **param_noise** – (AdaptiveParamNoiseSpec) the parameter noise type (can be None)
* **action_noise** – (ActionNoise) the action noise type (can be None)
* **param_noise_adaption_interval** – (int) apply param noise every N steps
* **tau** – (float) the soft update coefficient (keep old values, between 0 and 1)
* **normalize_returns** – (bool) should the critic output be normalized
* **normalize_observations** – (bool) should the observation be normalized
* **batch_size** – (int) the size of the batch for learning the policy
* **observation_range** – (tuple) the bounding values for the observation
* **return_range** – (tuple) the bounding values for the critic output
* **critic_l2_reg** – (float) l2 regularizer coefficient
* **actor_lr** – (float) the actor learning rate
* **critic_lr** – (float) the critic learning rate
* **clip_norm** – (float) clip the gradients (disabled if None)
* **reward_scale** – (float) the value the reward should be scaled by
* **render** – (bool) enable rendering of the environment
* **render_eval** – (bool) enable rendering of the evalution environment
* **buffer_size** – (int) the max number of transitions to store, size of the replay buffer
* **random_exploration** – (float) Probability of taking a random action (as in an epsilon-greedy strategy) This is not needed for DDPG normally but can help exploring when using HER + DDPG. This hack was present in the original OpenAI Baselines repo (DDPG + HER)
* **verbose** – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
* **tensorboard_log** – (str) the log location for tensorboard (if None, no logging)
* **policy_kwargs** – (dict) additional arguments to be passed to the policy on creation
* **full_tensorboard_log** – (bool) enable additional logging when using tensorboard WARNING: this logging can take a lot of space quickly


In [66]:
model = DDPG(policy = MlpPolicy, 
             env = env,
             render = False,
             nb_train_steps = 50,
             nb_rollout_steps = 200,
             eval_env = eval_env,
             render_eval = False,
             nb_eval_steps = 2,
             gamma = 0.999,
             buffer_size = 5000, 
             param_noise=param_noise, 
             action_noise=action_noise,
             verbose=1,
             tensorboard_log = '/home/karam/workspaces/reinforcement_learning/RL_AV'
            )

#### Train the Model
* **total_timesteps** – (int) The total number of samples to train on
* **callback** – (function (dict, dict)) -> boolean function called at every steps with state of the algorithm. It takes the local and global variables. If it returns False, training is aborted.
* **seed** – (int) The initial seed for training, if None: keep current seed
* **log_interval** – (int) The number of timesteps before logging.
* **tb_log_name** – (str) the name of the run for tensorboard log
reset_num_timesteps – (bool) whether or not to reset the current timestep number (used in logging)

In [67]:
model.learn(total_timesteps=300000,  log_interval=4)

---------------------------------------
| eval/Q                  | -0.00106  |
| eval/episodes           | 0         |
| eval/return             | nan       |
| eval/return_history     | nan       |
| reference_Q_mean        | -0.0421   |
| reference_Q_std         | 0.0346    |
| reference_action_mean   | -0.00825  |
| reference_action_std    | 0.0017    |
| reference_actor_Q_mean  | 0.000233  |
| reference_actor_Q_std   | 0.00293   |
| rollout/Q_mean          | 0.000252  |
| rollout/actions_mean    | 0.38      |
| rollout/actions_std     | 0.496     |
| rollout/episode_steps   | nan       |
| rollout/episodes        | 0         |
| rollout/return          | nan       |
| rollout/return_history  | nan       |
| total/duration          | 1.33      |
| total/episodes          | 0         |
| total/epochs            | 1         |
| total/steps             | 796       |
| total/steps_per_second  | 598       |
| train/loss_actor        | -0.000442 |
| train/loss_critic       | 3.04e-06  |



--------------------------------------
| eval/Q                  | 0.00243  |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.0385  |
| reference_Q_std         | 0.0346   |
| reference_action_mean   | -0.00472 |
| reference_action_std    | 0.0153   |
| reference_actor_Q_mean  | 0.00265  |
| reference_actor_Q_std   | 0.000819 |
| rollout/Q_mean          | 0.00103  |
| rollout/actions_mean    | 0.271    |
| rollout/actions_std     | 0.543    |
| rollout/episode_steps   | 999      |
| rollout/episodes        | 7        |
| rollout/return          | -37.4    |
| rollout/return_history  | -37.4    |
| total/duration          | 10.7     |
| total/episodes          | 7        |
| total/epochs            | 1        |
| total/steps             | 7196     |
| total/steps_per_second  | 671      |
| train/loss_actor        | -0.00241 |
| train/loss_critic       | 3.92e-08 |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 0.00284  |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.0382  |
| reference_Q_std         | 0.0347   |
| reference_action_mean   | 0.00574  |
| reference_action_std    | 0.0134   |
| reference_actor_Q_mean  | 0.00298  |
| reference_actor_Q_std   | 0.000762 |
| rollout/Q_mean          | 0.00153  |
| rollout/actions_mean    | 0.107    |
| rollout/actions_std     | 0.612    |
| rollout/episode_steps   | 999      |
| rollout/episodes        | 13       |
| rollout/return          | -39.2    |
| rollout/return_history  | -39.2    |
| total/duration          | 20.3     |
| total/episodes          | 13       |
| total/epochs            | 1        |
| total/steps             | 13596    |
| total/steps_per_second  | 670      |
| train/loss_actor        | -0.00218 |
| train/loss_critic       | 2.14e-08 |
| train/param_noise_di..


---------------------------------------
| eval/Q                  | -0.101    |
| eval/episodes           | 0         |
| eval/return             | nan       |
| eval/return_history     | nan       |
| reference_Q_mean        | -0.0487   |
| reference_Q_std         | 0.117     |
| reference_action_mean   | -0.0395   |
| reference_action_std    | 0.0161    |
| reference_actor_Q_mean  | -0.0206   |
| reference_actor_Q_std   | 0.12      |
| rollout/Q_mean          | -0.000232 |
| rollout/actions_mean    | 0.0522    |
| rollout/actions_std     | 0.59      |
| rollout/episode_steps   | 961       |
| rollout/episodes        | 20        |
| rollout/return          | -19.4     |
| rollout/return_history  | -19.4     |
| total/duration          | 29.9      |
| total/episodes          | 20        |
| total/epochs            | 1         |
| total/steps             | 19996     |
| total/steps_per_second  | 670       |
| train/loss_actor        | 0.101     |
| train/loss_critic       | 3.73      |


--------------------------------------
| eval/Q                  | -0.179   |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.141   |
| reference_Q_std         | 0.0992   |
| reference_action_mean   | 0.0338   |
| reference_action_std    | 0.00525  |
| reference_actor_Q_mean  | -0.108   |
| reference_actor_Q_std   | 0.107    |
| rollout/Q_mean          | -0.0378  |
| rollout/actions_mean    | 0.0532   |
| rollout/actions_std     | 0.612    |
| rollout/episode_steps   | 965      |
| rollout/episodes        | 27       |
| rollout/return          | -21.9    |
| rollout/return_history  | -21.9    |
| total/duration          | 40.2     |
| total/episodes          | 27       |
| total/epochs            | 1        |
| total/steps             | 26396    |
| total/steps_per_second  | 657      |
| train/loss_actor        | 0.153    |
| train/loss_critic       | 0.997    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | -0.184   |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.155   |
| reference_Q_std         | 0.106    |
| reference_action_mean   | -0.0654  |
| reference_action_std    | 0.00685  |
| reference_actor_Q_mean  | -0.118   |
| reference_actor_Q_std   | 0.116    |
| rollout/Q_mean          | -0.0641  |
| rollout/actions_mean    | 0.0417   |
| rollout/actions_std     | 0.64     |
| rollout/episode_steps   | 966      |
| rollout/episodes        | 33       |
| rollout/return          | -24.6    |
| rollout/return_history  | -24.6    |
| total/duration          | 51.6     |
| total/episodes          | 33       |
| total/epochs            | 1        |
| total/steps             | 32796    |
| total/steps_per_second  | 636      |
| train/loss_actor        | 0.172    |
| train/loss_critic       | 1.62     |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | -0.168   |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.179   |
| reference_Q_std         | 0.0861   |
| reference_action_mean   | -0.15    |
| reference_action_std    | 0.0461   |
| reference_actor_Q_mean  | -0.137   |
| reference_actor_Q_std   | 0.0926   |
| rollout/Q_mean          | -0.0671  |
| rollout/actions_mean    | 0.0304   |
| rollout/actions_std     | 0.633    |
| rollout/episode_steps   | 971      |
| rollout/episodes        | 40       |
| rollout/return          | -26.7    |
| rollout/return_history  | -26.7    |
| total/duration          | 62.4     |
| total/episodes          | 40       |
| total/epochs            | 1        |
| total/steps             | 39196    |
| total/steps_per_second  | 628      |
| train/loss_actor        | 0.0853   |
| train/loss_critic       | 0.00293  |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | -0.126   |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.158   |
| reference_Q_std         | 0.0851   |
| reference_action_mean   | -0.159   |
| reference_action_std    | 0.0251   |
| reference_actor_Q_mean  | -0.123   |
| reference_actor_Q_std   | 0.0882   |
| rollout/Q_mean          | -0.0798  |
| rollout/actions_mean    | 0.0235   |
| rollout/actions_std     | 0.623    |
| rollout/episode_steps   | 975      |
| rollout/episodes        | 46       |
| rollout/return          | -27.6    |
| rollout/return_history  | -27.6    |
| total/duration          | 73.3     |
| total/episodes          | 46       |
| total/epochs            | 1        |
| total/steps             | 45596    |
| total/steps_per_second  | 622      |
| train/loss_actor        | 0.148    |
| train/loss_critic       | 0.000877 |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | -0.138   |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.198   |
| reference_Q_std         | 0.0831   |
| reference_action_mean   | 0.163    |
| reference_action_std    | 0.0306   |
| reference_actor_Q_mean  | -0.164   |
| reference_actor_Q_std   | 0.0826   |
| rollout/Q_mean          | -0.0854  |
| rollout/actions_mean    | 0.0227   |
| rollout/actions_std     | 0.616    |
| rollout/episode_steps   | 974      |
| rollout/episodes        | 53       |
| rollout/return          | -25.8    |
| rollout/return_history  | -25.8    |
| total/duration          | 84.3     |
| total/episodes          | 53       |
| total/epochs            | 1        |
| total/steps             | 51996    |
| total/steps_per_second  | 617      |
| train/loss_actor        | 0.211    |
| train/loss_critic       | 0.000118 |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | -0.182   |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.213   |
| reference_Q_std         | 0.0787   |
| reference_action_mean   | 0.0996   |
| reference_action_std    | 0.0327   |
| reference_actor_Q_mean  | -0.179   |
| reference_actor_Q_std   | 0.078    |
| rollout/Q_mean          | -0.098   |
| rollout/actions_mean    | 0.0468   |
| rollout/actions_std     | 0.615    |
| rollout/episode_steps   | 976      |
| rollout/episodes        | 59       |
| rollout/return          | -27      |
| rollout/return_history  | -27      |
| total/duration          | 94.9     |
| total/episodes          | 59       |
| total/epochs            | 1        |
| total/steps             | 58396    |
| total/steps_per_second  | 616      |
| train/loss_actor        | 0.188    |
| train/loss_critic       | 8.85e-05 |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | -0.123   |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.16    |
| reference_Q_std         | 0.131    |
| reference_action_mean   | 0.0262   |
| reference_action_std    | 0.0206   |
| reference_actor_Q_mean  | -0.102   |
| reference_actor_Q_std   | 0.141    |
| rollout/Q_mean          | -0.104   |
| rollout/actions_mean    | 0.0363   |
| rollout/actions_std     | 0.608    |
| rollout/episode_steps   | 979      |
| rollout/episodes        | 66       |
| rollout/return          | -27.3    |
| rollout/return_history  | -27.3    |
| total/duration          | 105      |
| total/episodes          | 66       |
| total/epochs            | 1        |
| total/steps             | 64796    |
| total/steps_per_second  | 619      |
| train/loss_actor        | 0.141    |
| train/loss_critic       | 0.391    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | -0.305   |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.219   |
| reference_Q_std         | 0.133    |
| reference_action_mean   | 0.187    |
| reference_action_std    | 0.0265   |
| reference_actor_Q_mean  | -0.188   |
| reference_actor_Q_std   | 0.131    |
| rollout/Q_mean          | -0.112   |
| rollout/actions_mean    | 0.0353   |
| rollout/actions_std     | 0.607    |
| rollout/episode_steps   | 979      |
| rollout/episodes        | 72       |
| rollout/return          | -26.7    |
| rollout/return_history  | -26.7    |
| total/duration          | 115      |
| total/episodes          | 72       |
| total/epochs            | 1        |
| total/steps             | 71196    |
| total/steps_per_second  | 621      |
| train/loss_actor        | 0.185    |
| train/loss_critic       | 0.357    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | -0.297   |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.272   |
| reference_Q_std         | 0.0945   |
| reference_action_mean   | 0.195    |
| reference_action_std    | 0.0354   |
| reference_actor_Q_mean  | -0.228   |
| reference_actor_Q_std   | 0.0908   |
| rollout/Q_mean          | -0.119   |
| rollout/actions_mean    | 0.0461   |
| rollout/actions_std     | 0.619    |
| rollout/episode_steps   | 981      |
| rollout/episodes        | 79       |
| rollout/return          | -29      |
| rollout/return_history  | -29      |
| total/duration          | 124      |
| total/episodes          | 79       |
| total/epochs            | 1        |
| total/steps             | 77596    |
| total/steps_per_second  | 623      |
| train/loss_actor        | 0.206    |
| train/loss_critic       | 0.00159  |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | -0.294   |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.275   |
| reference_Q_std         | 0.1      |
| reference_action_mean   | 0.367    |
| reference_action_std    | 0.0265   |
| reference_actor_Q_mean  | -0.237   |
| reference_actor_Q_std   | 0.0955   |
| rollout/Q_mean          | -0.129   |
| rollout/actions_mean    | 0.0586   |
| rollout/actions_std     | 0.615    |
| rollout/episode_steps   | 982      |
| rollout/episodes        | 85       |
| rollout/return          | -29.2    |
| rollout/return_history  | -29.2    |
| total/duration          | 134      |
| total/episodes          | 85       |
| total/epochs            | 1        |
| total/steps             | 83996    |
| total/steps_per_second  | 625      |
| train/loss_actor        | 0.259    |
| train/loss_critic       | 0.000228 |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | -0.32    |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.278   |
| reference_Q_std         | 0.113    |
| reference_action_mean   | 0.0964   |
| reference_action_std    | 0.0625   |
| reference_actor_Q_mean  | -0.259   |
| reference_actor_Q_std   | 0.11     |
| rollout/Q_mean          | -0.128   |
| rollout/actions_mean    | 0.0691   |
| rollout/actions_std     | 0.606    |
| rollout/episode_steps   | 981      |
| rollout/episodes        | 92       |
| rollout/return          | -27.8    |
| rollout/return_history  | -27.8    |
| total/duration          | 145      |
| total/episodes          | 92       |
| total/epochs            | 1        |
| total/steps             | 90396    |
| total/steps_per_second  | 622      |
| train/loss_actor        | 0.0285   |
| train/loss_critic       | 0.153    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | -0.178   |
| eval/episodes           | 0        |
| eval/return             | nan      |
| eval/return_history     | nan      |
| reference_Q_mean        | -0.26    |
| reference_Q_std         | 0.128    |
| reference_action_mean   | -0.015   |
| reference_action_std    | 0.0483   |
| reference_actor_Q_mean  | -0.209   |
| reference_actor_Q_std   | 0.121    |
| rollout/Q_mean          | -0.131   |
| rollout/actions_mean    | 0.0434   |
| rollout/actions_std     | 0.616    |
| rollout/episode_steps   | 982      |
| rollout/episodes        | 98       |
| rollout/return          | -29.5    |
| rollout/return_history  | -29.5    |
| total/duration          | 156      |
| total/episodes          | 98       |
| total/epochs            | 1        |
| total/steps             | 96796    |
| total/steps_per_second  | 619      |
| train/loss_actor        | 0.176    |
| train/loss_critic       | 0.00744  |
| train/param_noise_di..


---------------------------------------
| eval/Q                  | -0.113    |
| eval/episodes           | 2         |
| eval/return             | -1.6e-05  |
| eval/return_history     | -2.85e-05 |
| reference_Q_mean        | -0.256    |
| reference_Q_std         | 0.142     |
| reference_action_mean   | 0.0363    |
| reference_action_std    | 0.0391    |
| reference_actor_Q_mean  | -0.207    |
| reference_actor_Q_std   | 0.136     |
| rollout/Q_mean          | -0.119    |
| rollout/actions_mean    | 0.0369    |
| rollout/actions_std     | 0.617     |
| rollout/episode_steps   | 979       |
| rollout/episodes        | 105       |
| rollout/return          | -27.9     |
| rollout/return_history  | -27.6     |
| total/duration          | 168       |
| total/episodes          | 105       |
| total/epochs            | 1         |
| total/steps             | 103196    |
| total/steps_per_second  | 615       |
| train/loss_actor        | -0.00739  |
| train/loss_critic       | 0.0057    |


---------------------------------------
| eval/Q                  | -0.0474   |
| eval/episodes           | 2         |
| eval/return             | -0.000252 |
| eval/return_history     | -0.000169 |
| reference_Q_mean        | -0.247    |
| reference_Q_std         | 0.16      |
| reference_action_mean   | 0.0702    |
| reference_action_std    | 0.035     |
| reference_actor_Q_mean  | -0.195    |
| reference_actor_Q_std   | 0.153     |
| rollout/Q_mean          | -0.115    |
| rollout/actions_mean    | 0.0463    |
| rollout/actions_std     | 0.615     |
| rollout/episode_steps   | 976       |
| rollout/episodes        | 112       |
| rollout/return          | -27.3     |
| rollout/return_history  | -26.1     |
| total/duration          | 179       |
| total/episodes          | 112       |
| total/epochs            | 1         |
| total/steps             | 109596    |
| total/steps_per_second  | 613       |
| train/loss_actor        | 0.18      |
| train/loss_critic       | 0.000248  |


---------------------------------------
| eval/Q                  | -0.0125   |
| eval/episodes           | 2         |
| eval/return             | -0.00122  |
| eval/return_history     | -0.000989 |
| reference_Q_mean        | -0.238    |
| reference_Q_std         | 0.189     |
| reference_action_mean   | 0.133     |
| reference_action_std    | 0.042     |
| reference_actor_Q_mean  | -0.189    |
| reference_actor_Q_std   | 0.183     |
| rollout/Q_mean          | -0.103    |
| rollout/actions_mean    | 0.057     |
| rollout/actions_std     | 0.61      |
| rollout/episode_steps   | 977       |
| rollout/episodes        | 118       |
| rollout/return          | -27.5     |
| rollout/return_history  | -29.2     |
| total/duration          | 189       |
| total/episodes          | 118       |
| total/epochs            | 1         |
| total/steps             | 115996    |
| total/steps_per_second  | 613       |
| train/loss_actor        | -0.13     |
| train/loss_critic       | 0.00437   |


---------------------------------------
| eval/Q                  | -0.0378   |
| eval/episodes           | 2         |
| eval/return             | -0.000661 |
| eval/return_history     | -0.000976 |
| reference_Q_mean        | -0.27     |
| reference_Q_std         | 0.219     |
| reference_action_mean   | -0.0567   |
| reference_action_std    | 0.0282    |
| reference_actor_Q_mean  | -0.213    |
| reference_actor_Q_std   | 0.204     |
| rollout/Q_mean          | -0.0904   |
| rollout/actions_mean    | 0.0485    |
| rollout/actions_std     | 0.613     |
| rollout/episode_steps   | 977       |
| rollout/episodes        | 125       |
| rollout/return          | -27.4     |
| rollout/return_history  | -29.6     |
| total/duration          | 200       |
| total/episodes          | 125       |
| total/epochs            | 1         |
| total/steps             | 122396    |
| total/steps_per_second  | 611       |
| train/loss_actor        | -0.249    |
| train/loss_critic       | 0.00732   |


---------------------------------------
| eval/Q                  | -0.0764   |
| eval/episodes           | 2         |
| eval/return             | -0.00135  |
| eval/return_history     | -0.000435 |
| reference_Q_mean        | -0.218    |
| reference_Q_std         | 0.206     |
| reference_action_mean   | 0.116     |
| reference_action_std    | 0.00548   |
| reference_actor_Q_mean  | -0.162    |
| reference_actor_Q_std   | 0.196     |
| rollout/Q_mean          | -0.0935   |
| rollout/actions_mean    | 0.0728    |
| rollout/actions_std     | 0.619     |
| rollout/episode_steps   | 978       |
| rollout/episodes        | 131       |
| rollout/return          | -28.7     |
| rollout/return_history  | -30.7     |
| total/duration          | 211       |
| total/episodes          | 131       |
| total/epochs            | 1         |
| total/steps             | 128796    |
| total/steps_per_second  | 610       |
| train/loss_actor        | 0.102     |
| train/loss_critic       | 0.00215   |


---------------------------------------
| eval/Q                  | -0.0513   |
| eval/episodes           | 2         |
| eval/return             | -0.00079  |
| eval/return_history     | -0.000583 |
| reference_Q_mean        | -0.184    |
| reference_Q_std         | 0.225     |
| reference_action_mean   | -0.0599   |
| reference_action_std    | 0.0463    |
| reference_actor_Q_mean  | -0.138    |
| reference_actor_Q_std   | 0.216     |
| rollout/Q_mean          | -0.0559   |
| rollout/actions_mean    | 0.0698    |
| rollout/actions_std     | 0.619     |
| rollout/episode_steps   | 979       |
| rollout/episodes        | 138       |
| rollout/return          | -27.9     |
| rollout/return_history  | -28.6     |
| total/duration          | 222       |
| total/episodes          | 138       |
| total/epochs            | 1         |
| total/steps             | 135196    |
| total/steps_per_second  | 610       |
| train/loss_actor        | -0.511    |
| train/loss_critic       | 0.000693  |


---------------------------------------
| eval/Q                  | -0.0952   |
| eval/episodes           | 2         |
| eval/return             | -0.000752 |
| eval/return_history     | -0.000231 |
| reference_Q_mean        | -0.169    |
| reference_Q_std         | 0.232     |
| reference_action_mean   | -0.0159   |
| reference_action_std    | 0.123     |
| reference_actor_Q_mean  | -0.123    |
| reference_actor_Q_std   | 0.227     |
| rollout/Q_mean          | -0.0119   |
| rollout/actions_mean    | 0.0638    |
| rollout/actions_std     | 0.618     |
| rollout/episode_steps   | 976       |
| rollout/episodes        | 145       |
| rollout/return          | -26       |
| rollout/return_history  | -25.6     |
| total/duration          | 232       |
| total/episodes          | 145       |
| total/epochs            | 1         |
| total/steps             | 141596    |
| total/steps_per_second  | 610       |
| train/loss_actor        | -1.19     |
| train/loss_critic       | 0.0937    |


---------------------------------------
| eval/Q                  | -0.161    |
| eval/episodes           | 2         |
| eval/return             | -0.00065  |
| eval/return_history     | -0.000626 |
| reference_Q_mean        | -0.245    |
| reference_Q_std         | 0.264     |
| reference_action_mean   | -0.0169   |
| reference_action_std    | 0.141     |
| reference_actor_Q_mean  | -0.181    |
| reference_actor_Q_std   | 0.252     |
| rollout/Q_mean          | 0.018     |
| rollout/actions_mean    | 0.0415    |
| rollout/actions_std     | 0.624     |
| rollout/episode_steps   | 972       |
| rollout/episodes        | 152       |
| rollout/return          | -24.9     |
| rollout/return_history  | -24.3     |
| total/duration          | 243       |
| total/episodes          | 152       |
| total/epochs            | 1         |
| total/steps             | 147996    |
| total/steps_per_second  | 609       |
| train/loss_actor        | -0.997    |
| train/loss_critic       | 0.119     |


--------------------------------------
| eval/Q                  | -0.116   |
| eval/episodes           | 2        |
| eval/return             | -0.00982 |
| eval/return_history     | -0.00131 |
| reference_Q_mean        | -0.167   |
| reference_Q_std         | 0.242    |
| reference_action_mean   | 0.204    |
| reference_action_std    | 0.115    |
| reference_actor_Q_mean  | -0.11    |
| reference_actor_Q_std   | 0.238    |
| rollout/Q_mean          | 0.106    |
| rollout/actions_mean    | 0.0307   |
| rollout/actions_std     | 0.629    |
| rollout/episode_steps   | 946      |
| rollout/episodes        | 163      |
| rollout/return          | -21      |
| rollout/return_history  | -17.3    |
| total/duration          | 254      |
| total/episodes          | 163      |
| total/epochs            | 1        |
| total/steps             | 154396   |
| total/steps_per_second  | 608      |
| train/loss_actor        | -2.16    |
| train/loss_critic       | 0.0252   |
| train/param_noise_di..


---------------------------------------
| eval/Q                  | 0.277     |
| eval/episodes           | 2         |
| eval/return             | -0.000941 |
| eval/return_history     | -0.00215  |
| reference_Q_mean        | -0.127    |
| reference_Q_std         | 0.275     |
| reference_action_mean   | 0.0718    |
| reference_action_std    | 0.0842    |
| reference_actor_Q_mean  | -0.0522   |
| reference_actor_Q_std   | 0.256     |
| rollout/Q_mean          | 0.243     |
| rollout/actions_mean    | 0.0234    |
| rollout/actions_std     | 0.629     |
| rollout/episode_steps   | 929       |
| rollout/episodes        | 173       |
| rollout/return          | -17.8     |
| rollout/return_history  | -11.4     |
| total/duration          | 265       |
| total/episodes          | 173       |
| total/epochs            | 1         |
| total/steps             | 160796    |
| total/steps_per_second  | 607       |
| train/loss_actor        | -3.45     |
| train/loss_critic       | 0.0326    |


--------------------------------------
| eval/Q                  | 0.346    |
| eval/episodes           | 2        |
| eval/return             | -0.0127  |
| eval/return_history     | -0.00331 |
| reference_Q_mean        | -0.123   |
| reference_Q_std         | 0.283    |
| reference_action_mean   | 0.341    |
| reference_action_std    | 0.0365   |
| reference_actor_Q_mean  | -0.0732  |
| reference_actor_Q_std   | 0.272    |
| rollout/Q_mean          | 0.344    |
| rollout/actions_mean    | 0.0365   |
| rollout/actions_std     | 0.628    |
| rollout/episode_steps   | 913      |
| rollout/episodes        | 183      |
| rollout/return          | -14.3    |
| rollout/return_history  | -1.96    |
| total/duration          | 277      |
| total/episodes          | 183      |
| total/epochs            | 1        |
| total/steps             | 167196   |
| total/steps_per_second  | 604      |
| train/loss_actor        | -3.58    |
| train/loss_critic       | 0.0875   |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | -0.152   |
| eval/episodes           | 2        |
| eval/return             | -0.0138  |
| eval/return_history     | -0.0179  |
| reference_Q_mean        | -0.0876  |
| reference_Q_std         | 0.234    |
| reference_action_mean   | 0.567    |
| reference_action_std    | 0.0763   |
| reference_actor_Q_mean  | -0.0428  |
| reference_actor_Q_std   | 0.242    |
| rollout/Q_mean          | 0.45     |
| rollout/actions_mean    | 0.0581   |
| rollout/actions_std     | 0.633    |
| rollout/episode_steps   | 894      |
| rollout/episodes        | 194      |
| rollout/return          | -11.4    |
| rollout/return_history  | 4.67     |
| total/duration          | 288      |
| total/episodes          | 194      |
| total/epochs            | 1        |
| total/steps             | 173596   |
| total/steps_per_second  | 603      |
| train/loss_actor        | -3.74    |
| train/loss_critic       | 0.59     |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 96.4     |
| eval/episodes           | 2        |
| eval/return             | -0.1     |
| eval/return_history     | -0.0509  |
| reference_Q_mean        | -0.148   |
| reference_Q_std         | 0.338    |
| reference_action_mean   | -0.243   |
| reference_action_std    | 0.142    |
| reference_actor_Q_mean  | -0.126   |
| reference_actor_Q_std   | 0.323    |
| rollout/Q_mean          | 0.571    |
| rollout/actions_mean    | 0.077    |
| rollout/actions_std     | 0.637    |
| rollout/episode_steps   | 875      |
| rollout/episodes        | 205      |
| rollout/return          | -8.82    |
| rollout/return_history  | 11.2     |
| total/duration          | 299      |
| total/episodes          | 205      |
| total/epochs            | 1        |
| total/steps             | 179996   |
| total/steps_per_second  | 601      |
| train/loss_actor        | -4.16    |
| train/loss_critic       | 0.109    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 125      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 57.9     |
| reference_Q_mean        | -0.101   |
| reference_Q_std         | 0.369    |
| reference_action_mean   | 0.513    |
| reference_action_std    | 0.0724   |
| reference_actor_Q_mean  | -0.0452  |
| reference_actor_Q_std   | 0.351    |
| rollout/Q_mean          | 0.606    |
| rollout/actions_mean    | 0.075    |
| rollout/actions_std     | 0.634    |
| rollout/episode_steps   | 873      |
| rollout/episodes        | 213      |
| rollout/return          | -7.98    |
| rollout/return_history  | 13.8     |
| total/duration          | 311      |
| total/episodes          | 213      |
| total/epochs            | 1        |
| total/steps             | 186396   |
| total/steps_per_second  | 600      |
| train/loss_actor        | -0.537   |
| train/loss_critic       | 0.0447   |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 127      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | -0.0999  |
| reference_Q_std         | 0.44     |
| reference_action_mean   | 0.578    |
| reference_action_std    | 0.147    |
| reference_actor_Q_mean  | -0.036   |
| reference_actor_Q_std   | 0.437    |
| rollout/Q_mean          | 0.709    |
| rollout/actions_mean    | 0.0833   |
| rollout/actions_std     | 0.635    |
| rollout/episode_steps   | 862      |
| rollout/episodes        | 223      |
| rollout/return          | -5.58    |
| rollout/return_history  | 20.7     |
| total/duration          | 322      |
| total/episodes          | 223      |
| total/epochs            | 1        |
| total/steps             | 192796   |
| total/steps_per_second  | 599      |
| train/loss_actor        | -3.09    |
| train/loss_critic       | 0.361    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 126      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | -0.184   |
| reference_Q_std         | 0.554    |
| reference_action_mean   | 0.127    |
| reference_action_std    | 0.418    |
| reference_actor_Q_mean  | -0.157   |
| reference_actor_Q_std   | 0.539    |
| rollout/Q_mean          | 0.82     |
| rollout/actions_mean    | 0.0948   |
| rollout/actions_std     | 0.636    |
| rollout/episode_steps   | 845      |
| rollout/episodes        | 235      |
| rollout/return          | -2.95    |
| rollout/return_history  | 31.5     |
| total/duration          | 333      |
| total/episodes          | 235      |
| total/epochs            | 1        |
| total/steps             | 199196   |
| total/steps_per_second  | 598      |
| train/loss_actor        | -4.47    |
| train/loss_critic       | 0.259    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 122      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | -0.083   |
| reference_Q_std         | 0.799    |
| reference_action_mean   | 0.803    |
| reference_action_std    | 0.158    |
| reference_actor_Q_mean  | -0.0442  |
| reference_actor_Q_std   | 0.798    |
| rollout/Q_mean          | 1.24     |
| rollout/actions_mean    | 0.106    |
| rollout/actions_std     | 0.635    |
| rollout/episode_steps   | 774      |
| rollout/episodes        | 265      |
| rollout/return          | 7.54     |
| rollout/return_history  | 52.5     |
| total/duration          | 344      |
| total/episodes          | 265      |
| total/epochs            | 1        |
| total/steps             | 205596   |
| total/steps_per_second  | 597      |
| train/loss_actor        | -14.3    |
| train/loss_critic       | 0.533    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 119      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | -0.564   |
| reference_Q_std         | 1.03     |
| reference_action_mean   | -0.00104 |
| reference_action_std    | 0.369    |
| reference_actor_Q_mean  | -0.462   |
| reference_actor_Q_std   | 1.01     |
| rollout/Q_mean          | 1.64     |
| rollout/actions_mean    | 0.105    |
| rollout/actions_std     | 0.638    |
| rollout/episode_steps   | 730      |
| rollout/episodes        | 290      |
| rollout/return          | 14.3     |
| rollout/return_history  | 66.9     |
| total/duration          | 356      |
| total/episodes          | 290      |
| total/epochs            | 1        |
| total/steps             | 211996   |
| total/steps_per_second  | 595      |
| train/loss_actor        | -16.4    |
| train/loss_critic       | 0.791    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 115      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | -0.17    |
| reference_Q_std         | 1.35     |
| reference_action_mean   | 0.18     |
| reference_action_std    | 0.26     |
| reference_actor_Q_mean  | -0.0609  |
| reference_actor_Q_std   | 1.35     |
| rollout/Q_mean          | 1.98     |
| rollout/actions_mean    | 0.11     |
| rollout/actions_std     | 0.637    |
| rollout/episode_steps   | 707      |
| rollout/episodes        | 309      |
| rollout/return          | 18.7     |
| rollout/return_history  | 73.5     |
| total/duration          | 369      |
| total/episodes          | 309      |
| total/epochs            | 1        |
| total/steps             | 218396   |
| total/steps_per_second  | 593      |
| train/loss_actor        | -13.1    |
| train/loss_critic       | 0.593    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 111      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | -0.129   |
| reference_Q_std         | 1.77     |
| reference_action_mean   | -0.0542  |
| reference_action_std    | 0.39     |
| reference_actor_Q_mean  | -0.0776  |
| reference_actor_Q_std   | 1.76     |
| rollout/Q_mean          | 2.26     |
| rollout/actions_mean    | 0.109    |
| rollout/actions_std     | 0.639    |
| rollout/episode_steps   | 683      |
| rollout/episodes        | 329      |
| rollout/return          | 22.4     |
| rollout/return_history  | 85.9     |
| total/duration          | 380      |
| total/episodes          | 329      |
| total/epochs            | 1        |
| total/steps             | 224796   |
| total/steps_per_second  | 591      |
| train/loss_actor        | -13.2    |
| train/loss_critic       | 0.726    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 108      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | -0.464   |
| reference_Q_std         | 2.15     |
| reference_action_mean   | 0.0759   |
| reference_action_std    | 0.366    |
| reference_actor_Q_mean  | -0.371   |
| reference_actor_Q_std   | 2.16     |
| rollout/Q_mean          | 2.78     |
| rollout/actions_mean    | 0.113    |
| rollout/actions_std     | 0.639    |
| rollout/episode_steps   | 647      |
| rollout/episodes        | 357      |
| rollout/return          | 27.7     |
| rollout/return_history  | 85.9     |
| total/duration          | 392      |
| total/episodes          | 357      |
| total/epochs            | 1        |
| total/steps             | 231196   |
| total/steps_per_second  | 589      |
| train/loss_actor        | -20      |
| train/loss_critic       | 0.902    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 105      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | -0.758   |
| reference_Q_std         | 2.37     |
| reference_action_mean   | 0.152    |
| reference_action_std    | 0.37     |
| reference_actor_Q_mean  | -0.663   |
| reference_actor_Q_std   | 2.38     |
| rollout/Q_mean          | 3.27     |
| rollout/actions_mean    | 0.106    |
| rollout/actions_std     | 0.643    |
| rollout/episode_steps   | 616      |
| rollout/episodes        | 385      |
| rollout/return          | 31.8     |
| rollout/return_history  | 85.3     |
| total/duration          | 404      |
| total/episodes          | 385      |
| total/epochs            | 1        |
| total/steps             | 237596   |
| total/steps_per_second  | 588      |
| train/loss_actor        | -23.5    |
| train/loss_critic       | 0.939    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 102      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | -0.864   |
| reference_Q_std         | 2.74     |
| reference_action_mean   | 0.143    |
| reference_action_std    | 0.506    |
| reference_actor_Q_mean  | -0.759   |
| reference_actor_Q_std   | 2.79     |
| rollout/Q_mean          | 3.74     |
| rollout/actions_mean    | 0.116    |
| rollout/actions_std     | 0.644    |
| rollout/episode_steps   | 592      |
| rollout/episodes        | 412      |
| rollout/return          | 35.3     |
| rollout/return_history  | 85.4     |
| total/duration          | 415      |
| total/episodes          | 412      |
| total/epochs            | 1        |
| total/steps             | 243996   |
| total/steps_per_second  | 587      |
| train/loss_actor        | -21.4    |
| train/loss_critic       | 0.921    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 98.7     |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | -0.81    |
| reference_Q_std         | 3.23     |
| reference_action_mean   | 0.383    |
| reference_action_std    | 0.419    |
| reference_actor_Q_mean  | -0.591   |
| reference_actor_Q_std   | 3.37     |
| rollout/Q_mean          | 4.24     |
| rollout/actions_mean    | 0.128    |
| rollout/actions_std     | 0.646    |
| rollout/episode_steps   | 564      |
| rollout/episodes        | 444      |
| rollout/return          | 38.8     |
| rollout/return_history  | 85       |
| total/duration          | 427      |
| total/episodes          | 444      |
| total/epochs            | 1        |
| total/steps             | 250396   |
| total/steps_per_second  | 586      |
| train/loss_actor        | -22.6    |
| train/loss_critic       | 1.21     |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 100      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | -0.469   |
| reference_Q_std         | 3.26     |
| reference_action_mean   | 0.494    |
| reference_action_std    | 0.315    |
| reference_actor_Q_mean  | -0.302   |
| reference_actor_Q_std   | 3.39     |
| rollout/Q_mean          | 4.69     |
| rollout/actions_mean    | 0.134    |
| rollout/actions_std     | 0.647    |
| rollout/episode_steps   | 550      |
| rollout/episodes        | 466      |
| rollout/return          | 41       |
| rollout/return_history  | 85.9     |
| total/duration          | 439      |
| total/episodes          | 466      |
| total/epochs            | 1        |
| total/steps             | 256796   |
| total/steps_per_second  | 584      |
| train/loss_actor        | -20.4    |
| train/loss_critic       | 4.27     |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 102      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | 0.263    |
| reference_Q_std         | 3.67     |
| reference_action_mean   | 0.577    |
| reference_action_std    | 0.281    |
| reference_actor_Q_mean  | 0.545    |
| reference_actor_Q_std   | 3.87     |
| rollout/Q_mean          | 4.96     |
| rollout/actions_mean    | 0.146    |
| rollout/actions_std     | 0.647    |
| rollout/episode_steps   | 544      |
| rollout/episodes        | 484      |
| rollout/return          | 42.4     |
| rollout/return_history  | 83.8     |
| total/duration          | 452      |
| total/episodes          | 484      |
| total/epochs            | 1        |
| total/steps             | 263196   |
| total/steps_per_second  | 583      |
| train/loss_actor        | -14.2    |
| train/loss_critic       | 0.864    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 103      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | 1.74     |
| reference_Q_std         | 4.56     |
| reference_action_mean   | 0.679    |
| reference_action_std    | 0.169    |
| reference_actor_Q_mean  | 1.93     |
| reference_actor_Q_std   | 4.64     |
| rollout/Q_mean          | 5.4      |
| rollout/actions_mean    | 0.15     |
| rollout/actions_std     | 0.65     |
| rollout/episode_steps   | 544      |
| rollout/episodes        | 495      |
| rollout/return          | 42.1     |
| rollout/return_history  | 77.9     |
| total/duration          | 464      |
| total/episodes          | 495      |
| total/epochs            | 1        |
| total/steps             | 269596   |
| total/steps_per_second  | 581      |
| train/loss_actor        | -26.1    |
| train/loss_critic       | 27.8     |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 103      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | 1.69     |
| reference_Q_std         | 5.19     |
| reference_action_mean   | 0.558    |
| reference_action_std    | 0.21     |
| reference_actor_Q_mean  | 1.94     |
| reference_actor_Q_std   | 5.3      |
| rollout/Q_mean          | 5.63     |
| rollout/actions_mean    | 0.161    |
| rollout/actions_std     | 0.651    |
| rollout/episode_steps   | 546      |
| rollout/episodes        | 505      |
| rollout/return          | 42       |
| rollout/return_history  | 73.1     |
| total/duration          | 476      |
| total/episodes          | 505      |
| total/epochs            | 1        |
| total/steps             | 275996   |
| total/steps_per_second  | 580      |
| train/loss_actor        | -17.9    |
| train/loss_critic       | 9.52     |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 108      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | 1.88     |
| reference_Q_std         | 5.21     |
| reference_action_mean   | 0.664    |
| reference_action_std    | 0.188    |
| reference_actor_Q_mean  | 2.1      |
| reference_actor_Q_std   | 5.47     |
| rollout/Q_mean          | 5.71     |
| rollout/actions_mean    | 0.174    |
| rollout/actions_std     | 0.653    |
| rollout/episode_steps   | 551      |
| rollout/episodes        | 512      |
| rollout/return          | 40.8     |
| rollout/return_history  | 63.5     |
| total/duration          | 488      |
| total/episodes          | 512      |
| total/epochs            | 1        |
| total/steps             | 282396   |
| total/steps_per_second  | 579      |
| train/loss_actor        | -3.06    |
| train/loss_critic       | 0.132    |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 103      |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | 1.72     |
| reference_Q_std         | 5.84     |
| reference_action_mean   | 0.0304   |
| reference_action_std    | 0.653    |
| reference_actor_Q_mean  | 2.02     |
| reference_actor_Q_std   | 5.93     |
| rollout/Q_mean          | 6.4      |
| rollout/actions_mean    | 0.177    |
| rollout/actions_std     | 0.655    |
| rollout/episode_steps   | 537      |
| rollout/episodes        | 538      |
| rollout/return          | 42.6     |
| rollout/return_history  | 61.7     |
| total/duration          | 500      |
| total/episodes          | 538      |
| total/epochs            | 1        |
| total/steps             | 288796   |
| total/steps_per_second  | 578      |
| train/loss_actor        | -47.1    |
| train/loss_critic       | 10.8     |
| train/param_noise_di..


--------------------------------------
| eval/Q                  | 99.7     |
| eval/episodes           | 2        |
| eval/return             | 99.9     |
| eval/return_history     | 99.9     |
| reference_Q_mean        | 2.14     |
| reference_Q_std         | 6.11     |
| reference_action_mean   | -0.349   |
| reference_action_std    | 0.77     |
| reference_actor_Q_mean  | 3        |
| reference_actor_Q_std   | 6.25     |
| rollout/Q_mean          | 7.44     |
| rollout/actions_mean    | 0.18     |
| rollout/actions_std     | 0.658    |
| rollout/episode_steps   | 505      |
| rollout/episodes        | 584      |
| rollout/return          | 46.2     |
| rollout/return_history  | 64.7     |
| total/duration          | 511      |
| total/episodes          | 584      |
| total/epochs            | 1        |
| total/steps             | 295196   |
| total/steps_per_second  | 577      |
| train/loss_actor        | -56.8    |
| train/loss_critic       | 21       |
| train/param_noise_di..

<stable_baselines.ddpg.ddpg.DDPG at 0x7fd4ea135cc0>

Save the model

In [54]:
model.save("ddpg_mountain")

Test your model

In [69]:
obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

env.close()