# PPO in Stable Baselines

In single-agent PPO, `MlpPolicy` was used in `PPO1` as follows:

```
model = PPO1(MlpPolicy, env, timesteps_per_actorbatch=4096, clip_param=0.2, entcoeff=0.0, optim_epochs=10,
                 optim_stepsize=3e-4, optim_batchsize=64, gamma=0.99, lam=0.95, schedule='linear', verbose=2)

```

`MlpPolicy` is found in `stable_baselines/common/policies.py`, inheriting `FeedForwardPolicy`, which inherits from `ActorCriticPolicy`.

In `FeedForwardPolicy`'s `__init__`, there contains the following:
```
if net_arch is None:
    if layers is None:
        layers = [64, 64]
    net_arch = [dict(vf=layers, pi=layers)]

with tf.variable_scope("model", reuse=reuse):
    if feature_extraction == "cnn":
        pi_latent = vf_latent = cnn_extractor(self.processed_obs, **kwargs)
    else:
        pi_latent, vf_latent = mlp_extractor(tf.layers.flatten(self.processed_obs), net_arch, act_fun)

    self._value_fn = linear(vf_latent, 'vf', 1)

    self._proba_distribution, self._policy, self.q_value = \
        self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)
```

Since `MlpPolicy` uses `feature_extraction="mlp"`, look into `mlp_extractor` (here)[https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/policies.py].

`mlp_extractor` constructs a MLP that receive observations as input and outputs a latent representation for the policy and a value network. Amount and size of hidden layers and how many shared between policy and value network can be spcified using `net_arch`.

In `mlp_extractor`, it iterates through `net_arch` and creates layers, specifically using `latent = act_fun(linear(latent, ...))`. Therfore, look into `act_fun` and `linear`, which belongs in stable_baselines.common.tf_layers.

`FeedForwardPolicy`'s default for `act_fun` is `tf.tanh`. Linear contains:

```
def linear(input_tensor, scope, n_hidden, *, init_scale=1.0, init_bias=0.0):
    """
    Creates a fully connected layer for TensorFlow
    :param input_tensor: (TensorFlow Tensor) The input tensor for the fully connected layer
    :param scope: (str) The TensorFlow variable scope
    :param n_hidden: (int) The number of hidden neurons
    :param init_scale: (int) The initialization scale
    :param init_bias: (int) The initialization offset bias
    :return: (TensorFlow Tensor) fully connected layer
    """
    with tf.variable_scope(scope):
        n_input = input_tensor.get_shape()[1].value
        weight = tf.get_variable("w", [n_input, n_hidden], initializer=ortho_init(init_scale))
        bias = tf.get_variable("b", [n_hidden], initializer=tf.constant_initializer(init_bias))
        return tf.matmul(input_tensor, weight) + bias
```

Therefore, to transform this model into a Bayesian neural network, the linear layer needs to be changed into DenseVariational instead of a linear layer. We can do this by modifying `FeedForwardPolicy` (which `MlpPolicy` inherits) with a new `bnn_extractor`, then creating a `BnnPolicy` to replace `MlpPolicy`.

In [157]:
from tensorflow.keras import backend as K
from tensorflow.keras import activations, initializers
from tensorflow.keras.layers import Layer

import tensorflow as tf
import tensorflow_probability as tfp

tfd = tfp.distributions
tfp.__version__

'0.8.0'

In [158]:
# Specify the surrogate posterior over `keras.layers.Dense` `kernel` and `bias`.
def posterior_mean_field(kernel_size, bias_size=0, dtype=None):
    n = kernel_size + bias_size
    c = np.log(np.expm1(1.))
    return tf.keras.Sequential([
        tfp.layers.VariableLayer(2 * n, dtype=dtype),
        tfp.layers.DistributionLambda(lambda t: tfd.Independent(
            tfd.Normal(loc=t[..., :n],
                     scale=1e-5 + tf.nn.softplus(c + t[..., n:])),
            reinterpreted_batch_ndims=1)),
    ])

# Specify the prior over `keras.layers.Dense` `kernel` and `bias`.
def prior_trainable(kernel_size, bias_size=0, dtype=None):
    n = kernel_size + bias_size
    return tf.keras.Sequential([
        tfp.layers.VariableLayer(n, dtype=dtype),
        tfp.layers.DistributionLambda(lambda t: tfd.Independent(
            tfd.Normal(loc=t, scale=1),
            reinterpreted_batch_ndims=1)),
    ])

In [159]:
def bnn_extractor(flat_observations, net_arch, act_fun):
    """
    Constructs an variational layer that receives observations as an input and outputs a latent representation for the policy and
    a value network. The ``net_arch`` parameter allows to specify the amount and size of the hidden layers and how many
    of them are shared between the policy network and the value network. It is assumed to be a list with the following
    structure:
    1. An arbitrary length (zero allowed) number of integers each specifying the number of units in a shared layer.
       If the number of ints is zero, there will be no shared layers.
    2. An optional dict, to specify the following non-shared layers for the value network and the policy network.
       It is formatted like ``dict(vf=[<value layer sizes>], pi=[<policy layer sizes>])``.
       If it is missing any of the keys (pi or vf), no non-shared layers (empty list) is assumed.
    For example to construct a network with one shared layer of size 55 followed by two non-shared layers for the value
    network of size 255 and a single non-shared layer of size 128 for the policy network, the following layers_spec
    would be used: ``[55, dict(vf=[255, 255], pi=[128])]``. A simple shared network topology with two layers of size 128
    would be specified as [128, 128].
    :param flat_observations: (tf.Tensor) The observations to base policy and value function on.
    :param net_arch: ([int or dict]) The specification of the policy and value networks.
        See above for details on its formatting.
    :param act_fun: (tf function) The activation function to use for the networks.
    :return: (tf.Tensor, tf.Tensor) latent_policy, latent_value of the specified network.
        If all layers are shared, then ``latent_policy == latent_value``
    """
    latent = flat_observations
    policy_only_layers = []  # Layer sizes of the network that only belongs to the policy network
    value_only_layers = []  # Layer sizes of the network that only belongs to the value network

    # Iterate through the shared layers and build the shared parts of the network
    for idx, layer in enumerate(net_arch):
        if isinstance(layer, int):  # Check that this is a shared layer
            layer_size = layer
#             latent = act_fun(linear(latent, "shared_fc{}".format(idx), layer_size, init_scale=np.sqrt(2)))
            latent = act_fun(tfp.layers.DenseVariational(layer_size, posterior_mean_field, prior_trainable, kl_weight=1))
        else:
            assert isinstance(layer, dict), "Error: the net_arch list can only contain ints and dicts"
            if 'pi' in layer:
                assert isinstance(layer['pi'], list), "Error: net_arch[-1]['pi'] must contain a list of integers."
                policy_only_layers = layer['pi']

            if 'vf' in layer:
                assert isinstance(layer['vf'], list), "Error: net_arch[-1]['vf'] must contain a list of integers."
                value_only_layers = layer['vf']
            break  # From here on the network splits up in policy and value network

    # Build the non-shared part of the network
    latent_policy = latent
    latent_value = latent
    for idx, (pi_layer_size, vf_layer_size) in enumerate(zip_longest(policy_only_layers, value_only_layers)):
        if pi_layer_size is not None:
            assert isinstance(pi_layer_size, int), "Error: net_arch[-1]['pi'] must only contain integers."
#             latent_policy = act_fun(linear(latent_policy, "pi_fc{}".format(idx), pi_layer_size, init_scale=np.sqrt(2)))
            latent_policy = act_fun(tfp.layers.DenseVariational(pi_layer_size, posterior_mean_field, prior_trainable, kl_weight=1)(latent_policy))

        if vf_layer_size is not None:
            assert isinstance(vf_layer_size, int), "Error: net_arch[-1]['vf'] must only contain integers."
#             latent_value = act_fun(linear(latent_value, "vf_fc{}".format(idx), vf_layer_size, init_scale=np.sqrt(2)))
            latent_value = act_fun(tfp.layers.DenseVariational(vf_layer_size, posterior_mean_field, prior_trainable, kl_weight=1)(latent_value))

    return latent_policy, latent_value

In [160]:
class FeedForwardPolicy(ActorCriticPolicy):
    """
    Policy object that implements actor critic, using a feed forward neural network.
    :param sess: (TensorFlow session) The current TensorFlow session
    :param ob_space: (Gym Space) The observation space of the environment
    :param ac_space: (Gym Space) The action space of the environment
    :param n_env: (int) The number of environments to run
    :param n_steps: (int) The number of steps to run for each environment
    :param n_batch: (int) The number of batch to run (n_envs * n_steps)
    :param reuse: (bool) If the policy is reusable or not
    :param layers: ([int]) (deprecated, use net_arch instead) The size of the Neural network for the policy
        (if None, default to [64, 64])
    :param net_arch: (list) Specification of the actor-critic policy network architecture (see mlp_extractor
        documentation for details).
    :param act_fun: (tf.func) the activation function to use in the neural network.
    :param cnn_extractor: (function (TensorFlow Tensor, ``**kwargs``): (TensorFlow Tensor)) the CNN feature extraction
    :param feature_extraction: (str) The feature extraction type ("cnn" or "mlp")
    :param kwargs: (dict) Extra keyword arguments for the nature CNN feature extraction
    """

    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, layers=None, net_arch=None,
                 act_fun=tf.tanh, cnn_extractor=nature_cnn, feature_extraction="cnn", **kwargs):
        super(FeedForwardPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse,
                                                scale=(feature_extraction == "cnn"))

        self._kwargs_check(feature_extraction, kwargs)

        if layers is not None:
            warnings.warn("Usage of the `layers` parameter is deprecated! Use net_arch instead "
                          "(it has a different semantics though).", DeprecationWarning)
            if net_arch is not None:
                warnings.warn("The new `net_arch` parameter overrides the deprecated `layers` parameter!",
                              DeprecationWarning)

        if net_arch is None:
            if layers is None:
                layers = [64, 64]
            net_arch = [dict(vf=layers, pi=layers)]

        with tf.variable_scope("model", reuse=reuse):
            if feature_extraction == "cnn":
                pi_latent = vf_latent = cnn_extractor(self.processed_obs, **kwargs)
            elif feature_extraction == "bnn":
                pi_latent, vf_latent = bnn_extractor(tf.layers.flatten(self.processed_obs), net_arch, act_fun)
            else:
                pi_latent, vf_latent = mlp_extractor(tf.layers.flatten(self.processed_obs), net_arch, act_fun)

            self._value_fn = linear(vf_latent, 'vf', 1)

            self._proba_distribution, self._policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self._setup_init()

    def step(self, obs, state=None, mask=None, deterministic=False):
        if deterministic:
            action, value, neglogp = self.sess.run([self.deterministic_action, self.value_flat, self.neglogp],
                                                   {self.obs_ph: obs})
        else:
            action, value, neglogp = self.sess.run([self.action, self.value_flat, self.neglogp],
                                                   {self.obs_ph: obs})
        return action, value, self.initial_state, neglogp

    def proba_step(self, obs, state=None, mask=None):
        return self.sess.run(self.policy_proba, {self.obs_ph: obs})

    def value(self, obs, state=None, mask=None):
        return self.sess.run(self.value_flat, {self.obs_ph: obs})

In [161]:
import warnings
from itertools import zip_longest
from abc import ABC, abstractmethod

import numpy as np
import tensorflow as tf
from gym.spaces import Discrete

from stable_baselines.common.tf_util import batch_to_seq, seq_to_batch
from stable_baselines.common.tf_layers import conv, linear, conv_to_fc, lstm
from stable_baselines.common.distributions import make_proba_dist_type, CategoricalProbabilityDistribution, \
    MultiCategoricalProbabilityDistribution, DiagGaussianProbabilityDistribution, BernoulliProbabilityDistribution
from stable_baselines.common.input import observation_input
from stable_baselines.common.policies import nature_cnn

In [162]:
class BnnPolicy(FeedForwardPolicy):
    """
    Policy object that implements actor critic, using a Bayesian neural net (2 layers of 64)
    :param sess: (TensorFlow session) The current TensorFlow session
    :param ob_space: (Gym Space) The observation space of the environment
    :param ac_space: (Gym Space) The action space of the environment
    :param n_env: (int) The number of environments to run
    :param n_steps: (int) The number of steps to run for each environment
    :param n_batch: (int) The number of batch to run (n_envs * n_steps)
    :param reuse: (bool) If the policy is reusable or not
    :param _kwargs: (dict) Extra keyword arguments for the nature CNN feature extraction
    """

    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs):
        super(BnnPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse,
                                        feature_extraction="bnn", **_kwargs)

# Single-Agent PPO with BNN

In [163]:
#!/usr/bin/env python3

# Train single CPU PPO1 on slimevolley.
# Should solve it (beat existing AI on average over 1000 trials) in 3 hours on single CPU, within 3M steps.

import os
import gym
import slimevolleygym
from slimevolleygym import SurvivalRewardEnv

from stable_baselines.ppo1 import PPO1
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import logger
from stable_baselines.common.callbacks import EvalCallback

NUM_TIMESTEPS = int(5e6)
SEED = 721
EVAL_FREQ = 250000
EVAL_EPISODES = 10  # was 1000
LOGDIR = "bnn_ppo1" # moved to zoo afterwards.

logger.configure(folder=LOGDIR)

env = gym.make("SlimeVolley-v0")
env.seed(SEED)

Logging to bnn_ppo1


[721]

In [None]:
# take mujoco hyperparams (but doubled timesteps_per_actorbatch to cover more steps.)
model = PPO1(BnnPolicy, env, timesteps_per_actorbatch=4096, clip_param=0.2, entcoeff=0.0, optim_epochs=10,
                 optim_stepsize=3e-4, optim_batchsize=64, gamma=0.99, lam=0.95, schedule='linear', verbose=2)

eval_callback = EvalCallback(env, best_model_save_path=LOGDIR, log_path=LOGDIR, eval_freq=EVAL_FREQ, n_eval_episodes=EVAL_EPISODES)

model.learn(total_timesteps=NUM_TIMESTEPS, callback=eval_callback)

model.save(os.path.join(LOGDIR, "final_model")) # probably never get to this point.

env.close()

********** Iteration 0 ************


  "{} != {}".format(self.training_env, self.eval_env))


Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
     -0.00074 |       0.00000 |       1.01718 |       0.00014 |       2.07934
     3.04e-05 |       0.00000 |       0.88093 |       0.00028 |       2.07920
     -0.00026 |       0.00000 |       0.87572 |       0.00045 |       2.07903
      0.00029 |       0.00000 |       0.76330 |       0.00054 |       2.07893
     -0.00033 |       0.00000 |       0.73383 |       0.00065 |       2.07883
     -0.00106 |       0.00000 |       0.60832 |       0.00084 |       2.07862
     -0.00165 |       0.00000 |       0.54448 |       0.00106 |       2.07842
      0.00029 |       0.00000 |       0.49235 |       0.00124 |       2.07823
     -0.00087 |       0.00000 |       0.48766 |       0.00134 |       2.07813
      0.00092 |       0.00000 |       0.43227 |       0.00147 |       2.07802
Evaluating losses...
     -0.00190 |       0.00000 |       0.41145 |       0.00165 |       2.07782
-----------------------------

      0.00083 |       0.00000 |       0.05020 |       0.00456 |       2.07654
     -0.00032 |       0.00000 |       0.05025 |       0.00439 |       2.07652
     -0.00081 |       0.00000 |       0.04994 |       0.00462 |       2.07635
Evaluating losses...
      0.00092 |       0.00000 |       0.05043 |       0.00448 |       2.07632
----------------------------------
| EpLenMean       | 564          |
| EpRewMean       | -4.93        |
| EpThisIter      | 7            |
| EpisodesSoFar   | 43           |
| TimeElapsed     | 38.9         |
| TimestepsSoFar  | 24576        |
| ev_tdlam_before | -0.0109      |
| loss_ent        | 2.0763159    |
| loss_kl         | 0.0044785896 |
| loss_pol_entpen | 0.0          |
| loss_pol_surr   | 0.0009182282 |
| loss_vf_loss    | 0.05043229   |
----------------------------------
********** Iteration 6 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00154 |       0.00000 |       0.05140 |  

********** Iteration 11 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
    -2.73e-05 |       0.00000 |       0.05114 |       0.00394 |       2.07346
     1.51e-05 |       0.00000 |       0.05079 |       0.00401 |       2.07348
      0.00199 |       0.00000 |       0.05112 |       0.00391 |       2.07330
     -0.00140 |       0.00000 |       0.05097 |       0.00395 |       2.07327
     -0.00157 |       0.00000 |       0.05087 |       0.00435 |       2.07307
     -0.00209 |       0.00000 |       0.05091 |       0.00419 |       2.07221
     -0.00035 |       0.00000 |       0.05038 |       0.00442 |       2.07200
    -3.37e-05 |       0.00000 |       0.05076 |       0.00467 |       2.07215
      0.00111 |       0.00000 |       0.05128 |       0.00460 |       2.07206
      0.00123 |       0.00000 |       0.05153 |       0.00443 |       2.07171
Evaluating losses...
      0.00044 |       0.00000 |       0.05117 |       0.00454 |       

      0.00290 |       0.00000 |       0.04766 |       0.00470 |       2.07566
      0.00268 |       0.00000 |       0.04717 |       0.00439 |       2.07591
     -0.00097 |       0.00000 |       0.04651 |       0.00482 |       2.07619
      0.00124 |       0.00000 |       0.04655 |       0.00463 |       2.07629
Evaluating losses...
      0.00115 |       0.00000 |       0.04779 |       0.00462 |       2.07606
----------------------------------
| EpLenMean       | 604          |
| EpRewMean       | -4.88        |
| EpThisIter      | 6            |
| EpisodesSoFar   | 116          |
| TimeElapsed     | 111          |
| TimestepsSoFar  | 69632        |
| ev_tdlam_before | -0.00448     |
| loss_ent        | 2.076061     |
| loss_kl         | 0.0046223225 |
| loss_pol_entpen | 0.0          |
| loss_pol_surr   | 0.0011482688 |
| loss_vf_loss    | 0.04778724   |
----------------------------------
********** Iteration 17 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss | 

********** Iteration 22 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00036 |       0.00000 |       0.04540 |       0.00453 |       2.07570
     -0.00065 |       0.00000 |       0.04598 |       0.00451 |       2.07557
      0.00215 |       0.00000 |       0.04563 |       0.00479 |       2.07505
      0.00151 |       0.00000 |       0.04553 |       0.00450 |       2.07520
     -0.00074 |       0.00000 |       0.04614 |       0.00453 |       2.07520
     -0.00155 |       0.00000 |       0.04539 |       0.00477 |       2.07464
     -0.00040 |       0.00000 |       0.04627 |       0.00469 |       2.07481
      0.00080 |       0.00000 |       0.04532 |       0.00505 |       2.07464
     2.44e-05 |       0.00000 |       0.04540 |       0.00507 |       2.07394
      0.00055 |       0.00000 |       0.04542 |       0.00494 |       2.07365
Evaluating losses...
      0.00094 |       0.00000 |       0.04543 |       0.00500 |       

      0.00102 |       0.00000 |       0.04525 |       0.00396 |       2.07366
      0.00028 |       0.00000 |       0.04499 |       0.00397 |       2.07354
Evaluating losses...
      0.00148 |       0.00000 |       0.04462 |       0.00364 |       2.07369
----------------------------------
| EpLenMean       | 611          |
| EpRewMean       | -4.9         |
| EpThisIter      | 6            |
| EpisodesSoFar   | 190          |
| TimeElapsed     | 183          |
| TimestepsSoFar  | 114688       |
| ev_tdlam_before | -0.0127      |
| loss_ent        | 2.0736928    |
| loss_kl         | 0.0036431474 |
| loss_pol_entpen | 0.0          |
| loss_pol_surr   | 0.0014762175 |
| loss_vf_loss    | 0.044619907  |
----------------------------------
********** Iteration 28 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00220 |       0.00000 |       0.04338 |       0.00411 |       2.07352
      0.00042 |       0.00000 |       0.04284 | 

********** Iteration 33 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00183 |       0.00000 |       0.04816 |       0.00428 |       2.07414
      0.00070 |       0.00000 |       0.04851 |       0.00455 |       2.07419
      0.00067 |       0.00000 |       0.04827 |       0.00479 |       2.07358
     -0.00165 |       0.00000 |       0.04794 |       0.00423 |       2.07378
      0.00262 |       0.00000 |       0.04754 |       0.00453 |       2.07382
     -0.00190 |       0.00000 |       0.04786 |       0.00423 |       2.07383
      0.00059 |       0.00000 |       0.04770 |       0.00437 |       2.07409
     -0.00308 |       0.00000 |       0.04740 |       0.00435 |       2.07340
      0.00166 |       0.00000 |       0.04745 |       0.00453 |       2.07363
     -0.00095 |       0.00000 |       0.04845 |       0.00426 |       2.07320
Evaluating losses...
     -0.00112 |       0.00000 |       0.04818 |       0.00475 |       

     -0.00075 |       0.00000 |       0.04497 |       0.00400 |       2.07032
     -0.00142 |       0.00000 |       0.04440 |       0.00403 |       2.07015
      0.00188 |       0.00000 |       0.04511 |       0.00407 |       2.07068
Evaluating losses...
     -0.00192 |       0.00000 |       0.04494 |       0.00401 |       2.06991
-----------------------------------
| EpLenMean       | 608           |
| EpRewMean       | -4.85         |
| EpThisIter      | 7             |
| EpisodesSoFar   | 263           |
| TimeElapsed     | 253           |
| TimestepsSoFar  | 159744        |
| ev_tdlam_before | 0.078         |
| loss_ent        | 2.069914      |
| loss_kl         | 0.004014659   |
| loss_pol_entpen | 0.0           |
| loss_pol_surr   | -0.0019240864 |
| loss_vf_loss    | 0.04494209    |
-----------------------------------
********** Iteration 39 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
     -0.00121 |       0.00000 |   

********** Iteration 44 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00067 |       0.00000 |       0.03610 |       0.00402 |       2.06791
      0.00132 |       0.00000 |       0.03616 |       0.00414 |       2.06811
     -0.00020 |       0.00000 |       0.03556 |       0.00406 |       2.06811
      0.00161 |       0.00000 |       0.03512 |       0.00403 |       2.06733
      0.00022 |       0.00000 |       0.03599 |       0.00406 |       2.06765
     9.90e-05 |       0.00000 |       0.03546 |       0.00433 |       2.06731
     -0.00106 |       0.00000 |       0.03625 |       0.00406 |       2.06790
     -0.00069 |       0.00000 |       0.03474 |       0.00434 |       2.06747
      0.00169 |       0.00000 |       0.03493 |       0.00441 |       2.06805
     -0.00230 |       0.00000 |       0.03498 |       0.00404 |       2.06706
Evaluating losses...
      0.00052 |       0.00000 |       0.03504 |       0.00429 |       

     -0.00168 |       0.00000 |       0.02310 |       0.00477 |       2.05857
      0.00067 |       0.00000 |       0.02163 |       0.00478 |       2.05827
     -0.00066 |       0.00000 |       0.02273 |       0.00471 |       2.05722
Evaluating losses...
      0.00013 |       0.00000 |       0.02215 |       0.00481 |       2.05786
-----------------------------------
| EpLenMean       | 601           |
| EpRewMean       | -4.85         |
| EpThisIter      | 7             |
| EpisodesSoFar   | 339           |
| TimeElapsed     | 323           |
| TimestepsSoFar  | 204800        |
| ev_tdlam_before | 0.714         |
| loss_ent        | 2.057865      |
| loss_kl         | 0.004809305   |
| loss_pol_entpen | 0.0           |
| loss_pol_surr   | 0.00012547546 |
| loss_vf_loss    | 0.022150118   |
-----------------------------------
********** Iteration 50 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00170 |       0.00000 |   

********** Iteration 55 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
    -2.67e-05 |       0.00000 |       0.02495 |       0.00397 |       2.04569
     -0.00140 |       0.00000 |       0.02495 |       0.00406 |       2.04480
      0.00021 |       0.00000 |       0.02537 |       0.00398 |       2.04469
     -0.00012 |       0.00000 |       0.02528 |       0.00432 |       2.04349
      0.00020 |       0.00000 |       0.02475 |       0.00437 |       2.04322
     -0.00112 |       0.00000 |       0.02433 |       0.00446 |       2.04430
      0.00026 |       0.00000 |       0.02463 |       0.00413 |       2.04310
      0.00077 |       0.00000 |       0.02532 |       0.00435 |       2.04417
      0.00197 |       0.00000 |       0.02470 |       0.00441 |       2.04283
     -0.00039 |       0.00000 |       0.02487 |       0.00449 |       2.04236
Evaluating losses...
     9.64e-05 |       0.00000 |       0.02504 |       0.00452 |       

      0.00151 |       0.00000 |       0.03200 |       0.00390 |       2.04837
      0.00072 |       0.00000 |       0.03238 |       0.00396 |       2.04850
     2.21e-05 |       0.00000 |       0.03213 |       0.00393 |       2.04846
Evaluating losses...
     -0.00095 |       0.00000 |       0.03121 |       0.00407 |       2.04843
-----------------------------------
| EpLenMean       | 607           |
| EpRewMean       | -4.86         |
| EpThisIter      | 5             |
| EpisodesSoFar   | 412           |
| TimeElapsed     | 396           |
| TimestepsSoFar  | 249856        |
| ev_tdlam_before | 0.667         |
| loss_ent        | 2.048428      |
| loss_kl         | 0.004071019   |
| loss_pol_entpen | 0.0           |
| loss_pol_surr   | -0.0009456637 |
| loss_vf_loss    | 0.031209854   |
-----------------------------------
********** Iteration 61 ************
Eval num_timesteps=249856, episode_reward=-5.00 +/- 0.00
Episode length: 647.20 +/- 88.13
New best mean reward!
Optimizing...


********** Iteration 66 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
     -0.00021 |       0.00000 |       0.02738 |       0.00443 |       2.03635
      0.00067 |       0.00000 |       0.02811 |       0.00412 |       2.03679
      0.00251 |       0.00000 |       0.02827 |       0.00410 |       2.03658
     -0.00199 |       0.00000 |       0.02738 |       0.00428 |       2.03716
      0.00286 |       0.00000 |       0.02820 |       0.00422 |       2.03738
     -0.00050 |       0.00000 |       0.02814 |       0.00430 |       2.03709
      0.00041 |       0.00000 |       0.02817 |       0.00447 |       2.03769
      0.00171 |       0.00000 |       0.02772 |       0.00424 |       2.03805
      0.00222 |       0.00000 |       0.02841 |       0.00420 |       2.03729
      0.00227 |       0.00000 |       0.02786 |       0.00439 |       2.03842
Evaluating losses...
     -0.00036 |       0.00000 |       0.02797 |       0.00409 |       

    -8.33e-05 |       0.00000 |       0.02581 |       0.00419 |       2.02504
      0.00280 |       0.00000 |       0.02646 |       0.00437 |       2.02574
     -0.00056 |       0.00000 |       0.02543 |       0.00462 |       2.02557
Evaluating losses...
     -0.00128 |       0.00000 |       0.02534 |       0.00418 |       2.02680
-----------------------------------
| EpLenMean       | 646           |
| EpRewMean       | -4.79         |
| EpThisIter      | 6             |
| EpisodesSoFar   | 482           |
| TimeElapsed     | 473           |
| TimestepsSoFar  | 294912        |
| ev_tdlam_before | 0.736         |
| loss_ent        | 2.0268033     |
| loss_kl         | 0.004178507   |
| loss_pol_entpen | 0.0           |
| loss_pol_surr   | -0.0012822751 |
| loss_vf_loss    | 0.02534422    |
-----------------------------------
********** Iteration 72 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00015 |       0.00000 |   

********** Iteration 77 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00041 |       0.00000 |       0.03731 |       0.00433 |       2.01733
      0.00137 |       0.00000 |       0.03737 |       0.00433 |       2.01742
     -0.00079 |       0.00000 |       0.03716 |       0.00436 |       2.01846
      0.00143 |       0.00000 |       0.03747 |       0.00439 |       2.01795
      0.00072 |       0.00000 |       0.03658 |       0.00428 |       2.01765
     -0.00056 |       0.00000 |       0.03615 |       0.00458 |       2.01865
     -0.00063 |       0.00000 |       0.03672 |       0.00460 |       2.01979
      0.00079 |       0.00000 |       0.03673 |       0.00441 |       2.02095
     -0.00343 |       0.00000 |       0.03659 |       0.00444 |       2.02020
      0.00024 |       0.00000 |       0.03635 |       0.00440 |       2.02088
Evaluating losses...
      0.00128 |       0.00000 |       0.03691 |       0.00429 |       

      0.00112 |       0.00000 |       0.02375 |       0.00401 |       2.00988
     2.34e-05 |       0.00000 |       0.02411 |       0.00395 |       2.00785
      0.00226 |       0.00000 |       0.02486 |       0.00387 |       2.01045
Evaluating losses...
     -0.00079 |       0.00000 |       0.02451 |       0.00396 |       2.01058
------------------------------------
| EpLenMean       | 638            |
| EpRewMean       | -4.81          |
| EpThisIter      | 6              |
| EpisodesSoFar   | 553            |
| TimeElapsed     | 544            |
| TimestepsSoFar  | 339968         |
| ev_tdlam_before | 0.767          |
| loss_ent        | 2.0105813      |
| loss_kl         | 0.0039596464   |
| loss_pol_entpen | 0.0            |
| loss_pol_surr   | -0.00078561105 |
| loss_vf_loss    | 0.024507962    |
------------------------------------
********** Iteration 83 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
     -0.00130 |     

********** Iteration 88 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00043 |       0.00000 |       0.02501 |       0.00352 |       2.00558
      0.00084 |       0.00000 |       0.02555 |       0.00364 |       2.00630
      0.00185 |       0.00000 |       0.02541 |       0.00382 |       2.00744
     -0.00052 |       0.00000 |       0.02565 |       0.00382 |       2.00676
      0.00079 |       0.00000 |       0.02498 |       0.00363 |       2.00904
     -0.00129 |       0.00000 |       0.02550 |       0.00390 |       2.00819
      0.00262 |       0.00000 |       0.02477 |       0.00371 |       2.00903
     -0.00149 |       0.00000 |       0.02505 |       0.00386 |       2.00907
      0.00087 |       0.00000 |       0.02462 |       0.00383 |       2.00927
      0.00121 |       0.00000 |       0.02531 |       0.00390 |       2.01098
Evaluating losses...
     -0.00129 |       0.00000 |       0.02475 |       0.00371 |       

      0.00170 |       0.00000 |       0.02574 |       0.00388 |       2.00603
      0.00090 |       0.00000 |       0.02616 |       0.00386 |       2.00857
     -0.00105 |       0.00000 |       0.02573 |       0.00395 |       2.00636
Evaluating losses...
      0.00121 |       0.00000 |       0.02554 |       0.00406 |       2.00832
----------------------------------
| EpLenMean       | 619          |
| EpRewMean       | -4.86        |
| EpThisIter      | 6            |
| EpisodesSoFar   | 626          |
| TimeElapsed     | 615          |
| TimestepsSoFar  | 385024       |
| ev_tdlam_before | 0.713        |
| loss_ent        | 2.008319     |
| loss_kl         | 0.004055469  |
| loss_pol_entpen | 0.0          |
| loss_pol_surr   | 0.0012132622 |
| loss_vf_loss    | 0.025543312  |
----------------------------------
********** Iteration 94 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00072 |       0.00000 |       0.02505 | 

********** Iteration 99 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00024 |       0.00000 |       0.02129 |       0.00348 |       2.02135
     -0.00071 |       0.00000 |       0.02052 |       0.00364 |       2.02248
      0.00241 |       0.00000 |       0.02080 |       0.00366 |       2.02260
      0.00032 |       0.00000 |       0.02116 |       0.00390 |       2.02344
     -0.00019 |       0.00000 |       0.02077 |       0.00369 |       2.02450
     -0.00045 |       0.00000 |       0.02076 |       0.00376 |       2.02591
      0.00166 |       0.00000 |       0.02037 |       0.00351 |       2.02301
    -3.64e-05 |       0.00000 |       0.02130 |       0.00353 |       2.02558
     -0.00048 |       0.00000 |       0.02100 |       0.00358 |       2.02544
     -0.00063 |       0.00000 |       0.02090 |       0.00364 |       2.02632
Evaluating losses...
     -0.00098 |       0.00000 |       0.02053 |       0.00392 |       

      0.00014 |       0.00000 |       0.02132 |       0.00359 |       2.01092
      0.00027 |       0.00000 |       0.02109 |       0.00376 |       2.01004
     -0.00226 |       0.00000 |       0.02178 |       0.00378 |       2.01167
Evaluating losses...
     -0.00020 |       0.00000 |       0.02125 |       0.00384 |       2.01149
------------------------------------
| EpLenMean       | 612            |
| EpRewMean       | -4.9           |
| EpThisIter      | 7              |
| EpisodesSoFar   | 700            |
| TimeElapsed     | 686            |
| TimestepsSoFar  | 430080         |
| ev_tdlam_before | 0.791          |
| loss_ent        | 2.0114899      |
| loss_kl         | 0.0038389058   |
| loss_pol_entpen | 0.0            |
| loss_pol_surr   | -0.00020430377 |
| loss_vf_loss    | 0.021253424    |
------------------------------------
********** Iteration 105 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00062 |    

********** Iteration 110 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00283 |       0.00000 |       0.02463 |       0.00379 |       1.99925
     -0.00031 |       0.00000 |       0.02484 |       0.00392 |       1.99998
     5.60e-05 |       0.00000 |       0.02426 |       0.00394 |       1.99984
      0.00456 |       0.00000 |       0.02432 |       0.00416 |       2.00010
    -6.19e-05 |       0.00000 |       0.02379 |       0.00379 |       1.99917
     -0.00105 |       0.00000 |       0.02404 |       0.00393 |       2.00044
     -0.00092 |       0.00000 |       0.02396 |       0.00378 |       1.99983
     -0.00230 |       0.00000 |       0.02456 |       0.00394 |       2.00042
      0.00213 |       0.00000 |       0.02352 |       0.00413 |       2.00135
     -0.00158 |       0.00000 |       0.02389 |       0.00414 |       2.00106
Evaluating losses...
     -0.00121 |       0.00000 |       0.02413 |       0.00404 |      

     -0.00078 |       0.00000 |       0.02611 |       0.00395 |       2.00533
     3.62e-06 |       0.00000 |       0.02597 |       0.00378 |       2.00551
     -0.00112 |       0.00000 |       0.02585 |       0.00407 |       2.00555
Evaluating losses...
     4.43e-05 |       0.00000 |       0.02553 |       0.00400 |       2.00533
-----------------------------------
| EpLenMean       | 634           |
| EpRewMean       | -4.82         |
| EpThisIter      | 6             |
| EpisodesSoFar   | 771           |
| TimeElapsed     | 755           |
| TimestepsSoFar  | 475136        |
| ev_tdlam_before | 0.777         |
| loss_ent        | 2.0053287     |
| loss_kl         | 0.0040001743  |
| loss_pol_entpen | 0.0           |
| loss_pol_surr   | 4.4297893e-05 |
| loss_vf_loss    | 0.025527893   |
-----------------------------------
********** Iteration 116 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00374 |       0.00000 |  

********** Iteration 121 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00087 |       0.00000 |       0.02530 |       0.00377 |       2.00857
      0.00114 |       0.00000 |       0.02447 |       0.00370 |       2.00793
      0.00315 |       0.00000 |       0.02432 |       0.00401 |       2.00829
      0.00054 |       0.00000 |       0.02416 |       0.00378 |       2.00912
      0.00124 |       0.00000 |       0.02436 |       0.00366 |       2.00967
      0.00061 |       0.00000 |       0.02450 |       0.00382 |       2.00900
      0.00097 |       0.00000 |       0.02434 |       0.00366 |       2.00898
      0.00060 |       0.00000 |       0.02410 |       0.00379 |       2.00878
      0.00117 |       0.00000 |       0.02474 |       0.00384 |       2.00915
      0.00114 |       0.00000 |       0.02418 |       0.00376 |       2.00938
Evaluating losses...
     -0.00038 |       0.00000 |       0.02409 |       0.00378 |      

      0.00279 |       0.00000 |       0.02564 |       0.00368 |       2.00309
      0.00054 |       0.00000 |       0.02573 |       0.00372 |       2.00306
     -0.00198 |       0.00000 |       0.02551 |       0.00408 |       2.00154
      0.00143 |       0.00000 |       0.02489 |       0.00399 |       2.00285
Evaluating losses...
     -0.00061 |       0.00000 |       0.02547 |       0.00390 |       2.00377
-----------------------------------
| EpLenMean       | 636           |
| EpRewMean       | -4.82         |
| EpThisIter      | 7             |
| EpisodesSoFar   | 841           |
| TimeElapsed     | 831           |
| TimestepsSoFar  | 520192        |
| ev_tdlam_before | 0.765         |
| loss_ent        | 2.0037653     |
| loss_kl         | 0.0038961964  |
| loss_pol_entpen | 0.0           |
| loss_pol_surr   | -0.0006074859 |
| loss_vf_loss    | 0.025471402   |
-----------------------------------
********** Iteration 127 ************
Optimizing...
     pol_surr |    pol_entpen |  

********** Iteration 132 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00080 |       0.00000 |       0.01960 |       0.00356 |       2.00208
      0.00131 |       0.00000 |       0.01957 |       0.00360 |       2.00281
      0.00056 |       0.00000 |       0.01984 |       0.00349 |       2.00250
      0.00193 |       0.00000 |       0.01884 |       0.00336 |       2.00341
      0.00146 |       0.00000 |       0.01945 |       0.00350 |       2.00242
      0.00488 |       0.00000 |       0.01928 |       0.00331 |       2.00092
      0.00272 |       0.00000 |       0.01966 |       0.00345 |       2.00133
      0.00028 |       0.00000 |       0.01944 |       0.00328 |       2.00145
     -0.00071 |       0.00000 |       0.01931 |       0.00344 |       2.00078
      0.00143 |       0.00000 |       0.01920 |       0.00369 |       2.00014
Evaluating losses...
      0.00035 |       0.00000 |       0.01903 |       0.00344 |      

      0.00101 |       0.00000 |       0.02127 |       0.00364 |       2.00549
     -0.00156 |       0.00000 |       0.02125 |       0.00365 |       2.00776
     -0.00146 |       0.00000 |       0.02166 |       0.00375 |       2.00756
Evaluating losses...
     -0.00297 |       0.00000 |       0.02137 |       0.00352 |       2.00751
-----------------------------------
| EpLenMean       | 605           |
| EpRewMean       | -4.86         |
| EpThisIter      | 5             |
| EpisodesSoFar   | 915           |
| TimeElapsed     | 901           |
| TimestepsSoFar  | 565248        |
| ev_tdlam_before | 0.778         |
| loss_ent        | 2.0075085     |
| loss_kl         | 0.0035223917  |
| loss_pol_entpen | 0.0           |
| loss_pol_surr   | -0.0029667842 |
| loss_vf_loss    | 0.02137308    |
-----------------------------------
********** Iteration 138 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00366 |       0.00000 |  

********** Iteration 143 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00150 |       0.00000 |       0.02146 |       0.00321 |       2.01379
     -0.00061 |       0.00000 |       0.02088 |       0.00324 |       2.01418
     -0.00062 |       0.00000 |       0.02098 |       0.00348 |       2.01341
      0.00099 |       0.00000 |       0.02142 |       0.00333 |       2.01555
      0.00162 |       0.00000 |       0.02091 |       0.00347 |       2.01398
      0.00079 |       0.00000 |       0.02095 |       0.00358 |       2.01557
      0.00110 |       0.00000 |       0.02111 |       0.00332 |       2.01629
     -0.00192 |       0.00000 |       0.02172 |       0.00341 |       2.01694
      0.00028 |       0.00000 |       0.02068 |       0.00343 |       2.01565
     -0.00150 |       0.00000 |       0.02179 |       0.00362 |       2.01562
Evaluating losses...
     -0.00134 |       0.00000 |       0.02095 |       0.00363 |      

     -0.00070 |       0.00000 |       0.02292 |       0.00345 |       2.01216
      0.00191 |       0.00000 |       0.02264 |       0.00339 |       2.01291
      0.00192 |       0.00000 |       0.02285 |       0.00315 |       2.01261
Evaluating losses...
     -0.00143 |       0.00000 |       0.02233 |       0.00317 |       2.01226
-----------------------------------
| EpLenMean       | 625           |
| EpRewMean       | -4.88         |
| EpThisIter      | 7             |
| EpisodesSoFar   | 987           |
| TimeElapsed     | 970           |
| TimestepsSoFar  | 610304        |
| ev_tdlam_before | 0.772         |
| loss_ent        | 2.0122554     |
| loss_kl         | 0.0031660174  |
| loss_pol_entpen | 0.0           |
| loss_pol_surr   | -0.0014332605 |
| loss_vf_loss    | 0.02232574    |
-----------------------------------
********** Iteration 149 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00152 |       0.00000 |  

********** Iteration 154 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
     -0.00109 |       0.00000 |       0.01692 |       0.00329 |       2.01572
      0.00225 |       0.00000 |       0.01710 |       0.00355 |       2.01494
     -0.00052 |       0.00000 |       0.01671 |       0.00338 |       2.01499
     -0.00146 |       0.00000 |       0.01686 |       0.00325 |       2.01467
      0.00025 |       0.00000 |       0.01715 |       0.00306 |       2.01642
     6.71e-05 |       0.00000 |       0.01733 |       0.00329 |       2.01576
      0.00116 |       0.00000 |       0.01719 |       0.00321 |       2.01613
      0.00093 |       0.00000 |       0.01720 |       0.00347 |       2.01496
      0.00111 |       0.00000 |       0.01696 |       0.00317 |       2.01609
     -0.00129 |       0.00000 |       0.01696 |       0.00315 |       2.01466
Evaluating losses...
     -0.00034 |       0.00000 |       0.01702 |       0.00328 |      

      0.00143 |       0.00000 |       0.02179 |       0.00332 |       2.01219
     2.84e-05 |       0.00000 |       0.02161 |       0.00335 |       2.01185
    -2.50e-05 |       0.00000 |       0.02155 |       0.00320 |       2.01288
Evaluating losses...
      0.00018 |       0.00000 |       0.02189 |       0.00311 |       2.01272
-----------------------------------
| EpLenMean       | 643           |
| EpRewMean       | -4.8          |
| EpThisIter      | 7             |
| EpisodesSoFar   | 1056          |
| TimeElapsed     | 1.04e+03      |
| TimestepsSoFar  | 655360        |
| ev_tdlam_before | 0.814         |
| loss_ent        | 2.0127237     |
| loss_kl         | 0.0031119671  |
| loss_pol_entpen | 0.0           |
| loss_pol_surr   | 0.00017723814 |
| loss_vf_loss    | 0.021885067   |
-----------------------------------
********** Iteration 160 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00025 |       0.00000 |  

********** Iteration 165 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00091 |       0.00000 |       0.02693 |       0.00343 |       2.01920
      0.00099 |       0.00000 |       0.02679 |       0.00333 |       2.01990
     -0.00043 |       0.00000 |       0.02691 |       0.00338 |       2.01904
      0.00145 |       0.00000 |       0.02611 |       0.00338 |       2.01931
     -0.00016 |       0.00000 |       0.02645 |       0.00330 |       2.01996
     -0.00175 |       0.00000 |       0.02666 |       0.00343 |       2.02042
     -0.00183 |       0.00000 |       0.02657 |       0.00350 |       2.02162
     -0.00090 |       0.00000 |       0.02691 |       0.00361 |       2.02169
     -0.00233 |       0.00000 |       0.02621 |       0.00363 |       2.02195
     -0.00065 |       0.00000 |       0.02607 |       0.00395 |       2.02238
Evaluating losses...
     -0.00018 |       0.00000 |       0.02657 |       0.00386 |      

     -0.00192 |       0.00000 |       0.02344 |       0.00373 |       2.01237
      0.00115 |       0.00000 |       0.02325 |       0.00397 |       2.01208
     -0.00305 |       0.00000 |       0.02301 |       0.00370 |       2.01219
Evaluating losses...
      0.00059 |       0.00000 |       0.02332 |       0.00422 |       2.01136
----------------------------------
| EpLenMean       | 614          |
| EpRewMean       | -4.86        |
| EpThisIter      | 7            |
| EpisodesSoFar   | 1131         |
| TimeElapsed     | 1.11e+03     |
| TimestepsSoFar  | 700416       |
| ev_tdlam_before | 0.776        |
| loss_ent        | 2.0113585    |
| loss_kl         | 0.0042213043 |
| loss_pol_entpen | 0.0          |
| loss_pol_surr   | 0.0005923668 |
| loss_vf_loss    | 0.023322314  |
----------------------------------
********** Iteration 171 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00087 |       0.00000 |       0.02101 |

********** Iteration 176 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00039 |       0.00000 |       0.02059 |       0.00379 |       2.00181
      0.00039 |       0.00000 |       0.02068 |       0.00381 |       2.00235
      0.00030 |       0.00000 |       0.02065 |       0.00379 |       2.00332
      0.00326 |       0.00000 |       0.02042 |       0.00346 |       2.00243
     -0.00093 |       0.00000 |       0.02038 |       0.00350 |       2.00348
      0.00106 |       0.00000 |       0.02033 |       0.00367 |       2.00443
      0.00158 |       0.00000 |       0.02049 |       0.00362 |       2.00423
     -0.00113 |       0.00000 |       0.02075 |       0.00386 |       2.00348
     -0.00075 |       0.00000 |       0.02002 |       0.00371 |       2.00144
      0.00070 |       0.00000 |       0.02030 |       0.00384 |       2.00468
Evaluating losses...
      0.00137 |       0.00000 |       0.02060 |       0.00346 |      

      0.00201 |       0.00000 |       0.01920 |       0.00388 |       2.00168
      0.00193 |       0.00000 |       0.01902 |       0.00341 |       2.00309
      0.00087 |       0.00000 |       0.01925 |       0.00379 |       2.00068
Evaluating losses...
      0.00153 |       0.00000 |       0.01880 |       0.00360 |       2.00352
----------------------------------
| EpLenMean       | 610          |
| EpRewMean       | -4.92        |
| EpThisIter      | 6            |
| EpisodesSoFar   | 1204         |
| TimeElapsed     | 1.18e+03     |
| TimestepsSoFar  | 745472       |
| ev_tdlam_before | 0.802        |
| loss_ent        | 2.0035229    |
| loss_kl         | 0.0036037965 |
| loss_pol_entpen | 0.0          |
| loss_pol_surr   | 0.0015287611 |
| loss_vf_loss    | 0.018799677  |
----------------------------------
********** Iteration 182 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00040 |       0.00000 |       0.01710 |

********** Iteration 187 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00066 |       0.00000 |       0.02067 |       0.00325 |       1.99904
     -0.00119 |       0.00000 |       0.02068 |       0.00319 |       2.00140
     -0.00057 |       0.00000 |       0.02071 |       0.00341 |       1.99857
     -0.00052 |       0.00000 |       0.02019 |       0.00359 |       1.99956
     -0.00045 |       0.00000 |       0.02032 |       0.00344 |       1.99838
     -0.00247 |       0.00000 |       0.02046 |       0.00362 |       1.99908
     -0.00153 |       0.00000 |       0.02050 |       0.00360 |       1.99786
      0.00049 |       0.00000 |       0.02060 |       0.00366 |       1.99869
      0.00139 |       0.00000 |       0.02045 |       0.00387 |       1.99750
      0.00034 |       0.00000 |       0.02078 |       0.00355 |       1.99763
Evaluating losses...
      0.00114 |       0.00000 |       0.02024 |       0.00371 |      

      0.00088 |       0.00000 |       0.01868 |       0.00284 |       2.00920
      0.00131 |       0.00000 |       0.01872 |       0.00280 |       2.00863
     -0.00234 |       0.00000 |       0.01883 |       0.00318 |       2.00971
Evaluating losses...
     -0.00342 |       0.00000 |       0.01868 |       0.00275 |       2.01014
-----------------------------------
| EpLenMean       | 633           |
| EpRewMean       | -4.93         |
| EpThisIter      | 7             |
| EpisodesSoFar   | 1275          |
| TimeElapsed     | 1.25e+03      |
| TimestepsSoFar  | 790528        |
| ev_tdlam_before | 0.811         |
| loss_ent        | 2.0101404     |
| loss_kl         | 0.0027456852  |
| loss_pol_entpen | 0.0           |
| loss_pol_surr   | -0.0034231772 |
| loss_vf_loss    | 0.018675748   |
-----------------------------------
********** Iteration 193 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00090 |       0.00000 |  

********** Iteration 198 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00150 |       0.00000 |       0.02641 |       0.00313 |       2.00374
     -0.00038 |       0.00000 |       0.02624 |       0.00325 |       2.00548
      0.00082 |       0.00000 |       0.02582 |       0.00332 |       2.00168
     5.87e-06 |       0.00000 |       0.02590 |       0.00334 |       2.00093
     -0.00031 |       0.00000 |       0.02582 |       0.00346 |       1.99846
      0.00039 |       0.00000 |       0.02615 |       0.00348 |       1.99924
     -0.00224 |       0.00000 |       0.02592 |       0.00343 |       1.99808
     -0.00038 |       0.00000 |       0.02579 |       0.00350 |       1.99921
     -0.00017 |       0.00000 |       0.02599 |       0.00359 |       1.99815
     -0.00032 |       0.00000 |       0.02603 |       0.00356 |       1.99580
Evaluating losses...
      0.00117 |       0.00000 |       0.02576 |       0.00355 |      

      0.00290 |       0.00000 |       0.02183 |       0.00372 |       1.98632
     1.97e-05 |       0.00000 |       0.02176 |       0.00337 |       1.98581
Evaluating losses...
     -0.00112 |       0.00000 |       0.02170 |       0.00371 |       1.98554
-----------------------------------
| EpLenMean       | 656           |
| EpRewMean       | -4.82         |
| EpThisIter      | 6             |
| EpisodesSoFar   | 1343          |
| TimeElapsed     | 1.32e+03      |
| TimestepsSoFar  | 835584        |
| ev_tdlam_before | 0.803         |
| loss_ent        | 1.985536      |
| loss_kl         | 0.0037120457  |
| loss_pol_entpen | 0.0           |
| loss_pol_surr   | -0.0011218976 |
| loss_vf_loss    | 0.021702224   |
-----------------------------------
********** Iteration 204 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00160 |       0.00000 |       0.02288 |       0.00314 |       1.98673
      0.00054 |       0.00000 |  

********** Iteration 209 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00085 |       0.00000 |       0.02676 |       0.00353 |       1.98773
      0.00217 |       0.00000 |       0.02630 |       0.00332 |       1.98699
      0.00155 |       0.00000 |       0.02658 |       0.00331 |       1.98655
     -0.00026 |       0.00000 |       0.02711 |       0.00351 |       1.98656
      0.00204 |       0.00000 |       0.02662 |       0.00320 |       1.98592
      0.00171 |       0.00000 |       0.02630 |       0.00337 |       1.98484
      0.00159 |       0.00000 |       0.02609 |       0.00347 |       1.98385
     3.47e-05 |       0.00000 |       0.02610 |       0.00324 |       1.98632
      0.00164 |       0.00000 |       0.02654 |       0.00322 |       1.98569
      0.00076 |       0.00000 |       0.02626 |       0.00338 |       1.98562
Evaluating losses...
     -0.00011 |       0.00000 |       0.02620 |       0.00325 |      

     -0.00015 |       0.00000 |       0.02620 |       0.00331 |       1.98737
      0.00254 |       0.00000 |       0.02682 |       0.00336 |       1.98806
     -0.00146 |       0.00000 |       0.02615 |       0.00344 |       1.98701
Evaluating losses...
     -0.00035 |       0.00000 |       0.02605 |       0.00348 |       1.98635
----------------------------------
| EpLenMean       | 651          |
| EpRewMean       | -4.8         |
| EpThisIter      | 6            |
| EpisodesSoFar   | 1413         |
| TimeElapsed     | 1.39e+03     |
| TimestepsSoFar  | 880640       |
| ev_tdlam_before | 0.787        |
| loss_ent        | 1.986348     |
| loss_kl         | 0.0034811315 |
| loss_pol_entpen | 0.0          |
| loss_pol_surr   | -0.000353095 |
| loss_vf_loss    | 0.026046634  |
----------------------------------
********** Iteration 215 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00236 |       0.00000 |       0.01788 |

********** Iteration 220 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
     -0.00061 |       0.00000 |       0.01936 |       0.00319 |       1.98960
      0.00150 |       0.00000 |       0.01907 |       0.00298 |       1.98866
     4.96e-05 |       0.00000 |       0.01908 |       0.00331 |       1.98898
     -0.00174 |       0.00000 |       0.01905 |       0.00315 |       1.98779
     -0.00062 |       0.00000 |       0.01896 |       0.00316 |       1.98730
      0.00116 |       0.00000 |       0.01923 |       0.00333 |       1.98708
     -0.00011 |       0.00000 |       0.01901 |       0.00325 |       1.98759
      0.00079 |       0.00000 |       0.01940 |       0.00338 |       1.98752
     -0.00126 |       0.00000 |       0.01906 |       0.00337 |       1.98697
      0.00194 |       0.00000 |       0.01900 |       0.00364 |       1.98490
Evaluating losses...
      0.00136 |       0.00000 |       0.01934 |       0.00344 |      

      0.00019 |       0.00000 |       0.02179 |       0.00314 |       1.97120
     -0.00171 |       0.00000 |       0.02189 |       0.00314 |       1.96865
      0.00070 |       0.00000 |       0.02197 |       0.00308 |       1.97060
Evaluating losses...
      0.00184 |       0.00000 |       0.02194 |       0.00356 |       1.96806
----------------------------------
| EpLenMean       | 636          |
| EpRewMean       | -4.88        |
| EpThisIter      | 7            |
| EpisodesSoFar   | 1484         |
| TimeElapsed     | 1.46e+03     |
| TimestepsSoFar  | 925696       |
| ev_tdlam_before | 0.789        |
| loss_ent        | 1.9680605    |
| loss_kl         | 0.0035593119 |
| loss_pol_entpen | 0.0          |
| loss_pol_surr   | 0.0018398701 |
| loss_vf_loss    | 0.021940231  |
----------------------------------
********** Iteration 226 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00047 |       0.00000 |       0.01687 |

********** Iteration 231 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
     -0.00068 |       0.00000 |       0.02358 |       0.00330 |       1.96781
     -0.00083 |       0.00000 |       0.02282 |       0.00319 |       1.96816
      0.00189 |       0.00000 |       0.02258 |       0.00325 |       1.96769
      0.00040 |       0.00000 |       0.02278 |       0.00328 |       1.96990
      0.00158 |       0.00000 |       0.02276 |       0.00311 |       1.96793
      0.00190 |       0.00000 |       0.02262 |       0.00292 |       1.96799
      0.00137 |       0.00000 |       0.02288 |       0.00307 |       1.96978
      0.00198 |       0.00000 |       0.02290 |       0.00279 |       1.96967
     -0.00139 |       0.00000 |       0.02308 |       0.00310 |       1.97034
      0.00089 |       0.00000 |       0.02258 |       0.00306 |       1.97139
Evaluating losses...
      0.00097 |       0.00000 |       0.02246 |       0.00291 |      

      0.00073 |       0.00000 |       0.02083 |       0.00347 |       1.96104
      0.00041 |       0.00000 |       0.02104 |       0.00342 |       1.95947
      0.00048 |       0.00000 |       0.02091 |       0.00321 |       1.96128
Evaluating losses...
    -9.76e-05 |       0.00000 |       0.02082 |       0.00342 |       1.95804
------------------------------------
| EpLenMean       | 654            |
| EpRewMean       | -4.86          |
| EpThisIter      | 5              |
| EpisodesSoFar   | 1552           |
| TimeElapsed     | 1.53e+03       |
| TimestepsSoFar  | 970752         |
| ev_tdlam_before | 0.82           |
| loss_ent        | 1.9580418      |
| loss_kl         | 0.0034190398   |
| loss_pol_entpen | 0.0            |
| loss_pol_surr   | -9.7645854e-05 |
| loss_vf_loss    | 0.020819036    |
------------------------------------
********** Iteration 237 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00210 |    

********** Iteration 242 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
      0.00109 |       0.00000 |       0.02100 |       0.00297 |       1.95934
     -0.00012 |       0.00000 |       0.02062 |       0.00307 |       1.95977
     -0.00160 |       0.00000 |       0.02054 |       0.00297 |       1.95861
      0.00066 |       0.00000 |       0.02054 |       0.00327 |       1.96099
     -0.00048 |       0.00000 |       0.02050 |       0.00283 |       1.96059
      0.00011 |       0.00000 |       0.02048 |       0.00291 |       1.96007
      0.00101 |       0.00000 |       0.02064 |       0.00301 |       1.96095
     -0.00028 |       0.00000 |       0.02081 |       0.00294 |       1.96214
      0.00155 |       0.00000 |       0.02054 |       0.00295 |       1.96058
      0.00116 |       0.00000 |       0.02053 |       0.00297 |       1.96076
Evaluating losses...
     -0.00190 |       0.00000 |       0.02023 |       0.00310 |      