# T81-558: Applications of Deep Neural Networks
**Module 12: Deep Learning and Security**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module Video Material

Main video lecture:

* Part 12.1: Introduction to the OpenAI Gym [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_reinforcement.ipynb)
* Part 12.2: Introduction to Q-Learning for Keras [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_reinforcement.ipynb)
* **Part 12.3: Keras Q-Learning in the OpenAI Gym** [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_reinforcement.ipynb)
* Part 12.4: Atari Games with Keras Neural Networks [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_reinforcement.ipynb)
* 12.5: How Alpha Zero used Reinforcement Learning to Master Chess [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_reinforcement.ipynb)


# Part 12.3: Keras Q-Learning in the OpenAI Gym

![Deep Q-Learning](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/deepqlearning.png "Reinforcement Learning")

* **CEMAgent**
    * **model** - The neural network that will be trained.
    * **nb_actions** - The number of actions the agent can take (e.g. up, down, left, right, fire)
    * **memory** - The EpisodeParameterMemory object to use.  This object observes and save all of the state transitions so that you can train your network on them later on (instead of having to make observations from the environment all the time).
    * **batch_size** - The batch size for neural network training, same concept as deep learning batch sizes.
    * **nb_steps_warmup** - Number of training steps to pass before any learning occurs.
    * **train_interval** - Logging interval, defines how often to log.
    * **elite_frac**
* **CEMAgent.fit**
    * **env** - The OpenAI gym environment being used.
    * **nb_steps** - Number of training steps to be performed.
    * **visualize** - If `True`, the environment is visualized during training. However,
                this is likely going to slow down training significantly and is thus intended to be
                a debugging instrument.
    * **verbose** - 0 for no logging, 1 for interval logging (compare `log_interval`), 2 for episode logging




In [1]:
import numpy as np
import gym

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten
from tensorflow.keras.optimizers import Adam

from rl.agents.cem import CEMAgent
from rl.memory import EpisodeParameterMemory

ENV_NAME = 'CartPole-v0'


# Get the environment and extract the number of actions.
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)

nb_actions = env.action_space.n
obs_dim = env.observation_space.shape[0]

# Option 1 : Simple model
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(nb_actions))
model.add(Activation('softmax'))

# Option 2: deep network
# model = Sequential()
# model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
# model.add(Dense(16))
# model.add(Activation('relu'))
# model.add(Dense(16))
# model.add(Activation('relu'))
# model.add(Dense(16))
# model.add(Activation('relu'))
# model.add(Dense(nb_actions))
# model.add(Activation('softmax'))


print(model.summary())


# Finally, we configure and compile our agent. You can use every built-in tensorflow.keras optimizer and
# even the metrics!
memory = EpisodeParameterMemory(limit=1000, window_length=1)

cem = CEMAgent(model=model, nb_actions=nb_actions, memory=memory,
               batch_size=50, nb_steps_warmup=2000, train_interval=50, elite_frac=0.05)
cem.compile()

# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
cem.fit(env, nb_steps=100000, visualize=False, verbose=2)

# After training is done, we save the best weights.
cem.save_weights('cem_{}_params.h5f'.format(ENV_NAME), overwrite=True)

# Finally, evaluate our algorithm for 5 episodes.
cem.test(env, nb_episodes=5, visualize=True)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 4)                 0         
_________________________________________________________________
dense (Dense)                (None, 2)                 10        
_________________________________________________________________
activation (Activation)      (None, 2)                 0         
Total params: 10
Trainable params: 10
Non-trainable params: 0
_________________________________________________________________
None
Training for 100000 steps ...
    57/100000: episode: 1, duration: 0.067s, episode steps:  57, steps per second: 853, episode reward: 57.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.491 [0.000, 1.000],  mean_best_reward: --
    81/100000: episode: 2, duration: 0.012s, episode steps:  24, steps per second: 1994, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000

   909/100000: episode: 38, duration: 0.010s, episode steps:  17, steps per second: 1664, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.412 [0.000, 1.000],  mean_best_reward: --
   932/100000: episode: 39, duration: 0.014s, episode steps:  23, steps per second: 1660, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.478 [0.000, 1.000],  mean_best_reward: --
   946/100000: episode: 40, duration: 0.013s, episode steps:  14, steps per second: 1093, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.714 [0.000, 1.000],  mean_best_reward: --
   965/100000: episode: 41, duration: 0.015s, episode steps:  19, steps per second: 1284, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.579 [0.000, 1.000],  mean_best_reward: --
   984/100000: episode: 42, duration: 0.015s, episode steps:  19, steps per second: 1287, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action:

  1878/100000: episode: 86, duration: 0.014s, episode steps:  22, steps per second: 1597, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
  1895/100000: episode: 87, duration: 0.010s, episode steps:  17, steps per second: 1648, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.294 [0.000, 1.000],  mean_best_reward: --
  1910/100000: episode: 88, duration: 0.010s, episode steps:  15, steps per second: 1574, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.400 [0.000, 1.000],  mean_best_reward: --
  1932/100000: episode: 89, duration: 0.012s, episode steps:  22, steps per second: 1880, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.409 [0.000, 1.000],  mean_best_reward: --
  1944/100000: episode: 90, duration: 0.010s, episode steps:  12, steps per second: 1152, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action:

  3080/100000: episode: 133, duration: 0.012s, episode steps:  23, steps per second: 1893, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.565 [0.000, 1.000],  mean_best_reward: --
  3136/100000: episode: 134, duration: 0.030s, episode steps:  56, steps per second: 1851, episode reward: 56.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
  3157/100000: episode: 135, duration: 0.016s, episode steps:  21, steps per second: 1338, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.524 [0.000, 1.000],  mean_best_reward: --
  3198/100000: episode: 136, duration: 0.021s, episode steps:  41, steps per second: 1962, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.512 [0.000, 1.000],  mean_best_reward: --
  3216/100000: episode: 137, duration: 0.010s, episode steps:  18, steps per second: 1885, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

  4519/100000: episode: 175, duration: 0.023s, episode steps:  20, steps per second: 887, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.550 [0.000, 1.000],  mean_best_reward: --
  4549/100000: episode: 176, duration: 0.033s, episode steps:  30, steps per second: 919, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
  4577/100000: episode: 177, duration: 0.027s, episode steps:  28, steps per second: 1032, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
  4636/100000: episode: 178, duration: 0.057s, episode steps:  59, steps per second: 1037, episode reward: 59.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.525 [0.000, 1.000],  mean_best_reward: --
  4705/100000: episode: 179, duration: 0.067s, episode steps:  69, steps per second: 1031, episode reward: 69.000, mean reward:  1.000 [ 1.000,  1.000], mean acti

  6194/100000: episode: 221, duration: 0.020s, episode steps:  24, steps per second: 1210, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.458 [0.000, 1.000],  mean_best_reward: --
  6224/100000: episode: 222, duration: 0.022s, episode steps:  30, steps per second: 1359, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
  6239/100000: episode: 223, duration: 0.014s, episode steps:  15, steps per second: 1037, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.667 [0.000, 1.000],  mean_best_reward: --
  6270/100000: episode: 224, duration: 0.021s, episode steps:  31, steps per second: 1498, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.548 [0.000, 1.000],  mean_best_reward: --
  6281/100000: episode: 225, duration: 0.012s, episode steps:  11, steps per second: 926, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean act

  7603/100000: episode: 265, duration: 0.016s, episode steps:  29, steps per second: 1852, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.552 [0.000, 1.000],  mean_best_reward: --
  7643/100000: episode: 266, duration: 0.022s, episode steps:  40, steps per second: 1798, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.525 [0.000, 1.000],  mean_best_reward: --
  7717/100000: episode: 267, duration: 0.044s, episode steps:  74, steps per second: 1691, episode reward: 74.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.554 [0.000, 1.000],  mean_best_reward: --
  7739/100000: episode: 268, duration: 0.012s, episode steps:  22, steps per second: 1860, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.409 [0.000, 1.000],  mean_best_reward: --
  7767/100000: episode: 269, duration: 0.014s, episode steps:  28, steps per second: 1976, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

  8951/100000: episode: 305, duration: 0.075s, episode steps:  81, steps per second: 1083, episode reward: 81.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.531 [0.000, 1.000],  mean_best_reward: --
  9039/100000: episode: 306, duration: 0.056s, episode steps:  88, steps per second: 1572, episode reward: 88.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.511 [0.000, 1.000],  mean_best_reward: --
  9063/100000: episode: 307, duration: 0.011s, episode steps:  24, steps per second: 2131, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.542 [0.000, 1.000],  mean_best_reward: --
  9131/100000: episode: 308, duration: 0.031s, episode steps:  68, steps per second: 2163, episode reward: 68.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.529 [0.000, 1.000],  mean_best_reward: --
  9160/100000: episode: 309, duration: 0.014s, episode steps:  29, steps per second: 2108, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 10826/100000: episode: 345, duration: 0.013s, episode steps:  25, steps per second: 1909, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 10846/100000: episode: 346, duration: 0.013s, episode steps:  20, steps per second: 1562, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.450 [0.000, 1.000],  mean_best_reward: --
 10887/100000: episode: 347, duration: 0.026s, episode steps:  41, steps per second: 1568, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.463 [0.000, 1.000],  mean_best_reward: --
 10951/100000: episode: 348, duration: 0.030s, episode steps:  64, steps per second: 2101, episode reward: 64.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.531 [0.000, 1.000],  mean_best_reward: --
 10965/100000: episode: 349, duration: 0.007s, episode steps:  14, steps per second: 1908, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 12435/100000: episode: 388, duration: 0.031s, episode steps:  61, steps per second: 1981, episode reward: 61.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.492 [0.000, 1.000],  mean_best_reward: --
 12459/100000: episode: 389, duration: 0.015s, episode steps:  24, steps per second: 1576, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.583 [0.000, 1.000],  mean_best_reward: --
 12474/100000: episode: 390, duration: 0.013s, episode steps:  15, steps per second: 1118, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  mean_best_reward: --
 12494/100000: episode: 391, duration: 0.012s, episode steps:  20, steps per second: 1722, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 12519/100000: episode: 392, duration: 0.014s, episode steps:  25, steps per second: 1783, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 14082/100000: episode: 434, duration: 0.015s, episode steps:  25, steps per second: 1722, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 14114/100000: episode: 435, duration: 0.020s, episode steps:  32, steps per second: 1600, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.469 [0.000, 1.000],  mean_best_reward: --
 14132/100000: episode: 436, duration: 0.015s, episode steps:  18, steps per second: 1205, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 14155/100000: episode: 437, duration: 0.014s, episode steps:  23, steps per second: 1668, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.565 [0.000, 1.000],  mean_best_reward: --
 14190/100000: episode: 438, duration: 0.019s, episode steps:  35, steps per second: 1875, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 15548/100000: episode: 473, duration: 0.019s, episode steps:  37, steps per second: 1912, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.514 [0.000, 1.000],  mean_best_reward: --
 15568/100000: episode: 474, duration: 0.011s, episode steps:  20, steps per second: 1883, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.450 [0.000, 1.000],  mean_best_reward: --
 15589/100000: episode: 475, duration: 0.015s, episode steps:  21, steps per second: 1433, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.429 [0.000, 1.000],  mean_best_reward: --
 15619/100000: episode: 476, duration: 0.021s, episode steps:  30, steps per second: 1408, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  mean_best_reward: --
 15638/100000: episode: 477, duration: 0.012s, episode steps:  19, steps per second: 1634, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 17153/100000: episode: 525, duration: 0.010s, episode steps:  19, steps per second: 1913, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.684 [0.000, 1.000],  mean_best_reward: --
 17190/100000: episode: 526, duration: 0.021s, episode steps:  37, steps per second: 1765, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.568 [0.000, 1.000],  mean_best_reward: --
 17208/100000: episode: 527, duration: 0.010s, episode steps:  18, steps per second: 1773, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 17237/100000: episode: 528, duration: 0.019s, episode steps:  29, steps per second: 1529, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.483 [0.000, 1.000],  mean_best_reward: --
 17265/100000: episode: 529, duration: 0.014s, episode steps:  28, steps per second: 2065, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 18726/100000: episode: 573, duration: 0.021s, episode steps:  39, steps per second: 1885, episode reward: 39.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.538 [0.000, 1.000],  mean_best_reward: --
 18758/100000: episode: 574, duration: 0.019s, episode steps:  32, steps per second: 1646, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.438 [0.000, 1.000],  mean_best_reward: --
 18801/100000: episode: 575, duration: 0.029s, episode steps:  43, steps per second: 1483, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.535 [0.000, 1.000],  mean_best_reward: --
 18826/100000: episode: 576, duration: 0.014s, episode steps:  25, steps per second: 1760, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 18852/100000: episode: 577, duration: 0.014s, episode steps:  26, steps per second: 1868, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 20266/100000: episode: 620, duration: 0.018s, episode steps:  33, steps per second: 1842, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.515 [0.000, 1.000],  mean_best_reward: --
 20298/100000: episode: 621, duration: 0.020s, episode steps:  32, steps per second: 1597, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.531 [0.000, 1.000],  mean_best_reward: --
 20322/100000: episode: 622, duration: 0.020s, episode steps:  24, steps per second: 1209, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.542 [0.000, 1.000],  mean_best_reward: --
 20343/100000: episode: 623, duration: 0.012s, episode steps:  21, steps per second: 1731, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.476 [0.000, 1.000],  mean_best_reward: --
 20377/100000: episode: 624, duration: 0.018s, episode steps:  34, steps per second: 1847, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 21831/100000: episode: 664, duration: 0.008s, episode steps:  13, steps per second: 1683, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.385 [0.000, 1.000],  mean_best_reward: --
 21864/100000: episode: 665, duration: 0.019s, episode steps:  33, steps per second: 1770, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.485 [0.000, 1.000],  mean_best_reward: --
 21879/100000: episode: 666, duration: 0.009s, episode steps:  15, steps per second: 1668, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.400 [0.000, 1.000],  mean_best_reward: --
 21902/100000: episode: 667, duration: 0.015s, episode steps:  23, steps per second: 1547, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.391 [0.000, 1.000],  mean_best_reward: --
 21931/100000: episode: 668, duration: 0.015s, episode steps:  29, steps per second: 1911, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 23501/100000: episode: 713, duration: 0.015s, episode steps:  29, steps per second: 1967, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.483 [0.000, 1.000],  mean_best_reward: --
 23552/100000: episode: 714, duration: 0.029s, episode steps:  51, steps per second: 1770, episode reward: 51.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.510 [0.000, 1.000],  mean_best_reward: --
 23583/100000: episode: 715, duration: 0.019s, episode steps:  31, steps per second: 1657, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.548 [0.000, 1.000],  mean_best_reward: --
 23607/100000: episode: 716, duration: 0.013s, episode steps:  24, steps per second: 1881, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.542 [0.000, 1.000],  mean_best_reward: --
 23634/100000: episode: 717, duration: 0.013s, episode steps:  27, steps per second: 2061, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 25147/100000: episode: 756, duration: 0.038s, episode steps:  68, steps per second: 1810, episode reward: 68.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.544 [0.000, 1.000],  mean_best_reward: --
 25161/100000: episode: 757, duration: 0.009s, episode steps:  14, steps per second: 1561, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 25193/100000: episode: 758, duration: 0.023s, episode steps:  32, steps per second: 1375, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 25240/100000: episode: 759, duration: 0.023s, episode steps:  47, steps per second: 2067, episode reward: 47.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.489 [0.000, 1.000],  mean_best_reward: --
 25270/100000: episode: 760, duration: 0.014s, episode steps:  30, steps per second: 2089, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 26590/100000: episode: 799, duration: 0.033s, episode steps:  19, steps per second: 571, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.579 [0.000, 1.000],  mean_best_reward: --
 26625/100000: episode: 800, duration: 0.065s, episode steps:  35, steps per second: 536, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.514 [0.000, 1.000],  mean_best_reward: --
 26645/100000: episode: 801, duration: 0.042s, episode steps:  20, steps per second: 473, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.400 [0.000, 1.000],  mean_best_reward: 93.000000
 26667/100000: episode: 802, duration: 0.029s, episode steps:  22, steps per second: 751, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 26752/100000: episode: 803, duration: 0.094s, episode steps:  85, steps per second: 907, episode reward: 85.000, mean reward:  1.000 [ 1.000,  1.000], mean 

 28524/100000: episode: 842, duration: 0.027s, episode steps:  21, steps per second: 781, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.476 [0.000, 1.000],  mean_best_reward: --
 28600/100000: episode: 843, duration: 0.078s, episode steps:  76, steps per second: 979, episode reward: 76.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.526 [0.000, 1.000],  mean_best_reward: --
 28648/100000: episode: 844, duration: 0.056s, episode steps:  48, steps per second: 863, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.542 [0.000, 1.000],  mean_best_reward: --
 28670/100000: episode: 845, duration: 0.027s, episode steps:  22, steps per second: 810, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 28693/100000: episode: 846, duration: 0.016s, episode steps:  23, steps per second: 1398, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action

 30175/100000: episode: 886, duration: 0.033s, episode steps:  28, steps per second: 838, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.464 [0.000, 1.000],  mean_best_reward: --
 30194/100000: episode: 887, duration: 0.022s, episode steps:  19, steps per second: 864, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.526 [0.000, 1.000],  mean_best_reward: --
 30220/100000: episode: 888, duration: 0.025s, episode steps:  26, steps per second: 1061, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 30235/100000: episode: 889, duration: 0.010s, episode steps:  15, steps per second: 1498, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  mean_best_reward: --
 30256/100000: episode: 890, duration: 0.015s, episode steps:  21, steps per second: 1372, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean acti

 31856/100000: episode: 935, duration: 0.015s, episode steps:  29, steps per second: 1944, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.517 [0.000, 1.000],  mean_best_reward: --
 31928/100000: episode: 936, duration: 0.036s, episode steps:  72, steps per second: 2021, episode reward: 72.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.486 [0.000, 1.000],  mean_best_reward: --
 31960/100000: episode: 937, duration: 0.016s, episode steps:  32, steps per second: 1985, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.469 [0.000, 1.000],  mean_best_reward: --
 31985/100000: episode: 938, duration: 0.013s, episode steps:  25, steps per second: 1929, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 32003/100000: episode: 939, duration: 0.009s, episode steps:  18, steps per second: 1948, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 33587/100000: episode: 977, duration: 0.021s, episode steps:  42, steps per second: 2014, episode reward: 42.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.548 [0.000, 1.000],  mean_best_reward: --
 33661/100000: episode: 978, duration: 0.034s, episode steps:  74, steps per second: 2177, episode reward: 74.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.473 [0.000, 1.000],  mean_best_reward: --
 33676/100000: episode: 979, duration: 0.008s, episode steps:  15, steps per second: 1865, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  mean_best_reward: --
 33722/100000: episode: 980, duration: 0.022s, episode steps:  46, steps per second: 2113, episode reward: 46.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.478 [0.000, 1.000],  mean_best_reward: --
 33746/100000: episode: 981, duration: 0.012s, episode steps:  24, steps per second: 2023, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 35335/100000: episode: 1024, duration: 0.025s, episode steps:  48, steps per second: 1934, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 35366/100000: episode: 1025, duration: 0.015s, episode steps:  31, steps per second: 2012, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.484 [0.000, 1.000],  mean_best_reward: --
 35406/100000: episode: 1026, duration: 0.020s, episode steps:  40, steps per second: 2011, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.525 [0.000, 1.000],  mean_best_reward: --
 35423/100000: episode: 1027, duration: 0.009s, episode steps:  17, steps per second: 1901, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.529 [0.000, 1.000],  mean_best_reward: --
 35448/100000: episode: 1028, duration: 0.013s, episode steps:  25, steps per second: 1990, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], me

 36655/100000: episode: 1063, duration: 0.027s, episode steps:  52, steps per second: 1938, episode reward: 52.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.462 [0.000, 1.000],  mean_best_reward: --
 36693/100000: episode: 1064, duration: 0.019s, episode steps:  38, steps per second: 2036, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.526 [0.000, 1.000],  mean_best_reward: --
 36710/100000: episode: 1065, duration: 0.009s, episode steps:  17, steps per second: 1824, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.471 [0.000, 1.000],  mean_best_reward: --
 36749/100000: episode: 1066, duration: 0.019s, episode steps:  39, steps per second: 2043, episode reward: 39.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.538 [0.000, 1.000],  mean_best_reward: --
 36771/100000: episode: 1067, duration: 0.011s, episode steps:  22, steps per second: 1913, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], me

 38369/100000: episode: 1106, duration: 0.011s, episode steps:  18, steps per second: 1658, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.444 [0.000, 1.000],  mean_best_reward: --
 38412/100000: episode: 1107, duration: 0.022s, episode steps:  43, steps per second: 1987, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.512 [0.000, 1.000],  mean_best_reward: --
 38457/100000: episode: 1108, duration: 0.022s, episode steps:  45, steps per second: 2031, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.511 [0.000, 1.000],  mean_best_reward: --
 38511/100000: episode: 1109, duration: 0.026s, episode steps:  54, steps per second: 2064, episode reward: 54.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.519 [0.000, 1.000],  mean_best_reward: --
 38566/100000: episode: 1110, duration: 0.027s, episode steps:  55, steps per second: 2071, episode reward: 55.000, mean reward:  1.000 [ 1.000,  1.000], me

 40085/100000: episode: 1150, duration: 0.035s, episode steps:  72, steps per second: 2081, episode reward: 72.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.528 [0.000, 1.000],  mean_best_reward: --
 40124/100000: episode: 1151, duration: 0.019s, episode steps:  39, steps per second: 2073, episode reward: 39.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.487 [0.000, 1.000],  mean_best_reward: 89.000000
 40150/100000: episode: 1152, duration: 0.013s, episode steps:  26, steps per second: 2060, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.538 [0.000, 1.000],  mean_best_reward: --
 40174/100000: episode: 1153, duration: 0.012s, episode steps:  24, steps per second: 2001, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 40211/100000: episode: 1154, duration: 0.017s, episode steps:  37, steps per second: 2136, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.0

 41831/100000: episode: 1197, duration: 0.029s, episode steps:  61, steps per second: 2086, episode reward: 61.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.475 [0.000, 1.000],  mean_best_reward: --
 41896/100000: episode: 1198, duration: 0.032s, episode steps:  65, steps per second: 2023, episode reward: 65.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.462 [0.000, 1.000],  mean_best_reward: --
 41944/100000: episode: 1199, duration: 0.023s, episode steps:  48, steps per second: 2047, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.521 [0.000, 1.000],  mean_best_reward: --
 41975/100000: episode: 1200, duration: 0.014s, episode steps:  31, steps per second: 2239, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.516 [0.000, 1.000],  mean_best_reward: --
 42003/100000: episode: 1201, duration: 0.016s, episode steps:  28, steps per second: 1793, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], me

 43484/100000: episode: 1237, duration: 0.044s, episode steps:  92, steps per second: 2103, episode reward: 92.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
 43551/100000: episode: 1238, duration: 0.033s, episode steps:  67, steps per second: 2047, episode reward: 67.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
 43572/100000: episode: 1239, duration: 0.011s, episode steps:  21, steps per second: 1865, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.524 [0.000, 1.000],  mean_best_reward: --
 43598/100000: episode: 1240, duration: 0.013s, episode steps:  26, steps per second: 2034, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 43610/100000: episode: 1241, duration: 0.006s, episode steps:  12, steps per second: 1851, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], me

 45214/100000: episode: 1282, duration: 0.015s, episode steps:  29, steps per second: 1953, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.483 [0.000, 1.000],  mean_best_reward: --
 45283/100000: episode: 1283, duration: 0.032s, episode steps:  69, steps per second: 2132, episode reward: 69.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.507 [0.000, 1.000],  mean_best_reward: --
 45342/100000: episode: 1284, duration: 0.027s, episode steps:  59, steps per second: 2147, episode reward: 59.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.525 [0.000, 1.000],  mean_best_reward: --
 45400/100000: episode: 1285, duration: 0.028s, episode steps:  58, steps per second: 2104, episode reward: 58.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.517 [0.000, 1.000],  mean_best_reward: --
 45419/100000: episode: 1286, duration: 0.010s, episode steps:  19, steps per second: 1993, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], me

 46979/100000: episode: 1330, duration: 0.032s, episode steps:  62, steps per second: 1964, episode reward: 62.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.484 [0.000, 1.000],  mean_best_reward: --
 47029/100000: episode: 1331, duration: 0.023s, episode steps:  50, steps per second: 2130, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 47093/100000: episode: 1332, duration: 0.029s, episode steps:  64, steps per second: 2182, episode reward: 64.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.484 [0.000, 1.000],  mean_best_reward: --
 47155/100000: episode: 1333, duration: 0.029s, episode steps:  62, steps per second: 2123, episode reward: 62.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.468 [0.000, 1.000],  mean_best_reward: --
 47200/100000: episode: 1334, duration: 0.020s, episode steps:  45, steps per second: 2199, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], me

 48768/100000: episode: 1379, duration: 0.024s, episode steps:  50, steps per second: 2056, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 48818/100000: episode: 1380, duration: 0.024s, episode steps:  50, steps per second: 2085, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.540 [0.000, 1.000],  mean_best_reward: --
 48841/100000: episode: 1381, duration: 0.011s, episode steps:  23, steps per second: 2027, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.478 [0.000, 1.000],  mean_best_reward: --
 48882/100000: episode: 1382, duration: 0.019s, episode steps:  41, steps per second: 2122, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.463 [0.000, 1.000],  mean_best_reward: --
 48928/100000: episode: 1383, duration: 0.021s, episode steps:  46, steps per second: 2148, episode reward: 46.000, mean reward:  1.000 [ 1.000,  1.000], me

 50508/100000: episode: 1424, duration: 0.018s, episode steps:  35, steps per second: 1987, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.486 [0.000, 1.000],  mean_best_reward: --
 50555/100000: episode: 1425, duration: 0.022s, episode steps:  47, steps per second: 2107, episode reward: 47.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.489 [0.000, 1.000],  mean_best_reward: --
 50605/100000: episode: 1426, duration: 0.024s, episode steps:  50, steps per second: 2062, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 50630/100000: episode: 1427, duration: 0.013s, episode steps:  25, steps per second: 1995, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 50670/100000: episode: 1428, duration: 0.018s, episode steps:  40, steps per second: 2193, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], me

 52251/100000: episode: 1475, duration: 0.032s, episode steps:  66, steps per second: 2036, episode reward: 66.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.485 [0.000, 1.000],  mean_best_reward: --
 52320/100000: episode: 1476, duration: 0.033s, episode steps:  69, steps per second: 2109, episode reward: 69.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.478 [0.000, 1.000],  mean_best_reward: --
 52333/100000: episode: 1477, duration: 0.007s, episode steps:  13, steps per second: 1911, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.231 [0.000, 1.000],  mean_best_reward: --
 52354/100000: episode: 1478, duration: 0.010s, episode steps:  21, steps per second: 2012, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.476 [0.000, 1.000],  mean_best_reward: --
 52369/100000: episode: 1479, duration: 0.008s, episode steps:  15, steps per second: 1889, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], me

 53525/100000: episode: 1515, duration: 0.014s, episode steps:  27, steps per second: 1924, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.519 [0.000, 1.000],  mean_best_reward: --
 53575/100000: episode: 1516, duration: 0.024s, episode steps:  50, steps per second: 2083, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 53605/100000: episode: 1517, duration: 0.015s, episode steps:  30, steps per second: 1984, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  mean_best_reward: --
 53642/100000: episode: 1518, duration: 0.017s, episode steps:  37, steps per second: 2131, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.459 [0.000, 1.000],  mean_best_reward: --
 53674/100000: episode: 1519, duration: 0.015s, episode steps:  32, steps per second: 2092, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], me

 55286/100000: episode: 1560, duration: 0.030s, episode steps:  61, steps per second: 2046, episode reward: 61.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.525 [0.000, 1.000],  mean_best_reward: --
 55347/100000: episode: 1561, duration: 0.028s, episode steps:  61, steps per second: 2160, episode reward: 61.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.492 [0.000, 1.000],  mean_best_reward: --
 55381/100000: episode: 1562, duration: 0.017s, episode steps:  34, steps per second: 2003, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.529 [0.000, 1.000],  mean_best_reward: --
 55466/100000: episode: 1563, duration: 0.039s, episode steps:  85, steps per second: 2153, episode reward: 85.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.506 [0.000, 1.000],  mean_best_reward: --
 55505/100000: episode: 1564, duration: 0.019s, episode steps:  39, steps per second: 2087, episode reward: 39.000, mean reward:  1.000 [ 1.000,  1.000], me

 57020/100000: episode: 1606, duration: 0.010s, episode steps:  18, steps per second: 1851, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 57060/100000: episode: 1607, duration: 0.020s, episode steps:  40, steps per second: 2012, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.525 [0.000, 1.000],  mean_best_reward: --
 57096/100000: episode: 1608, duration: 0.017s, episode steps:  36, steps per second: 2112, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.528 [0.000, 1.000],  mean_best_reward: --
 57144/100000: episode: 1609, duration: 0.022s, episode steps:  48, steps per second: 2149, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.458 [0.000, 1.000],  mean_best_reward: --
 57204/100000: episode: 1610, duration: 0.029s, episode steps:  60, steps per second: 2080, episode reward: 60.000, mean reward:  1.000 [ 1.000,  1.000], me

 58748/100000: episode: 1653, duration: 0.012s, episode steps:  25, steps per second: 2015, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 58792/100000: episode: 1654, duration: 0.022s, episode steps:  44, steps per second: 1986, episode reward: 44.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.455 [0.000, 1.000],  mean_best_reward: --
 58849/100000: episode: 1655, duration: 0.030s, episode steps:  57, steps per second: 1902, episode reward: 57.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.456 [0.000, 1.000],  mean_best_reward: --
 58877/100000: episode: 1656, duration: 0.014s, episode steps:  28, steps per second: 2063, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.536 [0.000, 1.000],  mean_best_reward: --
 58908/100000: episode: 1657, duration: 0.015s, episode steps:  31, steps per second: 2049, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], me

 60502/100000: episode: 1694, duration: 0.016s, episode steps:  30, steps per second: 1905, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.467 [0.000, 1.000],  mean_best_reward: --
 60518/100000: episode: 1695, duration: 0.008s, episode steps:  16, steps per second: 1887, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 60570/100000: episode: 1696, duration: 0.024s, episode steps:  52, steps per second: 2181, episode reward: 52.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.481 [0.000, 1.000],  mean_best_reward: --
 60602/100000: episode: 1697, duration: 0.016s, episode steps:  32, steps per second: 2033, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.562 [0.000, 1.000],  mean_best_reward: --
 60696/100000: episode: 1698, duration: 0.043s, episode steps:  94, steps per second: 2192, episode reward: 94.000, mean reward:  1.000 [ 1.000,  1.000], me

 62239/100000: episode: 1736, duration: 0.016s, episode steps:  30, steps per second: 1931, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.467 [0.000, 1.000],  mean_best_reward: --
 62269/100000: episode: 1737, duration: 0.015s, episode steps:  30, steps per second: 1949, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.467 [0.000, 1.000],  mean_best_reward: --
 62319/100000: episode: 1738, duration: 0.023s, episode steps:  50, steps per second: 2195, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 62395/100000: episode: 1739, duration: 0.037s, episode steps:  76, steps per second: 2046, episode reward: 76.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 62425/100000: episode: 1740, duration: 0.015s, episode steps:  30, steps per second: 2046, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], me

 63973/100000: episode: 1780, duration: 0.020s, episode steps:  40, steps per second: 1962, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 63990/100000: episode: 1781, duration: 0.009s, episode steps:  17, steps per second: 1917, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.471 [0.000, 1.000],  mean_best_reward: --
 64006/100000: episode: 1782, duration: 0.008s, episode steps:  16, steps per second: 1964, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.625 [0.000, 1.000],  mean_best_reward: --
 64032/100000: episode: 1783, duration: 0.012s, episode steps:  26, steps per second: 2110, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.538 [0.000, 1.000],  mean_best_reward: --
 64063/100000: episode: 1784, duration: 0.015s, episode steps:  31, steps per second: 2100, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], me

 65669/100000: episode: 1825, duration: 0.068s, episode steps:  86, steps per second: 1265, episode reward: 86.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.512 [0.000, 1.000],  mean_best_reward: --
 65726/100000: episode: 1826, duration: 0.034s, episode steps:  57, steps per second: 1679, episode reward: 57.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.456 [0.000, 1.000],  mean_best_reward: --
 65763/100000: episode: 1827, duration: 0.022s, episode steps:  37, steps per second: 1706, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.514 [0.000, 1.000],  mean_best_reward: --
 65821/100000: episode: 1828, duration: 0.037s, episode steps:  58, steps per second: 1554, episode reward: 58.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 65852/100000: episode: 1829, duration: 0.016s, episode steps:  31, steps per second: 1922, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], me

 67391/100000: episode: 1868, duration: 0.030s, episode steps:  48, steps per second: 1623, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.521 [0.000, 1.000],  mean_best_reward: --
 67414/100000: episode: 1869, duration: 0.013s, episode steps:  23, steps per second: 1766, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.565 [0.000, 1.000],  mean_best_reward: --
 67446/100000: episode: 1870, duration: 0.016s, episode steps:  32, steps per second: 1976, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 67516/100000: episode: 1871, duration: 0.033s, episode steps:  70, steps per second: 2112, episode reward: 70.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.514 [0.000, 1.000],  mean_best_reward: --
 67582/100000: episode: 1872, duration: 0.032s, episode steps:  66, steps per second: 2052, episode reward: 66.000, mean reward:  1.000 [ 1.000,  1.000], me

 69146/100000: episode: 1908, duration: 0.046s, episode steps:  98, steps per second: 2108, episode reward: 98.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.439 [0.000, 1.000],  mean_best_reward: --
 69197/100000: episode: 1909, duration: 0.024s, episode steps:  51, steps per second: 2105, episode reward: 51.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.490 [0.000, 1.000],  mean_best_reward: --
 69251/100000: episode: 1910, duration: 0.025s, episode steps:  54, steps per second: 2167, episode reward: 54.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 69278/100000: episode: 1911, duration: 0.016s, episode steps:  27, steps per second: 1728, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 69308/100000: episode: 1912, duration: 0.015s, episode steps:  30, steps per second: 2014, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], me

 70686/100000: episode: 1950, duration: 0.021s, episode steps:  45, steps per second: 2113, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 70769/100000: episode: 1951, duration: 0.040s, episode steps:  83, steps per second: 2082, episode reward: 83.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.482 [0.000, 1.000],  mean_best_reward: 117.500000
 70787/100000: episode: 1952, duration: 0.009s, episode steps:  18, steps per second: 2002, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.333 [0.000, 1.000],  mean_best_reward: --
 70808/100000: episode: 1953, duration: 0.011s, episode steps:  21, steps per second: 1856, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.429 [0.000, 1.000],  mean_best_reward: --
 70832/100000: episode: 1954, duration: 0.012s, episode steps:  24, steps per second: 2086, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.

 71987/100000: episode: 1991, duration: 0.033s, episode steps:  69, steps per second: 2073, episode reward: 69.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.536 [0.000, 1.000],  mean_best_reward: --
 72023/100000: episode: 1992, duration: 0.017s, episode steps:  36, steps per second: 2082, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 72087/100000: episode: 1993, duration: 0.031s, episode steps:  64, steps per second: 2073, episode reward: 64.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.484 [0.000, 1.000],  mean_best_reward: --
 72113/100000: episode: 1994, duration: 0.012s, episode steps:  26, steps per second: 2148, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.577 [0.000, 1.000],  mean_best_reward: --
 72157/100000: episode: 1995, duration: 0.022s, episode steps:  44, steps per second: 1995, episode reward: 44.000, mean reward:  1.000 [ 1.000,  1.000], me

 73731/100000: episode: 2030, duration: 0.019s, episode steps:  36, steps per second: 1852, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.472 [0.000, 1.000],  mean_best_reward: --
 73754/100000: episode: 2031, duration: 0.012s, episode steps:  23, steps per second: 1886, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
 73779/100000: episode: 2032, duration: 0.014s, episode steps:  25, steps per second: 1823, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 73823/100000: episode: 2033, duration: 0.022s, episode steps:  44, steps per second: 2009, episode reward: 44.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.545 [0.000, 1.000],  mean_best_reward: --
 73873/100000: episode: 2034, duration: 0.025s, episode steps:  50, steps per second: 2031, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], me

 75436/100000: episode: 2078, duration: 0.012s, episode steps:  23, steps per second: 1963, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.565 [0.000, 1.000],  mean_best_reward: --
 75456/100000: episode: 2079, duration: 0.011s, episode steps:  20, steps per second: 1879, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 75482/100000: episode: 2080, duration: 0.013s, episode steps:  26, steps per second: 1954, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 75494/100000: episode: 2081, duration: 0.006s, episode steps:  12, steps per second: 1869, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.583 [0.000, 1.000],  mean_best_reward: --
 75518/100000: episode: 2082, duration: 0.012s, episode steps:  24, steps per second: 2013, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], me

 77237/100000: episode: 2129, duration: 0.031s, episode steps:  64, steps per second: 2038, episode reward: 64.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.516 [0.000, 1.000],  mean_best_reward: --
 77295/100000: episode: 2130, duration: 0.027s, episode steps:  58, steps per second: 2153, episode reward: 58.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 77309/100000: episode: 2131, duration: 0.007s, episode steps:  14, steps per second: 1936, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.429 [0.000, 1.000],  mean_best_reward: --
 77339/100000: episode: 2132, duration: 0.014s, episode steps:  30, steps per second: 2085, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 77420/100000: episode: 2133, duration: 0.038s, episode steps:  81, steps per second: 2131, episode reward: 81.000, mean reward:  1.000 [ 1.000,  1.000], me

 78943/100000: episode: 2180, duration: 0.010s, episode steps:  19, steps per second: 1820, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.526 [0.000, 1.000],  mean_best_reward: --
 78991/100000: episode: 2181, duration: 0.024s, episode steps:  48, steps per second: 2015, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 79040/100000: episode: 2182, duration: 0.023s, episode steps:  49, steps per second: 2158, episode reward: 49.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.531 [0.000, 1.000],  mean_best_reward: --
 79087/100000: episode: 2183, duration: 0.022s, episode steps:  47, steps per second: 2143, episode reward: 47.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.532 [0.000, 1.000],  mean_best_reward: --
 79128/100000: episode: 2184, duration: 0.019s, episode steps:  41, steps per second: 2173, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], me

 80631/100000: episode: 2231, duration: 0.012s, episode steps:  23, steps per second: 1842, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.565 [0.000, 1.000],  mean_best_reward: --
 80673/100000: episode: 2232, duration: 0.021s, episode steps:  42, steps per second: 2010, episode reward: 42.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.476 [0.000, 1.000],  mean_best_reward: --
 80698/100000: episode: 2233, duration: 0.012s, episode steps:  25, steps per second: 2070, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 80720/100000: episode: 2234, duration: 0.011s, episode steps:  22, steps per second: 1938, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.591 [0.000, 1.000],  mean_best_reward: --
 80747/100000: episode: 2235, duration: 0.013s, episode steps:  27, steps per second: 2079, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], me

 82338/100000: episode: 2271, duration: 0.011s, episode steps:  22, steps per second: 1977, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.455 [0.000, 1.000],  mean_best_reward: --
 82373/100000: episode: 2272, duration: 0.018s, episode steps:  35, steps per second: 1947, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.457 [0.000, 1.000],  mean_best_reward: --
 82411/100000: episode: 2273, duration: 0.018s, episode steps:  38, steps per second: 2087, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.553 [0.000, 1.000],  mean_best_reward: --
 82451/100000: episode: 2274, duration: 0.019s, episode steps:  40, steps per second: 2122, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.475 [0.000, 1.000],  mean_best_reward: --
 82489/100000: episode: 2275, duration: 0.018s, episode steps:  38, steps per second: 2136, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], me

 84100/100000: episode: 2315, duration: 0.014s, episode steps:  27, steps per second: 1960, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.481 [0.000, 1.000],  mean_best_reward: --
 84146/100000: episode: 2316, duration: 0.023s, episode steps:  46, steps per second: 2029, episode reward: 46.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 84173/100000: episode: 2317, duration: 0.014s, episode steps:  27, steps per second: 1918, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 84210/100000: episode: 2318, duration: 0.017s, episode steps:  37, steps per second: 2117, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.432 [0.000, 1.000],  mean_best_reward: --
 84271/100000: episode: 2319, duration: 0.028s, episode steps:  61, steps per second: 2152, episode reward: 61.000, mean reward:  1.000 [ 1.000,  1.000], me

 85851/100000: episode: 2358, duration: 0.030s, episode steps:  53, steps per second: 1789, episode reward: 53.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.472 [0.000, 1.000],  mean_best_reward: --
 85863/100000: episode: 2359, duration: 0.006s, episode steps:  12, steps per second: 1849, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.583 [0.000, 1.000],  mean_best_reward: --
 85888/100000: episode: 2360, duration: 0.012s, episode steps:  25, steps per second: 2040, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 85925/100000: episode: 2361, duration: 0.018s, episode steps:  37, steps per second: 2090, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.486 [0.000, 1.000],  mean_best_reward: --
 85957/100000: episode: 2362, duration: 0.015s, episode steps:  32, steps per second: 2135, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], me

 87596/100000: episode: 2404, duration: 0.022s, episode steps:  45, steps per second: 2017, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.511 [0.000, 1.000],  mean_best_reward: --
 87611/100000: episode: 2405, duration: 0.008s, episode steps:  15, steps per second: 1907, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.467 [0.000, 1.000],  mean_best_reward: --
 87630/100000: episode: 2406, duration: 0.009s, episode steps:  19, steps per second: 2020, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.526 [0.000, 1.000],  mean_best_reward: --
 87647/100000: episode: 2407, duration: 0.009s, episode steps:  17, steps per second: 1961, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.471 [0.000, 1.000],  mean_best_reward: --
 87680/100000: episode: 2408, duration: 0.016s, episode steps:  33, steps per second: 2059, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], me

 89338/100000: episode: 2446, duration: 0.036s, episode steps:  67, steps per second: 1869, episode reward: 67.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.463 [0.000, 1.000],  mean_best_reward: --
 89367/100000: episode: 2447, duration: 0.015s, episode steps:  29, steps per second: 1955, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.586 [0.000, 1.000],  mean_best_reward: --
 89436/100000: episode: 2448, duration: 0.032s, episode steps:  69, steps per second: 2159, episode reward: 69.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.536 [0.000, 1.000],  mean_best_reward: --
 89480/100000: episode: 2449, duration: 0.021s, episode steps:  44, steps per second: 2075, episode reward: 44.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 89532/100000: episode: 2450, duration: 0.024s, episode steps:  52, steps per second: 2138, episode reward: 52.000, mean reward:  1.000 [ 1.000,  1.000], me

 90807/100000: episode: 2486, duration: 0.017s, episode steps:  33, steps per second: 1928, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.576 [0.000, 1.000],  mean_best_reward: --
 90830/100000: episode: 2487, duration: 0.013s, episode steps:  23, steps per second: 1752, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.435 [0.000, 1.000],  mean_best_reward: --
 90868/100000: episode: 2488, duration: 0.019s, episode steps:  38, steps per second: 2007, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 90901/100000: episode: 2489, duration: 0.024s, episode steps:  33, steps per second: 1358, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.515 [0.000, 1.000],  mean_best_reward: --
 90976/100000: episode: 2490, duration: 0.075s, episode steps:  75, steps per second: 1002, episode reward: 75.000, mean reward:  1.000 [ 1.000,  1.000], me

 92361/100000: episode: 2533, duration: 0.012s, episode steps:  20, steps per second: 1719, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.600 [0.000, 1.000],  mean_best_reward: --
 92435/100000: episode: 2534, duration: 0.035s, episode steps:  74, steps per second: 2085, episode reward: 74.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 92454/100000: episode: 2535, duration: 0.010s, episode steps:  19, steps per second: 1936, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.316 [0.000, 1.000],  mean_best_reward: --
 92478/100000: episode: 2536, duration: 0.012s, episode steps:  24, steps per second: 2054, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.458 [0.000, 1.000],  mean_best_reward: --
 92516/100000: episode: 2537, duration: 0.018s, episode steps:  38, steps per second: 2131, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], me

 94092/100000: episode: 2579, duration: 0.028s, episode steps:  60, steps per second: 2149, episode reward: 60.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 94117/100000: episode: 2580, duration: 0.014s, episode steps:  25, steps per second: 1795, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 94158/100000: episode: 2581, duration: 0.019s, episode steps:  41, steps per second: 2134, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.488 [0.000, 1.000],  mean_best_reward: --
 94177/100000: episode: 2582, duration: 0.010s, episode steps:  19, steps per second: 1925, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.474 [0.000, 1.000],  mean_best_reward: --
 94187/100000: episode: 2583, duration: 0.005s, episode steps:  10, steps per second: 1869, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], me

 95785/100000: episode: 2626, duration: 0.048s, episode steps: 100, steps per second: 2071, episode reward: 100.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 95805/100000: episode: 2627, duration: 0.011s, episode steps:  20, steps per second: 1835, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 95832/100000: episode: 2628, duration: 0.014s, episode steps:  27, steps per second: 1950, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.481 [0.000, 1.000],  mean_best_reward: --
 95857/100000: episode: 2629, duration: 0.012s, episode steps:  25, steps per second: 2020, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 95869/100000: episode: 2630, duration: 0.007s, episode steps:  12, steps per second: 1841, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], m

 97429/100000: episode: 2677, duration: 0.015s, episode steps:  28, steps per second: 1836, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.464 [0.000, 1.000],  mean_best_reward: --
 97454/100000: episode: 2678, duration: 0.015s, episode steps:  25, steps per second: 1614, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.440 [0.000, 1.000],  mean_best_reward: --
 97476/100000: episode: 2679, duration: 0.012s, episode steps:  22, steps per second: 1810, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.545 [0.000, 1.000],  mean_best_reward: --
 97523/100000: episode: 2680, duration: 0.023s, episode steps:  47, steps per second: 2031, episode reward: 47.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.468 [0.000, 1.000],  mean_best_reward: --
 97605/100000: episode: 2681, duration: 0.039s, episode steps:  82, steps per second: 2129, episode reward: 82.000, mean reward:  1.000 [ 1.000,  1.000], me

 99147/100000: episode: 2718, duration: 0.041s, episode steps:  85, steps per second: 2082, episode reward: 85.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.482 [0.000, 1.000],  mean_best_reward: --
 99177/100000: episode: 2719, duration: 0.015s, episode steps:  30, steps per second: 1957, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  mean_best_reward: --
 99226/100000: episode: 2720, duration: 0.024s, episode steps:  49, steps per second: 2033, episode reward: 49.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.510 [0.000, 1.000],  mean_best_reward: --
 99273/100000: episode: 2721, duration: 0.023s, episode steps:  47, steps per second: 2047, episode reward: 47.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.532 [0.000, 1.000],  mean_best_reward: --
 99298/100000: episode: 2722, duration: 0.013s, episode steps:  25, steps per second: 1906, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], me

<tensorflow.python.keras.callbacks.History at 0x10548fbe0>