# T81-558: Applications of Deep Neural Networks
**Module 12: Deep Learning and Security**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module Video Material

Main video lecture:

* Part 12.1: Introduction to the OpenAI Gym [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_reinforcement.ipynb)
* Part 12.2: Introduction to Q-Learning for Keras [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_reinforcement.ipynb)
* **Part 12.3: Keras Q-Learning in the OpenAI Gym** [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_reinforcement.ipynb)
* Part 12.4: Atari Games with Keras Neural Networks [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_reinforcement.ipynb)
* 12.5: How Alpha Zero used Reinforcement Learning to Master Chess [[Video]](https://www.youtube.com/playlist?list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_12_reinforcement.ipynb)


# Part 12.3: Keras Q-Learning in the OpenAI Gym

![Deep Q-Learning](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/deepqlearning.png "Reinforcement Learning")

* **CEMAgent**
    * **model** - The neural network that will be trained.
    * **nb_actions** - The number of actions the agent can take (e.g. up, down, left, right, fire)
    * **memory** - The EpisodeParameterMemory object to use.  This object observes and save all of the state transitions so that you can train your network on them later on (instead of having to make observations from the environment all the time).
    * **batch_size** - The batch size for neural network training, same concept as deep learning batch sizes.
    * **nb_steps_warmup** - Number of training steps to pass before any learning occurs.
    * **train_interval** - Logging interval, defines how often to log.
    * **elite_frac**
* **CEMAgent.fit**
    * **env** - The OpenAI gym environment being used.
    * **nb_steps** - Number of training steps to be performed.
    * **visualize** - If `True`, the environment is visualized during training. However,
                this is likely going to slow down training significantly and is thus intended to be
                a debugging instrument.
    * **verbose** - 0 for no logging, 1 for interval logging (compare `log_interval`), 2 for episode logging




In [2]:
import numpy as np
import gym

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten
from tensorflow.keras.optimizers import Adam

from rl.agents.cem import CEMAgent
from rl.memory import EpisodeParameterMemory

ENV_NAME = 'CartPole-v0'


# Get the environment and extract the number of actions.
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)

nb_actions = env.action_space.n
obs_dim = env.observation_space.shape[0]

# Option 1 : Simple model
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(nb_actions))
model.add(Activation('softmax'))

# Option 2: deep network
# model = Sequential()
# model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
# model.add(Dense(16))
# model.add(Activation('relu'))
# model.add(Dense(16))
# model.add(Activation('relu'))
# model.add(Dense(16))
# model.add(Activation('relu'))
# model.add(Dense(nb_actions))
# model.add(Activation('softmax'))


print(model.summary())


# Finally, we configure and compile our agent. You can use every built-in tensorflow.keras optimizer and
# even the metrics!
memory = EpisodeParameterMemory(limit=1000, window_length=1)

cem = CEMAgent(model=model, nb_actions=nb_actions, memory=memory,
               batch_size=50, nb_steps_warmup=2000, train_interval=50, elite_frac=0.05)
cem.compile()

# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
cem.fit(env, nb_steps=100000, visualize=False, verbose=2)

# After training is done, we save the best weights.
cem.save_weights('cem_{}_params.h5f'.format(ENV_NAME), overwrite=True)

# Finally, evaluate our algorithm for 5 episodes.
cem.test(env, nb_episodes=5, visualize=True)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 10        
_________________________________________________________________
activation_2 (Activation)    (None, 2)                 0         
Total params: 10
Trainable params: 10
Non-trainable params: 0
_________________________________________________________________
None
Training for 100000 steps ...
    47/100000: episode: 1, duration: 0.045s, episode steps:  47, steps per second: 1050, episode reward: 47.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.426 [0.000, 1.000],  mean_best_reward: --
    90/100000: episode: 2, duration: 0.020s, episode steps:  43, steps per second: 2115, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.

  1148/100000: episode: 50, duration: 0.017s, episode steps:  33, steps per second: 1971, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.485 [0.000, 1.000],  mean_best_reward: --
  1160/100000: episode: 51, duration: 0.007s, episode steps:  12, steps per second: 1808, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.667 [0.000, 1.000],  mean_best_reward: --
  1181/100000: episode: 52, duration: 0.011s, episode steps:  21, steps per second: 1937, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.333 [0.000, 1.000],  mean_best_reward: --
  1196/100000: episode: 53, duration: 0.009s, episode steps:  15, steps per second: 1644, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.600 [0.000, 1.000],  mean_best_reward: --
  1206/100000: episode: 54, duration: 0.006s, episode steps:  10, steps per second: 1726, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action:

  1946/100000: episode: 90, duration: 0.007s, episode steps:  12, steps per second: 1635, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.167 [0.000, 1.000],  mean_best_reward: --
  1959/100000: episode: 91, duration: 0.007s, episode steps:  13, steps per second: 1830, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.769 [0.000, 1.000],  mean_best_reward: --
  1979/100000: episode: 92, duration: 0.012s, episode steps:  20, steps per second: 1739, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.350 [0.000, 1.000],  mean_best_reward: --
  1989/100000: episode: 93, duration: 0.007s, episode steps:  10, steps per second: 1421, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  mean_best_reward: --
  2023/100000: episode: 94, duration: 0.017s, episode steps:  34, steps per second: 1964, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action:

  3229/100000: episode: 136, duration: 0.035s, episode steps:  74, steps per second: 2141, episode reward: 74.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
  3263/100000: episode: 137, duration: 0.019s, episode steps:  34, steps per second: 1768, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
  3300/100000: episode: 138, duration: 0.018s, episode steps:  37, steps per second: 2090, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.486 [0.000, 1.000],  mean_best_reward: --
  3338/100000: episode: 139, duration: 0.017s, episode steps:  38, steps per second: 2194, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
  3375/100000: episode: 140, duration: 0.018s, episode steps:  37, steps per second: 2036, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

  5021/100000: episode: 186, duration: 0.027s, episode steps:  55, steps per second: 2018, episode reward: 55.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.509 [0.000, 1.000],  mean_best_reward: --
  5132/100000: episode: 187, duration: 0.049s, episode steps: 111, steps per second: 2250, episode reward: 111.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.495 [0.000, 1.000],  mean_best_reward: --
  5144/100000: episode: 188, duration: 0.006s, episode steps:  12, steps per second: 1926, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.667 [0.000, 1.000],  mean_best_reward: --
  5192/100000: episode: 189, duration: 0.022s, episode steps:  48, steps per second: 2218, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
  5248/100000: episode: 190, duration: 0.025s, episode steps:  56, steps per second: 2267, episode reward: 56.000, mean reward:  1.000 [ 1.000,  1.000], mean a

  6778/100000: episode: 229, duration: 0.051s, episode steps: 114, steps per second: 2214, episode reward: 114.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.447 [0.000, 1.000],  mean_best_reward: --
  6824/100000: episode: 230, duration: 0.023s, episode steps:  46, steps per second: 2013, episode reward: 46.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
  6834/100000: episode: 231, duration: 0.006s, episode steps:  10, steps per second: 1707, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.600 [0.000, 1.000],  mean_best_reward: --
  6879/100000: episode: 232, duration: 0.021s, episode steps:  45, steps per second: 2125, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
  6901/100000: episode: 233, duration: 0.011s, episode steps:  22, steps per second: 2043, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean a

  8576/100000: episode: 276, duration: 0.017s, episode steps:  33, steps per second: 1908, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.545 [0.000, 1.000],  mean_best_reward: --
  8602/100000: episode: 277, duration: 0.014s, episode steps:  26, steps per second: 1915, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.462 [0.000, 1.000],  mean_best_reward: --
  8647/100000: episode: 278, duration: 0.023s, episode steps:  45, steps per second: 1972, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  mean_best_reward: --
  8663/100000: episode: 279, duration: 0.009s, episode steps:  16, steps per second: 1694, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.688 [0.000, 1.000],  mean_best_reward: --
  8739/100000: episode: 280, duration: 0.035s, episode steps:  76, steps per second: 2164, episode reward: 76.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 10291/100000: episode: 320, duration: 0.011s, episode steps:  18, steps per second: 1698, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.444 [0.000, 1.000],  mean_best_reward: --
 10331/100000: episode: 321, duration: 0.020s, episode steps:  40, steps per second: 2035, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.550 [0.000, 1.000],  mean_best_reward: --
 10358/100000: episode: 322, duration: 0.013s, episode steps:  27, steps per second: 2053, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.481 [0.000, 1.000],  mean_best_reward: --
 10406/100000: episode: 323, duration: 0.023s, episode steps:  48, steps per second: 2088, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 10429/100000: episode: 324, duration: 0.011s, episode steps:  23, steps per second: 2047, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 12022/100000: episode: 361, duration: 0.008s, episode steps:  14, steps per second: 1703, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 12037/100000: episode: 362, duration: 0.008s, episode steps:  15, steps per second: 1870, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.467 [0.000, 1.000],  mean_best_reward: --
 12060/100000: episode: 363, duration: 0.012s, episode steps:  23, steps per second: 1886, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.652 [0.000, 1.000],  mean_best_reward: --
 12120/100000: episode: 364, duration: 0.029s, episode steps:  60, steps per second: 2089, episode reward: 60.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 12168/100000: episode: 365, duration: 0.023s, episode steps:  48, steps per second: 2129, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 13752/100000: episode: 405, duration: 0.020s, episode steps:  40, steps per second: 1996, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.475 [0.000, 1.000],  mean_best_reward: --
 13801/100000: episode: 406, duration: 0.023s, episode steps:  49, steps per second: 2086, episode reward: 49.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.469 [0.000, 1.000],  mean_best_reward: --
 13820/100000: episode: 407, duration: 0.010s, episode steps:  19, steps per second: 1991, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.526 [0.000, 1.000],  mean_best_reward: --
 13833/100000: episode: 408, duration: 0.007s, episode steps:  13, steps per second: 1884, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.615 [0.000, 1.000],  mean_best_reward: --
 13871/100000: episode: 409, duration: 0.018s, episode steps:  38, steps per second: 2112, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 15471/100000: episode: 453, duration: 0.009s, episode steps:  17, steps per second: 1808, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.471 [0.000, 1.000],  mean_best_reward: --
 15497/100000: episode: 454, duration: 0.013s, episode steps:  26, steps per second: 2027, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.538 [0.000, 1.000],  mean_best_reward: --
 15517/100000: episode: 455, duration: 0.010s, episode steps:  20, steps per second: 2042, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.600 [0.000, 1.000],  mean_best_reward: --
 15566/100000: episode: 456, duration: 0.022s, episode steps:  49, steps per second: 2213, episode reward: 49.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.469 [0.000, 1.000],  mean_best_reward: --
 15618/100000: episode: 457, duration: 0.024s, episode steps:  52, steps per second: 2186, episode reward: 52.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 17270/100000: episode: 500, duration: 0.051s, episode steps: 110, steps per second: 2166, episode reward: 110.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.536 [0.000, 1.000],  mean_best_reward: --
 17296/100000: episode: 501, duration: 0.013s, episode steps:  26, steps per second: 2004, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: 90.500000
 17314/100000: episode: 502, duration: 0.009s, episode steps:  18, steps per second: 1955, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 17400/100000: episode: 503, duration: 0.039s, episode steps:  86, steps per second: 2211, episode reward: 86.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.523 [0.000, 1.000],  mean_best_reward: --
 17442/100000: episode: 504, duration: 0.019s, episode steps:  42, steps per second: 2195, episode reward: 42.000, mean reward:  1.000 [ 1.000,  1.000],

 19178/100000: episode: 547, duration: 0.022s, episode steps:  45, steps per second: 2055, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.489 [0.000, 1.000],  mean_best_reward: --
 19262/100000: episode: 548, duration: 0.038s, episode steps:  84, steps per second: 2217, episode reward: 84.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.560 [0.000, 1.000],  mean_best_reward: --
 19353/100000: episode: 549, duration: 0.042s, episode steps:  91, steps per second: 2166, episode reward: 91.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.538 [0.000, 1.000],  mean_best_reward: --
 19370/100000: episode: 550, duration: 0.008s, episode steps:  17, steps per second: 2042, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.529 [0.000, 1.000],  mean_best_reward: --
 19385/100000: episode: 551, duration: 0.008s, episode steps:  15, steps per second: 1808, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 21018/100000: episode: 594, duration: 0.035s, episode steps:  77, steps per second: 2178, episode reward: 77.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.506 [0.000, 1.000],  mean_best_reward: --
 21068/100000: episode: 595, duration: 0.022s, episode steps:  50, steps per second: 2270, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 21090/100000: episode: 596, duration: 0.011s, episode steps:  22, steps per second: 2081, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.545 [0.000, 1.000],  mean_best_reward: --
 21134/100000: episode: 597, duration: 0.020s, episode steps:  44, steps per second: 2206, episode reward: 44.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.477 [0.000, 1.000],  mean_best_reward: --
 21227/100000: episode: 598, duration: 0.041s, episode steps:  93, steps per second: 2286, episode reward: 93.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 22778/100000: episode: 637, duration: 0.032s, episode steps:  69, steps per second: 2149, episode reward: 69.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.565 [0.000, 1.000],  mean_best_reward: --
 22812/100000: episode: 638, duration: 0.017s, episode steps:  34, steps per second: 1960, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 22858/100000: episode: 639, duration: 0.021s, episode steps:  46, steps per second: 2152, episode reward: 46.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
 22906/100000: episode: 640, duration: 0.022s, episode steps:  48, steps per second: 2212, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.479 [0.000, 1.000],  mean_best_reward: --
 22943/100000: episode: 641, duration: 0.018s, episode steps:  37, steps per second: 2082, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 24558/100000: episode: 679, duration: 0.017s, episode steps:  34, steps per second: 2005, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.529 [0.000, 1.000],  mean_best_reward: --
 24578/100000: episode: 680, duration: 0.010s, episode steps:  20, steps per second: 1931, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.550 [0.000, 1.000],  mean_best_reward: --
 24632/100000: episode: 681, duration: 0.027s, episode steps:  54, steps per second: 2036, episode reward: 54.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.463 [0.000, 1.000],  mean_best_reward: --
 24645/100000: episode: 682, duration: 0.007s, episode steps:  13, steps per second: 1783, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.462 [0.000, 1.000],  mean_best_reward: --
 24699/100000: episode: 683, duration: 0.025s, episode steps:  54, steps per second: 2137, episode reward: 54.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 26353/100000: episode: 730, duration: 0.010s, episode steps:  19, steps per second: 1893, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.421 [0.000, 1.000],  mean_best_reward: --
 26380/100000: episode: 731, duration: 0.014s, episode steps:  27, steps per second: 1932, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 26410/100000: episode: 732, duration: 0.014s, episode steps:  30, steps per second: 2074, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  mean_best_reward: --
 26430/100000: episode: 733, duration: 0.010s, episode steps:  20, steps per second: 1978, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.550 [0.000, 1.000],  mean_best_reward: --
 26446/100000: episode: 734, duration: 0.009s, episode steps:  16, steps per second: 1831, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 28100/100000: episode: 775, duration: 0.029s, episode steps:  59, steps per second: 2009, episode reward: 59.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.458 [0.000, 1.000],  mean_best_reward: --
 28175/100000: episode: 776, duration: 0.035s, episode steps:  75, steps per second: 2159, episode reward: 75.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 28206/100000: episode: 777, duration: 0.015s, episode steps:  31, steps per second: 2086, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.548 [0.000, 1.000],  mean_best_reward: --
 28232/100000: episode: 778, duration: 0.012s, episode steps:  26, steps per second: 2116, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.462 [0.000, 1.000],  mean_best_reward: --
 28271/100000: episode: 779, duration: 0.020s, episode steps:  39, steps per second: 1996, episode reward: 39.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 29919/100000: episode: 821, duration: 0.060s, episode steps: 115, steps per second: 1908, episode reward: 115.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.504 [0.000, 1.000],  mean_best_reward: --
 29994/100000: episode: 822, duration: 0.035s, episode steps:  75, steps per second: 2136, episode reward: 75.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.493 [0.000, 1.000],  mean_best_reward: --
 30023/100000: episode: 823, duration: 0.015s, episode steps:  29, steps per second: 1888, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.517 [0.000, 1.000],  mean_best_reward: --
 30066/100000: episode: 824, duration: 0.022s, episode steps:  43, steps per second: 1920, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.465 [0.000, 1.000],  mean_best_reward: --
 30208/100000: episode: 825, duration: 0.065s, episode steps: 142, steps per second: 2200, episode reward: 142.000, mean reward:  1.000 [ 1.000,  1.000], mean 

 31713/100000: episode: 865, duration: 0.045s, episode steps:  97, steps per second: 2147, episode reward: 97.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.526 [0.000, 1.000],  mean_best_reward: --
 31738/100000: episode: 866, duration: 0.012s, episode steps:  25, steps per second: 2113, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 31824/100000: episode: 867, duration: 0.039s, episode steps:  86, steps per second: 2202, episode reward: 86.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.535 [0.000, 1.000],  mean_best_reward: --
 31840/100000: episode: 868, duration: 0.008s, episode steps:  16, steps per second: 1992, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.375 [0.000, 1.000],  mean_best_reward: --
 31856/100000: episode: 869, duration: 0.008s, episode steps:  16, steps per second: 1989, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 33487/100000: episode: 911, duration: 0.020s, episode steps:  41, steps per second: 2010, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.537 [0.000, 1.000],  mean_best_reward: --
 33519/100000: episode: 912, duration: 0.015s, episode steps:  32, steps per second: 2121, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.531 [0.000, 1.000],  mean_best_reward: --
 33607/100000: episode: 913, duration: 0.040s, episode steps:  88, steps per second: 2183, episode reward: 88.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.489 [0.000, 1.000],  mean_best_reward: --
 33626/100000: episode: 914, duration: 0.009s, episode steps:  19, steps per second: 2055, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.526 [0.000, 1.000],  mean_best_reward: --
 33650/100000: episode: 915, duration: 0.011s, episode steps:  24, steps per second: 2111, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 35226/100000: episode: 963, duration: 0.013s, episode steps:  26, steps per second: 2043, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.462 [0.000, 1.000],  mean_best_reward: --
 35265/100000: episode: 964, duration: 0.019s, episode steps:  39, steps per second: 2009, episode reward: 39.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.487 [0.000, 1.000],  mean_best_reward: --
 35294/100000: episode: 965, duration: 0.014s, episode steps:  29, steps per second: 2104, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.483 [0.000, 1.000],  mean_best_reward: --
 35325/100000: episode: 966, duration: 0.014s, episode steps:  31, steps per second: 2164, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.484 [0.000, 1.000],  mean_best_reward: --
 35353/100000: episode: 967, duration: 0.013s, episode steps:  28, steps per second: 2140, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean ac

 37018/100000: episode: 1011, duration: 0.013s, episode steps:  25, steps per second: 1962, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.440 [0.000, 1.000],  mean_best_reward: --
 37067/100000: episode: 1012, duration: 0.023s, episode steps:  49, steps per second: 2107, episode reward: 49.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.449 [0.000, 1.000],  mean_best_reward: --
 37084/100000: episode: 1013, duration: 0.008s, episode steps:  17, steps per second: 2034, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.647 [0.000, 1.000],  mean_best_reward: --
 37103/100000: episode: 1014, duration: 0.009s, episode steps:  19, steps per second: 2068, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.579 [0.000, 1.000],  mean_best_reward: --
 37156/100000: episode: 1015, duration: 0.024s, episode steps:  53, steps per second: 2232, episode reward: 53.000, mean reward:  1.000 [ 1.000,  1.000], me

 38464/100000: episode: 1052, duration: 0.014s, episode steps:  28, steps per second: 1941, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.429 [0.000, 1.000],  mean_best_reward: --
 38527/100000: episode: 1053, duration: 0.029s, episode steps:  63, steps per second: 2202, episode reward: 63.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.492 [0.000, 1.000],  mean_best_reward: --
 38550/100000: episode: 1054, duration: 0.012s, episode steps:  23, steps per second: 1970, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
 38616/100000: episode: 1055, duration: 0.030s, episode steps:  66, steps per second: 2219, episode reward: 66.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.545 [0.000, 1.000],  mean_best_reward: --
 38651/100000: episode: 1056, duration: 0.016s, episode steps:  35, steps per second: 2184, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], me

 40207/100000: episode: 1102, duration: 0.010s, episode steps:  18, steps per second: 1828, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 40233/100000: episode: 1103, duration: 0.014s, episode steps:  26, steps per second: 1913, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 40339/100000: episode: 1104, duration: 0.047s, episode steps: 106, steps per second: 2265, episode reward: 106.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 40357/100000: episode: 1105, duration: 0.009s, episode steps:  18, steps per second: 1947, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.611 [0.000, 1.000],  mean_best_reward: --
 40418/100000: episode: 1106, duration: 0.028s, episode steps:  61, steps per second: 2153, episode reward: 61.000, mean reward:  1.000 [ 1.000,  1.000], m

 42053/100000: episode: 1144, duration: 0.021s, episode steps:  43, steps per second: 2093, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.512 [0.000, 1.000],  mean_best_reward: --
 42069/100000: episode: 1145, duration: 0.009s, episode steps:  16, steps per second: 1811, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 42086/100000: episode: 1146, duration: 0.009s, episode steps:  17, steps per second: 1968, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.412 [0.000, 1.000],  mean_best_reward: --
 42142/100000: episode: 1147, duration: 0.026s, episode steps:  56, steps per second: 2157, episode reward: 56.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.518 [0.000, 1.000],  mean_best_reward: --
 42243/100000: episode: 1148, duration: 0.047s, episode steps: 101, steps per second: 2160, episode reward: 101.000, mean reward:  1.000 [ 1.000,  1.000], m

 44150/100000: episode: 1193, duration: 0.014s, episode steps:  27, steps per second: 1973, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 44211/100000: episode: 1194, duration: 0.029s, episode steps:  61, steps per second: 2109, episode reward: 61.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.508 [0.000, 1.000],  mean_best_reward: --
 44225/100000: episode: 1195, duration: 0.007s, episode steps:  14, steps per second: 1891, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.429 [0.000, 1.000],  mean_best_reward: --
 44243/100000: episode: 1196, duration: 0.009s, episode steps:  18, steps per second: 2006, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 44269/100000: episode: 1197, duration: 0.012s, episode steps:  26, steps per second: 2128, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], me

 45910/100000: episode: 1241, duration: 0.006s, episode steps:  10, steps per second: 1811, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.800 [0.000, 1.000],  mean_best_reward: --
 45935/100000: episode: 1242, duration: 0.014s, episode steps:  25, steps per second: 1819, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.440 [0.000, 1.000],  mean_best_reward: --
 45954/100000: episode: 1243, duration: 0.010s, episode steps:  19, steps per second: 1990, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.526 [0.000, 1.000],  mean_best_reward: --
 45968/100000: episode: 1244, duration: 0.007s, episode steps:  14, steps per second: 1945, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.571 [0.000, 1.000],  mean_best_reward: --
 46002/100000: episode: 1245, duration: 0.016s, episode steps:  34, steps per second: 2109, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], me

 47188/100000: episode: 1280, duration: 0.020s, episode steps:  41, steps per second: 2043, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.463 [0.000, 1.000],  mean_best_reward: --
 47208/100000: episode: 1281, duration: 0.012s, episode steps:  20, steps per second: 1706, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.550 [0.000, 1.000],  mean_best_reward: --
 47225/100000: episode: 1282, duration: 0.009s, episode steps:  17, steps per second: 1979, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.529 [0.000, 1.000],  mean_best_reward: --
 47270/100000: episode: 1283, duration: 0.021s, episode steps:  45, steps per second: 2111, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.489 [0.000, 1.000],  mean_best_reward: --
 47302/100000: episode: 1284, duration: 0.015s, episode steps:  32, steps per second: 2108, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], me

 49038/100000: episode: 1332, duration: 0.019s, episode steps:  36, steps per second: 1926, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.528 [0.000, 1.000],  mean_best_reward: --
 49061/100000: episode: 1333, duration: 0.011s, episode steps:  23, steps per second: 2051, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.565 [0.000, 1.000],  mean_best_reward: --
 49076/100000: episode: 1334, duration: 0.008s, episode steps:  15, steps per second: 1945, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.600 [0.000, 1.000],  mean_best_reward: --
 49100/100000: episode: 1335, duration: 0.011s, episode steps:  24, steps per second: 2113, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.458 [0.000, 1.000],  mean_best_reward: --
 49156/100000: episode: 1336, duration: 0.025s, episode steps:  56, steps per second: 2241, episode reward: 56.000, mean reward:  1.000 [ 1.000,  1.000], me

 50729/100000: episode: 1376, duration: 0.032s, episode steps:  68, steps per second: 2098, episode reward: 68.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.426 [0.000, 1.000],  mean_best_reward: --
 50774/100000: episode: 1377, duration: 0.022s, episode steps:  45, steps per second: 2028, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.467 [0.000, 1.000],  mean_best_reward: --
 50808/100000: episode: 1378, duration: 0.016s, episode steps:  34, steps per second: 2083, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 50828/100000: episode: 1379, duration: 0.010s, episode steps:  20, steps per second: 2016, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 50838/100000: episode: 1380, duration: 0.006s, episode steps:  10, steps per second: 1810, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], me

 52463/100000: episode: 1417, duration: 0.019s, episode steps:  39, steps per second: 2042, episode reward: 39.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.513 [0.000, 1.000],  mean_best_reward: --
 52538/100000: episode: 1418, duration: 0.040s, episode steps:  75, steps per second: 1892, episode reward: 75.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.453 [0.000, 1.000],  mean_best_reward: --
 52564/100000: episode: 1419, duration: 0.018s, episode steps:  26, steps per second: 1440, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 52589/100000: episode: 1420, duration: 0.014s, episode steps:  25, steps per second: 1747, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: --
 52621/100000: episode: 1421, duration: 0.017s, episode steps:  32, steps per second: 1918, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], me

 54053/100000: episode: 1462, duration: 0.014s, episode steps:  27, steps per second: 1871, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.593 [0.000, 1.000],  mean_best_reward: --
 54065/100000: episode: 1463, duration: 0.009s, episode steps:  12, steps per second: 1372, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.750 [0.000, 1.000],  mean_best_reward: --
 54105/100000: episode: 1464, duration: 0.019s, episode steps:  40, steps per second: 2090, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.475 [0.000, 1.000],  mean_best_reward: --
 54121/100000: episode: 1465, duration: 0.008s, episode steps:  16, steps per second: 2085, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.562 [0.000, 1.000],  mean_best_reward: --
 54135/100000: episode: 1466, duration: 0.008s, episode steps:  14, steps per second: 1831, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], me

 55691/100000: episode: 1511, duration: 0.009s, episode steps:  16, steps per second: 1798, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.438 [0.000, 1.000],  mean_best_reward: --
 55715/100000: episode: 1512, duration: 0.012s, episode steps:  24, steps per second: 2084, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 55734/100000: episode: 1513, duration: 0.009s, episode steps:  19, steps per second: 2069, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.474 [0.000, 1.000],  mean_best_reward: --
 55833/100000: episode: 1514, duration: 0.044s, episode steps:  99, steps per second: 2252, episode reward: 99.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.475 [0.000, 1.000],  mean_best_reward: --
 55849/100000: episode: 1515, duration: 0.008s, episode steps:  16, steps per second: 2005, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], me

 57542/100000: episode: 1563, duration: 0.019s, episode steps:  38, steps per second: 2009, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 57565/100000: episode: 1564, duration: 0.012s, episode steps:  23, steps per second: 1988, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
 57623/100000: episode: 1565, duration: 0.027s, episode steps:  58, steps per second: 2181, episode reward: 58.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.466 [0.000, 1.000],  mean_best_reward: --
 57644/100000: episode: 1566, duration: 0.011s, episode steps:  21, steps per second: 1941, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.524 [0.000, 1.000],  mean_best_reward: --
 57709/100000: episode: 1567, duration: 0.029s, episode steps:  65, steps per second: 2264, episode reward: 65.000, mean reward:  1.000 [ 1.000,  1.000], me

 59498/100000: episode: 1609, duration: 0.030s, episode steps:  65, steps per second: 2151, episode reward: 65.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.492 [0.000, 1.000],  mean_best_reward: --
 59545/100000: episode: 1610, duration: 0.022s, episode steps:  47, steps per second: 2121, episode reward: 47.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.447 [0.000, 1.000],  mean_best_reward: --
 59572/100000: episode: 1611, duration: 0.013s, episode steps:  27, steps per second: 2058, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 59599/100000: episode: 1612, duration: 0.013s, episode steps:  27, steps per second: 2090, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 59622/100000: episode: 1613, duration: 0.011s, episode steps:  23, steps per second: 2084, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], me

 61320/100000: episode: 1654, duration: 0.024s, episode steps:  49, steps per second: 2003, episode reward: 49.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.490 [0.000, 1.000],  mean_best_reward: --
 61333/100000: episode: 1655, duration: 0.007s, episode steps:  13, steps per second: 1896, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.615 [0.000, 1.000],  mean_best_reward: --
 61357/100000: episode: 1656, duration: 0.011s, episode steps:  24, steps per second: 2108, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.542 [0.000, 1.000],  mean_best_reward: --
 61394/100000: episode: 1657, duration: 0.017s, episode steps:  37, steps per second: 2200, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.514 [0.000, 1.000],  mean_best_reward: --
 61420/100000: episode: 1658, duration: 0.013s, episode steps:  26, steps per second: 2020, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], me

 63090/100000: episode: 1699, duration: 0.009s, episode steps:  16, steps per second: 1869, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.438 [0.000, 1.000],  mean_best_reward: --
 63106/100000: episode: 1700, duration: 0.008s, episode steps:  16, steps per second: 2022, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 63156/100000: episode: 1701, duration: 0.023s, episode steps:  50, steps per second: 2215, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.480 [0.000, 1.000],  mean_best_reward: 121.000000
 63186/100000: episode: 1702, duration: 0.014s, episode steps:  30, steps per second: 2181, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 63211/100000: episode: 1703, duration: 0.012s, episode steps:  25, steps per second: 2137, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.

 64804/100000: episode: 1746, duration: 0.018s, episode steps:  35, steps per second: 1908, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.571 [0.000, 1.000],  mean_best_reward: --
 64837/100000: episode: 1747, duration: 0.017s, episode steps:  33, steps per second: 1991, episode reward: 33.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.485 [0.000, 1.000],  mean_best_reward: --
 64882/100000: episode: 1748, duration: 0.022s, episode steps:  45, steps per second: 2055, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.578 [0.000, 1.000],  mean_best_reward: --
 64907/100000: episode: 1749, duration: 0.013s, episode steps:  25, steps per second: 1860, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 64922/100000: episode: 1750, duration: 0.008s, episode steps:  15, steps per second: 1868, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], me

 66128/100000: episode: 1787, duration: 0.024s, episode steps:  52, steps per second: 2142, episode reward: 52.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.577 [0.000, 1.000],  mean_best_reward: --
 66155/100000: episode: 1788, duration: 0.013s, episode steps:  27, steps per second: 2108, episode reward: 27.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.407 [0.000, 1.000],  mean_best_reward: --
 66335/100000: episode: 1789, duration: 0.079s, episode steps: 180, steps per second: 2267, episode reward: 180.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
 66396/100000: episode: 1790, duration: 0.027s, episode steps:  61, steps per second: 2242, episode reward: 61.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.508 [0.000, 1.000],  mean_best_reward: --
 66426/100000: episode: 1791, duration: 0.014s, episode steps:  30, steps per second: 2164, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], m

 67924/100000: episode: 1833, duration: 0.023s, episode steps:  47, steps per second: 2073, episode reward: 47.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.468 [0.000, 1.000],  mean_best_reward: --
 67956/100000: episode: 1834, duration: 0.015s, episode steps:  32, steps per second: 2146, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.531 [0.000, 1.000],  mean_best_reward: --
 68050/100000: episode: 1835, duration: 0.041s, episode steps:  94, steps per second: 2275, episode reward: 94.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.511 [0.000, 1.000],  mean_best_reward: --
 68184/100000: episode: 1836, duration: 0.060s, episode steps: 134, steps per second: 2239, episode reward: 134.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 68200/100000: episode: 1837, duration: 0.008s, episode steps:  16, steps per second: 2015, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], m

 69727/100000: episode: 1876, duration: 0.017s, episode steps:  34, steps per second: 2058, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 69778/100000: episode: 1877, duration: 0.023s, episode steps:  51, steps per second: 2245, episode reward: 51.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.529 [0.000, 1.000],  mean_best_reward: --
 69840/100000: episode: 1878, duration: 0.029s, episode steps:  62, steps per second: 2116, episode reward: 62.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.484 [0.000, 1.000],  mean_best_reward: --
 69928/100000: episode: 1879, duration: 0.039s, episode steps:  88, steps per second: 2278, episode reward: 88.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.466 [0.000, 1.000],  mean_best_reward: --
 69948/100000: episode: 1880, duration: 0.010s, episode steps:  20, steps per second: 1911, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], me

 71680/100000: episode: 1927, duration: 0.039s, episode steps:  88, steps per second: 2228, episode reward: 88.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.523 [0.000, 1.000],  mean_best_reward: --
 71727/100000: episode: 1928, duration: 0.022s, episode steps:  47, steps per second: 2122, episode reward: 47.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.532 [0.000, 1.000],  mean_best_reward: --
 71761/100000: episode: 1929, duration: 0.017s, episode steps:  34, steps per second: 1997, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.441 [0.000, 1.000],  mean_best_reward: --
 71780/100000: episode: 1930, duration: 0.010s, episode steps:  19, steps per second: 1939, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.579 [0.000, 1.000],  mean_best_reward: --
 71828/100000: episode: 1931, duration: 0.022s, episode steps:  48, steps per second: 2166, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], me

 73532/100000: episode: 1976, duration: 0.029s, episode steps:  64, steps per second: 2192, episode reward: 64.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.516 [0.000, 1.000],  mean_best_reward: --
 73572/100000: episode: 1977, duration: 0.018s, episode steps:  40, steps per second: 2177, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.525 [0.000, 1.000],  mean_best_reward: --
 73613/100000: episode: 1978, duration: 0.019s, episode steps:  41, steps per second: 2185, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.537 [0.000, 1.000],  mean_best_reward: --
 73631/100000: episode: 1979, duration: 0.009s, episode steps:  18, steps per second: 2063, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 73644/100000: episode: 1980, duration: 0.007s, episode steps:  13, steps per second: 1981, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], me

 75403/100000: episode: 2023, duration: 0.017s, episode steps:  36, steps per second: 2154, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.472 [0.000, 1.000],  mean_best_reward: --
 75425/100000: episode: 2024, duration: 0.011s, episode steps:  22, steps per second: 1954, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.545 [0.000, 1.000],  mean_best_reward: --
 75465/100000: episode: 2025, duration: 0.018s, episode steps:  40, steps per second: 2171, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 75499/100000: episode: 2026, duration: 0.016s, episode steps:  34, steps per second: 2167, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.559 [0.000, 1.000],  mean_best_reward: --
 75530/100000: episode: 2027, duration: 0.015s, episode steps:  31, steps per second: 2109, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], me

 77296/100000: episode: 2069, duration: 0.024s, episode steps:  52, steps per second: 2168, episode reward: 52.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.481 [0.000, 1.000],  mean_best_reward: --
 77311/100000: episode: 2070, duration: 0.008s, episode steps:  15, steps per second: 1925, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  mean_best_reward: --
 77352/100000: episode: 2071, duration: 0.020s, episode steps:  41, steps per second: 2083, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.537 [0.000, 1.000],  mean_best_reward: --
 77381/100000: episode: 2072, duration: 0.014s, episode steps:  29, steps per second: 2072, episode reward: 29.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.517 [0.000, 1.000],  mean_best_reward: --
 77456/100000: episode: 2073, duration: 0.033s, episode steps:  75, steps per second: 2244, episode reward: 75.000, mean reward:  1.000 [ 1.000,  1.000], me

 79214/100000: episode: 2113, duration: 0.048s, episode steps: 108, steps per second: 2233, episode reward: 108.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.528 [0.000, 1.000],  mean_best_reward: --
 79305/100000: episode: 2114, duration: 0.040s, episode steps:  91, steps per second: 2253, episode reward: 91.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.495 [0.000, 1.000],  mean_best_reward: --
 79356/100000: episode: 2115, duration: 0.024s, episode steps:  51, steps per second: 2153, episode reward: 51.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.510 [0.000, 1.000],  mean_best_reward: --
 79421/100000: episode: 2116, duration: 0.030s, episode steps:  65, steps per second: 2199, episode reward: 65.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.492 [0.000, 1.000],  mean_best_reward: --
 79501/100000: episode: 2117, duration: 0.036s, episode steps:  80, steps per second: 2207, episode reward: 80.000, mean reward:  1.000 [ 1.000,  1.000], m

 81009/100000: episode: 2155, duration: 0.009s, episode steps:  18, steps per second: 1904, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 81025/100000: episode: 2156, duration: 0.008s, episode steps:  16, steps per second: 1987, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 81061/100000: episode: 2157, duration: 0.017s, episode steps:  36, steps per second: 2089, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.556 [0.000, 1.000],  mean_best_reward: --
 81083/100000: episode: 2158, duration: 0.010s, episode steps:  22, steps per second: 2108, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.545 [0.000, 1.000],  mean_best_reward: --
 81131/100000: episode: 2159, duration: 0.022s, episode steps:  48, steps per second: 2142, episode reward: 48.000, mean reward:  1.000 [ 1.000,  1.000], me

 82349/100000: episode: 2197, duration: 0.011s, episode steps:  21, steps per second: 1939, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.571 [0.000, 1.000],  mean_best_reward: --
 82427/100000: episode: 2198, duration: 0.035s, episode steps:  78, steps per second: 2228, episode reward: 78.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.487 [0.000, 1.000],  mean_best_reward: --
 82496/100000: episode: 2199, duration: 0.031s, episode steps:  69, steps per second: 2229, episode reward: 69.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.478 [0.000, 1.000],  mean_best_reward: --
 82536/100000: episode: 2200, duration: 0.018s, episode steps:  40, steps per second: 2183, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 82590/100000: episode: 2201, duration: 0.025s, episode steps:  54, steps per second: 2201, episode reward: 54.000, mean reward:  1.000 [ 1.000,  1.000], me

 83619/100000: episode: 2239, duration: 0.050s, episode steps:  50, steps per second: 992, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 83674/100000: episode: 2240, duration: 0.040s, episode steps:  55, steps per second: 1365, episode reward: 55.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.473 [0.000, 1.000],  mean_best_reward: --
 83710/100000: episode: 2241, duration: 0.020s, episode steps:  36, steps per second: 1792, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.528 [0.000, 1.000],  mean_best_reward: --
 83760/100000: episode: 2242, duration: 0.027s, episode steps:  50, steps per second: 1822, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.540 [0.000, 1.000],  mean_best_reward: --
 83791/100000: episode: 2243, duration: 0.020s, episode steps:  31, steps per second: 1588, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mea

 85359/100000: episode: 2291, duration: 0.019s, episode steps:  41, steps per second: 2123, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.512 [0.000, 1.000],  mean_best_reward: --
 85398/100000: episode: 2292, duration: 0.018s, episode steps:  39, steps per second: 2186, episode reward: 39.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.513 [0.000, 1.000],  mean_best_reward: --
 85407/100000: episode: 2293, duration: 0.005s, episode steps:   9, steps per second: 1834, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.667 [0.000, 1.000],  mean_best_reward: --
 85430/100000: episode: 2294, duration: 0.011s, episode steps:  23, steps per second: 2115, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.522 [0.000, 1.000],  mean_best_reward: --
 85445/100000: episode: 2295, duration: 0.008s, episode steps:  15, steps per second: 1977, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], me

 87078/100000: episode: 2341, duration: 0.015s, episode steps:  30, steps per second: 1982, episode reward: 30.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.467 [0.000, 1.000],  mean_best_reward: --
 87154/100000: episode: 2342, duration: 0.036s, episode steps:  76, steps per second: 2138, episode reward: 76.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.526 [0.000, 1.000],  mean_best_reward: --
 87186/100000: episode: 2343, duration: 0.016s, episode steps:  32, steps per second: 2009, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.469 [0.000, 1.000],  mean_best_reward: --
 87212/100000: episode: 2344, duration: 0.012s, episode steps:  26, steps per second: 2122, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.462 [0.000, 1.000],  mean_best_reward: --
 87246/100000: episode: 2345, duration: 0.016s, episode steps:  34, steps per second: 2184, episode reward: 34.000, mean reward:  1.000 [ 1.000,  1.000], me

 88766/100000: episode: 2388, duration: 0.013s, episode steps:  24, steps per second: 1895, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.583 [0.000, 1.000],  mean_best_reward: --
 88802/100000: episode: 2389, duration: 0.018s, episode steps:  36, steps per second: 1971, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.528 [0.000, 1.000],  mean_best_reward: --
 88815/100000: episode: 2390, duration: 0.007s, episode steps:  13, steps per second: 1960, episode reward: 13.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.692 [0.000, 1.000],  mean_best_reward: --
 88855/100000: episode: 2391, duration: 0.019s, episode steps:  40, steps per second: 2149, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.525 [0.000, 1.000],  mean_best_reward: --
 88899/100000: episode: 2392, duration: 0.020s, episode steps:  44, steps per second: 2173, episode reward: 44.000, mean reward:  1.000 [ 1.000,  1.000], me

 90573/100000: episode: 2431, duration: 0.013s, episode steps:  25, steps per second: 1995, episode reward: 25.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 90593/100000: episode: 2432, duration: 0.010s, episode steps:  20, steps per second: 2083, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.450 [0.000, 1.000],  mean_best_reward: --
 90624/100000: episode: 2433, duration: 0.015s, episode steps:  31, steps per second: 2042, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.452 [0.000, 1.000],  mean_best_reward: --
 90662/100000: episode: 2434, duration: 0.018s, episode steps:  38, steps per second: 2102, episode reward: 38.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 90685/100000: episode: 2435, duration: 0.011s, episode steps:  23, steps per second: 2102, episode reward: 23.000, mean reward:  1.000 [ 1.000,  1.000], me

 92340/100000: episode: 2477, duration: 0.009s, episode steps:  15, steps per second: 1714, episode reward: 15.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  mean_best_reward: --
 92368/100000: episode: 2478, duration: 0.013s, episode steps:  28, steps per second: 2134, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.429 [0.000, 1.000],  mean_best_reward: --
 92384/100000: episode: 2479, duration: 0.008s, episode steps:  16, steps per second: 2001, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.562 [0.000, 1.000],  mean_best_reward: --
 92415/100000: episode: 2480, duration: 0.014s, episode steps:  31, steps per second: 2149, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.484 [0.000, 1.000],  mean_best_reward: --
 92439/100000: episode: 2481, duration: 0.012s, episode steps:  24, steps per second: 1987, episode reward: 24.000, mean reward:  1.000 [ 1.000,  1.000], me

 94163/100000: episode: 2516, duration: 0.034s, episode steps:  71, steps per second: 2115, episode reward: 71.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.563 [0.000, 1.000],  mean_best_reward: --
 94294/100000: episode: 2517, duration: 0.059s, episode steps: 131, steps per second: 2237, episode reward: 131.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.519 [0.000, 1.000],  mean_best_reward: --
 94334/100000: episode: 2518, duration: 0.018s, episode steps:  40, steps per second: 2188, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 94418/100000: episode: 2519, duration: 0.040s, episode steps:  84, steps per second: 2117, episode reward: 84.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.476 [0.000, 1.000],  mean_best_reward: --
 94453/100000: episode: 2520, duration: 0.017s, episode steps:  35, steps per second: 2008, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], m

 96096/100000: episode: 2563, duration: 0.031s, episode steps:  63, steps per second: 2052, episode reward: 63.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.524 [0.000, 1.000],  mean_best_reward: --
 96115/100000: episode: 2564, duration: 0.009s, episode steps:  19, steps per second: 2051, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.526 [0.000, 1.000],  mean_best_reward: --
 96137/100000: episode: 2565, duration: 0.011s, episode steps:  22, steps per second: 2003, episode reward: 22.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.591 [0.000, 1.000],  mean_best_reward: --
 96169/100000: episode: 2566, duration: 0.015s, episode steps:  32, steps per second: 2171, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.562 [0.000, 1.000],  mean_best_reward: --
 96212/100000: episode: 2567, duration: 0.021s, episode steps:  43, steps per second: 2096, episode reward: 43.000, mean reward:  1.000 [ 1.000,  1.000], me

 97750/100000: episode: 2608, duration: 0.029s, episode steps:  44, steps per second: 1510, episode reward: 44.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.545 [0.000, 1.000],  mean_best_reward: --
 97781/100000: episode: 2609, duration: 0.018s, episode steps:  31, steps per second: 1768, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.484 [0.000, 1.000],  mean_best_reward: --
 97896/100000: episode: 2610, duration: 0.088s, episode steps: 115, steps per second: 1314, episode reward: 115.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.513 [0.000, 1.000],  mean_best_reward: --
 97924/100000: episode: 2611, duration: 0.018s, episode steps:  28, steps per second: 1562, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.464 [0.000, 1.000],  mean_best_reward: --
 97941/100000: episode: 2612, duration: 0.014s, episode steps:  17, steps per second: 1227, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], m

 99339/100000: episode: 2650, duration: 0.053s, episode steps: 102, steps per second: 1921, episode reward: 102.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.520 [0.000, 1.000],  mean_best_reward: --
 99430/100000: episode: 2651, duration: 0.041s, episode steps:  91, steps per second: 2208, episode reward: 91.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.516 [0.000, 1.000],  mean_best_reward: 108.500000
 99458/100000: episode: 2652, duration: 0.013s, episode steps:  28, steps per second: 2121, episode reward: 28.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.536 [0.000, 1.000],  mean_best_reward: --
 99508/100000: episode: 2653, duration: 0.023s, episode steps:  50, steps per second: 2205, episode reward: 50.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  mean_best_reward: --
 99570/100000: episode: 2654, duration: 0.028s, episode steps:  62, steps per second: 2206, episode reward: 62.000, mean reward:  1.000 [ 1.000,  1

<tensorflow.python.keras.callbacks.History at 0x12bf01278>