# Questão 5

Para a arealização da questão 5, o primeiro passo foi verificar qual seria o número de steps exigidos por cada método para resultar um um tempo aproximado de impressão. Para isto, foi utilizado o script abaixo, no qual se ajustou manualmente o número de iterações para o treinamento por cada técnica reproduzindo o script diversas vezes. Ao final, regulamos o tempo de treinamento de todas em cerca de 10,5 segundos, pois este era o tempo em que as melhores técnicas começavam a saturar seus resultados em 100% de recompensa nos testes (200/200 em 5 vezes). Foi utilizado um treinamento de 3200 steps.

Ainda utilizando o script abaixo, verificamos o desempenho médio de cada técnica no teste. Sendo 5 recompensas de até 200 pontos para cada uma, a técnica de melhor desempenho foi a Deep Q Learning (DQN), com 196,8 steps em média antes de deixar o pêndulo cair (em 4 tentativas o pêndulo sequer caiu, com a contagem saturando em 200). 

A segunda melhor técnica foi a Cross-Entropy Method (CEM), que obteve 192,4 steps em média, tendo treinado com 11000 (o maior número de steps do grupo). O tempo de execução de seus step é rápido, mas ela exige muitos steps para melhorar seu desempenho.

A terceira melhor técnica foi a Duel DQN, que atingiu 165,6 de desempenho após um treinamento de apenas 2900 steps (o menor do grupo). Esta técnica não necessita de tantos steps para ser treinada, mas estes são de lenta execução.

O pior resultado fica com a State-ActionReward-State-Action, que utilizou treinamentos de 4500 steps e obteve desempenho médio de 24,2 antes de falhar nos testes, deixando o pêndulo cair do carrinho muito rapidmente em relação à outra.

Vale ressaltar que, quando aumentado o número de steps dos treinamentos, todas as técnicas atingiram desempenho médio igual ou próximo a 100%, indicando que elas podem cumprir seu objetivo, demandando apenas diferentes tempos de treino para tal.

In [1]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.cem import CEMAgent
from rl.memory import EpisodeParameterMemory

from rl.agents.dqn import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

from rl.agents import SARSAAgent
from rl.policy import BoltzmannQPolicy

from rl.agents.dqn import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

import time

ENV_NAME = 'CartPole-v0'


# Get the environment and extract the number of actions.
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)

nb_actions = env.action_space.n




# Next, we build a very simple model.
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=50000, window_length=1)
policy = BoltzmannQPolicy()
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
               target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

obs_dim = env.observation_space.shape[0]

# Option 1 : Simple model
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(nb_actions))
model.add(Activation('softmax'))

# Option 2: deep network
# model = Sequential()
# model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
# model.add(Dense(16))
# model.add(Activation('relu'))
# model.add(Dense(16))
# model.add(Activation('relu'))
# model.add(Dense(16))
# model.add(Activation('relu'))
# model.add(Dense(nb_actions))
# model.add(Activation('softmax'))


print(model.summary())


# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = EpisodeParameterMemory(limit=1000, window_length=1)

cem = CEMAgent(model=model, nb_actions=nb_actions, memory=memory,
               batch_size=50, nb_steps_warmup=2000, train_interval=50, elite_frac=0.05)
cem.compile()



# Next, we build a very simple model.
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

# SARSA does not require a memory.
policy = BoltzmannQPolicy()
sarsa = SARSAAgent(model=model, nb_actions=nb_actions, nb_steps_warmup=10, policy=policy)
sarsa.compile(Adam(lr=1e-3), metrics=['mae'])



# Next, we build a very simple model regardless of the dueling architecture
# if you enable dueling network in DQN , DQN will build a dueling network base on your model automatically
# Also, you can build a dueling network by yourself and turn off the dueling network in DQN.
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions, activation='linear'))
print(model.summary())

# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=50000, window_length=1)
policy = BoltzmannQPolicy()
# enable the dueling network
# you can specify the dueling_type to one of {'avg','max','naive'}
duel = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
               enable_dueling_network=True, dueling_type='avg', target_model_update=1e-2, policy=policy)
duel.compile(Adam(lr=1e-3), metrics=['mae'])



print("TEMPO: ", time.time())
# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
duel.fit(env, nb_steps=2900, visualize=True, verbose=2)

print("TEMPO: ", time.time())
# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
dqn.fit(env, nb_steps=3200, visualize=True, verbose=2)

print("TEMPO: ", time.time())
# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
cem.fit(env, nb_steps=11000, visualize=True, verbose=2)

print("TEMPO: ", time.time())
# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
sarsa.fit(env, nb_steps=4500, visualize=True, verbose=2)
print("TEMPO: ", time.time())




# After training is done, we save the final weights.
dqn.save_weights('dqn_{}_weights.h5f'.format(ENV_NAME), overwrite=True)


print("\n\nDQN\n\n")
# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=5, visualize=True)


# After training is done, we save the best weights.
cem.save_weights('cem_{}_params.h5f'.format(ENV_NAME), overwrite=True)

print("\n\CEM\n\n")
# Finally, evaluate our algorithm for 5 episodes.
cem.test(env, nb_episodes=5, visualize=True)



# After training is done, we save the final weights.
sarsa.save_weights('sarsa_{}_weights.h5f'.format(ENV_NAME), overwrite=True)

print("\n\SARSA\n\n")
# Finally, evaluate our algorithm for 5 episodes.
sarsa.test(env, nb_episodes=5, visualize=True)




# After training is done, we save the final weights.
duel.save_weights('duel_dqn_{}_weights.h5f'.format(ENV_NAME), overwrite=True)

print("\n\DUEL\n\n")
# Finally, evaluate our algorithm for 5 episodes.
duel.test(env, nb_episodes=5, visualize=True)


Using TensorFlow backend.
W0628 07:54:50.469259 140545833047872 deprecation_wrapper.py:119] From /home/patrick/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0628 07:54:50.484366 140545833047872 deprecation_wrapper.py:119] From /home/patrick/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0628 07:54:50.497800 140545833047872 deprecation_wrapper.py:119] From /home/patrick/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0628 07:54:50.672117 140545833047872 deprecation_wrapper.py:119] From /home/patrick/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please u

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                80        
_________________________________________________________________
activation_1 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
activation_2 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 16)                272       
_________________________________________________________________
activation_3 (Activation)    (None, 16)                0         
__________

W0628 07:54:50.804264 140545833047872 deprecation_wrapper.py:119] From /home/patrick/anaconda3/lib/python3.7/site-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 10        
_________________________________________________________________
activation_5 (Activation)    (None, 2)                 0         
Total params: 10
Trainable params: 10
Non-trainable params: 0
_________________________________________________________________
None
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_3 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_6 (Dense)              (None, 16)                80        
_________________________________________________________________
activatio



   39/2900: episode: 1, duration: 1.292s, episode steps: 39, steps per second: 30, episode reward: 39.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.487 [0.000, 1.000], mean observation: 0.153 [-0.990, 1.781], loss: 0.410016, mean_absolute_error: 0.561866, mean_q: 0.256616
   60/2900: episode: 2, duration: 0.068s, episode steps: 21, steps per second: 307, episode reward: 21.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.619 [0.000, 1.000], mean observation: -0.009 [-1.990, 1.412], loss: 0.186312, mean_absolute_error: 0.576259, mean_q: 0.599455
   97/2900: episode: 3, duration: 0.124s, episode steps: 37, steps per second: 299, episode reward: 37.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.595 [0.000, 1.000], mean observation: 0.008 [-2.402, 1.795], loss: 0.041041, mean_absolute_error: 0.681984, mean_q: 1.158313
  122/2900: episode: 4, duration: 0.099s, episode steps: 25, steps per second: 253, episode reward: 25.000, mean reward: 1.000 [1.000, 1.000], mean act

  664/2900: episode: 31, duration: 0.132s, episode steps: 36, steps per second: 273, episode reward: 36.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.361 [0.000, 1.000], mean observation: -0.022 [-1.972, 2.740], loss: 0.181011, mean_absolute_error: 2.958665, mean_q: 5.685248
  692/2900: episode: 32, duration: 0.091s, episode steps: 28, steps per second: 307, episode reward: 28.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.571 [0.000, 1.000], mean observation: -0.042 [-1.423, 0.930], loss: 0.140912, mean_absolute_error: 3.107859, mean_q: 6.020526
  719/2900: episode: 33, duration: 0.088s, episode steps: 27, steps per second: 308, episode reward: 27.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.556 [0.000, 1.000], mean observation: -0.026 [-1.669, 1.189], loss: 0.191176, mean_absolute_error: 3.231830, mean_q: 6.267021
  734/2900: episode: 34, duration: 0.055s, episode steps: 15, steps per second: 272, episode reward: 15.000, mean reward: 1.000 [1.000, 1.000], m

 1797/2900: episode: 62, duration: 0.444s, episode steps: 113, steps per second: 254, episode reward: 113.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.522 [0.000, 1.000], mean observation: 0.288 [-0.958, 1.494], loss: 0.644449, mean_absolute_error: 7.446884, mean_q: 14.854182
 1943/2900: episode: 63, duration: 0.906s, episode steps: 146, steps per second: 161, episode reward: 146.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.521 [0.000, 1.000], mean observation: 0.154 [-0.918, 1.163], loss: 0.834452, mean_absolute_error: 7.994983, mean_q: 15.991569
 2072/2900: episode: 64, duration: 0.705s, episode steps: 129, steps per second: 183, episode reward: 129.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.504 [0.000, 1.000], mean observation: -0.062 [-1.445, 1.198], loss: 0.867386, mean_absolute_error: 8.624294, mean_q: 17.293127
 2154/2900: episode: 65, duration: 0.455s, episode steps: 82, steps per second: 180, episode reward: 82.000, mean reward: 1.000 [1.000, 1.



   29/3200: episode: 1, duration: 0.845s, episode steps: 29, steps per second: 34, episode reward: 29.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.517 [0.000, 1.000], mean observation: -0.041 [-1.064, 0.626], loss: 0.455805, mean_absolute_error: 0.499867, mean_q: 0.049333
   61/3200: episode: 2, duration: 0.127s, episode steps: 32, steps per second: 251, episode reward: 32.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.625 [0.000, 1.000], mean observation: -0.048 [-2.420, 1.518], loss: 0.307264, mean_absolute_error: 0.515373, mean_q: 0.270659
   79/3200: episode: 3, duration: 0.072s, episode steps: 18, steps per second: 250, episode reward: 18.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.389 [0.000, 1.000], mean observation: 0.107 [-0.763, 1.610], loss: 0.131428, mean_absolute_error: 0.558231, mean_q: 0.664464




  108/3200: episode: 4, duration: 0.111s, episode steps: 29, steps per second: 260, episode reward: 29.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.483 [0.000, 1.000], mean observation: -0.055 [-1.424, 0.995], loss: 0.042255, mean_absolute_error: 0.671613, mean_q: 1.167719
  124/3200: episode: 5, duration: 0.057s, episode steps: 16, steps per second: 281, episode reward: 16.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.625 [0.000, 1.000], mean observation: -0.066 [-1.749, 1.031], loss: 0.025494, mean_absolute_error: 0.720847, mean_q: 1.304302
  168/3200: episode: 6, duration: 0.165s, episode steps: 44, steps per second: 267, episode reward: 44.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.432 [0.000, 1.000], mean observation: -0.040 [-1.329, 1.731], loss: 0.021315, mean_absolute_error: 0.805734, mean_q: 1.496822
  191/3200: episode: 7, duration: 0.081s, episode steps: 23, steps per second: 283, episode reward: 23.000, mean reward: 1.000 [1.000, 1.000], mean 

  962/3200: episode: 33, duration: 0.239s, episode steps: 76, steps per second: 318, episode reward: 76.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.526 [0.000, 1.000], mean observation: 0.051 [-1.441, 1.340], loss: 0.302551, mean_absolute_error: 4.002175, mean_q: 7.863830
 1053/3200: episode: 34, duration: 0.291s, episode steps: 91, steps per second: 313, episode reward: 91.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.549 [0.000, 1.000], mean observation: 0.226 [-1.720, 1.695], loss: 0.426884, mean_absolute_error: 4.335735, mean_q: 8.489305
 1129/3200: episode: 35, duration: 0.244s, episode steps: 76, steps per second: 312, episode reward: 76.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.474 [0.000, 1.000], mean observation: -0.115 [-1.720, 1.674], loss: 0.349334, mean_absolute_error: 4.584423, mean_q: 9.086555
 1186/3200: episode: 36, duration: 0.179s, episode steps: 57, steps per second: 318, episode reward: 57.000, mean reward: 1.000 [1.000, 1.000], mea

   353/11000: episode: 16, duration: 0.021s, episode steps: 18, steps per second: 870, episode reward: 18.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.611 [0.000, 1.000], mean observation: -0.096 [-1.790, 0.948], mean_best_reward: --
   369/11000: episode: 17, duration: 0.017s, episode steps: 16, steps per second: 956, episode reward: 16.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.312 [0.000, 1.000], mean observation: 0.080 [-1.177, 1.992], mean_best_reward: --
   380/11000: episode: 18, duration: 0.012s, episode steps: 11, steps per second: 927, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.091 [0.000, 1.000], mean observation: 0.124 [-1.738, 2.794], mean_best_reward: --
   392/11000: episode: 19, duration: 0.012s, episode steps: 12, steps per second: 979, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.250 [0.000, 1.000], mean observation: 0.115 [-1.227, 2.081], mean_best_reward: --
   412/11000: episode: 20, 

  1066/11000: episode: 57, duration: 0.038s, episode steps: 37, steps per second: 982, episode reward: 37.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.595 [0.000, 1.000], mean observation: -0.004 [-2.254, 1.573], mean_best_reward: --
  1078/11000: episode: 58, duration: 0.015s, episode steps: 12, steps per second: 825, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.500 [0.000, 1.000], mean observation: 0.109 [-0.799, 1.285], mean_best_reward: --
  1091/11000: episode: 59, duration: 0.015s, episode steps: 13, steps per second: 886, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.231 [0.000, 1.000], mean observation: 0.110 [-1.380, 2.333], mean_best_reward: --
  1111/11000: episode: 60, duration: 0.020s, episode steps: 20, steps per second: 1003, episode reward: 20.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.500 [0.000, 1.000], mean observation: -0.104 [-1.196, 0.430], mean_best_reward: --
  1123/11000: episode: 61

  1750/11000: episode: 94, duration: 0.033s, episode steps: 20, steps per second: 612, episode reward: 20.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.400 [0.000, 1.000], mean observation: 0.086 [-0.967, 1.590], mean_best_reward: --
  1767/11000: episode: 95, duration: 0.030s, episode steps: 17, steps per second: 575, episode reward: 17.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.529 [0.000, 1.000], mean observation: -0.080 [-1.227, 0.813], mean_best_reward: --
  1779/11000: episode: 96, duration: 0.020s, episode steps: 12, steps per second: 606, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.167 [0.000, 1.000], mean observation: 0.131 [-1.571, 2.636], mean_best_reward: --
  1791/11000: episode: 97, duration: 0.026s, episode steps: 12, steps per second: 462, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.333 [0.000, 1.000], mean observation: 0.098 [-1.000, 1.553], mean_best_reward: --
  1817/11000: episode: 98, 

  2452/11000: episode: 129, duration: 0.051s, episode steps: 26, steps per second: 513, episode reward: 26.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.346 [0.000, 1.000], mean observation: 0.008 [-1.602, 2.428], mean_best_reward: --
  2471/11000: episode: 130, duration: 0.039s, episode steps: 19, steps per second: 487, episode reward: 19.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.368 [0.000, 1.000], mean observation: 0.063 [-1.148, 1.726], mean_best_reward: --
  2489/11000: episode: 131, duration: 0.028s, episode steps: 18, steps per second: 633, episode reward: 18.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.833 [0.000, 1.000], mean observation: -0.020 [-3.277, 2.340], mean_best_reward: --
  2518/11000: episode: 132, duration: 0.043s, episode steps: 29, steps per second: 669, episode reward: 29.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.414 [0.000, 1.000], mean observation: 0.052 [-1.190, 2.080], mean_best_reward: --
  2532/11000: episode: 

  3108/11000: episode: 164, duration: 0.021s, episode steps: 19, steps per second: 908, episode reward: 19.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.474 [0.000, 1.000], mean observation: 0.080 [-0.798, 1.238], mean_best_reward: --
  3133/11000: episode: 165, duration: 0.028s, episode steps: 25, steps per second: 881, episode reward: 25.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.360 [0.000, 1.000], mean observation: 0.050 [-1.354, 2.187], mean_best_reward: --
  3159/11000: episode: 166, duration: 0.027s, episode steps: 26, steps per second: 955, episode reward: 26.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.423 [0.000, 1.000], mean observation: 0.079 [-0.840, 1.712], mean_best_reward: --
  3180/11000: episode: 167, duration: 0.030s, episode steps: 21, steps per second: 699, episode reward: 21.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.476 [0.000, 1.000], mean observation: 0.077 [-0.644, 1.379], mean_best_reward: --
  3204/11000: episode: 1

  4089/11000: episode: 206, duration: 0.032s, episode steps: 26, steps per second: 801, episode reward: 26.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.385 [0.000, 1.000], mean observation: 0.009 [-1.496, 1.979], mean_best_reward: --
  4098/11000: episode: 207, duration: 0.010s, episode steps: 9, steps per second: 929, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.222 [0.000, 1.000], mean observation: 0.162 [-1.153, 1.965], mean_best_reward: --
  4116/11000: episode: 208, duration: 0.021s, episode steps: 18, steps per second: 840, episode reward: 18.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.389 [0.000, 1.000], mean observation: 0.057 [-1.174, 1.750], mean_best_reward: --
  4139/11000: episode: 209, duration: 0.023s, episode steps: 23, steps per second: 1008, episode reward: 23.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.565 [0.000, 1.000], mean observation: -0.074 [-1.372, 0.794], mean_best_reward: --
  4184/11000: episode: 2

  5141/11000: episode: 245, duration: 0.048s, episode steps: 43, steps per second: 888, episode reward: 43.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.465 [0.000, 1.000], mean observation: -0.003 [-0.926, 1.369], mean_best_reward: --
  5165/11000: episode: 246, duration: 0.024s, episode steps: 24, steps per second: 1005, episode reward: 24.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.542 [0.000, 1.000], mean observation: -0.091 [-1.282, 0.795], mean_best_reward: --
  5181/11000: episode: 247, duration: 0.015s, episode steps: 16, steps per second: 1045, episode reward: 16.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.688 [0.000, 1.000], mean observation: -0.077 [-2.174, 1.415], mean_best_reward: --
  5195/11000: episode: 248, duration: 0.015s, episode steps: 14, steps per second: 906, episode reward: 14.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.714 [0.000, 1.000], mean observation: -0.100 [-2.021, 1.180], mean_best_reward: --
  5245/11000: epis

  6353/11000: episode: 284, duration: 0.048s, episode steps: 23, steps per second: 476, episode reward: 23.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.565 [0.000, 1.000], mean observation: -0.081 [-1.537, 0.771], mean_best_reward: --
  6372/11000: episode: 285, duration: 0.044s, episode steps: 19, steps per second: 435, episode reward: 19.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.368 [0.000, 1.000], mean observation: 0.091 [-0.976, 1.793], mean_best_reward: --
  6438/11000: episode: 286, duration: 0.093s, episode steps: 66, steps per second: 709, episode reward: 66.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.470 [0.000, 1.000], mean observation: -0.124 [-1.305, 0.786], mean_best_reward: --
  6454/11000: episode: 287, duration: 0.022s, episode steps: 16, steps per second: 737, episode reward: 16.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.438 [0.000, 1.000], mean observation: 0.052 [-1.200, 1.650], mean_best_reward: --
  6521/11000: episode:

  7659/11000: episode: 320, duration: 0.101s, episode steps: 97, steps per second: 957, episode reward: 97.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.546 [0.000, 1.000], mean observation: 0.330 [-1.177, 2.024], mean_best_reward: --
  7674/11000: episode: 321, duration: 0.016s, episode steps: 15, steps per second: 917, episode reward: 15.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.533 [0.000, 1.000], mean observation: 0.106 [-0.748, 1.261], mean_best_reward: --
  7755/11000: episode: 322, duration: 0.087s, episode steps: 81, steps per second: 928, episode reward: 81.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.494 [0.000, 1.000], mean observation: -0.008 [-1.333, 1.201], mean_best_reward: --
  7783/11000: episode: 323, duration: 0.028s, episode steps: 28, steps per second: 998, episode reward: 28.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.500 [0.000, 1.000], mean observation: -0.101 [-1.421, 0.604], mean_best_reward: --
  7828/11000: episode:

  8974/11000: episode: 356, duration: 0.016s, episode steps: 12, steps per second: 762, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.417 [0.000, 1.000], mean observation: 0.105 [-1.187, 1.830], mean_best_reward: --
  8996/11000: episode: 357, duration: 0.026s, episode steps: 22, steps per second: 851, episode reward: 22.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.455 [0.000, 1.000], mean observation: -0.074 [-1.178, 0.798], mean_best_reward: --
  9013/11000: episode: 358, duration: 0.020s, episode steps: 17, steps per second: 833, episode reward: 17.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.706 [0.000, 1.000], mean observation: -0.103 [-2.356, 1.372], mean_best_reward: --
  9031/11000: episode: 359, duration: 0.020s, episode steps: 18, steps per second: 903, episode reward: 18.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.444 [0.000, 1.000], mean observation: 0.111 [-0.756, 1.273], mean_best_reward: --
  9114/11000: episode:

 10257/11000: episode: 396, duration: 0.036s, episode steps: 29, steps per second: 804, episode reward: 29.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.483 [0.000, 1.000], mean observation: -0.113 [-1.117, 0.587], mean_best_reward: --
 10281/11000: episode: 397, duration: 0.025s, episode steps: 24, steps per second: 953, episode reward: 24.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.500 [0.000, 1.000], mean observation: 0.035 [-0.827, 1.046], mean_best_reward: --
 10293/11000: episode: 398, duration: 0.013s, episode steps: 12, steps per second: 920, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.750 [0.000, 1.000], mean observation: -0.126 [-2.053, 1.159], mean_best_reward: --
 10304/11000: episode: 399, duration: 0.011s, episode steps: 11, steps per second: 977, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.636 [0.000, 1.000], mean observation: -0.138 [-1.675, 0.942], mean_best_reward: --
 10358/11000: episode

  331/4500: episode: 16, duration: 0.036s, episode steps: 16, steps per second: 444, episode reward: 16.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.688 [0.000, 1.000], mean observation: -0.102 [-2.374, 1.387], loss: 2.911917, mean_absolute_error: 3.691833, mean_q: 6.473026
  346/4500: episode: 17, duration: 0.037s, episode steps: 15, steps per second: 404, episode reward: 15.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.600 [0.000, 1.000], mean observation: -0.107 [-1.330, 0.562], loss: 1.382095, mean_absolute_error: 3.569553, mean_q: 6.345711
  364/4500: episode: 18, duration: 0.040s, episode steps: 18, steps per second: 454, episode reward: 18.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.556 [0.000, 1.000], mean observation: -0.066 [-1.560, 0.990], loss: 1.714757, mean_absolute_error: 4.101874, mean_q: 7.170851
  374/4500: episode: 19, duration: 0.023s, episode steps: 10, steps per second: 434, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], m

  695/4500: episode: 47, duration: 0.026s, episode steps: 11, steps per second: 429, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.091 [0.000, 1.000], mean observation: 0.146 [-1.731, 2.783], loss: 8.395165, mean_absolute_error: 5.950137, mean_q: 11.384029
  706/4500: episode: 48, duration: 0.027s, episode steps: 11, steps per second: 402, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.091 [0.000, 1.000], mean observation: 0.124 [-1.789, 2.829], loss: 8.435202, mean_absolute_error: 5.932163, mean_q: 11.424528
  715/4500: episode: 49, duration: 0.020s, episode steps: 9, steps per second: 444, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.141 [-1.789, 2.804], loss: 9.081702, mean_absolute_error: 5.743052, mean_q: 11.090290
  727/4500: episode: 50, duration: 0.026s, episode steps: 12, steps per second: 467, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mea

 1191/4500: episode: 80, duration: 0.046s, episode steps: 19, steps per second: 412, episode reward: 19.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.579 [0.000, 1.000], mean observation: -0.065 [-1.478, 0.961], loss: 3.900349, mean_absolute_error: 7.030389, mean_q: 13.109353
 1202/4500: episode: 81, duration: 0.024s, episode steps: 11, steps per second: 462, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.727 [0.000, 1.000], mean observation: -0.104 [-1.868, 1.189], loss: 8.600678, mean_absolute_error: 7.406077, mean_q: 13.398121
 1213/4500: episode: 82, duration: 0.024s, episode steps: 11, steps per second: 451, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.818 [0.000, 1.000], mean observation: -0.159 [-2.301, 1.327], loss: 7.953047, mean_absolute_error: 7.302922, mean_q: 13.248173
 1231/4500: episode: 83, duration: 0.039s, episode steps: 18, steps per second: 463, episode reward: 18.000, mean reward: 1.000 [1.000, 1.000]

 1863/4500: episode: 110, duration: 0.072s, episode steps: 31, steps per second: 432, episode reward: 31.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.484 [0.000, 1.000], mean observation: -0.099 [-0.800, 0.398], loss: 5.528775, mean_absolute_error: 9.950999, mean_q: 18.657060
 1877/4500: episode: 111, duration: 0.029s, episode steps: 14, steps per second: 481, episode reward: 14.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.500 [0.000, 1.000], mean observation: 0.098 [-0.741, 1.298], loss: 11.911693, mean_absolute_error: 9.966255, mean_q: 18.435518
 1903/4500: episode: 112, duration: 0.066s, episode steps: 26, steps per second: 392, episode reward: 26.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.500 [0.000, 1.000], mean observation: -0.101 [-1.207, 0.738], loss: 6.307186, mean_absolute_error: 9.793403, mean_q: 18.342842
 1944/4500: episode: 113, duration: 0.088s, episode steps: 41, steps per second: 466, episode reward: 41.000, mean reward: 1.000 [1.000, 1.

 2733/4500: episode: 143, duration: 0.036s, episode steps: 14, steps per second: 390, episode reward: 14.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.571 [0.000, 1.000], mean observation: -0.083 [-1.706, 1.201], loss: 8.123520, mean_absolute_error: 9.427062, mean_q: 17.745267
 2742/4500: episode: 144, duration: 0.023s, episode steps: 9, steps per second: 384, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.778 [0.000, 1.000], mean observation: -0.128 [-1.970, 1.212], loss: 10.684997, mean_absolute_error: 9.329206, mean_q: 17.194337
 2755/4500: episode: 145, duration: 0.042s, episode steps: 13, steps per second: 310, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.692 [0.000, 1.000], mean observation: -0.077 [-2.129, 1.397], loss: 6.249128, mean_absolute_error: 8.701442, mean_q: 16.126359
 2787/4500: episode: 146, duration: 0.068s, episode steps: 32, steps per second: 472, episode reward: 32.000, mean reward: 1.000 [1.000, 1.0

 3791/4500: episode: 174, duration: 0.048s, episode steps: 20, steps per second: 420, episode reward: 20.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.550 [0.000, 1.000], mean observation: -0.065 [-1.225, 0.622], loss: 13.408300, mean_absolute_error: 13.564146, mean_q: 25.727165
 3826/4500: episode: 175, duration: 0.076s, episode steps: 35, steps per second: 462, episode reward: 35.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.457 [0.000, 1.000], mean observation: -0.096 [-1.147, 0.629], loss: 7.750299, mean_absolute_error: 12.678521, mean_q: 24.407891
 3841/4500: episode: 176, duration: 0.061s, episode steps: 15, steps per second: 247, episode reward: 15.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.467 [0.000, 1.000], mean observation: 0.093 [-0.418, 1.007], loss: 18.974102, mean_absolute_error: 13.189904, mean_q: 24.509106
 3911/4500: episode: 177, duration: 0.187s, episode steps: 70, steps per second: 374, episode reward: 70.000, mean reward: 1.000 [1.000

<keras.callbacks.History at 0x7fd2dc7fa550>