___

<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>
___
<center><em>Copyright by Pierian Data Inc.</em></center>
<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>

# Keras-RL DQN Exercise


In this exercise you are going to implement your first keras-rl agent based on the **Acrobot** environment (https://gym.openai.com/envs/Acrobot-v1/) <br />
The goal of this environment is to maneuver the robot arm upwards above the line with as little steps as possible

**TASK: Import necessary libraries** <br />

In [1]:
import gym
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten
from tensorflow.keras.optimizers import Adam
from rl.agents.dqn import DQNAgent

**TASK: Create the environment** <br />
The name is: *Acrobot-v1*

In [2]:
env_name = "Acrobot-v1"
env = gym.make(env_name)

In [3]:
num_actions = env.action_space.n
num_observations = env.observation_space.shape
print(f"Action Space: {env.action_space.n}")
print(f"Observation Space: {num_observations}")

assert num_actions == 3 and num_observations == (6,) , "Wrong environment!"

Action Space: 3
Observation Space: (6,)


**TASK: Create the Neural Network for your Deep-Q-Agent** <br />
Take a look at the size of the action space and the size of the observation space.
You are free to chose any architecture you want! <br />
Hint: It already works with three layers, each having 64 neurons.

In [4]:
model = Sequential()

model.add(Flatten(input_shape=(1, ) + num_observations))

model.add(Dense(64))
model.add(Activation("relu"))

model.add(Dense(64))
model.add(Activation("relu"))

model.add(Dense(64))
model.add(Activation("relu"))


model.add(Dense(num_actions))
model.add(Activation("linear"))

In [5]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 6)                 0         
_________________________________________________________________
dense (Dense)                (None, 64)                448       
_________________________________________________________________
activation (Activation)      (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
activation_1 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                4160      
_________________________________________________________________
activation_2 (Activation)    (None, 64)                0

**TASK: Initialize the circular buffer**<br />
Make sure you set the limit appropriately (50000 works well)

In [6]:
from rl.memory import SequentialMemory

In [7]:
memory = SequentialMemory(limit=50000, window_length=1)

**TASK: Use the epsilon greedy action selection strategy with *decaying* epsilon**

In [8]:
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy

In [9]:
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(),
                              attr="eps",
                              value_max=1.0,
                              value_min=0.1,
                              value_test=0.05,
                              nb_steps=150000
                             )

**TASK: Create the DQNAgent** <br />
Feel free to play with the nb_steps_warump, target_model_update, batch_size and gamma parameters. <br />
Hint:<br />
You can try *nb_steps_warmup*=1000, *target_model_update*=1000, *batch_size*=32 and *gamma*=0.99 as a first guess

In [10]:
dqn = DQNAgent(model=model, nb_actions=num_actions, memory=memory, 
               nb_steps_warmup=1000, target_model_update=1000, policy=policy,
               gamma=0.99, batch_size=32
              )

**TASK: Compile the model** <br />
Feel free to explore the effects of different optimizers and learning rates.
You can try Adam with a learning rate of 1e-3 as a first guess 

In [11]:
dqn.compile(Adam(learning_rate=1e-3), metrics=["mae"])

**TASK: Fit the model** <br />
150,000 steps should be a very good starting point

In [12]:
dqn.fit(env, nb_steps=150000, visualize=False, verbose=2)

Training for 150000 steps ...




    500/150000: episode: 1, duration: 1.548s, episode steps: 500, steps per second: 323, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.024 [0.000, 2.000],  loss: --, mae: --, mean_q: --, mean_eps: --
   1000/150000: episode: 2, duration: 0.774s, episode steps: 500, steps per second: 646, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.002 [0.000, 2.000],  loss: --, mae: --, mean_q: --, mean_eps: --




   1500/150000: episode: 3, duration: 6.676s, episode steps: 500, steps per second:  75, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.004 [0.000, 2.000],  loss: 0.007188, mae: 0.512267, mean_q: -0.701002, mean_eps: 0.992500
   2000/150000: episode: 4, duration: 5.994s, episode steps: 500, steps per second:  83, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.046 [0.000, 2.000],  loss: 0.000415, mae: 0.502049, mean_q: -0.718353, mean_eps: 0.989503
   2500/150000: episode: 5, duration: 5.957s, episode steps: 500, steps per second:  84, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.982 [0.000, 2.000],  loss: 0.008334, mae: 1.256795, mean_q: -1.825103, mean_eps: 0.986503
   3000/150000: episode: 6, duration: 6.069s, episode steps: 500, steps per second:  82, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.016 [0.000, 2.000],  loss: 0.000755, mae: 1.254379, mean_q

  17283/150000: episode: 35, duration: 6.979s, episode steps: 500, steps per second:  72, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.988 [0.000, 2.000],  loss: 0.217788, mae: 10.713266, mean_q: -15.849277, mean_eps: 0.897805
  17783/150000: episode: 36, duration: 6.983s, episode steps: 500, steps per second:  72, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.040 [0.000, 2.000],  loss: 0.199262, mae: 10.931666, mean_q: -16.176904, mean_eps: 0.894805
  18283/150000: episode: 37, duration: 7.998s, episode steps: 500, steps per second:  63, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.044 [0.000, 2.000],  loss: 0.247269, mae: 11.218981, mean_q: -16.576819, mean_eps: 0.891805
  18783/150000: episode: 38, duration: 8.743s, episode steps: 500, steps per second:  57, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.932 [0.000, 2.000],  loss: 0.188406, mae: 11.433

  32585/150000: episode: 67, duration: 6.254s, episode steps: 500, steps per second:  80, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.012 [0.000, 2.000],  loss: 0.353642, mae: 16.667734, mean_q: -24.686428, mean_eps: 0.805993
  33085/150000: episode: 68, duration: 6.320s, episode steps: 500, steps per second:  79, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.888 [0.000, 2.000],  loss: 0.513549, mae: 16.751382, mean_q: -24.759036, mean_eps: 0.802993
  33410/150000: episode: 69, duration: 4.127s, episode steps: 325, steps per second:  79, episode reward: -324.000, mean reward: -0.997 [-1.000,  0.000], mean action: 0.972 [0.000, 2.000],  loss: 0.532712, mae: 17.032548, mean_q: -25.169402, mean_eps: 0.800518
  33910/150000: episode: 70, duration: 6.443s, episode steps: 500, steps per second:  78, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.012 [0.000, 2.000],  loss: 0.554668, mae: 16.994

  46509/150000: episode: 99, duration: 6.548s, episode steps: 500, steps per second:  76, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.980 [0.000, 2.000],  loss: 0.853960, mae: 20.430147, mean_q: -30.124311, mean_eps: 0.722449
  46826/150000: episode: 100, duration: 4.097s, episode steps: 317, steps per second:  77, episode reward: -316.000, mean reward: -0.997 [-1.000,  0.000], mean action: 0.937 [0.000, 2.000],  loss: 0.624664, mae: 20.409443, mean_q: -30.112555, mean_eps: 0.719998
  47101/150000: episode: 101, duration: 3.533s, episode steps: 275, steps per second:  78, episode reward: -274.000, mean reward: -0.996 [-1.000,  0.000], mean action: 0.909 [0.000, 2.000],  loss: 0.737468, mae: 20.536915, mean_q: -30.275816, mean_eps: 0.718222
  47319/150000: episode: 102, duration: 2.786s, episode steps: 218, steps per second:  78, episode reward: -217.000, mean reward: -0.995 [-1.000,  0.000], mean action: 0.940 [0.000, 2.000],  loss: 0.799904, mae: 20.

  56979/150000: episode: 131, duration: 5.138s, episode steps: 398, steps per second:  77, episode reward: -397.000, mean reward: -0.997 [-1.000,  0.000], mean action: 1.020 [0.000, 2.000],  loss: 0.716201, mae: 22.245294, mean_q: -32.806619, mean_eps: 0.659323
  57440/150000: episode: 132, duration: 5.918s, episode steps: 461, steps per second:  78, episode reward: -460.000, mean reward: -0.998 [-1.000,  0.000], mean action: 1.011 [0.000, 2.000],  loss: 0.766471, mae: 22.398285, mean_q: -33.019986, mean_eps: 0.656746
  57643/150000: episode: 133, duration: 2.641s, episode steps: 203, steps per second:  77, episode reward: -202.000, mean reward: -0.995 [-1.000,  0.000], mean action: 1.108 [0.000, 2.000],  loss: 0.971967, mae: 22.406157, mean_q: -33.004522, mean_eps: 0.654754
  57950/150000: episode: 134, duration: 4.045s, episode steps: 307, steps per second:  76, episode reward: -306.000, mean reward: -0.997 [-1.000,  0.000], mean action: 0.997 [0.000, 2.000],  loss: 0.828145, mae: 22

  65483/150000: episode: 163, duration: 3.607s, episode steps: 241, steps per second:  67, episode reward: -240.000, mean reward: -0.996 [-1.000,  0.000], mean action: 1.066 [0.000, 2.000],  loss: 0.913826, mae: 23.018549, mean_q: -33.807502, mean_eps: 0.607828
  65659/150000: episode: 164, duration: 2.438s, episode steps: 176, steps per second:  72, episode reward: -175.000, mean reward: -0.994 [-1.000,  0.000], mean action: 0.983 [0.000, 2.000],  loss: 0.669293, mae: 23.087313, mean_q: -33.921688, mean_eps: 0.606577
  65943/150000: episode: 165, duration: 3.845s, episode steps: 284, steps per second:  74, episode reward: -283.000, mean reward: -0.996 [-1.000,  0.000], mean action: 1.183 [0.000, 2.000],  loss: 0.813714, mae: 23.096375, mean_q: -33.928919, mean_eps: 0.605197
  66116/150000: episode: 166, duration: 2.341s, episode steps: 173, steps per second:  74, episode reward: -172.000, mean reward: -0.994 [-1.000,  0.000], mean action: 1.116 [0.000, 2.000],  loss: 0.998833, mae: 23

  73140/150000: episode: 195, duration: 3.478s, episode steps: 247, steps per second:  71, episode reward: -246.000, mean reward: -0.996 [-1.000,  0.000], mean action: 0.838 [0.000, 2.000],  loss: 0.690694, mae: 23.656436, mean_q: -34.658451, mean_eps: 0.561904
  73369/150000: episode: 196, duration: 3.019s, episode steps: 229, steps per second:  76, episode reward: -228.000, mean reward: -0.996 [-1.000,  0.000], mean action: 0.786 [0.000, 2.000],  loss: 0.994547, mae: 23.702040, mean_q: -34.731466, mean_eps: 0.560476
  73537/150000: episode: 197, duration: 2.473s, episode steps: 168, steps per second:  68, episode reward: -167.000, mean reward: -0.994 [-1.000,  0.000], mean action: 0.935 [0.000, 2.000],  loss: 0.889603, mae: 23.630548, mean_q: -34.620173, mean_eps: 0.559285
  73721/150000: episode: 198, duration: 2.619s, episode steps: 184, steps per second:  70, episode reward: -183.000, mean reward: -0.995 [-1.000,  0.000], mean action: 0.935 [0.000, 2.000],  loss: 0.860892, mae: 23

  79630/150000: episode: 227, duration: 2.321s, episode steps: 180, steps per second:  78, episode reward: -179.000, mean reward: -0.994 [-1.000,  0.000], mean action: 0.967 [0.000, 2.000],  loss: 0.722746, mae: 24.187197, mean_q: -35.398550, mean_eps: 0.522763
  79786/150000: episode: 228, duration: 1.836s, episode steps: 156, steps per second:  85, episode reward: -155.000, mean reward: -0.994 [-1.000,  0.000], mean action: 1.077 [0.000, 2.000],  loss: 0.872694, mae: 24.175469, mean_q: -35.380120, mean_eps: 0.521755
  79894/150000: episode: 229, duration: 1.757s, episode steps: 108, steps per second:  61, episode reward: -107.000, mean reward: -0.991 [-1.000,  0.000], mean action: 1.093 [0.000, 2.000],  loss: 0.643840, mae: 24.315198, mean_q: -35.583577, mean_eps: 0.520963
  80074/150000: episode: 230, duration: 2.904s, episode steps: 180, steps per second:  62, episode reward: -179.000, mean reward: -0.994 [-1.000,  0.000], mean action: 1.000 [0.000, 2.000],  loss: 0.757248, mae: 24

  85096/150000: episode: 259, duration: 2.514s, episode steps: 190, steps per second:  76, episode reward: -189.000, mean reward: -0.995 [-1.000,  0.000], mean action: 1.005 [0.000, 2.000],  loss: 0.787117, mae: 24.028595, mean_q: -35.109994, mean_eps: 0.489997
  85307/150000: episode: 260, duration: 2.796s, episode steps: 211, steps per second:  75, episode reward: -210.000, mean reward: -0.995 [-1.000,  0.000], mean action: 0.834 [0.000, 2.000],  loss: 0.786704, mae: 24.112254, mean_q: -35.264618, mean_eps: 0.488794
  85488/150000: episode: 261, duration: 2.380s, episode steps: 181, steps per second:  76, episode reward: -180.000, mean reward: -0.994 [-1.000,  0.000], mean action: 0.834 [0.000, 2.000],  loss: 0.749400, mae: 23.968091, mean_q: -35.007933, mean_eps: 0.487618
  85666/150000: episode: 262, duration: 2.350s, episode steps: 178, steps per second:  76, episode reward: -177.000, mean reward: -0.994 [-1.000,  0.000], mean action: 0.888 [0.000, 2.000],  loss: 0.876920, mae: 24

  90328/150000: episode: 291, duration: 2.112s, episode steps: 162, steps per second:  77, episode reward: -161.000, mean reward: -0.994 [-1.000,  0.000], mean action: 1.012 [0.000, 2.000],  loss: 0.712209, mae: 24.716010, mean_q: -36.177394, mean_eps: 0.458521
  90462/150000: episode: 292, duration: 1.759s, episode steps: 134, steps per second:  76, episode reward: -133.000, mean reward: -0.993 [-1.000,  0.000], mean action: 1.231 [0.000, 2.000],  loss: 0.765644, mae: 24.773715, mean_q: -36.208499, mean_eps: 0.457633
  90637/150000: episode: 293, duration: 2.313s, episode steps: 175, steps per second:  76, episode reward: -174.000, mean reward: -0.994 [-1.000,  0.000], mean action: 0.926 [0.000, 2.000],  loss: 0.769108, mae: 24.517767, mean_q: -35.791574, mean_eps: 0.456706
  90793/150000: episode: 294, duration: 2.043s, episode steps: 156, steps per second:  76, episode reward: -155.000, mean reward: -0.994 [-1.000,  0.000], mean action: 0.968 [0.000, 2.000],  loss: 0.658119, mae: 24

  94859/150000: episode: 323, duration: 2.343s, episode steps: 179, steps per second:  76, episode reward: -178.000, mean reward: -0.994 [-1.000,  0.000], mean action: 0.966 [0.000, 2.000],  loss: 0.736774, mae: 24.633215, mean_q: -35.964085, mean_eps: 0.431386
  95010/150000: episode: 324, duration: 1.991s, episode steps: 151, steps per second:  76, episode reward: -150.000, mean reward: -0.993 [-1.000,  0.000], mean action: 0.907 [0.000, 2.000],  loss: 0.745039, mae: 24.216858, mean_q: -35.309681, mean_eps: 0.430396
  95181/150000: episode: 325, duration: 2.268s, episode steps: 171, steps per second:  75, episode reward: -170.000, mean reward: -0.994 [-1.000,  0.000], mean action: 0.825 [0.000, 2.000],  loss: 0.744557, mae: 24.329805, mean_q: -35.500832, mean_eps: 0.429430
  95295/150000: episode: 326, duration: 1.500s, episode steps: 114, steps per second:  76, episode reward: -113.000, mean reward: -0.991 [-1.000,  0.000], mean action: 1.018 [0.000, 2.000],  loss: 0.814439, mae: 24

  99147/150000: episode: 355, duration: 1.627s, episode steps: 127, steps per second:  78, episode reward: -126.000, mean reward: -0.992 [-1.000,  0.000], mean action: 1.016 [0.000, 2.000],  loss: 0.784625, mae: 24.221216, mean_q: -35.230414, mean_eps: 0.405502
  99265/150000: episode: 356, duration: 1.503s, episode steps: 118, steps per second:  78, episode reward: -117.000, mean reward: -0.992 [-1.000,  0.000], mean action: 0.949 [0.000, 2.000],  loss: 0.751325, mae: 24.478100, mean_q: -35.632714, mean_eps: 0.404767
  99387/150000: episode: 357, duration: 1.560s, episode steps: 122, steps per second:  78, episode reward: -121.000, mean reward: -0.992 [-1.000,  0.000], mean action: 1.098 [0.000, 2.000],  loss: 0.729452, mae: 24.258942, mean_q: -35.340284, mean_eps: 0.404047
  99515/150000: episode: 358, duration: 1.619s, episode steps: 128, steps per second:  79, episode reward: -127.000, mean reward: -0.992 [-1.000,  0.000], mean action: 0.992 [0.000, 2.000],  loss: 0.729132, mae: 24

 103444/150000: episode: 387, duration: 1.388s, episode steps: 109, steps per second:  79, episode reward: -108.000, mean reward: -0.991 [-1.000,  0.000], mean action: 0.890 [0.000, 2.000],  loss: 0.817875, mae: 24.104471, mean_q: -35.029134, mean_eps: 0.379666
 103547/150000: episode: 388, duration: 1.278s, episode steps: 103, steps per second:  81, episode reward: -102.000, mean reward: -0.990 [-1.000,  0.000], mean action: 0.845 [0.000, 2.000],  loss: 0.610881, mae: 24.604855, mean_q: -35.880373, mean_eps: 0.379030
 103661/150000: episode: 389, duration: 1.450s, episode steps: 114, steps per second:  79, episode reward: -113.000, mean reward: -0.991 [-1.000,  0.000], mean action: 1.009 [0.000, 2.000],  loss: 0.754728, mae: 24.346901, mean_q: -35.429966, mean_eps: 0.378379
 103783/150000: episode: 390, duration: 1.515s, episode steps: 122, steps per second:  81, episode reward: -121.000, mean reward: -0.992 [-1.000,  0.000], mean action: 0.820 [0.000, 2.000],  loss: 0.654116, mae: 24

 108050/150000: episode: 419, duration: 2.123s, episode steps: 154, steps per second:  73, episode reward: -153.000, mean reward: -0.994 [-1.000,  0.000], mean action: 0.909 [0.000, 2.000],  loss: 0.646414, mae: 23.678283, mean_q: -34.400338, mean_eps: 0.352165
 108230/150000: episode: 420, duration: 2.679s, episode steps: 180, steps per second:  67, episode reward: -179.000, mean reward: -0.994 [-1.000,  0.000], mean action: 0.878 [0.000, 2.000],  loss: 0.739951, mae: 23.311210, mean_q: -33.857364, mean_eps: 0.351163
 108369/150000: episode: 421, duration: 1.721s, episode steps: 139, steps per second:  81, episode reward: -138.000, mean reward: -0.993 [-1.000,  0.000], mean action: 0.971 [0.000, 2.000],  loss: 0.626727, mae: 23.442874, mean_q: -34.031088, mean_eps: 0.350206
 108475/150000: episode: 422, duration: 1.558s, episode steps: 106, steps per second:  68, episode reward: -105.000, mean reward: -0.991 [-1.000,  0.000], mean action: 1.066 [0.000, 2.000],  loss: 0.620843, mae: 23

 111976/150000: episode: 451, duration: 1.437s, episode steps: 109, steps per second:  76, episode reward: -108.000, mean reward: -0.991 [-1.000,  0.000], mean action: 1.073 [0.000, 2.000],  loss: 0.574288, mae: 23.581548, mean_q: -34.317972, mean_eps: 0.328474
 112116/150000: episode: 452, duration: 1.971s, episode steps: 140, steps per second:  71, episode reward: -139.000, mean reward: -0.993 [-1.000,  0.000], mean action: 0.993 [0.000, 2.000],  loss: 0.674833, mae: 23.534855, mean_q: -34.183983, mean_eps: 0.327727
 112267/150000: episode: 453, duration: 2.124s, episode steps: 151, steps per second:  71, episode reward: -150.000, mean reward: -0.993 [-1.000,  0.000], mean action: 1.066 [0.000, 2.000],  loss: 0.604503, mae: 23.321090, mean_q: -33.887898, mean_eps: 0.326854
 112369/150000: episode: 454, duration: 1.377s, episode steps: 102, steps per second:  74, episode reward: -101.000, mean reward: -0.990 [-1.000,  0.000], mean action: 1.088 [0.000, 2.000],  loss: 0.594648, mae: 23

 115971/150000: episode: 483, duration: 1.699s, episode steps: 130, steps per second:  77, episode reward: -129.000, mean reward: -0.992 [-1.000,  0.000], mean action: 0.946 [0.000, 2.000],  loss: 0.571952, mae: 22.840438, mean_q: -33.206687, mean_eps: 0.304567
 116079/150000: episode: 484, duration: 1.464s, episode steps: 108, steps per second:  74, episode reward: -107.000, mean reward: -0.991 [-1.000,  0.000], mean action: 1.019 [0.000, 2.000],  loss: 0.638826, mae: 23.283588, mean_q: -33.826215, mean_eps: 0.303853
 116208/150000: episode: 485, duration: 1.723s, episode steps: 129, steps per second:  75, episode reward: -128.000, mean reward: -0.992 [-1.000,  0.000], mean action: 1.070 [0.000, 2.000],  loss: 0.643450, mae: 22.785659, mean_q: -33.059192, mean_eps: 0.303142
 116323/150000: episode: 486, duration: 1.522s, episode steps: 115, steps per second:  76, episode reward: -114.000, mean reward: -0.991 [-1.000,  0.000], mean action: 0.861 [0.000, 2.000],  loss: 0.616342, mae: 23

 120211/150000: episode: 515, duration: 1.244s, episode steps:  95, steps per second:  76, episode reward: -94.000, mean reward: -0.989 [-1.000,  0.000], mean action: 0.895 [0.000, 2.000],  loss: 0.654922, mae: 22.895721, mean_q: -33.183960, mean_eps: 0.279022
 120316/150000: episode: 516, duration: 1.444s, episode steps: 105, steps per second:  73, episode reward: -104.000, mean reward: -0.990 [-1.000,  0.000], mean action: 0.971 [0.000, 2.000],  loss: 0.720821, mae: 22.983417, mean_q: -33.328052, mean_eps: 0.278422
 120421/150000: episode: 517, duration: 1.443s, episode steps: 105, steps per second:  73, episode reward: -104.000, mean reward: -0.990 [-1.000,  0.000], mean action: 1.057 [0.000, 2.000],  loss: 0.701195, mae: 22.792458, mean_q: -33.083661, mean_eps: 0.277792
 120527/150000: episode: 518, duration: 1.478s, episode steps: 106, steps per second:  72, episode reward: -105.000, mean reward: -0.991 [-1.000,  0.000], mean action: 0.774 [0.000, 2.000],  loss: 0.617150, mae: 23.

 124068/150000: episode: 547, duration: 1.160s, episode steps:  90, steps per second:  78, episode reward: -89.000, mean reward: -0.989 [-1.000,  0.000], mean action: 1.233 [0.000, 2.000],  loss: 0.699292, mae: 22.742907, mean_q: -32.955969, mean_eps: 0.255865
 124262/150000: episode: 548, duration: 2.427s, episode steps: 194, steps per second:  80, episode reward: -193.000, mean reward: -0.995 [-1.000,  0.000], mean action: 1.278 [0.000, 2.000],  loss: 0.721473, mae: 22.885147, mean_q: -33.200352, mean_eps: 0.255013
 124397/150000: episode: 549, duration: 1.731s, episode steps: 135, steps per second:  78, episode reward: -134.000, mean reward: -0.993 [-1.000,  0.000], mean action: 1.259 [0.000, 2.000],  loss: 0.726418, mae: 22.654337, mean_q: -32.870471, mean_eps: 0.254026
 124510/150000: episode: 550, duration: 1.417s, episode steps: 113, steps per second:  80, episode reward: -112.000, mean reward: -0.991 [-1.000,  0.000], mean action: 1.221 [0.000, 2.000],  loss: 0.671162, mae: 22.

 127994/150000: episode: 579, duration: 1.105s, episode steps:  79, steps per second:  71, episode reward: -78.000, mean reward: -0.987 [-1.000,  0.000], mean action: 0.924 [0.000, 2.000],  loss: 0.659322, mae: 22.595931, mean_q: -32.874423, mean_eps: 0.232276
 128114/150000: episode: 580, duration: 1.851s, episode steps: 120, steps per second:  65, episode reward: -119.000, mean reward: -0.992 [-1.000,  0.000], mean action: 0.917 [0.000, 2.000],  loss: 0.781931, mae: 22.726148, mean_q: -32.981393, mean_eps: 0.231679
 128198/150000: episode: 581, duration: 1.095s, episode steps:  84, steps per second:  77, episode reward: -83.000, mean reward: -0.988 [-1.000,  0.000], mean action: 1.000 [0.000, 2.000],  loss: 0.697991, mae: 22.493397, mean_q: -32.619364, mean_eps: 0.231067
 128304/150000: episode: 582, duration: 1.309s, episode steps: 106, steps per second:  81, episode reward: -105.000, mean reward: -0.991 [-1.000,  0.000], mean action: 0.972 [0.000, 2.000],  loss: 0.650960, mae: 22.6

 131675/150000: episode: 611, duration: 1.296s, episode steps: 101, steps per second:  78, episode reward: -100.000, mean reward: -0.990 [-1.000,  0.000], mean action: 0.693 [0.000, 2.000],  loss: 0.650319, mae: 22.571280, mean_q: -32.685423, mean_eps: 0.210256
 131827/150000: episode: 612, duration: 1.893s, episode steps: 152, steps per second:  80, episode reward: -151.000, mean reward: -0.993 [-1.000,  0.000], mean action: 0.783 [0.000, 2.000],  loss: 0.568133, mae: 22.542762, mean_q: -32.698610, mean_eps: 0.209497
 132008/150000: episode: 613, duration: 2.256s, episode steps: 181, steps per second:  80, episode reward: -180.000, mean reward: -0.994 [-1.000,  0.000], mean action: 0.746 [0.000, 2.000],  loss: 0.697759, mae: 22.561601, mean_q: -32.700209, mean_eps: 0.208498
 132152/150000: episode: 614, duration: 1.672s, episode steps: 144, steps per second:  86, episode reward: -143.000, mean reward: -0.993 [-1.000,  0.000], mean action: 0.972 [0.000, 2.000],  loss: 0.725325, mae: 22

 135389/150000: episode: 643, duration: 1.855s, episode steps: 128, steps per second:  69, episode reward: -127.000, mean reward: -0.992 [-1.000,  0.000], mean action: 1.023 [0.000, 2.000],  loss: 0.624339, mae: 22.497711, mean_q: -32.645699, mean_eps: 0.188053
 135519/150000: episode: 644, duration: 1.928s, episode steps: 130, steps per second:  67, episode reward: -129.000, mean reward: -0.992 [-1.000,  0.000], mean action: 1.069 [0.000, 2.000],  loss: 0.577864, mae: 22.559071, mean_q: -32.709139, mean_eps: 0.187279
 135603/150000: episode: 645, duration: 1.278s, episode steps:  84, steps per second:  66, episode reward: -83.000, mean reward: -0.988 [-1.000,  0.000], mean action: 0.988 [0.000, 2.000],  loss: 0.628098, mae: 22.434583, mean_q: -32.504304, mean_eps: 0.186637
 135687/150000: episode: 646, duration: 1.253s, episode steps:  84, steps per second:  67, episode reward: -83.000, mean reward: -0.988 [-1.000,  0.000], mean action: 1.036 [0.000, 2.000],  loss: 0.669871, mae: 22.7

 139112/150000: episode: 675, duration: 1.646s, episode steps: 130, steps per second:  79, episode reward: -129.000, mean reward: -0.992 [-1.000,  0.000], mean action: 1.231 [0.000, 2.000],  loss: 0.734915, mae: 22.168101, mean_q: -32.093348, mean_eps: 0.165721
 139208/150000: episode: 676, duration: 1.192s, episode steps:  96, steps per second:  81, episode reward: -95.000, mean reward: -0.990 [-1.000,  0.000], mean action: 1.021 [0.000, 2.000],  loss: 0.668885, mae: 22.243820, mean_q: -32.293613, mean_eps: 0.165043
 139347/150000: episode: 677, duration: 1.773s, episode steps: 139, steps per second:  78, episode reward: -138.000, mean reward: -0.993 [-1.000,  0.000], mean action: 1.165 [0.000, 2.000],  loss: 0.665309, mae: 22.408902, mean_q: -32.471414, mean_eps: 0.164338
 139520/150000: episode: 678, duration: 2.274s, episode steps: 173, steps per second:  76, episode reward: -172.000, mean reward: -0.994 [-1.000,  0.000], mean action: 1.156 [0.000, 2.000],  loss: 0.643340, mae: 22.

 142390/150000: episode: 707, duration: 1.452s, episode steps: 118, steps per second:  81, episode reward: -117.000, mean reward: -0.992 [-1.000,  0.000], mean action: 0.780 [0.000, 2.000],  loss: 0.736946, mae: 22.330822, mean_q: -32.335937, mean_eps: 0.146017
 142489/150000: episode: 708, duration: 1.268s, episode steps:  99, steps per second:  78, episode reward: -98.000, mean reward: -0.990 [-1.000,  0.000], mean action: 0.909 [0.000, 2.000],  loss: 0.651389, mae: 22.475500, mean_q: -32.568982, mean_eps: 0.145366
 142573/150000: episode: 709, duration: 1.042s, episode steps:  84, steps per second:  81, episode reward: -83.000, mean reward: -0.988 [-1.000,  0.000], mean action: 0.940 [0.000, 2.000],  loss: 0.722776, mae: 22.071967, mean_q: -31.973662, mean_eps: 0.144817
 142673/150000: episode: 710, duration: 1.248s, episode steps: 100, steps per second:  80, episode reward: -99.000, mean reward: -0.990 [-1.000,  0.000], mean action: 0.740 [0.000, 2.000],  loss: 0.636538, mae: 22.22

 145742/150000: episode: 739, duration: 1.840s, episode steps: 122, steps per second:  66, episode reward: -121.000, mean reward: -0.992 [-1.000,  0.000], mean action: 0.590 [0.000, 2.000],  loss: 0.629177, mae: 22.734914, mean_q: -32.894172, mean_eps: 0.125917
 145832/150000: episode: 740, duration: 1.044s, episode steps:  90, steps per second:  86, episode reward: -89.000, mean reward: -0.989 [-1.000,  0.000], mean action: 0.956 [0.000, 2.000],  loss: 0.610255, mae: 22.711421, mean_q: -32.903394, mean_eps: 0.125281
 145927/150000: episode: 741, duration: 1.123s, episode steps:  95, steps per second:  85, episode reward: -94.000, mean reward: -0.989 [-1.000,  0.000], mean action: 0.758 [0.000, 2.000],  loss: 0.590614, mae: 22.528282, mean_q: -32.642552, mean_eps: 0.124726
 146062/150000: episode: 742, duration: 1.648s, episode steps: 135, steps per second:  82, episode reward: -134.000, mean reward: -0.993 [-1.000,  0.000], mean action: 0.615 [0.000, 2.000],  loss: 0.734329, mae: 22.3

 148725/150000: episode: 771, duration: 0.955s, episode steps:  86, steps per second:  90, episode reward: -85.000, mean reward: -0.988 [-1.000,  0.000], mean action: 1.198 [0.000, 2.000],  loss: 0.683938, mae: 22.256199, mean_q: -32.188518, mean_eps: 0.107911
 148810/150000: episode: 772, duration: 0.947s, episode steps:  85, steps per second:  90, episode reward: -84.000, mean reward: -0.988 [-1.000,  0.000], mean action: 1.082 [0.000, 2.000],  loss: 0.612877, mae: 22.238542, mean_q: -32.176607, mean_eps: 0.107398
 148901/150000: episode: 773, duration: 1.041s, episode steps:  91, steps per second:  87, episode reward: -90.000, mean reward: -0.989 [-1.000,  0.000], mean action: 1.176 [0.000, 2.000],  loss: 0.619375, mae: 22.170071, mean_q: -32.067989, mean_eps: 0.106870
 148969/150000: episode: 774, duration: 0.787s, episode steps:  68, steps per second:  86, episode reward: -67.000, mean reward: -0.985 [-1.000,  0.000], mean action: 1.103 [0.000, 2.000],  loss: 0.719771, mae: 22.269

<keras.callbacks.History at 0x18403f787f0>

**TASK: Evaluate the model**

In [14]:
dqn.test(env, nb_episodes=5, visualize=True)
env.close()

Testing for 5 episodes ...
Episode 1: reward: -500.000, steps: 500
Episode 2: reward: -114.000, steps: 115
Episode 3: reward: -500.000, steps: 500
Episode 4: reward: -500.000, steps: 500
Episode 5: reward: -500.000, steps: 500
