___

<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>
___
<center><em>Copyright by Pierian Data Inc.</em></center>
<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>

# Keras-RL DQN Exercise


In this exercise you are going to implement your first keras-rl agent based on the **Acrobot** environment (https://gym.openai.com/envs/Acrobot-v1/) <br />
The goal of this environment is to maneuver the robot arm upwards above the line with as little steps as possible

**TASK: Import necessary libraries** <br />

In [1]:
import gym
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten
from tensorflow.keras.optimizers import Adam
from rl.agents.dqn import DQNAgent

**TASK: Create the environment** <br />
The name is: *Acrobot-v1*

In [2]:
env_name = "Acrobot-v1"
env = gym.make(env_name)

In [3]:
num_actions = env.action_space.n
num_observations = env.observation_space.shape
print(f"Action Space: {env.action_space.n}")
print(f"Observation Space: {num_observations}")

assert num_actions == 3 and num_observations == (6,) , "Wrong environment!"

Action Space: 3
Observation Space: (6,)


**TASK: Create the Neural Network for your Deep-Q-Agent** <br />
Take a look at the size of the action space and the size of the observation space.
You are free to chose any architecture you want! <br />
Hint: It already works with three layers, each having 64 neurons.

In [4]:
model = Sequential()

model.add(Flatten(input_shape=(1, ) + num_observations))

model.add(Dense(64))
model.add(Activation("relu"))

model.add(Dense(64))
model.add(Activation("relu"))

model.add(Dense(64))
model.add(Activation("relu"))


model.add(Dense(num_actions))
model.add(Activation("linear"))

In [5]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 6)                 0         
_________________________________________________________________
dense (Dense)                (None, 64)                448       
_________________________________________________________________
activation (Activation)      (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
activation_1 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                4160      
_________________________________________________________________
activation_2 (Activation)    (None, 64)                0

**TASK: Initialize the circular buffer**<br />
Make sure you set the limit appropriately (50000 works well)

In [6]:
from rl.memory import SequentialMemory

In [7]:
memory = SequentialMemory(limit=50000, window_length=1)

**TASK: Use the epsilon greedy action selection strategy with *decaying* epsilon**

In [8]:
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy

In [9]:
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(),
                              attr="eps",
                              value_max=1.0,
                              value_min=0.1,
                              value_test=0.05,
                              nb_steps=50000
                             )

**TASK: Create the DQNAgent** <br />
Feel free to play with the nb_steps_warump, target_model_update, batch_size and gamma parameters. <br />
Hint:<br />
You can try *nb_steps_warmup*=1000, *target_model_update*=1000, *batch_size*=32 and *gamma*=0.99 as a first guess

In [10]:
dqn = DQNAgent(model=model, nb_actions=num_actions, memory=memory, 
               nb_steps_warmup=1000, target_model_update=1000, policy=policy,
               gamma=0.99, batch_size=32
              )

**TASK: Compile the model** <br />
Feel free to explore the effects of different optimizers and learning rates.
You can try Adam with a learning rate of 1e-3 as a first guess 

In [11]:
dqn.compile(Adam(learning_rate=1e-3), metrics=["mae"])

**TASK: Fit the model** <br />
150,000 steps should be a very good starting point

In [12]:
dqn.fit(env, nb_steps=50000, visualize=False, verbose=2)

Training for 50000 steps ...




   500/50000: episode: 1, duration: 1.266s, episode steps: 500, steps per second: 395, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.960 [0.000, 2.000],  loss: --, mae: --, mean_q: --, mean_eps: --
  1000/50000: episode: 2, duration: 0.713s, episode steps: 500, steps per second: 701, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.020 [0.000, 2.000],  loss: --, mae: --, mean_q: --, mean_eps: --




  1500/50000: episode: 3, duration: 6.053s, episode steps: 500, steps per second:  83, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.998 [0.000, 2.000],  loss: 0.013619, mae: 0.640838, mean_q: -0.881434, mean_eps: 0.977500
  2000/50000: episode: 4, duration: 5.934s, episode steps: 500, steps per second:  84, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.966 [0.000, 2.000],  loss: 0.000578, mae: 0.639652, mean_q: -0.923714, mean_eps: 0.968509
  2500/50000: episode: 5, duration: 5.773s, episode steps: 500, steps per second:  87, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.008 [0.000, 2.000],  loss: 0.007981, mae: 1.396117, mean_q: -2.037532, mean_eps: 0.959509
  3000/50000: episode: 6, duration: 5.676s, episode steps: 500, steps per second:  88, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.994 [0.000, 2.000],  loss: 0.001321, mae: 1.391162, mean_q: -2.052

 17310/50000: episode: 35, duration: 6.061s, episode steps: 500, steps per second:  82, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.994 [0.000, 2.000],  loss: 0.231627, mae: 10.548152, mean_q: -15.606153, mean_eps: 0.692929
 17810/50000: episode: 36, duration: 6.094s, episode steps: 500, steps per second:  82, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.136 [0.000, 2.000],  loss: 0.224074, mae: 10.736222, mean_q: -15.890663, mean_eps: 0.683929
 18205/50000: episode: 37, duration: 4.838s, episode steps: 395, steps per second:  82, episode reward: -394.000, mean reward: -0.997 [-1.000,  0.000], mean action: 1.043 [0.000, 2.000],  loss: 0.269359, mae: 10.930934, mean_q: -16.152099, mean_eps: 0.675874
 18705/50000: episode: 38, duration: 6.136s, episode steps: 500, steps per second:  81, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.022 [0.000, 2.000],  loss: 0.200168, mae: 11.107694, mea

 29899/50000: episode: 67, duration: 6.246s, episode steps: 500, steps per second:  80, episode reward: -500.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.938 [0.000, 2.000],  loss: 0.565819, mae: 15.846997, mean_q: -23.415014, mean_eps: 0.466327
 30311/50000: episode: 68, duration: 5.571s, episode steps: 412, steps per second:  74, episode reward: -411.000, mean reward: -0.998 [-1.000,  0.000], mean action: 1.005 [0.000, 2.000],  loss: 0.485658, mae: 16.143714, mean_q: -23.886650, mean_eps: 0.458119
 30692/50000: episode: 69, duration: 5.142s, episode steps: 381, steps per second:  74, episode reward: -380.000, mean reward: -0.997 [-1.000,  0.000], mean action: 1.021 [0.000, 2.000],  loss: 0.451323, mae: 16.230098, mean_q: -24.025420, mean_eps: 0.450982
 30898/50000: episode: 70, duration: 2.503s, episode steps: 206, steps per second:  82, episode reward: -205.000, mean reward: -0.995 [-1.000,  0.000], mean action: 1.141 [0.000, 2.000],  loss: 0.530631, mae: 16.225981, mea

 39217/50000: episode: 99, duration: 3.984s, episode steps: 295, steps per second:  74, episode reward: -294.000, mean reward: -0.997 [-1.000,  0.000], mean action: 0.814 [0.000, 2.000],  loss: 0.796642, mae: 18.937146, mean_q: -27.946136, mean_eps: 0.296758
 39478/50000: episode: 100, duration: 3.483s, episode steps: 261, steps per second:  75, episode reward: -260.000, mean reward: -0.996 [-1.000,  0.000], mean action: 0.705 [0.000, 2.000],  loss: 0.661888, mae: 19.045137, mean_q: -28.159995, mean_eps: 0.291754
 39675/50000: episode: 101, duration: 2.620s, episode steps: 197, steps per second:  75, episode reward: -196.000, mean reward: -0.995 [-1.000,  0.000], mean action: 0.929 [0.000, 2.000],  loss: 0.635070, mae: 19.051954, mean_q: -28.158097, mean_eps: 0.287632
 39894/50000: episode: 102, duration: 3.247s, episode steps: 219, steps per second:  67, episode reward: -218.000, mean reward: -0.995 [-1.000,  0.000], mean action: 0.968 [0.000, 2.000],  loss: 0.851955, mae: 19.016666, 

 46705/50000: episode: 131, duration: 3.079s, episode steps: 231, steps per second:  75, episode reward: -230.000, mean reward: -0.996 [-1.000,  0.000], mean action: 1.100 [0.000, 2.000],  loss: 0.945790, mae: 20.903299, mean_q: -30.869265, mean_eps: 0.161398
 46914/50000: episode: 132, duration: 2.973s, episode steps: 209, steps per second:  70, episode reward: -208.000, mean reward: -0.995 [-1.000,  0.000], mean action: 1.105 [0.000, 2.000],  loss: 1.068174, mae: 20.950235, mean_q: -30.894279, mean_eps: 0.157438
 47067/50000: episode: 133, duration: 2.178s, episode steps: 153, steps per second:  70, episode reward: -152.000, mean reward: -0.993 [-1.000,  0.000], mean action: 1.261 [0.000, 2.000],  loss: 0.834501, mae: 20.905715, mean_q: -30.813558, mean_eps: 0.154180
 47279/50000: episode: 134, duration: 2.782s, episode steps: 212, steps per second:  76, episode reward: -211.000, mean reward: -0.995 [-1.000,  0.000], mean action: 1.226 [0.000, 2.000],  loss: 0.944045, mae: 20.950751,

<keras.callbacks.History at 0x1dc03f45780>

**TASK: Evaluate the model**

In [13]:
dqn.test(env, nb_episodes=5, visualize=True)
env.close()

Testing for 5 episodes ...
Episode 1: reward: -500.000, steps: 500
Episode 2: reward: -500.000, steps: 500
Episode 3: reward: -500.000, steps: 500
Episode 4: reward: -500.000, steps: 500
Episode 5: reward: -500.000, steps: 500
