___

<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>
___
<center><em>Copyright by Pierian Data Inc.</em></center>
<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>

# Keras-RL DQN Model


## Introduction to keras-rl(2)

In this notebook we will create our first Reinforcement Learning agent via keras-rl together,
based on a simple task from open-ai gym, namely the *Cartpole Example*

At first we will import all necessary packages:

In [1]:
import time  # to reduce the game speed when playing manually

import gym  # Contains the game we want to play
from pyglet.window import key  # for manual playing

# import necessary blocks from keras to build the Deep Learning backbone of our agent
from keras.models import Sequential  # To compose multiple Layers
from keras.layers import Dense  # Fully-Connected layer
from keras.layers import Activation  # Activation functions
from keras.layers import Flatten  # Flatten function

from keras.optimizers import Adam  # Adam optimizer

# Now the keras-rl2 agent. Dont get confused as it is only called rl and not keras-rl

from rl.agents.dqn import DQNAgent  # Use the basic Deep-Q-Network agent

Now we will create the environment:

In [2]:
# https://stackoverflow.com/questions/56904270/difference-between-openai-gym-environments-cartpole-v0-and-cartpole-v1
env_name = ENV_NAME = 'CartPole-v0'  # https://gym.openai.com/envs/CartPole-v1/
env = gym.make(env_name)  # create the environment
nb_actions = env.action_space.n  # get the number of possible actions
print(nb_actions)  # Cartpole has only two possible actions: Either move left or right

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
2


  result = entry_point.load(False)


Lets watch how the game looks when chosing random actions, or to be precise randomly move left and right

In [3]:
env.reset()  # reset the environment to the initial state
for _ in range(200):  # play for max 200 iterations
    env.render(mode="human")  # render the current game state on your screen
    random_action = env.action_space.sample()  # chose a random action
    env.step(random_action)  # execute that action
env.close()  # close the environment

[33mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.[0m


Now it is time that you try your luck! Try it out by using the left and right arrow key

In [4]:
action = 0
def key_press(k, mod):
    '''
    This function gets the key press for gym
    '''
    global action
    if k == key.LEFT:
        action = 0
    if k == key.RIGHT:
        action = 1

env.reset()
rewards = 0
for _ in range(1000):
    env.render(mode="human")
    env.viewer.window.on_key_press = key_press  # update the key press
    observation, reward, done, info = env.step(action)
    rewards+=1
    if done:
        print(f"You got {rewards} points!")
        break
    time.sleep(0.1)  # reduce speed a little bit
env.close()


AttributeError: 'NoneType' object has no attribute 'set_current'

Let us build a Deep Neural Network and try if it can beat our score

We use the same simple model with 2 hidden layers with 16 and 32 neurons each followed by relu activation

The output layer has 2 nodes, one for each action

In [8]:
model = Sequential()
# https://keras.io/api/layers/reshaping_layers/flatten/
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))

model.add(Dense(16))
model.add(Activation('relu'))

model.add(Dense(32))
model.add(Activation('relu'))

model.add(Dense(nb_actions))
model.add(Activation('linear'))

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 4)                 0         
                                                                 
 dense (Dense)               (None, 16)                80        
                                                                 
 activation (Activation)     (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 32)                544       
                                                                 
 activation_1 (Activation)   (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 2)                 66        
                                                                 
 activation_2 (Activation)   (None, 2)                 0

Lets create the DQN agent from keras-rl
For this setting, the agent takes the following parameters:

1. model = The model
2. nb_actions = The number of actions (2 in this case)
3. memory = The action replay memory. You can choose between the *SequentialMemory()* and *EpisodeParameterMemory() which is only used for one RL agent called CEM*
4. nb_steps_warmup = How many iterations without training - Used to fill the memory
5. target_model_update = When do we update the target model?
6. Action Selection policy. You can choose between a *LinearAnnealedPolicy()*, *SoftmaxPolicy()*, *EpsGreedyQPolicy()*, *GreedyQPolicy()*, *GreedyQPolicy()*, *MaxBoltzmannQPolicy()* and *BoltzmannGumbelQPolicy()*. We use all of them during the next notebooks but feel free to try them out and inspect which works best here

There are some more parameters, you can pass to the DQN Agent. Feel free to explore them, but we will also take a look at them together in the remaining notebooks

Here we initialize the circular buffer with a limit of 20000 and a window length of 1.
The window length describes the number of subsequent actions stored for a state.
This will be demonstrated in the next lecture, when we start dealing with images


In [9]:
from rl.memory import SequentialMemory  # Sequential Memory for storing observations ( optimized circular buffer)

memory = SequentialMemory(limit=20000, window_length=1)


Then we define the Action Selection Policy: <br />
We use *LinearAnnealedPolicy* in order to perform the epsilon greedy strategy with decaying epsilon. <br />
*LinearAnnealedPolicy* accepts an action selection policy, its maximal and minimal values and a step number in order to create a dynimal policy. <br/>
The minimal value epsilon can reach during training is 0.1.<br />
For evaluation (e.g running the agent) it is fixed to 0.05


In [10]:
# LinearAnnealedPolicy allows to decay the epsilon for the epsilon greedy strategy
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy

policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), 
                              attr='eps',
                              value_max=1.,
                              value_min=.1,
                              value_test=.05,
                              nb_steps=20000) 


Now we create the DQN Agent based on the defined model (**model**), the possible actions (**nb_actions**) (left and right in this case), the circular buffer (**memory**), the burnin or warmup phase (**10**), how often the target model gets updated (**100**) and the policy (**policy**)


In [11]:
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
               target_model_update=100, policy=policy)



Finally we compile our model with the Adam optimizer and a learning rate of 0.001.<br />
We log the Mean Absolute Error

In [12]:
# Use learning_rate instead of lr if you get warning
dqn.compile(Adam(lr=1e-3), metrics=['mae']) 

  super().__init__(name, **kwargs)


Now we run the training for 20000 steps. You can change visualize=True if you want to watch your model learning.
Keep in mind that this increases the running time
The training time is around 5 min so grep your favorite beverage and stay tuned


In [15]:
dqn.fit(env, nb_steps=20000, visualize=False, verbose=2)

Training for 20000 steps ...




    35/20000: episode: 1, duration: 0.860s, episode steps:  35, steps per second:  41, episode reward: 35.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.400 [0.000, 1.000],  loss: 0.549003, mae: 0.583047, mean_q: 0.232418, mean_eps: 0.998988
    46/20000: episode: 2, duration: 0.052s, episode steps:  11, steps per second: 213, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.364 [0.000, 1.000],  loss: 0.447209, mae: 0.646520, mean_q: 0.554865, mean_eps: 0.998200
    82/20000: episode: 3, duration: 0.188s, episode steps:  36, steps per second: 191, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.472 [0.000, 1.000],  loss: 0.207129, mae: 0.664731, mean_q: 0.824566, mean_eps: 0.997143
   127/20000: episode: 4, duration: 0.240s, episode steps:  45, steps per second: 187, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  loss: 0.238419, mae: 0.861196, mean_q: 1.161930, mean_ep

<keras.callbacks.History at 0x218ff50cdc0>

Wow! After only some minutes of training, we achieve great results!
The reason for this is, that keras-rl has implemented many optimization strategies (e.g the optimized replay buffer) which lead to a much faster convergence than our DQN implemented by hand

In [None]:
# After training is done, we save the final weights.
dqn.save_weights(f'dqn_{env_name}_weights.h5f', overwrite=True)

In [None]:
# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=5, visualize=True)
env.close()