# Keras-RL2 DQN - CartPole

Usually, DQN agents are not programmed manually, as in `03_1_DQN_Manual_Cartpole.ipynb`. Instead, abstraction libraries are used on top of OpenAI Gym and Keras. There are many libraries available, the one used in the course is [Keras RL2](https://github.com/taylormcnally/keras-rl2), which requires TF >= 2.1.

Note that Keras RL2 is basically Keras RL for Tensorflow 2 and that it is archived, i.e., not further developed; however, it seems to be a nice trade-off between abstraction and manual definition, optimizing for learning and understanding. Keras RL2 separates nicely:
- the model: we just define one and internally is managed the other
- replay memory/buffer: deque or circular array
- policy: e.g., epsilon-greedy with decaying value
- DQN agent: it takes all of the above and the environment and it is trained

Documentation link: [https://keras-rl.readthedocs.io/en/latest/](https://keras-rl.readthedocs.io/en/latest)

The documentation is not very extensive, but the examples are very nice; the examples are available on Github: [https://github.com/taylormcnally/keras-rl2/tree/master/examples](https://github.com/taylormcnally/keras-rl2/tree/master/examples)

Some alternatives would be:
- OpenAI Baselines
- TensorFlow Agents

Overview of contents:
1. Imports and Setup
2. Creating the ANN
3. DQN Agent: Training
4. Test

## 1. Imports and Setup

In [4]:
import time  # to reduce the game speed when playing manually
import gym
from pyglet.window import key  # for manual playing

# Import TF stuff first, because Keras-RL2 is built on TF
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Flatten
from tensorflow.keras.optimizers import Adam

# Now the import the Keras-rl2 agent
# See
# https://keras-rl.readthedocs.io/en/latest/agents/overview/#available-agents
# It is called rl, but it belongs to Keras-RL2
from rl.agents.dqn import DQNAgent  # Use the basic Deep-Q-Network agent

In [2]:
env_name = "CartPole-v0"

In [3]:
env = gym.make(env_name)

In [9]:
# Manual play
env.reset()
for _ in range(300):
    env.render(mode="human") # render on screen
    random_action = env.action_space.sample() # random action
    env.step(random_action)
env.close() # close

## 2. Creating the ANN

In [22]:
# Get number of actions
n_actions = env.action_space.n

In [23]:
n_actions

2

In [24]:
# Get number of observations
n_observations = env.observation_space.shape

In [27]:
# Note it is a tuple of dim 1
# We need Flatten to address that:
# Flatten() takes (None, a, b, c), where None is the batch,
# and it converts it to (None, a*b*c)
# https://keras.io/api/layers/reshaping_layers/flatten/
n_observations

(4,)

In [32]:
# We build the same model as in the previous notebook
# but now, we also use Flatten
model = Sequential()
# Flatten() takes (None, a, b, c), where None is the batch,
# and it converts it to (None, a*b*c)
# https://keras.io/api/layers/reshaping_layers/flatten/
model.add(Flatten(input_shape=(1,) + n_observations))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(32))
model.add(Activation('relu'))
model.add(Dense(n_actions))
model.add(Activation('relu'))

In [33]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_3 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_6 (Dense)              (None, 16)                80        
_________________________________________________________________
activation_6 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 32)                544       
_________________________________________________________________
activation_7 (Activation)    (None, 32)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 2)                 66        
_________________________________________________________________
activation_8 (Activation)    (None, 2)                

## 3. DQN Agent: Training

In [34]:
# Replay Buffer = Sequential Memory
from rl.memory import SequentialMemory

In [36]:
# limit: the size of the deque
# window_length: it starts making sense with images; use 1 for non-visual data
memory = SequentialMemory(limit=20000, window_length=1)

In [49]:
# Policy
# LinearAnnealedPolicy: linear decay
# EpsGreedyQPolicy: with a linearly decaying epsilon, choose exploitation/exploration according to it
from rl.policy import LinearAnnealedPolicy,EpsGreedyQPolicy

In [38]:
# Policy of action choice
# We use the epsilon-greedy policy, as always
# Random (exploration) or best (exploitation) action chosen
# depending on epsilon in [value_min, value_max], decreased by steps.
# value_test: evaluation can be performed at a fixed epsilon (should be small: exploitation)
# nb_steps: we match our sequential memory size
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(),
                             attr='eps',
                             value_max=1.0,
                             value_min=0.1,
                             value_test=0.05,
                             nb_steps=20000)

In [40]:
# DQN Agent
# We now pass all elements we have to the agent;
# beforehand, we coded all that manually, not anymore.
# nb_steps_warmup: our burn_in = how many steps before epsilon starts decreasing
# target_model_update: every how many epochs do we update the weights of the frozen model
dqn = DQNAgent(model=model,
              nb_actions=n_actions,
              memory=memory,
              nb_steps_warmup=10,
              target_model_update=100,
              policy=policy)

In [41]:
# Compile the Agent
# We need to pass the optimizer for the model and the metric(s)
# 'mae': Mean Absolute Error
dqn.compile(Adam(learning_rate=1e-3),metrics=['mae'])

In [43]:
# Train
# Note that it takes much much less than in the manual case, because it's optimized!
# nb_steps: episodes
dqn.fit(env,nb_steps=20000,visualize=False,verbose=1)

Training for 20000 steps ...
Interval 1 (0 steps performed)
238 episodes - episode_reward: 41.815 [8.000, 200.000] - loss: 34.565 - mae: 78.555 - mean_q: 160.273 - mean_eps: 0.775

Interval 2 (10000 steps performed)
done, took 99.197 seconds


<keras.callbacks.History at 0x27eec61f348>

In [48]:
# Save model weights in crompressed format: HDF5
dqn.save_weights(f'dqn_{env_name}_krl2_weights.h5f',overwrite=True)

## 4. Test

In [45]:
dqn.test(env,nb_episodes=5,visualize=True)
env.close()

Testing for 5 episodes ...
Episode 1: reward: 182.000, steps: 182
Episode 2: reward: 169.000, steps: 169
Episode 3: reward: 132.000, steps: 132
Episode 4: reward: 134.000, steps: 134
Episode 5: reward: 119.000, steps: 119
