# Keras-RL2 DQN - Acrobot

This notebook implements a RL agent able to act on the [Acrobot](https://gym.openai.com/envs/Acrobot-v1/) environment: "The acrobot system includes two joints and two links, where the joint between the two links is actuated. Initially, the links are hanging downwards, and **the goal is to swing the end of the lower link up to a given height.**"

The Github repository: [Acrobot @ Github](https://github.com/openai/gym/blob/master/gym/envs/classic_control/acrobot.py)

From Github, we know:
- State: `[cos(theta1) sin(theta1) cos(theta2) sin(theta2) thetaDot1 thetaDot2]`
- The action is either applying +1, 0 or -1 torque on the joint between the two pendulum links.

The implementation is basically a copy of the contents in the notebook

`03_2_DQN_KerasRL2_Cartpole.ipynb`

Overview of sections:
1. Imports and Setup
2. Creating the ANN
3. DQN Agent: Training
4. Test & Use

## 1. Imports and Setup

In [1]:
import time  # to reduce the game speed when playing manually
import numpy as np
import gym
from pyglet.window import key  # for manual playing

# Import TF stuff first, because Keras-RL2 is built on TF
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Flatten
from tensorflow.keras.optimizers import Adam

# Now the import the Keras-rl2 agent
# See
# https://keras-rl.readthedocs.io/en/latest/agents/overview/#available-agents
# It is called rl, but it belongs to Keras-RL2
from rl.agents.dqn import DQNAgent  # Use the basic Deep-Q-Network agent

  for external in metadata.entry_points().get(self.group, []):


In [5]:
env_name = "Acrobot-v1"

In [6]:
env = gym.make(env_name)

In [7]:
# Manual play
env.reset()
for _ in range(300):
    env.render(mode="human") # render on screen
    random_action = env.action_space.sample() # random action
    env.step(random_action)
env.close() # close

## 2. Creating the ANN

In [8]:
# Get number of actions
n_actions = env.action_space.n

In [9]:
n_actions

3

In [10]:
# Get number of observations
n_observations = env.observation_space.shape

In [11]:
# Note it is a tuple of dim 1
# We need Flatten to address that:
# Flatten() takes (None, a, b, c), where None is the batch,
# and it converts it to (None, a*b*c)
# https://keras.io/api/layers/reshaping_layers/flatten/
n_observations

(6,)

In [12]:
# Similar model as before, but with 64 units in each of the three layers
model = Sequential()
# Flatten() takes (None, a, b, c), where None is the batch,
# and it converts it to (None, a*b*c)
# https://keras.io/api/layers/reshaping_layers/flatten/
model.add(Flatten(input_shape=(1,) + n_observations))
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(n_actions))
model.add(Activation('relu'))

In [13]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 6)                 0         
_________________________________________________________________
dense (Dense)                (None, 64)                448       
_________________________________________________________________
activation (Activation)      (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
activation_1 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                4160      
_________________________________________________________________
activation_2 (Activation)    (None, 64)                0

## 3. DQN Agent: Training

In [14]:
# Replay Buffer = Sequential Memory
from rl.memory import SequentialMemory

In [15]:
# limit: the size of the deque
# window_length: it starts making sense with images; use 1 for non-visual data
memory = SequentialMemory(limit=50000, window_length=1)

In [16]:
# Policy
# LinearAnnealedPolicy: linear decay
# EpsGreedyQPolicy: with a linearly decaying epsilon, choose exploitation/exploration according to it
from rl.policy import LinearAnnealedPolicy,EpsGreedyQPolicy

In [18]:
# Policy of action choice
# We use the epsilon-greedy policy, as always
# Random (exploration) or best (exploitation) action chosen
# depending on epsilon in [value_min, value_max], decreased by steps.
# value_test: evaluation can be performed at a fixed epsilon (should be small: exploitation)
# nb_steps: we match our sequential memory size
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(),
                             attr='eps',
                             value_max=1.0,
                             value_min=0.1,
                             value_test=0.05,
                             nb_steps=150000)

In [19]:
# DQN Agent
# We now pass all elements we have to the agent;
# beforehand, we coded all that manually, not anymore.
# nb_steps_warmup: our burn_in = how many steps before epsilon starts decreasing
# target_model_update: every how many epochs do we update the weights of the frozen model
# Optional: batch_size, gamma
dqn = DQNAgent(model=model,
              nb_actions=n_actions,
              memory=memory,
              nb_steps_warmup=1000,
              target_model_update=1000,
              batch_size=32,
              gamma=0.99, 
              policy=policy)

In [20]:
# Compile the Agent
# We need to pass the optimizer for the model and the metric(s)
# 'mae': Mean Absolute Error
dqn.compile(Adam(learning_rate=1e-3),metrics=['mae'])

In [None]:
# Train
# Note that it takes much much less than in the manual case, because it's optimized!
# nb_steps: episodes
dqn.fit(env,nb_steps=150000,visualize=False,verbose=1)

Training for 150000 steps ...
Interval 1 (0 steps performed)
  265/10000 [..............................] - ETA: 5s - reward: -1.0000



20 episodes - episode_reward: -500.000 [-500.000, -500.000] - loss: 0.483 - mae: 0.327 - mean_q: 0.000 - mean_eps: 0.967

Interval 2 (10000 steps performed)
20 episodes - episode_reward: -493.400 [-500.000, -368.000] - loss: 0.500 - mae: 0.333 - mean_q: 0.000 - mean_eps: 0.910

Interval 3 (20000 steps performed)
20 episodes - episode_reward: -495.050 [-500.000, -401.000] - loss: 0.500 - mae: 0.333 - mean_q: 0.000 - mean_eps: 0.850

Interval 4 (30000 steps performed)
20 episodes - episode_reward: -494.850 [-500.000, -397.000] - loss: 0.500 - mae: 0.333 - mean_q: 0.000 - mean_eps: 0.790

Interval 5 (40000 steps performed)
20 episodes - episode_reward: -500.000 [-500.000, -500.000] - loss: 0.500 - mae: 0.333 - mean_q: 0.000 - mean_eps: 0.730

Interval 6 (50000 steps performed)
20 episodes - episode_reward: -500.000 [-500.000, -500.000] - loss: 0.500 - mae: 0.333 - mean_q: 0.000 - mean_eps: 0.670

Interval 7 (60000 steps performed)
20 episodes - episode_reward: -498.850 [-500.000, -477.000

In [None]:
# Save model weights in crompressed format: HDF5
dqn.save_weights(f'dqn_{env_name}_krl2_weights.h5f',overwrite=True)

In [None]:
# Load weights
# Note that we need to create the model and the DQN agent before loading the weights!
dqn.load_weights(f'dqn_{env_name}_krl2_weights.h5f')

## 4. Test & Use

In [None]:
# Test
dqn.test(env,nb_episodes=5,visualize=True)
env.close()

In [None]:
# Use the model to carry out actions without Keras-RL2, only with the model
observation = env.reset()
for counter in range(2000):
    env.render()
    print()
    action = np.argmax(model.predict(observation.reshape((1,1,6))))
    observation, reward, done, info = env.step(action)
    if done:
        #pass
        #print('done')
        break
env.close()