### Deep Q-Learning with Keras and OpenAI

Jay Urbain

This tutorial lab will demonstrate how deep reinforcement learning (deep Q-learning) can be implemented and applied to the *CartPole* game using *Keras* and the *OpenAI Gym*.

This lab consists of four parts:  
- A review of reinforcement learning.   
- Introduction a deep Q-Learning agent that plays the cart pole game using the OpenAI Gym.   
- Experimentation of machine learning hyperparameters to reduce the solution convergence time. Converge will be defined as the number of training epochs until the error is less than $0.01$. I.e., `e: 0.01` in the print statement below.   
- Extension of the Q-Learning agent to play one of the Atari PacMan in the OpenAI Gym playground.


<h2 id="References"><a href="#References" class="headerlink" title="References"></a>References</h2><ul>
<li><a href="https://arxiv.org/abs/1312.5602" target="_blank" rel="external">Playing Atari with Deep Reinforcement Learning</a></li>
<li><a href="https://www.nature.com/articles/nature14236" target="_blank" rel="external">Human-level Control Through Deep Reinforcement Learning</a></li>
<li><a href="https://github.com/rlcode/reinforcement-learning" target="_blank" rel="external">Reinforcement Learning Examples by RLCode</a></li>
<li><a href="https://ai.intel.com/demystifying-deep-reinforcement-learning/" target="_blank" rel="external">Demystifying Deep Reinforcement Learning Part 1</a></li>
<li><a href="https://ai.intel.com/deep-reinforcement-learning-with-neon/" target="_blank" rel="external">Demystifying Deep Reinforcement Learning with Neon</a></li>    
<li><a href="https://keon.io/" target="_blank" rel="external">Deep Q Learning</a></li>
</ul>

<img src='images/animation.gif' width='400px'></img>


### Reinforcement Learning

<img src='images/rl.png' width=400px></img>

Reinforcement Learning is a type of machine learning where agents learn by interacting with their environment. 

For example, we learn to balance and steer a bike by trial and error. As seen in the picture, the brain represents the AI agent, which acts on the environment. After each action, the agent receives feedback. The feedback consists of the reward and next state of the environment. The reward is usually defined by a human. If we use the analogy of the bicycle, we can define reward as the distance from the original starting point.

In the case of the cartpole, the action is left or right position, and the feedback is angle.


### Cartpole Game

Usually, training an agent to play an Atari can take several hours or even days. So we will make an agent to play a simpler game called *CartPole*, but using the same idea used in the paper.

CartPole is one of the basic environments in OpenAI gym (a game simulator). As you can see in the animation from the top, the goal of CartPole is to balance a pole connected with one joint on top of a moving cart. Instead of pixel information, there are 4 kinds of information given by the state, such as angle of the pole and position of the cart. An agent can move the cart by performing a series of actions of $0$ or $1$ to the cart, pushing it left or right.

Gym makes interacting with the game environment relatively simple.
'''
next_state, reward, done, info = env.step(action)
'''

An action can be either $0$ or $1$. If we pass an action, env, which represents the game environment, will emit the results. done is a boolean value telling whether the game ended or not. The old state information paired with action and next_state and reward is the information we need for training the agent.

Below is a bare bones model that just takes random actions for 250 interations. You should see a window pop up rendering the classic cart-pole problem.

In [10]:
import matplotlib.pyplot as plt
%matplotlib inline 

import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(250):
    env.render()
    env.step(env.action_space.sample()) # take a random action

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




[33mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.[0m


#### Observations  

If we ever want to do better than take random actions at each step, it’d probably be good to actually know what our actions are doing to the environment.

The environment’s `step` function returns exactly what we need. In fact, step returns four values. These are:

- `observation` (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.

- `reward` (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.

- `done` (boolean): whether it’s time to `reset` the environment again. Most (but not all) tasks are divided up into well-defined episodes, and `done` being `True` indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)

- `info` (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an `action`, and the environment returns an `observation` and a `reward`.



<img src='images/aeloop.svg' width=400px></img>

The process gets started by calling reset(), which returns an initial observation. So a more proper way of writing the previous code would be to respect the done flag:

In [2]:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break



[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[-0.02602502  0.00482096  0.04032394 -0.0392331 ]
[-0.0259286   0.19934216  0.03953928 -0.31892569]
[-0.02194176  0.39387933  0.03316077 -0.59888186]
[-0.01406417  0.58852201  0.02118313 -0.88093788]
[-0.00229373  0.3931188   0.00356437 -0.58167145]
[ 0.00556865  0.58819063 -0.00806906 -0.87322941]
[ 0.01733246  0.78342136 -0.02553365 -1.16843826]
[ 0.03300089  0.97886602 -0.04890241 -1.46901576]
[ 0.05257821  1.17455105 -0.07828273 -1.77656398]
[ 0.07606923  0.98039329 -0.11381401 -1.50921088]
[ 0.09567709  0.78681986 -0.14399822 -1.25411843]
[ 0.11141349  0.98346203 -0.16908059 -1.58821819]
[ 0.13108273  1.18014127 -0.20084496 -1.92850323]
Episode finished after 13 timesteps
[ 0.04984661 -0.01677599 -0.02079201  0.00214552]
[ 0.04951109 -0.21159368 -0.0207491   0.28819648]
[ 0.04527922 -0.40641369 -0.01498517  0.57426379]
[ 0.03715095 -0.21108489 -0.00349989  0.27689802]
[ 0.03

[-0.01334292 -0.38963954 -0.08342155  0.39557574]
[-0.02113571 -0.58348489 -0.07551004  0.6608351 ]
[-0.03280541 -0.77747939 -0.06229333  0.9288188 ]
[-0.048355   -0.97170759 -0.04371696  1.20129361]
[-0.06778915 -0.77604834 -0.01969108  0.89523647]
[-0.08331011 -0.580665   -0.00178636  0.59642947]
[-0.09492341 -0.3855181   0.01014223  0.30318439]
[-0.10263378 -0.58078312  0.01620592  0.59904862]
[-0.11424944 -0.77612802  0.02818689  0.89679182]
[-0.129772   -0.5813993   0.04612273  0.61310062]
[-0.14139998 -0.38695121  0.05838474  0.33529395]
[-0.14913901 -0.19270668  0.06509062  0.06157897]
[-0.15299314 -0.38869857  0.0663222   0.37406739]
[-0.16076711 -0.19457833  0.07380355  0.10301163]
[-0.16465868 -0.39067612  0.07586378  0.41803611]
[-0.1724722  -0.1967066   0.0842245   0.15020127]
[-0.17640633 -0.39292729  0.08722853  0.4682217 ]
[-0.18426488 -0.19913897  0.09659296  0.20425707]
[-0.18824766 -0.00552155  0.1006781  -0.05646105]
[-0.18835809  0.18802352  0.09954888 -0.31575917]


[ 0.09586794  0.41691972 -0.11267185 -0.66057831]
[ 0.10420633  0.22353102 -0.12588342 -0.40539009]
[ 0.10867695  0.42019235 -0.13399122 -0.73495991]
[ 0.1170808   0.61688576 -0.14869042 -1.06663112]
[ 0.12941851  0.81362849 -0.17002304 -1.40204298]
[ 0.14569108  0.62097729 -0.1980639  -1.16697926]
Episode finished after 24 timesteps
[-0.0329458   0.04963472 -0.02457102 -0.03189178]
[-0.03195311 -0.14512641 -0.02520885  0.25293858]
[-0.03485564  0.05034626 -0.02015008 -0.04758788]
[-0.03384871  0.24575126 -0.02110184 -0.34655961]
[-0.02893369  0.05093573 -0.02803303 -0.06060485]
[-0.02791497  0.24644816 -0.02924513 -0.36199891]
[-0.02298601  0.05175383 -0.03648511 -0.07867914]
[-0.02195093 -0.14282661 -0.03805869  0.20227304]
[-0.02480746 -0.33738418 -0.03401323  0.48271151]
[-0.03155515 -0.14179908 -0.024359    0.1795056 ]
[-0.03439113  0.05366282 -0.02076889 -0.12076112]
[-0.03331787 -0.1411555  -0.02318411  0.16529773]
[-0.03614098  0.05429052 -0.01987815 -0.13460803]
[-0.03505517 -

### Spaces

In the examples above, we’ve been sampling random actions from the environment’s action space. But what actually are those actions? Every environment comes with an `action_space` and an `observation_space`. These attributes are of type `Space`, and they describe the format of valid actions and observations:
    

In [3]:
import gym
import gym.spaces

env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2)
print(env.observation_space)
#> Box(4,)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Discrete(2)
Box(4,)




The Discrete space allows a fixed range of non-negative numbers, so in this case valid actions are either 0 or 1. The Box space represents an n-dimensional box, so valid observations will be an array of 4 numbers. We can also check the Box’s bounds:

In [4]:
print(env.observation_space.high)
#> array([ 2.4       ,         inf,  0.20943951,         inf])
print(env.observation_space.low)
#> array([-2.4       ,        -inf, -0.20943951,        -inf])

[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


This introspection can be helpful to write generic code that works for many different environments. Box and Discrete are the most common Spaces. You can sample from a Space or check that something belongs to it:

In [5]:
from gym import spaces
space = spaces.Discrete(8) # Set with 8 elements {0, 1, 2, ..., 7}
x = space.sample()
assert space.contains(x)
assert space.n == 8

For CartPole-v0 one of the actions applies force to the left, and one of them applies force to the right. (Can you figure out which is which?)

Fortunately, the better your learning algorithm, the less you’ll have to try to interpret these numbers yourself.

### Deep Reinforcement Learning

Google’s DeepMind published its famous paper [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602), in which they introduced a new algorithm called Deep Q Network (DQN for short) in 2013. It demonstrated how an AI agent can learn to play games by just observing the screen without any prior information about those games. The result turned out to be pretty impressive. This paper opened the era of what is called ‘deep reinforcement learning’, a mix of deep learning and reinforcement learning.

[Click to Watch: DeepMind’s Atari Player](https://www.youtube.com/watch?v=V1eYniJ0Rnk)
    
<img src='images/atari.png' width=400px></img>

In Q-Learning Algorithm, there is a function called Q Function, which is used to approximate the reward based on a state. We call it Q(s,a), where Q is a function which calculates the expected future value from state s and action a. Similarly in Deep Q Network algorithm, we use a neural network to approximate the reward based on the state. We will discuss how this works in detail.


### Implementing a MLP  using Keras  

For this tutorial, we can treat the neural network as a black box algorithm that maps inputs to outputs. It is  an algorithm that learns on the pairs of examples input and output data, detects some kind of patterns, and predicts the output based on an unseen input data. It's important to understand how the neural network is used in the DQN algorithm.

<img src='images/neuralnet.png' width='500px'></img>

Note that the neural net we are going to use is similar to the diagram above. We will have one input layer that receives 4 information and 3 hidden layers. But we are going to have 2 nodes in the output layer since there are two buttons (0 and 1) for the game.

Keras makes it really simple to implement a basic neural network. The code below creates an empty neural net model. activation, loss and optimizer are the parameters that define the characteristics of the neural network, but we are not going to discuss it here.


#### Imports

In [6]:
import random
import gym
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam


Using TensorFlow backend.
  from ._conv import register_converters as _register_converters


#### Keras neural network model for Q-Learning

In [7]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self._build_model()

    def _build_model(self):
        # Neural Net for Deep-Q learning Model
        model = Sequential()
        model.add(Dense(24, input_dim=self.state_size, activation='relu'))
        model.add(Dense(24, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        model.compile(loss='mse',
                      optimizer=Adam(lr=self.learning_rate))
        return model

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])  # returns action

    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = (reward + self.gamma *
                          np.amax(self.model.predict(next_state)[0]))
            target_f = self.model.predict(state)
            target_f[0][action] = target
            self.model.fit(state, target_f, epochs=1, verbose=0)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

#### Make OpenAI environment


In [9]:
%matplotlib inline

EPISODES = 1000

env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size)
# agent.load("./save/cartpole-dqn.h5")
done = False
batch_size = 32

for e in range(EPISODES):
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    for time in range(500):
        env.render()
        action = agent.act(state)
        next_state, reward, done, _ = env.step(action)
        reward = reward if not done else -10
        next_state = np.reshape(next_state, [1, state_size])
        agent.remember(state, action, reward, next_state, done)
        state = next_state
        if done:
            print("episode: {}/{}, score: {}, e: {:.2}"
                    .format(e, EPISODES, time, agent.epsilon))
            break
        if len(agent.memory) > batch_size:
            agent.replay(batch_size)
    # if e % 10 == 0:
    #     agent.save("./save/cartpole-dqn.h5")



[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
episode: 0/1000, score: 14, e: 1.0
episode: 1/1000, score: 28, e: 0.95
episode: 2/1000, score: 26, e: 0.83
episode: 3/1000, score: 20, e: 0.75
episode: 4/1000, score: 14, e: 0.7
episode: 5/1000, score: 15, e: 0.65
episode: 6/1000, score: 10, e: 0.62
episode: 7/1000, score: 18, e: 0.56
episode: 8/1000, score: 8, e: 0.54
episode: 9/1000, score: 11, e: 0.51
episode: 10/1000, score: 8, e: 0.49
episode: 11/1000, score: 7, e: 0.48
episode: 12/1000, score: 12, e: 0.45
episode: 13/1000, score: 8, e: 0.43
episode: 14/1000, score: 17, e: 0.4
episode: 15/1000, score: 11, e: 0.37
episode: 16/1000, score: 13, e: 0.35
episode: 17/1000, score: 16, e: 0.32
episode: 18/1000, score: 29, e: 0.28
episode: 19/1000, score: 42, e: 0.23
episode: 20/1000, score: 46, e: 0.18
episode: 21/1000, score: 68, e: 0.13
episode: 22/1000, score: 33, e: 0.11
episode: 23/1000, score: 34, e: 0.092
episode: 24/1000, sc

episode: 212/1000, score: 206, e: 0.01
episode: 213/1000, score: 171, e: 0.01
episode: 214/1000, score: 196, e: 0.01
episode: 215/1000, score: 130, e: 0.01
episode: 216/1000, score: 148, e: 0.01
episode: 217/1000, score: 166, e: 0.01
episode: 218/1000, score: 134, e: 0.01
episode: 219/1000, score: 116, e: 0.01
episode: 220/1000, score: 126, e: 0.01
episode: 221/1000, score: 135, e: 0.01
episode: 222/1000, score: 146, e: 0.01
episode: 223/1000, score: 141, e: 0.01
episode: 224/1000, score: 152, e: 0.01
episode: 225/1000, score: 163, e: 0.01
episode: 226/1000, score: 177, e: 0.01
episode: 227/1000, score: 146, e: 0.01
episode: 228/1000, score: 153, e: 0.01
episode: 229/1000, score: 226, e: 0.01
episode: 230/1000, score: 151, e: 0.01
episode: 231/1000, score: 98, e: 0.01
episode: 232/1000, score: 43, e: 0.01
episode: 233/1000, score: 12, e: 0.01
episode: 234/1000, score: 135, e: 0.01
episode: 235/1000, score: 202, e: 0.01
episode: 236/1000, score: 124, e: 0.01
episode: 237/1000, score: 16

episode: 425/1000, score: 258, e: 0.01
episode: 426/1000, score: 263, e: 0.01
episode: 427/1000, score: 499, e: 0.01
episode: 428/1000, score: 164, e: 0.01
episode: 429/1000, score: 446, e: 0.01
episode: 430/1000, score: 9, e: 0.01
episode: 431/1000, score: 10, e: 0.01
episode: 432/1000, score: 8, e: 0.01
episode: 433/1000, score: 165, e: 0.01
episode: 434/1000, score: 149, e: 0.01
episode: 435/1000, score: 499, e: 0.01
episode: 436/1000, score: 499, e: 0.01
episode: 437/1000, score: 303, e: 0.01
episode: 438/1000, score: 8, e: 0.01
episode: 439/1000, score: 499, e: 0.01
episode: 440/1000, score: 216, e: 0.01
episode: 441/1000, score: 8, e: 0.01
episode: 442/1000, score: 10, e: 0.01
episode: 443/1000, score: 8, e: 0.01
episode: 444/1000, score: 8, e: 0.01
episode: 445/1000, score: 8, e: 0.01
episode: 446/1000, score: 9, e: 0.01
episode: 447/1000, score: 9, e: 0.01
episode: 448/1000, score: 8, e: 0.01
episode: 449/1000, score: 9, e: 0.01
episode: 450/1000, score: 8, e: 0.01
episode: 451

episode: 640/1000, score: 218, e: 0.01
episode: 641/1000, score: 84, e: 0.01
episode: 642/1000, score: 166, e: 0.01
episode: 643/1000, score: 226, e: 0.01
episode: 644/1000, score: 499, e: 0.01
episode: 645/1000, score: 422, e: 0.01
episode: 646/1000, score: 13, e: 0.01
episode: 647/1000, score: 9, e: 0.01
episode: 648/1000, score: 8, e: 0.01
episode: 649/1000, score: 8, e: 0.01
episode: 650/1000, score: 9, e: 0.01
episode: 651/1000, score: 8, e: 0.01
episode: 652/1000, score: 9, e: 0.01
episode: 653/1000, score: 9, e: 0.01
episode: 654/1000, score: 8, e: 0.01
episode: 655/1000, score: 9, e: 0.01
episode: 656/1000, score: 8, e: 0.01
episode: 657/1000, score: 9, e: 0.01
episode: 658/1000, score: 9, e: 0.01
episode: 659/1000, score: 54, e: 0.01
episode: 660/1000, score: 499, e: 0.01
episode: 661/1000, score: 385, e: 0.01
episode: 662/1000, score: 223, e: 0.01
episode: 663/1000, score: 193, e: 0.01
episode: 664/1000, score: 350, e: 0.01
episode: 665/1000, score: 499, e: 0.01
episode: 666/

episode: 856/1000, score: 74, e: 0.01
episode: 857/1000, score: 9, e: 0.01
episode: 858/1000, score: 8, e: 0.01
episode: 859/1000, score: 9, e: 0.01
episode: 860/1000, score: 8, e: 0.01
episode: 861/1000, score: 9, e: 0.01
episode: 862/1000, score: 9, e: 0.01
episode: 863/1000, score: 8, e: 0.01
episode: 864/1000, score: 9, e: 0.01
episode: 865/1000, score: 9, e: 0.01
episode: 866/1000, score: 9, e: 0.01
episode: 867/1000, score: 9, e: 0.01
episode: 868/1000, score: 9, e: 0.01
episode: 869/1000, score: 9, e: 0.01
episode: 870/1000, score: 8, e: 0.01
episode: 871/1000, score: 9, e: 0.01
episode: 872/1000, score: 8, e: 0.01
episode: 873/1000, score: 8, e: 0.01
episode: 874/1000, score: 8, e: 0.01
episode: 875/1000, score: 8, e: 0.01
episode: 876/1000, score: 8, e: 0.01
episode: 877/1000, score: 7, e: 0.01
episode: 878/1000, score: 9, e: 0.01
episode: 879/1000, score: 9, e: 0.01
episode: 880/1000, score: 7, e: 0.01
episode: 881/1000, score: 9, e: 0.01
episode: 882/1000, score: 9, e: 0.01
