# Chapter 13: Deep Q-Learning

In Chapter 12, you used tabular Q-learning to train a Q table. You then use the trained Q-table to successfully play the OpenAI Gym Frozen Lake game. 

In many situations, the number of possible scenarios is too large. Examples include Chess or the Go game: the number of possible board positions is astronomical. It’s impractical to create a Q table for these types of games for two reasons: First, the computer will not have enough memory to save and update a Q table with so many different rows (each row represents a different scenario); Second, it's impossible to calculate and update the correct Q values because the number of size of the Q-table is too large. 

That's when deep neural networks can help. Neural networks are function approximators and we’ll use a deep neural network to approximate the Q values. That’s the idea behind deep Q learning. 

This chapter will apply deep Q learning to a game in OpenAI Gym: the Cart Pole game. You’ll learn how to calculate the Q tables by using deep Q networks, so that you can tackle much more complicated games later in this book (such as Tic Tac Toe and Connect Four).

***
$\mathbf{\text{Create a subfolder for files in Chapter 13}}$<br>
***
We'll put all files in Chapter 13 in a subfolder /files/ch13. The code in the cell below will create the subfolder.

***

In [1]:
import os

os.makedirs("files/ch13", exist_ok=True)

# 1. The Cart Pole Game in OpenAI Gym
As we discussed in Chapter 12, you need to install the *gym* library first. If you haven't already, install the OpenAI Gym library as follows:

`pip install gym==0.15.7`


## 1.1. Features of the Cart Pole Game 

If you go to the official Cartpole game website https://gym.openai.com/envs/CartPole-v0/. 

The problem is considered solved as getting an average reward of 195 or above in 200 consecutive trials. 

The code in the cell below will get you started:

In [2]:
import gym 
 
env = gym.make("CartPole-v0")
env.reset()                    
env.render()


We can also print out all possible actions and states of the game as follows:

In [3]:
# Print out all possible actions in this game
actions = env.action_space
print(f"The action space in the Cart Pole game is {actions}")

# Print out all possible states in this game
states = env.observation_space
print(f"The state space in the Cart Pole game is {states}")

The action space in the Cart Pole game is Discrete(2)
The state space in the Cart Pole game is Box(4,)


The action space in the Cart Pole game has two values: 0 and 1, with the following meanings:
* 0: moving left
* 1: moving right

The state in the Cart Pole game is a collection of four values, with the following meanings:

* The position of the cart, with values between -4.8 and 4.8; 
* The velociyt of the cart, with values between -4 and 4; 
* The angle of the pole, with values between -0.42 and 0.42;
* The angular velocity of the pole, with values between -4 and 4; 


The agent earns a reward of 1 for every time that the pole stays upright. If the pole is more than 15 degrees from 
 vertical or the cart moves more than 2.4 units from the center, the agent loses the game.

## 1.2. A Complete Cart Pole Game
You can play a complete game as follows by using random actions.

In [4]:
env.reset()   

while True:
    action = actions.sample()
    print(action)
    new_state, reward, done, info = env.step(action)
    env.render()
    print(new_state, reward, done, info)    
    if done == True:
        break

0
[-0.00077322 -0.1786936   0.00362296  0.28238639] 1.0 False {}
0
[-0.00434709 -0.37386704  0.00927069  0.57620978] 1.0 False {}
1
[-0.01182443 -0.17887626  0.02079488  0.28646172] 1.0 False {}
1
[-0.01540195  0.01594305  0.02652412  0.00040919] 1.0 False {}
0
[-0.01508309 -0.17954906  0.0265323   0.30134138] 1.0 False {}
0
[-0.01867407 -0.37503893  0.03255913  0.60227256] 1.0 False {}
1
[-0.02617485 -0.18038716  0.04460458  0.32002035] 1.0 False {}
0
[-0.0297826  -0.37611503  0.05100499  0.62642954] 1.0 False {}
1
[-0.0373049  -0.18174075  0.06353358  0.35023625] 1.0 False {}
1
[-0.04093971  0.01242282  0.0705383   0.07824482] 1.0 False {}
0
[-0.04069126 -0.18363572  0.0721032   0.39232236] 1.0 False {}
0
[-0.04436397 -0.37970288  0.07994965  0.70683892] 1.0 False {}
1
[-0.05195803 -0.18577423  0.09408642  0.44035529] 1.0 False {}
0
[-0.05567351 -0.38209301  0.10289353  0.76115164] 1.0 False {}
0
[-0.06331537 -0.57847056  0.11811656  1.0843574 ] 1.0 False {}
0
[-0.07488478 -0.7749359

The above cell used several methods in the game environment:
* The sample() method randomly selects an action from the action space. That is, it will return one of the values among {0, 1}. 
* The step() method is where the agent is interacting with the environment, and it takes the agent’s action as input. The output are four values: the new state, the reward, a variable *done* indicating whether the game has ended. The *info* variable provides some information about the game. 
* The render() method shows a diagram of the resulting state. 

The game loop is an infinite *while* loop. If the *done* variable returns a value *True*, the game ends, and we stop the infinite while loop.

Note that since the actions are chosen randomly, when you run the script, you’ll most likely get different results.

# 2. Deep Q-Learning in Cart Pole

The input of the deep Q-learning model is the state of the game, just as in deep learning models. However, the output layers are different. In deep learning models, the output is the probability of winning. In contrast, in deep Q-learning, the number of neurons in the output layer is the same as the number of actions. The output are the Q values of the state-action pair. More on this later.

## 2.1. Create the Deep Q Network
The model we'll use is as follows:

In [5]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

input_shape = [4] 
num_actions = 2

def create_q_model():
    model = keras.models.Sequential([
    keras.layers.Dense(32, activation="elu", 
                       input_shape=input_shape),
    keras.layers.Dense(32, activation="elu"),
    keras.layers.Dense(num_actions)
])
    return model
model = create_q_model()

optimizer = keras.optimizers.Adam(lr=0.00025)
loss_fn = keras.losses.mean_squared_error

The neural network has one input layer, two hidden layers, and one output layer. The network is an approximation of the Q-table: it takes in the state, and returns two values. The two values are the Q-values for Q(s, action=1) and Q(s, action=2), respectively. Therefore, the agent should take the action with the higher Q value, as we did in Chapter 12 with tabular Q-learning. 

Note here we are using the Exponential Linear Unit (ELU) activation function instead of the usual ReLU activation function. This is due to the fact that in the Cart Pole game, all rewards are positive numbers. In such situations, ELU activation functions returns negative values for small values of inputs, and this allows the function to push the mean values closer to zero. Alternatively, you can use batch normalization layers to achieve simimar effects. 

## 2.2. Train the Deep Q Network
The following script trains the deep Q network for the Cart Pole game. It stops until the average score is at least 195. That is, the cart pole needs to stay upright for at leat 195 consecutive time steps. 

First, we define a update_Q() function to train the model by selecting a batch from the replay buffer:

In [6]:
# Replay and update model parameters
def update_Q():
    # select a batch from the buffer memory
    samples = random.sample(memory,batch_size)
    dones = []
    frames = []
    new_frames = []
    rewards = []
    actions = []
    for sample in samples:
        frame, new_frame, action, reward, done = sample
        frames.append(frame)
        new_frames.append(new_frame)
        actions.append(action)
        if done==1.0 or done==True or done==1:done=1.0          
        else: done=0.0    
        dones.append(done)
        rewards.append(reward)
    frames=np.array(frames)
    new_frames=np.array(new_frames)
    dones=tf.convert_to_tensor(dones)
    # update the Q table
    preds = model.predict(new_frames, verbose=0)
    Qs = rewards + gamma * tf.reduce_max(preds, axis=1)
    # if done=1  reset Q to  -1; important
    new_Qs = Qs * (1 - dones) - dones

    # update model parameters
    onehot = tf.one_hot(actions, num_actions)
    with tf.GradientTape() as t:
        Q_preds = model(frames)
        # Calculate old Qs for the action taken
        old_Qs = tf.reduce_sum(tf.multiply(Q_preds,onehot),axis=1)
        # Calculate loss between new Qs and old Qs
        loss = loss_fn(new_Qs, old_Qs)
    # Update using backpropagation
    gs = t.gradient(loss,model.trainable_variables)
    optimizer.apply_gradients(zip(gs,model.trainable_variables))

The update_Q() function selects 32 observations from the replay buffer and update the Q values. We create the replay buffer as follows:

In [7]:
import random
from collections import deque
import numpy as np

# Discount factor for past rewards
gamma = 0.95 
# batch size
batch_size = 32  

# Create a replay buffer with a maximum length of 2000
memory=deque(maxlen=2000)
# Create a running rewards list with a length of 100
running_rewards=deque(maxlen=100)

The replay buffer has a maximum length of 2000 observations. Each time the function update_Q() is called, 32 observations are randomly selected from the buffer and used to update the deep Q network. The running_rewards list has a maximum value of 100 and it is used to keep track of the scores in the last 100 games. If the average score in these 100 games is above 195, we consider the model is trained. 

Next, we simulate games to train the model until the model can keep the cart pole upright for 195 consecutive time steps:

In [8]:
for episode in range(1, 10001): 
    # reset state and episode reward before each episode
    state = np.array(env.reset())
    episode_reward = 0
    for timestep in range(1, 201):
        # Calculate current epsilon based on frame count
        epsilon = max(1 - episode / 500, 0.05)
        # Use epsilon-greedy for exploration
        if epsilon> np.random.rand(1)[0]:
            # Take random action
            action = np.random.choice(num_actions)
        else:
            state_tensor = tf.convert_to_tensor(state)
            state_tensor = tf.expand_dims(state_tensor, 0)
            action_probs = model(state_tensor, training=False)
            # Take best action
            action = tf.argmax(action_probs[0]).numpy()
        # Apply the sampled action in our environment
        state_next, reward, done, _ = env.step(action)
        state_next = np.array(state_next)
        episode_reward += reward
        # Change done from True/False to 1.0 or 0.0
        if done==True:
            done=1.0  
        else:
            done=0.0
        # Save actions and states in replay buffer
        memory.append([state, state_next, action, reward, done])
        # current state becomes the next state in next round
        state = state_next
        # Update Q once batch size is over 32
        if len(memory) > batch_size:
            update_Q()
        if done==1.0 or done==1 or done==True:
            running_rewards.append(episode_reward)
            break
    running_reward = np.mean(np.array(running_rewards))
    if episode%20==0:
        # Log details
        template = "running reward: {:.2f} at episode {}, "
        print(template.format(running_reward, episode ))
    if running_reward>=195:
        # Log details
        template = "running reward: {:.2f} at episode {}, "
        print(template.format(running_reward, episode))
        # Periodically save the model
        model.save("files/ch13/cartpole_deepQ.h5")
        print(f"solved at episode {episode}")
        break

The model is considered trained if the averge score in the past 100 games is 195 or above, as stated by the OpenAI Gym rules. That's the criteria used in the training process. Once the goal is achieved, the training stops. 

The above program takes about an hour to run, depending on the speed of your computer. Here is the output from my computer

```python
...
...
running reward: 193.93 at episode 859, 
running reward: 194.24 at episode 860, 
running reward: 194.24 at episode 861, 
running reward: 194.37 at episode 862, 
running reward: 194.67 at episode 863, 
running reward: 194.75 at episode 864, 
running reward: 194.89 at episode 865, 
running reward: 195.20 at episode 866, 
Solved at episode 866!
```

# 3. Test the Deep Q Network

Now that the model is trained, we can use it to play the OpenAI Gym Cart Pole game and see if it works. 

## 3.1. Test One Game
Next, we'll test one game, with the graphical rendering turned on.

In [9]:
state = env.reset()
score = 0
for i in range(1,201):
    env.render()
    # Use the trained model to predict the prob of winning 
    X_state = np.array(state).reshape(-1,4)
    prediction = model.predict(X_state)
    # pick the action with highest probability of winning
    action = np.argmax(prediction)
    state, reward, done, info = env.step(action)
    score += 1
    if done == True:
        print(f"score is {score}")
        break
env.close()

score is 200


The score is 200. So the cart pole stayed upright for all 200 time steps. The deep Q network really works!!!

## 3.2. Test the Efficacy of the Deep Q Network
Next, we play the game 100 times using the trained deep Q network and see how effective the trained deep Q network is on average. 

In [10]:
env = gym.make('CartPole-v0')

def test_cartpole():
    state = env.reset()
    score = 0
    for i in range(1,201):
        # Use the model to predict the prob of winning 
        X_state = np.array(state).reshape(-1,4)
        prediction = model.predict(X_state)
        # pick the action with highest prob of winning
        action = np.argmax(prediction)
        state, reward, done, info = env.step(action)
        score += 1
        if done == True:
            break
    return score

#repeat the game 100 times and record all game outcomes
results=[]        
for x in range(100):
    result=test_cartpole()
    results.append(result)    

#print out the average score
average_score = np.array(results).mean()
print("the average score is", average_score)

env.close()

the average score is 200.0

So the trained deep Q network managed to make the cart pole stay upright for 200 consecutive time steps in every sigle game. 