# Chapter 22: Scaling Up Double Deep Q-Learning


New Skills in This Chapter:

• Playing an Atari game with and without the Baselines game wrapper

• Creating a Q-network to train all Atari games

• Defining A function to test any Atari game

• Capturing a game episode with high scores in an Atari game

***
*At DeepMind we have pioneered the combination of these approaches - deep
reinforcement learning - to create the first artificial agents to achieve human-level
performance across many challenging domains.*<br>
***
--DeepMind, 2016

***

In [1]:
import os

os.makedirs("files/ch22", exist_ok=True)

# 22.1. Get Started with the Seaquest Game

## 22.1.1. The Seaquest Game in OpenAI Gym

In [2]:
import gym

env1 = gym.make("Seaquest-v0")
env1.reset()
env1.render()

In [3]:
actions = env1.action_space
print(f"The action space for the Seaquest game is {actions}")
meanings = env1.env.get_action_meanings()
print(f'''The meanings of the actions for the Seaquest game are
      \n {meanings}''')
# Print out the observation space in this game
obs_space = env1.observation_space
print(f"The observation space for Seaquest game {obs_space}")

In [4]:
import matplotlib.pyplot as plt
import numpy as np

num_actions1 = env1.action_space.n
env1.reset()
for _ in range(20):
    action = np.random.choice(num_actions1)
    obs1, reward, done, info = env1.step(action)
plt.imshow(obs1)
plt.show()
env1.close()

In [5]:
from pprint import pprint

env1.reset()
env1.render()
history = []
while True:
    action = np.random.choice(num_actions1)
    obs1, reward, done, info = env1.step(action)
    env1.render()
    history.append([reward, done, info])
    if len(history)>1:
        if info["ale.lives"]<history[-2][2]["ale.lives"]:
            pprint(history[-10:])
            break
env1.close()        

## 22.1.2. Seaquest with the Baselines Game Wrapper

In [6]:
from baselines.common.atari_wrappers import make_atari
from baselines.common.atari_wrappers import wrap_deepmind

env1 = make_atari("SeaquestNoFrameskip-v4")
env1 = wrap_deepmind(env1, frame_stack=True, scale=True)
obs1 = env1.reset()
history = []
while True:
    action = env1.action_space.sample()
    obs1, reward, done, info = env1.step(action)
    history.append([reward, done, info])
    env1.render()
    if done:
        pprint(history[-10:])
        break
env1.close()    

In [7]:
env1.close()

## 22.1.3. Preprocessed Seaquest Game Windows

In [8]:
npobs1=np.array(obs1)
for i in range(4):
    plt.imshow(npobs1[:,:,i])
    plt.show()

## 22.1.4. Subplots of Seaquest Game Windows

In [9]:
from utils.ch22util import seaquest_pixels

seaquest_pixels()

# 22.2. Get Started with Beam Rider

## 22.2.1. Beam Rider without the Game Wrapper

In [10]:
env2 = gym.make("BeamRider-v0")
env2.reset()
env2.render()

In [11]:
actions = env2.action_space
print(f"The action space for Beam Rider is {actions}")
meanings = env2.env.get_action_meanings()
print(f'''The meanings of the actions for Beam Rider are
      \n {meanings}''')
obs_space = env2.observation_space
print(f"The observation space for Beam Rider is {obs_space}")    

In [12]:
num_actions2 = env2.action_space.n
env2.reset()
for _ in range(20):
    action = np.random.choice(num_actions2)
    obs2, reward, done, info = env2.step(action)
plt.imshow(obs2)
plt.show()
env2.close()

In [13]:
env2.reset()
env2.render()
history = []
while True:
    action = np.random.choice(num_actions2)
    obs2, reward, done, info = env2.step(action)
    env2.render()
    history.append([reward, done, info])
    if len(history)>1:
        if info["ale.lives"]<history[-2][2]["ale.lives"]:
            pprint(history[-10:])
            break
env2.close()        

## 22.2.2. Beam Rider with the Baselines Game Wrapper

In [14]:
env2 = make_atari("BeamRiderNoFrameskip-v4")
env2 = wrap_deepmind(env2, frame_stack=True, scale=True)
obs2 = env2.reset()
history = []
while True:
    action = env2.action_space.sample()
    obs2, reward, done, info = env2.step(action)
    history.append([reward, done, info])
    env2.render()
    if done:
        pprint(history[-10:])
        break
env2.close()        

As you can see, when the number of lives changes from 3 to 2, the variable done becomes True and the episode ends. Note that the reward is 7, not -1, but we can code it as -1 by using this line of code; you'll see it in the script for training later:

```python
    # Each time the agent loses a life, set Q to -1; important
    new_Qs = Qs * (1 - dones) - dones
```

Run the following to close the game window:

In [15]:
env2.close()

## 22.2.3. Preprocessed Beam Rider Game Windows


In [16]:
npobs2=np.array(obs2)
for i in range(4):
    plt.imshow(npobs2[:,:,i])
    plt.show()

## 22.2.4. Subplots of Beam Rider Game Windows

In [17]:
from utils.ch22util import beamrider_pixels

beamrider_pixels() 

# 22.3. Scaling Up the Double Deep Q-Network

## 22.3.1. Differences Among Atari Games

To create a function that can be applied to any Atari game, we first need to understand the differences among Atari games. 

Obviously, the name of the game is different. But there is a pattern. For the four games we have seen so far, Breakout, Space Invaders, Seaquest, and Beam Rider, their environment names are the following:
* BreakoutNoFrameskip-v4
* SpaceInvadersNoFrameskip-v4
* SeaquestNoFrameskip-v4
* BeamRiderNoFrameskip-v4
Therefore we can use this line of code 

```python
f"{name}NoFrameskip-v4"
```

in the function to scale up the game environment. 

The number of actions is different in different games. For the four games Breakout, Space Invaders, Seaquest, and Beam Rider, the numbers of actions are 4, 6, 18, and 9, respectively. However, we can use the code:

```python
num_actions = env.action_space.n
```

in the function to retrieve the number of actions for each game automatically.  

We can leave everything else in the training program the same.

## 22.3.2. A Generic Double Deep Q-Network

In [18]:
input_shape = (84, 84, 4,)
def create_model(num_actions):
    model=keras.models.Sequential()
    model.add(keras.layers.Conv2D(filters=32,kernel_size=8,
     strides=(4,4),activation="relu",input_shape=input_shape))
    model.add(keras.layers.Conv2D(filters=64,kernel_size=4,
     strides=(2,2),activation="relu"))
    model.add(keras.layers.Conv2D(filters=64,kernel_size=3,
     strides=(1,1),activation="relu"))
    model.add(keras.layers.Flatten())
    model.add(keras.layers.Dense(512,activation="relu"))
    model.add(keras.layers.Dense(num_actions))
    return model

In [19]:
lr=0.00025
optimizer=keras.optimizers.Adam(learning_rate=lr,clipnorm=1)
loss_function=keras.losses.Huber()

## 22.3.3. The Training Process for Any Atari Game 

In [20]:
gamma=0.99 
batch_size=32  

In [21]:
# Create a replay buffer 
memory=deque(maxlen=50000)
# Create a running rewards list 
running_rewards=deque(maxlen=100)

In [22]:
# Replay and update model parameters
def update_Q(num_actions):
    global dnn,target_dnn
    dones,frames,new_frames,rewards,actions=gen_batch()
    # update the Q table
    preds = target_dnn.predict(new_frames, verbose=0)
    Qs = rewards + gamma * tf.reduce_max(preds, axis=1)
    # if done=1  reset Q to  -1; important
    new_Qs = Qs * (1 - dones) - dones
    # update model parameters
    onehot = tf.one_hot(actions, num_actions)
    with tf.GradientTape() as t:
        Q_preds=dnn(frames)
        # Calculate old Qs for the action taken
        old_Qs=tf.reduce_sum(tf.multiply(Q_preds,onehot),axis=1)
        # Calculate loss between new Qs and old Qs
        loss=loss_function(new_Qs, old_Qs)
    # Update using backpropagation
    gs=t.gradient(loss,dnn.trainable_variables)
    optimizer.apply_gradients(zip(gs,dnn.trainable_variables)) 

In [23]:
def play_episode(num_actions,name):
    global frame_count,env,dnn,target_dnn
    # reset state and episode reward before each episode
    state = np.array(env.reset())
    episode_reward = 0    
    # Allow 10,000 steps per episode
    for timestep in range(1, 10001):
        frame_count += 1
        # Calculate current epsilon based on frame count
        epsilon = max(0.1, 1 - frame_count * (1-0.1) /1000000)
        # Use epsilon-greedy for exploration
        if frame_count < epsilon_random_frames or \
            epsilon > np.random.rand(1)[0]:
            # Take random action
            action = np.random.choice(num_actions)
        # Use exploitation
        else:
            state_tensor = tf.convert_to_tensor(state)
            state_tensor = tf.expand_dims(state_tensor, 0)
            action_probs = dnn(state_tensor, training=False)
            action = tf.argmax(action_probs[0]).numpy()
        # Apply the sampled action in our environment
        state_next, reward, done, _ = env.step(action)
        state_next = np.array(state_next)
        episode_reward += reward
        # Change done to 1.0 or 0.0 to prevent error
        if done==True:
            done=1.0
        else:
            done=0.0
        # Save actions and states in replay buffer
        memory.append([state, state_next, action, reward, done])
        # current state becomes the next state in next round
        state = state_next
        # Update Q once batch size is over 32
        if len(memory) > batch_size and \
            frame_count % update_after_actions == 0:
            update_Q(num_actions)
        if frame_count % update_target_network == 0:
            # update the the target network with new weights
            target_dnn.set_weights(dnn.get_weights())
            # Periodically save the model
            dnn.save(f"files/ch22/{name}.h5")         
        if done:
            running_rewards.append(episode_reward)
            break

In [24]:
def train_atari(name):
    global frame_count,env,num_actions,dnn,target_dnn
    # Use the Baseline Atari environment
    env = make_atari(f"{name}NoFrameskip-v4")
    # Process and stack the frames
    env = wrap_deepmind(env, frame_stack=True, scale=True)
    num_actions = env.action_space.n
    
    # Network for training
    dnn=create_model(num_actions)
    # Network for predicting (target network)
    target_dnn=create_model(num_actions)     
    episode=0
    frame_count=0
    while True: 
        episode += 1
        play_episode(num_actions,name)
        running_reward = np.mean(np.array(running_rewards))
        if episode%20==0:
            # Log details
            m="running reward: {:.2f} at episode {} and frame {}"
            print(m.format(running_reward,episode,frame_count))
        if running_reward>20:
            dnn.save(f"files/ch22/{name}.h5")
            print(f"solved at episode {episode}")
            break

# 22.4. Try It on Seaquest

## 22.4.1. Train the Model in Seaquest
The following line of code will train the agent in the Seaquest game:

In [25]:
from utils.ch22util import train_atari

train_atari("Seaquest")

We first import the function *train_atari()* from the local model *ch22util* and call the function. We put the name of the game, *Seaquest*, as the argument to the function. The training takes a couple of days. But you can use a pre-trained model that I put on the book's GitHub repository, saved as *Seaquest.h5*. 

In [26]:
import tensorflow as tf

reload1 = tf.keras.models.load_model("files/ch22/Seaquest.h5")
state = env1.reset()
for i in range(4):
    score = 0
    for j in range(10000):
        if np.random.rand(1)[0]<0.01:
            action = np.random.choice(num_actions1)
        else:
            state_tensor = tf.convert_to_tensor(state)
            state_tensor = tf.expand_dims(state_tensor, 0)
            action_probs = reload1(state_tensor, training=False)
            action = tf.argmax(action_probs[0]).numpy()    
        state, reward, done, info = env1.step(action)
        score += reward
        env1.render()
        if done:
            print("the score is", score)
            break
env1.close()

## 22.4.2. Test the Average Score in Seaquest

In [27]:
def test_atari(name):
    reload = tf.keras.models.load_model(f"files/ch22/{name}.h5")
    env = make_atari(f"{name}NoFrameskip-v4")
    env = wrap_deepmind(env, frame_stack=True, scale=True)
    scores = []
    num_actions = env.action_space.n
    for i in range(100):
        state = env.reset()
        score = 0
        for j in range(10000):
            if np.random.rand(1)[0]<0.01:
                action = np.random.choice(num_actions)
            else:
                state_tensor = tf.convert_to_tensor(state)
                state_tensor = tf.expand_dims(state_tensor, 0)
                action_probs = reload(state_tensor, training=False)
                action = tf.argmax(action_probs[0]).numpy()    
            state, reward, done, info = env.step(action)
            score += reward
            if done:
                print(f"the score in episode {i+1} is {score}")
                scores.append(score)
                break
    env.close()
    print(f"the average score is {np.array(scores).mean()}")  

To test the trained model in Seaquest, we import the function from the local module and call it to test 100 episodes of the game, like so:

In [28]:
from utils.ch22util import test_atari

test_atari("Seaquest")

## 22.4.3. Animate Successful Episodes
We'll highlight episodes where the agent performs well.

In [29]:
import imageio
import pickle 

for i in range(20):
    state = env1.reset()
    frames = []
    for j in range(10000):
        if np.random.rand(1)[0]<0.01:
            action = np.random.choice(num_actions1)
        else:
            state_tensor = tf.convert_to_tensor(state)
            state_tensor = tf.expand_dims(state_tensor, 0)
            action_Qs = reload1(state_tensor, training=False)
            action = tf.argmax(action_Qs[0]).numpy()    
        state, reward, done, info = env1.step(action)
        frames.append(env1.render(mode='rgb_array'))
        if done:
            pickle.dump(frames,\
                open(f'files/ch22/{name}{i+1}.p', 'wb'))
            imageio.mimsave(f"files/ch22/{name}{i+1}.gif",\
                            frames, fps=240)
            break
env1.close()

Go to the book's GitHub repository and download the file *Seaquest2.zip*. Unzip the file and save the unzipped file *Seaquest2.p* in the folder /Desktop/mla/files/ch22/. The convert it into an animation as follows:

In [30]:
Seaquest2=pickle.load(open("files\ch22\Seaquest2.p","rb"))
imageio.mimsave("files\ch22\seaqueste2.gif",Seaquest2[::5],fps=24)

<img src="https://gattonweb.uky.edu/faculty/lium/ml/Seaquest_episode2.gif" />

In [31]:
plots=Seaquest2[::22]
last=Seaquest2[-1].reshape(1,210,160,3)
Seaquest_plots=np.concatenate([plots,last],axis=0)

In [32]:
plt.figure(figsize=(12,16),dpi=100)
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.imshow(Seaquest_plots[i])
    plt.axis('off')
plt.subplots_adjust(bottom=0.001,right=0.999,top=0.999,
left=0.001, hspace=-0.1,wspace=0.1)
plt.savefig("files/ch22/Seaquest_plots.jpg")

<img src="https://gattonweb.uky.edu/faculty/lium/ml/Seaquest_plots.jpg" />

# 22.5. Try It on Beam Rider

## 22.5.1. Train the Model in Beam Rider

In [33]:
from utils.ch22util import train_atari

train_atari("BeamRider")

The training takes a couple of days. You can use a pre-trained model, *BeamRider.h5*, that I put on the book's GitHub repository. 

In [34]:
import numpy as np
import gym
import tensorflow as  tf
from baselines.common.atari_wrappers import make_atari
from baselines.common.atari_wrappers import wrap_deepmind
env2 = make_atari("BeamRiderNoFrameskip-v4")
env2 = wrap_deepmind(env2, frame_stack=True, scale=True)
reload2 = tf.keras.models.load_model("files/ch22/BeamRider.h5")
num_actions2 = num_actions2 = env2.action_space.n
state = env2.reset()
for i in range(3):
    score = 0
    for j in range(10000):
        if np.random.rand(1)[0]<0.01:
            action = np.random.choice(num_actions2)
        else:
            state_tensor = tf.convert_to_tensor(state)
            state_tensor = tf.expand_dims(state_tensor, 0)
            action_Qs = reload2(state_tensor, training=False)
            action = tf.argmax(action_Qs[0]).numpy()    
        state, reward, done, info = env2.step(action)
        score += reward
        env2.render()
        if done:
            print("the score is", score)
            break
env2.close()



Exception ignored in: <function SimpleImageViewer.__del__ at 0x0000020A0BEDC5E0>
Traceback (most recent call last):
  File "C:\Users\hlliu2\Anaconda3\envs\MLA\lib\site-packages\gym\envs\classic_control\rendering.py", line 369, in __del__
    self.close()
  File "C:\Users\hlliu2\Anaconda3\envs\MLA\lib\site-packages\gym\envs\classic_control\rendering.py", line 365, in close
    self.window.close()
  File "C:\Users\hlliu2\Anaconda3\envs\MLA\lib\site-packages\pyglet\window\win32\__init__.py", line 299, in close
    super(Win32Window, self).close()
  File "C:\Users\hlliu2\Anaconda3\envs\MLA\lib\site-packages\pyglet\window\__init__.py", line 823, in close
    app.windows.remove(self)
  File "C:\Users\hlliu2\Anaconda3\envs\MLA\lib\_weakrefset.py", line 114, in remove
    self.data.remove(ref(item))
KeyError: <weakref at 0x0000020A0BE80400; to 'Win32Window' at 0x0000020A0B9F7610>


the score is 23.0
the score is 7.0
the score is 30.0


## 22.5.2. The Average Score in Beam Rider

In [35]:
test_atari("BeamRider")

## 22.5.3. A Successful Episode in Beam Rider

In [36]:
for i in range(15):
    state = env2.reset()
    frames = []
    for j in range(10000):
        if np.random.rand(1)[0]<0.01:
            action = np.random.choice(num_actions2)
        else:
            state_tensor = tf.convert_to_tensor(state)
            state_tensor = tf.expand_dims(state_tensor, 0)
            action_Qs = reload2(state_tensor, training=False)
            action = tf.argmax(action_Qs[0]).numpy()    
        state, reward, done, info = env2.step(action)
        frames.append(env2.render(mode='rgb_array'))
        if done:
            pickle.dump(frames,\
                open(f'files/ch22/{name}{i+1}.p', 'wb'))
            imageio.mimsave(f"files/ch22/{name}{i+1}.gif",\
                            frames, fps=240)
            break
env2.close()

Go to the book's GitHub repository and download the file *BeamRider4.zip*. Unzip the file and save the unzipped file *BeamRider4.p* in the folder /Desktop/mla/files/ch22/. The convert it into an animation as follows:

In [37]:
BeamRider4=pickle.load(open("files\ch22\BeamRider4.p","rb"))
imageio.mimsave("files\ch22\BeamRider4.gif",\
                BeamRider4[::5],fps=24)

<img src="https://gattonweb.uky.edu/faculty/lium/ml/BeamRider_episode4.gif" />

In [38]:
plots=BeamRider4[::73]
last=BeamRider4[-1].reshape(1,210,160,3)
BeamRider_plots=np.concatenate([plots,last],axis=0)

In [39]:
plt.figure(figsize=(12,16),dpi=100)
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.imshow(BeamRider_plots[i])
    plt.axis('off')
plt.subplots_adjust(bottom=0.001,right=0.999,top=0.999,
left=0.001, hspace=-0.1,wspace=0.1)
plt.savefig("files/ch22/BeamRider_plots.jpg")

<img src="https://gattonweb.uky.edu/faculty/lium/ml/BeamRider_plots.jpg" />