<a href="https://colab.research.google.com/github/kruegz/pong/blob/main/pong.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Playing Pong with Reinforcement Learning


https://gym.openai.com/envs/Pong-v0/
https://towardsdatascience.com/deep-q-network-dqn-i-bce08bdf2af
https://towardsdatascience.com/getting-an-ai-to-play-atari-pong-with-deep-reinforcement-learning-47b0c56e78ae


## Setup

Import necessary packages and configure global settings.


In [1]:
%%bash

# Install main packages
pip install gym > /dev/null 2>&1
pip install pyglet > /dev/null 2>&1
pip install atari-py > /dev/null 2>&1

# Install additional packages for visualization
sudo apt-get install -y xvfb python-opengl > /dev/null 2>&1
pip install pyvirtualdisplay > /dev/null 2>&1
pip install git+https://github.com/tensorflow/docs > /dev/null 2>&1

sudo apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
pip install -U colabgymrender > /dev/null 2>&1


# Download and install Atari ROMs
# https://github.com/openai/atari-py#roms
# http://www.atarimania.com/rom_collection_archive_atari_2600_roms.html
wget http://www.atarimania.com/roms/Roms.rar
unrar e Roms.rar
python -m atari_py.import_roms .

export DISPLAY=localhost:0.0 


UNRAR 5.50 freeware      Copyright (c) 1993-2017 Alexander Roshal


Extracting from Roms.rar

Extracting  HC ROMS.zip                                                   36%  OK 
Extracting  ROMS.zip                                                      74% 99%  OK 
All OK
copying adventure.bin from ROMS/Adventure (1980) (Atari, Warren Robinett) (CX2613, CX2613P) (PAL).bin to /usr/local/lib/python3.7/dist-packages/atari_py/atari_roms/adventure.bin
copying air_raid.bin from ROMS/Air Raid (Men-A-Vision) (PAL) ~.bin to /usr/local/lib/python3.7/dist-packages/atari_py/atari_roms/air_raid.bin
copying alien.bin from ROMS/Alien (1982) (20th Century Fox Video Games, Douglas 'Dallas North' Neubauer) (11006) ~.bin to /usr/local/lib/python3.7/dist-packages/atari_py/atari_roms/alien.bin
copying amidar.bin from ROMS/Amidar (1982) (Parker Brothers, Ed Temple) (PB5310) ~.bin to /usr/local/lib/python3.7/dist-packages/atari_py/atari_roms/amidar.bin
copying assault.bin from ROMS/Assau

--2022-01-18 19:04:48--  http://www.atarimania.com/roms/Roms.rar
Resolving www.atarimania.com (www.atarimania.com)... 195.154.81.199
Connecting to www.atarimania.com (www.atarimania.com)|195.154.81.199|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11128004 (11M) [application/x-rar-compressed]
Saving to: ‘Roms.rar’

     0K .......... .......... .......... .......... ..........  0% 1.32M 8s
    50K .......... .......... .......... .......... ..........  0% 3.83M 5s
   100K .......... .......... .......... .......... ..........  1% 1.97M 5s
   150K .......... .......... .......... .......... ..........  1% 3.73M 5s
   200K .......... .......... .......... .......... ..........  2%  130M 4s
   250K .......... .......... .......... .......... ..........  2% 3.73M 4s
   300K .......... .......... .......... .......... ..........  3% 3.75M 3s
   350K .......... .......... .......... .......... ..........  3% 3.81M 3s
   400K .......... .......... .......... .......

In [2]:
from collections import deque
import random
import time
from typing import Any, List, Sequence, Tuple
from os.path import exists

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import gym
import cv2

from tensorflow.keras.models import Sequential, clone_model
from tensorflow.keras.layers import Dense, Flatten, Conv2D, Input
from tensorflow.keras.optimizers import Adam
import keras.backend as K

from colabgymrender.recorder import Recorder

Imageio: 'ffmpeg-linux64-v3.3.1' was not found on your computer; downloading it now.
Try 1. Download from https://github.com/imageio/imageio-binaries/raw/master/ffmpeg/ffmpeg-linux64-v3.3.1 (43.8 MB)
Downloading: 8192/45929032 bytes (0.0%)3760128/45929032 bytes (8.2%)7938048/45929032 bytes (17.3%)12083200/45929032 bytes (26.3%)15949824/45929032 bytes (34.7%)20119552/45929032 bytes (43.8%)24281088/45929032 bytes (52.9%)28409856/45929032 bytes (61.9%)32595968/45929032 bytes (71.0%)36765696/45929032 bytes (80.0%)41017344/45929032 bytes (89.3%)45105152/45929032 bytes (98.2%)45929032/45929032 bytes (100.0%)
  Done
File saved as /root

In [3]:
class Memory():
    def __init__(self,max_len):
        self.max_len = max_len
        self.frames = deque(maxlen = max_len)
        self.actions = deque(maxlen = max_len)
        self.rewards = deque(maxlen = max_len)
        self.done_flags = deque(maxlen = max_len)

    def add_experience(self,next_frame, next_frames_reward, next_action, next_frame_terminal):
        self.frames.append(next_frame)
        self.actions.append(next_action)
        self.rewards.append(next_frames_reward)
        self.done_flags.append(next_frame_terminal)

In [4]:
class Agent():
    def __init__(self,possible_actions,starting_mem_len,max_mem_len,starting_epsilon,learn_rate, starting_lives = 5, debug = False):
        self.memory = Memory(max_mem_len)
        self.possible_actions = possible_actions
        self.epsilon = starting_epsilon
        self.epsilon_decay = .9/100000
        self.epsilon_min = .05
        self.gamma = .95
        self.learn_rate = learn_rate
        self.model = self._build_model()
        self.model_target = clone_model(self.model)
        self.total_timesteps = 0
        self.lives = starting_lives #this parameter does not apply to pong
        self.starting_mem_len = starting_mem_len
        self.learns = 0


    def _build_model(self):
        model = Sequential()
        model.add(Input((84,84,4)))
        model.add(Conv2D(filters = 32,kernel_size = (8,8),strides = 4,data_format="channels_last", activation = 'relu',kernel_initializer = tf.keras.initializers.VarianceScaling(scale=2)))
        model.add(Conv2D(filters = 64,kernel_size = (4,4),strides = 2,data_format="channels_last", activation = 'relu',kernel_initializer = tf.keras.initializers.VarianceScaling(scale=2)))
        model.add(Conv2D(filters = 64,kernel_size = (3,3),strides = 1,data_format="channels_last", activation = 'relu',kernel_initializer = tf.keras.initializers.VarianceScaling(scale=2)))
        model.add(Flatten())
        model.add(Dense(512,activation = 'relu', kernel_initializer = tf.keras.initializers.VarianceScaling(scale=2)))
        model.add(Dense(len(self.possible_actions), activation = 'linear'))
        optimizer = Adam(self.learn_rate)
        model.compile(optimizer, loss=tf.keras.losses.Huber())
        model.summary()
        print('\nAgent Initialized\n')
        return model

    def get_action(self,state):
        """Explore"""
        if np.random.rand() < self.epsilon:
            return random.sample(self.possible_actions,1)[0]

        """Do Best Acton"""
        a_index = np.argmax(self.model.predict(state))
        return self.possible_actions[a_index]

    def _index_valid(self,index):
        if self.memory.done_flags[index-3] or self.memory.done_flags[index-2] or self.memory.done_flags[index-1] or self.memory.done_flags[index]:
            return False
        else:
            return True

    def learn(self,debug = False):
        """we want the output[a] to be R_(t+1) + Qmax_(t+1)."""
        """So target for taking action 1 should be [output[0], R_(t+1) + Qmax_(t+1), output[2]]"""

        """First we need 32 random valid indicies"""
        states = []
        next_states = []
        actions_taken = []
        next_rewards = []
        next_done_flags = []

        while len(states) < 32:
            index = np.random.randint(4,len(self.memory.frames) - 1)
            if self._index_valid(index):
                state = [self.memory.frames[index-3], self.memory.frames[index-2], self.memory.frames[index-1], self.memory.frames[index]]
                state = np.moveaxis(state,0,2)/255
                next_state = [self.memory.frames[index-2], self.memory.frames[index-1], self.memory.frames[index], self.memory.frames[index+1]]
                next_state = np.moveaxis(next_state,0,2)/255

                states.append(state)
                next_states.append(next_state)
                actions_taken.append(self.memory.actions[index])
                next_rewards.append(self.memory.rewards[index+1])
                next_done_flags.append(self.memory.done_flags[index+1])

        """Now we get the ouputs from our model, and the target model. We need this for our target in the error function"""
        labels = self.model.predict(np.array(states))
        next_state_values = self.model_target.predict(np.array(next_states))
        
        """Now we define our labels, or what the output should have been
           We want the output[action_taken] to be R_(t+1) + Qmax_(t+1) """
        for i in range(32):
            action = self.possible_actions.index(actions_taken[i])
            labels[i][action] = next_rewards[i] + (not next_done_flags[i]) * self.gamma * max(next_state_values[i])

        """Train our model using the states and outputs generated"""
        self.model.fit(np.array(states),labels,batch_size = 32, epochs = 1, verbose = 0)

        """Decrease epsilon and update how many times our agent has learned"""
        # tf.print("epsilon {} epsilon_min {} epsilon_decay{}".format(self.epsilon, self.epsilon_min, self.epsilon_decay))
        if self.epsilon > self.epsilon_min:
            self.epsilon -= self.epsilon_decay
            
        self.learns += 1
        
        """Every 10000 learned, copy our model weights to our target model"""
        if self.learns % 10000 == 0:
            self.model_target.set_weights(self.model.get_weights())
            tf.print('\nTarget model updated')

In [5]:
def resize_frame(frame):
    frame = frame[30:-12,5:-4]
    # print(frame.shape)
    frame = np.average(frame,axis = 2)
    # print(frame.shape)
    frame = cv2.resize(frame,(84,84),interpolation = cv2.INTER_NEAREST)
    # print(frame.shape)
    frame = np.array(frame,dtype = np.uint8)
    # print(frame.shape)
    return frame
    
def tf_resize_frame(frame):
    # print(frame.shape)
    frame = frame[30:-12,5:-4]
    # print(frame.shape)
    # frame = np.average(frame,axis = 2)
    frame = tf.image.rgb_to_grayscale(frame)
    # print(frame.shape)
    frame = tf.image.resize(frame, (84,84))
    # print(frame.shape)
    frame = tf.squeeze(frame)
    # print(frame.shape)
    frame = np.array(frame,dtype = np.uint8)
    # print(frame.shape)
    return frame

def initialize_new_game(name, env, agent):
    """We don't want an agents past game influencing its new game, so we add in some dummy data to initialize"""
    
    env.reset()
    starting_frame = resize_frame(env.step(0)[0])

    dummy_action = 0
    dummy_reward = 0
    dummy_done = False
    for i in range(3):
        agent.memory.add_experience(starting_frame, dummy_reward, dummy_action, dummy_done)

def make_env(name, agent):
    env = gym.make(name)
    env = Recorder(env, 'recordings')
    return env
    
# Wrap OpenAI Gym's `env.step` call as an operation in a TensorFlow function.
# This would allow it to be included in a callable TensorFlow graph.
# @tf.function
def env_step(action: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Returns state, reward and done flag given an action."""

    state, reward, done, _ = env.step(action)
    return (state.astype(np.float32), 
            np.array(reward, np.int32), 
            np.array(done, np.int32))

def tf_env_step(action: tf.Tensor) -> List[tf.Tensor]:
    return tf.numpy_function(env_step, [action], 
                           [tf.float32, tf.int32, tf.int32])
  
# @tf.function
# def take_step(name, env, agent, score, debug):
    
#     #1 and 2: Update timesteps and save weights
#     agent.total_timesteps += 1
#     if agent.total_timesteps % 50000 == 0:
#       agent.model.save_weights('recent_weights.hdf5')
#       print('\nWeights saved!')

#     #3: Take action
#     next_frame, next_frames_reward, next_frame_terminal = tf_env_step(agent.memory.actions[-1])
    
#     #4: Get next state
#     next_frame = resize_frame(next_frame)
#     new_state = [agent.memory.frames[-3], agent.memory.frames[-2], agent.memory.frames[-1], next_frame]
#     new_state = np.moveaxis(new_state,0,2)/255 #We have to do this to get it into keras's goofy format of [batch_size,rows,columns,channels]
#     new_state = np.expand_dims(new_state,0) #^^^
    
#     #5: Get next action, using next state
#     next_action = agent.get_action(new_state)

#     #6: If game is over, return the score
#     if next_frame_terminal:
#         agent.memory.add_experience(next_frame, next_frames_reward, next_action, next_frame_terminal)
#         return (score + next_frames_reward),True

#     #7: Now we add the next experience to memory
#     agent.memory.add_experience(next_frame, next_frames_reward, next_action, next_frame_terminal)

#     #8: If we are trying to debug this then render
#     if debug:
#         env.render()

#     #9: If the threshold memory is satisfied, make the agent learn from memory
#     # tf.print("len(agent.memory.frames) {} agent.starting_mem_len {}".format(len(agent.memory.frames), agent.starting_mem_len))
#     if len(agent.memory.frames) > agent.starting_mem_len:
#         agent.learn(debug)

#     return (score + next_frames_reward),False

def take_step(name, env, agent, score):
    
    #1 and 2: Update timesteps and save weights
    agent.total_timesteps += 1
    if agent.total_timesteps % 50000 == 0:
        agent.model.save_weights('recent_weights.hdf5')
        print('\nWeights saved!')

    next_frame, next_frames_reward, next_action, next_frame_terminal = tf_take_step(name, env, agent, score)

    #6: If game is over, return the score
    if next_frame_terminal:
        agent.memory.add_experience(next_frame, next_frames_reward, next_action, next_frame_terminal)
        return (score + next_frames_reward),True

    #7: Now we add the next experience to memory
    agent.memory.add_experience(next_frame, next_frames_reward, next_action, next_frame_terminal)

    return (score + next_frames_reward),False

# @tf.function
def tf_take_step(name, env, agent, score):
    
    #3: Take action
    next_frame, next_frames_reward, next_frame_terminal = tf_env_step(agent.memory.actions[-1])
    
    #4: Get next state
    next_frame = tf_resize_frame(next_frame)
    new_state = [agent.memory.frames[-3], agent.memory.frames[-2], agent.memory.frames[-1], next_frame]
    new_state = np.moveaxis(new_state,0,2)/255 #We have to do this to get it into keras's goofy format of [batch_size,rows,columns,channels]
    new_state = np.expand_dims(new_state,0) #^^^
    
    #5: Get next action, using next state
    next_action = agent.get_action(new_state)

    #9: If the threshold memory is satisfied, make the agent learn from memory
    # tf.print("len(agent.memory.frames) {} agent.starting_mem_len {}".format(len(agent.memory.frames), agent.starting_mem_len))
    if len(agent.memory.frames) > agent.starting_mem_len:
        agent.learn()

    return next_frame, next_frames_reward, next_action, next_frame_terminal

# @tf.function
def play_episode(name, env, agent, debug = False):
    initialize_new_game(name, env, agent)
    
    done = False
    score = 0
    while True:
        # score,done = take_step(name,env,agent,score, debug)
        score,done = take_step(name,env,agent,score)
        if done:
            break
    
    return score


In [6]:
name = 'Pong-v0'

agent = Agent(possible_actions=[0,2,3],starting_mem_len=50000,max_mem_len=750000,starting_epsilon = 1, learn_rate = .00025)
env = make_env(name,agent)

last_100_avg = [-21]
scores = deque(maxlen = 100)
max_score = -21

env.reset()

initialize_new_game(name, env, agent)
next_frame, next_frames_reward, next_frame_terminal = tf_env_step(agent.memory.actions[-1])
print("shape: {}".format(next_frame.shape))
print("resize_frame shape: {}".format(resize_frame(next_frame).shape))
print("tf_resize_frame shape: {}".format(tf_resize_frame(next_frame).shape))

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 20, 20, 32)        8224      
                                                                 
 conv2d_1 (Conv2D)           (None, 9, 9, 64)          32832     
                                                                 
 conv2d_2 (Conv2D)           (None, 7, 7, 64)          36928     
                                                                 
 flatten (Flatten)           (None, 3136)              0         
                                                                 
 dense (Dense)               (None, 512)               1606144   
                                                                 
 dense_1 (Dense)             (None, 3)                 1539      
                                                                 
Total params: 1,685,667
Trainable params: 1,685,667
Non-

In [None]:
name = 'Pong-v0'

agent = Agent(possible_actions=[0,2,3],starting_mem_len=50000,max_mem_len=750000,starting_epsilon = 1, learn_rate = .00025)
env = make_env(name,agent)

last_100_avg = [-21]
scores = deque(maxlen = 100)
max_score = -21

if exists('recent_weights.hdf5'):
    agent.model.load_weights('recent_weights.hdf5')
    agent.model_target.load_weights('recent_weights.hdf5')
# agent.epsilon = 0.0

env.reset()

for i in range(100):
    timesteps = agent.total_timesteps
    timee = time.time()
    score = play_episode(name, env, agent, debug = False) #set debug to true for rendering
    scores.append(score)
    if score > max_score:
        max_score = score

    print('\nEpisode: ' + str(i))
    print('Steps: ' + str(agent.total_timesteps - timesteps))
    print('Duration: ' + str(time.time() - timee))
    print('Score: ' + str(score))
    print('Max Score: ' + str(max_score))
    print('Epsilon: ' + str(agent.epsilon))
    print('Memory frames: ' + str(len(agent.memory.frames)))

    if i%10 == 0:
      env.play()

    if i%100==0 and i!=0:
        last_100_avg.append(sum(scores)/len(scores))
        plt.plot(np.arange(0,i+1,100),last_100_avg)
        plt.show()


        

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_3 (Conv2D)           (None, 20, 20, 32)        8224      
                                                                 
 conv2d_4 (Conv2D)           (None, 9, 9, 64)          32832     
                                                                 
 conv2d_5 (Conv2D)           (None, 7, 7, 64)          36928     
                                                                 
 flatten_1 (Flatten)         (None, 3136)              0         
                                                                 
 dense_2 (Dense)             (None, 512)               1606144   
                                                                 
 dense_3 (Dense)             (None, 3)                 1539      
                                                                 
Total params: 1,685,667
Trainable params: 1,685,667
No


Episode: 1
Steps: 1183
Duration: 3.7262110710144043
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: -21
Epsilon: 1
Memory frames: 2715

Episode: 2
Steps: 1221
Duration: 3.8707587718963623
Score: tf.Tensor(-20, shape=(), dtype=int32)
Max Score: tf.Tensor(-20, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 3939

Episode: 3
Steps: 1347
Duration: 4.23587441444397
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-20, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 5289

Episode: 4
Steps: 1272
Duration: 3.999191999435425
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-20, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 6564

Episode: 5
Steps: 1176
Duration: 3.702587127685547
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-20, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 7743

Episode: 6
Steps: 1156
Duration: 3.667827606201172
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-20, shape=(), dtype=int32)
Ep


Episode: 11
Steps: 1549
Duration: 4.924765348434448
Score: tf.Tensor(-20, shape=(), dtype=int32)
Max Score: tf.Tensor(-19, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 15968

Episode: 12
Steps: 1259
Duration: 3.998040199279785
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-19, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 17230

Episode: 13
Steps: 1255
Duration: 3.945356607437134
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-19, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 18488

Episode: 14
Steps: 1106
Duration: 3.493860960006714
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-19, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 19597

Episode: 15
Steps: 1094
Duration: 3.5905299186706543
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-19, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 20694

Episode: 16
Steps: 1358
Duration: 4.29331374168396
Score: tf.Tensor(-19, shape=(), dtype=int32)
Max Sco


Episode: 21
Steps: 1420
Duration: 4.557872772216797
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-19, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 28966

Episode: 22
Steps: 1402
Duration: 4.47639536857605
Score: tf.Tensor(-20, shape=(), dtype=int32)
Max Score: tf.Tensor(-19, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 30371

Episode: 23
Steps: 1171
Duration: 3.6952507495880127
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-19, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 31545

Episode: 24
Steps: 1275
Duration: 4.04198956489563
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-19, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 32823

Episode: 25
Steps: 1520
Duration: 4.766033411026001
Score: tf.Tensor(-19, shape=(), dtype=int32)
Max Score: tf.Tensor(-19, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 34346

Episode: 26
Steps: 1310
Duration: 4.175394058227539
Score: tf.Tensor(-20, shape=(), dtype=int32)
Max Scor


Episode: 31
Steps: 1172
Duration: 3.789700746536255
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-18, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 42536

Episode: 32
Steps: 1263
Duration: 3.977405309677124
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-18, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 43802

Episode: 33
Steps: 1338
Duration: 4.179568529129028
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-18, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 45143

Episode: 34
Steps: 1411
Duration: 4.540043830871582
Score: tf.Tensor(-21, shape=(), dtype=int32)
Max Score: tf.Tensor(-18, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 46557

Episode: 35
Steps: 1427
Duration: 4.4916675090789795
Score: tf.Tensor(-19, shape=(), dtype=int32)
Max Score: tf.Tensor(-18, shape=(), dtype=int32)
Epsilon: 1
Memory frames: 47987

Episode: 36
Steps: 1293
Duration: 4.169845819473267
Score: tf.Tensor(-20, shape=(), dtype=int32)
Max Sc