## Training an agent to play Breakout using Reinforcement Learning
**Gabriel C. Ullmann, COMP 6321**

In this project, I use three Reinforcement Learning algorithms (PPO, A2C, and DQN) to train an OpenAI Gym agent to play the game Breakout. Agents were trained using different combinations of algorithms, training steps, and reward functions to determine which one reaches the maximum average score and number of lives in the game. In this notebook, I will show how to create an agent, how the agent communicates with the game in order to play it and learn, and how we can understand the training process and the performance of our agent.

**Run the code** cell below to import the required packages for creating an agent:
- **OpenAI Gym**: provides us with a toolkit to build an agent.
- **Stable Baselines3**: provides us with implementations of RL algorithms to train the agent.
- **datetime**: for getting the current timestamp to record our agent tests.
- **Numpy**: used only briefly for array manipulation.

In [1]:
import gym
from gym import spaces
from stable_baselines3 import PPO
from datetime import datetime
import numpy as np

**Run the code** cell below to import the required packages to run the game. 

The implementation of the game Breakout that will be played by our agent was developed by [John Cheetham](https://github.com/johncheetham/breakout) using [pygame](https://github.com/pygame/pygame/), a popular game development library. Besides pygame, we will also need:
- **random**: used to randomize the starting position of the ball in the game. This makes the game a bit less "predictable" and therefore allows us to check if our agent is learning to adapt to different situations and not just repeating the same actions.
- **GameObjects**: a script written by me that contains classes that represent in-game objects (ball, bat, etc.), as well as utility functions and initialization parameters for the game.

**P.S:** you should see pygame's version being printed to the console if it loads correctly.

In [2]:
import pygame
import random
import game.breakout_objects

pygame 2.1.2 (SDL 2.0.16, Python 3.8.10)
Hello from the pygame community. https://www.pygame.org/contribute.html


**1.1 Creating the game class:** I organized the entire game logic inside a class (BreakoutGame), and my first idea was simply importing this class into the notebook. However, there was an [issue](https://stackoverflow.com/questions/58687829/why-does-my-jupyter-notebook-keeps-crashing-when-rendering-text-in-pygame) with this approach: the game runs, but the execution is not terminated by calling pygame.quit(). I also tried calling sys.exit(), but then it would crash Jupyter Notebook's kernel too. Therefore, the only way I found to make this project work inside a notebook was by copying the entire class to the cell below. 

Since the focus of this project is Reinforcement Learning, I will not go into detail on how the game works. For brevity, I also removed code comments and documentation, but you can go directly to source to read these in more detail if you want. However, it is relevant to say that the game works in conjunction with the gym environment (BreakoutAgent) using the [Observer pattern](https://refactoring.guru/design-patterns/observer): at every step of execution, the game notifies changes to its current state (e.g position of the ball and bat, score, etc.) to the agent, which uses these observations to learn and choose its next action.

In [3]:
import random
import sys, pygame
import game.breakout_objects as breakout_objects

""" By Gabriel C. Ullmann (2022). Based on code from John Cheetham (2009).
Source: https://github.com/johncheetham/breakout """
class BreakoutGame():

    def __init__(self):
        self._observers = []

    def attach(self, observer: breakout_objects.Observer) -> None:
        self._observers.append(observer)

    def detach(self,observer: breakout_objects.Observer) -> None:
        self._observers.remove(observer)

    def notify(self, event: breakout_objects.Event) -> None:
        for observer in self._observers:
            observer.update(event)

    def init_game(self):
        self.score = 0  
        self.wall = None
        self.ball_xspeed = breakout_objects.BALL_XSPEED
        self.ball_yspeed = breakout_objects.BALL_YSPEED
        self.lives = breakout_objects.MAX_LIVES
        self.bat_speed = breakout_objects.BAT_XSPEED
        self.size = self.width, self.height = 640, 480
        self.gameScreen = None
        self.gameClock = None

        self.init_graphics()
        self.init_objects()
        event = breakout_objects.Event(self.score, self.lives, self.bat, self.ball)
        self.notify(event)

    def init_graphics(self):
        pygame.init()  
        self.gameScreen = pygame.display.set_mode(self.size)
        self.gameClock = pygame.time.Clock()
        pygame.mouse.set_visible(0) 

    def init_objects(self):
        self.wall = breakout_objects.Wall()
        self.wall.build_wall(self.width)
        self.bat = breakout_objects.Bat()
        self.ball = breakout_objects.Ball()
        self.bat.rect = self.bat.rect.move((self.width / 2) - (self.bat.rect.right / 2), self.height - 20)
        self.ball.rect = self.ball.rect.move((self.width / 2) + random.randint(-200, 200), self.height / 2)

    def run_logic(self, comm):
        self.check_agent_commands(comm)
        self.check_quit_command() 
        self.check_collision()   
        self.check_game_over_condition()  
        self.update_ball_position()  
        self.check_ball_hit_wall()    
        event = breakout_objects.Event(self.score, self.lives, self.bat, self.ball)
        self.notify(event)

    def check_agent_commands(self, comm):
        if comm == 0:                        
            self.bat.rect = self.bat.rect.move(-self.bat_speed, 0)     
            if (self.bat.rect.left < 0):                           
                self.bat.rect.left = 0      
        if comm == 1:                    
            self.bat.rect = self.bat.rect.move(self.bat_speed, 0)
            if (self.bat.rect.right > self.width):                            
                self.bat.rect.right = self.width
            
    def check_collision(self):
        if self.ball.is_collided(self.bat.rect):
            self.ball_yspeed = -self.ball_yspeed                            
            offset = self.ball.rect.center[0] - self.bat.rect.center[0]                                             
            if offset > 0:
                if offset > 30:  
                    self.ball_xspeed = 7
                elif offset > 23:                 
                    self.ball_xspeed = 6
                elif offset > 17:
                    self.ball_xspeed = 5 
            else:  
                if offset < -30:                             
                    self.ball_xspeed = -7
                elif offset < -23:
                    self.ball_xspeed = -6
                elif self.ball_xspeed < -17:
                    self.ball_xspeed = -5                
   
    def check_game_over_condition(self):
        if self.ball.rect.top > self.height:
            self.lives -= 1    
            self.ball_xspeed = breakout_objects.BALL_XSPEED
            self.ball_yspeed = breakout_objects.BALL_YSPEED            
            self.ball.rect.center = self.width / 2 + random.randint(-200, 200), self.height / 3  

        if self.lives == 0:    
            event = breakout_objects.Event(self.score, self.lives, self.bat, self.ball)
            self.notify(event)
         
    def update_ball_position(self):
        if self.ball.rect.left < 0 or self.ball.rect.right > self.width:
            self.ball_xspeed = -self.ball_xspeed                         
        if self.ball.rect.top < 0:
            self.ball_yspeed = -self.ball_yspeed    

        if self.ball_xspeed < 0 and self.ball.rect.left < 0:
            self.ball_xspeed = -self.ball_xspeed                                
        if self.ball_xspeed > 0 and self.ball.rect.right > self.width:
            self.ball_xspeed = -self.ball_xspeed                               

    def check_ball_hit_wall(self):
        index = self.ball.rect.collidelist(self.wall.brickrect)       
        if index != -1: 
            if self.ball.rect.center[0] > self.wall.brickrect[index].right or \
                self.ball.rect.center[0] < self.wall.brickrect[index].left:
                self.ball_xspeed = -self.ball_xspeed
            else:
                self.ball_yspeed = -self.ball_yspeed                         
            self.wall.brickrect[index:index + 1] = []
            self.score += 10

    def render(self):
        self.gameClock.tick(60)

        self.gameScreen.fill(breakout_objects.BG_COLOR)
        scoretext, scoretextrect = breakout_objects.draw_text(self.score, self.width)
        self.gameScreen.blit(scoretext, scoretextrect)

        for i in range(0, len(self.wall.brickrect)):
            self.gameScreen.blit(self.wall.brick, self.wall.brickrect[i])    

        if self.wall.brickrect == []:              
            self.wall.build_wall(self.width)                
            self.ball_xspeed = breakout_objects.BALL_XSPEED
            self.ball_yspeed = breakout_objects.BALL_YSPEED              
            self.ball.rect.center = self.width / 2, self.height / 3
        
        self.gameScreen.blit(self.ball.sprite, self.ball.rect)
        self.gameScreen.blit(self.bat.sprite, self.bat.rect)
        pygame.display.flip()

    def check_quit_command(self):
        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                sys.exit()
            if event.type == pygame.KEYDOWN:
                if event.key == pygame.K_ESCAPE:
                    sys.exit()

**1.2 Creating an agent**: Gym works with the concept of environments, classes with a standardized interface inside of which you can implement your agent. An environment class is composed of 4 methods:
- **Init**: The first method executed after class instantiation. Here we declare some attributes that will dictate the basic behavior of our agent, such as the number of actions and observations. Since this is a game of Breakout, there are only two possible actions: moving the bat to the left or to the right. We will observe 4 variables in our game: the (x,y) position of the bat, and the (x,y) position of the ball.
- **Step**: This method must be called in every iteration of a loop when we are training or testing our agent. Inside of it, we check which agent actions will return positive/negative rewards, and also collect the observation that will be used to train the agents. Here, we give a positive reward if the agent makes the bat follow the ball, and also when it scores points. When it gets away from the ball, a negative reward is given. Otherwise, the reward equals zero, a neutral state.
- **Reset**: Resets the game to its initial state. If we are executing our agent multiple times and eventually it reaches a "game over" state, we can use this method to restart the game and keep training.
- **Render**: Like step, this method must be called in every iteration of a loop when we are training or testing our agent. Inside it, you can call your game-rendering logic (e.g: drawing things on the screen, checking for collisions, etc.). As we have already created our game object on **Init**, here we simply call self.game.render(). 

In [4]:
class BreakoutAgent(gym.Env):

    def __init__(self):
        super(BreakoutAgent, self).__init__()
        number_of_actions = 2
        number_of_observations = 4
        self.action_space = spaces.Discrete(number_of_actions)
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(number_of_observations,), dtype=np.float32)
        self.game = BreakoutGame()
        self.observer = breakout_objects.Observer()
        self.prevScore = 0
        self.game.attach(self.observer)

    def step(self, action):
        self.game.run_logic(action)

        # default reward is zero
        reward = 0
        done = (self.observer.event.lives == 0) 
        info = {"score": self.observer.event.score, "lives": self.observer.event.lives}

        ball = self.observer.event.ball.rect
        bat = self.observer.event.bat.rect
        dif_l = abs(ball.left - bat.left)
        dif_r = abs(ball.right - bat.right)  
        
        # reward 1: follow the ball
        if dif_l < 50 or dif_r < 50:
            reward = 1
        else:
            reward = -1

        # reward 2: break blocks to increase the score
        if self.observer.event.score - self.prevScore > 0:
            reward = 100
            
        self.prevScore = self.observer.event.score
        return np.array([ball.left, ball.right, bat.left, bat.right], dtype=np.float32), reward, done, info

    def reset(self):
        self.game.init_game()
        ball = self.observer.event.ball.rect
        bat = self.observer.event.bat.rect
        return np.array([ball.left, ball.right, bat.left, bat.right], dtype=np.float32)

    def render(self):
        self.game.render()

**1.2 Training the agent**: 
1. Create an instance of the Gym environment class.
1. Create an instance of the StableBaselines3' PPO algorithm, passing the environment as a parameter. Keep verbose=1 so you can observe the statistic outputted by the agent as it is trained.
1. Call model.learn() and pass the desired number of timesteps (we will use 100K since it yields good results).  In general, the longer you train your agent, the better. Here we will use the default hyperparameters, such as learning_rate=0.003. The full list is available in the [documentation](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html).
1. The training may take a couple of minutes. After it is finished, the trained agent will be saved to a file so we can play it back later.
1. We close the pygame window so it does not stay running after the agent has finished playing.

In [5]:
steps = 100000
env = BreakoutAgent()
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=steps) 
model.save("model_test")
pygame.quit()

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
-----------------------------
| time/              |      |
|    fps             | 1391 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1401        |
|    iterations           | 2           |
|    time_elapsed         | 2           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.017737664 |
|    clip_fraction        | 0.106       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.678      |
|    explained_variance   | -0.00472    |
|    learning_rate        | 0.0003      |
|    loss                 | 20.8        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.00517    |
|    value_loss         

-----------------------------------------
| time/                   |             |
|    fps                  | 1492        |
|    iterations           | 13          |
|    time_elapsed         | 17          |
|    total_timesteps      | 26624       |
| train/                  |             |
|    approx_kl            | 0.027310079 |
|    clip_fraction        | 0.184       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.623      |
|    explained_variance   | -1.19e-07   |
|    learning_rate        | 0.0003      |
|    loss                 | -0.027      |
|    n_updates            | 120         |
|    policy_gradient_loss | 0.00119     |
|    value_loss           | 8.51        |
-----------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1491        |
|    iterations           | 14          |
|    time_elapsed         | 19          |
|    total_timesteps      | 28672 

----------------------------------------
| time/                   |            |
|    fps                  | 1474       |
|    iterations           | 24         |
|    time_elapsed         | 33         |
|    total_timesteps      | 49152      |
| train/                  |            |
|    approx_kl            | 0.00997024 |
|    clip_fraction        | 0.0898     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.582     |
|    explained_variance   | 0          |
|    learning_rate        | 0.0003     |
|    loss                 | -0.00784   |
|    n_updates            | 230        |
|    policy_gradient_loss | 0.000356   |
|    value_loss           | 0.0261     |
----------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1473        |
|    iterations           | 25          |
|    time_elapsed         | 34          |
|    total_timesteps      | 51200       |
| train/  

-----------------------------------------
| time/                   |             |
|    fps                  | 1484        |
|    iterations           | 35          |
|    time_elapsed         | 48          |
|    total_timesteps      | 71680       |
| train/                  |             |
|    approx_kl            | 0.013334225 |
|    clip_fraction        | 0.211       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.609      |
|    explained_variance   | -1.99e-05   |
|    learning_rate        | 0.0003      |
|    loss                 | 0.0154      |
|    n_updates            | 340         |
|    policy_gradient_loss | -0.0116     |
|    value_loss           | 0.207       |
-----------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1485        |
|    iterations           | 36          |
|    time_elapsed         | 49          |
|    total_timesteps      | 73728 

-----------------------------------------
| time/                   |             |
|    fps                  | 1493        |
|    iterations           | 46          |
|    time_elapsed         | 63          |
|    total_timesteps      | 94208       |
| train/                  |             |
|    approx_kl            | 0.011099597 |
|    clip_fraction        | 0.165       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.564      |
|    explained_variance   | -2.38e-07   |
|    learning_rate        | 0.0003      |
|    loss                 | 4.74        |
|    n_updates            | 450         |
|    policy_gradient_loss | -0.0148     |
|    value_loss           | 11.2        |
-----------------------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 1493         |
|    iterations           | 47           |
|    time_elapsed         | 64           |
|    total_timesteps      | 9

**1.3 Testing the agent**: let's create a for loop and run our agent, letting it play for 2000 steps (this something is around 30s since the game runs at 60 fps).
1. Call model.predict(), passing the observations as a parameter. As the game has just been reset, the initial observation will correspond to the initial position of the ball and bat.
2. Call env.step() to check for rewards. The larger the reward, the more successful are the actions being taken by the agent.
3. Call env.render() to draw on the screen and execute game logic.
4. If the agent reaches a "game over" state before reaching 2000 steps, the game session will be terminated.

In [6]:
obs = env.reset()
def runAgent(env, obs, model):
    for i in range(2000):
        action, _state = model.predict(obs, deterministic=True)
        obs, reward, done, info = env.step(action)
        env.render()
        if done:
            print("Game over! No more lives.")
            break
    return info
            
info = runAgent(env, obs, model)
pygame.quit()

**1.4 Analysing agent performance with tensorboard:** for PPO, A2C, DQN and other algorithms available on StableBaselines3 you can pass a folder path for the "tensorboard_log" parameter. When this parameter is informed, important training metrics such as policy loss and mean reward will be saved to a file that can later be read by [Tensorboard](https://www.tensorflow.org/tensorboard), a visualization tool.

On the cell below, I changed the predict() function by passing the "tensorboard_log" parameter. Run the training again to generate the logs.

In [9]:
tensorboard_logs_path = 'testing/tensorboard'
steps = 100000
env = BreakoutAgent()
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=tensorboard_logs_path)
model.learn(total_timesteps=steps) 
model.save("model_test")
pygame.quit()

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to testing/tensorboard/PPO_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 932      |
|    ep_rew_mean     | -678     |
| time/              |          |
|    fps             | 1901     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 932         |
|    ep_rew_mean          | -678        |
| time/                   |             |
|    fps                  | 1491        |
|    iterations           | 2           |
|    time_elapsed         | 2           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.015061877 |
|    clip_fraction        | 0.0783      |
|    clip_range          

After training, you should see inside "testing/tensorboard" a folder called "PPO_1" and a log file inside of it. If you have tensorboard installed, you can run it from inside this notebook and visualize several charts that describe the training process.

In [10]:
from tensorboard import notebook
notebook.display(port=6006, height=1000) 

Selecting TensorBoard with logdir testing/tblogs/ (started 0:13:56 ago; port 6006, pid 59895).


Tensorboard shows us some interesting data about the training. I will explain some of them briefly below:
- **rollout/ep_len_mean**: Mean episode length. When this value is higher, it means our agent is playing for longer sessions, which means it is not "dying" in the game. If this value increases during training it is an evidence that our agent is being successful.
- **rollout/ep_rew_mean**: Mean reward by episode. If this value increases during training it is an evidence that our agent is doing the actions we want it to do, since they return positive rewards.
- **train/learning_rate**: Since StableBaseline's does not support adaptive learning rates, this value should stay the same throughout training.
- **train/entropy_loss**: Entropy is way to measure "randomness". In the context of our agents, it indicates how random are the actions it takes. This value should decrease during training, as a sign that our agent is learning and becoming less random.
- **train/policy_gradient_loss**: As the training progresses, this value should decrease since our agent is learning a policy that helps it maximize rewards.

In sum: if the mean reward is increasing and the losses are decreasing, it is a good sign that the agent is learning how to play the game and do well on it. If it isn't, we could try with different reward approaches and hyperparameters until we found a better solution.