## Training an agent to play Breakout using Reinforcement Learning
**Gabriel C. Ullmann, COMP 6321**

In this project, I use three Reinforcement Learning algorithms (PPO, A2C and DQN) to train an OpenAI Gym agent to play the game Breakout. Agents were trained using different combinations of algorithms, training steps and reward functions in order to determine which one reaches maximum average score and number of lives in the game.

**Run the code** cell below to import the required packages:
- **OpenAI Gym**: provides us with a toolkit to build and train an agent
- **Stable Baselines3**: provides us with an implementation of a RL algorithm called PPO
- **Datetime**: for getting the current timestamp to record our agent tests
- **Numpy**: used only briefly for array manipulation

In [2]:
import gym
from gym import spaces
from stable_baselines3 import PPO
from datetime import datetime
import numpy as np

**Run the code** cell below to import the Breakout game that our agent will be trained to play (BreakoutGame). 

Additionaly, a GameObjects class must be imported, since it contains some utility methods that are used for drawing things on the screen use PyGame, controlling game variables, etc. I will not go into details about how this works since our focus here is on how the agent works, not the game. 

P.S: you should see pygame's version being printed to the console if this package loads correctly

In [2]:
import GameObjects
from BreakoutGame import BreakoutGame

pygame 2.1.2 (SDL 2.0.16, Python 3.8.10)
Hello from the pygame community. https://www.pygame.org/contribute.html


**1.1 Creating an agent**: Gym works with the concept of environments, a class with a standardized interface inside of which you can implement your agent. An environment class is composed by 4 methods:
- **Init**: The first method executed after class instatiation. Here we declare some attrbiutes that will dictate the basic beahvior of our agent, such as number of actions and observations. Since this is a game of Breakout, there are only two possible actions: moving the bat to the left or to the right. We will observe 4 variables in our game: the (x,y) position of the bat, and the (x,y) position of the ball.
- **Step**: This method must be called in every iteration of a loop when we are training or testing our agent. Inside of it, we check which agent actions will return positve/negative rewards, and also collect the observation that will be used to train the agents. Here, we give o positive reward if the agent makes the bat follow the ball, and also when it scores points. When it gets away from the ball, a negative reward is given.
- **Reset**: Resets the game to its initial state. If we are executing our agent multiple times and eventually it reaches a "game over" state, we can use this method to restart the game and keep training.
- **Render**: Like step, this method must be called in every iteration of a loop when we are training or testing our agent. Inside it you can call your game-rendering logic (e.g: drawing things on the screen, checking for collisions, etc.). As we have already created our game object on **Init**, here we simply call self.game.render(). 

In [3]:
class BreakoutAgent(gym.Env):
    # Possible actions: going left or right
    LEFT = 0
    RIGHT = 1

    def __init__(self):
        super(BreakoutAgent, self).__init__()
        number_of_actions = 2
        number_of_observations = 4
        self.action_space = spaces.Discrete(number_of_actions)
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(number_of_observations,), dtype=np.float32)
        self.game = BreakoutGame()
        self.observer = GameObjects.Observer()
        self.prevScore = 0
        self.game.attach(self.observer)

    def step(self, action):
        self.game.runLogic(action)

        # by default, reward is zero; if the agent does something we want, we raise this reward to a positive value
        reward = 0
        
        # indicates whether the agent reached a "game over" state, no more lives
        done = (self.observer.event.lives == 0) 
        
        # returns current score and lives, will be useful later for recording the agent's performance
        info = {"score": self.observer.event.score, "lives": self.observer.event.lives}

        ball = self.observer.event.ball.rect
        bat = self.observer.event.bat.rect
        dif_l = abs(ball.left - bat.left)
        dif_r = abs(ball.right - bat.right)  
        
        # reward 1: follow the ball
        if dif_l < 50 or dif_r < 50:
            reward = 1
        else:
            reward = -1

        # reward 2: break blocks to increase the score
        if self.observer.event.score - self.prevScore > 0:
            reward = 100

        self.prevScore = self.observer.event.score
        
        # on each step, return: bat position, ball position, reward, done, info
        return np.array([ball.left, ball.right, bat.left, bat.right], dtype=np.float32), reward, done, info

    def reset(self):
        print("reset in the end")
        self.game.initGame()
        ball = self.observer.event.ball.rect
        bat = self.observer.event.bat.rect
        return np.array([ball.left, ball.right, bat.left, bat.right], dtype=np.float32)

    def render(self, mode='human'):
        self.game.render()
        
    def finish(self, mode='human'):
        print("called")
        self.game.finish()

**1.2 Training the agent**: 
1. Create an instance of the Gym environment class
1. Create an instance of the StableBaselines3' PPO algorithm, passing the environment as a parameter. Keep verbose=1 so you can observe the statistic outputted by the agent as it is trained
1. Call model.learn() and pass the desired number of timesteps (we will use 100K since it yields good results). This is the number of "frames" of your game that will be executed. In general, the longer you train your agent, the better. However, this may vary and you will have to test multiple configurations of this hyperparameter.
1. The script will stay in the "learn" line until it finishes training. After it is finished, then it will go on to reset the game state and save the trained agent to a file so we can play it back later.

**P.S:** StableBaselines3 will print several times to the console during training. If this bothers you, change to verbose=0.

In [4]:
%tb
steps = 100
env = BreakoutAgent()
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=steps) 
model.save("model_test")
env.finish()

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


No traceback available to show.


reset in the end
reset in the end
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1.7e+03  |
|    ep_rew_mean     | -532     |
| time/              |          |
|    fps             | 1409     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
called
called 2
quit()


**1.3 Testing the agent**: let's create a for loop and run our agent, letting it play for 2000 steps (this something is around 30s, since the game runs at 60 fps).
1. Call model.predict(), passing the observations as parameter. As the game has just been reset, the initial observation will correspond to the default initial position of the ball and bat.
2. Call env.step() to check if the agent did something that will return rewards to it.
3. Call env.render() to draw the game elements on the screen and execute game logic.
4. If the agent reaches a "game over" state before reaching 2000 steps, terminate the game session.

In [5]:
obs = env.reset()
def runAgent(env, obs, model):
    for i in range(10):
        action, _state = model.predict(obs, deterministic=True)
        obs, reward, done, info = env.step(action)
        env.render()
        if done:
            print("Game over! No more lives.")
            break
    env.finish()
    return info
            
info = runAgent(env, obs, model)

reset in the end
called
called 2
quit()


**1.4 Recording the agent's performance**: you can execute the same function we created in 1.3, but now inside of another for loop that repeats 10 times. This way, you can run your agent through 10 game sessions of 30s, and observe how it performs. At the end of each session, you can save the obtained score and number of lives to a CSV file that can be used for further analysis later.

In [6]:
for i in range(0, 10):
    info = runAgent(env, obs, model)

    with open("scores/score_example.csv", "a") as file:
        p1 = str(i)
        p2 = str(info["score"])
        p3 = str(info["lives"])
        p4 = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        file.write(p1 + "," + p2 + "," + p3 + "," + p4 + "\n")

called
called 2
quit()
called
called 2
quit()
called
called 2
quit()
called
called 2
quit()
called
called 2
quit()
called
called 2
quit()
called
called 2
quit()
called
called 2
quit()
called
called 2
quit()
called
called 2
quit()


**1.5 Analysing the agent's performance**: read csv, calculate average score and lives for the training with PPO, 100k steps. 

In [15]:
ds = np.loadtxt("scores/score_example.csv", delimiter=',', usecols=(1,2))
avg_score = np.round(np.average(ds[:, 1]), 2)
avg_lives = np.round(np.average(ds[:, 0]), 2)
print("Avg score: %1.2f, Avg lives: %1.2f" % (avg_score, avg_lives))

Avg score: 5.00, Avg lives: 0.00


In the actual project, this process of training/testing was repeated for each combination of algorithm/steps/reward function. You can see more details on how this was done in practice in the file: X