# Learning to Play Pong

This notebook uses mushroomrl (https://mushroomrl.readthedocs.io/en/latest/index.html) and the Pong class to run an experiment to learn to play a simplified version of Pong.

The simplified Pong consists of a ball bouncing in a 2-D box (no gravity). The left, right and top sides of the box are present and reflect the ball if it hits them, but the bottom side of the box is missing. If the ball moves toward the bottom of the box, the agent has to move a short paddle along the bottom of the box to reflect the ball, otherwise the ball falls out of the box and the game is lost. When not being used to reflect the ball, the paddle should return to the bottom left.

Originally I was planning to use manim to visualise the experiments, but mushroomrl has as a utilitiy that uses pygame to visualise experiment runs so I decided try that instead. The original mushroomrl code had a bug which caused the visualisation to hang, so a corrected version is included in pong.py. The window in which the visualisation runs often appears behind other windows, so you might need to look for it!

In [1]:
%load_ext autoreload
%autoreload 2

#Import all the things

import numpy as np
import random
from sklearn.ensemble import ExtraTreesRegressor

from mushroom_rl.algorithms.value import FQI
from mushroom_rl.core import Core
from mushroom_rl.policy import EpsGreedy
from mushroom_rl.utils.dataset import compute_J
from mushroom_rl.utils.parameters import Parameter

from pong import Pong

We begin by assigning the Markov Decision Process (MDP), which defines how the agent will move in the environment according to a given action.

In [2]:
mdp = Pong()

Now we define the agent.

- First we create the policy to be followed by the agent. Here we use a greedy algorithm, with `epsilon = 1`.

- Next, the policy approximator and its required paramaters iare created; here we use the `ExtraTreesRegressor` class of scikit-learn. 

- Finally the agent is created calling the algorithm class and providing the approximator and the policy, together with parameters used by the algorithm. Here we use Fitted Q-iteration (FQI).

In [3]:
# Define the agent

# Policy
epsilon = Parameter(value=1.)
pi = EpsGreedy(epsilon=epsilon)

# Approximator
approximator_params = dict(input_shape=mdp.info.observation_space.shape,
                           n_actions=mdp.info.action_space.n,
                           n_estimators=50,
                           min_samples_split=5,
                           min_samples_leaf=2)
approximator = ExtraTreesRegressor

# Agent
agent = FQI(mdp.info, pi, approximator, n_iterations=20,
            approximator_params=approximator_params)

Next we call the `core` module, which contains the function to learn in the MDP and evaluate the learned policy

In [4]:
core = Core(agent, mdp)

Now we're ready to learn! Here we train the agent. The agent’s policy is fitted after 400 random game samples have been collected, and this process is repeated 400 times. 400 * 400 was the smallest integration time that yielded good learning for this problem.

In [5]:
core.learn(n_episodes=400, n_episodes_per_fit=400)

 88%|███████████████████████████████████▎    | 353/400 [00:00<00:00, 955.76it/s]
  0%|                                                    | 0/20 [00:00<?, ?it/s][A
  5%|██▏                                         | 1/20 [00:00<00:03,  5.43it/s][A
 10%|████▍                                       | 2/20 [00:00<00:05,  3.38it/s][A
 15%|██████▌                                     | 3/20 [00:01<00:07,  2.24it/s][A
 20%|████████▊                                   | 4/20 [00:01<00:09,  1.71it/s][A
 25%|███████████                                 | 5/20 [00:02<00:10,  1.49it/s][A
 30%|█████████████▏                              | 6/20 [00:03<00:10,  1.38it/s][A
 35%|███████████████▍                            | 7/20 [00:04<00:09,  1.32it/s][A
 40%|█████████████████▌                          | 8/20 [00:05<00:09,  1.29it/s][A
 45%|███████████████████▊                        | 9/20 [00:06<00:08,  1.27it/s][A
 50%|█████████████████████▌                     | 10/20 [00:06<00:07,  1.26it/s

And now we evaluate the agent's learning. We compute the performance of the agent through the collected rewards during an evaluation run with randomly assigned starting position and velocity. Fixing `epsilon = 0`, the greedy policy is applied starting from the provided initial states, then the average cumulative discounted reward is displayed.

If `render` is set to `True` for the dataset, then a pygame window will display the game. However, the pygame window which displays the game may well open behind your browser, so you might not see it unless you look for it.

In [6]:
pi.set_epsilon(Parameter(0.))
x_start = random.uniform(0, 1)
y_start = random.uniform(0, 1)
vx_start = random.uniform(-1, 1)
vy_start = random.uniform(-1, 1)
initial_state = np.array([[x_start, y_start, vx_start, vy_start]])
dataset = core.evaluate(initial_states=initial_state, render=True)

print("The average cumulative discounted reward is", compute_J(dataset, gamma=mdp.info.gamma))

                                                                                

The average cumulative discounted reward is [2.011039317821579]


