# Teaching AI to play Atari Pong using reinforcement learning

First we have to execute all the necessary shell commands for setting up an environment to run reinforcement learning

In [None]:
!pip uninstall -y gym gymnasium ale-py
!pip install gymnasium[atari,accept-rom-license] ale-py


# Install Stable-Baselines3 for Reinforcement Learning
!pip install stable-baselines3[extra]

# Install virtual display for rendering in Colab
!apt-get install -y xvfb
!pip install pyvirtualdisplay

Found existing installation: gym 0.25.2
Uninstalling gym-0.25.2:
  Successfully uninstalled gym-0.25.2
[0mCollecting ale-py
  Downloading ale_py-0.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB)
Collecting gymnasium[accept-rom-license,atari]
  Downloading gymnasium-1.0.0-py3-none-any.whl.metadata (9.5 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium[accept-rom-license,atari])
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Downloading ale_py-0.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Downloading gymnasium-1.0.0-py3-none-any.whl (958 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m958.1/958.1 kB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: far

Check ale-py version in order to confirm we have the correct one to be able to run a Atari Pong environment

In [None]:
import ale_py
print(f"ale-py version: {ale_py.__version__}")

ale-py version: 0.10.1


Now we will choose the version of Pong we want to use for our project from the listed ones

In [None]:
import gymnasium as gym

# List all environments containing "Pong"
print([env_spec.id for env_spec in gym.envs.registry.values() if "Pong" in env_spec.id])

['Pong-v0', 'PongDeterministic-v0', 'PongNoFrameskip-v0', 'Pong-v4', 'PongDeterministic-v4', 'PongNoFrameskip-v4', 'Pong-ram-v0', 'Pong-ramDeterministic-v0', 'Pong-ramNoFrameskip-v0', 'Pong-ram-v4', 'Pong-ramDeterministic-v4', 'Pong-ramNoFrameskip-v4', 'ALE/Pong-v5', 'ALE/Pong-ram-v5']


This code snippet initializes and starts a virtual display using the pyvirtualdisplay library.
This is necessary since the Google colab environment does not have a graphical interface

* visible=0: Specifies that the display should run in headless mode (not physically visible).


* size=(1400, 900): Sets the resolution of the virtual display to 1400x900 pixels.

In [None]:
from pyvirtualdisplay import Display

# Start a virtual display
display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7a4780463010>

Now we will intialize the Pong environment using gymnasium library.

In [None]:
import gymnasium as gym
import gymnasium as gym

# Create Pong environment
env = gym.make("ALE/Pong-v5")

# Test the environment
obs = env.reset()
print("Environment initialized successfully!")

Environment initialized successfully!


This code sets up the Pong environment with frame stacking and parallel execution using tools from Stable-Baselines3, making it suitable for reinforcement learning.

* Imports the Proximal Policy Optimization (PPO) algorithm, a popular RL algorithm for training agents.

* Imports VecFrameStack, which stacks multiple consecutive frames together to help the agent understand motion.

* Imports a utility function to create and wrap Atari environments for parallel execution.

* Creates the Pong environment using the ALE/Pong-v5 identifier.

  * n_envs=1: Sets the number of parallel environments. This is useful for asynchronous training.

  * seed=0: Sets a random seed for reproducibility.

* Wraps the environment to stack the last 4 frames along the channel dimension of the observation.

* Stacked frames help the agent understand the dynamics of the environment.

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.env_util import make_atari_env

# Wrap the environment for parallel execution and frame stacking
env = make_atari_env("ALE/Pong-v5", n_envs=1, seed=0)

# Stack 4 frames for motion understanding
env = VecFrameStack(env, n_stack=4)

# Training Model

This code snippet initializes and trains a PPO (Proximal Policy Optimization) model on the Pong environment and saves the trained model for later use

* "CnnPolicy": Specifies the use of a Convolutional Neural Network (CNN) for processing image-based observations.

* env: The wrapped Atari Pong environment.

* verbose=1: Enables verbose output during training for monitoring progress.

* tensorboard_log="./ppo_pong_tensorboard/": Specifies the directory for storing logs that can be visualized using TensorBoard.

* model.learn(total_timesteps=50000): Trains the model for 50000 timesteps.

  * The number of timesteps can be increased for better model performance.

* model.save("ppo_pong"): Saves the trained model to a file named ppo_pong.zip for later use.

In [None]:
# Create the PPO model
model = PPO("CnnPolicy", env, verbose=1, tensorboard_log="./ppo_pong_tensorboard/")

# Train the model (adjust timesteps as needed for better performance)
model.learn(total_timesteps=50000)

# Save the trained model
model.save("ppo_pong")

Using cuda device
Wrapping the env in a VecTransposeImage.
Logging to ./ppo_pong_tensorboard/PPO_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 904      |
|    ep_rew_mean     | -20.6    |
| time/              |          |
|    fps             | 172      |
|    iterations      | 1        |
|    time_elapsed    | 11       |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 873         |
|    ep_rew_mean          | -20.7       |
| time/                   |             |
|    fps                  | 164         |
|    iterations           | 2           |
|    time_elapsed         | 24          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.009303406 |
|    clip_fraction        | 0.106       |
|    clip_range           | 0.2         |
|    entropy_lo

**Observations**

* Rewards: The agent's mean reward (ep_rew_mean) is consistently around -20.6. This suggests that while the agent is exploring, it hasn't yet learned how to win or significantly improve performance.

* Explained Variance: Values nearing 0.98 suggest that the value function is accurately estimating returns, which is a good sign.

* Loss Values: Both policy and value losses are decreasing steadily, indicating stable optimization.

* KL Divergence: Values like 0.01 and 0.008 are within a safe range, showing that the updates are not too aggressive.

* If we want to see significant model improvement, timesteps should be increased from 50000 to 1-2 million.

This code snippet will load our trained model and test its performance by letting it play for 1000 steps.

In [None]:
# Load the trained model
model = PPO.load("ppo_pong")

# Reset the environment
obs = env.reset()

# Let the agent play
for _ in range(1000):
    action, _ = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

# Recording Pong environment

* Creates a Pong environment with the render_mode set to "rgb_array", enabling the recording of frames as image data.

* Wraps the environment with the RecordVideo wrapper to record gameplay.
video_folder="./videos": Specifies the folder to save the recorded videos.
episode_trigger=lambda x: True: Ensures that every episode is recorded.

* Resets the environment and runs for up to 1000 steps, taking random actions (you can replace this with your trained model's predictions).
Stops recording when the episode ends (dones or truncated is True).

* Closes the environment and ensures the video file is finalized.

* A helper function to embed and display the recorded video in a Jupyter notebook.
Reads the video file, encodes it in base64, and creates an HTML video tag to render it inline.


* Displays the first recorded episode (saved as rl-video-episode-0.mp4).

In [None]:
from gymnasium.wrappers import RecordVideo
import gymnasium as gym
import io
import base64
from IPython.display import HTML

# Create the base environment with the correct render_mode
record_env = gym.make("ALE/Pong-v5", render_mode="rgb_array")

# Wrap the environment with RecordVideo
record_env = RecordVideo(record_env, video_folder="./videos", episode_trigger=lambda x: True)

# Record a single episode
obs, info = record_env.reset()
for _ in range(1000):
    action = record_env.action_space.sample()  # Random action (replace with model prediction if desired)
    obs, rewards, dones, truncated, info = record_env.step(action)
    if dones or truncated:
        break

record_env.close()

# Function to display the video
def show_video(video_path):
    video = io.open(video_path, 'r+b').read()
    encoded = base64.b64encode(video)
    return HTML(data=f"""
        <video width="640" height="480" controls>
            <source src="data:video/mp4;base64,{encoded.decode('ascii')}" type="video/mp4">
        </video>
    """)

# Display the video
show_video("./videos/rl-video-episode-0.mp4")

Exception ignored in: <function RecordVideo.__del__ at 0x7a47701fbeb0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gymnasium/wrappers/rendering.py", line 415, in __del__
    if len(self.recorded_frames) > 0:
AttributeError: 'RecordVideo' object has no attribute 'recorded_frames'


**Observations**

A 2-20 loss suggests that the model has not yet learned effective strategies to compete with the Atari CPU in Pong. This is common in early-stage training, especially if the model has only been trained for a relatively short number of timesteps.



**Possible improvement strategies**

1) Increase Training Time:
Train the model for more timesteps (e.g., 1 million or more). Pong generally requires significant training to perform well.

2) Evaluate and Tune Hyperparameters:
  * Learning Rate (learning_rate): Start with a lower learning rate like 1e-4 or 5e-5.
  * Batch Size (batch_size): Increase it to 64 or 128 for more stable gradients.
  * Number of Environments (n_envs): Increase to 8 or more for better sample efficiency.