#Task

We will train a PPO agent which learns to play the classic super mario game.

You can use the stable baselines implementation of PPO or right your own version.

For the env, we will use gym_super_mario_bros. Read more about it [Here](https://github.com/Kautenja/gym-super-mario-bros/)

Note that the stable-baselines3 implementations expect a gymnasium environment and not a gym environment (gymnasium is the upgraded form of gym. gym is depreciated but we can still find a lot of environments made in it.)

Fortunately, gymnasium has a way to resolve that issue and convert a gym env to a gymnasium env. We do need to install a compatible version of gym though.

In [6]:
%pip install swig
%pip install stable-baselines3 gymnasium[all] gym_super_mario_bros nes_py gym==0.10.9  # might need a restart of the session.

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
zsh:1: no matches found: gymnasium[all]
Note: you may need to restart the kernel to use updated packages.


In [7]:
def frames_to_video(frames, fps=24):
    fig = plt.figure(figsize=(frames[0].shape[1] / 100, frames[0].shape[0] / 100), dpi=100)
    ax = plt.axes()
    ax.set_axis_off()

    if len(frames[0].shape) == 2:  # Grayscale image
        im = ax.imshow(frames[0], cmap='gray')
    else:  # Color image
        im = ax.imshow(frames[0])

    def init():
        if len(frames[0].shape) == 2:
            im.set_data(frames[0], cmap='gray')
        else:
            im.set_data(frames[0])
        return im,

    def update(frame):
        if len(frames[frame].shape) == 2:
            im.set_data(frames[frame], cmap='gray')
        else:
            im.set_data(frames[frame])
        return im,

    interval = 1000 / fps
    anim = FuncAnimation(fig, update, frames=len(frames), init_func=init, blit=True, interval=interval)
    plt.close()
    return HTML(anim.to_html5_video())

## Making the environment

On top of making the gym requirement, we will make a vectorized environment (provided by stable baselines 3)

This introduces training over multiple environments simultaneously, making the traning faster. We will use DummyVecEnv which doesn't actually use subprocesses but if we were working with a complex environment with higher compute time, we could also use SubProcessVecEnv

Think about what wrappers you can use to make the job easier. You can also make the action-space simpler. Read more about it in the env page referenced above.

Use `'SuperMarioBros-v0'` version of environment

In [8]:
# code here

# Import necessary libraries
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.results_plotter import load_results, ts2xy
from stable_baselines3.common.env_util import make_vec_env
import imageio
import numpy as np
import matplotlib.pyplot as plt
import os
import gym_super_mario_bros
from nes_py.wrappers import JoypadSpace
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack

# Create a folder to save logs
log_dir = "tmp/"
os.makedirs(log_dir, exist_ok=True)

# Create the Super Mario Bros environment and wrap it in a Monitor wrapper
env = gym_super_mario_bros.make('SuperMarioBros-v0')
env = JoypadSpace(env, SIMPLE_MOVEMENT)
env = Monitor(env, log_dir)

# Stack frames for better temporal representation
env = VecFrameStack(make_vec_env(lambda: env, n_envs=1), n_stack=4)

# Set up the hyperparameters
ppo_hyperparams = {
    'learning_rate': 2.5e-4,
    'n_steps': 128,
    'batch_size': 256,
    'n_epochs': 4,
    'gamma': 0.99,
    'gae_lambda': 0.95,
    'clip_range': 0.1,
    'ent_coef': 0.01,
    'vf_coef': 0.5,
    'max_grad_norm': 0.5,
    'policy_kwargs': dict(net_arch=[dict(pi=[256, 256], vf=[256, 256])])
}

# Initialize the PPO model with the environment and custom hyperparameters
model = PPO("CnnPolicy", env, verbose=1, **ppo_hyperparams)

# Train the model
model.learn(total_timesteps=1000000, log_interval=10)

# Save the trained model
model.save("ppo_mario")

# Function to plot training progress
def plot_results(log_folder):
    x, y = ts2xy(load_results(log_folder), 'timesteps')
    plt.figure(figsize=(10, 5))
    plt.plot(x, y)
    plt.xlabel('Timesteps')
    plt.ylabel('Rewards')
    plt.title('Training Progress')
    plt.show()

# Plot the training progress
plot_results(log_dir)

# Delete the model to demonstrate loading
del model

# Load the saved model
model = PPO.load("ppo_mario")

# Reset the environment
obs = env.reset()

# Create a list to store frames
frames = []

# Run the model in the environment
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    
    # Capture the current frame
    frame = env.render(mode='rgb_array')
    frames.append(frame)
    
    if done:
        obs = env.reset()

# Save frames as a video
video_filename = 'mario_ppo.mp4'
imageio.mimsave(video_filename, [np.array(frame) for frame in frames], fps=30)


Using cpu device
Wrapping the env in a VecTransposeImage.
------------------------------------------
| time/                   |              |
|    fps                  | 14           |
|    iterations           | 10           |
|    time_elapsed         | 90           |
|    total_timesteps      | 1280         |
| train/                  |              |
|    approx_kl            | 0.0007551359 |
|    clip_fraction        | 0            |
|    clip_range           | 0.1          |
|    entropy_loss         | -1.92        |
|    explained_variance   | 0.00158      |
|    learning_rate        | 0.00025      |
|    loss                 | 0.123        |
|    n_updates            | 36           |
|    policy_gradient_loss | 4.69e-05     |
|    value_loss           | 1.01         |
------------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 13          |
|    iterations           | 20          |


## Creating and training the model

In [None]:
# code here

## Visualizing the results

In [None]:
t_env = get_vec_env(render_mode="rgb_array")

state = t_env.reset()
frames = []

while True:
    action, _ = model.predict(state)
    state_next, r, done, info = t_env.step(action)
    state = state_next.copy()
    frames.append(t_env.render())
    if done:
        break
    if len(frames) > 5000:  # to limit the video length in case mario is stuck on untrained models. can be removed
        break

t_env.close()

NameError: name 'get_vec_env' is not defined

In [None]:
frames_to_video(frames, fps=60)