#Task

We will train a PPO agent which learns to play the classic super mario game.

You can use the stable baselines implementation of PPO or right your own version.

For the env, we will use gym_super_mario_bros. Read more about it [Here](https://github.com/Kautenja/gym-super-mario-bros/)

Note that the stable-baselines3 implementations expect a gymnasium environment and not a gym environment (gymnasium is the upgraded form of gym. gym is depreciated but we can still find a lot of environments made in it.)

Fortunately, gymnasium has a way to resolve that issue and convert a gym env to a gymnasium env. We do need to install a compatible version of gym though.

In [None]:
%pip install swig
%pip install stable-baselines3 gymnasium[all] gym_super_mario_bros nes_py gym==0.10.9  # might need a restart of the session.



In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

from gymnasium.wrappers import GrayScaleObservation, EnvCompatibility
import gymnasium as gym
import gym_super_mario_bros
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from nes_py.wrappers import JoypadSpace

import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

In [None]:
def frames_to_video(frames, fps=24):
    fig = plt.figure(figsize=(frames[0].shape[1] / 100, frames[0].shape[0] / 100), dpi=100)
    ax = plt.axes()
    ax.set_axis_off()

    if len(frames[0].shape) == 2:  # Grayscale image
        im = ax.imshow(frames[0], cmap='gray')
    else:  # Color image
        im = ax.imshow(frames[0])

    def init():
        if len(frames[0].shape) == 2:
            im.set_data(frames[0], cmap='gray')
        else:
            im.set_data(frames[0])
        return im,

    def update(frame):
        if len(frames[frame].shape) == 2:
            im.set_data(frames[frame], cmap='gray')
        else:
            im.set_data(frames[frame])
        return im,

    interval = 1000 / fps
    anim = FuncAnimation(fig, update, frames=len(frames), init_func=init, blit=True, interval=interval)
    plt.close()
    return HTML(anim.to_html5_video())

## Making the environment and training the model

On top of making the gym requirement, we will make a vectorized environment (provided by stable baselines 3)

This introduces training over multiple environments simultaneously, making the traning faster. We will use DummyVecEnv which doesn't actually use subprocesses but if we were working with a complex environment with higher compute time, we could also use SubProcessVecEnv

Think about what wrappers you can use to make the job easier. You can also make the action-space simpler. Read more about it in the env page referenced above.

Use `'SuperMarioBros-v0'` version of environment

In [None]:
from stable_baselines3.dqn import MlpPolicy, CnnPolicy
from gymnasium.spaces import Box, Discrete
import torch

class ConvertGymWrapper(gym.Wrapper):
  def __init__(self, env):
    super().__init__(env)
    self.observation_space = Box(shape = env.observation_space.shape, low = 0, high = 255)
    self.action_space = Discrete(env.action_space.n)

def wrapped_env():
  env = gym_super_mario_bros.make('SuperMarioBros-v0')
  env = JoypadSpace(env, SIMPLE_MOVEMENT)
  env = EnvCompatibility(env)
  env = ConvertGymWrapper(env)
  return env


if __name__ == '__main__':
  vec_env = DummyVecEnv([wrapped_env for i in range(1)])
  # Use torch to determine the device
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  model = PPO("MlpPolicy", vec_env, verbose = 1, device = device)
  model.learn(total_timesteps = 1000000)

  result = entry_point.load(False)
  logger.deprecation(


Using cpu device


  return (self.ram[0x86] - self.ram[0x071c]) % 256


-----------------------------
| time/              |      |
|    fps             | 53   |
|    iterations      | 1    |
|    time_elapsed    | 38   |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 16          |
|    iterations           | 2           |
|    time_elapsed         | 254         |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008612933 |
|    clip_fraction        | 0.0364      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.94       |
|    explained_variance   | 0.00991     |
|    learning_rate        | 0.0003      |
|    loss                 | 105         |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.000832   |
|    value_loss           | 190         |
-----------------------------------------
----------------------------------

## Visualizing the results

In [None]:
t_env = wrapped_env()

state = t_env.reset()[0]
frames = []

while True:
    action, _ = model.predict(state)


    state_next, _, terminated, truncated, _ = t_env.step(action.item())

    state = state_next.copy()
    frames.append(state)
    if terminated or truncated or len(frames) >5000:
     # to limit the video length in case mario is stuck on untrained models. can be removed
        break

t_env.close()



  result = entry_point.load(False)
  logger.deprecation(
  return (self.ram[0x86] - self.ram[0x071c]) % 256


In [None]:
frames_to_video(frames, fps=60)