# Introduction

So far we considered single-agent environments but they have their limitations. What if we want the agent to cooperate or compete with other agent? Both [StableBaselines3](https://stable-baselines3.readthedocs.io/en/master/) and [gymnasium](https://gymnasium.farama.org/) do not natively implement such features, but we can bypass this problemn using [pettingzoo](https://pettingzoo.farama.org/) library. In this notebook we will augument training possibilities by creating multiagent environments using pettingZoo library with StableBaselines3.

# PettingZoo API

PettingZoo is a library based on gymnasium enabling multiagent environments. It contains AEC (Agent Environment Cycle) API for environments which agents perform actions one after another and Parallel API for simultaneous actions and observations. Moreover this library features various wrappers enabling even more features. We will describe some of them below.

### AEC API

AEC (Agent Environment Cycle) API allows to represent any type of game for multiagent reinforcement learning. Here agents execute actions in turns, one after another. In following example we create pettingzoo environment representing rock paper scissors game. There is no model nor policy, only two players taking random actions.

In [1]:
from pettingzoo.classic import rps_v2

env = rps_v2.env(render_mode="human")
env.reset(seed=42)

for agent in env.agent_iter():
    observation, reward, termination, truncation, info = env.last()

    if termination or truncation:
        action = None
    else:
        # this is where you would insert your policy
        action = env.action_space(agent).sample()

    env.step(action)
env.close()

In games not all moves are usually available to players. In that case we need to implement action mask that gives us all available actions for given player, below simple chess example with random moves and masking. This pettingZoo environment implements observation as dict with environment observation and action mask. We will later show how to train models with masks. For now this is how we can apply mask to sample function.

In [2]:
from pettingzoo.classic import chess_v6

env = chess_v6.env(render_mode="human")
env.metadata['render_fps'] = 30
env.reset(seed=42)

for agent in env.agent_iter():
    observation, reward, termination, truncation, info = env.last()

    if termination or truncation:
        action = None
    else:
        # invalid action masking is optional and environment-dependent
        if "action_mask" in info:
            mask = info["action_mask"]
        elif isinstance(observation, dict) and "action_mask" in observation:
            mask = observation["action_mask"]
        else:
            mask = None
        # this is where you would insert your policy
        action = env.action_space(agent).sample(mask)

    env.step(action)
env.close()

### Parallel API

For simultaneous actions and observations we use alternative Parallel API. We can get actions of all agents at the same time, here with sample function. Below example shows pistonball environment in which agents cooperate to move ball to other side.

In [3]:
from pettingzoo.butterfly import pistonball_v6
parallel_env = pistonball_v6.parallel_env(render_mode="human")
observations, infos = parallel_env.reset(seed=42)

while parallel_env.agents:
    # this is where you would insert your policy
    actions = {agent: parallel_env.action_space(agent).sample() for agent in parallel_env.agents}

    observations, rewards, terminations, truncations, infos = parallel_env.step(actions)
parallel_env.close()

### Wrapers

PettingZoo features some usefull wrapers. We can convert AEC environments to Parallel and other way around with ```aec_to_parallel``` and ```parallel_to_aec```. Other usefull wraper is ```TerminateIllegalWrapper``` that disallows illegal moves. For parallel environments we need to wrap them first in ```BaseParallelWraper```. More wrappers can be found on pettingZoo documentation website. Keep in mind that most of pettingZoo native environments are already wrapped with appropriate wrappers so there is no need to do this again.

In [4]:
from pettingzoo.utils.conversions import aec_to_parallel
from pettingzoo.butterfly import pistonball_v6
env = pistonball_v6.env()
env = aec_to_parallel(env)

In [5]:
from pettingzoo.utils import parallel_to_aec
from pettingzoo.butterfly import pistonball_v6
env = pistonball_v6.parallel_env()
env = parallel_to_aec(env)

In [6]:
from pettingzoo.utils import TerminateIllegalWrapper
from pettingzoo.classic import tictactoe_v3
env = tictactoe_v3.env()
env = TerminateIllegalWrapper(env, illegal_reward=-1)

env.reset()
for agent in env.agent_iter():
    observation, reward, termination, truncation, info = env.last()
    if termination or truncation:
        action = None
    else:
        # this is where you would insert your policy
        action = env.action_space(agent).sample()
    env.step(action)
env.close()

obs['action_mask'] contains a mask of all legal moves that can be chosen.


In [7]:
from pettingzoo.utils import BaseParallelWrapper
from pettingzoo.butterfly import pistonball_v6

parallel_env = pistonball_v6.parallel_env(render_mode="human")
parallel_env = BaseParallelWrapper(parallel_env)

observations, infos = parallel_env.reset()

while parallel_env.agents:
    actions = {agent: parallel_env.action_space(agent).sample(
    ) for agent in parallel_env.agents}  # this is where you would insert your policy
    observations, rewards, terminations, truncations, infos = parallel_env.step(
        actions)

parallel_env.close()

# Training and evaluation

Import libraries

In [8]:
from __future__ import annotations

import supersuit as ss
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy

from pettingzoo.sisl import waterworld_v4

Create and wrap environemnt.

- ```ss.pettingzoo_env_to_vec_env_v1``` makes environment compatibile with stable baselines3
- ```ss.concat_vec_envs_v1``` makes n simulations at the same time

In [9]:
env = waterworld_v4.parallel_env()
env.reset(seed=42)

env = ss.pettingzoo_env_to_vec_env_v1(env)
env = ss.concat_vec_envs_v1(env, 8, num_cpus=2, base_class="stable_baselines3")

Create PPO model, you can tune these and more parameters. This model uses ```MlpPolicy``` what means using multi-layer perceptron network.

In [10]:
model = PPO(
    MlpPolicy,
    env,
    verbose=3,
    learning_rate=1e-3,
    batch_size=256,
    device='cpu'
)

Using cpu device


Train and save model

In [11]:
model.learn(total_timesteps=10*(2**15), progress_bar=True)
model.save('waterworld_model')
env.close()


Output()

------------------------------
| time/              |       |
|    fps             | 1756  |
|    iterations      | 1     |
|    time_elapsed    | 18    |
|    total_timesteps | 32768 |
------------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 1584         |
|    iterations           | 2            |
|    time_elapsed         | 41           |
|    total_timesteps      | 65536        |
| train/                  |              |
|    approx_kl            | 0.0048466474 |
|    clip_fraction        | 0.0378       |
|    clip_range           | 0.2          |
|    entropy_loss         | -2.8         |
|    explained_variance   | -0.00415     |
|    learning_rate        | 0.001        |
|    loss                 | 4.3          |
|    n_updates            | 10           |
|    policy_gradient_loss | -0.000366    |
|    std                  | 0.981        |
|    value_loss           | 10.5         |
------------------------------------------


----------------------------------------
| time/                   |            |
|    fps                  | 1515       |
|    iterations           | 3          |
|    time_elapsed         | 64         |
|    total_timesteps      | 98304      |
| train/                  |            |
|    approx_kl            | 0.00454988 |
|    clip_fraction        | 0.0354     |
|    clip_range           | 0.2        |
|    entropy_loss         | -2.77      |
|    explained_variance   | 0.291      |
|    learning_rate        | 0.001      |
|    loss                 | 4.28       |
|    n_updates            | 20         |
|    policy_gradient_loss | -0.000742  |
|    std                  | 0.965      |
|    value_loss           | 11.7       |
----------------------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 1455         |
|    iterations           | 4            |
|    time_elapsed         | 90           |
|    total_timesteps      | 131072       |
| train/                  |              |
|    approx_kl            | 0.0051311515 |
|    clip_fraction        | 0.0454       |
|    clip_range           | 0.2          |
|    entropy_loss         | -2.73        |
|    explained_variance   | 0.36         |
|    learning_rate        | 0.001        |
|    loss                 | 7.3          |
|    n_updates            | 30           |
|    policy_gradient_loss | -0.000712    |
|    std                  | 0.948        |
|    value_loss           | 14.3         |
------------------------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 1442         |
|    iterations           | 5            |
|    time_elapsed         | 113          |
|    total_timesteps      | 163840       |
| train/                  |              |
|    approx_kl            | 0.0055509075 |
|    clip_fraction        | 0.0553       |
|    clip_range           | 0.2          |
|    entropy_loss         | -2.7         |
|    explained_variance   | 0.394        |
|    learning_rate        | 0.001        |
|    loss                 | 6.84         |
|    n_updates            | 40           |
|    policy_gradient_loss | -0.000962    |
|    std                  | 0.933        |
|    value_loss           | 15.7         |
------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 1430        |
|    iterations           | 6           |
|    time_elapsed         | 137         |
|    total_timesteps      | 196608      |
| train/                  |             |
|    approx_kl            | 0.004556311 |
|    clip_fraction        | 0.061       |
|    clip_range           | 0.2         |
|    entropy_loss         | -2.68       |
|    explained_variance   | 0.375       |
|    learning_rate        | 0.001       |
|    loss                 | 7.68        |
|    n_updates            | 50          |
|    policy_gradient_loss | -0.00117    |
|    std                  | 0.922       |
|    value_loss           | 15.4        |
-----------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 1420        |
|    iterations           | 7           |
|    time_elapsed         | 161         |
|    total_timesteps      | 229376      |
| train/                  |             |
|    approx_kl            | 0.007919838 |
|    clip_fraction        | 0.078       |
|    clip_range           | 0.2         |
|    entropy_loss         | -2.64       |
|    explained_variance   | 0.385       |
|    learning_rate        | 0.001       |
|    loss                 | 8.38        |
|    n_updates            | 60          |
|    policy_gradient_loss | -0.00157    |
|    std                  | 0.904       |
|    value_loss           | 14.5        |
-----------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 1420        |
|    iterations           | 8           |
|    time_elapsed         | 184         |
|    total_timesteps      | 262144      |
| train/                  |             |
|    approx_kl            | 0.006140894 |
|    clip_fraction        | 0.0731      |
|    clip_range           | 0.2         |
|    entropy_loss         | -2.6        |
|    explained_variance   | 0.418       |
|    learning_rate        | 0.001       |
|    loss                 | 10.8        |
|    n_updates            | 70          |
|    policy_gradient_loss | -0.000781   |
|    std                  | 0.883       |
|    value_loss           | 18.2        |
-----------------------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 1418         |
|    iterations           | 9            |
|    time_elapsed         | 207          |
|    total_timesteps      | 294912       |
| train/                  |              |
|    approx_kl            | 0.0065908213 |
|    clip_fraction        | 0.0729       |
|    clip_range           | 0.2          |
|    entropy_loss         | -2.56        |
|    explained_variance   | 0.437        |
|    learning_rate        | 0.001        |
|    loss                 | 9            |
|    n_updates            | 80           |
|    policy_gradient_loss | -0.00107     |
|    std                  | 0.872        |
|    value_loss           | 19.8         |
------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 1428        |
|    iterations           | 10          |
|    time_elapsed         | 229         |
|    total_timesteps      | 327680      |
| train/                  |             |
|    approx_kl            | 0.005607011 |
|    clip_fraction        | 0.0789      |
|    clip_range           | 0.2         |
|    entropy_loss         | -2.53       |
|    explained_variance   | 0.468       |
|    learning_rate        | 0.001       |
|    loss                 | 7.56        |
|    n_updates            | 90          |
|    policy_gradient_loss | -0.00132    |
|    std                  | 0.852       |
|    value_loss           | 19.9        |
-----------------------------------------


Evaluate model by simulating n games and collecting rewards

In [12]:
env = waterworld_v4.env()
env.reset(seed=42)
model = PPO.load('waterworld_model')

rewards = {agent: 0 for agent in env.possible_agents}

for i in range(10):
    env.reset(seed=i)

    for agent in env.agent_iter():
        obs, reward, termination, truncation, info = env.last()

        for a in env.agents:
            rewards[a] += env.rewards[a]
        if termination or truncation:
            break
        else:
            act = model.predict(obs, deterministic=True)[0]

        env.step(act)
env.close()

avg_reward = sum(rewards.values()) / len(rewards.values())
print("Rewards: ", rewards)
print(f"Avg reward: {avg_reward}")

Rewards:  {'pursuer_0': np.float64(37.14175783466845), 'pursuer_1': np.float64(-173.56618499680962)}
Avg reward: -68.21221358107059


You can visualise model by setting ```render_mode``` flag to ```"human"```

In [13]:
env = waterworld_v4.env(render_mode='human')
env.metadata['render_fps'] = 60
env.reset(seed=42)
model = model = PPO.load('waterworld_model')

for agent in env.agent_iter():
    obs, reward, termination, truncation, info = env.last()

    if termination or truncation:
        break
    else:
        act = model.predict(obs, deterministic=True)[0]

    env.step(act)
env.close()

# Training environments with action mask

Following pettingZoo documentation to train environments containing action mask in Stable Baselines3 we need to define below wrapper. Feel free to copy paste it but let's try to explain what happens here.

To use MaskablePPO from ```Stable Baselines3 - Contrib``` we must wrap environment in ```ActionMasker```. Moreover we need to pass function reference to ActionMasker that takes environment and returns action mask like in example below (```mask_fn```)

PettingZoo environments choose to approach this in following fasion. They create observation as dictionary with ```observation``` and ```action_mask``` and then split it in wrapper. Here is the wrapper:

In [14]:
import glob
import os
import time

import gymnasium as gym
from sb3_contrib import MaskablePPO
from sb3_contrib.common.maskable.policies import MaskableActorCriticPolicy
from sb3_contrib.common.wrappers import ActionMasker

import pettingzoo.utils
from pettingzoo.classic import connect_four_v3

In [15]:
# To pass into other gymnasium wrappers, we need to ensure that pettingzoo's wrappper
# can also be a gymnasium Env. Thus, we subclass under gym.Env as well.
class SB3ActionMaskWrapper(pettingzoo.utils.BaseWrapper, gym.Env):
    """Wrapper to allow PettingZoo environments to be used with SB3 illegal action masking."""

    def reset(self, seed=None, options=None):
        """Gymnasium-like reset function which assigns obs/action spaces to be the same for each agent.

        This is required as SB3 is designed for single-agent RL and doesn't expect obs/action spaces to be functions
        """
        super().reset(seed, options)

        # Strip the action mask out from the observation space
        self.observation_space = super().observation_space(self.possible_agents[0])[
            "observation"
        ]
        self.action_space = super().action_space(self.possible_agents[0])

        # Return initial observation, info (PettingZoo AEC envs do not by default)
        return self.observe(self.agent_selection), {}

    def step(self, action):
        """Gymnasium-like step function, returning observation, reward, termination, truncation, info.

        The observation is for the next agent (used to determine the next action), while the remaining
        items are for the agent that just acted (used to understand what just happened).
        """
        current_agent = self.agent_selection

        super().step(action)

        next_agent = self.agent_selection
        return (
            self.observe(next_agent),
            self._cumulative_rewards[current_agent],
            self.terminations[current_agent],
            self.truncations[current_agent],
            self.infos[current_agent],
        )

    def observe(self, agent):
        """Return only raw observation, removing action mask."""
        return super().observe(agent)["observation"]

    def action_mask(self):
        """Separate function used in order to access the action mask."""
        return super().observe(self.agent_selection)["action_mask"]

In [16]:
def mask_fn(env):
    # Do whatever you'd like in this function to return the action mask
    # for the current env. In this example, we assume the env has a
    # helpful method we can rely on.
    return env.action_mask()

### training

In [17]:
env = connect_four_v3.env()
env = SB3ActionMaskWrapper(env)
env.reset(seed=42)
env = ActionMasker(env, mask_fn)

In [18]:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [19]:
model.learn(total_timesteps=10*(2**12), progress_bar=True)
model.save(f"{env.unwrapped.metadata.get('name')}_{time.strftime('%Y%m%d-%H%M%S')}")
env.close()

Output()

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 21.2     |
|    ep_rew_mean     | 1        |
| time/              |          |
|    fps             | 987      |
|    iterations      | 1        |
|    time_elapsed    | 2        |
|    total_timesteps | 2048     |
---------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 21          |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 763         |
|    iterations           | 2           |
|    time_elapsed         | 5           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.007951414 |
|    clip_fraction        | 0.0416      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.9        |
|    explained_variance   | -2.45       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.00133    |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0172     |
|    value_loss           | 0.0755      |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 19.7        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 705         |
|    iterations           | 3           |
|    time_elapsed         | 8           |
|    total_timesteps      | 6144        |
| train/                  |             |
|    approx_kl            | 0.008569875 |
|    clip_fraction        | 0.0567      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.9        |
|    explained_variance   | 0.0224      |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0449     |
|    n_updates            | 20          |
|    policy_gradient_loss | -0.021      |
|    value_loss           | 0.0127      |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 20.1        |
|    ep_rew_mean          | 0.99        |
| time/                   |             |
|    fps                  | 685         |
|    iterations           | 4           |
|    time_elapsed         | 11          |
|    total_timesteps      | 8192        |
| train/                  |             |
|    approx_kl            | 0.010089598 |
|    clip_fraction        | 0.0881      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.88       |
|    explained_variance   | -0.441      |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0127     |
|    n_updates            | 30          |
|    policy_gradient_loss | -0.0264     |
|    value_loss           | 0.00532     |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 18.8        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 676         |
|    iterations           | 5           |
|    time_elapsed         | 15          |
|    total_timesteps      | 10240       |
| train/                  |             |
|    approx_kl            | 0.010065762 |
|    clip_fraction        | 0.0828      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.85       |
|    explained_variance   | -0.958      |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0444     |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.025      |
|    value_loss           | 0.00525     |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 17.1        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 669         |
|    iterations           | 6           |
|    time_elapsed         | 18          |
|    total_timesteps      | 12288       |
| train/                  |             |
|    approx_kl            | 0.010075117 |
|    clip_fraction        | 0.0911      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.84       |
|    explained_variance   | -0.315      |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0345     |
|    n_updates            | 50          |
|    policy_gradient_loss | -0.0272     |
|    value_loss           | 0.00317     |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 16.9        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 665         |
|    iterations           | 7           |
|    time_elapsed         | 21          |
|    total_timesteps      | 14336       |
| train/                  |             |
|    approx_kl            | 0.011402028 |
|    clip_fraction        | 0.115       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.81       |
|    explained_variance   | -0.357      |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0277     |
|    n_updates            | 60          |
|    policy_gradient_loss | -0.0307     |
|    value_loss           | 0.0026      |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 14.9        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 660         |
|    iterations           | 8           |
|    time_elapsed         | 24          |
|    total_timesteps      | 16384       |
| train/                  |             |
|    approx_kl            | 0.010663469 |
|    clip_fraction        | 0.0937      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.78       |
|    explained_variance   | 0.0319      |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0429     |
|    n_updates            | 70          |
|    policy_gradient_loss | -0.029      |
|    value_loss           | 0.00197     |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 13.5        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 656         |
|    iterations           | 9           |
|    time_elapsed         | 28          |
|    total_timesteps      | 18432       |
| train/                  |             |
|    approx_kl            | 0.014470482 |
|    clip_fraction        | 0.175       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.7        |
|    explained_variance   | -0.0936     |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0521     |
|    n_updates            | 80          |
|    policy_gradient_loss | -0.0354     |
|    value_loss           | 0.00189     |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 11.8        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 650         |
|    iterations           | 10          |
|    time_elapsed         | 31          |
|    total_timesteps      | 20480       |
| train/                  |             |
|    approx_kl            | 0.015216706 |
|    clip_fraction        | 0.185       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.62       |
|    explained_variance   | 0.0254      |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0555     |
|    n_updates            | 90          |
|    policy_gradient_loss | -0.0379     |
|    value_loss           | 0.00141     |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 10.6        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 646         |
|    iterations           | 11          |
|    time_elapsed         | 34          |
|    total_timesteps      | 22528       |
| train/                  |             |
|    approx_kl            | 0.015475433 |
|    clip_fraction        | 0.185       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.52       |
|    explained_variance   | 0.151       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0295     |
|    n_updates            | 100         |
|    policy_gradient_loss | -0.0356     |
|    value_loss           | 0.00103     |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 9.55        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 642         |
|    iterations           | 12          |
|    time_elapsed         | 38          |
|    total_timesteps      | 24576       |
| train/                  |             |
|    approx_kl            | 0.016106198 |
|    clip_fraction        | 0.157       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.4        |
|    explained_variance   | 0.321       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0459     |
|    n_updates            | 110         |
|    policy_gradient_loss | -0.0337     |
|    value_loss           | 0.000746    |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 9.09        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 635         |
|    iterations           | 13          |
|    time_elapsed         | 41          |
|    total_timesteps      | 26624       |
| train/                  |             |
|    approx_kl            | 0.016477432 |
|    clip_fraction        | 0.22        |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.22       |
|    explained_variance   | 0.337       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0325     |
|    n_updates            | 120         |
|    policy_gradient_loss | -0.0396     |
|    value_loss           | 0.000515    |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 8.03        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 633         |
|    iterations           | 14          |
|    time_elapsed         | 45          |
|    total_timesteps      | 28672       |
| train/                  |             |
|    approx_kl            | 0.020140778 |
|    clip_fraction        | 0.251       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.04       |
|    explained_variance   | 0.465       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0371     |
|    n_updates            | 130         |
|    policy_gradient_loss | -0.0401     |
|    value_loss           | 0.000381    |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 7.78        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 632         |
|    iterations           | 15          |
|    time_elapsed         | 48          |
|    total_timesteps      | 30720       |
| train/                  |             |
|    approx_kl            | 0.017266545 |
|    clip_fraction        | 0.182       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.858      |
|    explained_variance   | 0.584       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0364     |
|    n_updates            | 140         |
|    policy_gradient_loss | -0.034      |
|    value_loss           | 0.000225    |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 7.36        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 631         |
|    iterations           | 16          |
|    time_elapsed         | 51          |
|    total_timesteps      | 32768       |
| train/                  |             |
|    approx_kl            | 0.016191576 |
|    clip_fraction        | 0.127       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.667      |
|    explained_variance   | 0.779       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.051      |
|    n_updates            | 150         |
|    policy_gradient_loss | -0.0313     |
|    value_loss           | 0.000104    |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 7.16        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 631         |
|    iterations           | 17          |
|    time_elapsed         | 55          |
|    total_timesteps      | 34816       |
| train/                  |             |
|    approx_kl            | 0.012948139 |
|    clip_fraction        | 0.0916      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.539      |
|    explained_variance   | 0.862       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0377     |
|    n_updates            | 160         |
|    policy_gradient_loss | -0.0242     |
|    value_loss           | 5.35e-05    |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 7.07        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 634         |
|    iterations           | 18          |
|    time_elapsed         | 58          |
|    total_timesteps      | 36864       |
| train/                  |             |
|    approx_kl            | 0.006799562 |
|    clip_fraction        | 0.0549      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.455      |
|    explained_variance   | 0.947       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.00904    |
|    n_updates            | 170         |
|    policy_gradient_loss | -0.015      |
|    value_loss           | 2.01e-05    |
-----------------------------------------


------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 7.05         |
|    ep_rew_mean          | 1            |
| time/                   |              |
|    fps                  | 635          |
|    iterations           | 19           |
|    time_elapsed         | 61           |
|    total_timesteps      | 38912        |
| train/                  |              |
|    approx_kl            | 0.0084230825 |
|    clip_fraction        | 0.0427       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.452       |
|    explained_variance   | 0.961        |
|    learning_rate        | 0.0003       |
|    loss                 | -0.0445      |
|    n_updates            | 180          |
|    policy_gradient_loss | -0.0144      |
|    value_loss           | 1.33e-05     |
------------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 7.01        |
|    ep_rew_mean          | 1           |
| time/                   |             |
|    fps                  | 636         |
|    iterations           | 20          |
|    time_elapsed         | 64          |
|    total_timesteps      | 40960       |
| train/                  |             |
|    approx_kl            | 0.003441878 |
|    clip_fraction        | 0.0446      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.427      |
|    explained_variance   | 0.977       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.012      |
|    n_updates            | 190         |
|    policy_gradient_loss | -0.01       |
|    value_loss           | 7.05e-06    |
-----------------------------------------


### Evaluation

This is example eval function from pettingZoo documentation. This function is generalized to be used both for evaluating and rendering. First it creates environment with given parameters, then it finds latest saved policy and loads it, then it plays n games collecting their scores, then prints it and optionally renders game. There is more generalized functions like this in documentation that can be used with various environments. [SB3: Action Masked PPO for Connect Four](https://pettingzoo.farama.org/tutorials/sb3/connect_four/)

In [20]:
"""
Author: Elliot (https://github.com/elliottower)
"""
def eval_action_mask(env_fn, num_games=100, render_mode=None, **env_kwargs):
    # Evaluate a trained agent vs a random agent
    env = env_fn.env(render_mode=render_mode, **env_kwargs)

    print(
        f"Starting evaluation vs a random agent. Trained agent will play as {env.possible_agents[1]}."
    )

    try:
        latest_policy = max(
            glob.glob(f"{env.metadata['name']}*.zip"), key=os.path.getctime
        )
    except ValueError:
        print("Policy not found.")
        exit(0)

    model = MaskablePPO.load(latest_policy)

    scores = {agent: 0 for agent in env.possible_agents}
    total_rewards = {agent: 0 for agent in env.possible_agents}
    round_rewards = []

    for i in range(num_games):
        env.reset(seed=i)
        env.action_space(env.possible_agents[0]).seed(i)

        for agent in env.agent_iter():
            obs, reward, termination, truncation, info = env.last()

            # Separate observation and action mask
            observation, action_mask = obs.values()

            if termination or truncation:
                # If there is a winner, keep track, otherwise don't change the scores (tie)
                if (
                    env.rewards[env.possible_agents[0]]
                    != env.rewards[env.possible_agents[1]]
                ):
                    winner = max(env.rewards, key=env.rewards.get)
                    scores[winner] += env.rewards[
                        winner
                    ]  # only tracks the largest reward (winner of game)
                # Also track negative and positive rewards (penalizes illegal moves)
                for a in env.possible_agents:
                    total_rewards[a] += env.rewards[a]
                # List of rewards by round, for reference
                round_rewards.append(env.rewards)
                break
            else:
                if agent == env.possible_agents[0]:
                    act = env.action_space(agent).sample(action_mask)
                else:
                    # Note: PettingZoo expects integer actions # TODO: change chess to cast actions to type int?
                    act = int(
                        model.predict(
                            observation, action_masks=action_mask, deterministic=True
                        )[0]
                    )
            env.step(act)
    env.close()

    # Avoid dividing by zero
    if sum(scores.values()) == 0:
        winrate = 0
    else:
        winrate = scores[env.possible_agents[1]] / sum(scores.values())
    print("Rewards by round: ", round_rewards)
    print("Total rewards (incl. negative rewards): ", total_rewards)
    print("Winrate: ", winrate)
    print("Final scores: ", scores)
    return round_rewards, total_rewards, winrate, scores

In [21]:
env_fn = connect_four_v3

env_kwargs = {}

# Evaluate 100 games against a random agent (winrate should be ~80%)
eval_action_mask(env_fn, num_games=100, render_mode=None, **env_kwargs)

# Watch two games vs a random agent
eval_action_mask(env_fn, num_games=2, render_mode="human", **env_kwargs)

Starting evaluation vs a random agent. Trained agent will play as player_1.
Rewards by round:  [{'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': 1, 'player_1': -1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': 1, 'player_1': -1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': 1, 'player_1': -1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}, {'player_0': 

([{'player_0': -1, 'player_1': 1}, {'player_0': -1, 'player_1': 1}],
 {'player_0': -2, 'player_1': 2},
 1.0,
 {'player_0': 0, 'player_1': 2})

# Conclusions

We learned to use pettingZoo library to train Stable Baselines3 models on multi-agent environments predefined in pettingZoo. This configuration has it's disadvantages, We cannot train agents that have different observation or action spaces or different purpouses. That is because in reality we train the same agent from different perspectives. PettingZoo is compatibile with more RL libraries so feel free to try them! We decided on stable baselines3 because it is used widely and is easy to install. In next notebook we'll learn to create custom environments