# Navground-PettingZoo integration 

This notebook showcases the integration between navground and PettingZoo, the "multi-agent" version of Gymnasium.
We focus on the differences compared to with Gymnasium: have a look at `Navground-Gymnasium integration` for the common parts (e.g., rendering).
While with Gymnasium, we control a single navground agent (which may move among many other agents controlled by navground), with PettingZoo we can control multiple agents, even all the agents of a navground simulation. We load the same scenario with 20 agents and the same sensor.

In [1]:
from navground import sim
import numpy as np

scenario = sim.load_scenario("""
type: Cross
agent_margin: 0.1
side: 4
target_margin: 0.1
tolerance: 0.5
groups:
  -
    type: thymio
    number: 20
    radius: 0.1
    control_period: 0.1
    speed_tolerance: 0.02
    color: gray
    kinematics:
      type: 2WDiff
      wheel_axis: 0.094
      max_speed: 0.12
    behavior:
      type: HL
      optimal_speed: 0.12
      horizon: 5.0
      tau: 0.25
      eta: 0.5
      safety_margin: 0.1
    state_estimation:
      type: Bounded
      range: 5.0
""")

sensor = sim.load_state_estimation("""
type: Discs
number: 5
range: 5.0
max_speed: 0.12
max_radius: 0.1
""")

## A single group

Now, instead of a single agent, we want to control a group of agents with a policy acting on the selected sensor.
We define the PettingZoo environment, controlling the first 10 agents, *sharing* the same configuration

In [2]:
from navground_learning.env.pz import shared_parallel_env
from navground_learning.reward import SocialReward
from navground_learning import ObservationConfig, ControlActionConfig

observation_config = ObservationConfig()
action_config = ControlActionConfig()

env = shared_parallel_env(
    scenario=scenario,
    agent_indices=slice(0, 10, 1),
    sensor=sensor,
    action=action_config,
    observation=observation_config,
    reward=SocialReward(),
    time_step=0.1,
    max_duration=60.0)

All agents have the same observation and action spaces has configured 

In [3]:
print(f'We are controlling {len(env.possible_agents)} agents')

observation_space = env.observation_space(0)
action_space = env.action_space(0) 
if all(env.action_space(i) ==  action_space and env.observation_space(i) == observation_space 
       for i in env.possible_agents):
    print(f'They share the same observation {observation_space} and action {action_space} spaces')

We are controlling 10 agents
They share the same observation Dict('position': Box(-5.0, 5.0, (5, 2), float64), 'radius': Box(0.0, 0.1, (5,), float64), 'valid': MultiBinary((5,)), 'velocity': Box(-0.12, 0.12, (5, 2), float64), 'ego_target_direction': Box(-1.0, 1.0, (2,), float64), 'ego_target_distance': Box(0.0, 5.0, (1,), float64)) and action Box(-1.0, 1.0, (2,), float64) spaces


The `info` map returned by `reset(...)` and `step(...)` contains the action computed by original navground behavior, in this case `HL`, for each of the 10 agents.

In [4]:
observations, infos = env.reset()
print(f"Observation #0: {observations[0]}")
print(f"Info #0: {infos[0]}")

Observation #0: {'position': array([[ 0.10201398, -0.28317614],
       [-0.20867327, -0.35694126],
       [ 0.19885384, -0.67285531],
       [-0.56413084, -0.50433907],
       [-0.82594673, -0.22419791]]), 'radius': array([0.1, 0.1, 0.1, 0.1, 0.1]), 'valid': array([1, 1, 1, 1, 1], dtype=uint8), 'velocity': array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]]), 'ego_target_direction': array([ 1.00000000e+00, -3.41599925e-17]), 'ego_target_distance': array([2.1121034])}
Info #0: {'navground_action': array([0.32967995, 0.        ])}


Let's collect the reward from the original controller

In [5]:
all_rewards = []
for n in range(1000):
    actions = {i: info['navground_action'] for i, info in infos.items()}
    observations, rewards, terminated, truncated, infos = env.step(actions)
    all_rewards.append(np.mean(list(rewards.values())))
    done = np.bitwise_or(list(terminated.values()), list(truncated.values()))
    if np.all(done):
        print(f'reset after {n} steps')
        observations, infos = env.reset()

print(f'mean reward {np.mean(all_rewards):.3f}')

reset after 599 steps
mean reward -0.242


and compare it with the reward from a random policy

In [6]:
observations, infos = env.reset()
all_rewards = []
for n in range(1000):
    actions = {i: env.action_space(i).sample() for i in range(10)}
    observations, rewards, terminated, truncated, infos = env.step(actions)
    all_rewards.append(np.mean(list(rewards.values())))
    done = np.bitwise_or(list(terminated.values()), list(truncated.values()))
    if np.all(done):
        print(f'reset after {n} steps')
        observations, infos = env.reset()

print(f'mean reward {np.mean(all_rewards):.3f}')

reset after 599 steps
mean reward -1.017


We want to use a machine learning policy to generate to action. For instance, a random policy, like

In [7]:
from imitation.policies.base import RandomPolicy

policies = {i: RandomPolicy(env.observation_space(i), env.action_space(i)) for i in env.agents}

Policies output a tuple `(action, state)`. Therefore the new loop is

In [8]:
observations, infos = env.reset()
rewards = []
for n in range(1000):
    actions = {i: policies[i].predict(observations[i])[0] for i in env.agents}
    observations, rewards, terminated, truncated, infos = env.step(actions)
    all_rewards.append(np.mean(list(rewards.values())))
    done = np.bitwise_or(list(terminated.values()), list(truncated.values()))
    if np.all(done):
        print(f'reset after {n} steps')
        observations, infos = env.reset()

print(f'mean reward {np.mean(all_rewards):.3f}')

reset after 599 steps
mean reward -1.073


## Two groups

Let us now consider the more complex case where we want to control agents using different sensors and/or configurations.
For instance, we want to control the first 10 agents like before and the second 10 agents using a lidar scanner.
Let say we also want to control the second group in acceleration vs the first group in speed.

In [9]:
lidar = sim.load_state_estimation("""
type: Lidar
resolution: 100
range: 5.0
""")

In [10]:
from navground_learning.env.pz import parallel_env
from navground_learning import WorldConfig, GroupConfig

first_group = GroupConfig(indices=slice(0, 10, 1), sensor=sensor, observation = ObservationConfig(include_target_distance=False), 
                          action = ControlActionConfig(), reward=SocialReward())
second_group = GroupConfig(indices=slice(10, 20, 1), sensor=lidar, observation = ObservationConfig(), 
                           action = ControlActionConfig(use_acceleration_action=True, max_acceleration=1.0, max_angular_acceleration=10.0), reward=SocialReward())

env = parallel_env(scenario=scenario, config=WorldConfig(groups=[first_group, second_group]), 
                   time_step=0.1, max_duration=60.0)

The two groups uses now different observation spaces

In [11]:
env.observation_space(0)

Dict('position': Box(-5.0, 5.0, (5, 2), float64), 'radius': Box(0.0, 0.1, (5,), float64), 'valid': MultiBinary((5,)), 'velocity': Box(-0.12, 0.12, (5, 2), float64), 'ego_target_direction': Box(-1.0, 1.0, (2,), float64))

In [12]:
env.observation_space(10)

Dict('range': Box(0.0, 5.0, (100,), float64), 'ego_target_direction': Box(-1.0, 1.0, (2,), float64), 'ego_target_distance': Box(0.0, 5.0, (1,), float64))

and differnet maps between actions and commands

In [13]:
env._possible_agents[0].gym.get_cmd_from_action(np.ones(2), time_step=0.1)

Twist2((0.120000, 0.000000), 2.553191, frame=Frame.relative)

In [14]:
env._possible_agents[10].gym.get_cmd_from_action(np.ones(2), time_step=0.1)

Twist2((0.100000, 0.000000), 1.000000, frame=Frame.relative)

## Convert to a Gymnasium Env

In case the agents share the same configuration (and in particular action and observation spaces), we can convert the PettingZoo env in a Gymnasium vector env.

In [15]:
env = shared_parallel_env(
    scenario=scenario,
    agent_indices=slice(0, 10, 1),
    sensor=sensor,
    action=action_config,
    observation=observation_config,
    reward=SocialReward(),
    time_step=0.1,
    max_duration=60.0)

In [16]:
import supersuit

venv = supersuit.pettingzoo_env_to_vec_env_v1(env)

with 

In [17]:
venv.num_envs

10

environments that represents the individual agents. 

This vector env follows the Gymnasium API, stacking together observation, actions of the individual agents

If we want instead an vector env to follows the SB3 API, we can use (even stacking multiple vectorized envs together)

In [18]:
venv1 = supersuit.concat_vec_envs_v1(venv, 2, num_cpus=1, base_class="stable_baselines3")

In [19]:
venv1.num_envs

20