# DRAMA at the PettingZoo: Dynamically Restricted Action Spaces for Multi-Agent Reinforcement Learning Frameworks

This notebook demonstrates the basic functionality of restrictions, restrictors and restriction wrappers as described in _Oesterle et al. (2024): DRAMA at the PettingZoo: Dynamically Restricted Action Spaces for Multi-Agent Reinforcement Learning Frameworks_. More detailed examples can be found in the respective notebooks at `./examples/`, and the full documentation is available at https://drama-wrapper.readthedocs.io/.

## Imports

In [None]:
import numpy as np

from gymnasium.spaces import Discrete, Box, Space
from pettingzoo import AECEnv
from pettingzoo.classic import rps_v2
from drama import DiscreteSetRestriction, IntervalUnionRestriction, DiscreteSetActionSpace, Restrictor, RestrictionWrapper, RestrictorActionSpace
from examples.utils import play

## Basic usage of restrictions

Restrictions are subsets of `gym.Space`s. They are initialized with a base space and offer the same methods as a `gym.Space`, in particular `contains(x)` and `sample()`.

In [None]:
restriction = DiscreteSetRestriction(base_space=Discrete(10))
print(restriction)
restriction.remove(3)
restriction.remove(5)
print(restriction)
restriction.add(2)
restriction.add(3)
print(restriction.contains(8))
print(restriction.contains(5))

In [None]:
restriction = IntervalUnionRestriction(base_space=Box(0, 10))
print(restriction)
restriction.remove(3, 6)
print(restriction)
restriction.add(2, 4)
print(restriction.contains(3))
print(restriction.contains(5))

## Example: Rock-Paper-Scissors

In this example, we build a restriction wrapper around the _Rock-Paper-Scissors_ environment (`rps_v2`) of `pettingzoo`. 

- The restrictor prevents each player from repeating an action, i.e., it observes the player's last move and excludes this action from the set of allowed actions.
- The agents simply choose a random action from the allowed set.
- The `RestrictionWrapper` wraps the environment (including its agents) and one or more `Restrictor`s. The agent-environment cycle (AEC) is extended by the wrapper such that a restriction is created before each agent's action by the respective restrictor. The agent then observes not only the original observation, but also the restriction, and can act according to this additional information.

In [None]:
class RPSRestrictor(Restrictor):
    player_mapping = {'player_0': 'player_1', 'player_1': 'player_0'}
    
    def preprocess_observation(self, env: AECEnv):
        # Since the environment state is reset after each round, we need to get a player's 
        # previous action by looking at the _other_ player's observation
        return env.unwrapped.observe(self.player_mapping[env.unwrapped.agent_selection]).item()
    
    def act(self, observation: Space) -> RestrictorActionSpace:
        return DiscreteSetRestriction(base_space=self.action_space.base_space, allowed_actions=set(range(3)) - {observation})

In [None]:
env = rps_v2.env(num_actions=3, max_cycles=10, render_mode=None)
restrictor = RPSRestrictor(Discrete(4), DiscreteSetActionSpace(base_space=Discrete(3)))
wrapper = RestrictionWrapper(env, restrictor)

def rps_random_policy(obs):
    _, restriction = obs['observation'], obs['restriction']
    return np.random.choice(restriction)

policies = {'player_0': rps_random_policy, 'player_1': rps_random_policy, 'restrictor_0': restrictor.act}

## Execution

We play the game for one episode (10 cycles) and observe that the AEC now consists of alternating restrictor and agent actions. The `play()` utility function records all observations, actions and rewards into a dataframe.

In [None]:
play(wrapper, policies, record_trajectory=True)