# Application: Learning optimal restrictions in a continuous-action game

This notebook corresponds to Section 5.1 of the paper "Grams & Oesterle (forthcoming). _DRAMA at the PettingZoo: Dynamically Restricted Action Spaces for Multi-Agent Reinforcement Learning Frameworks_."

## Setup

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os, sys
sys.path.append(f'{os.getcwd()}/../../')

In [None]:
from gymnasium.spaces import Box
import matplotlib

from src.wrapper import RestrictionWrapper
from src.restrictors import IntervalUnionActionSpace

from examples.utils import play
from examples.cournot.utils.env import NFGEnvironment
from examples.cournot.utils.agents import UnrestrictedCournotAgent, RestrictedCournotAgent
from examples.cournot.utils.restrictor import CournotRestrictor

## Definition of the Cournot Game

In [None]:
maximum_price = 120
cost = 12

price_space = Box(0, maximum_price)
observation_spaces = {'player_0': price_space, 'player_1': price_space}
action_spaces = {'player_0': price_space, 'player_1': price_space}
utilities = {
    'player_0': (lambda actions: -actions['player_0'] ** 2 - actions['player_0'] * actions['player_1'] + (maximum_price - cost) * actions['player_0']), 
    'player_1': (lambda actions: -actions['player_1'] ** 2 - actions['player_0'] * actions['player_1'] + (maximum_price - cost) * actions['player_1'])}

env = NFGEnvironment(observation_spaces, action_spaces, utilities, number_of_steps=100, render_mode='human')

## Test: Play without restrictions

In [None]:
policies = {'player_0': UnrestrictedCournotAgent(maximum_price, cost).act, 'player_1': UnrestrictedCournotAgent(maximum_price, cost).act}
trajectory = play(env, policies, max_iter=100, render_mode=None, record_trajectory=True)

In [None]:
trajectory.groupby('agent')['reward'].plot()

## Self-learning restrictions

When we run the environment with the `CournotRestrictor`, we see that it observes the agents and waits until their strategies converge. At this point the restrictor estimates the environment parameters (more concretely, the parameter `lambda := maximum_price - cost`) from the observed agent actions, and defines a suitable restriction. The agents then react to the restriction by changing their strategies. Eventually, the restriction gives a reward increase by approximately 12.5%.

In [None]:
agents = [RestrictedCournotAgent(maximum_price, cost), RestrictedCournotAgent(maximum_price, cost)]
restrictor = CournotRestrictor(Box(0, maximum_price, shape=(2, )), IntervalUnionActionSpace(Box(0, maximum_price)))
wrapper = RestrictionWrapper(env, restrictor)

# Use restrictor for all agents uniformly
policies = {'player_0': agents[0].act, 'player_1': agents[1].act, 'restrictor_0': restrictor.act}

# Run wrapped environment for 100 iterations
trajectory = play(wrapper, policies, max_iter=100, render_mode=None, record_trajectory=True)

In [None]:
trajectory.head(20)

In [None]:
trajectory.groupby('agent')['reward'].plot()