# Env reward scaling - pettingzoo

Acts as a testbed for academics using 2 or 3d shaped data. 

Scales using MovingFoldChangeReward. 

### Background
In Dualer models I put reward and intrinsic rewards in competition. This means the two values must be matched up. Or put another way, that need to have common units. There are many ways to do that. Let's consider a biological motivated approach to reward normalization inspired by:

    - Adler, M., and Alon, U. (2018). Fold-change detection in biological
    systems. Current Opinion in Systems Biology 8, 81–89.
    - Karin, O., and Alon, U. (2021). The dopamine circuit as a reward-taxis navigation system. BioRxiv 439955, 30.

We'll try several `envs`

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import gym
import numpy as np
import torch as th

from copy import deepcopy
from stable_baselines3.common.env_util import make_vec_env
from dualer.wrappers.academic import RFPrediction
from dualer.wrappers.normalize import MovingFoldChangeReward
from dualer.wrappers.normalize import ClipReward

import pettingzoo
from pettingzoo import mpe
import supersuit as ss

from infoduel_maddpg.utils.academic_wrappers import StatePrediction

In [3]:
import matplotlib.pyplot as plt 
%matplotlib inline

# Init

In [59]:
env_name = "simple_v2"
# env_name = "simple_tag_v2"

In [60]:
seed = 50
latent_mode = "mlp"
total_timesteps = 1000
max_cycles = 25
lr_academic = 0.001
do_fold = True
do_clip = True

In [61]:
Env = getattr(mpe, env_name)
env = Env.parallel_env(max_cycles=max_cycles, continuous_actions=True)
env.reset()

{'agent_0': array([ 0.        ,  0.        , -0.06511435,  0.8512194 ], dtype=float32)}

In [7]:
env.observation_space(env.possible_agents[0])

(16,)

In [82]:
x = np.random.rand(10, 3)
print(x)
print(x[7:])

[[0.9317701  0.23136078 0.61476906]
 [0.62085595 0.55878321 0.58706657]
 [0.54758801 0.3344116  0.54128054]
 [0.34900793 0.62144991 0.18971545]
 [0.13397827 0.40148372 0.49485851]
 [0.65316473 0.25592372 0.92957907]
 [0.70411778 0.69050959 0.17696261]
 [0.69724967 0.94661423 0.27797765]
 [0.50311391 0.81548859 0.77624513]
 [0.51879887 0.92546242 0.95557266]]
[[0.69724967 0.94661423 0.27797765]
 [0.50311391 0.81548859 0.77624513]
 [0.51879887 0.92546242 0.95557266]]


In [31]:
env = StatePrediction(env, network_hidden=[10])

In [44]:
# env.reset()
actions = {}
for a in env.env.agents:
    actions[a] = env.env.action_space(a).sample()
print(actions)
env.step(actions)

{'adversary_0': array([0.59769356, 0.05751409, 0.79003614, 0.24761441, 0.4978155 ],
      dtype=float32), 'adversary_1': array([0.11352156, 0.03985091, 0.41391292, 0.8710639 , 0.4948355 ],
      dtype=float32), 'adversary_2': array([0.4889414 , 0.90396065, 0.05769864, 0.6752599 , 0.67560387],
      dtype=float32), 'agent_0': array([0.52922493, 0.24319053, 0.657722  , 0.2211245 , 0.45881745],
      dtype=float32)}


({'adversary_0': array([ 0.02599132,  0.30075386, -0.39166135,  0.80715   ,  0.778772  ,
         -0.75070363, -0.2045369 , -0.37165737, -0.32204536, -0.7704419 ,
          0.4751598 , -1.2035803 ,  1.3148565 , -1.7713962 , -0.39270455,
         -0.09498468], dtype=float32),
  'adversary_1': array([-0.2956975 ,  0.13106494, -0.71370673,  0.03670814,  1.1008173 ,
          0.01973824,  0.11750847,  0.3987845 ,  0.32204536,  0.7704419 ,
          0.79720515, -0.43313837,  1.6369019 , -1.0009543 , -0.39270455,
         -0.09498468], dtype=float32),
  'adversary_2': array([ 0.30572632, -0.21128106,  0.08349845, -0.39643022,  0.30361217,
          0.45287663, -0.6796967 ,  0.8319229 , -0.4751598 ,  1.2035803 ,
         -0.79720515,  0.43313837,  0.83969676, -0.5678159 , -0.39270455,
         -0.09498468], dtype=float32),
  'agent_0': array([-0.39270455, -0.09498468,  0.9231952 , -0.96424615, -0.53608453,
          1.0206925 , -1.5193934 ,  1.3997388 , -1.3148565 ,  1.7713962 ,
         -1.6

## Scratch cells

In [None]:
# env.seed(1)
# AEC
# env.reset()
# next_obs, rewards, dones, infos = env.last()

# parallel
next_obs = env.reset()

In [None]:
env.step??

In [None]:
env.possible_agents

In [None]:
from torch.utils.tensorboard import SummaryWriter

In [None]:
SummaryWriter?

In [None]:
np.zeros((10, 3,))

In [None]:
env.step([1])