<a href="https://colab.research.google.com/github/lcipolina/Ray/blob/main/MARL_RLlib_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hands-on RL with Ray’s RLlib 
<hr />

Taken from here:
https://risecamp.berkeley.edu/archives/rise-camp-2021/

Also here:
https://github.com/sven1977/rllib_tutorials

## Tutorial for working with multi-agent environments, models, and algorithms

<img src="https://drive.google.com/uc?export=view&id=1s1chO-ET7inBCKDdKgP4hI0UgTI4bLPs" width=250> <img src="https://drive.google.com/uc?export=view&id=1GGD7V_oO1osZqgKF8QzajM3_bs5o9fNw" width=169> <img src="https://drive.google.com/uc?export=view&id=1xJTlXqv182zVvDPeRc2lEg06zU0GbNrK" width=252> <img src="https://drive.google.com/uc?export=view&id=1X3eVsp3hhFzwFaeqOwwZ9DmJ0UiYfu4y" width=213>

### Overview
“Hands-on RL with Ray’s RLlib” is a beginners tutorial for working with reinforcement learning (RL) environments, models, and algorithms using Ray’s RLlib library. RLlib offers high scalability, a large list of algos to choose from (offline, model-based, model-free, etc..), support for TensorFlow and PyTorch, and a unified API for a variety of applications. This tutorial includes a brief introduction to provide an overview of concepts (e.g. why RL) before proceeding to RLlib (multi- and single-agent) environments, neural network models, student exercises, Q/A, and more. All code will be provided as .py files in a GitHub repo.

### Intended Audience
* Python programmers who want to get started with reinforcement learning and RLlib.

### Prerequisites
* Some Python programming experience.
* Some familiarity with machine learning.
* *Helpful, but not required:* Experience in reinforcement learning and Ray.
* *Helpful, but not required:* Experience with TensorFlow or PyTorch.

### Requirements/Dependencies


In [None]:
!pip install ray[rllib]
!pip install tensorflow -U  # <- either one works!
!pip install matplotlib

### Key Takeaways
* What is reinforcement learning and why RLlib?
* Core concepts of RLlib: Environments, Trainers, Policies, and Models.

### Tutorial Outline (30-40 min)
1. RL and RLlib in a nutshell.
1. Defining an RL-solvable problem: Our first (multi-agent) environment.
1. **Exercise No.1**: Environment Loop.
1. Picking an algorithm and training our first RLlib Trainer.
1. **Exercise No.2** Fixing our experiment's config - Going multi-agent.

### Other Recommended Readings
* [Reinforcement Learning with RLlib in the Unity Game Engine](https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d)

<img src="https://drive.google.com/uc?export=view&id=1mgu5vPHwTB-3uch1d43BICQoK0h9XkbO" width=400>

* [Attention Nets and More with RLlib's Trajectory View API](https://medium.com/distributed-computing-with-ray/attention-nets-and-more-with-rllibs-trajectory-view-api-d326339a6e65)
* [Intro to RLlib: Example Environments](https://medium.com/distributed-computing-with-ray/intro-to-rllib-example-environments-3a113f532c70)


## Environment Setup

### Coding/defining our "problem" via an RL environment.

We will use the following (adversarial) multi-agent environment
throughout this tutorial to demonstrate RLlib's
APIs, features, and customization options.

<img src="https://drive.google.com/uc?export=view&id=1GL5LDrrnw0rx-cYK9ucQ4drpaykz1pBd" width=800>

### A word or two on Spaces:

Spaces are used in ML to describe what possible/valid values inputs and outputs of a neural network can have.

RL environments also use them to describe what their valid observations and actions are.

Spaces are usually defined by their shape (e.g. 84x84x3 RGB images) and datatype (e.g. uint8 for RGB values between 0 and 255).
However, spaces could also be composed of other spaces (see Tuple or Dict spaces) or could be simply discrete with n fixed possible values
(represented by integers). For example, in our game, where each agent can only go up/down/left/right, the action space would be `Discrete(4)`
(no datatype, no shape needs to be defined here). Our observation space will be `MultiDiscrete([n, m])`, where n is the position of the agent observing and m is the position of the opposing agent, so if agent1 starts in the upper left corner and agent2 starts in the bottom right corner, agent1's observation would be: `[0, 63]` (in an 8 x 8 grid) and agent2's observation would be `[63, 0]`.

<img src="https://drive.google.com/uc?export=view&id=1zTklLKfSzK4ia054NNFMq3KLWii2QYa3" width=800>

In [None]:
# Let's code our multi-agent environment.

import gym
from gym.spaces import Discrete, MultiDiscrete
import numpy as np
import random

from ray.rllib.env.multi_agent_env import MultiAgentEnv


class MultiAgentArena(MultiAgentEnv):
    def __init__(self, config=None):
        """ Config takes in width, height, and ts """
        config = config or {}
        # Dimensions of the grid.
        self.width = config.get("width", 10)
        self.height = config.get("height", 10)

        # End an episode after this many timesteps.
        self.timestep_limit = config.get("ts", 100)

        self.observation_space = MultiDiscrete([self.width * self.height,
                                                self.width * self.height])
        # 0=up, 1=right, 2=down, 3=left.
        self.action_space = Discrete(4)

        # Reset env.
        self.reset()
        
    def reset(self):
        """Returns initial observation of next(!) episode."""
        # Row-major coords.
        self.agent1_pos = [0, 0]  # upper left corner
        self.agent2_pos = [self.height - 1, self.width - 1]  # lower bottom corner

        # Accumulated rewards in this episode.
        self.agent1_R = 0.0
        self.agent2_R = 0.0

        # Reset agent1's visited fields.
        self.agent1_visited_fields = set([tuple(self.agent1_pos)])

        # How many timesteps have we done in this episode.
        self.timesteps = 0

        # Return the initial observation in the new episode.
        return self._get_obs()

    def step(self, action: dict):
        """
        Returns (next observation, rewards, dones, infos) after having taken the given actions.
        
        e.g.
        `action={"agent1": action_for_agent1, "agent2": action_for_agent2}`
        """
        
        # increase our time steps counter by 1.
        self.timesteps += 1
        # An episode is "done" when we reach the time step limit.
        is_done = self.timesteps >= self.timestep_limit

        # Agent2 always moves first.
        # events = [collision|agent1_new_field]
        events = self._move(self.agent2_pos, action["agent2"], is_agent1=False)
        events = self._move(self.agent1_pos, action["agent1"], is_agent1=True)

        # Useful for rendering.
        self.collision = "collision" in events
            
        # Get observations (based on new agent positions).
        obs = self._get_obs()

        # Determine rewards based on the collected events:
        r1 = -1.0 if "collision" in events else 1.0 if "agent1_new_field" in events else -0.5
        r2 = 1.0 if "collision" in events else -0.1

        self.agent1_R += r1
        self.agent2_R += r2
        
        rewards = {
            "agent1": r1,
            "agent2": r2,
        }

        # Generate a `done` dict (per-agent and total).
        dones = {
            "agent1": is_done,
            "agent2": is_done,
            # special `__all__` key indicates that the episode is done for all agents.
            "__all__": is_done,
        }

        return obs, rewards, dones, {}  # <- info dict (not needed here).

    def _get_obs(self):
        """
        Returns obs dict (agent name to discrete-pos tuple) using each
        agent's current x/y-positions.
        """
        ag1_discrete_pos = self.agent1_pos[0] * self.width + \
            (self.agent1_pos[1] % self.width)
        ag2_discrete_pos = self.agent2_pos[0] * self.width + \
            (self.agent2_pos[1] % self.width)
        return {
            "agent1": np.array([ag1_discrete_pos, ag2_discrete_pos]),
            "agent2": np.array([ag2_discrete_pos, ag1_discrete_pos]),
        }

    def _move(self, coords, action, is_agent1):
        """
        Moves an agent (agent1 iff is_agent1=True, else agent2) from `coords` (x/y) using the
        given action (0=up, 1=right, etc..) and returns a resulting events dict:
        Agent1: "new" when entering a new field. "bumped" when having been bumped into by agent2.
        Agent2: "bumped" when bumping into agent1 (agent1 then gets -1.0).
        """
        orig_coords = coords[:]
        # Change the row: 0=up (-1), 2=down (+1)
        coords[0] += -1 if action == 0 else 1 if action == 2 else 0
        # Change the column: 1=right (+1), 3=left (-1)
        coords[1] += 1 if action == 1 else -1 if action == 3 else 0

        # Solve collisions.
        # Make sure, we don't end up on the other agent's position.
        # If yes, don't move (we are blocked).
        if (is_agent1 and coords == self.agent2_pos) or (not is_agent1 and coords == self.agent1_pos):
            coords[0], coords[1] = orig_coords
            # Agent2 blocked agent1 (agent1 tried to run into agent2)
            # OR Agent2 bumped into agent1 (agent2 tried to run into agent1)
            return {"collision"}

        # No agent blocking -> check walls.
        if coords[0] < 0:
            coords[0] = 0
        elif coords[0] >= self.height:
            coords[0] = self.height - 1
        if coords[1] < 0:
            coords[1] = 0
        elif coords[1] >= self.width:
            coords[1] = self.width - 1

        # If agent1 -> "new" if new tile covered.
        if is_agent1 and not tuple(coords) in self.agent1_visited_fields:
            self.agent1_visited_fields.add(tuple(coords))
            return {"agent1_new_field"}
        # No new tile for agent1.
        return set()

    def render(self, mode=None):
        '''
        Prints (displays) the ASCII versio of the environment
        '''
        print("_" * (self.width + 2))
        for r in range(self.height):
            print("|", end="")
            for c in range(self.width):
                field = r * self.width + c % self.width
                if self.agent1_pos == [r, c]:
                    print("1", end="")
                elif self.agent2_pos == [r, c]:
                    print("2", end="")
                elif (r, c) in self.agent1_visited_fields:
                    print(".", end="")
                else:
                    print(" ", end="")
            print("|")
        print("‾" * (self.width + 2))
        print(f"{'!!Collision!!' if self.collision else ''}")
        print("R1={: .1f}".format(self.agent1_R))
        print("R2={: .1f}".format(self.agent2_R))
        print()


env = MultiAgentArena()

obs = env.reset()

# Agent1 will move down, Agent2 moves up.
obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})

env.render()

print("Agent1's x/y position={}".format(env.agent1_pos))
print("Agent2's x/y position={}".format(env.agent2_pos))
print("Env timesteps={}".format(env.timesteps))


Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


____________
|.         |
|1         |
|          |
|          |
|          |
|          |
|          |
|          |
|         2|
|          |
‾‾‾‾‾‾‾‾‾‾‾‾

R1= 1.0
R2=-0.1

Agent1's x/y position=[1, 0]
Agent2's x/y position=[8, 9]
Env timesteps=1


## Exercise No 1: Environment Rollout using Random Actions

<hr />

<img src="https://drive.google.com/uc?export=view&id=1Ta1s0QOfSCtuK0ZbmviwkI_6GcBWmXzY" width=800>

In the cell above, we performed a `reset()` and a single `step()` call. 

To walk through an entire **episode**, one would normally call `step()` repeatedly (with different actions) until the returned `done` dict has the "agent1" or "agent2" (or "__all__") key set to True. 

Your task is to write an "environment loop" that runs for exactly one episode using our `MultiAgentArena` class.

Follow these instructions here to get this done.

1. `reset` the already created (variable `env`) environment to get the first (initial) observation.
1. Enter an infinite while loop.
1. Compute the actions for "agent1" and "agent2" calling `DummyTrainer.compute_action([obs])` twice (once for each agent).
1. Put the results of the action computations into an action dict (`{"agent1": ..., "agent2": ...}`).
1. Pass this action dict into the env's `step()` method, just like it's done in the above cell (where we do a single `step()`).
1. Check the returned `dones` dict for True (yes, episode is terminated) and if True, break out of the loop.

**Good luck! :)**


In [None]:
class DummyTrainer:
    """Dummy Trainer class used in Exercise #1.

    Use its `compute_action` method to get a new action for one of the agents,
    given the agent's observation (a single discrete value encoding the field
    the agent is currently in).

    This means, for a given state observation, just selects a random action.
    """

    def compute_action(self, single_agent_obs=None):
        # Returns a random action for a single agent.
        return np.random.randint(4)  # Discrete(4) -> return rand int between 0 and 3 (incl. 3).

dummy_trainer = DummyTrainer()
# Check, whether it's working.
for _ in range(3):
    # Get action for agent1 (providing agent1's and agent2's positions).
    print("action_agent1={}".format(dummy_trainer.compute_action(np.array([0, 99]))))

    # Get action for agent2 (providing agent2's and agent1's positions).
    print("action_agent2={}".format(dummy_trainer.compute_action(np.array([99, 0]))))

    print()

action_agent1=3
action_agent2=3

action_agent1=1
action_agent2=2

action_agent1=3
action_agent2=3



### Move the agent over the environment by picking random actions

See the agents moving in the environment below!

solution here: https://github.com/sven1977/rllib_tutorials/blob/865d77eacb8cb8025abe372d97c47f70aa1d035b/ray_summit_2021/live_coding/exercise_1.py

In [None]:
# Leave the following as-is. It'll help us with rendering the env in this very cell's output.

import time
from ipywidgets import Output
from IPython import display
import time

out = Output()
display.display(out)

with out:

    # 1)Instantiate and reset the env.
    env = MultiAgentArena()
    obs = env.reset()  #places agents on the default state
    
    # 2) Enter an infinite while loop (to step through the episode).
    while True:
        # 3) Calculate both agents' actions individually, using dummy_trainer.compute_action([individual agent's obs]). Pass each agen'ts actions
        # Note: observations for each agent are stored in dicts (from the _get_obs() method in the env)
        a1 = dummy_trainer.compute_action(obs["agent1"])
        a2 = dummy_trainer.compute_action(obs["agent2"])

        # 4) Compile the actions dict from both individual agents' actions.
        actions_dict = {"agent1": a1, "agent2": a2}
        # 5) Send the actions dict to the env's `step()` method to receive: obs, rewards, dones, info dicts
        # Send the action-dict to the env to calculate the reward and give the next action (or 'done')
        obs, rewards, dones, _ = env.step(actions_dict)

        # Get a rendered image from the env.
        out.clear_output(wait=True)
        env.render()
        time.sleep(0.1)
      
        # Don't write any code here (skip directly to 7).
        out.clear_output(wait=True)
        time.sleep(0.08)
        env.render()

        # 7) Check, whether the episde is done, if yes, break out of the while loop.
        if dones["agent1"]:
            break

# 8) Run it! :)

Output()

## STEP2: Training with RLlib's PPO

We will now train an RL agent with RLlib's PPO. PPO is well-known in the RL community to be one of the most reliable algorithms that works most classes of environments. 

There are many different algos in RLlib (over 20!) and you can mix match whatever algorithm you like to train your RL agent. This is what makes RLlib a versatile library to use!

<img src="https://drive.google.com/uc?export=view&id=11pv431GA0frNFZIRfeSp0mMeJ2coTkPW" width=800>


### Initializing Ray

In [None]:
import numpy as np
import pprint
import ray

# Start a new instance of Ray (when running this tutorial locally) or
# connect to an already running one (when running this tutorial through Anyscale).
#ray.shutdown()
#ray.init()  

ray.shutdown()
ray.init(ignore_reinit_error=True) #NOTE: It prints the dashboard running on a local port

# In case you encounter the following error during our tutorial: `RuntimeError: Maybe you called ray.init twice by accident?`
# Try: `ray.shutdown() + ray.init()` or `ray.init(ignore_reinit_error=True)`

RayContext(dashboard_url='', python_version='3.7.13', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', address_info={'node_ip_address': '172.28.0.2', 'raylet_ip_address': '172.28.0.2', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-08-22_13-53-48_742281_71/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-08-22_13-53-48_742281_71/sockets/raylet', 'webui_url': '', 'session_dir': '/tmp/ray/session_2022-08-22_13-53-48_742281_71', 'metrics_export_port': 51415, 'gcs_address': '172.28.0.2:56093', 'address': '172.28.0.2:56093', 'node_id': 'fb268bb01881cd45e995b229de0b43237da86e30268c65a42444910c'})

### Creating an RLlib Trainer (PPOTrainer)

The inputs to the trainer are the environment and the config dict

In this case, we pass the env inside the config dict.

In [None]:
# Import a Trainable (one of RLlib's built-in algorithms):
# We use the PPO algorithm here b/c its very flexible wrt its supported
# action spaces and model types and b/c it learns well almost any problem.
from ray.rllib.agents.ppo import PPOTrainer

# The trainer's input is a config file
# The config file defines the environment and some environment's
# options (see environment.py).
config = {
    "env": MultiAgentArena, # "my_env" <- if we previously have registered the env with `tune.register_env("[name]", lambda config: [returns env object])`.
    "env_config": {
        "config": {
            "width": 10,
            "height": 10,
            "ts": 100, #time steps
        },
    },

    # !PyTorch users!
    "framework": "tf",  # If users have chosen to install torch instead of tf.

    "create_env_on_driver": True,
}
# Instantiate the Trainer object using above config.
rllib_trainer = PPOTrainer(config=config)
rllib_trainer

2022-08-22 13:55:13,936	INFO trainer.py:2333 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
2022-08-22 13:55:13,941	INFO ppo.py:415 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
2022-08-22 13:55:13,942	INFO trainer.py:906 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


PPOTrainer

[2m[36m(pid=603)[0m Instructions for updating:
[2m[36m(pid=603)[0m experimental_relax_shapes is deprecated, use reduce_retracing instead


### Ready to train with RLlib's PPO algorithm

That's it, we are ready to train.
Calling `Trainer.train()` will execute a single "training iteration".

One iteration for most algos involves:

1) sampling from the environment(s) (= Rollout)

2) using the sampled data (observations, actions taken, rewards) to update the policy model (neural network), such that it would pick better actions in the future, leading to higher rewards.

Let's try it out!


In [None]:
# Runs 1 Iteration of Training
results = rllib_trainer.train()

# Delete the config from the results for clarity.
# Only the stats will remain, then.
del results["config"]
# Pretty print the stats.
pprint.pprint(results)
del rllib_trainer



{'agent_timesteps_total': 4000,
 'counters': {'num_agent_steps_sampled': 4000,
              'num_agent_steps_trained': 4000,
              'num_env_steps_sampled': 4000,
              'num_env_steps_trained': 4000},
 'custom_metrics': {},
 'date': '2022-08-22_13-59-55',
 'done': False,
 'episode_len_mean': 100.0,
 'episode_media': {},
 'episode_reward_max': 11.10000000000002,
 'episode_reward_mean': -6.1949999999999985,
 'episode_reward_min': -36.90000000000007,
 'episodes_this_iter': 20,
 'episodes_total': 20,
 'experiment_id': '0d8d7143d7824bd2a11ebcbf0a71e6a3',
 'hist_stats': {'episode_lengths': [100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                  

### Define directory for checkpoints

Checkpoints are used for the Rollouts of the policy after training or to resume training.

In [None]:
import shutil
import os

#Main saving directory
CHECKPOINT_ROOT = "tmp/ppo/cart"

# Where checkpoints are written:
shutil.rmtree(CHECKPOINT_ROOT, ignore_errors=True, onerror=None)

# Where some data will be written and used by Tensorboard below:
ray_results = os.getenv("HOME") + "/ray_results/"
shutil.rmtree(ray_results, ignore_errors=True, onerror=None)



In [None]:
# Another alternative is to train for a max number of iterations
# Training
#https://github.com/anyscale/academy/blob/main/ray-rllib/explore-rllib/01-Application-Cart-Pole.ipynb

# Similarly, we can save the training stats on list to inspect
N_ITER = 3 #only 3 iterations to show the idea   (By default, training runs for 10 iterations).
results = []
episode_data = []
episode_json = []

agent = PPOTrainer(config=config)

for n in range(N_ITER):
    result = agent.train() # each call to agent.train() returns a object containing information that we will inspect below
    results.append(result)
    
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'], 
               'episode_reward_mean':result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean':   result['episode_len_mean']}
    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    file_name = agent.save(CHECKPOINT_ROOT)
    
    print(f'{n:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}. Checkpoint saved to {file_name}')


#### Inspect Training results


In [None]:
# Inspect the results object
results

[]

In [None]:
import pandas as pd
# Convert to df and inspect
df = pd.DataFrame(data=episode_data)
df

In [None]:
#Plot results
df.plot(x="n", y=["episode_reward_mean", "episode_reward_min", "episode_reward_max"], secondary_y=True)

### Print out the policy and model to see the results of training in detail…

In [None]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

print(model.base_model.summary())

## Exercise 2: Training with Multiple Policies

We need a different policy for each agent because they have a different objective (and thus, a different reward scheme).

So far, our experiment has been ill-configured, because both
agents, which should behave differently due to their different
tasks and reward functions, learn the same policy: the "default_policy",
which RLlib always provides if you don't configure anything else.

Remember that RLlib does not know at Trainer setup time, how many and which agents the environment will "produce". Agent control (adding agents, removing them, terminating episodes for agents) is entirely in the Env's hands.
Let's fix our single policy problem and introduce the "multiagent" API.

<img src="https://drive.google.com/uc?export=view&id=1rsRMLN8KyEHKS4XCcjRmUW19kpRjqB8z" width=800>

In order to turn on RLlib's multi-agent functionality, follow these instructions:

1. A policies definition dict, mapping policy IDs (e.g. "policy1") to 4-tuples consisting of 1) policy class (None for using the default class), 2) observation space, 3) action space, and 4) config overrides (empty dict for no overrides and using the Trainer's main config dict).
1. A policy mapping function, mapping agent IDs (e.g. a string like "agent1", produced by the environment in the returned observation/rewards/dones-dicts) to a policy ID (another string, e.g. "policy1").
1. Pass in the policy mapping function and policy configs into the Trainer config.
1. Train!

If stucked, https://docs.ray.io/en/latest/rllib-env.html#multi-agent-and-hierarchical provides a great example.

**Good luck! :)**

In [None]:
# Run this if neccessary
ray.shutdown()
ray.init()

RayContext(dashboard_url='', python_version='3.7.13', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', address_info={'node_ip_address': '172.28.0.2', 'raylet_ip_address': '172.28.0.2', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-08-22_14-33-58_707870_71/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-08-22_14-33-58_707870_71/sockets/raylet', 'webui_url': '', 'session_dir': '/tmp/ray/session_2022-08-22_14-33-58_707870_71', 'metrics_export_port': 65525, 'gcs_address': '172.28.0.2:50579', 'address': '172.28.0.2:50579', 'node_id': '65eccfa059657088fc7f86a2e96aaf7f6290b7a265df8e530b74ca9f'})

In [None]:
# Exercise 2
# 1) Define the policies definition dict:

  # Each policy in there is defined by its ID (key) mapping to a 4-tuple (value):
  # - Policy class ('None' for using the "default" class, e.g. PPOTFPolicy for PPO+tf or PPOTorchPolicy for PPO+torch). By default is a fully connected network
  # - obs-space (we get this directly from our already created env object).
  # - act-space (we get this directly from our already created env object).
  # - config-overrides dict (leave empty for using the Trainer's config as-is). The empty dict is to use the 'config' dict we have defined before. We can override any dict param here.

policies = {
    "policy1": (None, env.observation_space,env.action_space, {}),
    "policy2": (None, env.observation_space,env.action_space,{'lr':0.0002}) #smaller 'lr' it will learn slower w.r.t policy 1
 }

# Note that now we won't have a "default_policy" anymore, just "policy1" and "policy2".

# 2) Defines an agent->policy mapping function.
# The mapping here is M (agents) -> N (policies), where M >= N.
def policy_mapping_fn(agent_id: str) -> str:
    # Make sure agent ID is valid.
    assert agent_id in ["agent1", "agent2"], f"ERROR: invalid agent ID {agent_id}!"

    return 'policy1' if agent_id == 'agent1' else 'policy2'

config = {
    "env": MultiAgentArena,  # "my_env" <- if we previously have registered the env with `tune.register_env("[name]", lambda config: [returns env object])`.
    "env_config": {
        "config": {
            "width": 10,
            "height": 10,
            "ts": 100,
        },
    },
    # !PyTorch users!
    "framework": "tf",  # If users have chosen to install torch instead of tf.
    "create_env_on_driver": True,
}

# 3) Adding the above to our config.
### Modify Code here ####
config.update({
    "multiagent": {
        "policies": policies,
        "policy_mapping_fn": policy_mapping_fn,
        # We'll leave this empty: Means, we train both policy1 and policy2.
        # "policies_to_train": policies_to_train,
    },
})

pprint.pprint(config)
print()
print(f"agent1 is now mapped to {policy_mapping_fn('agent1')}")
print(f"agent2 is now mapped to {policy_mapping_fn('agent2')}")

#rllib_trainer = PPOTrainer(config=config)

{'create_env_on_driver': True,
 'env': <class '__main__.MultiAgentArena'>,
 'env_config': {'config': {'height': 10, 'ts': 100, 'width': 10}},
 'framework': 'tf',
 'multiagent': {'policies': {'policy1': (None,
                                         MultiDiscrete([100 100]),
                                         Discrete(4),
                                         {}),
                             'policy2': (None,
                                         MultiDiscrete([100 100]),
                                         Discrete(4),
                                         {'lr': 0.0002})},
                'policy_mapping_fn': <function policy_mapping_fn at 0x7f2a924af830>}}

agent1 is now mapped to policy1
agent2 is now mapped to policy2




In [None]:
# Recreate our Trainer (we cannot just change the config on-the-fly).
rllib_trainer.stop()

# Using our updated (now multiagent!) config dict.
rllib_trainer = PPOTrainer(config=config)
rllib_trainer



PPOTrainer

Now that we are setup correctly with two policies as per our "multiagent" config, let's call train() on the new Trainer several times (what about 10 times?).

In [None]:
# 4) Run `train()` n times. Repeatedly call `train()` now to see rewards increase.
# Move on once you see (agent1 + agent2) episode rewards of 10.0 or more.
for _ in range(10):
    results = rllib_trainer.train()
    print(f"Iteration={rllib_trainer.iteration}: R(\"return\")={results['episode_reward_mean']}")  #will print out the cummulative reward for each iteration



Iteration=1: R("return")=-13.612500000000006
Iteration=2: R("return")=-8.054999999999996
Iteration=3: R("return")=-4.496999999999989
Iteration=4: R("return")=-0.6929999999999857
Iteration=5: R("return")=0.3600000000000152
Iteration=6: R("return")=1.5330000000000152
Iteration=7: R("return")=1.1910000000000154
Iteration=8: R("return")=1.8120000000000127
Iteration=9: R("return")=2.922000000000009
Iteration=10: R("return")=4.431000000000008


Now that we are setup correctly with two policies as per our "multiagent" config, let's call `train()` on the new Trainer several times (what about 10 times?).

In [None]:
# Do another loop, but this time, we will print out each policies' individual rewards.
for _ in range(10):
    results = rllib_trainer.train()
    r1 = results['policy_reward_mean']['policy1']
    r2 = results['policy_reward_mean']['policy2']
    r = r1 + r2
    print(f"Iteration={rllib_trainer.iteration}: R(\"return\")={r} R1={r1} R2={r2}")

Iteration=11: R("return")=5.541000000000015 R1=13.55 R2=-8.008999999999986
Iteration=12: R("return")=6.414000000000014 R1=14.5 R2=-8.085999999999986
Iteration=13: R("return")=8.337000000000014 R1=16.28 R2=-7.942999999999986
Iteration=14: R("return")=9.126000000000015 R1=16.695 R2=-7.568999999999986
Iteration=15: R("return")=10.893000000000013 R1=18.165 R2=-7.271999999999986
Iteration=16: R("return")=11.394000000000013 R1=18.105 R2=-6.710999999999989
Iteration=17: R("return")=13.155000000000014 R1=19.8 R2=-6.644999999999987
Iteration=18: R("return")=13.287000000000011 R1=20.185 R2=-6.897999999999987
Iteration=19: R("return")=12.699000000000012 R1=20.455 R2=-7.755999999999987
Iteration=20: R("return")=12.816000000000011 R1=19.56 R2=-6.743999999999987


## Evaluating Multiagent PPO Trainer

Now that we are done training with PPO, let's evaluate how the agents behave, using our code in Exercise 1.

In [None]:
out = Output()
display.display(out)

with out:
    env = MultiAgentArena()
    obs = env.reset()
    while True:
        a1 = rllib_trainer.compute_action(obs["agent1"], policy_id="policy1")
        a2 = rllib_trainer.compute_action(obs["agent2"], policy_id="policy2")    
        obs, rewards, dones, _ = env.step({"agent1": a1, "agent2": a2})
        out.clear_output(wait=True)
        time.sleep(0.08)
        env.render()
        if dones["agent1"]:
          break

Output()

#### !OPTIONAL HACK!

Feel free to play around with the following code in order to learn how RLlib - under the hood - calculates actions from the environment's observations using Policies and their model(s) inside our Trainer object):

In [None]:
# Let's actually "look inside" our Trainer to see what's in there.
from ray.rllib.utils.numpy import softmax

# To get to one of the policies inside the Trainer, use `Trainer.get_policy([policy ID])`:
policy = rllib_trainer.get_policy("policy1")
print(f"Our (only!) Policy right now is: {policy}")

# To get to the model inside any policy, do:
model = policy.model
#print(f"Our Policy's model is: {model}")

# Print out the policy's action and observation spaces.
print(f"Our Policy's observation space is: {policy.observation_space}")
print(f"Our Policy's action space is: {policy.action_space}")

# Produce a random obervation (B=1; batch of size 1).
obs = np.array([policy.observation_space.sample()])
# Alternatively for PyTorch:
#import torch
#obs = torch.from_numpy(obs)

# Get the action logits (as tf tensor).
# If you are using torch, you would get a torch tensor here.
logits, _ = model({"obs": obs})
logits

# Numpyize the tensor by running `logits` through the Policy's own tf.Session.
logits_np = policy.get_session().run(logits)
# For torch, you can simply do: `logits_np = logits.detach().cpu().numpy()`.

# Convert logits into action probabilities and remove the B=1.
action_probs = np.squeeze(softmax(logits_np))

# Sample an action, using the probabilities.
action = np.random.choice([0, 1, 2, 3], p=action_probs)

# Print out the action.
print(f"sampled action={action}")

Our (only!) Policy right now is: PPOTFPolicy
Our Policy's observation space is: Box([-1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.
 -1. -1.], [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1

### Saving and restoring a trained Trainer.
Currently, `rllib_trainer` is in an already trained state.
It holds optimized weights in its Policy's model that allow it to act
already somewhat smart in our environment when given an observation.

However, if we closed this notebook right now, all the effort would have been for nothing.
Let's therefore save the state of our trainer to disk for later!

In [None]:
# We use the `Trainer.save()` method to create a checkpoint.
checkpoint_file = rllib_trainer.save()
print(f"Trainer (at iteration {rllib_trainer.iteration} was saved in '{checkpoint_file}'!")

# Here is what a checkpoint directory contains:
print("The checkpoint directory contains the following files:")
import os
os.listdir(os.path.dirname(checkpoint_file))

Trainer (at iteration 0 was saved in '/root/ray_results/PPOTrainer_MultiAgentArena_2022-08-22_15-02-243up9g08g/checkpoint_000000/checkpoint-0'!
The checkpoint directory contains the following files:


['.is_checkpoint', 'checkpoint-0.tune_metadata', 'checkpoint-0']

### Restoring and evaluating a Trainer
In the following cell, we'll learn how to restore a saved Trainer from a checkpoint file.

We'll also evaluate a completely new Trainer (should act more or less randomly) vs an already trained one (the one we just restored from the created checkpoint file).

In [None]:
# Pretend, we wanted to pick up training from a previous run:
new_trainer = PPOTrainer(config=config)
# Evaluate the new trainer (this should yield random results).
results = new_trainer.evaluate()
print(f"Evaluating new trainer: R={results['evaluation']['episode_reward_mean']}")

# Restoring the trained state into the `new_trainer` object.
print(f"Before restoring: Trainer is at iteration={new_trainer.iteration}")
new_trainer.restore(checkpoint_file)
print(f"After restoring: Trainer is at iteration={new_trainer.iteration}")

# Evaluate again (this should yield results we saw after having trained our saved agent).
results = new_trainer.evaluate()
print(f"Evaluating restored trainer: R={results['evaluation']['episode_reward_mean']}")





2022-08-22 15:05:55,272	INFO trainable.py:589 -- Restored on 172.28.0.2 from checkpoint: /root/ray_results/PPOTrainer_MultiAgentArena_2022-08-22_15-02-243up9g08g/checkpoint_000000/checkpoint-0
2022-08-22 15:05:55,274	INFO trainable.py:597 -- Current state after restoring: {'_iteration': 0, '_timesteps_total': None, '_time_total': 0.0, '_episodes_total': None}


Evaluating new trainer: R=-9.855000000000008
Before restoring: Trainer is at iteration=0
After restoring: Trainer is at iteration=0
Evaluating restored trainer: R=-8.909999999999998


In order to release all resources from a Trainer, you can use a Trainer's `stop()` method.
You should definitley run this cell as it frees resources that we'll need later in this tutorial, when we'll do parallel hyperparameter sweeps.

In [None]:
rllib_trainer.stop()
new_trainer.stop()

[2m[36m(RolloutWorker pid=2073)[0m E0822 15:06:15.470895777    2112 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
[2m[36m(RolloutWorker pid=2072)[0m E0822 15:06:15.471469311    2096 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
[2m[36m(pid=2238)[0m Instructions for updating:
[2m[36m(pid=2238)[0m experimental_relax_shapes is deprecated, use reduce_retracing instead
[2m[36m(pid=2239)[0m Instructions for updating:
[2m[36m(pid=2239)[0m experimental_relax_shapes is deprecated, use reduce_retracing instead


### Moving stuff to the professional level: RLlib in connection w/ Ray Tune

Running any experiments through Ray Tune is the recommended way of doing things with RLlib. If you look at our
<a href="https://github.com/ray-project/ray/tree/master/rllib/examples">examples scripts folder</a>, you will see that almost all of the scripts use Ray Tune to run the particular RLlib workload demonstrated in each script.


When setting up hyperparameter sweeps for Tune, we'll do this in our already familiar config dict.

So let's take a quick look at our PPO algo's default config to understand, which hyperparameters we may want to play around with:

In [None]:
# Configuration dicts and Ray Tune.
# Where are the default configuration dicts stored?

# PPO algorithm:
from ray.rllib.agents.ppo import DEFAULT_CONFIG as PPO_DEFAULT_CONFIG
print(f"PPO's default config is:")
pprint.pprint(PPO_DEFAULT_CONFIG)

# DQN algorithm:
#from ray.rllib.agents.dqn import DEFAULT_CONFIG as DQN_DEFAULT_CONFIG
#print(f"DQN's default config is:")
#pprint.pprint(DQN_DEFAULT_CONFIG)

# Common (all algorithms).
#from ray.rllib.agents.trainer import COMMON_CONFIG
#print(f"RLlib Trainer's default config is:")
#pprint.pprint(COMMON_CONFIG)

PPO's default config is:
{'_disable_action_flattening': False,
 '_disable_execution_plan_api': True,
 '_disable_preprocessor_api': False,
 '_fake_gpus': False,
 '_tf_policy_handles_more_than_one_loss': False,
 'action_space': None,
 'actions_in_input_normalized': False,
 'always_attach_evaluation_results': False,
 'batch_mode': 'truncate_episodes',
 'callbacks': <class 'ray.rllib.agents.callbacks.DefaultCallbacks'>,
 'clip_actions': False,
 'clip_param': 0.3,
 'clip_rewards': None,
 'collect_metrics_timeout': -1,
 'compress_observations': False,
 'create_env_on_driver': False,
 'custom_eval_function': None,
 'custom_resources_per_worker': {},
 'disable_env_checking': False,
 'eager_max_retraces': 20,
 'eager_tracing': False,
 'entropy_coeff': 0.0,
 'entropy_coeff_schedule': None,
 'env': None,
 'env_config': {},
 'env_task_fn': None,
 'evaluation_config': {},
 'evaluation_duration': 10,
 'evaluation_duration_unit': 'episodes',
 'evaluation_interval': None,
 'evaluation_num_episodes': -

### Let's do a very simple grid-search over two learning rates with tune.run().

In particular, we will try the learning rates 0.00005 and 0.5 using `tune.grid_search([...])`
inside our config dict:

In [None]:
# Plugging in Ray Tune.
# Note that this is the recommended way to run any experiments with RLlib.
# Reasons:
# - Tune allows you to do hyperparameter tuning in a user-friendly way
#   and at large scale!
# - Tune automatically allocates needed resources for the different
#   hyperparam trials and experiment runs on a cluster.

from ray import tune

# Running stuff with tune, we can re-use the exact
# same config that we used when working with RLlib directly!
tune_config = config.copy()

# Let's add our first hyperparameter search via our config.
# How about we try two different learning rates? Let's say 0.00005 and 0.5 (ouch!).
tune_config["lr"] = tune.grid_search([0.0001, 0.5])  # <- 0.5? again: ouch!
tune_config["train_batch_size"] = tune.grid_search([3000, 4000])

# Now that we will run things "automatically" through tune, we have to
# define one or more stopping criteria.
# Tune will stop the run, once any single one of the criteria is matched (not all of them!).
stop = {
    # Note that the keys used here can be anything present in the above `rllib_trainer.train()` output dict.
    "training_iteration": 5,
    "episode_reward_mean": 20.0,
}

# "PPO" is a registered name that points to RLlib's PPOTrainer.
# See `ray/rllib/agents/registry.py`

# Run a simple experiment until one of the stopping criteria is met.
tune.run(
    "PPO",
    config=tune_config,
    stop=stop,

    # Note that no trainers will be returned from this call here.
    # Tune will create n Trainers internally, run them in parallel and destroy them at the end.
    # However, you can ...
    checkpoint_at_end=True,  # ... create a checkpoint when done.
    checkpoint_freq=10,  # ... create a checkpoint every 10 training iterations.
)

Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000




Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000




Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000




Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000




Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000




Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000




Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000




Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000




Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000




Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000




Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000




Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000




Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6a368_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_6a368_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_6a368_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_6a368_00003,PENDING,,0.5,4000


In [None]:
### Another example of using Ray Tune for the Parameters 
Now we will use DQN
and Ray Tune runner to train the algo

https://www.codeproject.com/Articles/5271939/Cartpole-The-Hello-World-of-Reinforcement-Learning

In [None]:
from ray import tune
from ray.rllib.agents.dqn import DQNTrainer
from ray.tune import CLIReporter
from ray.tune.progress_reporter import JupyterNotebookReporter

ray.shutdown()
ray.init(
    ignore_reinit_error=True
)

ENV = 'CartPole-v0'
TARGET_REWARD = 195  #it stops when this reward has been achieved
TRAINER = DQNTrainer

# TRAINING PARAMETERS
#Stopping criteria
stop_dict ={"training_iteration": 3,
            "timesteps_total"   : 5,
            "episode_reward_mean": TARGET_REWARD # stop as soon as we "solve" the environment            
            }  

# Parameters for the trainer function - if we use PPO, we can add the Net layers here as above
config_dict = { "env": ENV,
                "num_workers": 0,  # run in a single process
                "num_gpus": 0
                }

# Runner
analysis =  tune.run(
              TRAINER,
              stop  = stop_dict,
              config= config_dict,
              progress_reporter=JupyterNotebookReporter(overwrite=False),
              verbose=2 #can be changed
          )

Analyse the training results

In [None]:
df = analysis.dataframe()
df

### Why did we use 6 CPUs in the tune run above (3 CPUs per trial)?

PPO - by default - uses 2 "rollout" workers (`num_workers=2`). These are Ray Actors that have their own environment copy(ies) and step through those in parallel. On top of these two "rollout" workers, every Trainer in RLlib always also has a "local" worker, which - in case of PPO - handles the learning updates. This gives us 3 workers (2 rollout + 1 local learner), which require 3 CPUs.

## Environment Parallelization

<hr />

Using the `tune_config` that we have built so far, let's run another `tune.run()`, but apply the following changes to our setup this time:
- Setup only 1 learning rate under the "lr" config key. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Setup only 1 train batch size under the "train_batch_size" config key. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Set `num_workers` to 5, which will allow us to run more environment "rollouts" in parallel and to collect training batches more quickly.
- Set the `num_envs_per_worker` config parameter to 5. This will clone our env on each rollout worker, and thus parallelize action computing forward passes through our neural networks.

Other than that, use the exact same args as in our `tune.run()` call in the previous cell.



In [None]:
#### This might not be needed, check the resources utilization in the dashboard ####

#Initialize service and pass the number of resources available
ray.init(num_cpus = 1,
         num_gpus = 0,
         ignore_reinit_error = True)

In [None]:
# Run for longer this time (100 iterations) and try to reach 40.0 reward (sum of both agents).
stop = {
    "training_iteration": 180,  # we have the 15min break now to run this many iterations
    "episode_reward_mean": 60.0,  # sum of both agents' rewards. Probably won't reach it, but we should try nevertheless :)
}

# tune_config.update({
# ???
# })



tune_config["lr"] = 0.0001
tune_config["train_batch_size"] = 4000
tune_config["num_envs_per_worker"] = 5
tune_config["num_workers"] = 5

analysis = tune.run("PPO", config=tune_config, stop=stop, checkpoint_at_end=True, checkpoint_freq=5)

### Additional parameters we can pass to the config dict

If it's training too slowly you may need to modify the config above to use fewer hidden units, a larger sgd_minibatch_size, a smaller num_sgd_iter, or a larger num_workers.

```
num_sgd_iter -  is the number of epochs of SGD (stochastic gradient descent, i.e., passes through the data) that will be used to optimize the PPO surrogate objective at each iteration of PPO, for each minibatch ("chunk") of training data. Using minibatches is more efficient than training with one record at a time.

sgd_minibatch_size  - is the SGD minibatch size (batches of data) that will be used to optimize the PPO surrogate objective.

num_cpus_per_worker  - when set to 0 prevents Ray from pinning a CPU core to 
each worker, which means we could run out of workers in a constrained environment like a laptop or a cloud VM.
```

In [None]:
config = ppo.DEFAULT_CONFIG.copy()              # PPO's default configuration. 
config["log_level"] = "WARN"                    # Suppress too many messages, but try "INFO" to see what can be printed.

# Other settings we might adjust:
config["num_workers"] = 1                       # Use > 1 for using more CPU cores, including over a cluster
config["num_sgd_iter"] = 10                     # Number of SGD (stochastic gradient descent) iterations per training minibatch.
                                                # I.e., for each minibatch of data, do this many passes over it to train. 
config["sgd_minibatch_size"] = 250              # The amount of data records per minibatch
config["model"]["fcnet_hiddens"] = [100, 50]    # Neural network with two hidden layers, the list contains the number of weights on each layer
config["num_cpus_per_worker"] = 0               # This avoids running out of resources in the notebook environment when this cell is re-executed

In [None]:
agent = ppo.PPOTrainer(config, env=SELECT_ENV)

results = []
episode_data = []
episode_json = []

for n in range(N_ITER):
    result = agent.train()
    results.append(result)
    
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'], 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}
    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    file_name = agent.save(CHECKPOINT_ROOT)
    
    print(f'{n:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}. Checkpoint saved to {file_name}')

## How do we extract any checkpoint from a trial of a tune.run?

In [None]:
# Restore from a file

# Bring the model config
trained_config = config.copy()

# Load trained model
test_agent = ppo.PPOTrainer(trained_config, SELECT_ENV) #initialize object
test_agent.restore(file_name)  #above we have defined: file_name = agent.save(CHECKPOINT_ROOT)

In [None]:
# The previous tune.run (the one we did before the exercise) returned an Analysis object, from which we can access any checkpoint
# (given we set checkpoint_freq or checkpoint_at_end to reasonable values) like so:
print(analysis)
# Get all trials (we only have one).
trials = analysis.trials
# Assuming, the first trial was the best, we'd like to extract this trial's best checkpoint "":
best_checkpoint = analysis.get_best_checkpoint(trial=trials[0], metric="episode_reward_mean", mode="max")
print(f"Found best checkpoint for trial #2: {best_checkpoint}")

# Undo the grid-search config, which RLlib doesn't understand.
rllib_config = tune_config.copy()
rllib_config["lr"] = 0.00005
rllib_config["train_batch_size"] = 4000

# Restore a RLlib Trainer from the checkpoint.
new_trainer = PPOTrainer(config=rllib_config)
new_trainer.restore(best_checkpoint)
new_trainer

In [None]:
out = Output()
display.display(out)

with out:
    obs = env.reset()
    while True:
        a1 = new_trainer.compute_action(obs["agent1"], policy_id="policy1")
        a2 = new_trainer.compute_action(obs["agent2"], policy_id="policy2")
        actions = {"agent1": a1, "agent2": a2}
        obs, rewards, dones, _ = env.step(actions)

        out.clear_output(wait=True)
        env.render()
        time.sleep(0.07)

        if dones["agent1"] is True:
            break

## Let's talk about customization options

### Deep Dive: How do we customize RLlib's RL loop?

RLlib offers a callbacks API that allows you to add custom behavior to
all major events during the environment sampling- and learning process.

**Our problem:** So far, we can only see standard stats, such as rewards, episode lengths, etc..
This does not give us enough insights sometimes into important questions, such as: How many times
have both agents collided? or How many times has agent1 discovered a new field?

In the following cell, we will create custom callback "hooks" that will allow us to
add these stats to the returned metrics dict, and which will therefore be displayed in tensorboard!

For that we will override RLlib's DefaultCallbacks class and implement the
`on_episode_start`, `on_episode_step`, and `on_episode_end` methods therein:


In [None]:
# Override the DefaultCallbacks with your own and implement any methods (hooks)
# that you need.
from ray.rllib.agents.callbacks import DefaultCallbacks
from ray.rllib.evaluation.episode import MultiAgentEpisode


class MyCallbacks(DefaultCallbacks):
    def on_episode_start(self,
                         *,
                         worker,
                         base_env,
                         policies,
                         episode: MultiAgentEpisode,
                         env_index,
                         **kwargs):
        # We will use the `MultiAgentEpisode` object being passed into
        # all episode-related callbacks. It comes with a user_data property (dict),
        # which we can write arbitrary data into.

        # At the end of an episode, we'll transfer that data into the `hist_data`, and `custom_metrics`
        # properties to make sure our custom data is displayed in TensorBoard.

        # The episode is starting:
        # Set per-episode object to capture, which states (observations)
        # have been visited by agent1.
        episode.user_data["new_fields_discovered"] = 0
        # Set per-episode agent2-blocks counter (how many times has agent2 blocked agent1?).
        episode.user_data["num_collisions"] = 0

    def on_episode_step(self,
                        *,
                        worker,
                        base_env,
                        episode: MultiAgentEpisode,
                        env_index,
                        **kwargs):
        # Get both rewards.
        ag1_r = episode.prev_reward_for("agent1")
        ag2_r = episode.prev_reward_for("agent2")

        # Agent1 discovered a new field.
        if ag1_r == 1.0:
            episode.user_data["new_fields_discovered"] += 1
        # Collision.
        elif ag2_r == 1.0:
            episode.user_data["num_collisions"] += 1

    def on_episode_end(self,
                       *,
                       worker,
                       base_env,
                       policies,
                       episode: MultiAgentEpisode,
                       env_index,
                       **kwargs):
        # Episode is done:
        # Write scalar values (sum over rewards) to `custom_metrics` and
        # time-series data (rewards per time step) to `hist_data`.
        # Both will be visible then in TensorBoard.
        episode.custom_metrics["new_fields_discovered"] = episode.user_data["new_fields_discovered"]
        episode.custom_metrics["num_collisions"] = episode.user_data["num_collisions"]


# Solution Exercise #3

In [None]:


import ray
from ray.rllib.agents.callbacks import DefaultCallbacks
from ray import tune


class MyCallback(DefaultCallbacks):
    def on_episode_start(self, *, worker, base_env,
                         policies, episode,
                         env_index, **kwargs):
        # Set per-episode object to capture, which states (observations)
        # have been visited by agent1.
        episode.user_data["ground_covered"] = set()
        # Set per-episode agent2-blocks counter (how many times has agent2 blocked agent1?).
        episode.user_data["num_blocks"] = 0

    def on_episode_step(self, *, worker, base_env,
                        episode, env_index, **kwargs):
        # Add agent1's observation to our set of unique observations.
        ag1_obs = episode.last_raw_obs_for("agent1")
        episode.user_data["ground_covered"].add(ag1_obs)
        # If agent2's reward > 0.0, it means she has blocked agent1.
        ag2_r = episode.prev_reward_for("agent2")
        if ag2_r > 0.0:
            episode.user_data["num_blocks"] += 1

    def on_episode_end(self, *, worker, base_env,
                       policies, episode,
                       env_index, **kwargs):
        # Reset everything.
        episode.user_data["ground_covered"] = set()
        episode.user_data["num_blocks"] = 0



ray.init()

stop = {"training_iteration": 10}
# Specify env and custom callbacks in our config (leave everything else
# as-is (defaults)).
config = {
    "env": MultiAgentArena,
    "callbacks": MyCallback,
}

# Run for a few iterations.
tune.run("PPO", stop=stop, config=config)

# Check tensorboard.

In [None]:
# Setting up our config to point to our new custom callbacks class:
config = {
    "env": MultiAgentArena,
    "callbacks": MyCallbacks,  # by default, this would point to `rllib.agents.callbacks.DefaultCallbacks`, which does nothing.
    "num_workers": 5,  # we know now: this speeds up things!
}

tune.run(
    "PPO",
    config=config,
    stop={"training_iteration": 20},
    checkpoint_at_end=True,
    # If you'd like to restore the tune run from an existing checkpoint file, you can do the following:
    #restore="/Users/sven/ray_results/PPO/PPO_MultiAgentArena_fd451_00000_0_2021-05-25_15-13-26/checkpoint_000010/checkpoint-10",
)

### Let's check tensorboard for the new custom metrics!

1. Head over to the Anyscale project view and click on the "TensorBoard" button

See images here:

https://github.com/sven1977/rllib_tutorials/blob/865d77eacb8cb8025abe372d97c47f70aa1d035b/ray_summit_2021/tutorial_notebook.ipynb

Alternatively - if you ran this locally on your own machine:

1. Head over to ~/ray_results/PPO/PPO_MultiAgentArena_[some key]_00000_0_[date]_[time]/
1. In that directory, you should see a `event.out....` file.
1. Run `tensorboard --logdir .` and head to https://localhost:6006



### Deep Dive: Writing custom Models in tf or torch.

In [None]:
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.utils.framework import try_import_tf, try_import_torch

tf1, tf, tf_version = try_import_tf()
torch, nn = try_import_torch()


# Custom Neural Network Models.
class MyKerasModel(TFModelV2):
    """Custom model for policy gradient algorithms."""

    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name):
        """Build a simple [16, 16]-MLP (+ value branch)."""
        super(MyKerasModel, self).__init__(obs_space, action_space,
                                           num_outputs, model_config, name)
        
        # Keras Input layer.
        self.inputs = tf.keras.layers.Input(
            shape=obs_space.shape, name="observations")

        # Hidden layer (shared by action logits outputs and value output).
        layer_1 = tf.keras.layers.Dense(
            16,
            name="layer1",
            activation=tf.nn.relu)(self.inputs)
        
        # Action logits output.
        logits = tf.keras.layers.Dense(
            num_outputs,
            name="out",
            activation=None)(layer_1)

        # "Value"-branch (single node output).
        # Used by several RLlib algorithms (e.g. PPO) to calculate an observation's value.
        value_out = tf.keras.layers.Dense(
            1,
            name="value",
            activation=None)(layer_1)

        # The actual Keras model:
        self.base_model = tf.keras.Model(self.inputs,
                                         [logits, value_out])

    def forward(self, input_dict, state, seq_lens):
        """Custom-define your forard pass logic here."""
        # Pass inputs through our 2 layers and calculate the "value"
        # of the observation and store it for when `value_function` is called.
        logits, self.cur_value = self.base_model(input_dict["obs"])
        return logits, state

    def value_function(self):
        """Implement the value branch forward pass logic here:
        
        We will just return the already calculated `self.cur_value`.
        """
        assert self.cur_value is not None, "Must call `forward()` first!"
        return tf.reshape(self.cur_value, [-1])


class MyTorchModel(TorchModelV2, nn.Module):
    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name):
        """Build a simple [16, 16]-MLP (+ value branch)."""
        TorchModelV2.__init__(self, obs_space, action_space, num_outputs,
                              model_config, name)
        nn.Module.__init__(self)

        self.device = torch.device("cuda"
                                   if torch.cuda.is_available() else "cpu")

        # Hidden layer (shared by action logits outputs and value output).
        self.layer_1 = nn.Linear(obs_space.shape[0], 16).to(self.device)

        # Action logits output.
        self.layer_out = nn.Linear(16, num_outputs).to(self.device)

        # "Value"-branch (single node output).
        # Used by several RLlib algorithms (e.g. PPO) to calculate an observation's value.
        self.value_branch = nn.Linear(16, 1).to(self.device)
        self.cur_value = None

    def forward(self, input_dict, state, seq_lens):
        """Custom-define your forard pass logic here."""
        # Pass inputs through our 2 layers.
        layer_1_out = self.layer_1(input_dict["obs"])
        logits = self.layer_out(layer_1_out)

        # Calculate the "value" of the observation and store it for
        # when `value_function` is called.
        self.cur_value = self.value_branch(layer_1_out).squeeze(1)

        return logits, state

    def value_function(self):
        """Implement the value branch forward pass logic here:
        
        We will just return the already calculated `self.cur_value`.
        """
        assert self.cur_value is not None, "Must call `forward()` first!"
        return self.cur_value

In [None]:
# Do a quick test on the custom model classes.
test_model_tf = MyKerasModel(
    obs_space=gym.spaces.Box(-1.0, 1.0, (2, )),
    action_space=None,
    num_outputs=2,
    model_config={},
    name="MyModel",
)

print("TF-output={}".format(test_model_tf({"obs": np.array([[0.5, 0.5]])})))

# For PyTorch, you can do:
#test_model_torch = MyTorchModel(
#    obs_space=gym.spaces.Box(-1.0, 1.0, (2, )),
#    action_space=None,
#    num_outputs=2,
#    model_config={},
#    name="MyModel",
#)
#print("Torch-output={}".format(test_model_torch({"obs": torch.from_numpy(np.array([[0.5, 0.5]], dtype=np.float32))})))


In [None]:
# Set up our custom model and re-run the experiment.
config.update({
    "model": {
        "custom_model": MyKerasModel,  # for torch users: "custom_model": MyTorchModel
        "custom_model_config": {
            #"layers": [128, 128],
        },
    },
})

tune.run(
    "PPO",
    config=config,  # for torch users: config=dict(config, **{"framework": "torch"}),
    stop={
        "training_iteration": 5,
    },
)


### Model Rollout

Once we have trained a policy, we deploy it in the environment.

A 'Rollout' is the application of the trained policy to the environment. This is, for a given state, the policy function will output the best action to take.

****************************************************************************

WARNING: The rllib rollout command discussed next won't work in a cloud environment, because it attempts to pop up a window.


https://docs.ray.io/en/latest/rllib-concepts.html#policy-evaluation

***************************************************************************

Example of rollout: 

Reuse the trained policy to act in an environment
The line: `test_agent.compute_action(state)` uses the trained policy to pick an action given the state.

The reward received should match the training reward

In [None]:
env   = gym.make(SELECT_ENV)
state = env.reset()
done  = False
cumulative_reward = 0

while not done:
  action = test_agent.compute_single_action(state) #gets the next action given a state
  state, reward, done, _ = env.step(action)
  cumulative_reward += reward

print(cumulative_reward)  

### Tensorboard results

Note: one can also use WandB

In [None]:
#From command line:
#tensorboard - logdir=$HOME/ray_results/

### Shut down the service

In [None]:
ray.shutdown()

### Deep Dive: A closer look at RLlib's components

We already took a quick look inside an RLlib Trainer object and extracted its Policy(ies) and the Policy's model (neural network). 

Here is a much more detailed overview of what's inside a Trainer object.

At the core is the so-called `WorkerSet` sitting under `Trainer.workers`. A WorkerSet is a group of `RolloutWorker` (`rllib.evaluation.rollout_worker.py`) objects that always consists of a "local worker" (`Trainer.workers.local_worker()`) and 'n' "remote workers" (`Trainer.workers.remote_workers()`).

See image here:

https://github.com/sven1977/rllib_tutorials/blob/865d77eacb8cb8025abe372d97c47f70aa1d035b/ray_summit_2021/tutorial_notebook.ipynb

### Scaling RLlib

Scaling RLlib works by parallelizing the "jobs" that the remote `RolloutWorkers` do. In a vanilla RL algorithm, like PPO, DQN, and many others, the `@ray.remote` labeled RolloutWorkers in the figure above are responsible for interacting with one or more environments and thereby collecting experiences. Observations are produced by the environment, actions are then computed by the Policy(ies) copy located on the remote worker and sent to the environment in order to produce yet another observation. This cycle is repeated endlessly and only sometimes interrupted to send experience batches ("train batches") of a certain size to the "local worker". There these batches are used to call `Policy.learn_on_batch()`, which performs a loss calculation, followed by a model weights update, and a subsequent weights broadcast back to all the remote workers.

### Here are a couple of links that you may find useful.

- The <a href="https://github.com/sven1977/rllib_tutorials.git">github repo of this tutorial</a>.
- <a href="https://docs.ray.io/en/master/rllib.html">RLlib's documentation main page</a>.
- <a href="http://discuss.ray.io">Our discourse forum</a> to ask questions on Ray and its libraries.
- Our <a href="https://forms.gle/9TSdDYUgxYs8SA9e8">Slack channel</a> for interacting with other Ray RLlib users.
- The <a href="https://github.com/ray-project/ray/blob/master/rllib/examples/">RLlib examples scripts folder</a> with tons of examples on how to do different stuff with RLlib.
- A <a href="https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d">blog post on training with RLlib inside a Unity3D environment</a>.
