# Hands-on RL with Ray’s RLlib
## A beginner’s tutorial for working with multi-agent environments, models, and algorithms

<img src="images/pitfall.jpg" width=250> <img src="images/tesla.jpg" width=254> <img src="images/forklifts.jpg" width=169> <img src="images/robots.jpg" width=252> <img src="images/dota2.jpg" width=213>

### Overview
“Hands-on RL with Ray’s RLlib” is a beginners tutorial for working with reinforcement learning (RL) environments, models, and algorithms using Ray’s RLlib library. RLlib offers high scalability, a large list of algos to choose from (offline, model-based, model-free, etc..), support for TensorFlow and PyTorch, and a unified API for a variety of applications. This tutorial includes a brief introduction to provide an overview of concepts (e.g. why RL) before proceeding to RLlib (multi- and single-agent) environments, neural network models, hyperparameter tuning, debugging, student exercises, Q/A, and more. All code will be provided as .py files in a GitHub repo.

"Hands-on RL with Ray’s RLlib"은 Ray의 RLlib 라이브러리를 사용하여 강화학습(RL) 환경, 모델 및 알고리즘 개발작업을 위한 초보자 자습서입니다. RLlib는 높은 확장성, 선택할 수있는 다양한 알고리즘(오프라인 RL, 모델 기반 RL, 모델 프리 RL 등), TensorFlow 및 PyTorch 지원, 다양한 애플리케이션을 위한 통합된 API를 제공합니다. 이 자습서에는 RLlib (다중 및 단일 에이전트) 환경, 신경망 모델, 하이퍼파라미터 튜닝, 디버깅, excercises, Q/A 등을 진행하기 전에 RL에 대한 개념을 설명하는 개요(예 : RL을 사용하는 이유)가 포함되어 있습니다. 모든 코드는 GitHub 저장소에서 .py 파일로 제공됩니다.

### Intended Audience
* 강화 학습 및 RLlib를 시작하려는 Python 프로그래머.

### Prerequisites
* Some Python programming experience.
* Some familiarity with machine learning.
* *Helpful, but not required:* Experience in reinforcement learning and Ray.
* *Helpful, but not required:* Experience with TensorFlow or PyTorch.

### Requirements/Dependencies

To get this very notebook up and running on your local machine, you can follow these steps here:

Install conda (https://www.anaconda.com/products/individual)

Then ...

#### Quick `conda` setup instructions (Linux):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install tensorflow  # <- either one works!
$ pip install torch  # <- either one works!
$ pip install jupyterlab
```

#### Quick `conda` setup instructions (Mac):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install cmake "ray[rllib]"
$ pip install tensorflow  # <- either one works!
$ pip install torch  # <- either one works!
$ pip install jupyterlab
```

#### Quick `conda` setup instructions (Win10):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install [tensorflow|torch]  # <- either one works!
$ pip install jupyterlab
$ conda install pywin32
```

Also, for Win10 Atari support, we have to install atari_py from a different source (gym does not support Atari envs on Windows).

```
$ pip install git+https://github.com/Kojoley/atari-py.git
```

### Opening these tutorial files:
```
$ git clone https://github.com/sven1977/rllib_tutorials
$ cd rllib_tutorials
$ jupyter-lab
```

### Key Takeaways
* What is reinforcement learning and why RLlib?
* Core concepts of RLlib: Environments, Trainers, Policies, and Models.
* How to configure, hyperparameter-tune, and parallelize RLlib.
* RLlib debugging best practices.

* 강화 학습이란 무엇이며 왜 RLlib 인가요?
* RLlib의 핵심 개념 : 환경, 트레이너, 정책 및 모델.
* RLlib를 구성, 하이퍼 파라미터 조정 및 병렬화하는 방법.
* RLlib 디버깅 모범 사례.

### Tutorial Outline
1. RL and RLlib in a nutshell.
1. Defining an RL-solvable problem: Our first environment.
1. **Exercise No.1**: Environment loop.

1. RL 및 RLlib 간단 요약.
1. RL로 해결 가능한 문제 정의 : 첫번째 환경 소개.
1. **Exercise No.1** : 환경 루프.

(15min break)

1. Picking an algorithm and training our first RLlib Trainer.
1. Configurations and hyperparameters - Easy tuning with Ray Tune.
1. Fixing our experiment's config - Going multi-agent.
1. The "infinite laptop": Quick intro into how to use RLlib with Anyscale's product.
1. **Exercise No.2**: Run your own Ray RLlib+Tune experiment)
1. Neural network models - Provide your custom models using tf.keras or torch.nn.

1. 알고리즘을 선택하고 첫번째 RLlib 트레이너를 학습.
1. Config 및 하이퍼파라미터 - Ray Tune을 통한 손쉬운 튜닝.
1. 실험 Config 수정 - 다중 에이전트로 전환.
1. "Infinite laptop" : Anyscale 제품에서 RLlib를 사용하는 방법에 대한 간단 소개.
1. **Exercise No.2** : 자신만의 Ray RLlib + Tune 실험 실행.
1. 신경망 모델-tf.keras 또는 torch.nn을 사용하여 Custom 모델을 제공.

(15min break)

1. Deeper dive into RLlib's parallelization architecture.
1. Specifying different compute resources and parallelization options through our config.
1. "Hacking in": Using callbacks to customize the RL loop and generate our own metrics.
1. **Exercise No.3**: Write your own custom callback.
1. "Hacking in (part II)" - Debugging with RLlib and PyCharm.
1. Checking on the "infinite laptop" - Did RLlib learn to solve the problem?

1. RLlib의 병렬화 아키텍처에 대해 자세히 알아봅니다.
1. 구성을 통해 다양한 컴퓨팅 리소스 및 병렬화 옵션을 지정합니다.
1. "Hacking in": 콜백을 사용하여 RL 루프를 사용자 정의하고 자체 메트릭을 생성합니다.
1. **Exercise No.3** : 사용자 지정 콜백을 작성합니다.
1. "Hacking in (part II)"-RLlib 및 PyCharm을 사용한 디버깅.
1. "infinite laptop" 에서 'RLlib를 통해서 문제 해결 방법을 학습했는지' 확인합니다.

### Other Recommended Readings
* [Reinforcement Learning with RLlib in the Unity Game Engine](https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d)

<img src="images/unity3d_blog_post.png" width=400>

* [Attention Nets and More with RLlib's Trajectory View API](https://medium.com/distributed-computing-with-ray/attention-nets-and-more-with-rllibs-trajectory-view-api-d326339a6e65)
* [Intro to RLlib: Example Environments](https://medium.com/distributed-computing-with-ray/intro-to-rllib-example-environments-3a113f532c70)


## The RL cycle

<img src="images/rl-cycle.png" width=800>

### Coding/defining our "problem" via an RL environment.

We will use the following (adversarial) multi-agent environment
throughout this tutorial to demonstrate a large fraction of RLlib's
APIs, features, and customization options.

<img src="images/environment.png" width=800>

### Spaces 에 대해 간단히 알아 봅니다:

ML에서 Space은 신경망의 입력 및 출력이 가질 수있는 가능한/유효한 값을 설명하는 데 사용됩니다.

RL 환경은 또한 이를 사용하여 유효한 관찰(Observation) 및 행동(Action)이 무엇인지 설명합니다.

Space는 일반적으로 쉐이프(예 : 84x84x3 RGB 이미지) 및 데이터 타입(예 : 0에서 255 사이의 RGB 값에 대한 uint8)으로 정의됩니다.
그러나 Space은 다른 Space로 구성되거나(튜플 또는 딕트 공백 참조) n개의 고정 가능한 값으로 단순히 이산적(정수)일 수도 있습니다. 예를 들어 각 에이전트가 위 / 아래 / 왼쪽 / 오른쪽으로 만 이동할 수있는 게임에서 액션 공간은`Discrete(4)`입니다(이 경우, 데이터 타입이 없고, 쉐이프를 정의할 필요가 없습니다). 우리의 Observation Space는 `MultiDiscrete([n, m])`이 될 것입니다. 이 경우, n은 내 에이전트의 위치이고 m은 상대 에이전트의 위치입니다. 따라서 agent1이 왼쪽 상단에서 시작하고 agent2가 하단에서 시작한다면, 오른쪽 모서리에서 agent1의 관측치는`[0, 63]`(8 x 8 그리드에서)이고 agent2의 관측치는`[63, 0]`입니다.

<img src="images/spaces.png" width=800>

In [1]:
# Let's code our multi-agent environment.

import gym
from gym.spaces import Discrete, MultiDiscrete
import numpy as np
import random

from ray.rllib.env.multi_agent_env import MultiAgentEnv


class MultiAgentArena(MultiAgentEnv):
    def __init__(self, config=None):
        config = config or {}
        # Dimensions of the grid.
        self.width = config.get("width", 10)
        self.height = config.get("height", 10)

        # End an episode after this many timesteps.
        self.timestep_limit = config.get("ts", 100)

        self.observation_space = MultiDiscrete([self.width * self.height,
                                                self.width * self.height])
        # 0=up, 1=right, 2=down, 3=left.
        self.action_space = Discrete(4)

        # Reset env.
        self.reset()
        
    def reset(self):
        """Returns initial observation of next(!) episode."""
        # Row-major coords.
        self.agent1_pos = [0, 0]  # upper left corner
        self.agent2_pos = [self.height - 1, self.width - 1]  # lower bottom corner

        # Accumulated rewards in this episode.
        self.agent1_R = 0.0
        self.agent2_R = 0.0

        # Reset agent1's visited fields.
        self.agent1_visited_fields = set([tuple(self.agent1_pos)])

        # How many timesteps have we done in this episode.
        self.timesteps = 0

        # Return the initial observation in the new episode.
        return self._get_obs()

    def step(self, action: dict):
        """
        Returns (next observation, rewards, dones, infos) after having taken the given actions.
        
        e.g.
        `action={"agent1": action_for_agent1, "agent2": action_for_agent2}`
        """
        
        # increase our time steps counter by 1.
        self.timesteps += 1
        # An episode is "done" when we reach the time step limit.
        is_done = self.timesteps >= self.timestep_limit

        # Agent2 always moves first.
        # events = [collision|agent1_new_field]
        events = self._move(self.agent2_pos, action["agent2"], is_agent1=False)
        events |= self._move(self.agent1_pos, action["agent1"], is_agent1=True)

        # Useful for rendering.
        self.collision = "collision" in events
            
        # Get observations (based on new agent positions).
        obs = self._get_obs()

        # Determine rewards based on the collected events:
        r1 = -1.0 if "collision" in events else 1.0 if "agent1_new_field" in events else -0.5
        r2 = 1.0 if "collision" in events else -0.1

        self.agent1_R += r1
        self.agent2_R += r2
        
        rewards = {
            "agent1": r1,
            "agent2": r2,
        }

        # Generate a `done` dict (per-agent and total).
        dones = {
            "agent1": is_done,
            "agent2": is_done,
            # special `__all__` key indicates that the episode is done for all agents.
            "__all__": is_done,
        }

        return obs, rewards, dones, {}  # <- info dict (not needed here).

    def _get_obs(self):
        """
        Returns obs dict (agent name to discrete-pos tuple) using each
        agent's current x/y-positions.
        """
        ag1_discrete_pos = self.agent1_pos[0] * self.width + \
            (self.agent1_pos[1] % self.width)
        ag2_discrete_pos = self.agent2_pos[0] * self.width + \
            (self.agent2_pos[1] % self.width)
        return {
            "agent1": np.array([ag1_discrete_pos, ag2_discrete_pos]),
            "agent2": np.array([ag2_discrete_pos, ag1_discrete_pos]),
        }

    def _move(self, coords, action, is_agent1):
        """
        Moves an agent (agent1 iff is_agent1=True, else agent2) from `coords` (x/y) using the
        given action (0=up, 1=right, etc..) and returns a resulting events dict:
        Agent1: "new" when entering a new field. "bumped" when having been bumped into by agent2.
        Agent2: "bumped" when bumping into agent1 (agent1 then gets -1.0).
        """
        orig_coords = coords[:]
        # Change the row: 0=up (-1), 2=down (+1)
        coords[0] += -1 if action == 0 else 1 if action == 2 else 0
        # Change the column: 1=right (+1), 3=left (-1)
        coords[1] += 1 if action == 1 else -1 if action == 3 else 0

        # Solve collisions.
        # Make sure, we don't end up on the other agent's position.
        # If yes, don't move (we are blocked).
        if (is_agent1 and coords == self.agent2_pos) or (not is_agent1 and coords == self.agent1_pos):
            coords[0], coords[1] = orig_coords
            # Agent2 blocked agent1 (agent1 tried to run into agent2)
            # OR Agent2 bumped into agent1 (agent2 tried to run into agent1)
            return {"collision"}

        # No agent blocking -> check walls.
        if coords[0] < 0:
            coords[0] = 0
        elif coords[0] >= self.height:
            coords[0] = self.height - 1
        if coords[1] < 0:
            coords[1] = 0
        elif coords[1] >= self.width:
            coords[1] = self.width - 1

        # If agent1 -> "new" if new tile covered.
        if is_agent1 and not tuple(coords) in self.agent1_visited_fields:
            self.agent1_visited_fields.add(tuple(coords))
            return {"agent1_new_field"}
        # No new tile for agent1.
        return set()

    def render(self, mode=None):
        print("_" * (self.width + 2))
        for r in range(self.height):
            print("|", end="")
            for c in range(self.width):
                field = r * self.width + c % self.width
                if self.agent1_pos == [r, c]:
                    print("1", end="")
                elif self.agent2_pos == [r, c]:
                    print("2", end="")
                elif (r, c) in self.agent1_visited_fields:
                    print(".", end="")
                else:
                    print(" ", end="")
            print("|")
        print("‾" * (self.width + 2))
        print(f"{'!!Collision!!' if self.collision else ''}")
        print("R1={: .1f}".format(self.agent1_R))
        print("R2={: .1f}".format(self.agent2_R))
        print()


env = MultiAgentArena()

obs = env.reset()

# Agent1 will move down, Agent2 moves up.
obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})

env.render()

print("Agent1's x/y position={}".format(env.agent1_pos))
print("Agent2's x/y position={}".format(env.agent2_pos))
print("Env timesteps={}".format(env.timesteps))




____________
|.         |
|1         |
|          |
|          |
|          |
|          |
|          |
|          |
|         2|
|          |
‾‾‾‾‾‾‾‾‾‾‾‾

R1= 1.0
R2=-0.1

Agent1's x/y position=[1, 0]
Agent2's x/y position=[8, 9]
Env timesteps=1


## Exercise No 1

<hr />

<img src="images/exercise1.png" width=400>

위의 셀에서 `reset()`과 단일 `step()`호출을 수행했습니다. 전체 에피소드를 살펴 보면, 반환된 `done` 딕셔너리가 "agent1" 또는 "agent2"(또는 "__all__") 키가 True로 설정될 때까지 일반적으로 `step()`을 (다른 Action으로) 반복적으로 호출합니다. 여러분의 임무는 `MultiAgentArena` 클래스를 사용하여 정확히 하나의 에피소드에 대해 실행되는 "환경 루프"를 작성하는 것입니다.

이 작업을 수행하기 위해 아래 지침을 따르십시오.

1. 이미 생성된 (변수`env`) 환경을 `재설정`하여 첫 번째 (초기)Observation을 얻습니다.
1. 무한 while 루프를 돕니다.
1.`DummyTrainer.compute_action([obs])`을 두 번 호출하여 "agent1" 및 "agent2"에 대한 Action을 계산합니다 (각 에이전트에 대해 한 번씩).
1. Action 계산의 결과를 Action dict(`{ "agent1": ..., "agent2": ...}`)에 입력합니다.
1. 위의 셀(단일 `step()`을 실행)에서 수행한 것처럼 이 Action dict를 env의 `step()`메서드에 전달합니다.
1. 반환된 'dones' dict에서 True(에피소드가 종료됨을 의미함)를 확인하고 True인 경우 루프를 종료합니다.

**Good luck! :)**


In [2]:
class DummyTrainer:
    """Dummy Trainer class used in Exercise #1.

    Use its `compute_action` method to get a new action for one of the agents,
    given the agent's observation (a single discrete value encoding the field
    the agent is currently in).
    """

    def compute_action(self, single_agent_obs=None):
        # Returns a random action for a single agent.
        return np.random.randint(4)  # Discrete(4) -> return rand int between 0 and 3 (incl. 3).

dummy_trainer = DummyTrainer()
# Check, whether it's working.
for _ in range(3):
    # Get action for agent1 (providing agent1's and agent2's positions).
    print("action_agent1={}".format(dummy_trainer.compute_action(np.array([0, 99]))))

    # Get action for agent2 (providing agent2's and agent1's positions).
    print("action_agent2={}".format(dummy_trainer.compute_action(np.array([99, 0]))))

    print()

action_agent1=0
action_agent2=3

action_agent1=3
action_agent2=2

action_agent1=3
action_agent2=1



Write your solution code into this cell here:

In [3]:
# !LIVE CODING!

# Leave the following as-is. It'll help us with rendering the env in this very cell's output.
import time
from ipywidgets import Output
from IPython import display
import time

out = Output()
display.display(out)

with out:

    # Solution to Exercise #1:

    # Start coding here inside this `with`-block:
    # 1) Reset the env.
    env = MultiAgentArena()
    obs = env.reset()
    
    dummy_trainer = DummyTrainer()
    # 2) Enter an infinite while loop (to step through the episode).
    while True:
        # 3) Calculate both agents' actions individually, using dummy_trainer.compute_action([individual agent's obs])
        a1 = dummy_trainer.compute_action(obs["agent1"])
        a2 = dummy_trainer.compute_action(obs["agent2"])
        # 4) Compile the actions dict from both individual agents' actions.
        actions = {"agent1":a1, "agent2":a2}
        # 5) Send the actions dict to the env's `step()` method to receive: obs, rewards, dones, info dicts
        obs, rewards, dones, _  = env.step(actions)
        # 6) We'll do this together: Render the env.
        # Don't write any code here (skip directly to 7).
        out.clear_output(wait=True)
        time.sleep(0.08)
        env.render()

        # 7) Check, whether the episde is done, if yes, break out of the while loop.
        if dones["__all__"]:
            break
# 8) Run it! :)

Output()

------------------
## 15 min break :)
------------------

### And now for something completely different:
#### Plugging in RLlib!

In [4]:
import numpy as np
import pprint
import ray

# Start a new instance of Ray (when running this tutorial locally) or
# connect to an already running one (when running this tutorial through Anyscale).

ray.init()  # Hear the engine humming? ;)

# In case you encounter the following error during our tutorial: `RuntimeError: Maybe you called ray.init twice by accident?`
# Try: `ray.shutdown() + ray.init()` or `ray.init(ignore_reinit_error=True)`

2021-07-01 11:06:46,197	INFO services.py:1272 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '172.30.1.40',
 'raylet_ip_address': '172.30.1.40',
 'redis_address': '172.30.1.40:6379',
 'object_store_address': '/tmp/ray/session_2021-07-01_11-06-44_562960_4956/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-07-01_11-06-44_562960_4956/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2021-07-01_11-06-44_562960_4956',
 'metrics_export_port': 55923,
 'node_id': '589bd408048d2c7dfc1b96eac59af4507da2caa95f5a1de1c6719304'}

### Picking an RLlib algorithm - We'll use PPO throughout this tutorial (one-size-fits-all-kind-of-algo)

<img src="images/rllib_algos.png" width=800>

https://docs.ray.io/en/master/rllib-algorithms.html#available-algorithms-overview

In [5]:
# Import a Trainable (one of RLlib's built-in algorithms):
# We use the PPO algorithm here b/c its very flexible wrt its supported
# action spaces and model types and b/c it learns well almost any problem.
from ray.rllib.agents.ppo import PPOTrainer

# Specify a very simple config, defining our environment and some environment
# options (see environment.py).
config = {
    "env": MultiAgentArena,  # "my_env" <- if we previously have registered the env with `tune.register_env("[name]", lambda config: [returns env object])`.
    "env_config": {
        "config": {
            "width": 10,
            "height": 10,
            "ts": 100,
        },
    },

    # !PyTorch users!
    "framework": "torch",  # If users have chosen to install torch instead of tf.

    "create_env_on_driver": True,
}
# Instantiate the Trainer object using above config.
rllib_trainer = PPOTrainer(config=config)
rllib_trainer

2021-07-01 11:06:53,133	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


PPO

### Ready to train with RLlib's PPO algorithm

That's it, we are ready to train.
Calling `Trainer.train()` will execute a single "training iteration".

One iteration for most algos involves:

1) sampling from the environment(s)
2) using the sampled data (observations, actions taken, rewards) to update the policy model (neural network), such that it would pick better actions in the future, leading to higher rewards.

Let's try it out:

이제 학습할 준비가되었습니다.
`Trainer.train()`을 호출하면 단일 "학습 이터레이션"이 실행됩니다.

알고리즘에 대한 한 번의 이터레이션은 다음과 같습니다.

1) 환경에서 샘플링
2) 샘플링된 데이터 (observations, actions taken, rewards)를 사용하여 policy 모델(신경망)을 업데이트하여 향후 더 나은 Action을 선택하여 더 높은 Reward를 얻습니다.

실행해 봅시다. : 

In [6]:
results = rllib_trainer.train()

# Delete the config from the results for clarity.
# Only the stats will remain, then.
del results["config"]
# Pretty print the stats.
pprint.pprint(results)

{'agent_timesteps_total': 4000,
 'custom_metrics': {},
 'date': '2021-07-01_11-07-30',
 'done': False,
 'episode_len_mean': 100.0,
 'episode_media': {},
 'episode_reward_max': 10.79999999999997,
 'episode_reward_mean': -4.3950000000000005,
 'episode_reward_min': -28.500000000000043,
 'episodes_this_iter': 20,
 'episodes_total': 20,
 'experiment_id': '7db9968dfd0f49b1b9e90e48edd6e575',
 'hist_stats': {'episode_lengths': [100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    10

### Going from single policy (RLlib's default) to multi-policy:

So far, our experiment has been ill-configured, because both
agents, which should behave differently due to their different
tasks and reward functions, learn the same policy: the "default_policy",
which RLlib always provides if you don't configure anything else.
Remember that RLlib does not know at Trainer setup time, how many and which agents
the environment will "produce". Agent control (adding agents, removing them, terminating
episodes for agents) is entirely in the Env's hands.
Let's fix our single policy problem and introduce the "multiagent" API.
지금까지 우리의 실험은 잘못 구성되었습니다.
에이전트는 서로 다르기 때문에 다르게 행동해야합니다.
각 에이전트는 Task 및 Reward Function이 다른데 동일한 Policy를 학습하고 있습니다.
RLlib가 디폴트로 동일한 Policy("default_policy")를 제공합니다.
RLlib는 Trainer 설정 시 환경은 "생산"할 에이전트의 수와 에이전트를 알지 못한다는 점을 기억하십시오.
에이전트 제어(에이전트 추가, 제거, 에피소드 종료)는 전적으로 Env의 손에 달려 있습니다.
그러면 이제부터 이와 같은 Single Policy 문제를 수정하면서 "다중 에이전트(multiagent)" API를 소개하겠습니다.

<img src="images/from_single_agent_to_multi_agent.png" width=800>

In order to turn on RLlib's multi-agent functionality, we need two things:

1. A policy mapping function, mapping agent IDs (e.g. a string like "agent1", produced by the environment in the returned observation/rewards/dones-dicts) to a policy ID (another string, e.g. "policy1", which is under our control).
1. A policies definition dict, mapping policy IDs (e.g. "policy1") to 4-tuples consisting of 1) policy class (None for using the default class), 2) observation space, 3) action space, and 4) config overrides (empty dict for no overrides and using the Trainer's main config dict).

Let's take a closer look:

RLlib의 다중 에이전트 기능을 켜려면 다음 두 가지가 필요합니다.

1. policy mapping function : 매핑 '에이전트 ID(Agent ID)'(예 : 반환된 observation / reward / dones-dicts에 있는 환경에 의해 생성된 "agent1"과 같은 문자열)를 '정책 ID(Policy ID)'(다른 문자열, 예 : "policy1", 이것은 우리의 통제권 안에 있음)와 맵핑해야 합니다.
1. policies definition dict : 정책 ID(예 : "policy1")를 1) 정책 클래스(Policy class: 기본 클래스를 사용하는 경우는 None), 2) Observation space, 3) Action space, 4) Config 재정의(Trainer의 기본 구성 딕셔너리를 사용함. 재정의가 없을 때는 빈 딕셔너리 사용)로 구성된 4-튜플에 매핑해야 합니다..

자세히 살펴 봅시다. :

In [7]:
# Define the policies definition dict:
# Each policy in there is defined by its ID (key) mapping to a 4-tuple (value):
# - Policy class (None for using the "default" class, e.g. PPOTFPolicy for PPO+tf or PPOTorchPolicy for PPO+torch).
# - obs-space (we get this directly from our already created env object).
# - act-space (we get this directly from our already created env object).
# - config-overrides dict (leave empty for using the Trainer's config as-is)
policies = {
    "policy1": (None, env.observation_space, env.action_space, {}),
    "policy2": (None, env.observation_space, env.action_space, {"lr": 0.0002}),
}
# Note that now we won't have a "default_policy" anymore, just "policy1" and "policy2".

# Define an agent->policy mapping function.
# Which agents (defined by the environment) use which policies (defined by us)?
# The mapping here is M (agents) -> N (policies), where M >= N.
def policy_mapping_fn(agent_id: str):
    # Make sure agent ID is valid.
    assert agent_id in ["agent1", "agent2"], f"ERROR: invalid agent ID {agent_id}!"
    # Map agent1 to policy1, and agent2 to policy2.
    return "policy1" if agent_id == "agent1" else "policy2"

# We could - if we wanted - specify, which policies should be learnt (by default, RLlib learns all).
# Non-learnt policies will be frozen and not updated:
# policies_to_train = ["policy1", "policy2"]

# Adding the above to our config.
config.update({
    "multiagent": {
        "policies": policies,
        "policy_mapping_fn": policy_mapping_fn,
        # We'll leave this empty: Means, we train both policy1 and policy2.
        # "policies_to_train": policies_to_train,
    },
})

pprint.pprint(config)
print()
print(f"agent1 is now mapped to {policy_mapping_fn('agent1')}")
print(f"agent2 is now mapped to {policy_mapping_fn('agent2')}")

{'create_env_on_driver': True,
 'env': <class '__main__.MultiAgentArena'>,
 'env_config': {'config': {'height': 10, 'ts': 100, 'width': 10}},
 'framework': 'torch',
 'multiagent': {'policies': {'policy1': (None,
                                         MultiDiscrete([100 100]),
                                         Discrete(4),
                                         {}),
                             'policy2': (None,
                                         MultiDiscrete([100 100]),
                                         Discrete(4),
                                         {'lr': 0.0002})},
                'policy_mapping_fn': <function policy_mapping_fn at 0x7f81915380d0>}}

agent1 is now mapped to policy1
agent2 is now mapped to policy2


In [8]:
# Recreate our Trainer (we cannot just change the config on-the-fly).
rllib_trainer.stop()

# Using our updated (now multiagent!) config dict.
rllib_trainer = PPOTrainer(config=config)
rllib_trainer



PPO

Now that we are setup correctly with two policies as per our "multiagent" config, let's call `train()` on the new Trainer several times (what about 10 times?).

이제 "멀티 에이전트"구성에 따라 두 가지 Policy로 올바르게 설정되었으므로 새 Trainer에서 `train()`을 여러 번 호출 해 보겠습니다 (약 10번 정도?).

In [9]:
# Run `train()` n times. Repeatedly call `train()` now to see rewards increase.
# Move on once you see (agent1 + agent2) episode rewards of 10.0 or more.
for _ in range(10):
    results = rllib_trainer.train()
    print(f"Iteration={rllib_trainer.iteration}: R(\"return\")={results['episode_reward_mean']}")

Iteration=1: R("return")=-10.687500000000002
Iteration=2: R("return")=-5.0362499999999955
Iteration=3: R("return")=-1.7219999999999933
Iteration=4: R("return")=1.8510000000000093
Iteration=5: R("return")=1.5630000000000106
Iteration=6: R("return")=1.5600000000000147
Iteration=7: R("return")=1.5300000000000133
Iteration=8: R("return")=2.1150000000000126
Iteration=9: R("return")=3.3930000000000073
Iteration=10: R("return")=5.769


In [10]:
# Do another loop, but this time, we will print out each policies' individual rewards.
for _ in range(10):
    results = rllib_trainer.train()
    r1 = results['policy_reward_mean']['policy1']
    r2 = results['policy_reward_mean']['policy2']
    r = r1 + r2
    print(f"Iteration={rllib_trainer.iteration}: R(\"return\")={r} R1={r1} R2={r2}")

Iteration=11: R("return")=6.624000000000014 R1=13.61 R2=-6.9859999999999856
Iteration=12: R("return")=6.981000000000013 R1=13.725 R2=-6.7439999999999864
Iteration=13: R("return")=8.145000000000014 R1=14.405 R2=-6.2599999999999865
Iteration=14: R("return")=10.344000000000014 R1=15.845 R2=-5.500999999999987
Iteration=15: R("return")=11.763000000000012 R1=16.89 R2=-5.126999999999989
Iteration=16: R("return")=12.27300000000001 R1=17.125 R2=-4.8519999999999905
Iteration=17: R("return")=12.567000000000009 R1=17.54 R2=-4.97299999999999
Iteration=18: R("return")=14.190000000000012 R1=19.295 R2=-5.104999999999989
Iteration=19: R("return")=16.35600000000001 R1=21.45 R2=-5.09399999999999
Iteration=20: R("return")=17.55000000000001 R1=21.225 R2=-3.674999999999989


#### !OPTIONAL HACK! (<-- we will not do these during the tutorial, but feel free to try these cells by yourself)

Use the above solution of Exercise #1 and replace our `dummy_trainer` in that solution
with the now trained `rllib_trainer`. You should see a better performance of the two agents.

However, keep in mind that we are mostly training agent1 as we only trian a single policy and agent1
is the "easier" one to collect high rewards with.

위의 Exercise #1 에 위 솔루션을 사용해 봅니다. 'dummy_trainer'를 대체합니다.
이제 학습된`rllib_trainer`로. 두 에이전트의 더 나은 성능을 볼 수 있습니다.

이 경우 Single policy와 agent1만 학습하기 때문에 대부분 agent1만이 학습되어 agent1이 높은 Reward를 모으는 데 "더 쉬워 진다"는 점을 주의하십시오.

#### !OPTIONAL HACK!

Feel free to play around with the following code in order to learn how RLlib - under the hood - calculates actions from the environment's observations using Policies and their model(s) inside our Trainer object):

RLlib(내부에서)가 Trainer 객체 내부의 Policy 및 해당 모델을 사용하여 환경 Observation에서 Action을 계산하는 방법을 알아 보려면 다음 코드를 자유롭게 사용하십시오.

In [18]:
# Let's actually "look inside" our Trainer to see what's in there.
from ray.rllib.utils.numpy import softmax

# To get to one of the policies inside the Trainer, use `Trainer.get_policy([policy ID])`:
policy = rllib_trainer.get_policy("policy1")
print(f"Our (only!) Policy right now is: {policy}")

# To get to the model inside any policy, do:
model = policy.model
#print(f"Our Policy's model is: {model}")

# Print out the policy's action and observation spaces.
print(f"Our Policy's observation space is: {policy.observation_space}")
print(f"Our Policy's action space is: {policy.action_space}")

# Produce a random obervation (B=1; batch of size 1).
obs = np.array([policy.observation_space.sample()])
# Alternatively for PyTorch:
import torch
obs = torch.from_numpy(obs)

# Get the action logits (as tf tensor).
# If you are using torch, you would get a torch tensor here.
logits, _ = model({"obs": obs})
logits

# Numpyize the tensor by running `logits` through the Policy's own tf.Session.
# logits_np = policy.get_session().run(logits)
# For torch, you can simply do: `logits_np = logits.detach().cpu().numpy()`.
logits_np = logits.detach().cpu().numpy()

# Convert logits into action probabilities and remove the B=1.
action_probs = np.squeeze(softmax(logits_np))

# Sample an action, using the probabilities.
action = np.random.choice([0, 1, 2, 3], p=action_probs)

# Print out the action.
print(f"sampled action={action}")

Our (only!) Policy right now is: <ray.rllib.policy.policy_template.PPOTorchPolicy object at 0x7f8194131610>
Our Policy's observation space is: Box(-1.0, 1.0, (200,), float32)
Our Policy's action space is: Discrete(4)
sampled action=1


### Saving and restoring a trained Trainer.
Currently, `rllib_trainer` is in an already trained state.
It holds optimized weights in its Policy's model that allow it to act
already somewhat smart in our environment when given an observation.

However, if we closed this notebook right now, all the effort would have been for nothing.
Let's therefore save the state of our trainer to disk for later!

현재`rllib_trainer`는 이미 학습된 상태입니다.
Policy 모델에 최적화된 가중치를 가지고 있고 이를 통해 주어진 환경에서 받은 Observation에 대해 어느정도 스마트한 행동을 합니다.

하지만 지금 이 노트를 닫는다면 모든 노력이 헛되게 됩니다.
나중을 위해 trainer의 상태를 디스크에 저장해 봅시다!

In [19]:
# We use the `Trainer.save()` method to create a checkpoint.
checkpoint_file = rllib_trainer.save()
print(f"Trainer (at iteration {rllib_trainer.iteration} was saved in '{checkpoint_file}'!")

# Here is what a checkpoint directory contains:
print("The checkpoint directory contains the following files:")
import os
os.listdir(os.path.dirname(checkpoint_file))

Trainer (at iteration 20 was saved in '/Users/parksurk/ray_results/PPO_MultiAgentArena_2021-07-01_11-07-40poki00jw/checkpoint_000020/checkpoint-20'!
The checkpoint directory contains the following files:


['checkpoint-20', 'checkpoint-20.tune_metadata', '.is_checkpoint']

### Restoring and evaluating a Trainer
In the following cell, we'll learn how to restore a saved Trainer from a checkpoint file.

We'll also evaluate a completely new Trainer (should act more or less randomly) vs an already trained one (the one we just restored from the created checkpoint file).

다음 셀에서는 체크 포인트 파일에서 저장된 Trainer를 복원하는 방법을 알아 봅니다.

또한 완전히 새로운 트레이너(다소 무작위로 작동해야 함)와 이미 학습된 트레이너(위에서 만든 체크 포인트 파일에서 방금 복원 한 것)를 평가할 것입니다.

In [20]:
# Pretend, we wanted to pick up training from a previous run:
new_trainer = PPOTrainer(config=config)
# Evaluate the new trainer (this should yield random results).
results = new_trainer.evaluate()
print(f"Evaluating new trainer: R={results['evaluation']['episode_reward_mean']}")

# Restoring the trained state into the `new_trainer` object.
print(f"Before restoring: Trainer is at iteration={new_trainer.iteration}")
new_trainer.restore(checkpoint_file)
print(f"After restoring: Trainer is at iteration={new_trainer.iteration}")

# Evaluate again (this should yield results we saw after having trained our saved agent).
results = new_trainer.evaluate()
print(f"Evaluating restored trainer: R={results['evaluation']['episode_reward_mean']}")

2021-07-01 11:40:26,861	INFO trainable.py:377 -- Restored on 172.30.1.40 from checkpoint: /Users/parksurk/ray_results/PPO_MultiAgentArena_2021-07-01_11-07-40poki00jw/checkpoint_000020/checkpoint-20
2021-07-01 11:40:26,862	INFO trainable.py:385 -- Current state after restoring: {'_iteration': 20, '_timesteps_total': None, '_time_total': 368.76109981536865, '_episodes_total': 800}


Evaluating new trainer: R=-12.330000000000002
Before restoring: Trainer is at iteration=0
After restoring: Trainer is at iteration=20
Evaluating restored trainer: R=18.70499999999994


In order to release all resources from a Trainer, you can use a Trainer's `stop()` method.
You should definitley run this cell as it frees resources that we'll need later in this tutorial, when we'll do parallel hyperparameter sweeps.

Trainer에서 모든 리소스를 해제하려면 Trainer의`stop()`메서드를 사용할 수 있습니다.
병렬 하이퍼 파라미터 스윕을 수행 할 때 이 자습서의 뒷부분에서 필요한 리소스를 확보해야 하므로 아래 셀을 실행해야합니다.

In [21]:
rllib_trainer.stop()
new_trainer.stop()

### Moving stuff to the professional level: RLlib in connection w/ Ray Tune

Running any experiments through Ray Tune is the recommended way of doing things with RLlib. If you look at our
<a href="https://github.com/ray-project/ray/tree/master/rllib/examples">examples scripts folder</a>, you will see that almost all of the scripts use Ray Tune to run the particular RLlib workload demonstrated in each script.

Ray Tune을 통해 실험을 실행하는 것은 RLlib로 작업을 수행할 때 권장하는 방법입니다. 다음 링크를 보면

<a href="https://github.com/ray-project/ray/tree/master/rllib/examples">예제 스크립트 폴더</a>

거의 모든 스크립트가 Ray Tune을 사용하여 각 스크립트에 설명된 특정 RLlib 워크로드를 실행하는 것을 볼 수 있습니다.

<img src="images/rllib_and_tune.png" width=400>

When setting up hyperparameter sweeps for Tune, we'll do this in our already familiar config dict.

So let's take a quick look at our PPO algo's default config to understand, which hyperparameters we may want to play around with:

Tune에 대한 하이퍼파라미터 스윕을 설정할 때 이미 익숙한 구성 딕션어리에서 수행합니다.

그러면 PPO 알고리즘의 기본 구성을 간략히 살펴보고, 어떻게 하이퍼파라미터를 선택하는지 알아 보겠습니다.

In [22]:
# Configuration dicts and Ray Tune.
# Where are the default configuration dicts stored?

# PPO algorithm:
from ray.rllib.agents.ppo import DEFAULT_CONFIG as PPO_DEFAULT_CONFIG
print(f"PPO's default config is:")
pprint.pprint(PPO_DEFAULT_CONFIG)

# DQN algorithm:
#from ray.rllib.agents.dqn import DEFAULT_CONFIG as DQN_DEFAULT_CONFIG
#print(f"DQN's default config is:")
#pprint.pprint(DQN_DEFAULT_CONFIG)

# Common (all algorithms).
#from ray.rllib.agents.trainer import COMMON_CONFIG
#print(f"RLlib Trainer's default config is:")
#pprint.pprint(COMMON_CONFIG)

PPO's default config is:
{'_fake_gpus': False,
 'batch_mode': 'truncate_episodes',
 'callbacks': <class 'ray.rllib.agents.callbacks.DefaultCallbacks'>,
 'clip_actions': True,
 'clip_param': 0.3,
 'clip_rewards': None,
 'collect_metrics_timeout': 180,
 'compress_observations': False,
 'create_env_on_driver': False,
 'custom_eval_function': None,
 'custom_resources_per_worker': {},
 'eager_tracing': False,
 'entropy_coeff': 0.0,
 'entropy_coeff_schedule': None,
 'env': None,
 'env_config': {},
 'env_task_fn': None,
 'evaluation_config': {},
 'evaluation_interval': None,
 'evaluation_num_episodes': 10,
 'evaluation_num_workers': 0,
 'evaluation_parallel_to_training': False,
 'exploration_config': {'type': 'StochasticSampling'},
 'explore': True,
 'extra_python_environs_for_driver': {},
 'extra_python_environs_for_worker': {},
 'fake_sampler': False,
 'framework': 'tf',
 'gamma': 0.99,
 'grad_clip': None,
 'horizon': None,
 'ignore_worker_failures': False,
 'in_evaluation': False,
 'input'

### Let's do a very simple grid-search over two learning rates with tune.run().

In particular, we will try the learning rates 0.00005 and 0.5 using `tune.grid_search([...])`
inside our config dict:

`tune.grid_search ([...])`를 사용하여 학습률 0.00005와 0.5를 시도합니다.
구성 딕션어리 내부 :

In [23]:
# Plugging in Ray Tune.
# Note that this is the recommended way to run any experiments with RLlib.
# Reasons:
# - Tune allows you to do hyperparameter tuning in a user-friendly way
#   and at large scale!
# - Tune automatically allocates needed resources for the different
#   hyperparam trials and experiment runs on a cluster.

from ray import tune

# Running stuff with tune, we can re-use the exact
# same config that we used when working with RLlib directly!
tune_config = config.copy()

# Let's add our first hyperparameter search via our config.
# How about we try two different learning rates? Let's say 0.00005 and 0.5 (ouch!).
tune_config["lr"] = tune.grid_search([0.0001, 0.5])  # <- 0.5? again: ouch!
tune_config["train_batch_size"] = tune.grid_search([3000, 4000])

# Now that we will run things "automatically" through tune, we have to
# define one or more stopping criteria.
# Tune will stop the run, once any single one of the criteria is matched (not all of them!).
stop = {
    # Note that the keys used here can be anything present in the above `rllib_trainer.train()` output dict.
    "training_iteration": 5,
    "episode_reward_mean": 20.0,
}

# "PPO" is a registered name that points to RLlib's PPOTrainer.
# See `ray/rllib/agents/registry.py`

# Run a simple experiment until one of the stopping criteria is met.
tune.run(
    "PPO",
    config=tune_config,
    stop=stop,

    # Note that no trainers will be returned from this call here.
    # Tune will create n Trainers internally, run them in parallel and destroy them at the end.
    # However, you can ...
    checkpoint_at_end=True,  # ... create a checkpoint when done.
    checkpoint_freq=10,  # ... create a checkpoint every 10 training iterations.
)

Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_e0548_00000,PENDING,,0.0001,3000
PPO_MultiAgentArena_e0548_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_e0548_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_e0548_00003,PENDING,,0.5,4000


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_e0548_00000,RUNNING,,0.0001,3000
PPO_MultiAgentArena_e0548_00001,RUNNING,,0.5,3000
PPO_MultiAgentArena_e0548_00002,PENDING,,0.0001,4000
PPO_MultiAgentArena_e0548_00003,PENDING,,0.5,4000


[2m[36m(pid=5317)[0m 2021-07-01 11:56:13,300	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=5318)[0m 2021-07-01 11:56:13,300	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=5317)[0m 2021-07-01 11:56:23,687	INFO trainable.py:101 -- Trainable.setup took 10.387 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=5318)[0m 2021-07-01 11:56:23,685	INFO trainable.py:101 -- Trainable.setup took 10.385 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Result for PPO_MultiAgentArena_e0548_00000:
  agent_timesteps_total: 6000
  custom_metrics: {}
  date: 2021-07-01_11-56-40
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 10.200000000000028
  episode_reward_mean: -9.09
  episode_reward_min: -37.50000000000004
  episodes_this_iter: 30
  episodes_total: 30
  experiment_id: 109393779bf04cd2a072197ec47e0f68
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.20000000000000004
          cur_lr: 0.0001
          entropy: 1.3596946497758229
          entropy_coeff: 0.0
          kl: 0.027194303460419178
          policy_loss: -0.07365838422750433
          total_loss: 39.00392214457194
          vf_explained_var: 0.16660884022712708
          vf_loss: 39.072141806284584
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.20000000000000004
          cur_lr: 0.0002
      

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00000,RUNNING,172.30.1.40:5318,0.0001,3000,1.0,16.4154,3000.0,-9.09,10.2,-37.5,100.0
PPO_MultiAgentArena_e0548_00001,RUNNING,,0.5,3000,,,,,,,
PPO_MultiAgentArena_e0548_00002,PENDING,,0.0001,4000,,,,,,,
PPO_MultiAgentArena_e0548_00003,PENDING,,0.5,4000,,,,,,,




Result for PPO_MultiAgentArena_e0548_00001:
  agent_timesteps_total: 6000
  custom_metrics: {}
  date: 2021-07-01_11-56-41
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 10.200000000000001
  episode_reward_mean: -11.0
  episode_reward_min: -30.00000000000005
  episodes_this_iter: 30
  episodes_total: 30
  experiment_id: 30c1148e916449e1884fd7c5abd4d048
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.20000000000000004
          cur_lr: 0.5
          entropy: 0.09071534869838398
          entropy_coeff: 0.0
          kl: .inf
          policy_loss: 0.477949483320117
          total_loss: .inf
          vf_explained_var: 0.031452279537916183
          vf_loss: 42.38547468185425
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.20000000000000004
          cur_lr: 0.0002
          entropy: 1.3447722991307576
  

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00000,RUNNING,172.30.1.40:5318,0.0001,3000,2.0,34.1811,6000.0,-4.48,18.6,-37.5,100.0
PPO_MultiAgentArena_e0548_00001,RUNNING,172.30.1.40:5317,0.5,3000,1.0,18.1531,3000.0,-11.0,10.2,-30.0,100.0
PPO_MultiAgentArena_e0548_00002,PENDING,,0.0001,4000,,,,,,,
PPO_MultiAgentArena_e0548_00003,PENDING,,0.5,4000,,,,,,,




Result for PPO_MultiAgentArena_e0548_00001:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-07-01_11-57-01
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 10.200000000000001
  episode_reward_mean: -21.575000000000014
  episode_reward_min: -35.70000000000004
  episodes_this_iter: 30
  episodes_total: 60
  experiment_id: 30c1148e916449e1884fd7c5abd4d048
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.30000000000000004
          cur_lr: 0.5
          entropy: 0.0
          entropy_coeff: 0.0
          kl: .inf
          policy_loss: 0.1340596245136112
          total_loss: .inf
          vf_explained_var: 0.013232949189841747
          vf_loss: 101.28861967722575
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.30000000000000004
          cur_lr: 0.0002
          entropy: 1.2741331954797108
 

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00000,RUNNING,172.30.1.40:5318,0.0001,3000,3.0,51.1524,9000.0,-3.69,18.6,-37.5,100.0
PPO_MultiAgentArena_e0548_00001,RUNNING,172.30.1.40:5317,0.5,3000,2.0,37.7615,6000.0,-21.575,10.2,-35.7,100.0
PPO_MultiAgentArena_e0548_00002,PENDING,,0.0001,4000,,,,,,,
PPO_MultiAgentArena_e0548_00003,PENDING,,0.5,4000,,,,,,,


Result for PPO_MultiAgentArena_e0548_00001:
  agent_timesteps_total: 18000
  custom_metrics: {}
  date: 2021-07-01_11-57-23
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 10.200000000000001
  episode_reward_mean: -29.083333333333364
  episode_reward_min: -46.500000000000064
  episodes_this_iter: 30
  episodes_total: 90
  experiment_id: 30c1148e916449e1884fd7c5abd4d048
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.45
          cur_lr: 0.5
          entropy: 0.0
          entropy_coeff: 0.0
          kl: 0.0
          policy_loss: -0.0031135548294211426
          total_loss: 100.55957317352295
          vf_explained_var: 0.06382351368665695
          vf_loss: 100.5626875559489
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.45
          cur_lr: 0.0002
          entropy: 1.2185085713863373
          entro

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00000,RUNNING,172.30.1.40:5318,0.0001,3000,3.0,51.1524,9000.0,-3.69,18.6,-37.5,100.0
PPO_MultiAgentArena_e0548_00001,RUNNING,172.30.1.40:5317,0.5,3000,3.0,59.2336,9000.0,-29.0833,10.2,-46.5,100.0
PPO_MultiAgentArena_e0548_00002,PENDING,,0.0001,4000,,,,,,,
PPO_MultiAgentArena_e0548_00003,PENDING,,0.5,4000,,,,,,,


Result for PPO_MultiAgentArena_e0548_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-07-01_11-57-30
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 18.59999999999996
  episode_reward_mean: -1.604999999999989
  episode_reward_min: -37.50000000000004
  episodes_this_iter: 30
  episodes_total: 120
  experiment_id: 109393779bf04cd2a072197ec47e0f68
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.6750000000000002
          cur_lr: 0.0001
          entropy: 1.2709183792273204
          entropy_coeff: 0.0
          kl: 0.021722817871098716
          policy_loss: -0.0721723132301122
          total_loss: 29.253071626027424
          vf_explained_var: 0.2923828065395355
          vf_loss: 29.31058120727539
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.6750000000000002
          cur_lr: 0.0

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00000,RUNNING,172.30.1.40:5318,0.0001,3000,4.0,66.6113,12000.0,-1.605,18.6,-37.5,100.0
PPO_MultiAgentArena_e0548_00001,RUNNING,172.30.1.40:5317,0.5,3000,3.0,59.2336,9000.0,-29.0833,10.2,-46.5,100.0
PPO_MultiAgentArena_e0548_00002,PENDING,,0.0001,4000,,,,,,,
PPO_MultiAgentArena_e0548_00003,PENDING,,0.5,4000,,,,,,,


Result for PPO_MultiAgentArena_e0548_00001:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-07-01_11-57-43
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 3.000000000000019
  episode_reward_mean: -35.76000000000003
  episode_reward_min: -46.500000000000064
  episodes_this_iter: 30
  episodes_total: 120
  experiment_id: 30c1148e916449e1884fd7c5abd4d048
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.225
          cur_lr: 0.5
          entropy: 0.0
          entropy_coeff: 0.0
          kl: 0.0
          policy_loss: -0.0036503668719281754
          total_loss: 97.39595603942871
          vf_explained_var: 0.050319839268922806
          vf_loss: 97.39960606892903
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.6750000000000002
          cur_lr: 0.0002
          entropy: 1.1506167352199554
 

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00000,RUNNING,172.30.1.40:5318,0.0001,3000,4.0,66.6113,12000.0,-1.605,18.6,-37.5,100.0
PPO_MultiAgentArena_e0548_00001,RUNNING,172.30.1.40:5317,0.5,3000,4.0,79.305,12000.0,-35.76,3.0,-46.5,100.0
PPO_MultiAgentArena_e0548_00002,PENDING,,0.0001,4000,,,,,,,
PPO_MultiAgentArena_e0548_00003,PENDING,,0.5,4000,,,,,,,


Result for PPO_MultiAgentArena_e0548_00000:
  agent_timesteps_total: 30000
  custom_metrics: {}
  date: 2021-07-01_11-57-46
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 21.599999999999966
  episode_reward_mean: -0.6179999999999874
  episode_reward_min: -22.50000000000003
  episodes_this_iter: 30
  episodes_total: 150
  experiment_id: 109393779bf04cd2a072197ec47e0f68
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 1.258529340227445
          entropy_coeff: 0.0
          kl: 0.01582040269083033
          policy_loss: -0.06465477589517832
          total_loss: 20.755126516024273
          vf_explained_var: 0.4199785888195038
          vf_loss: 20.803763031959534
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

[2m[36m(pid=20990)[0m 2021-07-01 11:57:55,822	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Result for PPO_MultiAgentArena_e0548_00001:
  agent_timesteps_total: 30000
  custom_metrics: {}
  date: 2021-07-01_11-58-03
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -24.30000000000001
  episode_reward_mean: -38.703000000000046
  episode_reward_min: -46.500000000000064
  episodes_this_iter: 30
  episodes_total: 150
  experiment_id: 30c1148e916449e1884fd7c5abd4d048
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.1125
          cur_lr: 0.5
          entropy: 0.0
          entropy_coeff: 0.0
          kl: 0.0
          policy_loss: 0.0017244585712129872
          total_loss: 109.28126271565755
          vf_explained_var: 0.10310747474431992
          vf_loss: 109.27954006195068
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0002
          entropy: 1.0966935753822327

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00001,RUNNING,172.30.1.40:5317,0.5,3000,5.0,99.9738,15000.0,-38.703,-24.3,-46.5,100.0
PPO_MultiAgentArena_e0548_00002,RUNNING,,0.0001,4000,,,,,,,
PPO_MultiAgentArena_e0548_00003,PENDING,,0.5,4000,,,,,,,
PPO_MultiAgentArena_e0548_00000,TERMINATED,,0.0001,3000,5.0,82.8723,15000.0,-0.618,21.6,-22.5,100.0


[2m[36m(pid=21614)[0m 2021-07-01 11:58:12,422	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Result for PPO_MultiAgentArena_e0548_00002:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-07-01_11-58-25
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 12.000000000000023
  episode_reward_mean: -9.899999999999995
  episode_reward_min: -35.40000000000006
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: 165b12bdda7d4a6ba79c5cf4f1f16b99
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.2
          cur_lr: 0.0001
          entropy: 1.3582803755998611
          entropy_coeff: 0.0
          kl: 0.028701205796096474
          policy_loss: -0.059343935987271834
          total_loss: 34.44962179660797
          vf_explained_var: 0.11263073980808258
          vf_loss: 34.5032262802124
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.2
          cur_lr: 0.0002
          entropy: 1.34889

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00002,RUNNING,172.30.1.40:20990,0.0001,4000,1.0,22.9159,4000.0,-9.9,12.0,-35.4,100.0
PPO_MultiAgentArena_e0548_00003,RUNNING,,0.5,4000,,,,,,,
PPO_MultiAgentArena_e0548_00000,TERMINATED,,0.0001,3000,5.0,82.8723,15000.0,-0.618,21.6,-22.5,100.0
PPO_MultiAgentArena_e0548_00001,TERMINATED,,0.5,3000,5.0,99.9738,15000.0,-38.703,-24.3,-46.5,100.0




Result for PPO_MultiAgentArena_e0548_00003:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-07-01_11-58-46
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 11.700000000000019
  episode_reward_mean: -7.424999999999992
  episode_reward_min: -34.50000000000006
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: a5ea5be08c514c6a8d524ac08f699f01
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.2
          cur_lr: 0.5
          entropy: 0.06490051870840502
          entropy_coeff: 0.0
          kl: .inf
          policy_loss: 0.4366196487098932
          total_loss: .inf
          vf_explained_var: -0.060224324464797974
          vf_loss: 50.42529737949371
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.2
          cur_lr: 0.0002
          entropy: 1.3498334139585495
          entropy_c

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00002,RUNNING,172.30.1.40:20990,0.0001,4000,1,22.9159,4000,-9.9,12.0,-35.4,100
PPO_MultiAgentArena_e0548_00003,RUNNING,172.30.1.40:21614,0.5,4000,1,24.7548,4000,-7.425,11.7,-34.5,100
PPO_MultiAgentArena_e0548_00000,TERMINATED,,0.0001,3000,5,82.8723,15000,-0.618,21.6,-22.5,100
PPO_MultiAgentArena_e0548_00001,TERMINATED,,0.5,3000,5,99.9738,15000,-38.703,-24.3,-46.5,100


Result for PPO_MultiAgentArena_e0548_00002:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-07-01_11-58-48
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 15.000000000000021
  episode_reward_mean: -6.978749999999991
  episode_reward_min: -35.40000000000006
  episodes_this_iter: 40
  episodes_total: 80
  experiment_id: 165b12bdda7d4a6ba79c5cf4f1f16b99
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.30000000000000004
          cur_lr: 0.0001
          entropy: 1.3225731179118156
          entropy_coeff: 0.0
          kl: 0.029981161875184625
          policy_loss: -0.06853678141487762
          total_loss: 33.906187415122986
          vf_explained_var: 0.1982439011335373
          vf_loss: 33.9657301902771
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.30000000000000004
          cur_lr: 0

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00002,RUNNING,172.30.1.40:20990,0.0001,4000,3,65.9046,12000,-4.89,19.2,-30.0,100
PPO_MultiAgentArena_e0548_00003,RUNNING,172.30.1.40:21614,0.5,4000,1,24.7548,4000,-7.425,11.7,-34.5,100
PPO_MultiAgentArena_e0548_00000,TERMINATED,,0.0001,3000,5,82.8723,15000,-0.618,21.6,-22.5,100
PPO_MultiAgentArena_e0548_00001,TERMINATED,,0.5,3000,5,99.9738,15000,-38.703,-24.3,-46.5,100




Result for PPO_MultiAgentArena_e0548_00003:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-07-01_11-59-10
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 11.700000000000019
  episode_reward_mean: -29.692500000000035
  episode_reward_min: -58.50000000000009
  episodes_this_iter: 40
  episodes_total: 80
  experiment_id: a5ea5be08c514c6a8d524ac08f699f01
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.30000000000000004
          cur_lr: 0.5
          entropy: 0.03847502931260749
          entropy_coeff: 0.0
          kl: .inf
          policy_loss: 0.11948691852740012
          total_loss: .inf
          vf_explained_var: -0.054587654769420624
          vf_loss: 147.67479872703552
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.30000000000000004
          cur_lr: 0.0002
          entropy: 1.

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00002,RUNNING,172.30.1.40:20990,0.0001,4000,4,85.7141,16000,-2.394,19.5,-30.0,100
PPO_MultiAgentArena_e0548_00003,RUNNING,172.30.1.40:21614,0.5,4000,2,48.3912,8000,-29.6925,11.7,-58.5,100
PPO_MultiAgentArena_e0548_00000,TERMINATED,,0.0001,3000,5,82.8723,15000,-0.618,21.6,-22.5,100
PPO_MultiAgentArena_e0548_00001,TERMINATED,,0.5,3000,5,99.9738,15000,-38.703,-24.3,-46.5,100




Result for PPO_MultiAgentArena_e0548_00003:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-07-01_11-59-36
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 11.700000000000019
  episode_reward_mean: -45.675000000000075
  episode_reward_min: -60.0000000000001
  episodes_this_iter: 40
  episodes_total: 120
  experiment_id: a5ea5be08c514c6a8d524ac08f699f01
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.45000000000000007
          cur_lr: 0.5
          entropy: 0.0
          entropy_coeff: 0.0
          kl: .inf
          policy_loss: 0.003162487060762942
          total_loss: .inf
          vf_explained_var: 0.09282225370407104
          vf_loss: 88.48809289932251
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.45000000000000007
          cur_lr: 0.0002
          entropy: 1.3065645135939121
 

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00002,RUNNING,172.30.1.40:20990,0.0001,4000,4,85.7141,16000,-2.394,19.5,-30.0,100
PPO_MultiAgentArena_e0548_00003,RUNNING,172.30.1.40:21614,0.5,4000,3,74.1656,12000,-45.675,11.7,-60.0,100
PPO_MultiAgentArena_e0548_00000,TERMINATED,,0.0001,3000,5,82.8723,15000,-0.618,21.6,-22.5,100
PPO_MultiAgentArena_e0548_00001,TERMINATED,,0.5,3000,5,99.9738,15000,-38.703,-24.3,-46.5,100


Result for PPO_MultiAgentArena_e0548_00002:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-07-01_11-59-48
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.999999999999925
  episode_reward_mean: 1.35300000000001
  episode_reward_min: -18.299999999999983
  episodes_this_iter: 40
  episodes_total: 200
  experiment_id: 165b12bdda7d4a6ba79c5cf4f1f16b99
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.675
          cur_lr: 0.0001
          entropy: 1.2573885843157768
          entropy_coeff: 0.0
          kl: 0.020924198208376765
          policy_loss: -0.06594341617892496
          total_loss: 31.756824374198914
          vf_explained_var: 0.3943982422351837
          vf_loss: 31.808643698692322
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0002
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00002,RUNNING,172.30.1.40:20990,0.0001,4000,5,105.782,20000,1.353,24.0,-18.3,100
PPO_MultiAgentArena_e0548_00003,RUNNING,172.30.1.40:21614,0.5,4000,3,74.1656,12000,-45.675,11.7,-60.0,100
PPO_MultiAgentArena_e0548_00000,TERMINATED,,0.0001,3000,5,82.8723,15000,-0.618,21.6,-22.5,100
PPO_MultiAgentArena_e0548_00001,TERMINATED,,0.5,3000,5,99.9738,15000,-38.703,-24.3,-46.5,100


Result for PPO_MultiAgentArena_e0548_00003:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-07-01_12-00-00
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -45.90000000000006
  episode_reward_mean: -58.479000000000106
  episode_reward_min: -60.0000000000001
  episodes_this_iter: 40
  episodes_total: 160
  experiment_id: a5ea5be08c514c6a8d524ac08f699f01
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.675
          cur_lr: 0.5
          entropy: 0.0
          entropy_coeff: 0.0
          kl: 0.0
          policy_loss: 0.0031935407605487853
          total_loss: 99.45838046073914
          vf_explained_var: 0.11185324937105179
          vf_loss: 99.45518708229065
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.675
          cur_lr: 0.0002
          entropy: 1.2973405234515667
          entrop

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00003,RUNNING,172.30.1.40:21614,0.5,4000,4,98.7268,16000,-58.479,-45.9,-60.0,100
PPO_MultiAgentArena_e0548_00000,TERMINATED,,0.0001,3000,5,82.8723,15000,-0.618,21.6,-22.5,100
PPO_MultiAgentArena_e0548_00001,TERMINATED,,0.5,3000,5,99.9738,15000,-38.703,-24.3,-46.5,100
PPO_MultiAgentArena_e0548_00002,TERMINATED,,0.0001,4000,5,105.782,20000,1.353,24.0,-18.3,100


Result for PPO_MultiAgentArena_e0548_00003:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-07-01_12-00-24
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -58.50000000000009
  episode_reward_mean: -59.9250000000001
  episode_reward_min: -60.0000000000001
  episodes_this_iter: 40
  episodes_total: 200
  experiment_id: a5ea5be08c514c6a8d524ac08f699f01
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.3375
          cur_lr: 0.5
          entropy: 0.0
          entropy_coeff: 0.0
          kl: 0.0
          policy_loss: -0.007534682372352108
          total_loss: 150.69444930553436
          vf_explained_var: 0.10564549267292023
          vf_loss: 150.70198106765747
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.675
          cur_lr: 0.0002
          entropy: 1.2682304754853249
          entrop

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00003,RUNNING,172.30.1.40:21614,0.5,4000,5,122.003,20000,-59.925,-58.5,-60.0,100
PPO_MultiAgentArena_e0548_00000,TERMINATED,,0.0001,3000,5,82.8723,15000,-0.618,21.6,-22.5,100
PPO_MultiAgentArena_e0548_00001,TERMINATED,,0.5,3000,5,99.9738,15000,-38.703,-24.3,-46.5,100
PPO_MultiAgentArena_e0548_00002,TERMINATED,,0.0001,4000,5,105.782,20000,1.353,24.0,-18.3,100


Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_e0548_00000,TERMINATED,,0.0001,3000,5,82.8723,15000,-0.618,21.6,-22.5,100
PPO_MultiAgentArena_e0548_00001,TERMINATED,,0.5,3000,5,99.9738,15000,-38.703,-24.3,-46.5,100
PPO_MultiAgentArena_e0548_00002,TERMINATED,,0.0001,4000,5,105.782,20000,1.353,24.0,-18.3,100
PPO_MultiAgentArena_e0548_00003,TERMINATED,,0.5,4000,5,122.003,20000,-59.925,-58.5,-60.0,100


2021-07-01 12:00:24,393	INFO tune.py:549 -- Total run time: 259.64 seconds (259.40 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7f81a5d869a0>

### Why did we use 6 CPUs in the tune run above (3 CPUs per trial)?

PPO-기본적으로 2개의 "롤아웃" 작업자(`num_workers = 2`)를 사용합니다. 이들은 자체 환경 카피를 갖고 병렬로 단계를 진행하는 레이 액터입니다. 이 두 "롤아웃" 작업자 외에 RLlib의 모든 Trainer에는 항상 "로컬" 작업자가 있으며, PPO의 경우 학습 업데이트를 처리합니다. 이를 통해 3개의 작업자 (2 개의 롤아웃 + 1 개의 로컬 학습자)가 제공되며 3 개의 CPU가 필요합니다.

## Exercise No 2

<hr />

Using the `tune_config` that we have built so far, let's run another `tune.run()`, but apply the following changes to our setup this time:
- Setup only 1 learning rate under the "lr" config key. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Setup only 1 train batch size under the "train_batch_size" config key. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Set `num_workers` to 5, which will allow us to run more environment "rollouts" in parallel and to collect training batches more quickly.
- Set the `num_envs_per_worker` config parameter to 5. This will clone our env on each rollout worker, and thus parallelize action computing forward passes through our neural networks.

Other than that, use the exact same args as in our `tune.run()` call in the previous cell.

<hr />

지금까지 빌드한 `tune_config`를 사용하여 다른 `tune.run()`을 실행하되 이번에는 설정에 다음 변경 사항을 적용합니다.
- "lr" 구성 키 아래에 1 개의 학습률만 설정합니다. 이전 셀 (가장 높은 평균 보상을 산출 한 셀)의 실행에서 (겉보기에) 가장 좋은 값을 선택합니다.
- "train_batch_size" 구성 키 아래에 기차 배치 크기를 1 만 설정합니다. 이전 셀 (가장 높은 평균 보상을 산출 한 셀)의 실행에서 (겉보기에) 가장 좋은 값을 선택합니다.
- `num_workers`를 5로 설정하면 더 많은 환경 "롤아웃"을 병렬로 실행하고 훈련 배치를 더 빨리 수집 할 수 있습니다.
- `num_envs_per_worker` 구성 매개 변수를 5로 설정합니다. 그러면 각 롤아웃 작업자에서 env가 복제되므로 신경망을 통해 전달되는 액션 컴퓨팅이 병렬화됩니다.

그 외에는 이전 셀의`tune.run()`호출에서와 똑같은 인수를 사용하십시오.

**Good luck! :)**


In [24]:
# !LIVE CODING!

# Solution to Exercise #2

# Run for longer this time (100 iterations) and try to reach 40.0 reward (sum of both agents).
stop = {
    "training_iteration": 180,  # we have the 15min break now to run this many iterations
    "episode_reward_mean": 60.0,  # sum of both agents' rewards. Probably won't reach it, but we should try nevertheless :)
}

# tune_config.update({
# ???
# })

# analysis = tune.run(...)

tune_config["lr"] = 0.0001
tune_config["train_batch_size"] = 4000
tune_config["num_envs_per_worker"] = 5
tune_config["num_workers"] = 5

analysis = tune.run("PPO", config=tune_config, stop=stop, checkpoint_at_end=True, checkpoint_freq=5)

Trial name,status,loc
PPO_MultiAgentArena_7f8d1_00000,PENDING,


[2m[36m(pid=21670)[0m 2021-07-01 12:14:57,619	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=21670)[0m 2021-07-01 12:15:08,737	INFO trainable.py:101 -- Trainable.setup took 11.119 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-07-01_12-15-25
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 8.100000000000028
  episode_reward_mean: -6.767999999999992
  episode_reward_min: -31.50000000000002
  episodes_this_iter: 25
  episodes_total: 25
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.2
          cur_lr: 0.0001
          entropy: 1.3568887673318386
          entropy_coeff: 0.0
          kl: 0.03012266621226445
          policy_loss: -0.06361623853445053
          total_loss: 29.52478688955307
          vf_explained_var: 0.08373932540416718
          vf_loss: 29.582378208637238
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.2
          cur_lr: 0.0002
          entropy: 1.348378

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,1,16.7223,4000,-6.768,8.1,-31.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-07-01_12-15-41
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.099999999999973
  episode_reward_mean: -5.49199999999999
  episode_reward_min: -31.50000000000002
  episodes_this_iter: 50
  episodes_total: 75
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.30000000000000004
          cur_lr: 0.0001
          entropy: 1.3113512620329857
          entropy_coeff: 0.0
          kl: 0.03275543975178152
          policy_loss: -0.07503505887871142
          total_loss: 23.290098428726196
          vf_explained_var: 0.2802092730998993
          vf_loss: 23.355306833982468
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.30000000000000004
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,2,32.4361,8000,-5.492,23.1,-31.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-07-01_12-15-56
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.099999999999973
  episode_reward_mean: -3.26099999999999
  episode_reward_min: -31.50000000000002
  episodes_this_iter: 25
  episodes_total: 100
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.45000000000000007
          cur_lr: 0.0001
          entropy: 1.2861355915665627
          entropy_coeff: 0.0
          kl: 0.025299696542788297
          policy_loss: -0.0661568658251781
          total_loss: 22.09027224779129
          vf_explained_var: 0.3724067211151123
          vf_loss: 22.145043909549713
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.45000000000000007
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,3,47.5472,12000,-3.261,23.1,-31.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-07-01_12-16-12
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.099999999999973
  episode_reward_mean: 0.006000000000012518
  episode_reward_min: -21.00000000000004
  episodes_this_iter: 50
  episodes_total: 150
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.675
          cur_lr: 0.0001
          entropy: 1.2525263242423534
          entropy_coeff: 0.0
          kl: 0.02081197005463764
          policy_loss: -0.06522226356901228
          total_loss: 24.721259891986847
          vf_explained_var: 0.37142133712768555
          vf_loss: 24.772433876991272
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.675
          cur_lr: 0.0002
          entropy

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,4,64.1034,16000,0.006,23.1,-21,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-07-01_12-16-29
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 21.000000000000018
  episode_reward_mean: 1.2750000000000128
  episode_reward_min: -19.499999999999986
  episodes_this_iter: 50
  episodes_total: 200
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 1.2251441702246666
          entropy_coeff: 0.0
          kl: 0.01720927143469453
          policy_loss: -0.05809747794410214
          total_loss: 24.986693501472473
          vf_explained_var: 0.3580786883831024
          vf_loss: 25.027366757392883
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,5,80.422,20000,1.275,21,-19.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 48000
  custom_metrics: {}
  date: 2021-07-01_12-16-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 28.499999999999925
  episode_reward_mean: 1.8540000000000116
  episode_reward_min: -14.69999999999999
  episodes_this_iter: 25
  episodes_total: 225
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 1.2090138867497444
          entropy_coeff: 0.0
          kl: 0.016028896439820528
          policy_loss: -0.06064255003002472
          total_loss: 29.62217015028
          vf_explained_var: 0.4218500256538391
          vf_loss: 29.66658365726471
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.000

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,6,98.2084,24000,1.854,28.5,-14.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 56000
  custom_metrics: {}
  date: 2021-07-01_12-17-06
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 28.499999999999925
  episode_reward_mean: 3.2040000000000113
  episode_reward_min: -15.299999999999978
  episodes_this_iter: 50
  episodes_total: 275
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 1.1845766603946686
          entropy_coeff: 0.0
          kl: 0.01669146100175567
          policy_loss: -0.05655819133971818
          total_loss: 20.38133680820465
          vf_explained_var: 0.4689350724220276
          vf_loss: 20.420994997024536
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,7,117.386,28000,3.204,28.5,-15.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 64000
  custom_metrics: {}
  date: 2021-07-01_12-17-22
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 28.499999999999925
  episode_reward_mean: 3.4200000000000097
  episode_reward_min: -15.299999999999978
  episodes_this_iter: 25
  episodes_total: 300
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 1.168278556317091
          entropy_coeff: 0.0
          kl: 0.015423265751451254
          policy_loss: -0.05428478046087548
          total_loss: 30.142649054527283
          vf_explained_var: 0.33004117012023926
          vf_loss: 30.18131709098816
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,8,133.693,32000,3.42,28.5,-15.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 72000
  custom_metrics: {}
  date: 2021-07-01_12-17-39
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 36.599999999999945
  episode_reward_mean: 5.7450000000000045
  episode_reward_min: -14.099999999999978
  episodes_this_iter: 50
  episodes_total: 350
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 1.1381555162370205
          entropy_coeff: 0.0
          kl: 0.015697246795753017
          policy_loss: -0.05019960334175266
          total_loss: 31.474104523658752
          vf_explained_var: 0.3831806480884552
          vf_loss: 31.508410453796387
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,9,149.991,36000,5.745,36.6,-14.1,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 80000
  custom_metrics: {}
  date: 2021-07-01_12-17-56
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 36.599999999999945
  episode_reward_mean: 8.300999999999995
  episode_reward_min: -11.999999999999973
  episodes_this_iter: 50
  episodes_total: 400
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 1.116125252097845
          entropy_coeff: 0.0
          kl: 0.016089638404082507
          policy_loss: -0.050306459015700966
          total_loss: 33.64773416519165
          vf_explained_var: 0.2923678755760193
          vf_loss: 33.68174958229065
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,10,166.911,40000,8.301,36.6,-12,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 88000
  custom_metrics: {}
  date: 2021-07-01_12-18-13
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 27.599999999999888
  episode_reward_mean: 8.588999999999992
  episode_reward_min: -11.999999999999973
  episodes_this_iter: 25
  episodes_total: 425
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 1.0902166105806828
          entropy_coeff: 0.0
          kl: 0.017312415409833193
          policy_loss: -0.05841508507728577
          total_loss: 27.08051198720932
          vf_explained_var: 0.41637498140335083
          vf_loss: 27.121398210525513
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,11,184.122,44000,8.589,27.6,-12,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 96000
  custom_metrics: {}
  date: 2021-07-01_12-18-29
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 29.09999999999995
  episode_reward_mean: 10.145999999999987
  episode_reward_min: -11.69999999999998
  episodes_this_iter: 50
  episodes_total: 475
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 1.0569895319640636
          entropy_coeff: 0.0
          kl: 0.01619165571173653
          policy_loss: -0.05843629679293372
          total_loss: 39.908546686172485
          vf_explained_var: 0.31136554479599
          vf_loss: 39.950589179992676
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.00

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,12,200.406,48000,10.146,29.1,-11.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 104000
  custom_metrics: {}
  date: 2021-07-01_12-18-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 29.09999999999995
  episode_reward_mean: 12.422999999999979
  episode_reward_min: -7.799999999999984
  episodes_this_iter: 25
  episodes_total: 500
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 1.0248748362064362
          entropy_coeff: 0.0
          kl: 0.017148243845440447
          policy_loss: -0.058404994721058756
          total_loss: 39.854022204875946
          vf_explained_var: 0.3219492733478546
          vf_loss: 39.895064771175385
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,13,218.164,52000,12.423,29.1,-7.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 112000
  custom_metrics: {}
  date: 2021-07-01_12-19-05
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 29.09999999999995
  episode_reward_mean: 14.432999999999963
  episode_reward_min: -5.3999999999999915
  episodes_this_iter: 50
  episodes_total: 550
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.9973515067249537
          entropy_coeff: 0.0
          kl: 0.01615451308316551
          policy_loss: -0.05109630296647083
          total_loss: 31.19097602367401
          vf_explained_var: 0.37627577781677246
          vf_loss: 31.225715935230255
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,14,235.636,56000,14.433,29.1,-5.4,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 120000
  custom_metrics: {}
  date: 2021-07-01_12-19-21
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 29.099999999999923
  episode_reward_mean: 15.878999999999968
  episode_reward_min: 2.99482660892636e-14
  episodes_this_iter: 50
  episodes_total: 600
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.9702014122158289
          entropy_coeff: 0.0
          kl: 0.015383271791506559
          policy_loss: -0.05491414025891572
          total_loss: 36.91341292858124
          vf_explained_var: 0.3397338390350342
          vf_loss: 36.95275259017944
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,15,251.823,60000,15.879,29.1,2.99483e-14,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 128000
  custom_metrics: {}
  date: 2021-07-01_12-19-39
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.0999999999999
  episode_reward_mean: 16.493999999999957
  episode_reward_min: -2.999999999999973
  episodes_this_iter: 25
  episodes_total: 625
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.9621274322271347
          entropy_coeff: 0.0
          kl: 0.014050655881874263
          policy_loss: -0.047218539693858474
          total_loss: 37.175280690193176
          vf_explained_var: 0.38614946603775024
          vf_loss: 37.20827263593674
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,16,269.88,64000,16.494,32.1,-3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 136000
  custom_metrics: {}
  date: 2021-07-01_12-19-57
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.6999999999999
  episode_reward_mean: 18.242999999999952
  episode_reward_min: -2.999999999999973
  episodes_this_iter: 50
  episodes_total: 675
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.9346323776990175
          entropy_coeff: 0.0
          kl: 0.01375378860393539
          policy_loss: -0.0499732798198238
          total_loss: 29.570641458034515
          vf_explained_var: 0.42696326971054077
          vf_loss: 29.606688916683197
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,17,287.622,68000,18.243,32.7,-3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 144000
  custom_metrics: {}
  date: 2021-07-01_12-20-14
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.6999999999999
  episode_reward_mean: 18.926999999999946
  episode_reward_min: -2.999999999999973
  episodes_this_iter: 25
  episodes_total: 700
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.9122906383126974
          entropy_coeff: 0.0
          kl: 0.013836637837812304
          policy_loss: -0.043922056909650564
          total_loss: 59.79106044769287
          vf_explained_var: 0.24767276644706726
          vf_loss: 59.82097339630127
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,18,305.313,72000,18.927,32.7,-3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 152000
  custom_metrics: {}
  date: 2021-07-01_12-20-30
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 34.4999999999999
  episode_reward_mean: 19.904999999999944
  episode_reward_min: 3.2999999999999394
  episodes_this_iter: 50
  episodes_total: 750
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.9015404060482979
          entropy_coeff: 0.0
          kl: 0.013370373228099197
          policy_loss: -0.043550039059482515
          total_loss: 38.45504206418991
          vf_explained_var: 0.38878071308135986
          vf_loss: 38.48505401611328
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,19,321.005,76000,19.905,34.5,3.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 160000
  custom_metrics: {}
  date: 2021-07-01_12-20-49
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 35.39999999999989
  episode_reward_mean: 20.789999999999942
  episode_reward_min: -4.19999999999998
  episodes_this_iter: 50
  episodes_total: 800
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.893436374142766
          entropy_coeff: 0.0
          kl: 0.015498742839554325
          policy_loss: -0.054675333318300545
          total_loss: 45.45559549331665
          vf_explained_var: 0.32242220640182495
          vf_loss: 45.49457883834839
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,20,339.407,80000,20.79,35.4,-4.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 168000
  custom_metrics: {}
  date: 2021-07-01_12-21-07
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 35.39999999999989
  episode_reward_mean: 20.138999999999943
  episode_reward_min: -14.999999999999973
  episodes_this_iter: 25
  episodes_total: 825
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.8833394609391689
          entropy_coeff: 0.0
          kl: 0.013538038620026782
          policy_loss: -0.03531770597328432
          total_loss: 81.41321158409119
          vf_explained_var: 0.31171923875808716
          vf_loss: 81.43482375144958
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,21,357.9,84000,20.139,35.4,-15,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 176000
  custom_metrics: {}
  date: 2021-07-01_12-21-24
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 36.599999999999916
  episode_reward_mean: 21.305999999999937
  episode_reward_min: -14.999999999999973
  episodes_this_iter: 50
  episodes_total: 875
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.8898424785584211
          entropy_coeff: 0.0
          kl: 0.013627707463456318
          policy_loss: -0.05538107693428174
          total_loss: 47.56302630901337
          vf_explained_var: 0.38327717781066895
          vf_loss: 47.604610085487366
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,22,375.134,88000,21.306,36.6,-15,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 184000
  custom_metrics: {}
  date: 2021-07-01_12-21-42
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 36.599999999999916
  episode_reward_mean: 21.81899999999993
  episode_reward_min: -14.999999999999973
  episodes_this_iter: 25
  episodes_total: 900
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.8714092690497637
          entropy_coeff: 0.0
          kl: 0.012820605625165626
          policy_loss: -0.03884829790331423
          total_loss: 46.085367918014526
          vf_explained_var: 0.350822389125824
          vf_loss: 46.11123466491699
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,23,393.048,92000,21.819,36.6,-15,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 192000
  custom_metrics: {}
  date: 2021-07-01_12-21-59
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 35.99999999999991
  episode_reward_mean: 22.259999999999927
  episode_reward_min: -4.4999999999999725
  episodes_this_iter: 50
  episodes_total: 950
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.8506576735526323
          entropy_coeff: 0.0
          kl: 0.013460650690831244
          policy_loss: -0.04159090976463631
          total_loss: 39.805975914001465
          vf_explained_var: 0.4023236036300659
          vf_loss: 39.83393895626068
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,24,409.521,96000,22.26,36,-4.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 200000
  custom_metrics: {}
  date: 2021-07-01_12-22-15
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 34.79999999999993
  episode_reward_mean: 22.23899999999993
  episode_reward_min: -3.899999999999974
  episodes_this_iter: 50
  episodes_total: 1000
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.8361954726278782
          entropy_coeff: 0.0
          kl: 0.013801533525111154
          policy_loss: -0.042492512468015775
          total_loss: 39.49115043878555
          vf_explained_var: 0.3885577917098999
          vf_loss: 39.5196692943573
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,25,425.471,100000,22.239,34.8,-3.9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 208000
  custom_metrics: {}
  date: 2021-07-01_12-22-32
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 39.599999999999916
  episode_reward_mean: 23.150999999999918
  episode_reward_min: -3.899999999999974
  episodes_this_iter: 25
  episodes_total: 1025
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.8104434404522181
          entropy_coeff: 0.0
          kl: 0.01443797501269728
          policy_loss: -0.04361645434983075
          total_loss: 54.72861862182617
          vf_explained_var: 0.3610352575778961
          vf_loss: 54.757617235183716
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,26,442.583,104000,23.151,39.6,-3.9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 216000
  custom_metrics: {}
  date: 2021-07-01_12-22-50
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 39.599999999999916
  episode_reward_mean: 23.078999999999922
  episode_reward_min: 8.10000000000003
  episodes_this_iter: 50
  episodes_total: 1075
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.7827772051095963
          entropy_coeff: 0.0
          kl: 0.01187054009642452
          policy_loss: -0.04296684722066857
          total_loss: 58.35486912727356
          vf_explained_var: 0.3712236285209656
          vf_loss: 58.38581693172455
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,27,460.625,108000,23.079,39.6,8.1,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 224000
  custom_metrics: {}
  date: 2021-07-01_12-23-08
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 39.599999999999916
  episode_reward_mean: 23.51099999999992
  episode_reward_min: 8.10000000000003
  episodes_this_iter: 25
  episodes_total: 1100
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.755469124764204
          entropy_coeff: 0.0
          kl: 0.012736326403683051
          policy_loss: -0.04194001277210191
          total_loss: 69.02539026737213
          vf_explained_var: 0.3581518232822418
          vf_loss: 69.05443394184113
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.00

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,28,478.096,112000,23.511,39.6,8.1,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 232000
  custom_metrics: {}
  date: 2021-07-01_12-23-26
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 37.199999999999925
  episode_reward_mean: 23.77199999999992
  episode_reward_min: 8.10000000000003
  episodes_this_iter: 50
  episodes_total: 1150
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.7568658664822578
          entropy_coeff: 0.0
          kl: 0.011726018390618265
          policy_loss: -0.0369270934315864
          total_loss: 37.519820392131805
          vf_explained_var: 0.498402863740921
          vf_loss: 37.544874131679535
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,29,496.327,116000,23.772,37.2,8.1,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 240000
  custom_metrics: {}
  date: 2021-07-01_12-23-43
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 37.199999999999925
  episode_reward_mean: 24.143999999999913
  episode_reward_min: 6.600000000000022
  episodes_this_iter: 50
  episodes_total: 1200
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.7550983149558306
          entropy_coeff: 0.0
          kl: 0.013223929272498935
          policy_loss: -0.05174273130251095
          total_loss: 46.10219955444336
          vf_explained_var: 0.2923959493637085
          vf_loss: 46.140552282333374
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,30,513.229,120000,24.144,37.2,6.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 248000
  custom_metrics: {}
  date: 2021-07-01_12-24-00
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 37.199999999999925
  episode_reward_mean: 25.013999999999925
  episode_reward_min: 6.600000000000022
  episodes_this_iter: 25
  episodes_total: 1225
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.7478128559887409
          entropy_coeff: 0.0
          kl: 0.014537624316290021
          policy_loss: -0.04249158198945224
          total_loss: 58.531182169914246
          vf_explained_var: 0.4047550559043884
          vf_loss: 58.55895435810089
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,31,529.872,124000,25.014,37.2,6.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 256000
  custom_metrics: {}
  date: 2021-07-01_12-24-17
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 38.6999999999999
  episode_reward_mean: 25.439999999999912
  episode_reward_min: 6.600000000000022
  episodes_this_iter: 50
  episodes_total: 1275
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.7493421267718077
          entropy_coeff: 0.0
          kl: 0.011357261042576283
          policy_loss: -0.032791847304906696
          total_loss: 43.244035959243774
          vf_explained_var: 0.46728771924972534
          vf_loss: 43.265329122543335
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,32,546.772,128000,25.44,38.7,6.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 264000
  custom_metrics: {}
  date: 2021-07-01_12-24-34
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 38.69999999999992
  episode_reward_mean: 26.24399999999991
  episode_reward_min: 13.199999999999976
  episodes_this_iter: 25
  episodes_total: 1300
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.7520488016307354
          entropy_coeff: 0.0
          kl: 0.014453867683187127
          policy_loss: -0.046342017958522774
          total_loss: 47.91534101963043
          vf_explained_var: 0.31279441714286804
          vf_loss: 47.94704806804657
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,33,564.337,132000,26.244,38.7,13.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 272000
  custom_metrics: {}
  date: 2021-07-01_12-24-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 38.69999999999992
  episode_reward_mean: 24.782999999999916
  episode_reward_min: -15.000000000000016
  episodes_this_iter: 50
  episodes_total: 1350
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.733967112377286
          entropy_coeff: 0.0
          kl: 0.010923820751486346
          policy_loss: -0.03539090038975701
          total_loss: 48.204896569252014
          vf_explained_var: 0.4383675754070282
          vf_loss: 48.22922682762146
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,34,582.065,136000,24.783,38.7,-15,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 280000
  custom_metrics: {}
  date: 2021-07-01_12-25-10
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 41.69999999999991
  episode_reward_mean: 23.297999999999917
  episode_reward_min: -15.000000000000016
  episodes_this_iter: 50
  episodes_total: 1400
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.7258352134376764
          entropy_coeff: 0.0
          kl: 0.012346163217443973
          policy_loss: -0.03469091281294823
          total_loss: 41.05376237630844
          vf_explained_var: 0.3798743188381195
          vf_loss: 41.075952887535095
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,35,599.647,140000,23.298,41.7,-15,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 288000
  custom_metrics: {}
  date: 2021-07-01_12-25-28
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 41.9999999999999
  episode_reward_mean: 24.611999999999917
  episode_reward_min: -15.000000000000016
  episodes_this_iter: 25
  episodes_total: 1425
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.7090427149087191
          entropy_coeff: 0.0
          kl: 0.011347666091751307
          policy_loss: -0.03905441757524386
          total_loss: 54.887192249298096
          vf_explained_var: 0.3882172107696533
          vf_loss: 54.91475749015808
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,36,618.055,144000,24.612,42,-15,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 296000
  custom_metrics: {}
  date: 2021-07-01_12-25-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.19999999999991
  episode_reward_mean: 25.700999999999922
  episode_reward_min: 6.599999999999946
  episodes_this_iter: 50
  episodes_total: 1475
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.6513062454760075
          entropy_coeff: 0.0
          kl: 0.010608064330881462
          policy_loss: -0.03765065909828991
          total_loss: 83.23577332496643
          vf_explained_var: 0.3778274357318878
          vf_loss: 83.26268267631531
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,37,637.367,148000,25.701,46.2,6.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 304000
  custom_metrics: {}
  date: 2021-07-01_12-26-05
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.19999999999991
  episode_reward_mean: 26.54399999999992
  episode_reward_min: 10.799999999999931
  episodes_this_iter: 25
  episodes_total: 1500
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.6578108109533787
          entropy_coeff: 0.0
          kl: 0.01142730432911776
          policy_loss: -0.04072327809990384
          total_loss: 93.50556802749634
          vf_explained_var: 0.29439038038253784
          vf_loss: 93.53472149372101
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,38,654.955,152000,26.544,46.2,10.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 312000
  custom_metrics: {}
  date: 2021-07-01_12-26-24
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 41.099999999999916
  episode_reward_mean: 26.807999999999925
  episode_reward_min: 11.699999999999978
  episodes_this_iter: 50
  episodes_total: 1550
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.6529938206076622
          entropy_coeff: 0.0
          kl: 0.01205019476765301
          policy_loss: -0.03559972153743729
          total_loss: 74.10644471645355
          vf_explained_var: 0.39646294713020325
          vf_loss: 74.12984478473663
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,39,674.302,156000,26.808,41.1,11.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 320000
  custom_metrics: {}
  date: 2021-07-01_12-26-42
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 42.29999999999994
  episode_reward_mean: 27.401999999999926
  episode_reward_min: 9.300000000000015
  episodes_this_iter: 50
  episodes_total: 1600
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.6171208284795284
          entropy_coeff: 0.0
          kl: 0.010537848487729207
          policy_loss: -0.038302931119687855
          total_loss: 98.75880146026611
          vf_explained_var: 0.3327019214630127
          vf_loss: 98.78643572330475
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,40,691.905,160000,27.402,42.3,9.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 328000
  custom_metrics: {}
  date: 2021-07-01_12-26-58
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 42.29999999999994
  episode_reward_mean: 26.759999999999923
  episode_reward_min: -5.399999999999974
  episodes_this_iter: 25
  episodes_total: 1625
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.608578335493803
          entropy_coeff: 0.0
          kl: 0.010056228551547974
          policy_loss: -0.039720732427667826
          total_loss: 126.71490669250488
          vf_explained_var: 0.3913487195968628
          vf_loss: 126.74444818496704
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,41,708.004,164000,26.76,42.3,-5.4,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 336000
  custom_metrics: {}
  date: 2021-07-01_12-27-14
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 41.0999999999999
  episode_reward_mean: 26.768999999999924
  episode_reward_min: -5.399999999999974
  episodes_this_iter: 50
  episodes_total: 1675
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.6082373224198818
          entropy_coeff: 0.0
          kl: 0.010465923434821889
          policy_loss: -0.03888287820154801
          total_loss: 90.99619698524475
          vf_explained_var: 0.40143096446990967
          vf_loss: 91.02448320388794
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,42,723.608,168000,26.769,41.1,-5.4,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 344000
  custom_metrics: {}
  date: 2021-07-01_12-27-30
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 41.0999999999999
  episode_reward_mean: 26.354999999999926
  episode_reward_min: -5.399999999999974
  episodes_this_iter: 25
  episodes_total: 1700
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.6330260615795851
          entropy_coeff: 0.0
          kl: 0.011587152344873175
          policy_loss: -0.04267140389129054
          total_loss: 55.799885392189026
          vf_explained_var: 0.3577066659927368
          vf_loss: 55.83082449436188
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,43,739.542,172000,26.355,41.1,-5.4,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 352000
  custom_metrics: {}
  date: 2021-07-01_12-27-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 44.099999999999895
  episode_reward_mean: 26.16299999999992
  episode_reward_min: 10.20000000000001
  episodes_this_iter: 50
  episodes_total: 1750
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.6044534202665091
          entropy_coeff: 0.0
          kl: 0.009882264173938893
          policy_loss: -0.03455116726399865
          total_loss: 63.919331073760986
          vf_explained_var: 0.4511594772338867
          vf_loss: 63.943875193595886
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,44,756.293,176000,26.163,44.1,10.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 360000
  custom_metrics: {}
  date: 2021-07-01_12-28-05
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 44.099999999999945
  episode_reward_mean: 26.936999999999916
  episode_reward_min: 10.199999999999916
  episodes_this_iter: 50
  episodes_total: 1800
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.5773807624354959
          entropy_coeff: 0.0
          kl: 0.01096820883685723
          policy_loss: -0.03606245148694143
          total_loss: 67.6617283821106
          vf_explained_var: 0.41563931107521057
          vf_loss: 67.68668520450592
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,45,774.134,180000,26.937,44.1,10.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 368000
  custom_metrics: {}
  date: 2021-07-01_12-28-24
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 44.099999999999945
  episode_reward_mean: 27.71399999999992
  episode_reward_min: 9.599999999999929
  episodes_this_iter: 25
  episodes_total: 1825
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.575436856597662
          entropy_coeff: 0.0
          kl: 0.010584337898762897
          policy_loss: -0.029876374581363052
          total_loss: 99.357541680336
          vf_explained_var: 0.4079066514968872
          vf_loss: 99.37670063972473
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.00

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,46,793.549,184000,27.714,44.1,9.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 376000
  custom_metrics: {}
  date: 2021-07-01_12-28-40
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 44.099999999999945
  episode_reward_mean: 27.014999999999926
  episode_reward_min: -4.73232564246473e-14
  episodes_this_iter: 50
  episodes_total: 1875
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.5624528033658862
          entropy_coeff: 0.0
          kl: 0.010137731253053062
          policy_loss: -0.029504501959308982
          total_loss: 90.40749025344849
          vf_explained_var: 0.4197303056716919
          vf_loss: 90.42673194408417
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,47,809.291,188000,27.015,44.1,-4.73233e-14,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 384000
  custom_metrics: {}
  date: 2021-07-01_12-28-56
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 43.49999999999994
  episode_reward_mean: 26.942999999999916
  episode_reward_min: -4.73232564246473e-14
  episodes_this_iter: 25
  episodes_total: 1900
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.5556779690086842
          entropy_coeff: 0.0
          kl: 0.010142985032871366
          policy_loss: -0.03942727667163126
          total_loss: 108.74274230003357
          vf_explained_var: 0.32439902424812317
          vf_loss: 108.77189946174622
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,48,825.312,192000,26.943,43.5,-4.73233e-14,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 392000
  custom_metrics: {}
  date: 2021-07-01_12-29-13
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 44.99999999999989
  episode_reward_mean: 26.97299999999991
  episode_reward_min: 11.099999999999975
  episodes_this_iter: 50
  episodes_total: 1950
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.5412654019892216
          entropy_coeff: 0.0
          kl: 0.01096715868334286
          policy_loss: -0.037435822290717624
          total_loss: 50.55959916114807
          vf_explained_var: 0.5152097344398499
          vf_loss: 50.58593064546585
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,49,842.617,196000,26.973,45,11.1,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 400000
  custom_metrics: {}
  date: 2021-07-01_12-29-31
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.29999999999989
  episode_reward_mean: 27.608999999999913
  episode_reward_min: 11.099999999999923
  episodes_this_iter: 50
  episodes_total: 2000
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.5460654096677899
          entropy_coeff: 0.0
          kl: 0.011444914096500725
          policy_loss: -0.03660273423884064
          total_loss: 51.23738205432892
          vf_explained_var: 0.42363882064819336
          vf_loss: 51.262396693229675
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,50,860.614,200000,27.609,45.3,11.1,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 408000
  custom_metrics: {}
  date: 2021-07-01_12-29-50
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.29999999999989
  episode_reward_mean: 27.25799999999991
  episode_reward_min: -0.9000000000000057
  episodes_this_iter: 25
  episodes_total: 2025
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.5393119771033525
          entropy_coeff: 0.0
          kl: 0.010714047908550128
          policy_loss: -0.031208670377964154
          total_loss: 58.7618545293808
          vf_explained_var: 0.42772674560546875
          vf_loss: 58.782214522361755
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,51,879.565,204000,27.258,45.3,-0.9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 416000
  custom_metrics: {}
  date: 2021-07-01_12-30-08
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.29999999999989
  episode_reward_mean: 26.672999999999917
  episode_reward_min: -0.9000000000000057
  episodes_this_iter: 50
  episodes_total: 2075
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.5260745612904429
          entropy_coeff: 0.0
          kl: 0.010782082536024973
          policy_loss: -0.038400133460527286
          total_loss: 45.476948857307434
          vf_explained_var: 0.48442479968070984
          vf_loss: 45.504432022571564
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,52,897.399,208000,26.673,45.3,-0.9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 424000
  custom_metrics: {}
  date: 2021-07-01_12-30-25
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 42.899999999999906
  episode_reward_mean: 26.120999999999913
  episode_reward_min: -0.9000000000000057
  episodes_this_iter: 25
  episodes_total: 2100
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.5207427088171244
          entropy_coeff: 0.0
          kl: 0.009903548183501698
          policy_loss: -0.030531662399880588
          total_loss: 64.98694276809692
          vf_explained_var: 0.3759056627750397
          vf_loss: 65.00744581222534
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,53,914.11,212000,26.121,42.9,-0.9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 432000
  custom_metrics: {}
  date: 2021-07-01_12-30-43
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 44.699999999999896
  episode_reward_mean: 27.581999999999912
  episode_reward_min: 9.299999999999978
  episodes_this_iter: 50
  episodes_total: 2150
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.532847173511982
          entropy_coeff: 0.0
          kl: 0.011965885641984642
          policy_loss: -0.0397716409934219
          total_loss: 47.60154092311859
          vf_explained_var: 0.44051337242126465
          vf_loss: 47.62919735908508
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,54,932.338,216000,27.582,44.7,9.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 440000
  custom_metrics: {}
  date: 2021-07-01_12-31-01
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 44.699999999999896
  episode_reward_mean: 28.91999999999991
  episode_reward_min: 9.299999999999978
  episodes_this_iter: 50
  episodes_total: 2200
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.5267959600314498
          entropy_coeff: 0.0
          kl: 0.011378162686014548
          policy_loss: -0.04015610304486472
          total_loss: 51.2565575838089
          vf_explained_var: 0.38757336139678955
          vf_loss: 51.28519296646118
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,55,949.484,220000,28.92,44.7,9.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 448000
  custom_metrics: {}
  date: 2021-07-01_12-31-20
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 43.79999999999989
  episode_reward_mean: 28.49399999999991
  episode_reward_min: 10.199999999999973
  episodes_this_iter: 25
  episodes_total: 2225
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.518477720208466
          entropy_coeff: 0.0
          kl: 0.0101266161800595
          policy_loss: -0.028769243916030973
          total_loss: 58.48160147666931
          vf_explained_var: 0.4852888584136963
          vf_loss: 58.500118136405945
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,56,968.443,224000,28.494,43.8,10.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 456000
  custom_metrics: {}
  date: 2021-07-01_12-31-38
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 43.79999999999989
  episode_reward_mean: 28.77299999999991
  episode_reward_min: 10.199999999999973
  episodes_this_iter: 50
  episodes_total: 2275
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.5084627205505967
          entropy_coeff: 0.0
          kl: 0.010730282490840182
          policy_loss: -0.04745615113642998
          total_loss: 43.285601019859314
          vf_explained_var: 0.46620872616767883
          vf_loss: 43.322192907333374
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,57,986.468,228000,28.773,43.8,10.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 464000
  custom_metrics: {}
  date: 2021-07-01_12-31-57
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.899999999999906
  episode_reward_mean: 28.511999999999908
  episode_reward_min: 11.999999999999925
  episodes_this_iter: 25
  episodes_total: 2300
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.5033513372763991
          entropy_coeff: 0.0
          kl: 0.011761668807594106
          policy_loss: -0.037840775679796934
          total_loss: 49.8784402012825
          vf_explained_var: 0.3748028874397278
          vf_loss: 49.904372692108154
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,58,1005.5,232000,28.512,45.9,12,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 472000
  custom_metrics: {}
  date: 2021-07-01_12-32-16
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.899999999999906
  episode_reward_mean: 28.925999999999913
  episode_reward_min: 0.30000000000000937
  episodes_this_iter: 50
  episodes_total: 2350
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.499114696867764
          entropy_coeff: 0.0
          kl: 0.01040422500227578
          policy_loss: -0.030269727867562324
          total_loss: 39.088170289993286
          vf_explained_var: 0.5453683137893677
          vf_loss: 39.10790550708771
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,59,1024.66,236000,28.926,45.9,0.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 480000
  custom_metrics: {}
  date: 2021-07-01_12-32-34
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.599999999999895
  episode_reward_mean: 28.841999999999906
  episode_reward_min: 0.30000000000000937
  episodes_this_iter: 50
  episodes_total: 2400
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.4971405565738678
          entropy_coeff: 0.0
          kl: 0.009180304448818788
          policy_loss: -0.03316359795280732
          total_loss: 40.95639330148697
          vf_explained_var: 0.4539433717727661
          vf_loss: 40.98026245832443
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,60,1042.8,240000,28.842,45.6,0.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 488000
  custom_metrics: {}
  date: 2021-07-01_12-32-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.599999999999895
  episode_reward_mean: 28.529999999999905
  episode_reward_min: 0.30000000000000937
  episodes_this_iter: 25
  episodes_total: 2425
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.4604586251080036
          entropy_coeff: 0.0
          kl: 0.00968187366379425
          policy_loss: -0.03634499783220235
          total_loss: 54.27969813346863
          vf_explained_var: 0.4854733347892761
          vf_loss: 54.30624163150787
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,61,1060.18,244000,28.53,45.6,0.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 496000
  custom_metrics: {}
  date: 2021-07-01_12-33-10
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.299999999999905
  episode_reward_mean: 29.345999999999908
  episode_reward_min: 6.899999999999972
  episodes_this_iter: 50
  episodes_total: 2475
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.45990800578147173
          entropy_coeff: 0.0
          kl: 0.009412181709194556
          policy_loss: -0.03621422732248902
          total_loss: 47.44175672531128
          vf_explained_var: 0.4747450351715088
          vf_loss: 47.4684419631958
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,62,1078.32,248000,29.346,48.3,6.9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 504000
  custom_metrics: {}
  date: 2021-07-01_12-33-28
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.99999999999989
  episode_reward_mean: 29.91899999999991
  episode_reward_min: 9.299999999999946
  episodes_this_iter: 25
  episodes_total: 2500
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.45996512845158577
          entropy_coeff: 0.0
          kl: 0.009708321245852858
          policy_loss: -0.029493181966245174
          total_loss: 57.936488032341
          vf_explained_var: 0.32221248745918274
          vf_loss: 57.956151843070984
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,63,1096.17,252000,29.919,51,9.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 512000
  custom_metrics: {}
  date: 2021-07-01_12-33-46
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.99999999999989
  episode_reward_mean: 29.843999999999905
  episode_reward_min: 4.7999999999999705
  episodes_this_iter: 50
  episodes_total: 2550
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.4333848813548684
          entropy_coeff: 0.0
          kl: 0.011029368717572652
          policy_loss: -0.03199455395224504
          total_loss: 50.58541393280029
          vf_explained_var: 0.49820467829704285
          vf_loss: 50.606241106987
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,64,1114.93,256000,29.844,51,4.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 520000
  custom_metrics: {}
  date: 2021-07-01_12-34-03
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 44.99999999999989
  episode_reward_mean: 30.506999999999913
  episode_reward_min: 4.7999999999999705
  episodes_this_iter: 50
  episodes_total: 2600
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.4466161960735917
          entropy_coeff: 0.0
          kl: 0.009506653674179688
          policy_loss: -0.02882041031261906
          total_loss: 48.38812077045441
          vf_explained_var: 0.38340672850608826
          vf_loss: 48.40731596946716
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,65,1131.77,260000,30.507,45,4.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 528000
  custom_metrics: {}
  date: 2021-07-01_12-34-20
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 44.99999999999989
  episode_reward_mean: 31.460999999999903
  episode_reward_min: 13.49999999999991
  episodes_this_iter: 25
  episodes_total: 2625
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.41558741219341755
          entropy_coeff: 0.0
          kl: 0.008428301807725802
          policy_loss: -0.03117535766796209
          total_loss: 43.22935688495636
          vf_explained_var: 0.5371391177177429
          vf_loss: 43.25199830532074
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,66,1148.34,264000,31.461,45,13.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 536000
  custom_metrics: {}
  date: 2021-07-01_12-34-36
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.89999999999989
  episode_reward_mean: 31.664999999999896
  episode_reward_min: 16.499999999999922
  episodes_this_iter: 50
  episodes_total: 2675
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.41459635086357594
          entropy_coeff: 0.0
          kl: 0.008136734672007151
          policy_loss: -0.02910262248769868
          total_loss: 54.96034324169159
          vf_explained_var: 0.47449517250061035
          vf_loss: 54.98120844364166
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,67,1164.61,268000,31.665,48.9,16.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 544000
  custom_metrics: {}
  date: 2021-07-01_12-34-53
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.89999999999989
  episode_reward_mean: 31.4129999999999
  episode_reward_min: 10.79999999999995
  episodes_this_iter: 25
  episodes_total: 2700
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.4151736833155155
          entropy_coeff: 0.0
          kl: 0.009356792696053162
          policy_loss: -0.02684480749303475
          total_loss: 46.228831708431244
          vf_explained_var: 0.4356357157230377
          vf_loss: 46.24620372056961
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,68,1181.41,272000,31.413,48.9,10.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 552000
  custom_metrics: {}
  date: 2021-07-01_12-35-12
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.89999999999989
  episode_reward_mean: 31.427999999999912
  episode_reward_min: 10.79999999999995
  episodes_this_iter: 50
  episodes_total: 2750
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.39950852654874325
          entropy_coeff: 0.0
          kl: 0.008149424218572676
          policy_loss: -0.031159012636635453
          total_loss: 37.72542840242386
          vf_explained_var: 0.550269365310669
          vf_loss: 37.74833619594574
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,69,1200.27,276000,31.428,48.9,10.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 560000
  custom_metrics: {}
  date: 2021-07-01_12-35-31
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.5999999999999
  episode_reward_mean: 31.0379999999999
  episode_reward_min: 12.899999999999928
  episodes_this_iter: 50
  episodes_total: 2800
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.40106165409088135
          entropy_coeff: 0.0
          kl: 0.008346112736035138
          policy_loss: -0.02956718095811084
          total_loss: 45.721173882484436
          vf_explained_var: 0.4126655161380768
          vf_loss: 45.74229061603546
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,70,1219.21,280000,31.038,45.6,12.9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 568000
  custom_metrics: {}
  date: 2021-07-01_12-35-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 53.099999999999895
  episode_reward_mean: 31.709999999999905
  episode_reward_min: 15.299999999999908
  episodes_this_iter: 25
  episodes_total: 2825
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3888206984847784
          entropy_coeff: 0.0
          kl: 0.008587263888330199
          policy_loss: -0.028097990056267008
          total_loss: 51.13039267063141
          vf_explained_var: 0.5088303089141846
          vf_loss: 51.14979660511017
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,71,1235.41,284000,31.71,53.1,15.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 576000
  custom_metrics: {}
  date: 2021-07-01_12-36-05
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 53.099999999999895
  episode_reward_mean: 31.403999999999904
  episode_reward_min: 11.699999999999932
  episodes_this_iter: 50
  episodes_total: 2875
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3801783425733447
          entropy_coeff: 0.0
          kl: 0.00835933315102011
          policy_loss: -0.025995594056439586
          total_loss: 42.57580357789993
          vf_explained_var: 0.5453577637672424
          vf_loss: 42.59333539009094
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,72,1252.77,288000,31.404,53.1,11.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 584000
  custom_metrics: {}
  date: 2021-07-01_12-36-21
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 53.099999999999895
  episode_reward_mean: 31.214999999999907
  episode_reward_min: 11.699999999999932
  episodes_this_iter: 25
  episodes_total: 2900
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.40327494870871305
          entropy_coeff: 0.0
          kl: 0.012119251972762868
          policy_loss: -0.029202532081399113
          total_loss: 53.28054475784302
          vf_explained_var: 0.36442506313323975
          vf_loss: 53.297476291656494
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,73,1269.08,292000,31.215,53.1,11.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 592000
  custom_metrics: {}
  date: 2021-07-01_12-36-37
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.99999999999989
  episode_reward_mean: 31.166999999999906
  episode_reward_min: 11.699999999999932
  episodes_this_iter: 50
  episodes_total: 2950
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.37697332352399826
          entropy_coeff: 0.0
          kl: 0.009029634325997904
          policy_loss: -0.022354075117618777
          total_loss: 41.856142938137054
          vf_explained_var: 0.5243224501609802
          vf_loss: 41.86935383081436
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,74,1284.64,296000,31.167,48,11.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 600000
  custom_metrics: {}
  date: 2021-07-01_12-36-53
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.99999999999989
  episode_reward_mean: 33.19499999999991
  episode_reward_min: 15.599999999999929
  episodes_this_iter: 50
  episodes_total: 3000
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.38701165188103914
          entropy_coeff: 0.0
          kl: 0.009140318594290875
          policy_loss: -0.02292408555513248
          total_loss: 42.5873920917511
          vf_explained_var: 0.4088112711906433
          vf_loss: 42.6010617017746
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,75,1301.25,300000,33.195,48,15.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 608000
  custom_metrics: {}
  date: 2021-07-01_12-37-11
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.99999999999989
  episode_reward_mean: 33.74399999999991
  episode_reward_min: 15.599999999999929
  episodes_this_iter: 25
  episodes_total: 3025
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3619242087006569
          entropy_coeff: 0.0
          kl: 0.008744300954276696
          policy_loss: -0.02794624667149037
          total_loss: 38.97281110286713
          vf_explained_var: 0.543172299861908
          vf_loss: 38.99190402030945
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,76,1318.74,304000,33.744,48,15.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 616000
  custom_metrics: {}
  date: 2021-07-01_12-37-26
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.1999999999999
  episode_reward_mean: 32.56499999999991
  episode_reward_min: 14.099999999999952
  episodes_this_iter: 50
  episodes_total: 3075
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.37443354818969965
          entropy_coeff: 0.0
          kl: 0.00910986056260299
          policy_loss: -0.02700778329744935
          total_loss: 41.13952624797821
          vf_explained_var: 0.5260223150253296
          vf_loss: 41.157310128211975
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,77,1333.71,308000,32.565,46.2,14.1,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 624000
  custom_metrics: {}
  date: 2021-07-01_12-37-40
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.2999999999999
  episode_reward_mean: 31.937999999999906
  episode_reward_min: 7.500000000000012
  episodes_this_iter: 25
  episodes_total: 3100
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3730168053880334
          entropy_coeff: 0.0
          kl: 0.009844091939157806
          policy_loss: -0.031390396296046674
          total_loss: 48.65114152431488
          vf_explained_var: 0.4138753414154053
          vf_loss: 48.6725652217865
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,78,1348.4,312000,31.938,48.3,7.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 632000
  custom_metrics: {}
  date: 2021-07-01_12-37-56
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.2999999999999
  episode_reward_mean: 32.27999999999991
  episode_reward_min: 7.500000000000012
  episodes_this_iter: 50
  episodes_total: 3150
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3438447033986449
          entropy_coeff: 0.0
          kl: 0.007547226006863639
          policy_loss: -0.03104245100985281
          total_loss: 34.29871141910553
          vf_explained_var: 0.6018577814102173
          vf_loss: 34.32211208343506
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.00

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,79,1363.57,316000,32.28,48.3,7.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 640000
  custom_metrics: {}
  date: 2021-07-01_12-38-11
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.799999999999905
  episode_reward_mean: 33.128999999999905
  episode_reward_min: 14.39999999999996
  episodes_this_iter: 50
  episodes_total: 3200
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3706224812194705
          entropy_coeff: 0.0
          kl: 0.010421168219181709
          policy_loss: -0.035137413564370945
          total_loss: 36.567643105983734
          vf_explained_var: 0.46790772676467896
          vf_loss: 36.59222894906998
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,80,1379,320000,33.129,49.8,14.4,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 648000
  custom_metrics: {}
  date: 2021-07-01_12-38-28
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.799999999999905
  episode_reward_mean: 33.2609999999999
  episode_reward_min: 14.39999999999996
  episodes_this_iter: 25
  episodes_total: 3225
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3481900170445442
          entropy_coeff: 0.0
          kl: 0.008395638724323362
          policy_loss: -0.02565316326217726
          total_loss: 37.0484356880188
          vf_explained_var: 0.574562668800354
          vf_loss: 37.065587759017944
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.00

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,81,1395.61,324000,33.261,49.8,14.4,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 656000
  custom_metrics: {}
  date: 2021-07-01_12-38-44
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.09999999999991
  episode_reward_mean: 32.036999999999914
  episode_reward_min: 10.799999999999939
  episodes_this_iter: 50
  episodes_total: 3275
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3373169479891658
          entropy_coeff: 0.0
          kl: 0.008734833361813799
          policy_loss: -0.0339308048132807
          total_loss: 29.622411727905273
          vf_explained_var: 0.6293529272079468
          vf_loss: 29.647498726844788
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,82,1411.86,328000,32.037,47.1,10.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 664000
  custom_metrics: {}
  date: 2021-07-01_12-39-01
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.4999999999999
  episode_reward_mean: 32.38199999999991
  episode_reward_min: 10.799999999999939
  episodes_this_iter: 25
  episodes_total: 3300
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.34206255059689283
          entropy_coeff: 0.0
          kl: 0.007730686294962652
          policy_loss: -0.026720883310190402
          total_loss: 40.78951561450958
          vf_explained_var: 0.41514626145362854
          vf_loss: 40.80840760469437
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,83,1428.97,332000,32.382,49.5,10.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 672000
  custom_metrics: {}
  date: 2021-07-01_12-39-18
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.4999999999999
  episode_reward_mean: 32.41199999999991
  episode_reward_min: 10.799999999999939
  episodes_this_iter: 50
  episodes_total: 3350
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3181705502793193
          entropy_coeff: 0.0
          kl: 0.00733953551389277
          policy_loss: -0.026901803154032677
          total_loss: 38.14650648832321
          vf_explained_var: 0.5793315768241882
          vf_loss: 38.165977120399475
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,84,1445.94,336000,32.412,49.5,10.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 680000
  custom_metrics: {}
  date: 2021-07-01_12-39-36
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 43.79999999999991
  episode_reward_mean: 32.8589999999999
  episode_reward_min: 18.29999999999989
  episodes_this_iter: 50
  episodes_total: 3400
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3206331096589565
          entropy_coeff: 0.0
          kl: 0.007782569402479567
          policy_loss: -0.023343132925219834
          total_loss: 35.257325530052185
          vf_explained_var: 0.5374382138252258
          vf_loss: 35.272788286209106
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,85,1463.32,340000,32.859,43.8,18.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 688000
  custom_metrics: {}
  date: 2021-07-01_12-39-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.5999999999999
  episode_reward_mean: 33.17999999999991
  episode_reward_min: 16.499999999999922
  episodes_this_iter: 25
  episodes_total: 3425
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.30223307851701975
          entropy_coeff: 0.0
          kl: 0.01105104829184711
          policy_loss: -0.03933554410468787
          total_loss: 61.824514746665955
          vf_explained_var: 0.5247037410736084
          vf_loss: 61.85266041755676
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,86,1479.81,344000,33.18,45.6,16.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 696000
  custom_metrics: {}
  date: 2021-07-01_12-40-08
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.6999999999999
  episode_reward_mean: 33.32399999999991
  episode_reward_min: 7.499999999999927
  episodes_this_iter: 50
  episodes_total: 3475
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.32161636743694544
          entropy_coeff: 0.0
          kl: 0.00792305514914915
          policy_loss: -0.02969865084742196
          total_loss: 37.84160125255585
          vf_explained_var: 0.5959841012954712
          vf_loss: 37.86327821016312
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.00

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,87,1495.54,348000,33.324,47.7,7.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 704000
  custom_metrics: {}
  date: 2021-07-01_12-40-23
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.6999999999999
  episode_reward_mean: 34.022999999999904
  episode_reward_min: 7.499999999999927
  episodes_this_iter: 25
  episodes_total: 3500
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3082009954378009
          entropy_coeff: 0.0
          kl: 0.0075245433108648285
          policy_loss: -0.021467555314302444
          total_loss: 53.491371154785156
          vf_explained_var: 0.43768495321273804
          vf_loss: 53.505221247673035
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,88,1510.26,352000,34.023,47.7,7.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 712000
  custom_metrics: {}
  date: 2021-07-01_12-40-38
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.99999999999989
  episode_reward_mean: 34.72799999999991
  episode_reward_min: 7.499999999999927
  episodes_this_iter: 50
  episodes_total: 3550
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2936189705505967
          entropy_coeff: 0.0
          kl: 0.006276743049966171
          policy_loss: -0.021215364278759807
          total_loss: 41.04600924253464
          vf_explained_var: 0.6234234571456909
          vf_loss: 41.06086963415146
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,89,1525.87,356000,34.728,51,7.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 720000
  custom_metrics: {}
  date: 2021-07-01_12-40-54
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.99999999999989
  episode_reward_mean: 35.06399999999991
  episode_reward_min: 16.799999999999905
  episodes_this_iter: 50
  episodes_total: 3600
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3105206345207989
          entropy_coeff: 0.0
          kl: 0.0076787130237789825
          policy_loss: -0.019555308695998974
          total_loss: 39.540516912937164
          vf_explained_var: 0.5051263570785522
          vf_loss: 39.5522980093956
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,90,1541.48,360000,35.064,51,16.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 728000
  custom_metrics: {}
  date: 2021-07-01_12-41-10
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.09999999999991
  episode_reward_mean: 34.9229999999999
  episode_reward_min: 16.799999999999905
  episodes_this_iter: 25
  episodes_total: 3625
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.29324008291587234
          entropy_coeff: 0.0
          kl: 0.007221027131890878
          policy_loss: -0.017683978308923542
          total_loss: 37.96734815835953
          vf_explained_var: 0.610059916973114
          vf_loss: 37.97772002220154
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,91,1556.92,364000,34.923,50.1,16.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 736000
  custom_metrics: {}
  date: 2021-07-01_12-41-24
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.09999999999991
  episode_reward_mean: 34.35599999999991
  episode_reward_min: 20.099999999999895
  episodes_this_iter: 50
  episodes_total: 3675
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.30507426522672176
          entropy_coeff: 0.0
          kl: 0.008362106789718382
          policy_loss: -0.03387977530655917
          total_loss: 30.464163541793823
          vf_explained_var: 0.6516702175140381
          vf_loss: 30.489576995372772
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,92,1571.08,368000,34.356,50.1,20.1,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 744000
  custom_metrics: {}
  date: 2021-07-01_12-41-39
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.0999999999999
  episode_reward_mean: 34.62299999999991
  episode_reward_min: 22.799999999999955
  episodes_this_iter: 25
  episodes_total: 3700
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.3115471573546529
          entropy_coeff: 0.0
          kl: 0.006594996724743396
          policy_loss: -0.018588885141070932
          total_loss: 47.37114715576172
          vf_explained_var: 0.4623723030090332
          vf_loss: 47.38305854797363
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,93,1586.23,372000,34.623,47.1,22.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 752000
  custom_metrics: {}
  date: 2021-07-01_12-41-54
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.8999999999999
  episode_reward_mean: 34.3259999999999
  episode_reward_min: 20.69999999999991
  episodes_this_iter: 50
  episodes_total: 3750
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.27907597040757537
          entropy_coeff: 0.0
          kl: 0.0060969748155912384
          policy_loss: -0.026106120640179142
          total_loss: 31.99796837568283
          vf_explained_var: 0.6687365770339966
          vf_loss: 32.01790100336075
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,94,1600.76,376000,34.326,51.9,20.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 760000
  custom_metrics: {}
  date: 2021-07-01_12-42-09
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.8999999999999
  episode_reward_mean: 34.970999999999904
  episode_reward_min: 20.69999999999991
  episodes_this_iter: 50
  episodes_total: 3800
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.29612492490559816
          entropy_coeff: 0.0
          kl: 0.0077678941015619785
          policy_loss: -0.031062833368196152
          total_loss: 33.28314524888992
          vf_explained_var: 0.5937671661376953
          vf_loss: 33.30634289979935
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,95,1615.83,380000,34.971,51.9,20.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 768000
  custom_metrics: {}
  date: 2021-07-01_12-42-26
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.8999999999999
  episode_reward_mean: 35.36999999999991
  episode_reward_min: 20.69999999999991
  episodes_this_iter: 25
  episodes_total: 3825
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2776964367367327
          entropy_coeff: 0.0
          kl: 0.005776785059424583
          policy_loss: -0.019745711993891746
          total_loss: 33.27989602088928
          vf_explained_var: 0.658515214920044
          vf_loss: 33.293793082237244
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,96,1632.66,384000,35.37,45.9,20.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 776000
  custom_metrics: {}
  date: 2021-07-01_12-42-40
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.8999999999999
  episode_reward_mean: 35.8799999999999
  episode_reward_min: 19.799999999999912
  episodes_this_iter: 50
  episodes_total: 3875
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2829280002042651
          entropy_coeff: 0.0
          kl: 0.007355675392318517
          policy_loss: -0.019168791361153126
          total_loss: 34.18053990602493
          vf_explained_var: 0.6561027765274048
          vf_loss: 34.19226098060608
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,97,1647.22,388000,35.88,45.9,19.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 784000
  custom_metrics: {}
  date: 2021-07-01_12-42-57
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.299999999999905
  episode_reward_mean: 35.28299999999991
  episode_reward_min: 19.799999999999912
  episodes_this_iter: 25
  episodes_total: 3900
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2834771778434515
          entropy_coeff: 0.0
          kl: 0.006141318059235346
          policy_loss: -0.016993618512060493
          total_loss: 52.25935709476471
          vf_explained_var: 0.5137035846710205
          vf_loss: 52.270132184028625
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,98,1663.62,392000,35.283,45.3,19.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 792000
  custom_metrics: {}
  date: 2021-07-01_12-43-12
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.2999999999999
  episode_reward_mean: 35.171999999999905
  episode_reward_min: 19.799999999999912
  episodes_this_iter: 50
  episodes_total: 3950
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2654889728873968
          entropy_coeff: 0.0
          kl: 0.0064079228905029595
          policy_loss: -0.017270981101319194
          total_loss: 39.83958142995834
          vf_explained_var: 0.6194511651992798
          vf_loss: 39.85036474466324
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,99,1679.54,396000,35.172,48.3,19.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 800000
  custom_metrics: {}
  date: 2021-07-01_12-43-29
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.2999999999999
  episode_reward_mean: 35.41799999999991
  episode_reward_min: 3.5999999999999446
  episodes_this_iter: 50
  episodes_total: 4000
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2996889129281044
          entropy_coeff: 0.0
          kl: 0.0076964655600022525
          policy_loss: -0.01908352685859427
          total_loss: 40.998607099056244
          vf_explained_var: 0.49544084072113037
          vf_loss: 41.00989830493927
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,100,1695.79,400000,35.418,48.3,3.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 808000
  custom_metrics: {}
  date: 2021-07-01_12-43-48
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 54.59999999999991
  episode_reward_mean: 35.909999999999904
  episode_reward_min: 3.5999999999999446
  episodes_this_iter: 25
  episodes_total: 4025
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2836586721241474
          entropy_coeff: 0.0
          kl: 0.007766731956508011
          policy_loss: -0.02051701524760574
          total_loss: 49.495555996894836
          vf_explained_var: 0.5558496713638306
          vf_loss: 49.50820851325989
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,101,1714.59,404000,35.91,54.6,3.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 816000
  custom_metrics: {}
  date: 2021-07-01_12-44-04
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 54.59999999999991
  episode_reward_mean: 36.146999999999906
  episode_reward_min: 14.699999999999909
  episodes_this_iter: 50
  episodes_total: 4075
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.28829260542988777
          entropy_coeff: 0.0
          kl: 0.008042376401135698
          policy_loss: -0.023153640038799495
          total_loss: 37.82472234964371
          vf_explained_var: 0.62677401304245
          vf_loss: 37.839733600616455
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,102,1730.69,408000,36.147,54.6,14.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 824000
  custom_metrics: {}
  date: 2021-07-01_12-44-19
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 54.59999999999991
  episode_reward_mean: 36.18899999999991
  episode_reward_min: 14.699999999999909
  episodes_this_iter: 25
  episodes_total: 4100
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2918240646831691
          entropy_coeff: 0.0
          kl: 0.006812435181927867
          policy_loss: -0.013199584267567843
          total_loss: 64.1449259519577
          vf_explained_var: 0.45528942346572876
          vf_loss: 64.15122902393341
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,103,1746.17,412000,36.189,54.6,14.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 832000
  custom_metrics: {}
  date: 2021-07-01_12-44-34
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.19999999999992
  episode_reward_mean: 35.59499999999991
  episode_reward_min: 15.299999999999901
  episodes_this_iter: 50
  episodes_total: 4150
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2741573555395007
          entropy_coeff: 0.0
          kl: 0.006900431224494241
          policy_loss: -0.021427681029308587
          total_loss: 45.07684409618378
          vf_explained_var: 0.5862284302711487
          vf_loss: 45.09128439426422
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,104,1760.53,416000,35.595,49.2,15.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 840000
  custom_metrics: {}
  date: 2021-07-01_12-44-48
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.89999999999991
  episode_reward_mean: 35.9969999999999
  episode_reward_min: 20.699999999999896
  episodes_this_iter: 50
  episodes_total: 4200
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.28605205193161964
          entropy_coeff: 0.0
          kl: 0.010135989359696396
          policy_loss: -0.024289785185828805
          total_loss: 53.171104311943054
          vf_explained_var: 0.47232311964035034
          vf_loss: 53.18513059616089
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,105,1774.9,420000,35.997,51.9,20.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 848000
  custom_metrics: {}
  date: 2021-07-01_12-45-03
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.89999999999991
  episode_reward_mean: 36.33299999999991
  episode_reward_min: 20.699999999999896
  episodes_this_iter: 25
  episodes_total: 4225
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.27026319690048695
          entropy_coeff: 0.0
          kl: 0.006855769461253658
          policy_loss: -0.02048995642689988
          total_loss: 45.55350911617279
          vf_explained_var: 0.6111595630645752
          vf_loss: 45.56705713272095
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,106,1789.47,424000,36.333,51.9,20.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 856000
  custom_metrics: {}
  date: 2021-07-01_12-45-17
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.899999999999906
  episode_reward_mean: 36.131999999999906
  episode_reward_min: 14.700000000000026
  episodes_this_iter: 50
  episodes_total: 4275
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.27970315515995026
          entropy_coeff: 0.0
          kl: 0.0068431787367444485
          policy_loss: -0.017431623593438417
          total_loss: 39.11892366409302
          vf_explained_var: 0.6292023062705994
          vf_loss: 39.12942677736282
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_l

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,107,1803.77,428000,36.132,45.9,14.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 864000
  custom_metrics: {}
  date: 2021-07-01_12-45-31
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.899999999999906
  episode_reward_mean: 35.96999999999991
  episode_reward_min: 14.700000000000026
  episodes_this_iter: 25
  episodes_total: 4300
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2770256018266082
          entropy_coeff: 0.0
          kl: 0.0073635741719044745
          policy_loss: -0.019001650740392506
          total_loss: 43.169180035591125
          vf_explained_var: 0.5191107988357544
          vf_loss: 43.18072581291199
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,108,1817.58,432000,35.97,45.9,14.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 872000
  custom_metrics: {}
  date: 2021-07-01_12-45-45
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.2999999999999
  episode_reward_mean: 36.27299999999991
  episode_reward_min: 20.699999999999896
  episodes_this_iter: 50
  episodes_total: 4350
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2592877075076103
          entropy_coeff: 0.0
          kl: 0.006118778212112375
          policy_loss: -0.013852222706191242
          total_loss: 34.52606463432312
          vf_explained_var: 0.6525859832763672
          vf_loss: 34.53372144699097
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,109,1831.68,436000,36.273,48.3,20.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 880000
  custom_metrics: {}
  date: 2021-07-01_12-45-59
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.2999999999999
  episode_reward_mean: 35.903999999999904
  episode_reward_min: 17.099999999999902
  episodes_this_iter: 50
  episodes_total: 4400
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2835629410110414
          entropy_coeff: 0.0
          kl: 0.006559956993442029
          policy_loss: -0.01645479310536757
          total_loss: 44.392465114593506
          vf_explained_var: 0.45895153284072876
          vf_loss: 44.402278423309326
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,110,1845.32,440000,35.904,48.3,17.1,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 888000
  custom_metrics: {}
  date: 2021-07-01_12-46-13
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.8999999999999
  episode_reward_mean: 35.711999999999904
  episode_reward_min: 11.399999999999913
  episodes_this_iter: 25
  episodes_total: 4425
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2562564886175096
          entropy_coeff: 0.0
          kl: 0.007741676803561859
          policy_loss: -0.02819039062887896
          total_loss: 45.992348432540894
          vf_explained_var: 0.6086419820785522
          vf_loss: 46.01270067691803
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,111,1859.32,444000,35.712,51.9,11.4,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 896000
  custom_metrics: {}
  date: 2021-07-01_12-46-27
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.8999999999999
  episode_reward_mean: 35.0489999999999
  episode_reward_min: 11.399999999999913
  episodes_this_iter: 50
  episodes_total: 4475
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2717755059711635
          entropy_coeff: 0.0
          kl: 0.0074429627857171
          policy_loss: -0.022095425403676927
          total_loss: 36.2455113530159
          vf_explained_var: 0.658694863319397
          vf_loss: 36.26007068157196
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0002


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,112,1873.37,448000,35.049,51.9,11.4,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 904000
  custom_metrics: {}
  date: 2021-07-01_12-46-41
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.8999999999999
  episode_reward_mean: 34.439999999999905
  episode_reward_min: 11.399999999999913
  episodes_this_iter: 25
  episodes_total: 4500
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.27741820691153407
          entropy_coeff: 0.0
          kl: 0.00654564805154223
          policy_loss: -0.026817948615644127
          total_loss: 43.92541819810867
          vf_explained_var: 0.4934638440608978
          vf_loss: 43.945608615875244
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,113,1887.54,452000,34.44,51.9,11.4,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 912000
  custom_metrics: {}
  date: 2021-07-01_12-46-55
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.2999999999999
  episode_reward_mean: 34.53299999999991
  episode_reward_min: 12.599999999999921
  episodes_this_iter: 50
  episodes_total: 4550
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.24796770233660936
          entropy_coeff: 0.0
          kl: 0.006497509493783582
          policy_loss: -0.025395021948497742
          total_loss: 37.1281304359436
          vf_explained_var: 0.6683326959609985
          vf_loss: 37.14694797992706
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,114,1901.87,456000,34.533,48.3,12.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 920000
  custom_metrics: {}
  date: 2021-07-01_12-47-09
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.799999999999905
  episode_reward_mean: 35.966999999999906
  episode_reward_min: 12.599999999999921
  episodes_this_iter: 50
  episodes_total: 4600
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.27358304569497705
          entropy_coeff: 0.0
          kl: 0.00751752796350047
          policy_loss: -0.028997768939007074
          total_loss: 39.3973907828331
          vf_explained_var: 0.5322921872138977
          vf_loss: 39.41877752542496
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,115,1915.68,460000,35.967,49.8,12.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 928000
  custom_metrics: {}
  date: 2021-07-01_12-47-23
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.799999999999905
  episode_reward_mean: 35.1899999999999
  episode_reward_min: 12.599999999999921
  episodes_this_iter: 25
  episodes_total: 4625
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2556845103390515
          entropy_coeff: 0.0
          kl: 0.0055202865623869
          policy_loss: -0.017197587963892147
          total_loss: 39.52508723735809
          vf_explained_var: 0.6065208911895752
          vf_loss: 39.53669512271881
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,116,1929.82,464000,35.19,49.8,12.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 936000
  custom_metrics: {}
  date: 2021-07-01_12-47-37
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.29999999999989
  episode_reward_mean: 34.19699999999991
  episode_reward_min: 4.5000000000000195
  episodes_this_iter: 50
  episodes_total: 4675
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.25412569660693407
          entropy_coeff: 0.0
          kl: 0.007471303280908614
          policy_loss: -0.025338502397062257
          total_loss: 38.690449595451355
          vf_explained_var: 0.6168878674507141
          vf_loss: 38.708222687244415
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,117,1943.75,468000,34.197,48.3,4.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 944000
  custom_metrics: {}
  date: 2021-07-01_12-47-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.29999999999989
  episode_reward_mean: 33.515999999999906
  episode_reward_min: 4.5000000000000195
  episodes_this_iter: 25
  episodes_total: 4700
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.25230602640658617
          entropy_coeff: 0.0
          kl: 0.00612779168295674
          policy_loss: -0.022795697150286287
          total_loss: 45.82136404514313
          vf_explained_var: 0.46159565448760986
          vf_loss: 45.83795523643494
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,118,1958.14,472000,33.516,48.3,4.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 952000
  custom_metrics: {}
  date: 2021-07-01_12-48-06
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.3999999999999
  episode_reward_mean: 33.491999999999905
  episode_reward_min: 4.5000000000000195
  episodes_this_iter: 50
  episodes_total: 4750
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.25140486052259803
          entropy_coeff: 0.0
          kl: 0.008559684807551093
          policy_loss: -0.02557358704507351
          total_loss: 31.490682721138
          vf_explained_var: 0.6811472773551941
          vf_loss: 31.507589638233185
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,119,1972.49,476000,33.492,47.4,4.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 960000
  custom_metrics: {}
  date: 2021-07-01_12-48-21
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.4999999999999
  episode_reward_mean: 33.04199999999991
  episode_reward_min: 14.699999999999902
  episodes_this_iter: 50
  episodes_total: 4800
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.25868048472329974
          entropy_coeff: 0.0
          kl: 0.00693410616077017
          policy_loss: -0.02096278520184569
          total_loss: 31.207833528518677
          vf_explained_var: 0.6010348796844482
          vf_loss: 31.221775889396667
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,120,1987.23,480000,33.042,46.5,14.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 968000
  custom_metrics: {}
  date: 2021-07-01_12-48-35
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.199999999999896
  episode_reward_mean: 33.0149999999999
  episode_reward_min: 14.699999999999902
  episodes_this_iter: 25
  episodes_total: 4825
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.22876544995233417
          entropy_coeff: 0.0
          kl: 0.005539232937735505
          policy_loss: -0.01443162449868396
          total_loss: 38.01633733510971
          vf_explained_var: 0.6308506727218628
          vf_loss: 38.02516049146652
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,121,2001.35,484000,33.015,46.2,14.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 976000
  custom_metrics: {}
  date: 2021-07-01_12-48-49
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.199999999999896
  episode_reward_mean: 32.5259999999999
  episode_reward_min: 6.299999999999931
  episodes_this_iter: 50
  episodes_total: 4875
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.24252421548590064
          entropy_coeff: 0.0
          kl: 0.006604652138776146
          policy_loss: -0.01377787091769278
          total_loss: 40.83487820625305
          vf_explained_var: 0.6356974840164185
          vf_loss: 40.84196877479553
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,122,2015.22,488000,32.526,46.2,6.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 984000
  custom_metrics: {}
  date: 2021-07-01_12-49-03
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.599999999999895
  episode_reward_mean: 32.897999999999904
  episode_reward_min: 6.299999999999931
  episodes_this_iter: 25
  episodes_total: 4900
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2594753201119602
          entropy_coeff: 0.0
          kl: 0.006647950780461542
          policy_loss: -0.0191470542922616
          total_loss: 40.532476365566254
          vf_explained_var: 0.512898325920105
          vf_loss: 40.54489290714264
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,123,2029.18,492000,32.898,48.6,6.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 992000
  custom_metrics: {}
  date: 2021-07-01_12-49-17
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.699999999999896
  episode_reward_mean: 32.4959999999999
  episode_reward_min: 6.299999999999931
  episodes_this_iter: 50
  episodes_total: 4950
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.24385864986106753
          entropy_coeff: 0.0
          kl: 0.0054721335254726
          policy_loss: -0.025443819089559838
          total_loss: 33.25220447778702
          vf_explained_var: 0.6340951919555664
          vf_loss: 33.27210694551468
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,124,2043.45,496000,32.496,50.7,6.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1000000
  custom_metrics: {}
  date: 2021-07-01_12-49-32
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.699999999999896
  episode_reward_mean: 31.982999999999905
  episode_reward_min: 12.899999999999917
  episodes_this_iter: 50
  episodes_total: 5000
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.23850606102496386
          entropy_coeff: 0.0
          kl: 0.005489326700626407
          policy_loss: -0.014752764604054391
          total_loss: 59.97450911998749
          vf_explained_var: 0.5382956266403198
          vf_loss: 59.983705043792725
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,125,2058.01,500000,31.983,50.7,12.9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1008000
  custom_metrics: {}
  date: 2021-07-01_12-49-49
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.6999999999999
  episode_reward_mean: 32.72099999999991
  episode_reward_min: 13.799999999999926
  episodes_this_iter: 25
  episodes_total: 5025
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.23300011595711112
          entropy_coeff: 0.0
          kl: 0.006959000849747099
          policy_loss: -0.023597833671374246
          total_loss: 60.041171073913574
          vf_explained_var: 0.5113990306854248
          vf_loss: 60.05772376060486
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,126,2074.76,504000,32.721,47.7,13.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1016000
  custom_metrics: {}
  date: 2021-07-01_12-50-03
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.799999999999905
  episode_reward_mean: 33.98699999999991
  episode_reward_min: 14.699999999999925
  episodes_this_iter: 50
  episodes_total: 5075
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.23740224773064256
          entropy_coeff: 0.0
          kl: 0.006237456123926677
          policy_loss: -0.019811995938653126
          total_loss: 36.69084280729294
          vf_explained_var: 0.6539090871810913
          vf_loss: 36.704338788986206
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_l

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,127,2089.21,508000,33.987,49.8,14.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1024000
  custom_metrics: {}
  date: 2021-07-01_12-50-18
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.799999999999905
  episode_reward_mean: 33.91499999999991
  episode_reward_min: 13.199999999999967
  episodes_this_iter: 25
  episodes_total: 5100
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2508497373200953
          entropy_coeff: 0.0
          kl: 0.007742087604128756
          policy_loss: -0.023394023010041565
          total_loss: 53.029316544532776
          vf_explained_var: 0.5134106874465942
          vf_loss: 53.04487133026123
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,128,2103.57,512000,33.915,49.8,13.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1032000
  custom_metrics: {}
  date: 2021-07-01_12-50-32
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.799999999999905
  episode_reward_mean: 34.040999999999904
  episode_reward_min: 13.199999999999967
  episodes_this_iter: 50
  episodes_total: 5150
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.23639861727133393
          entropy_coeff: 0.0
          kl: 0.006641366278927308
          policy_loss: -0.017231517034815624
          total_loss: 36.11501634120941
          vf_explained_var: 0.6323361396789551
          vf_loss: 36.125523924827576
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,129,2117.84,516000,34.041,49.8,13.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1040000
  custom_metrics: {}
  date: 2021-07-01_12-50-46
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.8999999999999
  episode_reward_mean: 34.964999999999904
  episode_reward_min: 18.599999999999895
  episodes_this_iter: 50
  episodes_total: 5200
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2588189085945487
          entropy_coeff: 0.0
          kl: 0.007066527308779769
          policy_loss: -0.024600098404334858
          total_loss: 40.8539314866066
          vf_explained_var: 0.5260671377182007
          vf_loss: 40.87137722969055
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,130,2132.19,520000,34.965,48.9,18.6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1048000
  custom_metrics: {}
  date: 2021-07-01_12-51-00
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.49999999999991
  episode_reward_mean: 34.6049999999999
  episode_reward_min: 17.3999999999999
  episodes_this_iter: 25
  episodes_total: 5225
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2424436816945672
          entropy_coeff: 0.0
          kl: 0.007093300097039901
          policy_loss: -0.021399646007921547
          total_loss: 34.758048474788666
          vf_explained_var: 0.6197205781936646
          vf_loss: 34.7722664475441
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,131,2146.18,524000,34.605,46.5,17.4,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1056000
  custom_metrics: {}
  date: 2021-07-01_12-51-15
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.7999999999999
  episode_reward_mean: 34.0169999999999
  episode_reward_min: 16.199999999999903
  episodes_this_iter: 50
  episodes_total: 5275
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.23858831264078617
          entropy_coeff: 0.0
          kl: 0.007312173467653338
          policy_loss: -0.033376879640854895
          total_loss: 30.977666795253754
          vf_explained_var: 0.6662899851799011
          vf_loss: 31.003640353679657
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,132,2160.5,528000,34.017,49.8,16.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1064000
  custom_metrics: {}
  date: 2021-07-01_12-51-29
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.79999999999991
  episode_reward_mean: 33.83399999999991
  episode_reward_min: 15.899999999999912
  episodes_this_iter: 25
  episodes_total: 5300
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.24242649413645267
          entropy_coeff: 0.0
          kl: 0.006287668409640901
          policy_loss: -0.020267354528186843
          total_loss: 48.0935400724411
          vf_explained_var: 0.4969027638435364
          vf_loss: 48.10744118690491
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,133,2174.8,532000,33.834,49.8,15.9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1072000
  custom_metrics: {}
  date: 2021-07-01_12-51-44
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.79999999999991
  episode_reward_mean: 33.095999999999904
  episode_reward_min: 11.69999999999991
  episodes_this_iter: 50
  episodes_total: 5350
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.23096564365550876
          entropy_coeff: 0.0
          kl: 0.008015323022846133
          policy_loss: -0.023659745638724416
          total_loss: 32.68730962276459
          vf_explained_var: 0.669015645980835
          vf_loss: 32.70285415649414
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,134,2189.25,536000,33.096,49.8,11.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1080000
  custom_metrics: {}
  date: 2021-07-01_12-51-58
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.899999999999906
  episode_reward_mean: 32.20199999999991
  episode_reward_min: 11.69999999999991
  episodes_this_iter: 50
  episodes_total: 5400
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2427317025139928
          entropy_coeff: 0.0
          kl: 0.006937732730875723
          policy_loss: -0.025318899424746633
          total_loss: 36.5800661444664
          vf_explained_var: 0.5336323976516724
          vf_loss: 36.59836006164551
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,135,2203.59,540000,32.202,45.9,11.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1088000
  custom_metrics: {}
  date: 2021-07-01_12-52-12
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.89999999999991
  episode_reward_mean: 32.40899999999991
  episode_reward_min: 10.199999999999914
  episodes_this_iter: 25
  episodes_total: 5425
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.23279994213953614
          entropy_coeff: 0.0
          kl: 0.007348718514549546
          policy_loss: -0.017887908674310893
          total_loss: 50.645334005355835
          vf_explained_var: 0.5299735069274902
          vf_loss: 50.655781984329224
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_l

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,136,2217.86,544000,32.409,45.9,10.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1096000
  custom_metrics: {}
  date: 2021-07-01_12-52-28
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.7999999999999
  episode_reward_mean: 32.954999999999906
  episode_reward_min: 10.199999999999914
  episodes_this_iter: 50
  episodes_total: 5475
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2429116521961987
          entropy_coeff: 0.0
          kl: 0.008331216595252044
          policy_loss: -0.022531356691615656
          total_loss: 61.87284767627716
          vf_explained_var: 0.4856888949871063
          vf_loss: 61.886942744255066
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,137,2233.09,548000,32.955,46.8,10.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1104000
  custom_metrics: {}
  date: 2021-07-01_12-52-46
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.7999999999999
  episode_reward_mean: 32.74499999999991
  episode_reward_min: 7.7999999999999154
  episodes_this_iter: 25
  episodes_total: 5500
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2534395381808281
          entropy_coeff: 0.0
          kl: 0.006808859929151367
          policy_loss: -0.01966734675806947
          total_loss: 54.55338394641876
          vf_explained_var: 0.44026944041252136
          vf_loss: 54.566158056259155
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,138,2251.65,552000,32.745,46.8,7.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1112000
  custom_metrics: {}
  date: 2021-07-01_12-53-02
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.59999999999992
  episode_reward_mean: 33.77699999999991
  episode_reward_min: 7.7999999999999154
  episodes_this_iter: 50
  episodes_total: 5550
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.253717967774719
          entropy_coeff: 0.0
          kl: 0.007236987832584418
          policy_loss: -0.0330022034177091
          total_loss: 30.307249307632446
          vf_explained_var: 0.6179074048995972
          vf_loss: 30.332924485206604
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,139,2267.86,556000,33.777,45.6,7.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1120000
  custom_metrics: {}
  date: 2021-07-01_12-53-19
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.599999999999916
  episode_reward_mean: 34.18199999999991
  episode_reward_min: 15.299999999999905
  episodes_this_iter: 50
  episodes_total: 5600
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.2609911886975169
          entropy_coeff: 0.0
          kl: 0.0067008822079515085
          policy_loss: -0.02537728524475824
          total_loss: 38.01330989599228
          vf_explained_var: 0.563621461391449
          vf_loss: 38.031902849674225
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,140,2284.44,560000,34.182,48.6,15.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1128000
  custom_metrics: {}
  date: 2021-07-01_12-53-34
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.599999999999916
  episode_reward_mean: 33.42599999999991
  episode_reward_min: 15.299999999999905
  episodes_this_iter: 25
  episodes_total: 5625
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 1.0125000000000002
          cur_lr: 0.0001
          entropy: 0.23163103172555566
          entropy_coeff: 0.0
          kl: 0.0047566401335643604
          policy_loss: -0.011789395764935762
          total_loss: 48.40736836194992
          vf_explained_var: 0.5938019752502441
          vf_loss: 48.41434186697006
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_l

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,141,2299.51,564000,33.426,48.6,15.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1136000
  custom_metrics: {}
  date: 2021-07-01_12-53-49
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.999999999999915
  episode_reward_mean: 33.16799999999991
  episode_reward_min: 15.299999999999905
  episodes_this_iter: 50
  episodes_total: 5675
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2249625655822456
          entropy_coeff: 0.0
          kl: 0.011607827400439419
          policy_loss: -0.018548636115156114
          total_loss: 33.35839283466339
          vf_explained_var: 0.6509989500045776
          vf_loss: 33.37106537818909
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,142,2313.96,568000,33.168,48,15.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1144000
  custom_metrics: {}
  date: 2021-07-01_12-54-04
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.799999999999926
  episode_reward_mean: 32.8109999999999
  episode_reward_min: 13.199999999999982
  episodes_this_iter: 25
  episodes_total: 5700
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.25183290196582675
          entropy_coeff: 0.0
          kl: 0.014860992698231712
          policy_loss: -0.03232403949368745
          total_loss: 50.6695591211319
          vf_explained_var: 0.5301828384399414
          vf_loss: 50.69436037540436
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,143,2328.82,572000,32.811,46.8,13.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1152000
  custom_metrics: {}
  date: 2021-07-01_12-54-20
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.0999999999999
  episode_reward_mean: 33.22199999999991
  episode_reward_min: 10.499999999999979
  episodes_this_iter: 50
  episodes_total: 5750
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.23627133574336767
          entropy_coeff: 0.0
          kl: 0.012405671994201839
          policy_loss: -0.023141042023780756
          total_loss: 44.154469192028046
          vf_explained_var: 0.655792236328125
          vf_loss: 44.17132955789566
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,144,2345.01,576000,33.222,47.1,10.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1160000
  custom_metrics: {}
  date: 2021-07-01_12-54-35
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.4999999999999
  episode_reward_mean: 32.59799999999991
  episode_reward_min: 10.499999999999979
  episodes_this_iter: 50
  episodes_total: 5800
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2344177351333201
          entropy_coeff: 0.0
          kl: 0.010900532550294884
          policy_loss: -0.026935252884868532
          total_loss: 42.40937739610672
          vf_explained_var: 0.5766823291778564
          vf_loss: 42.430795431137085
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,145,2360.07,580000,32.598,49.5,10.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1168000
  custom_metrics: {}
  date: 2021-07-01_12-54-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.4999999999999
  episode_reward_mean: 32.927999999999905
  episode_reward_min: 10.499999999999979
  episodes_this_iter: 25
  episodes_total: 5825
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2229600022546947
          entropy_coeff: 0.0
          kl: 0.010285116790328175
          policy_loss: -0.028172020858619362
          total_loss: 54.33310651779175
          vf_explained_var: 0.5925516486167908
          vf_loss: 54.35607159137726
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,146,2376.74,584000,32.928,49.5,10.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1176000
  custom_metrics: {}
  date: 2021-07-01_12-55-08
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.0999999999999
  episode_reward_mean: 33.215999999999916
  episode_reward_min: 11.699999999999935
  episodes_this_iter: 50
  episodes_total: 5875
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2117365263402462
          entropy_coeff: 0.0
          kl: 0.00916745059657842
          policy_loss: -0.018735508376266807
          total_loss: 54.27224266529083
          vf_explained_var: 0.6216288805007935
          vf_loss: 54.286336183547974
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,147,2392.73,588000,33.216,50.1,11.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1184000
  custom_metrics: {}
  date: 2021-07-01_12-55-23
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.0999999999999
  episode_reward_mean: 33.149999999999906
  episode_reward_min: 11.699999999999935
  episodes_this_iter: 25
  episodes_total: 5900
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2273365636356175
          entropy_coeff: 0.0
          kl: 0.009956065405276604
          policy_loss: -0.029662500164704397
          total_loss: 42.508177518844604
          vf_explained_var: 0.5807116031646729
          vf_loss: 42.53279936313629
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,148,2408.23,592000,33.15,50.1,11.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1192000
  custom_metrics: {}
  date: 2021-07-01_12-55-40
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.0999999999999
  episode_reward_mean: 33.491999999999905
  episode_reward_min: 11.699999999999935
  episodes_this_iter: 50
  episodes_total: 5950
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2384969755075872
          entropy_coeff: 0.0
          kl: 0.011584791980567388
          policy_loss: -0.018274011759785935
          total_loss: 32.01387083530426
          vf_explained_var: 0.6652053594589233
          vf_loss: 32.026279866695404
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,149,2424.71,596000,33.492,50.1,11.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1200000
  custom_metrics: {}
  date: 2021-07-01_12-55-59
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.39999999999989
  episode_reward_mean: 32.68199999999991
  episode_reward_min: 7.49999999999992
  episodes_this_iter: 50
  episodes_total: 6000
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.23385249124839902
          entropy_coeff: 0.0
          kl: 0.0130698015273083
          policy_loss: -0.026355884736403823
          total_loss: 45.99314433336258
          vf_explained_var: 0.5023248791694641
          vf_loss: 46.01288318634033
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,150,2443.57,600000,32.682,47.4,7.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1208000
  custom_metrics: {}
  date: 2021-07-01_12-56-15
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.999999999999915
  episode_reward_mean: 32.216999999999906
  episode_reward_min: 7.49999999999992
  episodes_this_iter: 25
  episodes_total: 6025
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.24564681900665164
          entropy_coeff: 0.0
          kl: 0.012811394815798849
          policy_loss: -0.024723084061406553
          total_loss: 49.326666593551636
          vf_explained_var: 0.5660548806190491
          vf_loss: 49.34490394592285
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,151,2460.1,604000,32.217,48,7.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1216000
  custom_metrics: {}
  date: 2021-07-01_12-56-31
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.8999999999999
  episode_reward_mean: 31.802999999999905
  episode_reward_min: 13.199999999999948
  episodes_this_iter: 50
  episodes_total: 6075
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2559543545357883
          entropy_coeff: 0.0
          kl: 0.018963325303047895
          policy_loss: -0.02703658299287781
          total_loss: 37.216406881809235
          vf_explained_var: 0.6427992582321167
          vf_loss: 37.23384338617325
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,152,2475.76,608000,31.803,48.9,13.2,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1224000
  custom_metrics: {}
  date: 2021-07-01_12-56-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.8999999999999
  episode_reward_mean: 30.875999999999912
  episode_reward_min: 11.999999999999922
  episodes_this_iter: 25
  episodes_total: 6100
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.25559239834547043
          entropy_coeff: 0.0
          kl: 0.01256851345533505
          policy_loss: -0.022375546977855265
          total_loss: 38.282747983932495
          vf_explained_var: 0.5036213397979736
          vf_loss: 38.29876047372818
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,153,2491.6,612000,30.876,48.9,12,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1232000
  custom_metrics: {}
  date: 2021-07-01_12-57-04
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.8999999999999
  episode_reward_mean: 30.836999999999918
  episode_reward_min: 5.999999999999945
  episodes_this_iter: 50
  episodes_total: 6150
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2453766157850623
          entropy_coeff: 0.0
          kl: 0.010266824887366965
          policy_loss: -0.02087603820837103
          total_loss: 39.54801380634308
          vf_explained_var: 0.5995327830314636
          vf_loss: 39.56369352340698
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,154,2508.41,616000,30.837,48.9,6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1240000
  custom_metrics: {}
  date: 2021-07-01_12-57-20
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.49999999999991
  episode_reward_mean: 31.115999999999918
  episode_reward_min: 5.999999999999945
  episodes_this_iter: 50
  episodes_total: 6200
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2446759552694857
          entropy_coeff: 0.0
          kl: 0.011487491661682725
          policy_loss: -0.020894525077892467
          total_loss: 35.94888353347778
          vf_explained_var: 0.527084231376648
          vf_loss: 35.96396219730377
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,155,2524.51,620000,31.116,46.5,6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1248000
  custom_metrics: {}
  date: 2021-07-01_12-57-34
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.49999999999991
  episode_reward_mean: 31.07399999999991
  episode_reward_min: 5.999999999999945
  episodes_this_iter: 25
  episodes_total: 6225
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.24111131113022566
          entropy_coeff: 0.0
          kl: 0.01088010887906421
          policy_loss: -0.02161408815300092
          total_loss: 39.79644852876663
          vf_explained_var: 0.5909992456436157
          vf_loss: 39.81255429983139
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,156,2538.87,624000,31.074,46.5,6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1256000
  custom_metrics: {}
  date: 2021-07-01_12-57-48
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.49999999999991
  episode_reward_mean: 33.10799999999992
  episode_reward_min: -1.8000000000000544
  episodes_this_iter: 50
  episodes_total: 6275
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2535309884697199
          entropy_coeff: 0.0
          kl: 0.011989142498350702
          policy_loss: -0.019312340853502974
          total_loss: 36.60686683654785
          vf_explained_var: 0.6438410878181458
          vf_loss: 36.6201097369194
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,157,2552.81,628000,33.108,46.5,-1.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1264000
  custom_metrics: {}
  date: 2021-07-01_12-58-02
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.29999999999991
  episode_reward_mean: 33.17999999999991
  episode_reward_min: -1.8000000000000544
  episodes_this_iter: 25
  episodes_total: 6300
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.24417909421026707
          entropy_coeff: 0.0
          kl: 0.010733691859059036
          policy_loss: -0.022517229823279195
          total_loss: 45.951009809970856
          vf_explained_var: 0.46042293310165405
          vf_loss: 45.968092262744904
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,158,2566.88,632000,33.18,45.3,-1.8,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1272000
  custom_metrics: {}
  date: 2021-07-01_12-58-17
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.09999999999991
  episode_reward_mean: 34.694999999999915
  episode_reward_min: 10.49999999999998
  episodes_this_iter: 50
  episodes_total: 6350
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.23996385373175144
          entropy_coeff: 0.0
          kl: 0.013039452052908018
          policy_loss: -0.025721884914673865
          total_loss: 35.456013441085815
          vf_explained_var: 0.6026760339736938
          vf_loss: 35.475133776664734
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_l

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,159,2581.58,636000,34.695,47.1,10.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1280000
  custom_metrics: {}
  date: 2021-07-01_12-58-33
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.09999999999991
  episode_reward_mean: 34.154999999999916
  episode_reward_min: 5.999999999999952
  episodes_this_iter: 50
  episodes_total: 6400
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.23094602720811963
          entropy_coeff: 0.0
          kl: 0.010230292617052328
          policy_loss: -0.0252016419544816
          total_loss: 41.946721255779266
          vf_explained_var: 0.50103759765625
          vf_loss: 41.96674311161041
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,160,2597.28,640000,34.155,47.1,6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1288000
  custom_metrics: {}
  date: 2021-07-01_12-58-48
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.7999999999999
  episode_reward_mean: 34.433999999999905
  episode_reward_min: 5.999999999999952
  episodes_this_iter: 25
  episodes_total: 6425
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2273476216942072
          entropy_coeff: 0.0
          kl: 0.00958817829086911
          policy_loss: -0.013595336451544426
          total_loss: 49.55330407619476
          vf_explained_var: 0.552842915058136
          vf_loss: 49.56204617023468
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,161,2612.94,644000,34.434,49.8,6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1296000
  custom_metrics: {}
  date: 2021-07-01_12-59-05
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.7999999999999
  episode_reward_mean: 34.66499999999991
  episode_reward_min: 5.999999999999952
  episodes_this_iter: 50
  episodes_total: 6475
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2162245586514473
          entropy_coeff: 0.0
          kl: 0.009379845345392823
          policy_loss: -0.021022748725954443
          total_loss: 48.32509821653366
          vf_explained_var: 0.5778358578681946
          vf_loss: 48.34137284755707
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,162,2629.79,648000,34.665,49.8,6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1304000
  custom_metrics: {}
  date: 2021-07-01_12-59-23
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.7999999999999
  episode_reward_mean: 34.7609999999999
  episode_reward_min: 17.699999999999903
  episodes_this_iter: 25
  episodes_total: 6500
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.22046523028984666
          entropy_coeff: 0.0
          kl: 0.009326205872639548
          policy_loss: -0.016885988152353093
          total_loss: 43.75933909416199
          vf_explained_var: 0.5231469869613647
          vf_loss: 43.77150458097458
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,163,2647.18,652000,34.761,49.8,17.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1312000
  custom_metrics: {}
  date: 2021-07-01_12-59-39
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.1999999999999
  episode_reward_mean: 34.49399999999991
  episode_reward_min: 15.899999999999908
  episodes_this_iter: 50
  episodes_total: 6550
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.22121370490640402
          entropy_coeff: 0.0
          kl: 0.012302072544116527
          policy_loss: -0.020638422603951767
          total_loss: 34.98120719194412
          vf_explained_var: 0.5532575845718384
          vf_loss: 34.99561721086502
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,164,2663.01,656000,34.494,49.2,15.9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1320000
  custom_metrics: {}
  date: 2021-07-01_12-59-54
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.29999999999991
  episode_reward_mean: 33.21599999999991
  episode_reward_min: 8.999999999999936
  episodes_this_iter: 50
  episodes_total: 6600
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2167542432434857
          entropy_coeff: 0.0
          kl: 0.0091228888486512
          policy_loss: -0.01421427502646111
          total_loss: 45.66939890384674
          vf_explained_var: 0.47786056995391846
          vf_loss: 45.67899560928345
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,165,2678.65,660000,33.216,45.3,9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1328000
  custom_metrics: {}
  date: 2021-07-01_13-00-10
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.9999999999999
  episode_reward_mean: 32.58299999999991
  episode_reward_min: 8.999999999999936
  episodes_this_iter: 25
  episodes_total: 6625
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.21668453561142087
          entropy_coeff: 0.0
          kl: 0.01135477764182724
          policy_loss: -0.01901558364625089
          total_loss: 53.20595681667328
          vf_explained_var: 0.5217186212539673
          vf_loss: 53.21922433376312
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,166,2694.55,664000,32.583,48,9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1336000
  custom_metrics: {}
  date: 2021-07-01_13-00-26
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.9999999999999
  episode_reward_mean: 32.612999999999914
  episode_reward_min: 8.999999999999936
  episodes_this_iter: 50
  episodes_total: 6675
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.21538107795640826
          entropy_coeff: 0.0
          kl: 0.010739857898443006
          policy_loss: -0.015469940175535157
          total_loss: 43.212106227874756
          vf_explained_var: 0.5664441585540771
          vf_loss: 43.222140073776245
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,167,2710.7,668000,32.613,48,9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1344000
  custom_metrics: {}
  date: 2021-07-01_13-00-43
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.9999999999999
  episode_reward_mean: 33.42299999999991
  episode_reward_min: 11.699999999999925
  episodes_this_iter: 25
  episodes_total: 6700
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2069063000380993
          entropy_coeff: 0.0
          kl: 0.008866777963703498
          policy_loss: -0.01646372675895691
          total_loss: 47.40563225746155
          vf_explained_var: 0.4623560905456543
          vf_loss: 47.417606830596924
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,168,2727.12,672000,33.423,48,11.7,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1352000
  custom_metrics: {}
  date: 2021-07-01_13-01-00
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.9999999999999
  episode_reward_mean: 33.500999999999905
  episode_reward_min: 12.299999999999914
  episodes_this_iter: 50
  episodes_total: 6750
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.21435610111802816
          entropy_coeff: 0.0
          kl: 0.010512712833588012
          policy_loss: -0.0234015857859049
          total_loss: 40.08094775676727
          vf_explained_var: 0.5776660442352295
          vf_loss: 40.099027037620544
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,169,2744.49,676000,33.501,48,12.3,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1360000
  custom_metrics: {}
  date: 2021-07-01_13-01-16
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.5999999999999
  episode_reward_mean: 32.609999999999914
  episode_reward_min: 6.000000000000017
  episodes_this_iter: 50
  episodes_total: 6800
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2170128095895052
          entropy_coeff: 0.0
          kl: 0.007484836882213131
          policy_loss: -0.007072918291669339
          total_loss: 63.3177444934845
          vf_explained_var: 0.3791038393974304
          vf_loss: 63.32102870941162
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,170,2760.62,680000,32.61,48.6,6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1368000
  custom_metrics: {}
  date: 2021-07-01_13-01-32
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.5999999999999
  episode_reward_mean: 32.72399999999991
  episode_reward_min: 6.000000000000017
  episodes_this_iter: 25
  episodes_total: 6825
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.21780216647312045
          entropy_coeff: 0.0
          kl: 0.010549621743848547
          policy_loss: -0.023908889677841216
          total_loss: 36.60573881864548
          vf_explained_var: 0.5831892490386963
          vf_loss: 36.62430739402771
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,171,2776.36,684000,32.724,48.6,6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1376000
  custom_metrics: {}
  date: 2021-07-01_13-01-48
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.299999999999905
  episode_reward_mean: 33.749999999999915
  episode_reward_min: 6.000000000000017
  episodes_this_iter: 50
  episodes_total: 6875
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.21999898832291365
          entropy_coeff: 0.0
          kl: 0.00967420803499408
          policy_loss: -0.01003492617746815
          total_loss: 44.01618945598602
          vf_explained_var: 0.5957474112510681
          vf_loss: 44.02132749557495
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.7593750000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,172,2792.37,688000,33.75,48.3,6,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1384000
  custom_metrics: {}
  date: 2021-07-01_13-02-06
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.799999999999905
  episode_reward_mean: 33.36899999999991
  episode_reward_min: 9.899999999999968
  episodes_this_iter: 25
  episodes_total: 6900
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.22293304931372404
          entropy_coeff: 0.0
          kl: 0.011049154039938003
          policy_loss: -0.020793142612092197
          total_loss: 52.960651993751526
          vf_explained_var: 0.4275742471218109
          vf_loss: 52.975852370262146
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.7593750000000001
          cur_l

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,173,2809.55,692000,33.369,46.8,9.9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1392000
  custom_metrics: {}
  date: 2021-07-01_13-02-23
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.89999999999991
  episode_reward_mean: 32.58899999999992
  episode_reward_min: -49.50000000000007
  episodes_this_iter: 50
  episodes_total: 6950
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.21623583836480975
          entropy_coeff: 0.0
          kl: 0.016494034833158366
          policy_loss: -0.03617893095361069
          total_loss: 54.19572901725769
          vf_explained_var: 0.4942289888858795
          vf_loss: 54.22355842590332
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.7593750000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,174,2827.23,696000,32.589,48.9,-49.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1400000
  custom_metrics: {}
  date: 2021-07-01_13-02-41
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.89999999999991
  episode_reward_mean: 32.65799999999991
  episode_reward_min: -49.50000000000007
  episodes_this_iter: 50
  episodes_total: 7000
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.22391396202147007
          entropy_coeff: 0.0
          kl: 0.00959914390114136
          policy_loss: -0.022862165642436594
          total_loss: 46.81623840332031
          vf_explained_var: 0.45141756534576416
          vf_loss: 46.83424139022827
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.7593750000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,175,2844.65,700000,32.658,48.9,-49.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1408000
  custom_metrics: {}
  date: 2021-07-01_13-02-57
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.4999999999999
  episode_reward_mean: 34.02599999999991
  episode_reward_min: -49.50000000000007
  episodes_this_iter: 25
  episodes_total: 7025
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.20179722364991903
          entropy_coeff: 0.0
          kl: 0.010360228407080285
          policy_loss: -0.019091066365945153
          total_loss: 48.753127574920654
          vf_explained_var: 0.5255646109580994
          vf_loss: 48.76697289943695
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.7593750000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,176,2861.2,704000,34.026,46.5,-49.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1416000
  custom_metrics: {}
  date: 2021-07-01_13-03-15
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 52.1999999999999
  episode_reward_mean: 35.13599999999991
  episode_reward_min: 10.499999999999918
  episodes_this_iter: 50
  episodes_total: 7075
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.21377791184931993
          entropy_coeff: 0.0
          kl: 0.012634761922527105
          policy_loss: -0.019614091113908216
          total_loss: 41.71421056985855
          vf_explained_var: 0.5917768478393555
          vf_loss: 41.727427661418915
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.7593750000000001
          cur_lr:

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,177,2878.62,708000,35.136,52.2,10.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1424000
  custom_metrics: {}
  date: 2021-07-01_13-03-32
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 52.1999999999999
  episode_reward_mean: 35.85599999999991
  episode_reward_min: 10.499999999999918
  episodes_this_iter: 25
  episodes_total: 7100
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.2073258738964796
          entropy_coeff: 0.0
          kl: 0.0073325229532201774
          policy_loss: -0.01641579464194365
          total_loss: 78.31391596794128
          vf_explained_var: 0.40429461002349854
          vf_loss: 78.3266190290451
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.7593750000000001
          cur_lr: 0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,178,2895.54,712000,35.856,52.2,10.5,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1432000
  custom_metrics: {}
  date: 2021-07-01_13-03-48
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.699999999999896
  episode_reward_mean: 35.14499999999991
  episode_reward_min: 15.899999999999922
  episodes_this_iter: 50
  episodes_total: 7150
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.18840395519509912
          entropy_coeff: 0.0
          kl: 0.009152248705504462
          policy_loss: -0.021580108907073736
          total_loss: 44.6403426527977
          vf_explained_var: 0.588764488697052
          vf_loss: 44.65728962421417
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.7593750000000001
          cur_lr: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,179,2911.53,716000,35.145,50.7,15.9,100


Result for PPO_MultiAgentArena_7f8d1_00000:
  agent_timesteps_total: 1440000
  custom_metrics: {}
  date: 2021-07-01_13-04-02
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 53.9999999999999
  episode_reward_mean: 36.81599999999991
  episode_reward_min: 12.89999999999991
  episodes_this_iter: 50
  episodes_total: 7200
  experiment_id: 9a7578a6603e466f8f19eb8a71ff19dd
  hostname: SKCC17N00536.local
  info:
    learner:
      policy1:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.5062500000000001
          cur_lr: 0.0001
          entropy: 0.19940048549324274
          entropy_coeff: 0.0
          kl: 0.011309475332382135
          policy_loss: -0.02954732798389159
          total_loss: 56.96694266796112
          vf_explained_var: 0.458646297454834
          vf_loss: 56.990763902664185
      policy2:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.7593750000000001
          cur_lr: 0.0

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,RUNNING,172.30.1.40:21670,180,2925.99,720000,36.816,54,12.9,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_7f8d1_00000,TERMINATED,,180,2925.99,720000,36.816,54,12.9,100


2021-07-01 13:04:03,100	INFO tune.py:549 -- Total run time: 2952.21 seconds (2951.97 seconds for the tuning loop).


------------------
## 15 min break :)
------------------


(while the above experiment is running (and hopefully learning))


## How do we extract any checkpoint from a trial of a tune.run?

In [25]:
# The previous tune.run (the one we did before the exercise) returned an Analysis object, from which we can access any checkpoint
# (given we set checkpoint_freq or checkpoint_at_end to reasonable values) like so:
print(analysis)
# Get all trials (we only have one).
trials = analysis.trials
# Assuming, the first trial was the best, we'd like to extract this trial's best checkpoint "":
best_checkpoint = analysis.get_best_checkpoint(trial=trials[0], metric="episode_reward_mean", mode="max")
print(f"Found best checkpoint for trial #2: {best_checkpoint}")

# Undo the grid-search config, which RLlib doesn't understand.
rllib_config = tune_config.copy()
rllib_config["lr"] = 0.00005
rllib_config["train_batch_size"] = 4000

# Restore a RLlib Trainer from the checkpoint.
new_trainer = PPOTrainer(config=rllib_config)
new_trainer.restore(best_checkpoint)
new_trainer



<ray.tune.analysis.experiment_analysis.ExperimentAnalysis object at 0x7f8196a0b400>
Found best checkpoint for trial #2: /Users/parksurk/ray_results/PPO/PPO_MultiAgentArena_7f8d1_00000_0_2021-07-01_12-14-51/checkpoint_000180/checkpoint-180


2021-07-01 13:05:26,819	INFO trainable.py:101 -- Trainable.setup took 10.661 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-07-01 13:05:26,849	INFO trainable.py:377 -- Restored on 172.30.1.40 from checkpoint: /Users/parksurk/ray_results/PPO/PPO_MultiAgentArena_7f8d1_00000_0_2021-07-01_12-14-51/checkpoint_000180/checkpoint-180
2021-07-01 13:05:26,850	INFO trainable.py:385 -- Current state after restoring: {'_iteration': 180, '_timesteps_total': None, '_time_total': 2925.9923446178436, '_episodes_total': 7200}


PPO

In [28]:
out = Output()
display.display(out)

with out:
    obs = env.reset()
    while True:
        a1 = new_trainer.compute_action(obs["agent1"], policy_id="policy1")
        a2 = new_trainer.compute_action(obs["agent2"], policy_id="policy2")
        actions = {"agent1": a1, "agent2": a2}
        obs, rewards, dones, _ = env.step(actions)

        out.clear_output(wait=True)
        env.render()
        time.sleep(0.07)

        if dones["agent1"] is True:
            break


Output()

## Let's talk about customization options

### Deep Dive: How do we customize RLlib's RL loop?

RLlib offers a callbacks API that allows you to add custom behavior to
all major events during the environment sampling- and learning process.

**Our problem:** So far, we can only see standard stats, such as rewards, episode lengths, etc..
This does not give us enough insights sometimes into important questions, such as: How many times
have both agents collided? or How many times has agent1 discovered a new field?

In the following cell, we will create custom callback "hooks" that will allow us to
add these stats to the returned metrics dict, and which will therefore be displayed in tensorboard!

For that we will override RLlib's DefaultCallbacks class and implement the
`on_episode_start`, `on_episode_step`, and `on_episode_end` methods therein:

RLlib는 환경 샘플링 및 학습 과정 동안의 모든 주요 이벤트에 대응하는 사용자 정의 행동을 추가 할 수있는 콜백 API를 제공합니다.

**문제 :** 지금까지는 Reward, 에피소드 길이 등과 같은 표준 통계만 볼 수 있었습니다.
이것은 때때로 다음과 같은 중요한 질문에 대한 충분한 통찰력을 제공하지 않습니다.
두 요원이 충돌 했습니까? 또는 agent1이 새 필드를 몇 번이나 발견 했습니까?

다음 셀에서는 사용자 정의 Callback "hook"를 생성하고, 이런 통계치를 반환된 메트릭 딕셔너리에 추가하여 텐서 보드에 표시합니다!

이를 위해 RLlib의 DefaultCallbacks 클래스를 재정의하고
`on_episode_start`,`on_episode_step` 및`on_episode_end` 메소드를 구현합니다. :


In [30]:
# Override the DefaultCallbacks with your own and implement any methods (hooks)
# that you need.
from ray.rllib.agents.callbacks import DefaultCallbacks
from ray.rllib.evaluation.episode import MultiAgentEpisode


class MyCallbacks(DefaultCallbacks):
    def on_episode_start(self,
                         *,
                         worker,
                         base_env,
                         policies,
                         episode: MultiAgentEpisode,
                         env_index,
                         **kwargs):
        # We will use the `MultiAgentEpisode` object being passed into
        # all episode-related callbacks. It comes with a user_data property (dict),
        # which we can write arbitrary data into.

        # At the end of an episode, we'll transfer that data into the `hist_data`, and `custom_metrics`
        # properties to make sure our custom data is displayed in TensorBoard.

        # The episode is starting:
        # Set per-episode object to capture, which states (observations)
        # have been visited by agent1.
        episode.user_data["new_fields_discovered"] = 0
        # Set per-episode agent2-blocks counter (how many times has agent2 blocked agent1?).
        episode.user_data["num_collisions"] = 0

    def on_episode_step(self,
                        *,
                        worker,
                        base_env,
                        episode: MultiAgentEpisode,
                        env_index,
                        **kwargs):
        # Get both rewards.
        ag1_r = episode.prev_reward_for("agent1")
        ag2_r = episode.prev_reward_for("agent2")

        # Agent1 discovered a new field.
        if ag1_r == 1.0:
            episode.user_data["new_fields_discovered"] += 1
        # Collision.
        elif ag2_r == 1.0:
            episode.user_data["num_collisions"] += 1

    def on_episode_end(self,
                       *,
                       worker,
                       base_env,
                       policies,
                       episode: MultiAgentEpisode,
                       env_index,
                       **kwargs):
        # Episode is done:
        # Write scalar values (sum over rewards) to `custom_metrics` and
        # time-series data (rewards per time step) to `hist_data`.
        # Both will be visible then in TensorBoard.
        episode.custom_metrics["new_fields_discovered"] = episode.user_data["new_fields_discovered"]
        episode.custom_metrics["num_collisions"] = episode.user_data["num_collisions"]


In [33]:
new_trainer.stop()
# Setting up our config to point to our new custom callbacks class:
config = {
    "env": MultiAgentArena,
    "callbacks": MyCallbacks,  # by default, this would point to `rllib.agents.callbacks.DefaultCallbacks`, which does nothing.
    "num_workers": 5,  # we know now: this speeds up things!
}

tune.run(
    "PPO",
    config=config,
    stop={"training_iteration": 20},
    checkpoint_at_end=True,
    # If you'd like to restore the tune run from an existing checkpoint file, you can do the following:
    #restore="/Users/sven/ray_results/PPO/PPO_MultiAgentArena_fd451_00000_0_2021-05-25_15-13-26/checkpoint_000010/checkpoint-10",
)

Trial name,status,loc
PPO_MultiAgentArena_d2605_00000,PENDING,


[2m[36m(pid=44012)[0m 2021-07-01 13:28:53,236	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=44012)[0m 2021-07-01 13:28:53,236	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=44012)[0m 2021-07-01 13:29:08,596	INFO trainable.py:101 -- Trainable.setup took 15.361 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Result for PPO_MultiAgentArena_d2605_00000:
  agent_timesteps_total: 4000
  custom_metrics:
    new_fields_discovered_max: 55
    new_fields_discovered_mean: 33.25
    new_fields_discovered_min: 19
    num_collisions_max: 7
    num_collisions_mean: 1.3
    num_collisions_min: 0
  date: 2021-07-01_13-29-13
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.09999999999998
  episode_reward_mean: -9.119999999999996
  episode_reward_min: -31.50000000000003
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: 1b52ca63368549ca92f67da3fee55fbb
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.366931676864624
          entropy_coeff: 0.0
          kl: 0.0194054264575243
          model: {}
          policy_loss: -0.05213673785328865
          total_loss: 25.09111976623535
          vf_explained_var: 0.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_d2605_00000,RUNNING,172.30.1.40:44012,1,4.97972,4000,-9.12,23.1,-31.5,100


Result for PPO_MultiAgentArena_d2605_00000:
  agent_timesteps_total: 12000
  custom_metrics:
    new_fields_discovered_max: 55
    new_fields_discovered_mean: 35.63333333333333
    new_fields_discovered_min: 19
    num_collisions_max: 20
    num_collisions_mean: 1.7833333333333334
    num_collisions_min: 0
  date: 2021-07-01_13-29-21
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.09999999999998
  episode_reward_mean: -5.194999999999993
  episode_reward_min: -31.50000000000003
  episodes_this_iter: 20
  episodes_total: 60
  experiment_id: 1b52ca63368549ca92f67da3fee55fbb
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.2905468940734863
          entropy_coeff: 0.0
          kl: 0.02482198178768158
          model: {}
          policy_loss: -0.05899612605571747
          total_loss: 18.5784854888916


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_d2605_00000,RUNNING,172.30.1.40:44012,3,13.0767,12000,-5.195,23.1,-31.5,100


Result for PPO_MultiAgentArena_d2605_00000:
  agent_timesteps_total: 20000
  custom_metrics:
    new_fields_discovered_max: 55
    new_fields_discovered_mean: 36.67
    new_fields_discovered_min: 19
    num_collisions_max: 20
    num_collisions_mean: 1.73
    num_collisions_min: 0
  date: 2021-07-01_13-29-29
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.09999999999998
  episode_reward_mean: -3.629999999999992
  episode_reward_min: -31.50000000000003
  episodes_this_iter: 20
  episodes_total: 100
  experiment_id: 1b52ca63368549ca92f67da3fee55fbb
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 1.2443851232528687
          entropy_coeff: 0.0
          kl: 0.020394615828990936
          model: {}
          policy_loss: -0.05789593607187271
          total_loss: 14.557127952575684
          vf_explained

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_d2605_00000,RUNNING,172.30.1.40:44012,5,20.3497,20000,-3.63,23.1,-31.5,100


Result for PPO_MultiAgentArena_d2605_00000:
  agent_timesteps_total: 28000
  custom_metrics:
    new_fields_discovered_max: 51
    new_fields_discovered_mean: 37.99
    new_fields_discovered_min: 23
    num_collisions_max: 7
    num_collisions_mean: 1.68
    num_collisions_min: 0
  date: 2021-07-01_13-29-36
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 16.500000000000018
  episode_reward_mean: -1.6679999999999893
  episode_reward_min: -25.500000000000004
  episodes_this_iter: 20
  episodes_total: 140
  experiment_id: 1b52ca63368549ca92f67da3fee55fbb
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 1.1852091550827026
          entropy_coeff: 0.0
          kl: 0.01788623444736004
          model: {}
          policy_loss: -0.053150419145822525
          total_loss: 18.78753662109375
          vf_explaine

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_d2605_00000,RUNNING,172.30.1.40:44012,7,27.8785,28000,-1.668,16.5,-25.5,100


Result for PPO_MultiAgentArena_d2605_00000:
  agent_timesteps_total: 36000
  custom_metrics:
    new_fields_discovered_max: 53
    new_fields_discovered_mean: 39.17
    new_fields_discovered_min: 23
    num_collisions_max: 11
    num_collisions_mean: 2.07
    num_collisions_min: 0
  date: 2021-07-01_13-29-44
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 19.499999999999922
  episode_reward_mean: 0.3210000000000104
  episode_reward_min: -25.500000000000004
  episodes_this_iter: 20
  episodes_total: 180
  experiment_id: 1b52ca63368549ca92f67da3fee55fbb
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 1.1406888961791992
          entropy_coeff: 0.0
          kl: 0.018338419497013092
          model: {}
          policy_loss: -0.05762483552098274
          total_loss: 14.989665985107422
          vf_explain

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_d2605_00000,RUNNING,172.30.1.40:44012,9,35.2997,36000,0.321,19.5,-25.5,100


Result for PPO_MultiAgentArena_d2605_00000:
  agent_timesteps_total: 44000
  custom_metrics:
    new_fields_discovered_max: 54
    new_fields_discovered_mean: 40.29
    new_fields_discovered_min: 23
    num_collisions_max: 15
    num_collisions_mean: 2.0
    num_collisions_min: 0
  date: 2021-07-01_13-29-51
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 21.59999999999993
  episode_reward_mean: 1.9890000000000065
  episode_reward_min: -25.500000000000004
  episodes_this_iter: 20
  episodes_total: 220
  experiment_id: 1b52ca63368549ca92f67da3fee55fbb
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 1.0832452774047852
          entropy_coeff: 0.0
          kl: 0.019080184400081635
          model: {}
          policy_loss: -0.05616436526179314
          total_loss: 17.575571060180664
          vf_explained

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_d2605_00000,RUNNING,172.30.1.40:44012,11,42.6406,44000,1.989,21.6,-25.5,100


Result for PPO_MultiAgentArena_d2605_00000:
  agent_timesteps_total: 52000
  custom_metrics:
    new_fields_discovered_max: 54
    new_fields_discovered_mean: 42.38
    new_fields_discovered_min: 28
    num_collisions_max: 15
    num_collisions_mean: 1.99
    num_collisions_min: 0
  date: 2021-07-01_13-30-00
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 22.79999999999997
  episode_reward_mean: 5.181000000000004
  episode_reward_min: -13.799999999999978
  episodes_this_iter: 20
  episodes_total: 260
  experiment_id: 1b52ca63368549ca92f67da3fee55fbb
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 1.0272111892700195
          entropy_coeff: 0.0
          kl: 0.018489399924874306
          model: {}
          policy_loss: -0.05418701097369194
          total_loss: 18.107131958007812
          vf_explained

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_d2605_00000,RUNNING,172.30.1.40:44012,13,51.0677,52000,5.181,22.8,-13.8,100


Result for PPO_MultiAgentArena_d2605_00000:
  agent_timesteps_total: 60000
  custom_metrics:
    new_fields_discovered_max: 54
    new_fields_discovered_mean: 43.47
    new_fields_discovered_min: 32
    num_collisions_max: 13
    num_collisions_mean: 2.1
    num_collisions_min: 0
  date: 2021-07-01_13-30-07
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 22.79999999999997
  episode_reward_mean: 6.878999999999999
  episode_reward_min: -11.399999999999988
  episodes_this_iter: 20
  episodes_total: 300
  experiment_id: 1b52ca63368549ca92f67da3fee55fbb
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 0.9820687770843506
          entropy_coeff: 0.0
          kl: 0.01893707364797592
          model: {}
          policy_loss: -0.054366763681173325
          total_loss: 20.352720260620117
          vf_explained_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_d2605_00000,RUNNING,172.30.1.40:44012,15,58.3884,60000,6.879,22.8,-11.4,100


Result for PPO_MultiAgentArena_d2605_00000:
  agent_timesteps_total: 68000
  custom_metrics:
    new_fields_discovered_max: 54
    new_fields_discovered_mean: 43.78
    new_fields_discovered_min: 30
    num_collisions_max: 13
    num_collisions_mean: 2.16
    num_collisions_min: 0
  date: 2021-07-01_13-30-14
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 22.79999999999997
  episode_reward_mean: 7.3439999999999985
  episode_reward_min: -11.999999999999984
  episodes_this_iter: 20
  episodes_total: 340
  experiment_id: 1b52ca63368549ca92f67da3fee55fbb
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 0.9339596033096313
          entropy_coeff: 0.0
          kl: 0.018967602401971817
          model: {}
          policy_loss: -0.05348660424351692
          total_loss: 19.252872467041016
          vf_explaine

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_d2605_00000,RUNNING,172.30.1.40:44012,17,65.669,68000,7.344,22.8,-12,100


Result for PPO_MultiAgentArena_d2605_00000:
  agent_timesteps_total: 76000
  custom_metrics:
    new_fields_discovered_max: 54
    new_fields_discovered_mean: 43.53
    new_fields_discovered_min: 30
    num_collisions_max: 10
    num_collisions_mean: 2.06
    num_collisions_min: 0
  date: 2021-07-01_13-30-21
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.1
  episode_reward_mean: 6.839999999999993
  episode_reward_min: -11.999999999999984
  episodes_this_iter: 20
  episodes_total: 380
  experiment_id: 1b52ca63368549ca92f67da3fee55fbb
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 0.8836515545845032
          entropy_coeff: 0.0
          kl: 0.0193608608096838
          model: {}
          policy_loss: -0.05156983435153961
          total_loss: 21.538658142089844
          vf_explained_var: 0.3287947

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_d2605_00000,RUNNING,172.30.1.40:44012,19,72.8003,76000,6.84,23.1,-12,100


Result for PPO_MultiAgentArena_d2605_00000:
  agent_timesteps_total: 80000
  custom_metrics:
    new_fields_discovered_max: 61
    new_fields_discovered_mean: 44.25
    new_fields_discovered_min: 30
    num_collisions_max: 10
    num_collisions_mean: 2.25
    num_collisions_min: 0
  date: 2021-07-01_13-30-25
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 33.299999999999905
  episode_reward_mean: 8.00999999999999
  episode_reward_min: -11.999999999999984
  episodes_this_iter: 20
  episodes_total: 400
  experiment_id: 1b52ca63368549ca92f67da3fee55fbb
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 0.84686279296875
          entropy_coeff: 0.0
          kl: 0.019242260605096817
          model: {}
          policy_loss: -0.052279114723205566
          total_loss: 30.78931999206543
          vf_explained_va

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_d2605_00000,TERMINATED,,20,76.3225,80000,8.01,33.3,-12,100


2021-07-01 13:30:26,036	INFO tune.py:549 -- Total run time: 101.23 seconds (100.75 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7f819120e9a0>

### Let's check tensorboard for the new custom metrics!

1. Head over to the Anyscale project view and click on the "TensorBoard" butten:

<img src="images/tensorboard_button.png" width=1000>

Alternatively - if you ran this locally on your own machine:

1. Head over to ~/ray_results/PPO/PPO_MultiAgentArena_[some key]_00000_0_[date]_[time]/
1. In that directory, you should see a `event.out....` file.
1. Run `tensorboard --logdir .` and head to https://localhost:6006

<img src="images/tensorboard.png" width=800>


### Deep Dive: Writing custom Models in tf or torch.

In [34]:
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.utils.framework import try_import_tf, try_import_torch

tf1, tf, tf_version = try_import_tf()
torch, nn = try_import_torch()


# Custom Neural Network Models.
class MyKerasModel(TFModelV2):
    """Custom model for policy gradient algorithms."""

    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name):
        """Build a simple [16, 16]-MLP (+ value branch)."""
        super(MyKerasModel, self).__init__(obs_space, action_space,
                                           num_outputs, model_config, name)
        
        # Keras Input layer.
        self.inputs = tf.keras.layers.Input(
            shape=obs_space.shape, name="observations")

        # Hidden layer (shared by action logits outputs and value output).
        layer_1 = tf.keras.layers.Dense(
            16,
            name="layer1",
            activation=tf.nn.relu)(self.inputs)
        
        # Action logits output.
        logits = tf.keras.layers.Dense(
            num_outputs,
            name="out",
            activation=None)(layer_1)

        # "Value"-branch (single node output).
        # Used by several RLlib algorithms (e.g. PPO) to calculate an observation's value.
        value_out = tf.keras.layers.Dense(
            1,
            name="value",
            activation=None)(layer_1)

        # The actual Keras model:
        self.base_model = tf.keras.Model(self.inputs,
                                         [logits, value_out])

    def forward(self, input_dict, state, seq_lens):
        """Custom-define your forard pass logic here."""
        # Pass inputs through our 2 layers and calculate the "value"
        # of the observation and store it for when `value_function` is called.
        logits, self.cur_value = self.base_model(input_dict["obs"])
        return logits, state

    def value_function(self):
        """Implement the value branch forward pass logic here:
        
        We will just return the already calculated `self.cur_value`.
        """
        assert self.cur_value is not None, "Must call `forward()` first!"
        return tf.reshape(self.cur_value, [-1])


class MyTorchModel(TorchModelV2, nn.Module):
    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name):
        """Build a simple [16, 16]-MLP (+ value branch)."""
        TorchModelV2.__init__(self, obs_space, action_space, num_outputs,
                              model_config, name)
        nn.Module.__init__(self)

        self.device = torch.device("cuda"
                                   if torch.cuda.is_available() else "cpu")

        # Hidden layer (shared by action logits outputs and value output).
        self.layer_1 = nn.Linear(obs_space.shape[0], 16).to(self.device)

        # Action logits output.
        self.layer_out = nn.Linear(16, num_outputs).to(self.device)

        # "Value"-branch (single node output).
        # Used by several RLlib algorithms (e.g. PPO) to calculate an observation's value.
        self.value_branch = nn.Linear(16, 1).to(self.device)
        self.cur_value = None

    def forward(self, input_dict, state, seq_lens):
        """Custom-define your forard pass logic here."""
        # Pass inputs through our 2 layers.
        layer_1_out = self.layer_1(input_dict["obs"])
        logits = self.layer_out(layer_1_out)

        # Calculate the "value" of the observation and store it for
        # when `value_function` is called.
        self.cur_value = self.value_branch(layer_1_out).squeeze(1)

        return logits, state

    def value_function(self):
        """Implement the value branch forward pass logic here:
        
        We will just return the already calculated `self.cur_value`.
        """
        assert self.cur_value is not None, "Must call `forward()` first!"
        return self.cur_value


In [35]:
# Do a quick test on the custom model classes.
#test_model_tf = MyKerasModel(
#    obs_space=gym.spaces.Box(-1.0, 1.0, (2, )),
#    action_space=None,
#    num_outputs=2,
#    model_config={},
#    name="MyModel",
#)

#print("TF-output={}".format(test_model_tf({"obs": np.array([[0.5, 0.5]])})))

# For PyTorch, you can do:
test_model_torch = MyTorchModel(
    obs_space=gym.spaces.Box(-1.0, 1.0, (2, )),
    action_space=None,
    num_outputs=2,
    model_config={},
    name="MyModel",
)
print("Torch-output={}".format(test_model_torch({"obs": torch.from_numpy(np.array([[0.5, 0.5]], dtype=np.float32))})))


Torch-output=(tensor([[ 0.6229, -0.7027]], grad_fn=<AddmmBackward>), [])


In [40]:
# Set up our custom model and re-run the experiment.
#config.update({
#    "model": {
#        "custom_model": MyKerasModel,  # for torch users: "custom_model": MyTorchModel
#        "custom_model_config": {
#            #"layers": [128, 128],
#        },
#    },
#})

config.update({
    "model": {
        "custom_model": MyTorchModel,  # for torch users: "custom_model": MyTorchModel
        "custom_model_config": {
            "layers": [128, 128],
        },
    },
})

#tune.run(
#    "PPO",
#    config=config,  # for torch users: config=dict(config, **{"framework": "torch"}),
#    stop={
#        "training_iteration": 5,
#    },
#)

tune.run(
    "PPO",
    config=dict(config, **{"framework": "torch"}),  # for torch users: config=dict(config, **{"framework": "torch"}),
    stop={
        "training_iteration": 5,
    },
)


Trial name,status,loc
PPO_MultiAgentArena_9889f_00000,PENDING,


[2m[36m(pid=66554)[0m 2021-07-01 14:24:30,416	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=66554)[0m 2021-07-01 14:24:40,554	INFO trainable.py:101 -- Trainable.setup took 10.139 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Result for PPO_MultiAgentArena_9889f_00000:
  agent_timesteps_total: 4000
  custom_metrics:
    new_fields_discovered_max: 47
    new_fields_discovered_mean: 29.2
    new_fields_discovered_min: 17
    num_collisions_max: 5
    num_collisions_mean: 1.25
    num_collisions_min: 0
  date: 2021-07-01_14-24-44
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 12.000000000000023
  episode_reward_mean: -15.045000000000002
  episode_reward_min: -31.500000000000036
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: 8219602d9bd44911809d9131340f776b
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.2
          cur_lr: 5.0e-05
          entropy: 1.3840322196483612
          entropy_coeff: 0.0
          kl: 0.0024876923416741192
          policy_loss: -0.005995301267830655
          total_loss: 38.67150855064392
          vf_explained_var: 0.003352800

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_9889f_00000,RUNNING,172.30.1.40:66554,1,4.35012,4000,-15.045,12,-31.5,100


Result for PPO_MultiAgentArena_9889f_00000:
  agent_timesteps_total: 12000
  custom_metrics:
    new_fields_discovered_max: 53
    new_fields_discovered_mean: 33.18333333333333
    new_fields_discovered_min: 16
    num_collisions_max: 7
    num_collisions_mean: 1.1666666666666667
    num_collisions_min: 0
  date: 2021-07-01_14-24-51
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 20.999999999999922
  episode_reward_mean: -9.130000000000003
  episode_reward_min: -35.40000000000004
  episodes_this_iter: 20
  episodes_total: 60
  experiment_id: 8219602d9bd44911809d9131340f776b
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.1
          cur_lr: 5.0e-05
          entropy: 1.3656450547277927
          entropy_coeff: 0.0
          kl: 0.003002431447384879
          policy_loss: -0.0022404889023164287
          total_loss: 25.51762819290161
          vf_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_9889f_00000,RUNNING,172.30.1.40:66554,3,10.5624,12000,-9.13,21,-35.4,100


Result for PPO_MultiAgentArena_9889f_00000:
  agent_timesteps_total: 20000
  custom_metrics:
    new_fields_discovered_max: 53
    new_fields_discovered_mean: 34.65
    new_fields_discovered_min: 16
    num_collisions_max: 8
    num_collisions_mean: 1.38
    num_collisions_min: 0
  date: 2021-07-01_14-24-56
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 20.999999999999922
  episode_reward_mean: -6.842999999999997
  episode_reward_min: -35.40000000000004
  episodes_this_iter: 20
  episodes_total: 100
  experiment_id: 8219602d9bd44911809d9131340f776b
  hostname: SKCC17N00536.local
  info:
    learner:
      default_policy:
        learner_stats:
          allreduce_latency: 0.0
          cur_kl_coeff: 0.025
          cur_lr: 5.0e-05
          entropy: 1.3530115261673927
          entropy_coeff: 0.0
          kl: 0.0003534120914991945
          policy_loss: -0.005451907229144126
          total_loss: 18.50159814953804
          vf_explained_var: 0.1307562

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_9889f_00000,TERMINATED,,5,15.5658,20000,-6.843,21,-35.4,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_9889f_00000,TERMINATED,,5,15.5658,20000,-6.843,21,-35.4,100


2021-07-01 14:24:56,857	INFO tune.py:549 -- Total run time: 33.11 seconds (32.55 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7f8192ce3610>

### Deep Dive: A closer look at RLlib's components
#### (Depending on time left and amount of questions having been accumulated :)

We already took a quick look inside an RLlib Trainer object and extracted its Policy(ies) and the Policy's model (neural network). Here is a much more detailed overview of what's inside a Trainer object.

At the core is the so-called `WorkerSet` sitting under `Trainer.workers`. A WorkerSet is a group of `RolloutWorker` (`rllib.evaluation.rollout_worker.py`) objects that always consists of a "local worker" (`Trainer.workers.local_worker()`) and n "remote workers" (`Trainer.workers.remote_workers()`).


우리는 이미 RLlib Trainer 객체를 내부를 간단히 보고 정책과 정책 모델(신경망)을 추출했습니다. 다음은 트레이너 내부에 대한 훨씬 더 많은 개체에 대한 간략한 Overview입니다.

핵심은 `Trainer.workers` 아래에 있는 소위 `WorkerSet`에 있습니다. WorkerSet은 항상 1개의 "Local worker"(`Trainer.workers.local_worker()`)와 n개의 "Remote worker"(`Trainer.workers.remote_workers()`)로 이루어진 `RolloutWorker` (`rllib.evaluation.rollout_worker.py`) 의 그룹입니다.

<img src="images/rllib_structure.png" width=1000>

### Scaling RLlib

Scaling RLlib works by parallelizing the "jobs" that the remote `RolloutWorkers` do. In a vanilla RL algorithm, like PPO, DQN, and many others, the `@ray.remote` labeled RolloutWorkers in the figure above are responsible for interacting with one or more environments and thereby collecting experiences. Observations are produced by the environment, actions are then computed by the Policy(ies) copy located on the remote worker and sent to the environment in order to produce yet another observation. This cycle is repeated endlessly and only sometimes interrupted to send experience batches ("train batches") of a certain size to the "local worker". There these batches are used to call `Policy.learn_on_batch()`, which performs a loss calculation, followed by a model weights update, and a subsequent weights broadcast back to all the remote workers.

RLlib 확장은 원격 `RolloutWorkers`가 수행하는 "작업(Jobs)"을 병렬화하여 작동합니다. PPO, DQN 등과 같은 바닐라 RL 알고리즘에서 위의 그림에서 RolloutWorkers라는 레이블이 붙은`@ray.remote`는 하나 이상의 환경과 상호 작용하여 경험을 수집합니다. Observation은 환경에 의해 생성되고 Action은 원격 작업자(Remote worker)에 있는 Policy 사본에 의해 계산되고 또 다른 Observation을 생성하기 위해 환경으로 전송됩니다. 이 사이클은 끝없이 반복되며 특정 크기의 경험(Experience) 배치("학습 배치")를 "로컬 작업자(Local worker)"에게 보내기 위해 때때로 중단됩니다. 여기에서 이러한 배치는 Loss 계산을 수행하고 모델 가중치 업데이트를 수행하고 후속 가중치가 모든 원격 작업자에게 다시 브로드 캐스트되는 `Policy.learn_on_batch()`를 호출하는 데 사용됩니다.

## Time for Q&A

...

## Thank you for listening and participating!

### Here are a couple of links that you may find useful.

- The <a href="https://github.com/sven1977/rllib_tutorials.git">github repo of this tutorial</a>.
- <a href="https://docs.ray.io/en/master/rllib.html">RLlib's documentation main page</a>.
- <a href="http://discuss.ray.io">Our discourse forum</a> to ask questions on Ray and its libraries.
- Our <a href="https://forms.gle/9TSdDYUgxYs8SA9e8">Slack channel</a> for interacting with other Ray RLlib users.
- The <a href="https://github.com/ray-project/ray/blob/master/rllib/examples/">RLlib examples scripts folder</a> with tons of examples on how to do different stuff with RLlib.
- A <a href="https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d">blog post on training with RLlib inside a Unity3D environment</a>.
