# Connect four with two Deep Q-Network agents from Tianshou

In the previous notebook, `2-learning-connect-four-dqn-vs-random-agent-tianshou.ipynb`, we trained a DQN agent by letting it play against a random agent.
To make things more interesting and challanging, we now train two DQN agents to play against each other.


<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Training two DQN agents on connect four Gym
  - Create the Gym environment
  - Building the environment
  - Implementing the DQN policy
  - Building agents
  - Function for letting agents learn
  - Function for watching learned agent
  - Doing the experiment
- Discussion

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [3]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib

# Plotting
import matplotlib; print(f"Matplotlib version (3.5.1 recommended): {matplotlib.__version__}")
import matplotlib.pyplot as plt

# Argparser
import argparse

# More data types
import typing
import numpy as np

# Pygame
import pygame; print(f"Pygame version (2.1.2 recommended): {pygame.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Tianshou for RL algorithms
import tianshou as ts; print(f"Tianshou version (0.4.8 recommended): {ts.__version__}")

# Torch is a popular DL framework
import torch; print(f"Torch version (1.12.0 recommended): {torch.__version__}")

# PPrint is a pretty print for variables
from pprint import pprint

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym
importlib.invalidate_caches()
importlib.reload(cfgym)

# Time for allowing "freezes" in execution
import time;

# Allow for copying objects in a non reference manner
import copy

# Used for updating notebook display
from IPython.display import clear_output

Matplotlib version (3.5.1 recommended): 3.5.1
Pygame version (2.1.2 recommended): 2.1.2
Gym version (0.21.0 recommended): 0.21.0
Tianshou version (0.4.8 recommended): 0.4.8
Torch version (1.12.0 recommended): 1.12.0.dev20220520+cu116


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

In [4]:
####################################################
# CUDA VALIDATION
####################################################

# Check cuda available
print(f"CUDA is available: {torch.cuda.is_available()}")

# Show cuda devices
print(f"\nAmount of connected devices supporting CUDA: {torch.cuda.device_count()}")

# Show current cuda device
print(f"\nCurrent CUDA device: {torch.cuda.current_device()}")

# Show cuda device name
print(f"Cuda device 0 name: {torch.cuda.get_device_name(0)}")

CUDA is available: True

Amount of connected devices supporting CUDA: 1

Current CUDA device: 0
Cuda device 0 name: NVIDIA GeForce GTX 970


<hr><hr>

## Training two DQN agents on connect four Gym

Our connect four gym setup requires two agents, one for each player.
To reduce complexity, agents will always play as the same player, e.g. always as player 1.
It is important to note that connect four is a *solved game*.
According to [The Washington Post](https://www.washingtonpost.com/news/wonk/wp/2015/05/08/how-to-win-any-popular-game-according-to-data-scientists/):

> Connect Four is what mathematicians call a "solved game," meaning you can play it perfectly every time, no matter what your opponent does. You will need to get the first move, but as long as you do so, you can always win within 41 moves.

<hr>

### Create the Gym environment

Our environment used is identical to the last notebook, `2-learning-connect-four-dqn-vs-random-agent-tianshou.ipynb`.

In [5]:
####################################################
# SETTING UP THE GYM ENVIRONMENT
####################################################

# Create an instance of the environment to be used
# V2 is used as this contains edits for Tianshou
# We use the Tianshou PettingZooEnv wrapper for multiagent support
env = ts.env.PettingZooEnv(cfgym.env())

# Get information about the environment
print(f"Observation space: {env.observation_space}")
print(f"\nAction space: {env.action_space}")

# Reset the environment to start from a clean state, returns the initial observation
observation = env.reset()

print("\n Initial player id:")
print(observation["agent_id"])

print("\n Initial observation:")
print(observation["obs"])

print("\n Initial mask:")
print(observation["mask"])

# Clean unused variables
del observation
del env

Observation space: Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]], [[2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]], (6, 7), int8))

Action space: Discrete(7)

 Initial player id:
player_1

 Initial observation:
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]

 Initial mask:
[True, True, True, True, True, True, True]


<hr>

### Building the environment

We build a simple function to get the environment the agent will play in.
This is identical to the last notebook, `2-learning-connect-four-dqn-vs-random-agent-tianshou.ipynb`.

In [6]:
####################################################
# CONNECT FOUR V2 ENVIRONMENT
####################################################

def get_env():
    """
    Returns the connect four gym environment V2 altered for Tianshou and Petting Zoo compatibility.
    Already wrapped with a ts.env.PettingZooEnv wrapper.
    """
    return ts.env.PettingZooEnv(cfgym.env())

<hr>

### Implementing the DQN policy

The DQN policy for the agent is configured and set up below.
This is identical to the last notebook, `2-learning-connect-four-dqn-vs-random-agent-tianshou.ipynb`.

In [7]:
####################################################
# DQN POLICY
####################################################

def cf_dqn_policy(state_shape: tuple,
                  action_shape: tuple,
                  optim: typing.Optional[torch.optim.Optimizer] = None,
                  learning_rate: float =  0.0001,
                  gamma: float = 0.9, # Smaller gamma favours "faster" win
                  n_step: int = 3, # Number of steps to look ahead
                  target_update_freq: int = 320,
                  hidden_sizes: list = [128, 128, 128, 128]):
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Network to be used for DQN
    ## Set to use the CUDA device if possible
    net = ts.utils.net.common.Net(state_shape= state_shape,
                                  action_shape= action_shape,
                                  hidden_sizes= hidden_sizes,
                                  device= device).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    if optim is None:
        optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # Our agent DQN policy
    return ts.policy.DQNPolicy(model= net,
                               optim= optim,
                               discount_factor= gamma,
                               estimation_step= n_step,
                               target_update_freq= target_update_freq)

<hr>

### Building agents

We change the code used from the previous notebook (`2-learning-connect-four-dqn-vs-random-agent-tianshou.ipynb`) so that two DQN agents are now created.

In [8]:
####################################################
# AGENT CREATION
####################################################

def get_agents(agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
               agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
               optim: typing.Optional[torch.optim.Optimizer] = None,
               resume_path_player_1: str = '', # Path to file to resume agent training from
               resume_path_player_2: str = '', 
               ) -> typing.Tuple[ts.policy.BasePolicy, torch.optim.Optimizer, list]:
    """
    Gets a multi agent policy manager, optimizer and player ids for the connect four V2 gym environment.
    Per default this returns 
        - Multi agent manager for 2 agents using DQN
        - Adam optimizer
        - ['player_1', 'player_2'] from the connect four environment
    """
    
    # Get the environment to play in (Connect four gym V2)
    env = get_env()
    
    # Get the observation space from the environment, depending on typo of space (ternary operator)
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    
    # Set the arguments
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent player 1 to be a DQN if no policy is passed.
    if agent_player1 is None:
        # Our agent1 uses a DQN policy
        agent_player1 = cf_dqn_policy(state_shape= state_shape,
                                      action_shape= action_shape,
                                      optim= optim)
        
        # If we resume our agent we need to load the previous config
        if resume_path_player_1:
            agent_player1.load_state_dict(torch.load(resume_path_player_1))
    
    # Configure agent player 2 to be a DQN if no policy is passed.
    if agent_player2 is None:
        # Our agent1 uses a DQN policy
        agent_player2 = cf_dqn_policy(state_shape= state_shape,
                                      action_shape= action_shape,
                                      optim= optim)
        
        # If we resume our agent we need to load the previous config
        if resume_path_player_2:
            agent_player2.load_state_dict(torch.load(resume_path_player_2))

    # Both our agents are DQN agents by default
    agents = [agent_player1, agent_player2]
        
    # Our policy depends on the order of the agents
    policy = ts.policy.MultiAgentPolicyManager(agents, env)
    
    # Return our policy, optimizer and the available agents in the environment
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    
    return policy, optim, env.agents

<hr>

### Function for letting agents learn

Whilst the previous notebook (`2-learning-connect-four-dqn-vs-random-agent-tianshou.ipynb`) only trained one agent, we now have to train two agents.
This makes things a little harder.
Most importantly, we don't want to focus on using a single agent's score as metric, since this would favour one strong and one weak agent.
In an ideal world, both agents would be so good that there is always a draw.
However, given the connect four game is decided also means that a perfectly trained player 1 agent would result in a permanent win for agent 1.

In [9]:
####################################################
# AGENT TRAINING
####################################################

def train_agent(filename: str = "dqn_vs_dqn",
                agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
                optim: typing.Optional[torch.optim.Optimizer] = None,
                training_env_num: int = 10,
                testing_env_num: int = 10,
                buffer_size: int = 2^14, # 16 384
                batch_size: int = 64,
                epochs: int = 50,
                step_per_epoch: int = 1000,
                step_per_collect: int = 10,
                update_per_step: float = 0.1,
                testing_eps: float = 0.05,
                training_eps: float = 0.1,
                ) -> typing.Tuple[dict, ts.policy.BasePolicy]:
    """
    Trains two agents in the connect four V2 environment and saves their best model and logs.
    Returns:
        - result from offpolicy_trainer
        - final version of agent 1
        - final version of agent 2
    """

    # ======== notebook specific =========
    notebook_version = '3' # Used for foldering logs and models

    # ======== environment setup =========
    train_envs = ts.env.DummyVectorEnv([get_env for _ in range(training_env_num)])
    test_envs = ts.env.DummyVectorEnv([get_env for _ in range(testing_env_num)])
    
    # set the seed for reproducibility
    np.random.seed(1998)
    torch.manual_seed(1998)
    train_envs.seed(1998)
    test_envs.seed(1998)

    # ======== agent setup =========
    # Gets our agents from the previously made function
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    policy, optim, agents = get_agents(agent_player1=agent_player1,
                                       agent_player2=agent_player2,
                                       optim=optim)

    # ======== collector setup =========
    # Make a collector for the training environments
    train_collector = ts.data.Collector(policy= policy,
                                        env= train_envs,
                                        buffer= ts.data.VectorReplayBuffer(buffer_size, len(train_envs)),
                                        exploration_noise= True)
    
    # Make a collector for the testing environments
    test_collector = ts.data.Collector(policy= policy,
                                       env= test_envs,
                                       exploration_noise= True)
    
    # Uncomment below if you want to set epsilon in epsilon policy
    # policy.set_eps(1)
    
    # Collect data fot the training evnironments
    train_collector.collect(n_step= batch_size * training_env_num)
    
    # ======== ensure folders exist =========
    if not os.path.exists(os.path.join('./logs', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./logs', 'paper_notebooks', notebook_version, filename))
    if not os.path.exists(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename))

    # ======== tensorboard logging setup =========
    # Allows to save the training progress to tensorboard compatable logs
    log_path = os.path.join('./logs', 'paper_notebooks', notebook_version, filename)
    writer = torch.utils.tensorboard.SummaryWriter(log_path)
    logger = ts.utils.TensorboardLogger(writer)

    # ======== callback functions used during training =========
    # We want to save our best policy
    def save_best_fn(policy):
        """
        Callback to save the best model
        """
        # Save agent 1
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent1.pth')
        torch.save(policy.policies[agents[0]].state_dict(), model_save_path)
        
        # Save agent 2
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent2.pth')
        torch.save(policy.policies[agents[1]].state_dict(), model_save_path)
        
        # Save agent2

    def stop_fn(mean_rewards):
        """
        Callback to stop training when we've reached the win rate
        """
        return mean_rewards >= 7 # (win = 10, 70% win without invalid moves = mean of 7)

    def train_fn(epoch, env_step):
        """
        Callback before training
        """        
        # Before training we want to configure the epsilon for the agents
        # In general more exploratory than the test case
        policy.policies[agents[0]].set_eps(training_eps)
        policy.policies[agents[1]].set_eps(training_eps)

    def test_fn(epoch, env_step):
        """
        Callback beore testing
        """        
        # Before testing we want to configure the epsilon for the agents
        # In general more greedy than the train case but not
        #   to avoid getting stuck on invalid moves
        policy.policies[agents[0]].set_eps(testing_eps)
        policy.policies[agents[1]].set_eps(testing_eps)

    def reward_metric(rews):
        """
        Callback for reward collection
        """
        # We are interested in having a high total total reward,
        #   as this would mean equally good agents.
        return rews[:, 0] + rews[:, 1]

    # trainer
    result = ts.trainer.offpolicy_trainer(policy= policy,
                                          train_collector= train_collector,
                                          test_collector= test_collector,
                                          max_epoch= epochs,
                                          step_per_epoch= step_per_epoch,
                                          step_per_collect= step_per_collect,
                                          episode_per_test= testing_env_num,
                                          batch_size= batch_size,
                                          train_fn= train_fn,
                                          test_fn= test_fn,
                                          # Stop function to stop before specified amount of epochs
                                          #stop_fn= stop_fn
                                          save_best_fn= save_best_fn,
                                          update_per_step= update_per_step,
                                          logger= logger,
                                          test_in_train= False,
                                          reward_metric= reward_metric)

    return result, policy.policies[agents[0]], policy.policies[agents[1]]

<hr>

### Function for watching learned agent

When an agent has learned, we can watch the learned policy in action.

In [10]:
####################################################
# WATCHING THE LEARNED POLICY IN ACTION
####################################################

def watch(numer_of_games:int = 3,
          agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
          agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
          test_epsilon: float = 0.05, # For the watching we act completely greedy but low random for not getting stuck on invalid move
          render_speed: float = 0.15, # Amount of seconds to update frame/ do a step
          ) -> None:
    
    # Get the connect four V2 environment (must be a list)
    env= ts.env.DummyVectorEnv([get_env])
    
    # Get the agents from the trained agents
    policy, optim, agents = get_agents(agent_player1= agent_player1,
                                       agent_player2= agent_player2)
    
    # Evaluate the policy
    policy.eval()
    
    # Set the testing policy epsilon for our agents
    policy.policies[agents[0]].set_eps(test_epsilon)
    policy.policies[agents[1]].set_eps(test_epsilon)
    
    # Collect the test data
    collector = ts.data.Collector(policy= policy,
                                  env= env,
                                  exploration_noise= True)
    
    # Render games in human mode to see how it plays
    result = collector.collect(n_episode= numer_of_games, render= render_speed)
    
    # Close the environment aftering collecting the results
    # This closes the pygame window after completion
    env.close()
    
    # Get the rewards and length from the test trials
    rewards, length = result["rews"], result["lens"]
    
    # Print the final reward for the first agent
    print(f"Average steps of game:  {length.mean()}")
    print(f"Final mean reward agent 1: {rewards[:, 0].mean()}, std: {rewards[:, 0].std()}")
    print(f"Final mean reward agent 2: {rewards[:, 1].mean()}, std: {rewards[:, 1].std()}")

<hr>

### Doing the experiment

We now do the experiment with using our previously created functions.

In [21]:
####################################################
# EXPERIMENT: TRAINING AGENTS
####################################################

# Train the agent
off_policy_traininer_results, final_agent_player1, final_agent_player2 = train_agent(epochs= 500)

Epoch #1: 1001it [00:02, 470.55it/s, env_step=1000, len=8, n/ep=0, n/st=10, player_1/loss=22.262, player_2/loss=24.486, rew=0.00]                          


Epoch #1: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #2: 1001it [00:02, 481.17it/s, env_step=2000, len=32, n/ep=0, n/st=10, player_1/loss=126.596, player_2/loss=23.137, rew=-136.00]                          


Epoch #2: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #3: 1001it [00:02, 469.66it/s, env_step=3000, len=7, n/ep=0, n/st=10, player_1/loss=56.899, player_2/loss=24.074, rew=0.00]                             


Epoch #3: test_reward: -33194.000000 ± 52344.971827, best_reward: 0.000000 ± 0.000000 in #0


Epoch #4: 1001it [00:02, 457.85it/s, env_step=4000, len=25, n/ep=3, n/st=10, player_1/loss=48.430, player_2/loss=25.406, rew=-97.00]                          


Epoch #4: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #5: 1001it [00:02, 453.09it/s, env_step=5000, len=9, n/ep=1, n/st=10, player_1/loss=20.314, player_2/loss=11.178, rew=0.00]                            


Epoch #5: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #6: 1001it [00:02, 453.50it/s, env_step=6000, len=7, n/ep=1, n/st=10, player_1/loss=15.429, player_2/loss=13.926, rew=0.00]                             


Epoch #6: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #7: 1001it [00:02, 455.77it/s, env_step=7000, len=7, n/ep=2, n/st=10, player_1/loss=16.056, player_2/loss=26.163, rew=0.00]                             


Epoch #7: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #8: 1001it [00:02, 456.39it/s, env_step=8000, len=8, n/ep=0, n/st=10, player_1/loss=8.224, player_2/loss=15.882, rew=0.00]                           


Epoch #8: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #9: 1001it [00:02, 452.88it/s, env_step=9000, len=7, n/ep=2, n/st=10, player_1/loss=6.921, player_2/loss=15.742, rew=0.00]                          


Epoch #9: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #10: 1001it [00:02, 439.37it/s, env_step=10000, len=7, n/ep=1, n/st=10, player_1/loss=7.635, player_2/loss=15.326, rew=0.00]                          


Epoch #10: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #11: 1001it [00:02, 464.65it/s, env_step=11000, len=8, n/ep=2, n/st=10, player_1/loss=7.923, player_2/loss=12.452, rew=0.00]                          


Epoch #11: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #12: 1001it [00:02, 458.27it/s, env_step=12000, len=8, n/ep=0, n/st=10, player_1/loss=47.349, player_2/loss=26.207, rew=0.00]                             


Epoch #12: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #13: 1001it [00:02, 456.18it/s, env_step=13000, len=7, n/ep=2, n/st=10, player_1/loss=10.868, player_2/loss=76.904, rew=0.00]                             


Epoch #13: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #14: 1001it [00:02, 450.03it/s, env_step=14000, len=7, n/ep=2, n/st=10, player_1/loss=19.280, player_2/loss=115.152, rew=0.00]                             


Epoch #14: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #15: 1001it [00:02, 456.60it/s, env_step=15000, len=8, n/ep=0, n/st=10, player_1/loss=8.475, player_2/loss=22.045, rew=0.00]                           


Epoch #15: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #16: 1001it [00:02, 461.87it/s, env_step=16000, len=7, n/ep=1, n/st=10, player_1/loss=11.963, player_2/loss=21.164, rew=0.00]                          


Epoch #16: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #17: 1001it [00:02, 460.17it/s, env_step=17000, len=7, n/ep=1, n/st=10, player_1/loss=7.921, player_2/loss=16.131, rew=0.00]                           


Epoch #17: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #18: 1001it [00:02, 449.63it/s, env_step=18000, len=8, n/ep=3, n/st=10, player_1/loss=7.082, player_2/loss=14.208, rew=0.00]                          


Epoch #18: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #19: 1001it [00:02, 455.35it/s, env_step=19000, len=7, n/ep=0, n/st=10, player_1/loss=7.600, player_2/loss=13.221, rew=0.00]                          


Epoch #19: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #20: 1001it [00:02, 462.08it/s, env_step=20000, len=7, n/ep=1, n/st=10, player_1/loss=7.286, player_2/loss=15.405, rew=0.00]                            


Epoch #20: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #21: 1001it [00:02, 461.87it/s, env_step=21000, len=7, n/ep=0, n/st=10, player_1/loss=9.134, player_2/loss=11.762, rew=0.00]                          


Epoch #21: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #22: 1001it [00:02, 459.32it/s, env_step=22000, len=7, n/ep=1, n/st=10, player_1/loss=7.222, player_2/loss=9.857, rew=0.00]                           


Epoch #22: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #23: 1001it [00:02, 459.96it/s, env_step=23000, len=9, n/ep=1, n/st=10, player_1/loss=7.088, player_2/loss=9.970, rew=0.00]                           


Epoch #23: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #24: 1001it [00:02, 450.84it/s, env_step=24000, len=8, n/ep=1, n/st=10, player_1/loss=5.530, player_2/loss=11.347, rew=0.00]                          


Epoch #24: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #25: 1001it [00:02, 462.72it/s, env_step=25000, len=9, n/ep=1, n/st=10, player_1/loss=7.636, player_2/loss=19.092, rew=0.00]                            


Epoch #25: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #26: 1001it [00:02, 459.53it/s, env_step=26000, len=8, n/ep=3, n/st=10, player_1/loss=5.903, player_2/loss=12.655, rew=0.00]                          


Epoch #26: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #27: 1001it [00:02, 456.39it/s, env_step=27000, len=7, n/ep=0, n/st=10, player_1/loss=7.528, player_2/loss=22.077, rew=0.00]                          


Epoch #27: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #28: 1001it [00:02, 465.08it/s, env_step=28000, len=10, n/ep=2, n/st=10, player_1/loss=20.616, player_2/loss=19.179, rew=0.00]                          


Epoch #28: test_reward: -12.600000 ± 37.800000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #29: 1001it [00:02, 469.88it/s, env_step=29000, len=48, n/ep=0, n/st=10, player_1/loss=69.683, player_2/loss=59.766, rew=-468.00]                          


Epoch #29: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #30: 1001it [00:02, 461.65it/s, env_step=30000, len=9, n/ep=2, n/st=10, player_1/loss=18.380, player_2/loss=112.053, rew=0.00]                             


Epoch #30: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #31: 1001it [00:02, 460.38it/s, env_step=31000, len=10, n/ep=0, n/st=10, player_1/loss=12.340, player_2/loss=54.290, rew=0.00]                          


Epoch #31: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #32: 1001it [00:02, 455.35it/s, env_step=32000, len=8, n/ep=4, n/st=10, player_1/loss=12.226, player_2/loss=32.722, rew=0.00]                          


Epoch #32: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #33: 1001it [00:02, 470.33it/s, env_step=33000, len=7, n/ep=2, n/st=10, player_1/loss=95.995, player_2/loss=129.231, rew=0.00]                              


Epoch #33: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #34: 1001it [00:02, 458.90it/s, env_step=34000, len=9, n/ep=1, n/st=10, player_1/loss=109.152, player_2/loss=36.945, rew=0.00]                           


Epoch #34: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #35: 1001it [00:02, 452.88it/s, env_step=35000, len=8, n/ep=0, n/st=10, player_1/loss=78.314, player_2/loss=129.219, rew=0.00]                             


Epoch #35: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #36: 1001it [00:02, 448.82it/s, env_step=36000, len=7, n/ep=1, n/st=10, player_1/loss=69.859, player_2/loss=142.929, rew=0.00]                          


Epoch #36: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #37: 1001it [00:02, 457.23it/s, env_step=37000, len=9, n/ep=0, n/st=10, player_1/loss=125.630, player_2/loss=52.612, rew=0.00]                              


Epoch #37: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #38: 1001it [00:02, 450.44it/s, env_step=38000, len=9, n/ep=0, n/st=10, player_1/loss=99.821, player_2/loss=159.539, rew=0.00]                             


Epoch #38: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #39: 1001it [00:02, 458.69it/s, env_step=39000, len=7, n/ep=2, n/st=10, player_1/loss=76.985, player_2/loss=114.848, rew=0.00]                              


Epoch #39: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #40: 1001it [00:02, 466.60it/s, env_step=40000, len=7, n/ep=2, n/st=10, player_1/loss=40.331, player_2/loss=37.417, rew=0.00]                          


Epoch #40: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #41: 1001it [00:02, 460.38it/s, env_step=41000, len=9, n/ep=2, n/st=10, player_1/loss=144.085, player_2/loss=39.315, rew=0.00]                              


Epoch #41: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #42: 1001it [00:02, 452.47it/s, env_step=42000, len=8, n/ep=0, n/st=10, player_1/loss=98.317, player_2/loss=29.118, rew=0.00]                             


Epoch #42: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #43: 1001it [00:02, 459.96it/s, env_step=43000, len=7, n/ep=1, n/st=10, player_1/loss=53.416, player_2/loss=59.994, rew=0.00]                               


Epoch #43: test_reward: -1143.000000 ± 3429.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #44: 1001it [00:02, 467.69it/s, env_step=44000, len=7, n/ep=1, n/st=10, player_1/loss=27.354, player_2/loss=51.321, rew=0.00]                             


Epoch #44: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #45: 1001it [00:02, 468.13it/s, env_step=45000, len=7, n/ep=1, n/st=10, player_1/loss=23.256, player_2/loss=72.869, rew=0.00]                              


Epoch #45: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #46: 1001it [00:02, 462.29it/s, env_step=46000, len=7, n/ep=1, n/st=10, player_1/loss=7.812, player_2/loss=21.665, rew=0.00]                           


Epoch #46: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #47: 1001it [00:02, 469.00it/s, env_step=47000, len=8, n/ep=0, n/st=10, player_1/loss=15.844, player_2/loss=22.196, rew=0.00]                           


Epoch #47: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #48: 1001it [00:02, 459.11it/s, env_step=48000, len=8, n/ep=1, n/st=10, player_1/loss=47.805, player_2/loss=51.436, rew=0.00]                             


Epoch #48: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #49: 1001it [00:02, 466.17it/s, env_step=49000, len=8, n/ep=2, n/st=10, player_1/loss=112.601, player_2/loss=19.931, rew=0.00]                            


Epoch #49: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #50: 1001it [00:02, 452.07it/s, env_step=50000, len=7, n/ep=2, n/st=10, player_1/loss=51.256, player_2/loss=16.615, rew=0.00]                          


Epoch #50: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #51: 1001it [00:02, 462.93it/s, env_step=51000, len=7, n/ep=1, n/st=10, player_1/loss=49.375, player_2/loss=29.967, rew=0.00]                           


Epoch #51: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #52: 1001it [00:02, 462.93it/s, env_step=52000, len=9, n/ep=0, n/st=10, player_1/loss=28.688, player_2/loss=16.754, rew=0.00]                          


Epoch #52: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #53: 1001it [00:02, 458.48it/s, env_step=53000, len=7, n/ep=2, n/st=10, player_1/loss=55.524, player_2/loss=14.850, rew=0.00]                          


Epoch #53: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #54: 1001it [00:02, 456.18it/s, env_step=54000, len=8, n/ep=0, n/st=10, player_1/loss=14.767, player_2/loss=14.295, rew=0.00]                          


Epoch #54: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #55: 1001it [00:02, 457.02it/s, env_step=55000, len=7, n/ep=2, n/st=10, player_1/loss=20.828, player_2/loss=18.401, rew=0.00]                          


Epoch #55: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #56: 1001it [00:02, 461.44it/s, env_step=56000, len=7, n/ep=1, n/st=10, player_1/loss=12.996, player_2/loss=14.361, rew=0.00]                          


Epoch #56: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #57: 1001it [00:02, 455.98it/s, env_step=57000, len=7, n/ep=0, n/st=10, player_1/loss=18.708, player_2/loss=20.445, rew=0.00]                           


Epoch #57: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #58: 1001it [00:02, 465.73it/s, env_step=58000, len=8, n/ep=2, n/st=10, player_1/loss=13.941, player_2/loss=17.836, rew=0.00]                          


Epoch #58: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #59: 1001it [00:02, 456.60it/s, env_step=59000, len=7, n/ep=3, n/st=10, player_1/loss=21.564, player_2/loss=13.593, rew=0.00]                           


Epoch #59: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #60: 1001it [00:02, 455.15it/s, env_step=60000, len=7, n/ep=2, n/st=10, player_1/loss=9.910, player_2/loss=11.804, rew=0.00]                           


Epoch #60: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #61: 1001it [00:02, 454.53it/s, env_step=61000, len=8, n/ep=0, n/st=10, player_1/loss=6.895, player_2/loss=9.187, rew=0.00]                           


Epoch #61: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #62: 1001it [00:02, 441.89it/s, env_step=62000, len=9, n/ep=0, n/st=10, player_1/loss=11.980, player_2/loss=10.641, rew=0.00]                          


Epoch #62: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #63: 1001it [00:02, 419.53it/s, env_step=63000, len=8, n/ep=2, n/st=10, player_1/loss=8.332, player_2/loss=14.803, rew=0.00]                          


Epoch #63: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #64: 1001it [00:02, 418.96it/s, env_step=64000, len=8, n/ep=2, n/st=10, player_1/loss=9.062, player_2/loss=11.521, rew=0.00]                           


Epoch #64: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #65: 1001it [00:02, 444.24it/s, env_step=65000, len=8, n/ep=2, n/st=10, player_1/loss=17.987, player_2/loss=34.400, rew=0.00]                          


Epoch #65: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #66: 1001it [00:02, 446.82it/s, env_step=66000, len=8, n/ep=0, n/st=10, player_1/loss=29.986, player_2/loss=228.115, rew=0.00]                              


Epoch #66: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #67: 1001it [00:02, 463.15it/s, env_step=67000, len=8, n/ep=0, n/st=10, player_1/loss=25.649, player_2/loss=29.831, rew=0.00]                            


Epoch #67: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #68: 1001it [00:02, 465.73it/s, env_step=68000, len=8, n/ep=1, n/st=10, player_1/loss=20.361, player_2/loss=150.863, rew=0.00]                              


Epoch #68: test_reward: -13.500000 ± 40.500000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #69: 1001it [00:02, 483.73it/s, env_step=69000, len=8, n/ep=1, n/st=10, player_1/loss=54.286, player_2/loss=227.229, rew=0.00]                              


Epoch #69: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #70: 1001it [00:02, 469.66it/s, env_step=70000, len=10, n/ep=1, n/st=10, player_1/loss=18.124, player_2/loss=22.191, rew=0.00]                          


Epoch #70: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #71: 1001it [00:02, 477.50it/s, env_step=71000, len=8, n/ep=0, n/st=10, player_1/loss=30.641, player_2/loss=34.377, rew=0.00]                            


Epoch #71: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #72: 1001it [00:02, 477.05it/s, env_step=72000, len=8, n/ep=1, n/st=10, player_1/loss=21.739, player_2/loss=63.610, rew=0.00]                             


Epoch #72: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #73: 1001it [00:02, 454.32it/s, env_step=73000, len=11, n/ep=0, n/st=10, player_1/loss=27.132, player_2/loss=135.385, rew=0.00]                             


Epoch #73: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #74: 1001it [00:02, 453.29it/s, env_step=74000, len=7, n/ep=4, n/st=10, player_1/loss=33.460, player_2/loss=110.133, rew=0.00]                               


Epoch #74: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #75: 1001it [00:02, 467.25it/s, env_step=75000, len=9, n/ep=0, n/st=10, player_1/loss=7.648, player_2/loss=16.685, rew=0.00]                            


Epoch #75: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #76: 1001it [00:02, 485.37it/s, env_step=76000, len=7, n/ep=0, n/st=10, player_1/loss=9.001, player_2/loss=24.199, rew=0.00]                           


Epoch #76: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #77: 1001it [00:02, 479.56it/s, env_step=77000, len=8, n/ep=0, n/st=10, player_1/loss=14.267, player_2/loss=26.862, rew=0.00]                             


Epoch #77: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #78: 1001it [00:02, 485.04it/s, env_step=78000, len=10, n/ep=1, n/st=10, player_1/loss=12.699, player_2/loss=132.042, rew=0.00]                            


Epoch #78: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #79: 1001it [00:02, 472.10it/s, env_step=79000, len=7, n/ep=3, n/st=10, player_1/loss=7.763, player_2/loss=20.794, rew=0.00]                            


Epoch #79: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #80: 1001it [00:02, 480.02it/s, env_step=80000, len=7, n/ep=1, n/st=10, player_1/loss=4.826, player_2/loss=20.018, rew=0.00]                          


Epoch #80: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #81: 1001it [00:02, 455.98it/s, env_step=81000, len=7, n/ep=1, n/st=10, player_1/loss=11.473, player_2/loss=21.510, rew=0.00]                          


Epoch #81: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #82: 1001it [00:02, 469.00it/s, env_step=82000, len=7, n/ep=1, n/st=10, player_1/loss=6.248, player_2/loss=24.222, rew=0.00]                          


Epoch #82: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #83: 1001it [00:02, 481.64it/s, env_step=83000, len=8, n/ep=1, n/st=10, player_1/loss=8.404, player_2/loss=20.522, rew=0.00]                          


Epoch #83: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #84: 1001it [00:02, 468.57it/s, env_step=84000, len=7, n/ep=0, n/st=10, player_1/loss=10.086, player_2/loss=24.937, rew=0.00]                          


Epoch #84: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #85: 1001it [00:02, 481.41it/s, env_step=85000, len=7, n/ep=2, n/st=10, player_1/loss=6.132, player_2/loss=21.571, rew=0.00]                          


Epoch #85: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #86: 1001it [00:02, 481.64it/s, env_step=86000, len=7, n/ep=3, n/st=10, player_1/loss=9.958, player_2/loss=21.448, rew=0.00]                          


Epoch #86: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #87: 1001it [00:02, 477.05it/s, env_step=87000, len=9, n/ep=1, n/st=10, player_1/loss=18.756, player_2/loss=48.439, rew=0.00]                          


Epoch #87: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #88: 1001it [00:02, 471.88it/s, env_step=88000, len=7, n/ep=2, n/st=10, player_1/loss=5.335, player_2/loss=25.302, rew=0.00]                          


Epoch #88: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #89: 1001it [00:02, 489.17it/s, env_step=89000, len=9, n/ep=0, n/st=10, player_1/loss=16.438, player_2/loss=23.938, rew=0.00]                          


Epoch #89: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #90: 1001it [00:02, 470.33it/s, env_step=90000, len=7, n/ep=1, n/st=10, player_1/loss=6.122, player_2/loss=22.621, rew=0.00]                           


Epoch #90: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #91: 1001it [00:02, 484.90it/s, env_step=91000, len=7, n/ep=2, n/st=10, player_1/loss=9.975, player_2/loss=19.675, rew=0.00]                          


Epoch #91: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #92: 1001it [00:02, 479.33it/s, env_step=92000, len=7, n/ep=1, n/st=10, player_1/loss=6.177, player_2/loss=20.793, rew=0.00]                          


Epoch #92: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #93: 1001it [00:02, 474.34it/s, env_step=93000, len=7, n/ep=1, n/st=10, player_1/loss=6.492, player_2/loss=19.934, rew=0.00]                          


Epoch #93: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #94: 1001it [00:02, 472.78it/s, env_step=94000, len=8, n/ep=2, n/st=10, player_1/loss=5.514, player_2/loss=21.549, rew=0.00]                          


Epoch #94: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #95: 1001it [00:02, 478.87it/s, env_step=95000, len=7, n/ep=1, n/st=10, player_1/loss=6.942, player_2/loss=20.103, rew=0.00]                          


Epoch #95: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #96: 1001it [00:02, 483.03it/s, env_step=96000, len=7, n/ep=0, n/st=10, player_1/loss=6.474, player_2/loss=20.882, rew=0.00]                          


Epoch #96: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #97: 1001it [00:02, 482.80it/s, env_step=97000, len=7, n/ep=3, n/st=10, player_1/loss=27.770, player_2/loss=23.008, rew=0.00]                             


Epoch #97: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #98: 1001it [00:02, 450.03it/s, env_step=98000, len=8, n/ep=2, n/st=10, player_1/loss=11.293, player_2/loss=19.513, rew=0.00]                          


Epoch #98: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #99: 1001it [00:02, 455.15it/s, env_step=99000, len=9, n/ep=1, n/st=10, player_1/loss=8.998, player_2/loss=20.212, rew=0.00]                           


Epoch #99: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #100: 1001it [00:02, 467.69it/s, env_step=100000, len=7, n/ep=1, n/st=10, player_1/loss=9.357, player_2/loss=23.168, rew=0.00]                          


Epoch #100: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #101: 1001it [00:02, 451.25it/s, env_step=101000, len=7, n/ep=1, n/st=10, player_1/loss=7.247, player_2/loss=23.406, rew=0.00]                          


Epoch #101: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #102: 1001it [00:02, 462.51it/s, env_step=102000, len=7, n/ep=3, n/st=10, player_1/loss=6.599, player_2/loss=18.520, rew=0.00]                          


Epoch #102: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #103: 1001it [00:02, 452.27it/s, env_step=103000, len=7, n/ep=3, n/st=10, player_1/loss=12.555, player_2/loss=15.114, rew=0.00]                          


Epoch #103: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #104: 1001it [00:02, 459.32it/s, env_step=104000, len=7, n/ep=0, n/st=10, player_1/loss=7.457, player_2/loss=16.482, rew=0.00]                           


Epoch #104: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #105: 1001it [00:02, 461.23it/s, env_step=105000, len=7, n/ep=0, n/st=10, player_1/loss=6.092, player_2/loss=17.135, rew=0.00]                          


Epoch #105: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #106: 1001it [00:02, 447.42it/s, env_step=106000, len=7, n/ep=1, n/st=10, player_1/loss=8.320, player_2/loss=19.829, rew=0.00]                          


Epoch #106: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #107: 1001it [00:02, 425.55it/s, env_step=107000, len=7, n/ep=2, n/st=10, player_1/loss=6.189, player_2/loss=20.052, rew=0.00]                          


Epoch #107: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #108: 1001it [00:02, 395.16it/s, env_step=108000, len=8, n/ep=1, n/st=10, player_1/loss=6.554, player_2/loss=17.961, rew=0.00]                          


Epoch #108: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #109: 1001it [00:02, 427.00it/s, env_step=109000, len=7, n/ep=2, n/st=10, player_1/loss=7.344, player_2/loss=17.130, rew=0.00]                          


Epoch #109: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #110: 1001it [00:02, 496.11it/s, env_step=110000, len=7, n/ep=1, n/st=10, player_1/loss=5.227, player_2/loss=17.865, rew=0.00]                          


Epoch #110: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #111: 1001it [00:02, 494.62it/s, env_step=111000, len=8, n/ep=2, n/st=10, player_1/loss=7.461, player_2/loss=16.954, rew=0.00]                          


Epoch #111: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #112: 1001it [00:02, 498.27it/s, env_step=112000, len=7, n/ep=1, n/st=10, player_1/loss=5.507, player_2/loss=18.899, rew=0.00]                          


Epoch #112: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #113: 1001it [00:02, 498.03it/s, env_step=113000, len=7, n/ep=1, n/st=10, player_1/loss=7.673, player_2/loss=19.920, rew=0.00]                          


Epoch #113: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #114: 1001it [00:01, 501.93it/s, env_step=114000, len=7, n/ep=1, n/st=10, player_1/loss=7.231, player_2/loss=19.896, rew=0.00]                          


Epoch #114: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #115: 1001it [00:02, 493.71it/s, env_step=115000, len=8, n/ep=2, n/st=10, player_1/loss=8.251, player_2/loss=19.088, rew=0.00]                          


Epoch #115: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #116: 1001it [00:01, 503.63it/s, env_step=116000, len=7, n/ep=1, n/st=10, player_1/loss=4.811, player_2/loss=15.668, rew=0.00]                          


Epoch #116: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #117: 1001it [00:01, 506.59it/s, env_step=117000, len=8, n/ep=3, n/st=10, player_1/loss=8.723, player_2/loss=14.386, rew=0.00]                          


Epoch #117: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #118: 1001it [00:02, 492.71it/s, env_step=118000, len=7, n/ep=0, n/st=10, player_1/loss=8.682, player_2/loss=15.533, rew=0.00]                            


Epoch #118: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #119: 1001it [00:02, 491.57it/s, env_step=119000, len=7, n/ep=1, n/st=10, player_1/loss=38.618, player_2/loss=85.405, rew=0.00]                              


Epoch #119: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #120: 1001it [00:02, 447.22it/s, env_step=120000, len=7, n/ep=1, n/st=10, player_1/loss=7.739, player_2/loss=18.132, rew=0.00]                          


Epoch #120: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #121: 1001it [00:02, 445.03it/s, env_step=121000, len=7, n/ep=2, n/st=10, player_1/loss=7.066, player_2/loss=15.473, rew=0.00]                          


Epoch #121: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #122: 1001it [00:02, 472.97it/s, env_step=122000, len=8, n/ep=1, n/st=10, player_1/loss=21.926, player_2/loss=57.030, rew=0.00]                              


Epoch #122: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #123: 1001it [00:02, 437.30it/s, env_step=123000, len=7, n/ep=2, n/st=10, player_1/loss=64.902, player_2/loss=70.689, rew=0.00]                              


Epoch #123: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #124: 1001it [00:02, 450.12it/s, env_step=124000, len=7, n/ep=3, n/st=10, player_1/loss=30.975, player_2/loss=96.868, rew=0.00]                              


Epoch #124: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #125: 1001it [00:02, 454.01it/s, env_step=125000, len=7, n/ep=2, n/st=10, player_1/loss=10.062, player_2/loss=14.592, rew=0.00]                          


Epoch #125: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #126: 1001it [00:02, 486.20it/s, env_step=126000, len=10, n/ep=2, n/st=10, player_1/loss=15.559, player_2/loss=23.339, rew=0.00]                          


Epoch #126: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #127: 1001it [00:02, 494.70it/s, env_step=127000, len=8, n/ep=0, n/st=10, player_1/loss=10.195, player_2/loss=20.302, rew=0.00]                          


Epoch #127: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #128: 1001it [00:02, 487.50it/s, env_step=128000, len=7, n/ep=3, n/st=10, player_1/loss=10.791, player_2/loss=18.413, rew=0.00]                          


Epoch #128: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #129: 1001it [00:02, 492.29it/s, env_step=129000, len=7, n/ep=1, n/st=10, player_1/loss=6.570, player_2/loss=18.680, rew=0.00]                          


Epoch #129: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #130: 1001it [00:02, 485.48it/s, env_step=130000, len=9, n/ep=1, n/st=10, player_1/loss=7.049, player_2/loss=20.018, rew=0.00]                          


Epoch #130: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #131: 1001it [00:02, 489.73it/s, env_step=131000, len=8, n/ep=2, n/st=10, player_1/loss=6.794, player_2/loss=18.898, rew=0.00]                          


Epoch #131: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #132: 1001it [00:02, 482.91it/s, env_step=132000, len=7, n/ep=1, n/st=10, player_1/loss=8.294, player_2/loss=21.325, rew=0.00]                          


Epoch #132: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #133: 1001it [00:02, 485.01it/s, env_step=133000, len=7, n/ep=2, n/st=10, player_1/loss=6.933, player_2/loss=20.513, rew=0.00]                          


Epoch #133: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #134: 1001it [00:02, 498.38it/s, env_step=134000, len=7, n/ep=4, n/st=10, player_1/loss=9.183, player_2/loss=17.689, rew=0.00]                          


Epoch #134: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #135: 1001it [00:02, 435.81it/s, env_step=135000, len=8, n/ep=3, n/st=10, player_1/loss=5.407, player_2/loss=19.987, rew=0.00]                          


Epoch #135: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #136: 1001it [00:02, 454.83it/s, env_step=136000, len=10, n/ep=0, n/st=10, player_1/loss=6.287, player_2/loss=20.559, rew=0.00]                          


Epoch #136: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #137: 1001it [00:02, 458.78it/s, env_step=137000, len=7, n/ep=0, n/st=10, player_1/loss=7.073, player_2/loss=16.311, rew=0.00]                          


Epoch #137: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #138: 1001it [00:02, 450.94it/s, env_step=138000, len=7, n/ep=0, n/st=10, player_1/loss=6.854, player_2/loss=19.072, rew=0.00]                          


Epoch #138: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #139: 1001it [00:02, 458.74it/s, env_step=139000, len=9, n/ep=2, n/st=10, player_1/loss=7.558, player_2/loss=19.562, rew=0.00]                          


Epoch #139: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #140: 1001it [00:02, 442.96it/s, env_step=140000, len=8, n/ep=0, n/st=10, player_1/loss=9.732, player_2/loss=15.937, rew=0.00]                           


Epoch #140: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #141: 1001it [00:02, 467.35it/s, env_step=141000, len=8, n/ep=0, n/st=10, player_1/loss=8.586, player_2/loss=16.959, rew=0.00]                          


Epoch #141: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #142: 1001it [00:02, 449.72it/s, env_step=142000, len=7, n/ep=1, n/st=10, player_1/loss=7.274, player_2/loss=19.397, rew=0.00]                          


Epoch #142: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #143: 1001it [00:02, 464.75it/s, env_step=143000, len=7, n/ep=1, n/st=10, player_1/loss=6.039, player_2/loss=17.278, rew=0.00]                          


Epoch #143: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #144: 1001it [00:02, 461.70it/s, env_step=144000, len=9, n/ep=2, n/st=10, player_1/loss=6.665, player_2/loss=15.886, rew=0.00]                          


Epoch #144: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #145: 1001it [00:02, 461.01it/s, env_step=145000, len=7, n/ep=2, n/st=10, player_1/loss=7.131, player_2/loss=18.062, rew=0.00]                          


Epoch #145: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #146: 1001it [00:02, 454.22it/s, env_step=146000, len=8, n/ep=1, n/st=10, player_1/loss=6.167, player_2/loss=17.725, rew=0.00]                          


Epoch #146: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #147: 1001it [00:02, 456.90it/s, env_step=147000, len=8, n/ep=1, n/st=10, player_1/loss=6.173, player_2/loss=17.514, rew=0.00]                           


Epoch #147: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #148: 1001it [00:02, 454.01it/s, env_step=148000, len=9, n/ep=0, n/st=10, player_1/loss=6.793, player_2/loss=20.837, rew=0.00]                          


Epoch #148: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #149: 1001it [00:02, 448.27it/s, env_step=149000, len=7, n/ep=0, n/st=10, player_1/loss=5.535, player_2/loss=18.017, rew=0.00]                          


Epoch #149: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #150: 1001it [00:02, 459.01it/s, env_step=150000, len=8, n/ep=0, n/st=10, player_1/loss=5.779, player_2/loss=19.200, rew=0.00]                          


Epoch #150: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #151: 1001it [00:02, 456.49it/s, env_step=151000, len=7, n/ep=0, n/st=10, player_1/loss=25.876, player_2/loss=23.004, rew=0.00]                            


Epoch #151: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #152: 1001it [00:02, 447.31it/s, env_step=152000, len=7, n/ep=5, n/st=10, player_1/loss=11.047, player_2/loss=37.367, rew=0.00]                           


Epoch #152: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #153: 1001it [00:02, 453.15it/s, env_step=153000, len=8, n/ep=0, n/st=10, player_1/loss=8.818, player_2/loss=16.865, rew=0.00]                            


Epoch #153: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #154: 1001it [00:02, 460.43it/s, env_step=154000, len=7, n/ep=3, n/st=10, player_1/loss=3.621, player_2/loss=16.626, rew=0.00]                           


Epoch #154: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #155: 1001it [00:02, 454.63it/s, env_step=155000, len=7, n/ep=2, n/st=10, player_1/loss=6.101, player_2/loss=14.477, rew=0.00]                          


Epoch #155: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #156: 1001it [00:02, 461.12it/s, env_step=156000, len=8, n/ep=3, n/st=10, player_1/loss=7.453, player_2/loss=16.558, rew=0.00]                          


Epoch #156: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #157: 1001it [00:02, 456.49it/s, env_step=157000, len=8, n/ep=1, n/st=10, player_1/loss=7.135, player_2/loss=16.932, rew=0.00]                          


Epoch #157: test_reward: -58.500000 ± 175.500000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #158: 1001it [00:02, 456.50it/s, env_step=158000, len=7, n/ep=2, n/st=10, player_1/loss=35.332, player_2/loss=47.830, rew=0.00]                             


Epoch #158: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #159: 1001it [00:02, 464.54it/s, env_step=159000, len=9, n/ep=1, n/st=10, player_1/loss=18.988, player_2/loss=52.981, rew=0.00]                             


Epoch #159: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #160: 1001it [00:02, 467.14it/s, env_step=160000, len=7, n/ep=1, n/st=10, player_1/loss=10.135, player_2/loss=35.925, rew=0.00]                             


Epoch #160: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #161: 1001it [00:02, 446.98it/s, env_step=161000, len=7, n/ep=2, n/st=10, player_1/loss=14.818, player_2/loss=30.892, rew=0.00]                             


Epoch #161: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #162: 1001it [00:02, 459.63it/s, env_step=162000, len=9, n/ep=1, n/st=10, player_1/loss=8.119, player_2/loss=20.805, rew=0.00]                           


Epoch #162: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #163: 1001it [00:02, 463.68it/s, env_step=163000, len=7, n/ep=3, n/st=10, player_1/loss=8.707, player_2/loss=83.224, rew=0.00]                                


Epoch #163: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #164: 1001it [00:02, 455.82it/s, env_step=164000, len=7, n/ep=1, n/st=10, player_1/loss=9.229, player_2/loss=16.151, rew=0.00]                          


Epoch #164: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #165: 1001it [00:02, 460.89it/s, env_step=165000, len=8, n/ep=0, n/st=10, player_1/loss=7.290, player_2/loss=14.901, rew=0.00]                          


Epoch #165: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #166: 1001it [00:02, 456.91it/s, env_step=166000, len=7, n/ep=2, n/st=10, player_1/loss=11.061, player_2/loss=13.806, rew=0.00]                          


Epoch #166: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #167: 1001it [00:02, 468.89it/s, env_step=167000, len=9, n/ep=0, n/st=10, player_1/loss=9.654, player_2/loss=18.757, rew=0.00]                           


Epoch #167: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #168: 1001it [00:02, 445.33it/s, env_step=168000, len=7, n/ep=3, n/st=10, player_1/loss=6.925, player_2/loss=13.640, rew=0.00]                          


Epoch #168: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #169: 1001it [00:02, 460.43it/s, env_step=169000, len=7, n/ep=1, n/st=10, player_1/loss=6.510, player_2/loss=17.182, rew=0.00]                          


Epoch #169: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #170: 1001it [00:02, 460.48it/s, env_step=170000, len=8, n/ep=2, n/st=10, player_1/loss=12.240, player_2/loss=27.426, rew=0.00]                          


Epoch #170: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #171: 1001it [00:02, 461.33it/s, env_step=171000, len=7, n/ep=3, n/st=10, player_1/loss=55.411, player_2/loss=64.756, rew=0.00]                               


Epoch #171: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #172: 1001it [00:02, 452.78it/s, env_step=172000, len=8, n/ep=1, n/st=10, player_1/loss=154.127, player_2/loss=29.642, rew=0.00]                              


Epoch #172: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #173: 1001it [00:02, 459.84it/s, env_step=173000, len=7, n/ep=2, n/st=10, player_1/loss=10.089, player_2/loss=18.714, rew=0.00]                           


Epoch #173: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #174: 1001it [00:02, 425.42it/s, env_step=174000, len=7, n/ep=2, n/st=10, player_1/loss=8.896, player_2/loss=14.894, rew=0.00]                          


Epoch #174: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #175: 1001it [00:02, 429.10it/s, env_step=175000, len=7, n/ep=2, n/st=10, player_1/loss=6.477, player_2/loss=18.973, rew=0.00]                          


Epoch #175: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #176: 1001it [00:02, 421.13it/s, env_step=176000, len=7, n/ep=1, n/st=10, player_1/loss=8.940, player_2/loss=23.697, rew=0.00]                          


Epoch #176: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #177: 1001it [00:02, 444.09it/s, env_step=177000, len=7, n/ep=4, n/st=10, player_1/loss=12.914, player_2/loss=20.853, rew=0.00]                          


Epoch #177: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #178: 1001it [00:02, 428.46it/s, env_step=178000, len=9, n/ep=1, n/st=10, player_1/loss=71.824, player_2/loss=18.583, rew=0.00]                              


Epoch #178: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #179: 1001it [00:02, 404.76it/s, env_step=179000, len=7, n/ep=2, n/st=10, player_1/loss=5.945, player_2/loss=18.134, rew=0.00]                          


Epoch #179: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #180: 1001it [00:02, 450.88it/s, env_step=180000, len=8, n/ep=0, n/st=10, player_1/loss=9.067, player_2/loss=20.991, rew=0.00]                          


Epoch #180: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #181: 1001it [00:02, 447.75it/s, env_step=181000, len=8, n/ep=3, n/st=10, player_1/loss=6.316, player_2/loss=20.113, rew=0.00]                          


Epoch #181: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #182: 1001it [00:02, 463.20it/s, env_step=182000, len=7, n/ep=3, n/st=10, player_1/loss=11.306, player_2/loss=21.662, rew=0.00]                           


Epoch #182: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #183: 1001it [00:02, 456.69it/s, env_step=183000, len=7, n/ep=0, n/st=10, player_1/loss=9.380, player_2/loss=20.945, rew=0.00]                           


Epoch #183: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #184: 1001it [00:02, 444.93it/s, env_step=184000, len=8, n/ep=0, n/st=10, player_1/loss=11.281, player_2/loss=20.571, rew=0.00]                          


Epoch #184: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #185: 1001it [00:02, 452.12it/s, env_step=185000, len=8, n/ep=1, n/st=10, player_1/loss=7.113, player_2/loss=22.585, rew=0.00]                          


Epoch #185: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #186: 1001it [00:02, 453.39it/s, env_step=186000, len=9, n/ep=1, n/st=10, player_1/loss=8.008, player_2/loss=16.977, rew=0.00]                          


Epoch #186: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #187: 1001it [00:02, 465.17it/s, env_step=187000, len=7, n/ep=2, n/st=10, player_1/loss=8.692, player_2/loss=16.743, rew=0.00]                          


Epoch #187: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #188: 1001it [00:02, 423.30it/s, env_step=188000, len=8, n/ep=1, n/st=10, player_1/loss=7.699, player_2/loss=17.307, rew=0.00]                          


Epoch #188: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #189: 1001it [00:02, 441.49it/s, env_step=189000, len=7, n/ep=1, n/st=10, player_1/loss=7.980, player_2/loss=17.717, rew=0.00]                          


Epoch #189: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #190: 1001it [00:02, 452.88it/s, env_step=190000, len=11, n/ep=1, n/st=10, player_1/loss=7.430, player_2/loss=20.126, rew=0.00]                          


Epoch #190: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #191: 1001it [00:02, 412.57it/s, env_step=191000, len=7, n/ep=1, n/st=10, player_1/loss=5.161, player_2/loss=19.357, rew=0.00]                          


Epoch #191: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #192: 1001it [00:02, 398.91it/s, env_step=192000, len=8, n/ep=1, n/st=10, player_1/loss=5.961, player_2/loss=19.429, rew=0.00]                          


Epoch #192: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #193: 1001it [00:02, 461.46it/s, env_step=193000, len=7, n/ep=1, n/st=10, player_1/loss=10.535, player_2/loss=16.456, rew=0.00]                          


Epoch #193: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #194: 1001it [00:02, 448.51it/s, env_step=194000, len=7, n/ep=0, n/st=10, player_1/loss=9.410, player_2/loss=20.862, rew=0.00]                           


Epoch #194: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #195: 1001it [00:02, 459.90it/s, env_step=195000, len=7, n/ep=0, n/st=10, player_1/loss=4.965, player_2/loss=15.103, rew=0.00]                          


Epoch #195: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #196: 1001it [00:02, 446.66it/s, env_step=196000, len=7, n/ep=2, n/st=10, player_1/loss=4.723, player_2/loss=17.642, rew=0.00]                          


Epoch #196: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #197: 1001it [00:02, 446.24it/s, env_step=197000, len=7, n/ep=1, n/st=10, player_1/loss=6.782, player_2/loss=17.844, rew=0.00]                          


Epoch #197: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #198: 1001it [00:02, 455.04it/s, env_step=198000, len=7, n/ep=0, n/st=10, player_1/loss=6.729, player_2/loss=16.348, rew=0.00]                          


Epoch #198: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #199: 1001it [00:02, 451.27it/s, env_step=199000, len=10, n/ep=0, n/st=10, player_1/loss=19.444, player_2/loss=21.826, rew=0.00]                          


Epoch #199: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #200: 1001it [00:02, 451.44it/s, env_step=200000, len=8, n/ep=1, n/st=10, player_1/loss=98.271, player_2/loss=32.186, rew=0.00]                             


Epoch #200: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #201: 1001it [00:02, 463.03it/s, env_step=201000, len=7, n/ep=1, n/st=10, player_1/loss=32.484, player_2/loss=41.574, rew=0.00]                             


Epoch #201: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #202: 1001it [00:02, 433.87it/s, env_step=202000, len=7, n/ep=5, n/st=10, player_1/loss=9.202, player_2/loss=17.273, rew=0.00]                           


Epoch #202: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #203: 1001it [00:02, 448.43it/s, env_step=203000, len=8, n/ep=2, n/st=10, player_1/loss=4.724, player_2/loss=12.301, rew=0.00]                          


Epoch #203: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #204: 1001it [00:02, 444.54it/s, env_step=204000, len=7, n/ep=2, n/st=10, player_1/loss=8.242, player_2/loss=13.489, rew=0.00]                          


Epoch #204: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #205: 1001it [00:02, 428.25it/s, env_step=205000, len=8, n/ep=3, n/st=10, player_1/loss=17.921, player_2/loss=18.553, rew=0.00]                            


Epoch #205: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #206: 1001it [00:02, 439.57it/s, env_step=206000, len=7, n/ep=2, n/st=10, player_1/loss=38.133, player_2/loss=30.635, rew=0.00]                              


Epoch #206: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #207: 1001it [00:02, 452.50it/s, env_step=207000, len=7, n/ep=3, n/st=10, player_1/loss=21.878, player_2/loss=24.087, rew=0.00]                             


Epoch #207: test_reward: -8480.100000 ± 25440.300000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #208: 1001it [00:02, 463.17it/s, env_step=208000, len=7, n/ep=0, n/st=10, player_1/loss=77.546, player_2/loss=28.834, rew=0.00]                              


Epoch #208: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #209: 1001it [00:02, 461.64it/s, env_step=209000, len=7, n/ep=2, n/st=10, player_1/loss=30.354, player_2/loss=22.584, rew=0.00]                          


Epoch #209: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #210: 1001it [00:02, 443.30it/s, env_step=210000, len=7, n/ep=1, n/st=10, player_1/loss=21.524, player_2/loss=28.896, rew=0.00]                            


Epoch #210: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #211: 1001it [00:02, 454.22it/s, env_step=211000, len=8, n/ep=1, n/st=10, player_1/loss=7.214, player_2/loss=14.971, rew=0.00]                           


Epoch #211: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #212: 1001it [00:02, 454.55it/s, env_step=212000, len=8, n/ep=2, n/st=10, player_1/loss=5.165, player_2/loss=18.466, rew=0.00]                          


Epoch #212: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #213: 1001it [00:02, 462.82it/s, env_step=213000, len=7, n/ep=0, n/st=10, player_1/loss=7.707, player_2/loss=15.876, rew=0.00]                          


Epoch #213: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #214: 1001it [00:02, 451.34it/s, env_step=214000, len=9, n/ep=1, n/st=10, player_1/loss=8.321, player_2/loss=18.438, rew=0.00]                          


Epoch #214: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #215: 1001it [00:02, 459.09it/s, env_step=215000, len=7, n/ep=0, n/st=10, player_1/loss=7.119, player_2/loss=21.873, rew=0.00]                          


Epoch #215: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #216: 1001it [00:02, 460.27it/s, env_step=216000, len=10, n/ep=1, n/st=10, player_1/loss=6.963, player_2/loss=17.836, rew=0.00]                          


Epoch #216: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #217: 1001it [00:02, 449.32it/s, env_step=217000, len=8, n/ep=1, n/st=10, player_1/loss=6.555, player_2/loss=18.390, rew=0.00]                          


Epoch #217: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #218: 1001it [00:02, 456.49it/s, env_step=218000, len=8, n/ep=2, n/st=10, player_1/loss=5.821, player_2/loss=16.639, rew=0.00]                          


Epoch #218: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #219: 1001it [00:02, 459.21it/s, env_step=219000, len=8, n/ep=0, n/st=10, player_1/loss=6.900, player_2/loss=16.800, rew=0.00]                          


Epoch #219: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #220: 1001it [00:02, 435.02it/s, env_step=220000, len=8, n/ep=0, n/st=10, player_1/loss=5.554, player_2/loss=16.982, rew=0.00]                          


Epoch #220: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #221: 1001it [00:02, 424.69it/s, env_step=221000, len=8, n/ep=2, n/st=10, player_1/loss=5.883, player_2/loss=18.616, rew=0.00]                          


Epoch #221: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #222: 1001it [00:02, 446.91it/s, env_step=222000, len=8, n/ep=4, n/st=10, player_1/loss=5.274, player_2/loss=17.046, rew=0.00]                          


Epoch #222: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #223: 1001it [00:02, 451.08it/s, env_step=223000, len=7, n/ep=5, n/st=10, player_1/loss=6.805, player_2/loss=16.553, rew=0.00]                          


Epoch #223: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #224: 1001it [00:02, 421.34it/s, env_step=224000, len=7, n/ep=1, n/st=10, player_1/loss=5.412, player_2/loss=17.935, rew=0.00]                          


Epoch #224: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #225: 1001it [00:02, 444.66it/s, env_step=225000, len=7, n/ep=0, n/st=10, player_1/loss=16.486, player_2/loss=18.795, rew=0.00]                          


Epoch #225: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #226: 1001it [00:02, 453.93it/s, env_step=226000, len=7, n/ep=2, n/st=10, player_1/loss=127.496, player_2/loss=43.601, rew=0.00]                              


Epoch #226: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #227: 1001it [00:02, 454.23it/s, env_step=227000, len=7, n/ep=3, n/st=10, player_1/loss=6.744, player_2/loss=16.461, rew=0.00]                           


Epoch #227: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #228: 1001it [00:02, 451.62it/s, env_step=228000, len=10, n/ep=3, n/st=10, player_1/loss=8.006, player_2/loss=16.204, rew=0.00]                          


Epoch #228: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #229: 1001it [00:02, 463.95it/s, env_step=229000, len=7, n/ep=0, n/st=10, player_1/loss=6.014, player_2/loss=17.199, rew=0.00]                          


Epoch #229: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #230: 1001it [00:02, 443.68it/s, env_step=230000, len=12, n/ep=0, n/st=10, player_1/loss=6.366, player_2/loss=16.163, rew=0.00]                          


Epoch #230: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #231: 1001it [00:02, 459.35it/s, env_step=231000, len=9, n/ep=0, n/st=10, player_1/loss=9.892, player_2/loss=17.260, rew=0.00]                          


Epoch #231: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #232: 1001it [00:02, 444.73it/s, env_step=232000, len=7, n/ep=0, n/st=10, player_1/loss=4.708, player_2/loss=16.917, rew=0.00]                          


Epoch #232: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #233: 1001it [00:02, 461.12it/s, env_step=233000, len=7, n/ep=1, n/st=10, player_1/loss=7.470, player_2/loss=17.775, rew=0.00]                          


Epoch #233: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #234: 1001it [00:02, 465.83it/s, env_step=234000, len=15, n/ep=0, n/st=10, player_1/loss=23.733, player_2/loss=14.267, rew=0.00]                           


Epoch #234: test_reward: -534.100000 ± 1579.378007, best_reward: 0.000000 ± 0.000000 in #0


Epoch #235: 1001it [00:02, 448.75it/s, env_step=235000, len=7, n/ep=0, n/st=10, player_1/loss=25.185, player_2/loss=19.994, rew=0.00]                           


Epoch #235: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #236: 1001it [00:02, 460.27it/s, env_step=236000, len=8, n/ep=0, n/st=10, player_1/loss=5.284, player_2/loss=16.185, rew=0.00]                          


Epoch #236: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #237: 1001it [00:02, 446.90it/s, env_step=237000, len=7, n/ep=2, n/st=10, player_1/loss=4.172, player_2/loss=17.548, rew=0.00]                          


Epoch #237: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #238: 1001it [00:02, 448.70it/s, env_step=238000, len=8, n/ep=0, n/st=10, player_1/loss=7.089, player_2/loss=15.680, rew=0.00]                           


Epoch #238: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #239: 1001it [00:02, 441.01it/s, env_step=239000, len=7, n/ep=2, n/st=10, player_1/loss=24.151, player_2/loss=20.411, rew=0.00]                             


Epoch #239: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #240: 1001it [00:02, 435.66it/s, env_step=240000, len=8, n/ep=2, n/st=10, player_1/loss=8.464, player_2/loss=21.902, rew=0.00]                           


Epoch #240: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #241: 1001it [00:02, 448.42it/s, env_step=241000, len=11, n/ep=1, n/st=10, player_1/loss=13.589, player_2/loss=14.337, rew=0.00]                           


Epoch #241: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #242: 1001it [00:02, 447.72it/s, env_step=242000, len=9, n/ep=2, n/st=10, player_1/loss=11.651, player_2/loss=12.966, rew=0.00]                            


Epoch #242: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #243: 1001it [00:02, 462.38it/s, env_step=243000, len=10, n/ep=1, n/st=10, player_1/loss=35.151, player_2/loss=8.502, rew=0.00]                          


Epoch #243: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #244: 1001it [00:02, 448.21it/s, env_step=244000, len=8, n/ep=2, n/st=10, player_1/loss=27.665, player_2/loss=6.857, rew=0.00]                            


Epoch #244: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #245: 1001it [00:02, 361.47it/s, env_step=245000, len=7, n/ep=2, n/st=10, player_1/loss=60.463, player_2/loss=11.849, rew=0.00]                            


Epoch #245: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #246: 1001it [00:02, 420.53it/s, env_step=246000, len=8, n/ep=2, n/st=10, player_1/loss=27.652, player_2/loss=15.519, rew=0.00]                             


Epoch #246: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #247: 1001it [00:02, 429.97it/s, env_step=247000, len=8, n/ep=2, n/st=10, player_1/loss=23.395, player_2/loss=6.768, rew=0.00]                           


Epoch #247: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #248: 1001it [00:02, 440.43it/s, env_step=248000, len=10, n/ep=3, n/st=10, player_1/loss=23.490, player_2/loss=24.111, rew=0.00]                          


Epoch #248: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #249: 1001it [00:02, 391.96it/s, env_step=249000, len=9, n/ep=3, n/st=10, player_1/loss=159.147, player_2/loss=36.175, rew=0.00]                               


Epoch #249: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #250: 1001it [00:02, 371.42it/s, env_step=250000, len=9, n/ep=2, n/st=10, player_1/loss=59.971, player_2/loss=16.057, rew=0.00]                             


Epoch #250: test_reward: -347.100000 ± 1041.300000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #251: 1001it [00:02, 481.99it/s, env_step=251000, len=13, n/ep=0, n/st=10, player_1/loss=16.911, player_2/loss=15.759, rew=0.00]                           


Epoch #251: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #252: 1001it [00:02, 487.31it/s, env_step=252000, len=8, n/ep=0, n/st=10, player_1/loss=10.823, player_2/loss=10.593, rew=0.00]                          


Epoch #252: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #253: 1001it [00:02, 488.09it/s, env_step=253000, len=7, n/ep=0, n/st=10, player_1/loss=6.295, player_2/loss=15.833, rew=0.00]                          


Epoch #253: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #254: 1001it [00:02, 478.99it/s, env_step=254000, len=7, n/ep=2, n/st=10, player_1/loss=6.995, player_2/loss=14.765, rew=0.00]                          


Epoch #254: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #255: 1001it [00:02, 488.82it/s, env_step=255000, len=8, n/ep=3, n/st=10, player_1/loss=20.162, player_2/loss=11.886, rew=0.00]                            


Epoch #255: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #256: 1001it [00:02, 485.25it/s, env_step=256000, len=7, n/ep=0, n/st=10, player_1/loss=13.587, player_2/loss=15.846, rew=0.00]                             


Epoch #256: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #257: 1001it [00:02, 491.53it/s, env_step=257000, len=9, n/ep=0, n/st=10, player_1/loss=10.553, player_2/loss=11.452, rew=0.00]                          


Epoch #257: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #258: 1001it [00:02, 479.66it/s, env_step=258000, len=17, n/ep=1, n/st=10, player_1/loss=7.474, player_2/loss=16.823, rew=0.00]                            


Epoch #258: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #259: 1001it [00:02, 477.85it/s, env_step=259000, len=8, n/ep=2, n/st=10, player_1/loss=46.854, player_2/loss=17.037, rew=0.00]                             


Epoch #259: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #260: 1001it [00:02, 460.73it/s, env_step=260000, len=7, n/ep=0, n/st=10, player_1/loss=85.376, player_2/loss=32.477, rew=0.00]                                


Epoch #260: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #261: 1001it [00:02, 474.88it/s, env_step=261000, len=7, n/ep=3, n/st=10, player_1/loss=5.442, player_2/loss=8.876, rew=0.00]                            


Epoch #261: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #262: 1001it [00:02, 486.66it/s, env_step=262000, len=8, n/ep=0, n/st=10, player_1/loss=15.953, player_2/loss=18.397, rew=0.00]                            


Epoch #262: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #263: 1001it [00:02, 488.07it/s, env_step=263000, len=7, n/ep=4, n/st=10, player_1/loss=21.770, player_2/loss=33.756, rew=0.00]                             


Epoch #263: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #264: 1001it [00:02, 482.44it/s, env_step=264000, len=8, n/ep=4, n/st=10, player_1/loss=6.060, player_2/loss=16.365, rew=0.00]                          


Epoch #264: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #265: 1001it [00:02, 478.87it/s, env_step=265000, len=9, n/ep=1, n/st=10, player_1/loss=7.350, player_2/loss=13.967, rew=0.00]                          


Epoch #265: test_reward: -6.300000 ± 18.900000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #266: 1001it [00:02, 441.80it/s, env_step=266000, len=7, n/ep=1, n/st=10, player_1/loss=9.412, player_2/loss=14.531, rew=0.00]                          


Epoch #266: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #267: 1001it [00:02, 393.25it/s, env_step=267000, len=7, n/ep=1, n/st=10, player_1/loss=9.644, player_2/loss=15.039, rew=0.00]                           


Epoch #267: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #268: 1001it [00:02, 467.49it/s, env_step=268000, len=7, n/ep=1, n/st=10, player_1/loss=136.456, player_2/loss=14.582, rew=0.00]                            


Epoch #268: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #269: 1001it [00:02, 480.01it/s, env_step=269000, len=7, n/ep=0, n/st=10, player_1/loss=120.735, player_2/loss=42.551, rew=0.00]                             


Epoch #269: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #270: 1001it [00:02, 478.23it/s, env_step=270000, len=8, n/ep=2, n/st=10, player_1/loss=12.503, player_2/loss=15.695, rew=0.00]                           


Epoch #270: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #271: 1001it [00:02, 494.53it/s, env_step=271000, len=8, n/ep=1, n/st=10, player_1/loss=8.469, player_2/loss=12.092, rew=0.00]                           


Epoch #271: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #272: 1001it [00:02, 474.00it/s, env_step=272000, len=7, n/ep=1, n/st=10, player_1/loss=9.998, player_2/loss=12.421, rew=0.00]                           


Epoch #272: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #273: 1001it [00:02, 481.00it/s, env_step=273000, len=7, n/ep=0, n/st=10, player_1/loss=4.169, player_2/loss=19.088, rew=0.00]                          


Epoch #273: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #274: 1001it [00:02, 456.91it/s, env_step=274000, len=7, n/ep=0, n/st=10, player_1/loss=8.758, player_2/loss=17.858, rew=0.00]                          


Epoch #274: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #275: 1001it [00:02, 484.24it/s, env_step=275000, len=7, n/ep=1, n/st=10, player_1/loss=7.904, player_2/loss=17.412, rew=0.00]                           


Epoch #275: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #276: 1001it [00:02, 486.61it/s, env_step=276000, len=8, n/ep=0, n/st=10, player_1/loss=5.486, player_2/loss=18.396, rew=0.00]                          


Epoch #276: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #277: 1001it [00:02, 459.31it/s, env_step=277000, len=7, n/ep=0, n/st=10, player_1/loss=9.599, player_2/loss=17.760, rew=0.00]                          


Epoch #277: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #278: 1001it [00:02, 468.24it/s, env_step=278000, len=10, n/ep=0, n/st=10, player_1/loss=4.915, player_2/loss=18.591, rew=0.00]                          


Epoch #278: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #279: 1001it [00:02, 495.33it/s, env_step=279000, len=7, n/ep=1, n/st=10, player_1/loss=5.666, player_2/loss=19.881, rew=0.00]                          


Epoch #279: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #280: 1001it [00:02, 476.83it/s, env_step=280000, len=7, n/ep=3, n/st=10, player_1/loss=11.986, player_2/loss=23.840, rew=0.00]                          


Epoch #280: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #281: 1001it [00:02, 486.67it/s, env_step=281000, len=8, n/ep=3, n/st=10, player_1/loss=6.994, player_2/loss=18.090, rew=0.00]                          


Epoch #281: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #282: 1001it [00:02, 484.12it/s, env_step=282000, len=7, n/ep=0, n/st=10, player_1/loss=7.900, player_2/loss=19.252, rew=0.00]                          


Epoch #282: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #283: 1001it [00:02, 479.19it/s, env_step=283000, len=9, n/ep=0, n/st=10, player_1/loss=16.654, player_2/loss=17.956, rew=0.00]                          


Epoch #283: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #284: 1001it [00:02, 480.37it/s, env_step=284000, len=15, n/ep=2, n/st=10, player_1/loss=55.302, player_2/loss=18.668, rew=0.00]                           


Epoch #284: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #285: 1001it [00:02, 483.84it/s, env_step=285000, len=9, n/ep=0, n/st=10, player_1/loss=97.439, player_2/loss=9.624, rew=0.00]                                


Epoch #285: test_reward: -10359.800000 ± 31079.400000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #286: 1001it [00:02, 482.57it/s, env_step=286000, len=15, n/ep=1, n/st=10, player_1/loss=42.948, player_2/loss=27.017, rew=0.00]                            


Epoch #286: test_reward: -181.800000 ± 545.400000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #287: 1001it [00:02, 491.81it/s, env_step=287000, len=8, n/ep=1, n/st=10, player_1/loss=33.242, player_2/loss=15.395, rew=0.00]                            


Epoch #287: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #288: 1001it [00:02, 487.50it/s, env_step=288000, len=9, n/ep=1, n/st=10, player_1/loss=114.884, player_2/loss=27.860, rew=0.00]                             


Epoch #288: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #289: 1001it [00:02, 482.10it/s, env_step=289000, len=11, n/ep=0, n/st=10, player_1/loss=59.179, player_2/loss=19.375, rew=0.00]                          


Epoch #289: test_reward: -4836.000000 ± 12999.929608, best_reward: 0.000000 ± 0.000000 in #0


Epoch #290: 1001it [00:02, 482.80it/s, env_step=290000, len=43, n/ep=0, n/st=10, player_1/loss=457.095, player_2/loss=185.506, rew=-261.00]                          


Epoch #290: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #291: 1001it [00:02, 487.97it/s, env_step=291000, len=8, n/ep=1, n/st=10, player_1/loss=646.523, player_2/loss=173.380, rew=0.00]                               


Epoch #291: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #292: 1001it [00:02, 445.03it/s, env_step=292000, len=8, n/ep=2, n/st=10, player_1/loss=11.587, player_2/loss=24.639, rew=0.00]                            


Epoch #292: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #293: 1001it [00:02, 405.67it/s, env_step=293000, len=7, n/ep=1, n/st=10, player_1/loss=7.500, player_2/loss=19.314, rew=0.00]                          


Epoch #293: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #294: 1001it [00:02, 445.03it/s, env_step=294000, len=7, n/ep=1, n/st=10, player_1/loss=6.156, player_2/loss=20.015, rew=0.00]                          


Epoch #294: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #295: 1001it [00:02, 465.08it/s, env_step=295000, len=7, n/ep=1, n/st=10, player_1/loss=12.394, player_2/loss=19.690, rew=0.00]                          


Epoch #295: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #296: 1001it [00:02, 451.14it/s, env_step=296000, len=8, n/ep=0, n/st=10, player_1/loss=7.662, player_2/loss=17.846, rew=0.00]                           


Epoch #296: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #297: 1001it [00:02, 433.23it/s, env_step=297000, len=9, n/ep=2, n/st=10, player_1/loss=6.629, player_2/loss=20.833, rew=0.00]                           


Epoch #297: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #298: 1001it [00:02, 441.79it/s, env_step=298000, len=7, n/ep=0, n/st=10, player_1/loss=7.383, player_2/loss=19.930, rew=0.00]                          


Epoch #298: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #299: 1001it [00:02, 433.51it/s, env_step=299000, len=7, n/ep=2, n/st=10, player_1/loss=6.476, player_2/loss=18.884, rew=0.00]                          


Epoch #299: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #300: 1001it [00:02, 428.28it/s, env_step=300000, len=7, n/ep=0, n/st=10, player_1/loss=21.269, player_2/loss=16.481, rew=0.00]                          


Epoch #300: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #301: 1001it [00:02, 451.71it/s, env_step=301000, len=9, n/ep=1, n/st=10, player_1/loss=13.875, player_2/loss=9.073, rew=0.00]                           


Epoch #301: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #302: 1001it [00:02, 447.01it/s, env_step=302000, len=8, n/ep=3, n/st=10, player_1/loss=12.297, player_2/loss=16.956, rew=0.00]                          


Epoch #302: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #303: 1001it [00:02, 433.56it/s, env_step=303000, len=8, n/ep=2, n/st=10, player_1/loss=32.791, player_2/loss=11.851, rew=0.00]                          


Epoch #303: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #304: 1001it [00:02, 451.07it/s, env_step=304000, len=7, n/ep=2, n/st=10, player_1/loss=282.690, player_2/loss=42.683, rew=0.00]                              


Epoch #304: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #305: 1001it [00:02, 399.34it/s, env_step=305000, len=8, n/ep=0, n/st=10, player_1/loss=80.561, player_2/loss=56.942, rew=0.00]                               


Epoch #305: test_reward: -34.800000 ± 104.400000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #306: 1001it [00:02, 466.60it/s, env_step=306000, len=8, n/ep=0, n/st=10, player_1/loss=21.961, player_2/loss=39.890, rew=0.00]                             


Epoch #306: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #307: 1001it [00:02, 465.71it/s, env_step=307000, len=10, n/ep=0, n/st=10, player_1/loss=12.040, player_2/loss=10.185, rew=0.00]                          


Epoch #307: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #308: 1001it [00:02, 426.82it/s, env_step=308000, len=8, n/ep=0, n/st=10, player_1/loss=14.116, player_2/loss=23.012, rew=0.00]                          


Epoch #308: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #309: 1001it [00:02, 442.28it/s, env_step=309000, len=10, n/ep=0, n/st=10, player_1/loss=12.433, player_2/loss=7.492, rew=0.00]                          


Epoch #309: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #310: 1001it [00:02, 446.02it/s, env_step=310000, len=8, n/ep=1, n/st=10, player_1/loss=11.997, player_2/loss=6.311, rew=0.00]                          


Epoch #310: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #311: 1001it [00:02, 399.23it/s, env_step=311000, len=8, n/ep=1, n/st=10, player_1/loss=10.218, player_2/loss=4.855, rew=0.00]                          


Epoch #311: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #312: 1001it [00:02, 440.92it/s, env_step=312000, len=8, n/ep=1, n/st=10, player_1/loss=14.331, player_2/loss=5.463, rew=0.00]                          


Epoch #312: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #313: 1001it [00:02, 435.74it/s, env_step=313000, len=8, n/ep=1, n/st=10, player_1/loss=10.862, player_2/loss=20.079, rew=0.00]                             


Epoch #313: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #314: 1001it [00:02, 442.67it/s, env_step=314000, len=8, n/ep=2, n/st=10, player_1/loss=8.223, player_2/loss=6.979, rew=0.00]                            


Epoch #314: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #315: 1001it [00:02, 444.24it/s, env_step=315000, len=8, n/ep=1, n/st=10, player_1/loss=11.289, player_2/loss=10.991, rew=0.00]                          


Epoch #315: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #316: 1001it [00:02, 424.29it/s, env_step=316000, len=12, n/ep=1, n/st=10, player_1/loss=11.012, player_2/loss=7.267, rew=0.00]                          


Epoch #316: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #317: 1001it [00:02, 434.23it/s, env_step=317000, len=8, n/ep=1, n/st=10, player_1/loss=34.881, player_2/loss=19.861, rew=0.00]                             


Epoch #317: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #318: 1001it [00:02, 461.01it/s, env_step=318000, len=8, n/ep=1, n/st=10, player_1/loss=4.997, player_2/loss=2.766, rew=0.00]                            


Epoch #318: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #319: 1001it [00:02, 362.78it/s, env_step=319000, len=10, n/ep=1, n/st=10, player_1/loss=5.888, player_2/loss=6.275, rew=0.00]                          


Epoch #319: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #320: 1001it [00:02, 410.71it/s, env_step=320000, len=8, n/ep=1, n/st=10, player_1/loss=6.236, player_2/loss=4.111, rew=0.00]                          


Epoch #320: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #321: 1001it [00:02, 392.68it/s, env_step=321000, len=8, n/ep=2, n/st=10, player_1/loss=6.319, player_2/loss=3.459, rew=0.00]                          


Epoch #321: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #322: 1001it [00:02, 422.73it/s, env_step=322000, len=11, n/ep=2, n/st=10, player_1/loss=36.987, player_2/loss=18.961, rew=0.00]                            


Epoch #322: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #323: 1001it [00:02, 457.02it/s, env_step=323000, len=8, n/ep=1, n/st=10, player_1/loss=53.160, player_2/loss=15.323, rew=0.00]                          


Epoch #323: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #324: 1001it [00:02, 401.52it/s, env_step=324000, len=7, n/ep=0, n/st=10, player_1/loss=71.955, player_2/loss=26.420, rew=0.00]                             


Epoch #324: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #325: 1001it [00:02, 478.42it/s, env_step=325000, len=8, n/ep=1, n/st=10, player_1/loss=147.495, player_2/loss=62.084, rew=0.00]                             


Epoch #325: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #326: 1001it [00:02, 469.66it/s, env_step=326000, len=7, n/ep=2, n/st=10, player_1/loss=158.088, player_2/loss=100.833, rew=0.00]                              


Epoch #326: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #327: 1001it [00:02, 419.67it/s, env_step=327000, len=10, n/ep=2, n/st=10, player_1/loss=53.977, player_2/loss=122.107, rew=0.00]                             


Epoch #327: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #328: 1001it [00:02, 454.12it/s, env_step=328000, len=9, n/ep=2, n/st=10, player_1/loss=129.895, player_2/loss=72.101, rew=0.00]                          


Epoch #328: test_reward: -1.800000 ± 5.400000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #329: 1001it [00:02, 476.59it/s, env_step=329000, len=7, n/ep=0, n/st=10, player_1/loss=480.991, player_2/loss=46.921, rew=0.00]                              


Epoch #329: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #330: 1001it [00:02, 487.50it/s, env_step=330000, len=7, n/ep=3, n/st=10, player_1/loss=15.180, player_2/loss=27.901, rew=0.00]                           


Epoch #330: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #331: 1001it [00:02, 408.37it/s, env_step=331000, len=9, n/ep=1, n/st=10, player_1/loss=10.252, player_2/loss=48.429, rew=0.00]                          


Epoch #331: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #332: 1001it [00:02, 442.66it/s, env_step=332000, len=7, n/ep=1, n/st=10, player_1/loss=7.183, player_2/loss=47.669, rew=0.00]                              


Epoch #332: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #333: 1001it [00:02, 450.06it/s, env_step=333000, len=7, n/ep=2, n/st=10, player_1/loss=7.111, player_2/loss=24.325, rew=0.00]                           


Epoch #333: test_reward: -0.200000 ± 0.600000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #334: 1001it [00:02, 477.96it/s, env_step=334000, len=7, n/ep=3, n/st=10, player_1/loss=9.549, player_2/loss=19.658, rew=0.00]                             


Epoch #334: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #335: 1001it [00:02, 439.15it/s, env_step=335000, len=7, n/ep=1, n/st=10, player_1/loss=11.030, player_2/loss=36.580, rew=0.00]                          


Epoch #335: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #336: 1001it [00:02, 479.01it/s, env_step=336000, len=10, n/ep=1, n/st=10, player_1/loss=11.177, player_2/loss=47.785, rew=0.00]                          


Epoch #336: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #337: 1001it [00:02, 445.88it/s, env_step=337000, len=10, n/ep=0, n/st=10, player_1/loss=20.382, player_2/loss=74.808, rew=0.00]                          


Epoch #337: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #338: 1001it [00:02, 458.89it/s, env_step=338000, len=10, n/ep=0, n/st=10, player_1/loss=10.061, player_2/loss=18.700, rew=0.00]                          


Epoch #338: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #339: 1001it [00:02, 467.04it/s, env_step=339000, len=10, n/ep=0, n/st=10, player_1/loss=63.173, player_2/loss=150.214, rew=0.00]                           


Epoch #339: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #340: 1001it [00:02, 429.24it/s, env_step=340000, len=8, n/ep=1, n/st=10, player_1/loss=172.400, player_2/loss=45.491, rew=0.00]                               


Epoch #340: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #341: 1001it [00:02, 462.72it/s, env_step=341000, len=9, n/ep=0, n/st=10, player_1/loss=77.155, player_2/loss=35.965, rew=0.00]                               


Epoch #341: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #342: 1001it [00:02, 465.08it/s, env_step=342000, len=10, n/ep=1, n/st=10, player_1/loss=28.718, player_2/loss=27.103, rew=0.00]                          


Epoch #342: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #343: 1001it [00:02, 483.03it/s, env_step=343000, len=7, n/ep=0, n/st=10, player_1/loss=33.675, player_2/loss=18.658, rew=0.00]                          


Epoch #343: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #344: 1001it [00:02, 455.56it/s, env_step=344000, len=9, n/ep=0, n/st=10, player_1/loss=14.939, player_2/loss=12.829, rew=0.00]                          


Epoch #344: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #345: 1001it [00:02, 449.32it/s, env_step=345000, len=9, n/ep=0, n/st=10, player_1/loss=87.387, player_2/loss=16.186, rew=0.00]                          


Epoch #345: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #346: 1001it [00:02, 446.78it/s, env_step=346000, len=7, n/ep=1, n/st=10, player_1/loss=137.911, player_2/loss=37.738, rew=0.00]                            


Epoch #346: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #347: 1001it [00:02, 456.97it/s, env_step=347000, len=7, n/ep=1, n/st=10, player_1/loss=31.061, player_2/loss=14.794, rew=0.00]                          


Epoch #347: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #348: 1001it [00:02, 470.60it/s, env_step=348000, len=10, n/ep=1, n/st=10, player_1/loss=105.642, player_2/loss=14.105, rew=0.00]                          


Epoch #348: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #349: 1001it [00:02, 466.81it/s, env_step=349000, len=9, n/ep=1, n/st=10, player_1/loss=109.434, player_2/loss=77.084, rew=0.00]                             


Epoch #349: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #350: 1001it [00:02, 475.12it/s, env_step=350000, len=8, n/ep=0, n/st=10, player_1/loss=184.798, player_2/loss=24.372, rew=0.00]                            


Epoch #350: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #351: 1001it [00:02, 478.70it/s, env_step=351000, len=14, n/ep=1, n/st=10, player_1/loss=283.061, player_2/loss=31.006, rew=0.00]                           


Epoch #351: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #352: 1001it [00:02, 478.75it/s, env_step=352000, len=9, n/ep=0, n/st=10, player_1/loss=760.876, player_2/loss=103.111, rew=0.00]                             


Epoch #352: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #353: 1001it [00:02, 476.14it/s, env_step=353000, len=9, n/ep=2, n/st=10, player_1/loss=863.709, player_2/loss=185.015, rew=0.00]                                 


Epoch #353: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #354: 1001it [00:02, 456.18it/s, env_step=354000, len=13, n/ep=1, n/st=10, player_1/loss=275.596, player_2/loss=304.223, rew=0.00]                               


Epoch #354: test_reward: -908.100000 ± 2724.300000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #355: 1001it [00:02, 463.82it/s, env_step=355000, len=9, n/ep=1, n/st=10, player_1/loss=237.428, player_2/loss=230.825, rew=0.00]                          


Epoch #355: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #356: 1001it [00:02, 401.43it/s, env_step=356000, len=9, n/ep=1, n/st=10, player_1/loss=1238.159, player_2/loss=1956.661, rew=0.00]                             


Epoch #356: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #357: 1001it [00:02, 446.47it/s, env_step=357000, len=10, n/ep=1, n/st=10, player_1/loss=1092.903, player_2/loss=887.980, rew=0.00]                                


Epoch #357: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #358: 1001it [00:02, 447.82it/s, env_step=358000, len=7, n/ep=2, n/st=10, player_1/loss=437.417, player_2/loss=498.917, rew=0.00]                              


Epoch #358: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #359: 1001it [00:02, 474.79it/s, env_step=359000, len=7, n/ep=1, n/st=10, player_1/loss=191.347, player_2/loss=195.496, rew=0.00]                             


Epoch #359: test_reward: -214.200000 ± 642.600000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #360: 1001it [00:02, 473.67it/s, env_step=360000, len=9, n/ep=0, n/st=10, player_1/loss=84.840, player_2/loss=30.474, rew=0.00]                          


Epoch #360: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #361: 1001it [00:02, 472.10it/s, env_step=361000, len=9, n/ep=0, n/st=10, player_1/loss=168.918, player_2/loss=81.762, rew=0.00]                             


Epoch #361: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #362: 1001it [00:02, 439.89it/s, env_step=362000, len=9, n/ep=3, n/st=10, player_1/loss=257.654, player_2/loss=147.647, rew=0.00]                              


Epoch #362: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #363: 1001it [00:02, 465.52it/s, env_step=363000, len=7, n/ep=2, n/st=10, player_1/loss=71.521, player_2/loss=45.246, rew=0.00]                           


Epoch #363: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #364: 1001it [00:02, 490.60it/s, env_step=364000, len=11, n/ep=0, n/st=10, player_1/loss=109.398, player_2/loss=72.687, rew=0.00]                          


Epoch #364: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #365: 1001it [00:02, 468.57it/s, env_step=365000, len=7, n/ep=0, n/st=10, player_1/loss=108.331, player_2/loss=88.476, rew=0.00]                           


Epoch #365: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #366: 1001it [00:02, 486.08it/s, env_step=366000, len=7, n/ep=0, n/st=10, player_1/loss=35.838, player_2/loss=24.712, rew=0.00]                          


Epoch #366: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #367: 1001it [00:02, 476.14it/s, env_step=367000, len=9, n/ep=1, n/st=10, player_1/loss=131.455, player_2/loss=155.698, rew=0.00]                           


Epoch #367: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #368: 1001it [00:02, 483.03it/s, env_step=368000, len=7, n/ep=1, n/st=10, player_1/loss=161.776, player_2/loss=140.767, rew=0.00]                           


Epoch #368: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #369: 1001it [00:02, 481.41it/s, env_step=369000, len=7, n/ep=3, n/st=10, player_1/loss=31.796, player_2/loss=14.945, rew=0.00]                            


Epoch #369: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #370: 1001it [00:02, 471.88it/s, env_step=370000, len=8, n/ep=0, n/st=10, player_1/loss=65.756, player_2/loss=36.035, rew=0.00]                          


Epoch #370: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #371: 1001it [00:02, 478.87it/s, env_step=371000, len=14, n/ep=1, n/st=10, player_1/loss=93.574, player_2/loss=25.987, rew=0.00]                          


Epoch #371: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #372: 1001it [00:02, 479.33it/s, env_step=372000, len=7, n/ep=1, n/st=10, player_1/loss=318.908, player_2/loss=38.537, rew=0.00]                            


Epoch #372: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #373: 1001it [00:02, 422.84it/s, env_step=373000, len=8, n/ep=0, n/st=10, player_1/loss=91.536, player_2/loss=38.802, rew=0.00]                           


Epoch #373: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #374: 1001it [00:02, 367.90it/s, env_step=374000, len=7, n/ep=1, n/st=10, player_1/loss=85.518, player_2/loss=214.915, rew=0.00]                             


Epoch #374: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #375: 1001it [00:02, 397.52it/s, env_step=375000, len=7, n/ep=1, n/st=10, player_1/loss=140.653, player_2/loss=322.542, rew=0.00]                               


Epoch #375: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #376: 1001it [00:02, 404.30it/s, env_step=376000, len=7, n/ep=0, n/st=10, player_1/loss=17.188, player_2/loss=117.090, rew=0.00]                          


Epoch #376: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #377: 1001it [00:02, 367.10it/s, env_step=377000, len=10, n/ep=1, n/st=10, player_1/loss=80.073, player_2/loss=95.794, rew=0.00]                            


Epoch #377: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #378: 1001it [00:02, 411.73it/s, env_step=378000, len=8, n/ep=1, n/st=10, player_1/loss=26.276, player_2/loss=27.626, rew=0.00]                          


Epoch #378: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #379: 1001it [00:02, 458.27it/s, env_step=379000, len=10, n/ep=0, n/st=10, player_1/loss=43.927, player_2/loss=36.957, rew=0.00]                          


Epoch #379: test_reward: -61.900000 ± 185.700000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #380: 1001it [00:02, 494.24it/s, env_step=380000, len=7, n/ep=2, n/st=10, player_1/loss=78.467, player_2/loss=38.068, rew=0.00]                             


Epoch #380: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #381: 1001it [00:02, 436.50it/s, env_step=381000, len=7, n/ep=0, n/st=10, player_1/loss=5.982, player_2/loss=31.266, rew=0.00]                           


Epoch #381: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #382: 1001it [00:02, 445.23it/s, env_step=382000, len=9, n/ep=1, n/st=10, player_1/loss=8.096, player_2/loss=26.856, rew=0.00]                           


Epoch #382: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #383: 1001it [00:02, 458.33it/s, env_step=383000, len=8, n/ep=2, n/st=10, player_1/loss=6.618, player_2/loss=17.518, rew=0.00]                          


Epoch #383: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #384: 1001it [00:02, 433.33it/s, env_step=384000, len=8, n/ep=1, n/st=10, player_1/loss=70.857, player_2/loss=30.958, rew=0.00]                            


Epoch #384: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #385: 1001it [00:02, 445.63it/s, env_step=385000, len=8, n/ep=3, n/st=10, player_1/loss=38.393, player_2/loss=39.359, rew=0.00]                          


Epoch #385: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #386: 1001it [00:02, 435.87it/s, env_step=386000, len=7, n/ep=2, n/st=10, player_1/loss=705.208, player_2/loss=235.203, rew=0.00]                               


Epoch #386: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #387: 1001it [00:02, 468.35it/s, env_step=387000, len=9, n/ep=2, n/st=10, player_1/loss=55.003, player_2/loss=51.401, rew=0.00]                            


Epoch #387: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #388: 1001it [00:02, 480.94it/s, env_step=388000, len=7, n/ep=0, n/st=10, player_1/loss=204.968, player_2/loss=103.587, rew=0.00]                             


Epoch #388: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #389: 1001it [00:02, 486.32it/s, env_step=389000, len=8, n/ep=2, n/st=10, player_1/loss=25.450, player_2/loss=42.866, rew=0.00]                            


Epoch #389: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #390: 1001it [00:02, 479.10it/s, env_step=390000, len=8, n/ep=0, n/st=10, player_1/loss=58.791, player_2/loss=81.955, rew=0.00]                             


Epoch #390: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #391: 1001it [00:02, 459.74it/s, env_step=391000, len=7, n/ep=1, n/st=10, player_1/loss=28.791, player_2/loss=48.847, rew=0.00]                          


Epoch #391: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #392: 1001it [00:02, 452.68it/s, env_step=392000, len=7, n/ep=1, n/st=10, player_1/loss=639.751, player_2/loss=36.662, rew=0.00]                            


Epoch #392: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #393: 1001it [00:02, 467.69it/s, env_step=393000, len=33, n/ep=0, n/st=10, player_1/loss=578.142, player_2/loss=31.267, rew=-916.00]                            


Epoch #393: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #394: 1001it [00:02, 479.33it/s, env_step=394000, len=8, n/ep=0, n/st=10, player_1/loss=348.664, player_2/loss=43.307, rew=0.00]                             


Epoch #394: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #395: 1001it [00:02, 469.44it/s, env_step=395000, len=7, n/ep=1, n/st=10, player_1/loss=548.053, player_2/loss=96.320, rew=0.00]                               


Epoch #395: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #396: 1001it [00:02, 461.65it/s, env_step=396000, len=7, n/ep=3, n/st=10, player_1/loss=239.463, player_2/loss=337.665, rew=0.00]                              


Epoch #396: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #397: 1001it [00:02, 466.60it/s, env_step=397000, len=7, n/ep=0, n/st=10, player_1/loss=531.351, player_2/loss=116.168, rew=0.00]                              


Epoch #397: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #398: 1001it [00:02, 474.79it/s, env_step=398000, len=8, n/ep=1, n/st=10, player_1/loss=25.733, player_2/loss=25.670, rew=0.00]                           


Epoch #398: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #399: 1001it [00:02, 461.50it/s, env_step=399000, len=10, n/ep=1, n/st=10, player_1/loss=132.991, player_2/loss=82.289, rew=0.00]                            


Epoch #399: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #400: 1001it [00:02, 466.38it/s, env_step=400000, len=7, n/ep=1, n/st=10, player_1/loss=41.654, player_2/loss=76.571, rew=0.00]                            


Epoch #400: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #401: 1001it [00:02, 480.02it/s, env_step=401000, len=8, n/ep=0, n/st=10, player_1/loss=38.672, player_2/loss=48.580, rew=0.00]                          


Epoch #401: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #402: 1001it [00:02, 464.65it/s, env_step=402000, len=7, n/ep=2, n/st=10, player_1/loss=95.002, player_2/loss=315.581, rew=0.00]                             


Epoch #402: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #403: 1001it [00:02, 445.03it/s, env_step=403000, len=11, n/ep=3, n/st=10, player_1/loss=49.847, player_2/loss=45.535, rew=0.00]                          


Epoch #403: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #404: 1001it [00:02, 494.73it/s, env_step=404000, len=7, n/ep=1, n/st=10, player_1/loss=28.973, player_2/loss=25.634, rew=0.00]                          


Epoch #404: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #405: 1001it [00:02, 472.77it/s, env_step=405000, len=7, n/ep=3, n/st=10, player_1/loss=50.150, player_2/loss=27.202, rew=0.00]                          


Epoch #405: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #406: 1001it [00:02, 489.41it/s, env_step=406000, len=8, n/ep=0, n/st=10, player_1/loss=14.975, player_2/loss=22.124, rew=0.00]                          


Epoch #406: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #407: 1001it [00:02, 483.26it/s, env_step=407000, len=7, n/ep=2, n/st=10, player_1/loss=21.605, player_2/loss=18.232, rew=0.00]                          


Epoch #407: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #408: 1001it [00:02, 476.14it/s, env_step=408000, len=7, n/ep=0, n/st=10, player_1/loss=50.529, player_2/loss=30.701, rew=0.00]                          


Epoch #408: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #409: 1001it [00:02, 469.22it/s, env_step=409000, len=7, n/ep=2, n/st=10, player_1/loss=17.157, player_2/loss=44.303, rew=0.00]                          


Epoch #409: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #410: 1001it [00:02, 480.71it/s, env_step=410000, len=8, n/ep=3, n/st=10, player_1/loss=14.610, player_2/loss=14.686, rew=0.00]                          


Epoch #410: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #411: 1001it [00:02, 487.26it/s, env_step=411000, len=7, n/ep=1, n/st=10, player_1/loss=11.338, player_2/loss=14.906, rew=0.00]                          


Epoch #411: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #412: 1001it [00:02, 479.56it/s, env_step=412000, len=7, n/ep=2, n/st=10, player_1/loss=9.215, player_2/loss=15.634, rew=0.00]                           


Epoch #412: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #413: 1001it [00:02, 478.87it/s, env_step=413000, len=8, n/ep=2, n/st=10, player_1/loss=9.785, player_2/loss=14.542, rew=0.00]                          


Epoch #413: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #414: 1001it [00:02, 484.20it/s, env_step=414000, len=7, n/ep=1, n/st=10, player_1/loss=11.585, player_2/loss=19.313, rew=0.00]                          


Epoch #414: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #415: 1001it [00:02, 440.53it/s, env_step=415000, len=7, n/ep=0, n/st=10, player_1/loss=9.778, player_2/loss=15.036, rew=0.00]                           


Epoch #415: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #416: 1001it [00:02, 486.55it/s, env_step=416000, len=7, n/ep=0, n/st=10, player_1/loss=17.905, player_2/loss=17.245, rew=0.00]                          


Epoch #416: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #417: 1001it [00:02, 470.99it/s, env_step=417000, len=7, n/ep=2, n/st=10, player_1/loss=8.536, player_2/loss=18.747, rew=0.00]                           


Epoch #417: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #418: 1001it [00:02, 491.33it/s, env_step=418000, len=15, n/ep=1, n/st=10, player_1/loss=20.105, player_2/loss=16.061, rew=0.00]                          


Epoch #418: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #419: 1001it [00:02, 483.03it/s, env_step=419000, len=10, n/ep=2, n/st=10, player_1/loss=12.676, player_2/loss=17.434, rew=0.00]                          


Epoch #419: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #420: 1001it [00:02, 472.55it/s, env_step=420000, len=7, n/ep=1, n/st=10, player_1/loss=9.907, player_2/loss=16.001, rew=0.00]                           


Epoch #420: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #421: 1001it [00:02, 473.66it/s, env_step=421000, len=11, n/ep=1, n/st=10, player_1/loss=13.415, player_2/loss=22.878, rew=0.00]                          


Epoch #421: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #422: 1001it [00:02, 472.77it/s, env_step=422000, len=9, n/ep=0, n/st=10, player_1/loss=15.749, player_2/loss=17.276, rew=0.00]                          


Epoch #422: test_reward: -2230.000000 ± 6690.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #423: 1001it [00:02, 437.07it/s, env_step=423000, len=7, n/ep=2, n/st=10, player_1/loss=8.831, player_2/loss=15.923, rew=0.00]                          


Epoch #423: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #424: 1001it [00:02, 453.09it/s, env_step=424000, len=9, n/ep=1, n/st=10, player_1/loss=5.901, player_2/loss=17.216, rew=0.00]                          


Epoch #424: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #425: 1001it [00:02, 465.73it/s, env_step=425000, len=9, n/ep=2, n/st=10, player_1/loss=7.554, player_2/loss=16.882, rew=0.00]                          


Epoch #425: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #426: 1001it [00:02, 467.25it/s, env_step=426000, len=7, n/ep=1, n/st=10, player_1/loss=7.618, player_2/loss=17.189, rew=0.00]                          


Epoch #426: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #427: 1001it [00:02, 451.66it/s, env_step=427000, len=7, n/ep=2, n/st=10, player_1/loss=9.023, player_2/loss=16.327, rew=0.00]                          


Epoch #427: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #428: 1001it [00:02, 481.41it/s, env_step=428000, len=7, n/ep=0, n/st=10, player_1/loss=10.473, player_2/loss=15.843, rew=0.00]                          


Epoch #428: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #429: 1001it [00:02, 457.23it/s, env_step=429000, len=7, n/ep=1, n/st=10, player_1/loss=9.107, player_2/loss=15.047, rew=0.00]                           


Epoch #429: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #430: 1001it [00:02, 479.79it/s, env_step=430000, len=7, n/ep=3, n/st=10, player_1/loss=9.468, player_2/loss=16.443, rew=0.00]                           


Epoch #430: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #431: 1001it [00:02, 480.02it/s, env_step=431000, len=7, n/ep=1, n/st=10, player_1/loss=10.200, player_2/loss=17.939, rew=0.00]                          


Epoch #431: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #432: 1001it [00:02, 471.65it/s, env_step=432000, len=8, n/ep=0, n/st=10, player_1/loss=7.546, player_2/loss=17.843, rew=0.00]                           


Epoch #432: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #433: 1001it [00:02, 464.22it/s, env_step=433000, len=7, n/ep=1, n/st=10, player_1/loss=8.027, player_2/loss=20.113, rew=0.00]                          


Epoch #433: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #434: 1001it [00:02, 483.73it/s, env_step=434000, len=8, n/ep=4, n/st=10, player_1/loss=6.962, player_2/loss=19.344, rew=0.00]                          


Epoch #434: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #435: 1001it [00:02, 473.89it/s, env_step=435000, len=8, n/ep=1, n/st=10, player_1/loss=9.624, player_2/loss=19.816, rew=0.00]                           


Epoch #435: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #436: 1001it [00:02, 478.87it/s, env_step=436000, len=7, n/ep=2, n/st=10, player_1/loss=9.279, player_2/loss=17.802, rew=0.00]                          


Epoch #436: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #437: 1001it [00:02, 481.64it/s, env_step=437000, len=8, n/ep=3, n/st=10, player_1/loss=9.192, player_2/loss=17.403, rew=0.00]                          


Epoch #437: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #438: 1001it [00:02, 487.97it/s, env_step=438000, len=7, n/ep=1, n/st=10, player_1/loss=8.994, player_2/loss=16.883, rew=0.00]                          


Epoch #438: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #439: 1001it [00:02, 462.93it/s, env_step=439000, len=7, n/ep=2, n/st=10, player_1/loss=32.015, player_2/loss=50.909, rew=0.00]                          


Epoch #439: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #440: 1001it [00:02, 479.10it/s, env_step=440000, len=7, n/ep=1, n/st=10, player_1/loss=41.773, player_2/loss=89.711, rew=0.00]                               


Epoch #440: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #441: 1001it [00:02, 483.03it/s, env_step=441000, len=7, n/ep=1, n/st=10, player_1/loss=7.539, player_2/loss=19.089, rew=0.00]                          


Epoch #441: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #442: 1001it [00:02, 484.90it/s, env_step=442000, len=7, n/ep=3, n/st=10, player_1/loss=9.609, player_2/loss=16.688, rew=0.00]                          


Epoch #442: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #443: 1001it [00:02, 481.87it/s, env_step=443000, len=8, n/ep=4, n/st=10, player_1/loss=7.032, player_2/loss=19.262, rew=0.00]                          


Epoch #443: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #444: 1001it [00:02, 480.94it/s, env_step=444000, len=8, n/ep=2, n/st=10, player_1/loss=5.901, player_2/loss=21.667, rew=0.00]                          


Epoch #444: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #445: 1001it [00:02, 481.87it/s, env_step=445000, len=7, n/ep=4, n/st=10, player_1/loss=6.847, player_2/loss=18.490, rew=0.00]                          


Epoch #445: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #446: 1001it [00:02, 481.17it/s, env_step=446000, len=7, n/ep=0, n/st=10, player_1/loss=10.196, player_2/loss=13.442, rew=0.00]                          


Epoch #446: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #447: 1001it [00:02, 463.97it/s, env_step=447000, len=8, n/ep=3, n/st=10, player_1/loss=137.467, player_2/loss=30.234, rew=0.00]                          


Epoch #447: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #448: 1001it [00:02, 496.44it/s, env_step=448000, len=8, n/ep=1, n/st=10, player_1/loss=132.868, player_2/loss=90.847, rew=0.00]                               


Epoch #448: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #449: 1001it [00:02, 472.33it/s, env_step=449000, len=7, n/ep=2, n/st=10, player_1/loss=67.133, player_2/loss=18.546, rew=0.00]                               


Epoch #449: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #450: 1001it [00:02, 433.60it/s, env_step=450000, len=7, n/ep=2, n/st=10, player_1/loss=10.189, player_2/loss=25.542, rew=0.00]                          


Epoch #450: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #451: 1001it [00:02, 469.44it/s, env_step=451000, len=8, n/ep=2, n/st=10, player_1/loss=11.555, player_2/loss=18.714, rew=0.00]                          


Epoch #451: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #452: 1001it [00:02, 489.41it/s, env_step=452000, len=7, n/ep=2, n/st=10, player_1/loss=7.545, player_2/loss=18.679, rew=0.00]                           


Epoch #452: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #453: 1001it [00:02, 472.32it/s, env_step=453000, len=7, n/ep=1, n/st=10, player_1/loss=9.652, player_2/loss=15.672, rew=0.00]                            


Epoch #453: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #454: 1001it [00:02, 481.17it/s, env_step=454000, len=7, n/ep=1, n/st=10, player_1/loss=8.717, player_2/loss=25.081, rew=0.00]                          


Epoch #454: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #455: 1001it [00:02, 475.69it/s, env_step=455000, len=7, n/ep=0, n/st=10, player_1/loss=7.464, player_2/loss=18.059, rew=0.00]                          


Epoch #455: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #456: 1001it [00:02, 478.87it/s, env_step=456000, len=7, n/ep=0, n/st=10, player_1/loss=6.572, player_2/loss=16.314, rew=0.00]                          


Epoch #456: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #457: 1001it [00:02, 471.88it/s, env_step=457000, len=8, n/ep=2, n/st=10, player_1/loss=20.189, player_2/loss=35.177, rew=0.00]                          


Epoch #457: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #458: 1001it [00:02, 414.37it/s, env_step=458000, len=7, n/ep=2, n/st=10, player_1/loss=7.948, player_2/loss=74.176, rew=0.00]                              


Epoch #458: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #459: 1001it [00:02, 479.33it/s, env_step=459000, len=7, n/ep=0, n/st=10, player_1/loss=10.506, player_2/loss=14.576, rew=0.00]                          


Epoch #459: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #460: 1001it [00:02, 476.59it/s, env_step=460000, len=10, n/ep=0, n/st=10, player_1/loss=31.400, player_2/loss=12.768, rew=0.00]                          


Epoch #460: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #461: 1001it [00:02, 481.87it/s, env_step=461000, len=7, n/ep=0, n/st=10, player_1/loss=31.332, player_2/loss=14.538, rew=0.00]                            


Epoch #461: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #462: 1001it [00:02, 469.44it/s, env_step=462000, len=18, n/ep=1, n/st=10, player_1/loss=37.367, player_2/loss=14.241, rew=0.00]                          


Epoch #462: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #463: 1001it [00:02, 448.62it/s, env_step=463000, len=8, n/ep=1, n/st=10, player_1/loss=36.581, player_2/loss=23.164, rew=0.00]                              


Epoch #463: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #464: 1001it [00:02, 467.04it/s, env_step=464000, len=11, n/ep=1, n/st=10, player_1/loss=49.164, player_2/loss=33.693, rew=0.00]                            


Epoch #464: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #465: 1001it [00:02, 464.01it/s, env_step=465000, len=25, n/ep=1, n/st=10, player_1/loss=157.064, player_2/loss=20.808, rew=-17.00]                          


Epoch #465: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #466: 1001it [00:02, 478.64it/s, env_step=466000, len=8, n/ep=1, n/st=10, player_1/loss=98.314, player_2/loss=29.505, rew=0.00]                              


Epoch #466: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #467: 1001it [00:02, 452.27it/s, env_step=467000, len=8, n/ep=1, n/st=10, player_1/loss=8.459, player_2/loss=12.480, rew=0.00]                           


Epoch #467: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #468: 1001it [00:02, 465.73it/s, env_step=468000, len=7, n/ep=2, n/st=10, player_1/loss=10.716, player_2/loss=12.906, rew=0.00]                          


Epoch #468: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #469: 1001it [00:02, 481.17it/s, env_step=469000, len=10, n/ep=2, n/st=10, player_1/loss=12.420, player_2/loss=12.202, rew=0.00]                          


Epoch #469: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #470: 1001it [00:02, 474.34it/s, env_step=470000, len=7, n/ep=3, n/st=10, player_1/loss=198.819, player_2/loss=36.914, rew=0.00]                             


Epoch #470: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #471: 1001it [00:02, 468.35it/s, env_step=471000, len=18, n/ep=3, n/st=10, player_1/loss=54.127, player_2/loss=15.895, rew=-40.33]                          


Epoch #471: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #472: 1001it [00:02, 460.43it/s, env_step=472000, len=7, n/ep=1, n/st=10, player_1/loss=21.300, player_2/loss=56.222, rew=0.00]                              


Epoch #472: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #473: 1001it [00:02, 465.95it/s, env_step=473000, len=7, n/ep=0, n/st=10, player_1/loss=13.237, player_2/loss=13.697, rew=0.00]                          


Epoch #473: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #474: 1001it [00:02, 484.67it/s, env_step=474000, len=7, n/ep=1, n/st=10, player_1/loss=9.532, player_2/loss=15.736, rew=0.00]                          


Epoch #474: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #475: 1001it [00:02, 473.44it/s, env_step=475000, len=8, n/ep=1, n/st=10, player_1/loss=22.044, player_2/loss=15.010, rew=0.00]                             


Epoch #475: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #476: 1001it [00:02, 471.43it/s, env_step=476000, len=18, n/ep=1, n/st=10, player_1/loss=17.029, player_2/loss=15.365, rew=0.00]                          


Epoch #476: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #477: 1001it [00:02, 481.41it/s, env_step=477000, len=7, n/ep=0, n/st=10, player_1/loss=10.580, player_2/loss=14.277, rew=0.00]                          


Epoch #477: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #478: 1001it [00:02, 484.90it/s, env_step=478000, len=8, n/ep=1, n/st=10, player_1/loss=12.021, player_2/loss=16.232, rew=0.00]                          


Epoch #478: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #479: 1001it [00:02, 489.17it/s, env_step=479000, len=7, n/ep=2, n/st=10, player_1/loss=11.351, player_2/loss=16.883, rew=0.00]                          


Epoch #479: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #480: 1001it [00:02, 466.60it/s, env_step=480000, len=10, n/ep=1, n/st=10, player_1/loss=7.844, player_2/loss=16.034, rew=0.00]                          


Epoch #480: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #481: 1001it [00:02, 462.76it/s, env_step=481000, len=9, n/ep=1, n/st=10, player_1/loss=43.161, player_2/loss=20.294, rew=0.00]                            


Epoch #481: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #482: 1001it [00:02, 487.26it/s, env_step=482000, len=12, n/ep=2, n/st=10, player_1/loss=16.214, player_2/loss=13.226, rew=0.00]                          


Epoch #482: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #483: 1001it [00:02, 472.10it/s, env_step=483000, len=10, n/ep=0, n/st=10, player_1/loss=130.616, player_2/loss=19.530, rew=0.00]                           


Epoch #483: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #484: 1001it [00:02, 477.50it/s, env_step=484000, len=10, n/ep=1, n/st=10, player_1/loss=18.625, player_2/loss=17.569, rew=0.00]                            


Epoch #484: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #485: 1001it [00:02, 483.03it/s, env_step=485000, len=15, n/ep=1, n/st=10, player_1/loss=15.650, player_2/loss=13.927, rew=0.00]                          


Epoch #485: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #486: 1001it [00:02, 446.09it/s, env_step=486000, len=13, n/ep=0, n/st=10, player_1/loss=14.205, player_2/loss=9.169, rew=0.00]                          


Epoch #486: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #487: 1001it [00:02, 439.18it/s, env_step=487000, len=12, n/ep=1, n/st=10, player_1/loss=11.029, player_2/loss=15.508, rew=0.00]                          


Epoch #487: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #488: 1001it [00:02, 458.90it/s, env_step=488000, len=10, n/ep=2, n/st=10, player_1/loss=14.002, player_2/loss=7.533, rew=0.00]                           


Epoch #488: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #489: 1001it [00:02, 460.77it/s, env_step=489000, len=10, n/ep=1, n/st=10, player_1/loss=29.451, player_2/loss=12.109, rew=0.00]                            


Epoch #489: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #490: 1001it [00:02, 477.73it/s, env_step=490000, len=8, n/ep=2, n/st=10, player_1/loss=111.446, player_2/loss=13.337, rew=0.00]                              


Epoch #490: test_reward: -1621.000000 ± 4863.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #491: 1001it [00:02, 477.73it/s, env_step=491000, len=14, n/ep=0, n/st=10, player_1/loss=19.891, player_2/loss=18.486, rew=-32.33]                          


Epoch #491: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #492: 1001it [00:02, 443.78it/s, env_step=492000, len=8, n/ep=0, n/st=10, player_1/loss=147.538, player_2/loss=38.621, rew=0.00]                              


Epoch #492: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #493: 1001it [00:02, 484.20it/s, env_step=493000, len=7, n/ep=1, n/st=10, player_1/loss=221.828, player_2/loss=21.986, rew=0.00]                              


Epoch #493: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #494: 1001it [00:02, 459.74it/s, env_step=494000, len=7, n/ep=0, n/st=10, player_1/loss=22.597, player_2/loss=18.242, rew=0.00]                           


Epoch #494: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #495: 1001it [00:02, 478.87it/s, env_step=495000, len=7, n/ep=0, n/st=10, player_1/loss=85.657, player_2/loss=36.561, rew=0.00]                             


Epoch #495: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #496: 1001it [00:02, 482.57it/s, env_step=496000, len=10, n/ep=1, n/st=10, player_1/loss=89.100, player_2/loss=105.960, rew=0.00]                              


Epoch #496: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #497: 1001it [00:02, 469.66it/s, env_step=497000, len=7, n/ep=0, n/st=10, player_1/loss=27.231, player_2/loss=12.048, rew=0.00]                          


Epoch #497: test_reward: -2578.100000 ± 7734.300000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #498: 1001it [00:02, 477.73it/s, env_step=498000, len=8, n/ep=1, n/st=10, player_1/loss=226.103, player_2/loss=22.247, rew=0.00]                             


Epoch #498: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #499: 1001it [00:02, 486.79it/s, env_step=499000, len=10, n/ep=2, n/st=10, player_1/loss=346.372, player_2/loss=7.495, rew=0.00]                              

Epoch #499: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0





In [11]:
####################################################
# EXPERIMENT: VIEWING THE LEARNED POLICY
####################################################

# Get the environment settings
env = get_env()
observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
state_shape = observation_space.shape or observation_space.n
action_shape = env.action_space.shape or env.action_space.n

# Configure the best agent
best_agent1 = cf_dqn_policy(state_shape= state_shape,
                            action_shape= action_shape)
best_agent1.load_state_dict(torch.load("./saved_variables/paper_notebooks/3/dqn_vs_dqn/best_policy_agent1.pth"))


best_agent2 = cf_dqn_policy(state_shape= state_shape,
                            action_shape= action_shape)
best_agent2.load_state_dict(torch.load("./saved_variables/paper_notebooks/3/dqn_vs_dqn/best_policy_agent2.pth"))

# Watch the best agetn at work
watch(numer_of_games= 3,
      agent_player1= best_agent1,
      agent_player2= best_agent2)



Average steps of game:  7.333333333333333
Final mean reward agent 1: 3.3333333333333335, std: 9.428090415820634
Final mean reward agent 2: -3.3333333333333335, std: 9.428090415820634


<hr><hr>

## Discussion

The strategy of the DQN remains unchanged. Agent one wins simply by stacking coins, which agent 2 also does.
Perhaps running for more epochs could give a boost in performance but this is not feasible given our limited amount of computational power.
The next notebook will try to optimize the results further using different techniques.

In [26]:
####################################################
# CLEAN VARIABLES
####################################################

del action_shape
del best_agent1
del best_agent2
del env
del final_agent_player1
del final_agent_player2
del observation_space
del off_policy_traininer_results
del state_shape
