# MLP based DQN agent against fixed oponent

In the previous notebook, `7-cnn-dqn-fixed-oponent.ipynb`, we used the CNN based model for training through an iteration of alternating frozen agents.
We found this to give interesting but not fully statisfactory results.
We will now use the same technique for the custom MLP based approach designed in `5-improving-dqn-architecture.ipynb` to properly compare both architectures performance for the agents.

<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Training two DQN agents on connect four Gym
  - Building the environment
  - Implementing the DQN policy
  - Building agents
  - Function for letting agents learn
  - Function for watching learned agent
  - Doing the experiment
- Discussion

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [3]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib

# Plotting
import matplotlib; print(f"Matplotlib version (3.5.1 recommended): {matplotlib.__version__}")
import matplotlib.pyplot as plt

# Argparser
import argparse

# More data types
import typing
import numpy as np

# Pygame
import pygame; print(f"Pygame version (2.1.2 recommended): {pygame.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Tianshou for RL algorithms
import tianshou as ts; print(f"Tianshou version (0.4.8 recommended): {ts.__version__}")

# Torch is a popular DL framework
import torch; print(f"Torch version (1.12.0 recommended): {torch.__version__}")

# PPrint is a pretty print for variables
from pprint import pprint

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym
importlib.invalidate_caches()
importlib.reload(cfgym)

# Time for allowing "freezes" in execution
import time;

# Allow for copying objects in a non reference manner
import copy

# Used for updating notebook display
from IPython.display import clear_output

Matplotlib version (3.5.1 recommended): 3.5.1
Pygame version (2.1.2 recommended): 2.1.2
Gym version (0.21.0 recommended): 0.21.0
Tianshou version (0.4.8 recommended): 0.4.8
Torch version (1.12.0 recommended): 1.12.0.dev20220520+cu116


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

In [4]:
####################################################
# CUDA VALIDATION
####################################################

# Check cuda available
print(f"CUDA is available: {torch.cuda.is_available()}")

# Show cuda devices
print(f"\nAmount of connected devices supporting CUDA: {torch.cuda.device_count()}")

# Show current cuda device
print(f"\nCurrent CUDA device: {torch.cuda.current_device()}")

# Show cuda device name
print(f"Cuda device 0 name: {torch.cuda.get_device_name(0)}")

CUDA is available: True

Amount of connected devices supporting CUDA: 1

Current CUDA device: 0
Cuda device 0 name: NVIDIA GeForce GTX 970


<hr><hr>

## Training two DQN agents on connect four Gym

Our connect four gym setup requires two agents, one for each player.
To reduce complexity, agents will always play as the same player, e.g. always as player 1.
It is important to note that connect four is a *solved game*.
According to [The Washington Post](https://www.washingtonpost.com/news/wonk/wp/2015/05/08/how-to-win-any-popular-game-according-to-data-scientists/):

> Connect Four is what mathematicians call a "solved game," meaning you can play it perfectly every time, no matter what your opponent does. You will need to get the first move, but as long as you do so, you can always win within 41 moves.

<hr>

### Building the environment

This code is taken from previous notebooks.
We don't allow invalid moves to make the problem easier for now.

In [5]:
####################################################
# CONNECT FOUR V2 ENVIRONMENT
####################################################

def get_env():
    """
    Returns the connect four gym environment V2 altered for Tianshou and Petting Zoo compatibility.
    Already wrapped with a ts.env.PettingZooEnv wrapper.
    """
    return ts.env.PettingZooEnv(cfgym.env(reward_move= 0, # Set to 1 for reward to make moves (incentivise longer games)
                                          reward_invalid= -3,
                                          reward_draw= 100,
                                          reward_win= 25,
                                          reward_loss= -25,
                                          allow_invalid_move= False))
    
    
# Test the environment
env = get_env()
print(f"Observation space: {env.observation_space}")
print(f"\nAction space: {env.action_space}")

# Reset the environment to start from a clean state, returns the initial observation
observation = env.reset()

print("\n Initial player id:")
print(observation["agent_id"])

print("\n Initial observation:")
print(observation["obs"])

print("\n Initial mask:")
print(observation["mask"])

# Clean unused variables
del observation
del env

Observation space: Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]], [[2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]], (6, 7), int8))

Action space: Discrete(7)

 Initial player id:
player_1

 Initial observation:
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]

 Initial mask:
[True, True, True, True, True, True, True]


<hr>

### Implementing the DQN policy

We use the strategy created in `5-improving-dqn-architecture.ipynb`.

In [6]:
####################################################
# DQN ARCHITECTURE
####################################################

class CustomDQN(torch.nn.Module):
    """
    Custom DQN using a model based on CNN
    """
    def __init__(self,
                 state_shape: typing.Sequence[int],
                 action_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu',):
        # Parent call
        super().__init__()
        
        # Save device (e.g. cuda)
        self.device = device
        
        self.model = torch.nn.Sequential(
            torch.nn.Linear(np.prod(state_shape), 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, np.prod(action_shape)),
        )

    def forward(self, obs, state=None, info={}):
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float, device=self.device)
        batch = obs.shape[0]
        logits = self.model(obs.view(batch, -1))
        return logits, state


In [7]:
####################################################
# DQN POLICY
####################################################

def cf_custom_dqn_policy(state_shape: tuple,
                         action_shape: tuple,
                         optim: typing.Optional[torch.optim.Optimizer] = None,
                         learning_rate: float =  0.0001,
                         gamma: float = 0.9, # Smaller gamma favours "faster" win
                         n_step: int = 4, # Number of steps to look ahead
                         frozen: bool = False,
                         target_update_freq: int = 320):
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Network to be used for DQN
    net = CustomDQN(state_shape, action_shape, device= device).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    if optim is None:
        optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # If we are frozen, we use an optimizer that has learning rate 0
    if frozen:
        optim = torch.optim.SGD(net.parameters(), lr= 0)
        
        
    # Our agent DQN policy
    return ts.policy.DQNPolicy(model= net,
                               optim= optim,
                               discount_factor= gamma,
                               estimation_step= n_step,
                               target_update_freq= target_update_freq)

<hr>

### Building agents

This is identical to the previous notebook with the added option of "freezing" an agent which corresponds to giving it an optimizer with learning rate 0.

In [8]:
####################################################
# AGENT CREATION
####################################################

def get_agents(agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
               agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
               optim: typing.Optional[torch.optim.Optimizer] = None,
               resume_path_player_1: str = '', # Path to file to resume agent training from
               resume_path_player_2: str = '', 
               agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
               agent_player2_frozen: bool = False,
               ) -> typing.Tuple[ts.policy.BasePolicy, torch.optim.Optimizer, list]:
    """
    Gets a multi agent policy manager, optimizer and player ids for the connect four V2 gym environment.
    Per default this returns 
        - Multi agent manager for 2 agents using DQN
        - Adam optimizer
        - ['player_1', 'player_2'] from the connect four environment
    """
    
    # Get the environment to play in (Connect four gym V2)
    env = get_env()
    
    # Get the observation space from the environment, depending on typo of space (ternary operator)
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    
    # Set the arguments
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent player 1 to be a DQN if no policy is passed.
    if agent_player1 is None:
        # Our agent1 uses a DQN policy
        agent_player1 = cf_custom_dqn_policy(state_shape= state_shape,
                                             action_shape= action_shape,
                                             optim= optim,
                                             frozen= agent_player1_frozen)
                
        # If we resume our agent we need to load the previous config
        if resume_path_player_1:
            agent_player1.load_state_dict(torch.load(resume_path_player_1))
            
    
    # Configure agent player 2 to be a DQN if no policy is passed.
    if agent_player2 is None:
        # Our agent1 uses a DQN policy
        agent_player2 = cf_custom_dqn_policy(state_shape= state_shape,
                                             action_shape= action_shape,
                                             optim= optim,
                                             frozen= agent_player2_frozen)
        
                
        # If we resume our agent we need to load the previous config
        if resume_path_player_2:
            agent_player2.load_state_dict(torch.load(resume_path_player_2))

    # Both our agents are DQN agents by default
    agents = [agent_player1, agent_player2]
        
    # Our policy depends on the order of the agents
    policy = ts.policy.MultiAgentPolicyManager(agents, env)
    
    # Return our policy, optimizer and the available agents in the environment
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    
    return policy, optim, env.agents

<hr>

### Function for letting agents learn

This is identical to the previous notebook.

In [9]:
####################################################
# AGENT TRAINING
####################################################

def train_agent(filename: str = "dqn_vs_dqn_cnn_based",
                agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
                agent_player2_frozen: bool = False,
                single_agent_score_as_reward: bool= False, # Uses non frozen agent's score as reward
                optim: typing.Optional[torch.optim.Optimizer] = None,
                training_env_num: int = 1,
                testing_env_num: int = 1,
                buffer_size: int = 2^14,
                batch_size: int = 1, 
                epochs: int = 50, #50
                step_per_epoch: int = 1024, #1024
                step_per_collect: int = 64, # transition before update
                update_per_step: float = 0.1,
                testing_eps: float = 0.05,
                training_eps: float = 0.1,
                ) -> typing.Tuple[dict, ts.policy.BasePolicy]:
    """
    Trains two agents in the connect four V2 environment and saves their best model and logs.
    Returns:
        - result from offpolicy_trainer
        - final version of agent 1
        - final version of agent 2
    """

    # ======== notebook specific =========
    notebook_version = '8' # Used for foldering logs and models

    # ======== environment setup =========
    train_envs = ts.env.DummyVectorEnv([get_env for _ in range(training_env_num)])
    test_envs = ts.env.DummyVectorEnv([get_env for _ in range(testing_env_num)])
    
    # set the seed for reproducibility
    np.random.seed(1998)
    torch.manual_seed(1998)
    train_envs.seed(1998)
    test_envs.seed(1998)

    # ======== agent setup =========
    # Gets our agents from the previously made function
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    policy, optim, agents = get_agents(agent_player1=agent_player1,
                                       agent_player2=agent_player2,
                                       agent_player1_frozen= agent_player1_frozen,
                                       agent_player2_frozen= agent_player2_frozen,
                                       optim=optim)

    # ======== collector setup =========
    # Make a collector for the training environments
    train_collector = ts.data.Collector(policy= policy,
                                        env= train_envs,
                                        buffer= ts.data.VectorReplayBuffer(buffer_size, len(train_envs)),
                                        exploration_noise= True)
    
    # Make a collector for the testing environments
    test_collector = ts.data.Collector(policy= policy,
                                       env= test_envs,
                                       buffer= ts.data.VectorReplayBuffer(buffer_size, len(test_envs)),
                                       exploration_noise= True)
    
    # Uncomment below if you want to set epsilon in epsilon policy
    # policy.set_eps(1)
    
    # Collect data fot the training evnironments
    train_collector.collect(n_step= batch_size * training_env_num)
    
    # ======== ensure folders exist =========
    if not os.path.exists(os.path.join('./logs', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./logs', 'paper_notebooks', notebook_version, filename))
    if not os.path.exists(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename))

    # ======== tensorboard logging setup =========
    # Allows to save the training progress to tensorboard compatable logs
    log_path = os.path.join('./logs', 'paper_notebooks', notebook_version, filename)
    writer = torch.utils.tensorboard.SummaryWriter(log_path)
    logger = ts.utils.TensorboardLogger(writer)

    # ======== callback functions used during training =========
    # We want to save our best policy
    def save_best_fn(policy):
        """
        Callback to save the best model
        """
        # Save best agent 1
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent1.pth')
        torch.save(policy.policies[agents[0]].state_dict(), model_save_path)
        
        # Save best agent 2
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent2.pth')
        torch.save(policy.policies[agents[1]].state_dict(), model_save_path)
        
        # Save agent2

    def stop_fn(mean_rewards):
        """
        Callback to stop training when we've reached the win rate
        """
        return mean_rewards >= 7 # (win = 10, 70% win without invalid moves = mean of 7)

    def train_fn(epoch, env_step):
        """
        Callback before training
        """        
        # Before training we want to configure the epsilon for the agents
        # In general more exploratory than the test case
        policy.policies[agents[0]].set_eps(training_eps)
        policy.policies[agents[1]].set_eps(training_eps)

    def test_fn(epoch, env_step):
        """
        Callback beore testing
        """        
        # Before testing we want to configure the epsilon for the agents
        # In general more greedy than the train case but not
        #   to avoid getting stuck on invalid moves
        policy.policies[agents[0]].set_eps(testing_eps)
        policy.policies[agents[1]].set_eps(testing_eps)

    def reward_metric(rews):
        """
        Callback for reward collection
        """        
        if agent_player2_frozen and single_agent_score_as_reward:
            # agent 2 frozen, optimizing for agent 1
            return rews[:, 0]
        
        if agent_player1_frozen and single_agent_score_as_reward:
            # agent 1 frozen, optimizing for agent 2
            return rews[:, 1]
        
        # Per default we are interested in optimizing both agents
        return rews[:, 0] + rews[:, 1]
    
            

    # trainer
    result = ts.trainer.offpolicy_trainer(policy= policy,
                                          train_collector= train_collector,
                                          test_collector= test_collector,
                                          max_epoch= epochs,
                                          step_per_epoch= step_per_epoch,
                                          step_per_collect= step_per_collect,
                                          episode_per_test= testing_env_num,
                                          batch_size= batch_size,
                                          train_fn= train_fn,
                                          test_fn= test_fn,
                                          # Stop function to stop before specified amount of epochs
                                          #stop_fn= stop_fn
                                          save_best_fn= save_best_fn,
                                          update_per_step= update_per_step,
                                          logger= logger,
                                          test_in_train= False,
                                          reward_metric= reward_metric)
    
    # Save final agent 1
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent1.pth')
    torch.save(policy.policies[agents[0]].state_dict(), model_save_path)

    # Save final agent 2
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent2.pth')
    torch.save(policy.policies[agents[1]].state_dict(), model_save_path)

    return result, policy.policies[agents[0]], policy.policies[agents[1]]

<hr>

### Function for watching learned agent

Identical to the previous notebook.

In [10]:
####################################################
# WATCHING THE LEARNED POLICY IN ACTION
####################################################

def watch(numer_of_games: int = 3,
          agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
          agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
          test_epsilon: float = 0.05, # For the watching we act completely greedy but low random for not getting stuck on invalid move
          render_speed: float = 0.15, # Amount of seconds to update frame/ do a step
          ) -> None:
    
    # Get the connect four V2 environment (must be a list)
    env= ts.env.DummyVectorEnv([get_env])
    
    # Get the agents from the trained agents
    policy, optim, agents = get_agents(agent_player1= agent_player1,
                                       agent_player2= agent_player2)
    
    # Evaluate the policy
    policy.eval()
    
    # Set the testing policy epsilon for our agents
    policy.policies[agents[0]].set_eps(test_epsilon)
    policy.policies[agents[1]].set_eps(test_epsilon)
    
    # Collect the test data
    collector = ts.data.Collector(policy= policy,
                                  env= env,
                                  exploration_noise= True)
    
    # Render games in human mode to see how it plays
    result = collector.collect(n_episode= numer_of_games, render= render_speed)
    
    # Close the environment aftering collecting the results
    # This closes the pygame window after completion
    env.close()
    
    # Get the rewards and length from the test trials
    rewards, length = result["rews"], result["lens"]
    
    # Print the final reward for the first agent
    print(f"Average steps of game:  {length.mean()}")
    print(f"Final mean reward agent 1: {rewards[:, 0].mean()}, std: {rewards[:, 0].std()}")
    print(f"Final mean reward agent 2: {rewards[:, 1].mean()}, std: {rewards[:, 1].std()}")

<hr>

### Doing the experiment

We now do the experiment with using our previously created functions.
We freeze one agent and initialize both agents from previous versions.

The following iterations were made:

1. Freeze agent 1, train agent 2:
    - Model save name: `1-mlp_dqn_frozen_agent1` 
    - Agent 1 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent2.pth`
    - Learning rate: `0.0001`
    - Training epsilon: `0.2`
    - Look ahead steps: `4`
    - Reward for move/invalid: `+1` / `-3`
    - Allow invalid move: `False`
    - Epochs: `1000`
    - Gamma: `0.9`
    - Best epoch: `1` with test reward `1102`
    - Scoring: sum of `both` agent's score
2. Freeze agent 2, train agent 1:
    - Model save name: `2-mlp_dqn_frozen_agent2` 
    - Agent 1 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/1-mlp_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.0001`
    - Training epsilon: `0.2`
    - Look ahead steps: `4`
    - Reward for move/invalid: `+1` / `-3`
    - Allow invalid move: `False`
    - Epochs: `1000`
    - Gamma: `0.9`
    - Best epoch: `482` with test reward `1102`
    - Scoring: sum of `both` agent's score

After which the agent was so focused on prolonging the game, we decided to lower the learning rate and start optimizing for winning again. We also lowered the amount of epochs in each iterations of swapping the frozen agent.

3. Freeze agent 1, train agent 2:
    - Model save name: `3-mlp_dqn_frozen_agent1` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/2-mlp_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/1-mlp_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.00005` # halfed learning rate
    - Training epsilon: `0.1` # halfed training epsilon
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `500`
    - Gamma: `0.8` 
    - Best epoch: `7` with test reward `100`
    - Scoring: reward of `agent 2`
4. Freeze agent 2, train agent 1:
    - Model save name: `4-mlp_dqn_frozen_agent2` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/2-mlp_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/3-mlp_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.00005`
    - Training epsilon: `0.1`
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `500`
    - Gamma: `0.8`
    - Best epoch: `XXX` with test reward `YYY`
    - Scoring: reward of `agent 1`
    
To do further training, a loop was created which alternated between freezing agens every 50 epochs. This loop was executed 20 times. The learning rate was also lowered once again.

5. Loop frozen agents:
    - Model save name: `5-50epoch_20loop/looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/4-mlp_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/3-mlp_dqn_frozen_agent1/best_policy_agent2.pth`
    - Learning rate: `0.000001`
    - Training epsilon: `0.1`
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `50` x `20` loops 
    - Gamma: `0.8` 
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`
6. Loop frozen agents:
    - Model save name: `6-20epoch_100loop/looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/5-50epoch_20loop/looping-iteration-18/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/5-50epoch_20loop/looping-iteration-19/best_policy_agent2.pth`
    - Learning rate: `0.000003`
    - Training epsilon: `0.1`
    - Look ahead steps: `8`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `20` x `100` loops 
    - Gamma: `0.9` 
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`
7. Loop frozen agents:
    - Model save name: `7-20epoch_500loop/looping-iteration-i` 
    - Agent 1 start: `XXX`
    - Agent 2 start: `XXX`
    - Learning rate: `0.001`
    - Training epsilon: `0.05`
    - Look ahead steps: `8`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `20` x `500` loops 
    - Gamma: `0.9` 
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`

For file size reasons, only a portion of the saved agents are kept and stored on GitHub.


In [14]:
####################################################
# EXPERIMENT: TRAINING AGENTS
####################################################

# Configs for the agents
#freeze_agent1 = False
agent1_starting_params = "./saved_variables/paper_notebooks/8/5-50epoch_20loop/looping-iteration-18/best_policy_agent1.pth"

#freeze_agent2 = True
agent2_starting_params = "./saved_variables/paper_notebooks/8/5-50epoch_20loop/looping-iteration-19/best_policy_agent2.pth"

single_agent_score_as_reward = True # To use combined reward or non frozen agent reward as scoring
filename = "6-20epoch_100loop/looping-iteration-i"
epochs = 20
loops = 100

learning_rate = 0.000003
training_eps = 0.1
gamma = 0.9
n_step = 8

for loop_idx in range(loops):
    # Filename
    filename = f"6-20epoch_100loop/looping-iteration-{loop_idx}"
    
    # Use provided starting params in first loop, the one from previous iteration in next
    if loop_idx > 0:
        agent1_starting_params = f"./saved_variables/paper_notebooks/8/6-20epoch_100loop/looping-iteration-{loop_idx-1}/final_policy_agent1.pth"
        agent2_starting_params = f"./saved_variables/paper_notebooks/8/6-20epoch_100loop/looping-iteration-{loop_idx-1}/final_policy_agent2.pth"
    
    # Determine what agent to freeze
    freeze_agent1 = True if loop_idx % 2 == 1 else False
    freeze_agent2 = True if loop_idx % 2 == 0 else False
    
    # Get the environment settings
    env = get_env()
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent 1
    agent1 = cf_custom_dqn_policy(state_shape= state_shape,
                                  action_shape= action_shape,
                                  gamma= gamma,
                                  frozen= freeze_agent1,
                                  learning_rate = learning_rate,
                                  n_step= n_step)
    
    if agent1_starting_params:
        agent1.load_state_dict(torch.load(agent1_starting_params))
        
        # Configure agent 2
        agent2 = cf_custom_dqn_policy(state_shape= state_shape,
                                      action_shape= action_shape,
                                      gamma= gamma,
                                      frozen= freeze_agent2,
                                      learning_rate = learning_rate,
                                      n_step= n_step)
        
        if agent2_starting_params:
            agent2.load_state_dict(torch.load(agent2_starting_params))
            
            
            # Train the agent
            off_policy_traininer_results, final_agent_player1, final_agent_player2 = train_agent(epochs= epochs,
                                                                                                 agent_player1= agent1,
                                                                                                 agent_player1_frozen = freeze_agent1,
                                                                                                 agent_player2= agent2,
                                                                                                 agent_player2_frozen = freeze_agent2,
                                                                                                 filename= filename,
                                                                                                 single_agent_score_as_reward = single_agent_score_as_reward,
                                                                                                 training_eps= training_eps)
            
            

Epoch #1: 1025it [00:02, 492.06it/s, env_step=1024, len=19, n/ep=3, n/st=64, player_1/loss=172.980, player_2/loss=111.378, rew=-8.33]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 498.67it/s, env_step=2048, len=15, n/ep=4, n/st=64, player_1/loss=127.040, player_2/loss=177.031, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 503.00it/s, env_step=3072, len=19, n/ep=3, n/st=64, player_1/loss=149.251, player_2/loss=232.460, rew=-8.33]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 478.35it/s, env_step=4096, len=17, n/ep=4, n/st=64, player_1/loss=199.072, player_2/loss=215.538, rew=0.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 509.95it/s, env_step=5120, len=18, n/ep=4, n/st=64, player_1/loss=208.388, player_2/loss=239.569, rew=12.50]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 496.37it/s, env_step=6144, len=11, n/ep=6, n/st=64, player_1/loss=116.101, player_2/loss=308.222, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 481.26it/s, env_step=7168, len=20, n/ep=3, n/st=64, player_1/loss=123.514, player_2/loss=345.400, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 430.00it/s, env_step=8192, len=11, n/ep=7, n/st=64, player_1/loss=119.759, player_2/loss=304.477, rew=17.86]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 367.76it/s, env_step=9216, len=8, n/ep=8, n/st=64, player_1/loss=107.984, player_2/loss=239.138, rew=18.75]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 457.22it/s, env_step=10240, len=9, n/ep=6, n/st=64, player_1/loss=109.250, player_2/loss=282.704, rew=16.67]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 471.97it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=44.932, player_2/loss=376.338, rew=18.75]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 365.50it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_1/loss=93.521, player_2/loss=347.089, rew=18.75]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 351.56it/s, env_step=13312, len=10, n/ep=7, n/st=64, player_1/loss=117.018, player_2/loss=347.670, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 446.25it/s, env_step=14336, len=11, n/ep=6, n/st=64, player_1/loss=110.691, player_2/loss=512.006, rew=16.67]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 478.73it/s, env_step=15360, len=13, n/ep=5, n/st=64, player_1/loss=129.533, player_2/loss=457.146, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 416.03it/s, env_step=16384, len=12, n/ep=5, n/st=64, player_1/loss=199.854, player_2/loss=364.689, rew=-15.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 417.39it/s, env_step=17408, len=10, n/ep=6, n/st=64, player_1/loss=279.624, player_2/loss=470.412, rew=8.33]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 479.70it/s, env_step=18432, len=17, n/ep=4, n/st=64, player_1/loss=185.094, player_2/loss=463.918, rew=12.50]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 484.20it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=137.122, player_2/loss=361.981, rew=17.86]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 483.45it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=195.233, player_2/loss=335.252, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 482.97it/s, env_step=2048, len=8, n/ep=7, n/st=64, player_1/loss=196.315, player_2/loss=430.587, rew=-17.86]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #3: 1025it [00:02, 483.13it/s, env_step=3072, len=9, n/ep=8, n/st=64, player_1/loss=119.770, player_2/loss=496.304, rew=-12.50]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #4: 1025it [00:02, 484.59it/s, env_step=4096, len=12, n/ep=5, n/st=64, player_2/loss=400.398, rew=-15.00]        


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #5: 1025it [00:02, 481.05it/s, env_step=5120, len=14, n/ep=5, n/st=64, player_1/loss=139.403, player_2/loss=367.069, rew=-5.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #6: 1025it [00:02, 485.64it/s, env_step=6144, len=10, n/ep=5, n/st=64, player_1/loss=131.403, player_2/loss=442.113, rew=-15.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #7: 1025it [00:02, 482.99it/s, env_step=7168, len=11, n/ep=6, n/st=64, player_1/loss=117.258, player_2/loss=402.250, rew=-16.67]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #8: 1025it [00:02, 488.17it/s, env_step=8192, len=7, n/ep=9, n/st=64, player_1/loss=118.640, player_2/loss=280.847, rew=-8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #9: 1025it [00:02, 486.58it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=148.680, rew=-25.00]         


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #10: 1025it [00:02, 483.72it/s, env_step=10240, len=10, n/ep=7, n/st=64, player_1/loss=213.486, rew=-17.86]      


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #11: 1025it [00:02, 478.85it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=233.750, player_2/loss=384.609, rew=-19.44]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #12: 1025it [00:02, 486.28it/s, env_step=12288, len=9, n/ep=6, n/st=64, player_1/loss=102.214, player_2/loss=344.728, rew=-16.67]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #13: 1025it [00:02, 487.72it/s, env_step=13312, len=11, n/ep=6, n/st=64, player_1/loss=148.131, player_2/loss=361.354, rew=0.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #14: 1025it [00:02, 481.47it/s, env_step=14336, len=11, n/ep=6, n/st=64, player_1/loss=215.804, player_2/loss=347.676, rew=0.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #15: 1025it [00:02, 484.83it/s, env_step=15360, len=12, n/ep=4, n/st=64, player_1/loss=133.599, player_2/loss=374.364, rew=12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #16: 1025it [00:02, 486.37it/s, env_step=16384, len=11, n/ep=6, n/st=64, player_1/loss=80.933, player_2/loss=393.494, rew=8.33]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #17: 1025it [00:02, 483.05it/s, env_step=17408, len=10, n/ep=6, n/st=64, player_1/loss=121.921, player_2/loss=421.516, rew=-16.67]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #18: 1025it [00:02, 484.06it/s, env_step=18432, len=10, n/ep=6, n/st=64, player_1/loss=199.796, player_2/loss=348.733, rew=-16.67]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #19: 1025it [00:02, 485.78it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=195.093, player_2/loss=289.257, rew=-17.86]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #1: 1025it [00:02, 484.85it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=107.445, player_2/loss=403.419, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 484.37it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=195.934, player_2/loss=366.286, rew=25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 485.18it/s, env_step=3072, len=12, n/ep=6, n/st=64, player_1/loss=160.181, player_2/loss=392.364, rew=8.33]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 486.47it/s, env_step=4096, len=13, n/ep=5, n/st=64, player_1/loss=144.087, player_2/loss=362.954, rew=25.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 486.17it/s, env_step=5120, len=13, n/ep=5, n/st=64, player_1/loss=179.103, player_2/loss=374.715, rew=5.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 485.50it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=128.817, player_2/loss=465.776, rew=6.25]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 484.63it/s, env_step=7168, len=11, n/ep=6, n/st=64, player_1/loss=140.995, player_2/loss=373.569, rew=8.33]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 483.07it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=166.871, player_2/loss=352.390, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 488.54it/s, env_step=9216, len=10, n/ep=6, n/st=64, player_1/loss=114.429, player_2/loss=411.723, rew=8.33]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 486.45it/s, env_step=10240, len=8, n/ep=8, n/st=64, player_1/loss=122.563, player_2/loss=403.597, rew=6.25]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 481.32it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=148.232, rew=12.50]        


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 486.09it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_1/loss=201.888, player_2/loss=354.214, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 489.17it/s, env_step=13312, len=11, n/ep=7, n/st=64, player_1/loss=232.626, rew=25.00]       


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 486.25it/s, env_step=14336, len=7, n/ep=8, n/st=64, player_1/loss=134.704, player_2/loss=560.185, rew=12.50]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 485.09it/s, env_step=15360, len=17, n/ep=4, n/st=64, player_1/loss=191.930, player_2/loss=450.193, rew=12.50]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 488.32it/s, env_step=16384, len=15, n/ep=5, n/st=64, player_1/loss=202.720, player_2/loss=339.760, rew=5.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 485.02it/s, env_step=17408, len=18, n/ep=4, n/st=64, player_1/loss=172.620, player_2/loss=265.695, rew=12.50]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 487.86it/s, env_step=18432, len=9, n/ep=6, n/st=64, player_1/loss=205.577, player_2/loss=249.595, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 485.57it/s, env_step=19456, len=18, n/ep=3, n/st=64, player_1/loss=211.637, player_2/loss=214.816, rew=-8.33]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 486.50it/s, env_step=1024, len=13, n/ep=5, n/st=64, player_1/loss=238.236, player_2/loss=132.290, rew=-5.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 487.05it/s, env_step=2048, len=15, n/ep=3, n/st=64, player_1/loss=202.589, player_2/loss=126.956, rew=-8.33]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 491.43it/s, env_step=3072, len=15, n/ep=4, n/st=64, player_1/loss=113.938, player_2/loss=250.592, rew=0.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 485.84it/s, env_step=4096, len=19, n/ep=3, n/st=64, player_1/loss=121.348, player_2/loss=308.960, rew=8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 488.66it/s, env_step=5120, len=20, n/ep=3, n/st=64, player_1/loss=200.399, player_2/loss=231.207, rew=-8.33]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 489.70it/s, env_step=6144, len=22, n/ep=2, n/st=64, player_1/loss=140.284, player_2/loss=206.508, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 486.59it/s, env_step=7168, len=18, n/ep=4, n/st=64, player_1/loss=131.249, player_2/loss=236.858, rew=-12.50]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 493.05it/s, env_step=8192, len=19, n/ep=3, n/st=64, player_1/loss=159.137, player_2/loss=183.406, rew=-25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 489.07it/s, env_step=9216, len=18, n/ep=3, n/st=64, player_1/loss=177.378, player_2/loss=204.097, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 487.98it/s, env_step=10240, len=23, n/ep=3, n/st=64, player_1/loss=173.152, player_2/loss=198.334, rew=-8.33]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 487.51it/s, env_step=11264, len=19, n/ep=3, n/st=64, player_1/loss=122.334, player_2/loss=205.736, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 488.87it/s, env_step=12288, len=17, n/ep=4, n/st=64, player_1/loss=152.851, player_2/loss=177.718, rew=-25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 491.43it/s, env_step=13312, len=15, n/ep=4, n/st=64, player_1/loss=132.661, player_2/loss=180.265, rew=-12.50]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 488.86it/s, env_step=14336, len=15, n/ep=5, n/st=64, player_1/loss=124.221, player_2/loss=199.259, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 487.09it/s, env_step=15360, len=20, n/ep=3, n/st=64, player_2/loss=188.209, rew=-8.33]       


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 488.26it/s, env_step=16384, len=16, n/ep=5, n/st=64, player_1/loss=121.406, player_2/loss=222.097, rew=-5.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 486.05it/s, env_step=17408, len=18, n/ep=4, n/st=64, player_1/loss=73.776, player_2/loss=159.967, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 488.25it/s, env_step=18432, len=17, n/ep=4, n/st=64, player_1/loss=61.043, player_2/loss=149.849, rew=0.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 487.05it/s, env_step=19456, len=15, n/ep=5, n/st=64, player_1/loss=83.296, player_2/loss=158.173, rew=-5.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 486.48it/s, env_step=1024, len=18, n/ep=3, n/st=64, player_1/loss=44.687, player_2/loss=105.827, rew=8.33]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #2: 1025it [00:02, 484.69it/s, env_step=2048, len=10, n/ep=6, n/st=64, player_1/loss=91.389, player_2/loss=229.295, rew=16.67]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #3: 1025it [00:02, 486.68it/s, env_step=3072, len=8, n/ep=7, n/st=64, player_1/loss=146.746, player_2/loss=316.688, rew=3.57]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #4: 1025it [00:02, 487.66it/s, env_step=4096, len=8, n/ep=8, n/st=64, player_1/loss=202.035, player_2/loss=355.828, rew=18.75]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #5: 1025it [00:02, 489.14it/s, env_step=5120, len=14, n/ep=5, n/st=64, player_1/loss=241.529, player_2/loss=353.110, rew=15.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #6: 1025it [00:02, 485.66it/s, env_step=6144, len=12, n/ep=5, n/st=64, player_1/loss=236.974, player_2/loss=244.000, rew=15.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #7: 1025it [00:02, 487.02it/s, env_step=7168, len=15, n/ep=4, n/st=64, player_1/loss=147.032, player_2/loss=223.409, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #8: 1025it [00:02, 486.88it/s, env_step=8192, len=11, n/ep=6, n/st=64, player_1/loss=145.368, player_2/loss=229.134, rew=16.67]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #9: 1025it [00:02, 487.48it/s, env_step=9216, len=10, n/ep=6, n/st=64, player_1/loss=235.291, player_2/loss=220.722, rew=8.33]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #10: 1025it [00:02, 485.50it/s, env_step=10240, len=14, n/ep=5, n/st=64, player_1/loss=237.201, player_2/loss=275.699, rew=15.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #11: 1025it [00:02, 486.97it/s, env_step=11264, len=13, n/ep=5, n/st=64, player_1/loss=291.475, player_2/loss=284.810, rew=15.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #12: 1025it [00:02, 485.57it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=261.891, player_2/loss=240.650, rew=17.86]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #13: 1025it [00:02, 488.23it/s, env_step=13312, len=9, n/ep=6, n/st=64, player_1/loss=177.846, player_2/loss=195.122, rew=16.67]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #14: 1025it [00:02, 485.41it/s, env_step=14336, len=15, n/ep=4, n/st=64, player_1/loss=229.318, player_2/loss=258.823, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #15: 1025it [00:02, 486.65it/s, env_step=15360, len=15, n/ep=4, n/st=64, player_1/loss=338.182, player_2/loss=217.247, rew=0.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #16: 1025it [00:02, 487.84it/s, env_step=16384, len=19, n/ep=4, n/st=64, player_1/loss=264.928, player_2/loss=163.922, rew=12.50]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #17: 1025it [00:02, 485.23it/s, env_step=17408, len=17, n/ep=5, n/st=64, player_1/loss=165.585, player_2/loss=173.066, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #18: 1025it [00:02, 490.40it/s, env_step=18432, len=14, n/ep=5, n/st=64, player_1/loss=147.979, player_2/loss=204.716, rew=25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #19: 1025it [00:02, 486.38it/s, env_step=19456, len=7, n/ep=8, n/st=64, player_1/loss=127.987, player_2/loss=265.242, rew=6.25]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #1: 1025it [00:02, 483.70it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=294.850, player_2/loss=292.872, rew=-12.50]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 486.41it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=286.254, player_2/loss=344.007, rew=-18.75]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 490.42it/s, env_step=3072, len=7, n/ep=8, n/st=64, player_1/loss=202.840, player_2/loss=405.464, rew=6.25]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 484.20it/s, env_step=4096, len=7, n/ep=9, n/st=64, player_1/loss=150.477, player_2/loss=426.263, rew=-25.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 485.82it/s, env_step=5120, len=8, n/ep=6, n/st=64, player_1/loss=215.598, player_2/loss=379.968, rew=-8.33]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 487.72it/s, env_step=6144, len=10, n/ep=7, n/st=64, player_1/loss=242.811, player_2/loss=334.345, rew=-10.71]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #7: 1025it [00:02, 489.08it/s, env_step=7168, len=9, n/ep=7, n/st=64, player_1/loss=202.762, player_2/loss=345.136, rew=-10.71]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #8: 1025it [00:02, 485.67it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=157.656, player_2/loss=321.454, rew=-12.50]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #9: 1025it [00:02, 486.36it/s, env_step=9216, len=8, n/ep=6, n/st=64, player_1/loss=199.545, player_2/loss=330.406, rew=-16.67]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #10: 1025it [00:02, 485.47it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=201.703, player_2/loss=286.839, rew=-13.89]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #11: 1025it [00:02, 487.20it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=112.029, player_2/loss=319.394, rew=-12.50]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #12: 1025it [00:02, 489.38it/s, env_step=12288, len=7, n/ep=9, n/st=64, player_1/loss=165.270, player_2/loss=352.417, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #13: 1025it [00:02, 488.64it/s, env_step=13312, len=8, n/ep=8, n/st=64, player_1/loss=192.990, player_2/loss=373.093, rew=-12.50]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #14: 1025it [00:02, 489.03it/s, env_step=14336, len=9, n/ep=7, n/st=64, player_1/loss=158.587, player_2/loss=341.125, rew=-10.71]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #15: 1025it [00:02, 487.60it/s, env_step=15360, len=11, n/ep=6, n/st=64, player_1/loss=114.810, player_2/loss=371.684, rew=-16.67]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #16: 1025it [00:02, 484.94it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=91.218, player_2/loss=447.379, rew=12.50]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #17: 1025it [00:02, 488.34it/s, env_step=17408, len=8, n/ep=7, n/st=64, player_1/loss=105.855, player_2/loss=478.440, rew=-17.86]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #18: 1025it [00:02, 485.59it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=128.203, player_2/loss=435.199, rew=-12.50]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #19: 1025it [00:02, 483.80it/s, env_step=19456, len=10, n/ep=6, n/st=64, player_1/loss=235.994, player_2/loss=322.334, rew=0.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #6


Epoch #1: 1025it [00:02, 486.75it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=283.654, player_2/loss=313.043, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 486.39it/s, env_step=2048, len=7, n/ep=7, n/st=64, player_1/loss=193.535, player_2/loss=341.287, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 485.35it/s, env_step=3072, len=8, n/ep=7, n/st=64, player_1/loss=119.013, player_2/loss=337.044, rew=10.71]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 486.63it/s, env_step=4096, len=7, n/ep=8, n/st=64, player_1/loss=150.455, player_2/loss=373.644, rew=25.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 487.18it/s, env_step=5120, len=8, n/ep=6, n/st=64, player_1/loss=157.604, player_2/loss=363.031, rew=16.67]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 485.81it/s, env_step=6144, len=9, n/ep=7, n/st=64, player_1/loss=229.460, player_2/loss=331.397, rew=10.71]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 485.87it/s, env_step=7168, len=8, n/ep=7, n/st=64, player_1/loss=242.272, player_2/loss=307.841, rew=3.57]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 487.27it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=274.188, player_2/loss=267.970, rew=12.50]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 485.30it/s, env_step=9216, len=9, n/ep=7, n/st=64, player_1/loss=344.485, player_2/loss=293.828, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 484.83it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=223.485, player_2/loss=284.809, rew=19.44]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 484.83it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=134.770, player_2/loss=336.444, rew=2.78]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 486.84it/s, env_step=12288, len=17, n/ep=3, n/st=64, player_1/loss=193.497, rew=-8.33]       


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 487.71it/s, env_step=13312, len=13, n/ep=5, n/st=64, player_1/loss=266.510, player_2/loss=197.642, rew=-15.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 485.77it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=289.913, player_2/loss=231.322, rew=13.89]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 486.07it/s, env_step=15360, len=9, n/ep=8, n/st=64, player_1/loss=142.306, player_2/loss=298.808, rew=12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 483.97it/s, env_step=16384, len=7, n/ep=8, n/st=64, player_1/loss=152.114, player_2/loss=360.078, rew=6.25]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 484.03it/s, env_step=17408, len=7, n/ep=8, n/st=64, player_1/loss=232.484, player_2/loss=299.475, rew=6.25]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 485.33it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=178.011, player_2/loss=384.053, rew=12.50]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 488.18it/s, env_step=19456, len=8, n/ep=9, n/st=64, player_1/loss=191.504, player_2/loss=421.345, rew=19.44]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 486.53it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=173.687, player_2/loss=484.212, rew=-6.25]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 483.50it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=209.070, player_2/loss=423.800, rew=-18.75]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #3: 1025it [00:02, 487.81it/s, env_step=3072, len=7, n/ep=8, n/st=64, player_1/loss=200.656, player_2/loss=411.356, rew=-12.50]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #4: 1025it [00:02, 488.63it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=186.037, player_2/loss=346.737, rew=-8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #5: 1025it [00:02, 485.58it/s, env_step=5120, len=13, n/ep=5, n/st=64, player_1/loss=182.449, player_2/loss=342.390, rew=-15.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #6: 1025it [00:02, 488.33it/s, env_step=6144, len=11, n/ep=4, n/st=64, player_1/loss=112.138, player_2/loss=359.289, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #7: 1025it [00:02, 484.80it/s, env_step=7168, len=10, n/ep=6, n/st=64, player_1/loss=85.117, player_2/loss=357.232, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #8: 1025it [00:02, 488.08it/s, env_step=8192, len=11, n/ep=6, n/st=64, player_1/loss=137.580, player_2/loss=395.344, rew=-16.67]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #9: 1025it [00:02, 487.04it/s, env_step=9216, len=10, n/ep=7, n/st=64, player_1/loss=134.172, player_2/loss=353.104, rew=-10.71]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #10: 1025it [00:02, 486.91it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=131.619, player_2/loss=360.169, rew=-19.44]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #11: 1025it [00:02, 489.22it/s, env_step=11264, len=7, n/ep=7, n/st=64, player_1/loss=141.981, player_2/loss=382.499, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #12: 1025it [00:02, 468.18it/s, env_step=12288, len=8, n/ep=8, n/st=64, player_1/loss=156.088, player_2/loss=438.534, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #13: 1025it [00:02, 488.22it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=149.082, player_2/loss=449.790, rew=-17.86]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #14: 1025it [00:02, 485.60it/s, env_step=14336, len=9, n/ep=7, n/st=64, player_1/loss=121.977, player_2/loss=439.307, rew=-17.86]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #15: 1025it [00:02, 486.63it/s, env_step=15360, len=10, n/ep=7, n/st=64, player_1/loss=141.537, player_2/loss=336.892, rew=-17.86]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #16: 1025it [00:02, 491.66it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=128.276, player_2/loss=315.403, rew=-19.44]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #17: 1025it [00:02, 486.45it/s, env_step=17408, len=9, n/ep=6, n/st=64, player_1/loss=126.818, player_2/loss=308.992, rew=-8.33]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #18: 1025it [00:02, 487.62it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=237.454, player_2/loss=337.614, rew=-12.50]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #19: 1025it [00:02, 485.55it/s, env_step=19456, len=10, n/ep=6, n/st=64, player_1/loss=232.831, player_2/loss=337.028, rew=-8.33]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #1: 1025it [00:02, 483.37it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=229.547, player_2/loss=380.719, rew=18.75]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 487.50it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=260.708, player_2/loss=448.121, rew=25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 485.37it/s, env_step=3072, len=7, n/ep=9, n/st=64, player_1/loss=231.631, player_2/loss=453.635, rew=19.44]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 487.63it/s, env_step=4096, len=10, n/ep=6, n/st=64, player_1/loss=132.980, player_2/loss=403.307, rew=0.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 473.52it/s, env_step=5120, len=9, n/ep=7, n/st=64, player_1/loss=133.444, player_2/loss=458.131, rew=25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 443.15it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=74.945, player_2/loss=453.338, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 413.47it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=67.474, player_2/loss=374.848, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 389.09it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=97.795, player_2/loss=388.168, rew=18.75]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 396.64it/s, env_step=9216, len=8, n/ep=8, n/st=64, player_1/loss=100.080, player_2/loss=416.217, rew=18.75]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 404.42it/s, env_step=10240, len=8, n/ep=8, n/st=64, player_1/loss=64.051, player_2/loss=516.275, rew=18.75]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 480.19it/s, env_step=11264, len=7, n/ep=7, n/st=64, player_1/loss=46.888, player_2/loss=515.062, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 415.72it/s, env_step=12288, len=7, n/ep=9, n/st=64, player_1/loss=53.326, player_2/loss=526.606, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 476.33it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=51.845, player_2/loss=490.958, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 439.16it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=87.723, rew=25.00]         


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 411.32it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_1/loss=78.700, player_2/loss=411.659, rew=-3.57]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 298.65it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=69.020, player_2/loss=410.664, rew=18.75]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 343.31it/s, env_step=17408, len=7, n/ep=8, n/st=64, player_1/loss=57.178, player_2/loss=400.225, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 361.51it/s, env_step=18432, len=7, n/ep=9, n/st=64, player_1/loss=54.755, player_2/loss=429.585, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 363.50it/s, env_step=19456, len=7, n/ep=8, n/st=64, player_1/loss=52.971, player_2/loss=474.154, rew=25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 354.90it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=52.660, player_2/loss=474.019, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 363.28it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=49.252, player_2/loss=480.790, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 364.79it/s, env_step=3072, len=7, n/ep=8, n/st=64, player_1/loss=79.624, player_2/loss=418.383, rew=-18.75]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 355.45it/s, env_step=4096, len=9, n/ep=7, n/st=64, player_1/loss=79.969, player_2/loss=424.219, rew=-25.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 359.74it/s, env_step=5120, len=7, n/ep=9, n/st=64, player_1/loss=85.706, player_2/loss=462.826, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 325.61it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=74.668, player_2/loss=471.297, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 335.62it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=82.539, player_2/loss=467.888, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 329.18it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=113.349, player_2/loss=450.711, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 334.78it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=133.692, player_2/loss=403.295, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 356.80it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=150.884, player_2/loss=355.431, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 332.54it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=111.163, player_2/loss=379.843, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 361.72it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=76.639, player_2/loss=428.672, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 355.81it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=73.126, player_2/loss=426.939, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 334.51it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=77.057, player_2/loss=372.758, rew=-10.71]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 363.21it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_1/loss=55.265, player_2/loss=407.063, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 358.87it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=48.351, player_2/loss=477.820, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 345.21it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=56.747, player_2/loss=390.670, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 350.77it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=57.614, player_2/loss=399.601, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 341.81it/s, env_step=19456, len=12, n/ep=5, n/st=64, player_1/loss=81.030, player_2/loss=397.107, rew=-15.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 331.64it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=47.573, player_2/loss=339.461, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 334.83it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=72.699, player_2/loss=345.250, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 329.53it/s, env_step=3072, len=7, n/ep=9, n/st=64, player_1/loss=105.655, player_2/loss=343.780, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 272.97it/s, env_step=4096, len=10, n/ep=7, n/st=64, player_1/loss=95.554, player_2/loss=418.663, rew=17.86]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 298.81it/s, env_step=5120, len=10, n/ep=6, n/st=64, player_1/loss=84.459, player_2/loss=465.027, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 331.87it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=56.765, player_2/loss=461.670, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 335.20it/s, env_step=7168, len=13, n/ep=6, n/st=64, player_1/loss=43.277, player_2/loss=500.035, rew=16.67]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 308.14it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=51.501, player_2/loss=486.880, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 324.80it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=56.621, player_2/loss=537.398, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:03, 338.61it/s, env_step=10240, len=7, n/ep=7, n/st=64, player_1/loss=58.722, player_2/loss=508.443, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 325.40it/s, env_step=11264, len=9, n/ep=8, n/st=64, player_1/loss=58.919, player_2/loss=442.937, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 330.97it/s, env_step=12288, len=7, n/ep=9, n/st=64, player_1/loss=88.977, player_2/loss=377.097, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:03, 335.18it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=78.031, player_2/loss=492.877, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 309.36it/s, env_step=14336, len=11, n/ep=6, n/st=64, player_1/loss=43.299, player_2/loss=480.268, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:03, 315.28it/s, env_step=15360, len=9, n/ep=8, n/st=64, player_1/loss=46.512, player_2/loss=451.667, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 327.83it/s, env_step=16384, len=8, n/ep=7, n/st=64, player_1/loss=47.328, player_2/loss=477.131, rew=17.86]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:03, 316.49it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=74.039, player_2/loss=473.926, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:03, 305.54it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=90.738, player_2/loss=453.196, rew=18.75]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 320.37it/s, env_step=19456, len=8, n/ep=8, n/st=64, player_1/loss=162.384, player_2/loss=354.226, rew=18.75]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 316.75it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=38.761, player_2/loss=444.180, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 324.16it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=63.115, player_2/loss=435.151, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 323.25it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=70.234, player_2/loss=443.918, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 325.24it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=40.061, player_2/loss=440.673, rew=-25.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 333.82it/s, env_step=5120, len=9, n/ep=6, n/st=64, player_1/loss=60.958, player_2/loss=466.449, rew=-16.67]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 323.01it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=139.595, player_2/loss=444.052, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 326.17it/s, env_step=7168, len=8, n/ep=9, n/st=64, player_1/loss=141.430, player_2/loss=402.064, rew=-19.44]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 335.24it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=88.537, player_2/loss=366.969, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 338.43it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=65.601, player_2/loss=361.199, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:03, 341.52it/s, env_step=10240, len=8, n/ep=7, n/st=64, player_1/loss=61.805, player_2/loss=356.142, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 328.23it/s, env_step=11264, len=8, n/ep=7, n/st=64, player_1/loss=64.181, player_2/loss=385.017, rew=-17.86]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 334.05it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_1/loss=51.574, player_2/loss=422.682, rew=-18.75]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:03, 323.17it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=40.050, player_2/loss=433.583, rew=-13.89]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 329.03it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=79.365, player_2/loss=366.407, rew=-19.44]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:03, 320.89it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=93.472, player_2/loss=387.166, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 334.31it/s, env_step=16384, len=9, n/ep=7, n/st=64, player_1/loss=78.681, player_2/loss=411.495, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:03, 268.57it/s, env_step=17408, len=9, n/ep=8, n/st=64, player_1/loss=57.341, player_2/loss=377.553, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:04, 253.00it/s, env_step=18432, len=11, n/ep=6, n/st=64, player_1/loss=104.491, player_2/loss=348.818, rew=-16.67]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:04, 241.20it/s, env_step=19456, len=7, n/ep=9, n/st=64, player_1/loss=128.535, player_2/loss=349.046, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 275.97it/s, env_step=1024, len=10, n/ep=6, n/st=64, player_1/loss=166.050, player_2/loss=369.279, rew=8.33]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 327.13it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=168.131, player_2/loss=434.768, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 300.63it/s, env_step=3072, len=7, n/ep=9, n/st=64, player_1/loss=126.224, player_2/loss=443.764, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 350.24it/s, env_step=4096, len=9, n/ep=7, n/st=64, player_1/loss=41.112, player_2/loss=459.852, rew=10.71]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 341.31it/s, env_step=5120, len=7, n/ep=9, n/st=64, player_1/loss=91.119, player_2/loss=439.625, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 339.51it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=85.200, player_2/loss=393.161, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 341.35it/s, env_step=7168, len=17, n/ep=4, n/st=64, player_1/loss=39.210, player_2/loss=412.583, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 333.30it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=50.269, player_2/loss=391.938, rew=18.75]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 340.49it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=53.786, player_2/loss=413.334, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:03, 340.69it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=54.598, player_2/loss=443.805, rew=18.75]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 335.47it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=71.916, rew=25.00]         


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 338.14it/s, env_step=12288, len=8, n/ep=8, n/st=64, player_1/loss=150.089, player_2/loss=428.441, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 345.40it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=107.263, player_2/loss=441.843, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 315.52it/s, env_step=14336, len=9, n/ep=7, n/st=64, player_1/loss=63.240, player_2/loss=368.136, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:03, 331.46it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_1/loss=139.694, player_2/loss=361.744, rew=17.86]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 316.73it/s, env_step=16384, len=7, n/ep=8, n/st=64, player_1/loss=112.894, player_2/loss=360.019, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:03, 311.68it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=37.781, player_2/loss=374.069, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:03, 329.78it/s, env_step=18432, len=8, n/ep=7, n/st=64, player_1/loss=113.261, player_2/loss=383.249, rew=17.86]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 320.68it/s, env_step=19456, len=8, n/ep=8, n/st=64, player_1/loss=175.796, player_2/loss=407.652, rew=25.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 306.38it/s, env_step=1024, len=10, n/ep=6, n/st=64, player_1/loss=168.334, player_2/loss=382.396, rew=-8.33]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 345.06it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=225.840, player_2/loss=402.431, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 350.22it/s, env_step=3072, len=7, n/ep=9, n/st=64, player_1/loss=139.632, player_2/loss=390.647, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 304.23it/s, env_step=4096, len=9, n/ep=7, n/st=64, player_1/loss=41.743, player_2/loss=411.210, rew=-10.71]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 333.56it/s, env_step=5120, len=7, n/ep=9, n/st=64, player_1/loss=91.603, player_2/loss=427.270, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 322.29it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=89.342, player_2/loss=399.621, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 330.38it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=35.041, player_2/loss=385.462, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 342.92it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=105.370, player_2/loss=390.635, rew=-18.75]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 347.20it/s, env_step=9216, len=8, n/ep=8, n/st=64, player_1/loss=171.948, player_2/loss=334.599, rew=-18.75]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 345.59it/s, env_step=10240, len=9, n/ep=6, n/st=64, player_1/loss=162.576, player_2/loss=345.421, rew=-16.67]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 346.96it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=181.160, player_2/loss=346.711, rew=-19.44]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 311.20it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_1/loss=164.109, player_2/loss=333.042, rew=-16.67]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:03, 322.51it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=38.681, player_2/loss=388.445, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 343.85it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=33.464, player_2/loss=360.485, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 351.20it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=35.249, player_2/loss=402.958, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 301.59it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=29.594, player_2/loss=437.965, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:03, 318.34it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=86.921, player_2/loss=373.712, rew=-19.44]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:03, 285.23it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=70.742, rew=-17.86]        


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 322.80it/s, env_step=19456, len=14, n/ep=4, n/st=64, player_1/loss=64.578, player_2/loss=344.775, rew=0.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 343.98it/s, env_step=1024, len=8, n/ep=7, n/st=64, player_1/loss=39.881, player_2/loss=445.608, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 351.90it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=128.264, player_2/loss=424.304, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 340.88it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=118.731, player_2/loss=395.033, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 344.79it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=24.296, player_2/loss=395.693, rew=25.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 336.50it/s, env_step=5120, len=7, n/ep=9, n/st=64, player_1/loss=27.647, player_2/loss=407.969, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 349.69it/s, env_step=6144, len=12, n/ep=5, n/st=64, player_1/loss=35.916, player_2/loss=392.544, rew=15.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 347.31it/s, env_step=7168, len=9, n/ep=7, n/st=64, player_1/loss=31.025, rew=17.86]           


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 315.79it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=87.285, player_2/loss=412.251, rew=18.75]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 339.79it/s, env_step=9216, len=8, n/ep=8, n/st=64, player_1/loss=122.397, player_2/loss=382.389, rew=18.75]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 342.45it/s, env_step=10240, len=9, n/ep=7, n/st=64, player_1/loss=113.487, player_2/loss=337.609, rew=17.86]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 329.20it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=150.662, player_2/loss=331.614, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 335.02it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=158.043, player_2/loss=366.498, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 358.53it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=46.427, player_2/loss=368.702, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 361.02it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=42.420, player_2/loss=363.100, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 358.10it/s, env_step=15360, len=7, n/ep=8, n/st=64, player_1/loss=28.166, player_2/loss=366.158, rew=18.75]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 364.85it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=36.274, player_2/loss=393.704, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 346.16it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=91.571, player_2/loss=333.451, rew=19.44]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 363.67it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=113.927, player_2/loss=320.881, rew=17.86]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 362.48it/s, env_step=19456, len=14, n/ep=4, n/st=64, player_1/loss=56.929, player_2/loss=347.193, rew=0.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 357.67it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=104.123, player_2/loss=424.354, rew=-12.50]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 349.58it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=127.834, player_2/loss=437.241, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 350.34it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=91.002, player_2/loss=402.699, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 362.05it/s, env_step=4096, len=11, n/ep=8, n/st=64, player_1/loss=62.546, player_2/loss=363.853, rew=-18.75]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 363.78it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=47.869, player_2/loss=336.725, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 356.81it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=28.788, player_2/loss=364.478, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 360.58it/s, env_step=7168, len=10, n/ep=7, n/st=64, player_1/loss=25.835, player_2/loss=363.665, rew=-17.86]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 373.93it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=36.470, player_2/loss=349.294, rew=-18.75]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 371.03it/s, env_step=9216, len=8, n/ep=9, n/st=64, player_1/loss=70.879, player_2/loss=333.758, rew=-19.44]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 361.79it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=78.357, player_2/loss=355.064, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 355.21it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_2/loss=363.126, rew=-18.75]       


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 371.40it/s, env_step=12288, len=7, n/ep=9, n/st=64, player_1/loss=91.739, player_2/loss=312.469, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 363.49it/s, env_step=13312, len=8, n/ep=8, n/st=64, player_1/loss=38.133, player_2/loss=322.895, rew=-18.75]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 365.16it/s, env_step=14336, len=8, n/ep=9, n/st=64, player_1/loss=55.966, player_2/loss=335.987, rew=-19.44]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 383.53it/s, env_step=15360, len=17, n/ep=3, n/st=64, player_1/loss=85.760, player_2/loss=286.628, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 382.09it/s, env_step=16384, len=7, n/ep=7, n/st=64, player_1/loss=58.149, player_2/loss=278.333, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 371.90it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=45.603, player_2/loss=278.106, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 345.82it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=74.462, player_2/loss=286.777, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 349.39it/s, env_step=19456, len=7, n/ep=9, n/st=64, player_1/loss=150.507, player_2/loss=311.899, rew=-13.89]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #19


Epoch #1: 1025it [00:02, 352.90it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=23.383, player_2/loss=425.027, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 367.10it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=40.345, player_2/loss=401.186, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 382.77it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=32.914, player_2/loss=365.930, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 379.72it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=29.581, player_2/loss=376.040, rew=16.67]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 380.38it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=25.081, player_2/loss=373.934, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 370.71it/s, env_step=6144, len=12, n/ep=6, n/st=64, player_1/loss=32.057, player_2/loss=346.430, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 383.07it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=31.360, player_2/loss=327.942, rew=18.75]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 375.33it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=77.417, player_2/loss=338.502, rew=18.75]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 346.53it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=96.677, player_2/loss=316.252, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 370.45it/s, env_step=10240, len=9, n/ep=7, n/st=64, player_1/loss=85.889, player_2/loss=301.098, rew=17.86]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 368.54it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=73.108, player_2/loss=310.326, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 369.76it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=53.998, player_2/loss=340.611, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:03, 312.95it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=46.148, player_2/loss=348.976, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 362.02it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=36.513, player_2/loss=341.158, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 363.04it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=27.466, player_2/loss=333.777, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 373.98it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=32.639, player_2/loss=343.110, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 367.40it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=45.808, player_2/loss=290.754, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 353.90it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=43.450, player_2/loss=289.758, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 384.02it/s, env_step=19456, len=18, n/ep=3, n/st=64, player_1/loss=59.507, player_2/loss=317.140, rew=8.33]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 355.64it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=137.326, player_2/loss=292.186, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 349.97it/s, env_step=2048, len=17, n/ep=4, n/st=64, player_1/loss=150.183, player_2/loss=346.744, rew=-25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #3: 1025it [00:02, 365.90it/s, env_step=3072, len=7, n/ep=9, n/st=64, player_1/loss=180.434, player_2/loss=300.744, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #4: 1025it [00:02, 368.03it/s, env_step=4096, len=16, n/ep=4, n/st=64, player_1/loss=112.525, player_2/loss=326.028, rew=-12.50]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #5: 1025it [00:02, 376.93it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=135.777, player_2/loss=399.858, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #6: 1025it [00:02, 360.77it/s, env_step=6144, len=15, n/ep=5, n/st=64, player_1/loss=210.600, player_2/loss=342.024, rew=-15.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #7: 1025it [00:02, 370.23it/s, env_step=7168, len=12, n/ep=5, n/st=64, player_1/loss=142.723, player_2/loss=279.100, rew=-15.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #8: 1025it [00:02, 373.95it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=96.841, player_2/loss=244.367, rew=-18.75]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #9: 1025it [00:02, 370.38it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=64.218, player_2/loss=263.430, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #10: 1025it [00:02, 372.74it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=33.384, player_2/loss=352.153, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #11: 1025it [00:02, 362.23it/s, env_step=11264, len=9, n/ep=5, n/st=64, player_1/loss=80.207, player_2/loss=343.440, rew=-25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #12: 1025it [00:02, 377.24it/s, env_step=12288, len=7, n/ep=9, n/st=64, player_1/loss=136.154, player_2/loss=283.505, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #13: 1025it [00:02, 387.16it/s, env_step=13312, len=7, n/ep=7, n/st=64, player_1/loss=87.026, player_2/loss=248.529, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #14: 1025it [00:02, 383.15it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=80.315, player_2/loss=349.653, rew=-12.50]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #15: 1025it [00:02, 381.45it/s, env_step=15360, len=14, n/ep=5, n/st=64, player_1/loss=116.220, player_2/loss=303.197, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #16: 1025it [00:02, 387.39it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=83.738, player_2/loss=264.817, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #17: 1025it [00:02, 388.12it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=42.398, player_2/loss=294.823, rew=-19.44]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #18: 1025it [00:02, 376.62it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=27.676, player_2/loss=311.451, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #19: 1025it [00:02, 378.05it/s, env_step=19456, len=7, n/ep=9, n/st=64, player_1/loss=25.180, player_2/loss=352.948, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #1: 1025it [00:02, 382.91it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=45.420, player_2/loss=275.664, rew=18.75]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 387.31it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=95.258, player_2/loss=300.191, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 359.78it/s, env_step=3072, len=8, n/ep=7, n/st=64, player_1/loss=69.797, player_2/loss=314.156, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 375.53it/s, env_step=4096, len=9, n/ep=7, n/st=64, player_1/loss=76.020, player_2/loss=357.147, rew=17.86]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 372.95it/s, env_step=5120, len=9, n/ep=7, n/st=64, player_1/loss=79.206, player_2/loss=362.538, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 379.77it/s, env_step=6144, len=11, n/ep=5, n/st=64, player_1/loss=97.128, player_2/loss=372.885, rew=15.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 378.85it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=118.464, player_2/loss=341.938, rew=18.75]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 375.21it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=84.005, player_2/loss=286.085, rew=12.50]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 380.81it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=60.566, player_2/loss=235.066, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 376.28it/s, env_step=10240, len=8, n/ep=7, n/st=64, player_1/loss=72.790, player_2/loss=229.795, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 313.10it/s, env_step=11264, len=8, n/ep=7, n/st=64, player_1/loss=58.349, player_2/loss=279.311, rew=17.86]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 321.35it/s, env_step=12288, len=10, n/ep=7, n/st=64, player_1/loss=46.257, player_2/loss=291.704, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 378.99it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=52.285, player_2/loss=338.880, rew=19.44]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 366.12it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=58.102, player_2/loss=333.263, rew=13.89]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 369.50it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=85.910, player_2/loss=355.719, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 333.57it/s, env_step=16384, len=11, n/ep=6, n/st=64, player_1/loss=71.252, player_2/loss=315.143, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:03, 330.23it/s, env_step=17408, len=10, n/ep=7, n/st=64, player_1/loss=33.986, player_2/loss=331.947, rew=17.86]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 352.97it/s, env_step=18432, len=12, n/ep=4, n/st=64, player_1/loss=36.709, player_2/loss=331.840, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 378.89it/s, env_step=19456, len=7, n/ep=9, n/st=64, player_1/loss=55.884, player_2/loss=304.192, rew=25.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 375.38it/s, env_step=1024, len=18, n/ep=3, n/st=64, player_1/loss=78.162, player_2/loss=346.112, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 379.63it/s, env_step=2048, len=28, n/ep=2, n/st=64, player_1/loss=94.606, player_2/loss=338.536, rew=0.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 380.75it/s, env_step=3072, len=14, n/ep=4, n/st=64, player_1/loss=76.428, player_2/loss=306.373, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 370.78it/s, env_step=4096, len=23, n/ep=3, n/st=64, player_1/loss=67.239, player_2/loss=267.311, rew=-25.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 379.82it/s, env_step=5120, len=8, n/ep=8, n/st=64, player_1/loss=71.451, player_2/loss=307.523, rew=-18.75]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 396.27it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=71.064, player_2/loss=304.536, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 386.54it/s, env_step=7168, len=8, n/ep=7, n/st=64, player_1/loss=110.442, player_2/loss=290.878, rew=-17.86]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 382.81it/s, env_step=8192, len=14, n/ep=5, n/st=64, player_1/loss=122.308, player_2/loss=206.059, rew=-15.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 392.93it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=85.681, player_2/loss=212.921, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 395.84it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=69.020, player_2/loss=242.476, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 375.10it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=53.243, player_2/loss=230.377, rew=-13.89]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 365.27it/s, env_step=12288, len=7, n/ep=9, n/st=64, player_1/loss=57.685, player_2/loss=237.152, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 370.14it/s, env_step=13312, len=19, n/ep=3, n/st=64, player_1/loss=56.875, rew=-25.00]       


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #13


Epoch #14: 1025it [00:02, 375.04it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=126.251, player_2/loss=298.295, rew=-8.33]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #13


Epoch #15: 1025it [00:02, 391.27it/s, env_step=15360, len=17, n/ep=3, n/st=64, player_1/loss=96.356, player_2/loss=282.070, rew=-8.33]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #13


Epoch #16: 1025it [00:02, 389.06it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=32.846, player_2/loss=211.692, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #13


Epoch #17: 1025it [00:02, 374.15it/s, env_step=17408, len=9, n/ep=7, n/st=64, player_1/loss=34.411, player_2/loss=249.116, rew=-10.71]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #13


Epoch #18: 1025it [00:02, 342.70it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=57.925, player_2/loss=230.517, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #13


Epoch #19: 1025it [00:02, 356.58it/s, env_step=19456, len=9, n/ep=8, n/st=64, player_1/loss=61.600, player_2/loss=193.208, rew=-12.50]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #13


Epoch #1: 1025it [00:02, 362.27it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=34.162, player_2/loss=261.752, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 377.40it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=194.766, player_2/loss=261.187, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 376.85it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=129.712, player_2/loss=279.483, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 388.09it/s, env_step=4096, len=11, n/ep=6, n/st=64, player_1/loss=139.009, player_2/loss=300.580, rew=16.67]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 389.90it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=118.794, player_2/loss=303.235, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 387.36it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=37.151, player_2/loss=285.711, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 390.06it/s, env_step=7168, len=8, n/ep=9, n/st=64, player_1/loss=53.445, player_2/loss=317.688, rew=19.44]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 382.97it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=54.901, player_2/loss=280.840, rew=18.75]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 372.98it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=51.703, player_2/loss=268.381, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 377.45it/s, env_step=10240, len=8, n/ep=7, n/st=64, player_1/loss=44.036, player_2/loss=275.670, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 378.47it/s, env_step=11264, len=10, n/ep=6, n/st=64, player_1/loss=85.777, player_2/loss=264.353, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 389.07it/s, env_step=12288, len=9, n/ep=8, n/st=64, player_1/loss=87.691, rew=18.75]         


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 377.65it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=99.676, player_2/loss=249.624, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 361.00it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=114.593, player_2/loss=255.721, rew=13.89]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 370.04it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=36.315, player_2/loss=315.814, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 371.42it/s, env_step=16384, len=9, n/ep=7, n/st=64, player_1/loss=73.043, player_2/loss=321.228, rew=10.71]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 362.10it/s, env_step=17408, len=11, n/ep=6, n/st=64, player_1/loss=99.002, player_2/loss=313.076, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 374.39it/s, env_step=18432, len=8, n/ep=9, n/st=64, player_1/loss=146.520, player_2/loss=294.459, rew=13.89]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 375.01it/s, env_step=19456, len=7, n/ep=9, n/st=64, player_1/loss=81.021, player_2/loss=288.934, rew=25.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 372.07it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=61.185, player_2/loss=278.716, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 350.84it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=89.791, player_2/loss=225.690, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 364.25it/s, env_step=3072, len=20, n/ep=3, n/st=64, player_1/loss=78.284, player_2/loss=230.768, rew=-25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #4: 1025it [00:02, 383.30it/s, env_step=4096, len=9, n/ep=7, n/st=64, player_2/loss=246.700, rew=-10.71]         


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #5: 1025it [00:02, 368.38it/s, env_step=5120, len=26, n/ep=2, n/st=64, player_1/loss=70.778, player_2/loss=215.366, rew=0.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #6: 1025it [00:02, 375.86it/s, env_step=6144, len=7, n/ep=7, n/st=64, player_1/loss=59.727, player_2/loss=230.520, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #7: 1025it [00:02, 378.10it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=101.951, player_2/loss=253.117, rew=-18.75]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #8: 1025it [00:02, 386.36it/s, env_step=8192, len=9, n/ep=7, n/st=64, player_1/loss=88.027, player_2/loss=256.307, rew=-17.86]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #9: 1025it [00:02, 385.40it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=112.984, player_2/loss=307.607, rew=-18.75]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #10: 1025it [00:02, 380.51it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=116.436, player_2/loss=318.500, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #11: 1025it [00:02, 347.95it/s, env_step=11264, len=10, n/ep=7, n/st=64, player_1/loss=31.016, player_2/loss=285.005, rew=-17.86]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #12: 1025it [00:02, 350.06it/s, env_step=12288, len=8, n/ep=8, n/st=64, player_1/loss=78.054, player_2/loss=253.677, rew=-18.75]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #13: 1025it [00:02, 377.55it/s, env_step=13312, len=8, n/ep=9, n/st=64, player_1/loss=76.076, player_2/loss=241.910, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #14: 1025it [00:02, 382.19it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=39.616, player_2/loss=271.695, rew=-12.50]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #15: 1025it [00:02, 373.49it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=31.160, player_2/loss=273.999, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #16: 1025it [00:02, 371.78it/s, env_step=16384, len=14, n/ep=5, n/st=64, player_1/loss=68.038, player_2/loss=299.496, rew=-15.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #17: 1025it [00:02, 375.76it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=126.914, player_2/loss=301.143, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #18: 1025it [00:02, 372.71it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=87.438, player_2/loss=316.030, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #19: 1025it [00:02, 346.07it/s, env_step=19456, len=8, n/ep=8, n/st=64, player_1/loss=50.579, player_2/loss=307.255, rew=-18.75]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #1: 1025it [00:03, 338.56it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=32.313, player_2/loss=265.550, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 347.76it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=150.158, player_2/loss=270.920, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 353.50it/s, env_step=3072, len=10, n/ep=6, n/st=64, player_1/loss=147.543, player_2/loss=233.583, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 363.89it/s, env_step=4096, len=9, n/ep=7, n/st=64, player_1/loss=60.174, player_2/loss=269.536, rew=17.86]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 354.69it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=108.347, player_2/loss=254.730, rew=18.75]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 366.83it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=91.647, player_2/loss=262.108, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 362.64it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=41.821, player_2/loss=263.809, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 353.69it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=82.299, player_2/loss=281.443, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 372.09it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=51.000, player_2/loss=256.424, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 371.83it/s, env_step=10240, len=9, n/ep=7, n/st=64, player_1/loss=141.898, player_2/loss=262.113, rew=17.86]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 364.95it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=121.486, player_2/loss=277.310, rew=18.75]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 370.13it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_1/loss=85.167, player_2/loss=296.968, rew=16.67]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 371.43it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=92.544, player_2/loss=283.431, rew=18.75]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 383.45it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=112.779, player_2/loss=267.275, rew=17.86]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 369.19it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=106.566, player_2/loss=272.495, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 361.39it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=52.798, player_2/loss=289.427, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 369.55it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=87.006, player_2/loss=261.086, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 380.98it/s, env_step=18432, len=12, n/ep=6, n/st=64, player_1/loss=113.151, player_2/loss=287.047, rew=16.67]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 376.39it/s, env_step=19456, len=8, n/ep=5, n/st=64, player_1/loss=112.185, player_2/loss=293.466, rew=25.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 365.39it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=24.596, player_2/loss=255.805, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 330.62it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=126.880, player_2/loss=268.687, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 351.48it/s, env_step=3072, len=9, n/ep=5, n/st=64, player_1/loss=76.795, player_2/loss=249.419, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 377.37it/s, env_step=4096, len=10, n/ep=7, n/st=64, player_1/loss=61.063, player_2/loss=297.519, rew=-3.57]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 370.75it/s, env_step=5120, len=11, n/ep=5, n/st=64, player_1/loss=64.586, player_2/loss=271.015, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 353.50it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=64.266, player_2/loss=281.884, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 357.38it/s, env_step=7168, len=7, n/ep=9, n/st=64, player_1/loss=80.447, rew=-13.89]          


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 365.84it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=103.517, player_2/loss=246.041, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 370.70it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=136.920, player_2/loss=236.460, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 363.16it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=126.900, player_2/loss=273.103, rew=-19.44]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 398.48it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=166.730, player_2/loss=278.002, rew=-8.33]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 383.53it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_1/loss=215.424, player_2/loss=300.464, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 381.01it/s, env_step=13312, len=8, n/ep=8, n/st=64, player_1/loss=170.772, player_2/loss=306.680, rew=-12.50]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 376.09it/s, env_step=14336, len=9, n/ep=8, n/st=64, player_1/loss=119.902, player_2/loss=292.988, rew=-12.50]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 359.95it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=61.326, player_2/loss=286.777, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 373.63it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=78.473, player_2/loss=303.106, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 395.07it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=99.809, player_2/loss=281.383, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 394.79it/s, env_step=18432, len=8, n/ep=9, n/st=64, player_1/loss=68.262, rew=-19.44]        


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 377.93it/s, env_step=19456, len=8, n/ep=5, n/st=64, player_1/loss=103.648, player_2/loss=277.336, rew=-5.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 296.91it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=33.024, player_2/loss=242.078, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 354.45it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=92.219, player_2/loss=285.614, rew=19.44]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 377.76it/s, env_step=3072, len=8, n/ep=6, n/st=64, player_1/loss=112.329, player_2/loss=291.953, rew=16.67]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 372.20it/s, env_step=4096, len=10, n/ep=5, n/st=64, player_1/loss=80.220, player_2/loss=263.912, rew=15.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 368.23it/s, env_step=5120, len=11, n/ep=5, n/st=64, player_1/loss=49.397, player_2/loss=304.782, rew=25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 338.63it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=19.371, player_2/loss=305.056, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 333.04it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=31.071, player_2/loss=280.985, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 378.83it/s, env_step=8192, len=7, n/ep=9, n/st=64, player_1/loss=142.956, player_2/loss=308.763, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 352.74it/s, env_step=9216, len=9, n/ep=7, n/st=64, player_1/loss=226.124, player_2/loss=302.495, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 344.73it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=148.711, player_2/loss=285.616, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 348.75it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=76.734, player_2/loss=282.531, rew=13.89]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 338.13it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=140.548, player_2/loss=292.673, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 380.44it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=157.717, player_2/loss=271.384, rew=19.44]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 332.12it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=213.161, player_2/loss=261.544, rew=10.71]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 360.55it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=126.558, player_2/loss=277.637, rew=19.44]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 364.65it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=57.195, player_2/loss=314.526, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 358.41it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=65.527, player_2/loss=268.310, rew=19.44]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 368.78it/s, env_step=18432, len=8, n/ep=9, n/st=64, player_1/loss=47.237, rew=19.44]         


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 380.81it/s, env_step=19456, len=8, n/ep=7, n/st=64, player_1/loss=84.553, player_2/loss=247.092, rew=10.71]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 376.65it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=29.967, player_2/loss=295.381, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 376.76it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=118.930, player_2/loss=283.195, rew=-19.44]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 391.05it/s, env_step=3072, len=8, n/ep=7, n/st=64, player_1/loss=149.840, player_2/loss=273.292, rew=-17.86]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 395.48it/s, env_step=4096, len=8, n/ep=7, n/st=64, player_1/loss=118.649, player_2/loss=274.317, rew=-17.86]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 394.93it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=106.902, player_2/loss=279.377, rew=-18.75]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 398.67it/s, env_step=6144, len=8, n/ep=7, n/st=64, player_1/loss=105.721, player_2/loss=282.428, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 377.86it/s, env_step=7168, len=7, n/ep=9, n/st=64, player_1/loss=75.799, rew=-13.89]          


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 369.74it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=137.605, player_2/loss=273.086, rew=-18.75]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 387.03it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=256.501, player_2/loss=232.773, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 388.60it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=220.072, player_2/loss=241.046, rew=-18.75]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 378.61it/s, env_step=11264, len=10, n/ep=6, n/st=64, player_1/loss=80.619, player_2/loss=256.044, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 380.03it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=51.745, player_2/loss=275.723, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 389.12it/s, env_step=13312, len=8, n/ep=8, n/st=64, player_1/loss=86.140, player_2/loss=278.576, rew=-12.50]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 379.47it/s, env_step=14336, len=11, n/ep=7, n/st=64, player_1/loss=126.159, player_2/loss=269.319, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 393.03it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=58.017, player_2/loss=254.156, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 394.55it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=65.867, player_2/loss=268.761, rew=-19.44]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 383.57it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=86.722, player_2/loss=256.721, rew=-19.44]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 376.84it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=51.257, player_2/loss=226.669, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 381.33it/s, env_step=19456, len=11, n/ep=6, n/st=64, player_1/loss=102.708, player_2/loss=226.678, rew=-8.33]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 378.10it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=23.785, player_2/loss=277.997, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 376.85it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=22.022, player_2/loss=300.494, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 372.54it/s, env_step=3072, len=7, n/ep=9, n/st=64, player_1/loss=24.221, player_2/loss=280.583, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 388.13it/s, env_step=4096, len=8, n/ep=7, n/st=64, player_1/loss=27.898, player_2/loss=272.847, rew=17.86]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 351.92it/s, env_step=5120, len=11, n/ep=5, n/st=64, player_1/loss=70.246, player_2/loss=265.971, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 324.83it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=35.861, player_2/loss=238.936, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 312.00it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=104.483, player_2/loss=233.845, rew=18.75]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 328.62it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=72.391, player_2/loss=362.816, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 341.57it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=77.925, player_2/loss=376.711, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 361.76it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=110.705, player_2/loss=320.959, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 394.19it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=77.011, player_2/loss=212.328, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 385.39it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_1/loss=56.747, player_2/loss=245.415, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 366.84it/s, env_step=13312, len=8, n/ep=8, n/st=64, player_1/loss=101.753, player_2/loss=272.864, rew=12.50]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 370.96it/s, env_step=14336, len=11, n/ep=7, n/st=64, player_1/loss=81.393, player_2/loss=258.363, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 387.36it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=29.283, player_2/loss=273.274, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 410.70it/s, env_step=16384, len=7, n/ep=8, n/st=64, player_1/loss=110.300, player_2/loss=256.613, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 407.32it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=135.335, player_2/loss=229.633, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 390.59it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=111.293, player_2/loss=237.504, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 378.72it/s, env_step=19456, len=8, n/ep=8, n/st=64, player_1/loss=23.275, player_2/loss=237.429, rew=25.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 402.19it/s, env_step=1024, len=8, n/ep=7, n/st=64, player_1/loss=26.099, player_2/loss=281.310, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 407.65it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=81.345, player_2/loss=278.217, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 388.06it/s, env_step=3072, len=8, n/ep=6, n/st=64, player_1/loss=137.516, player_2/loss=267.845, rew=-16.67]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 387.57it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=144.333, player_2/loss=225.423, rew=-16.67]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 384.77it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=115.059, player_2/loss=234.653, rew=-18.75]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 410.57it/s, env_step=6144, len=8, n/ep=7, n/st=64, player_1/loss=88.622, player_2/loss=269.982, rew=-17.86]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 402.39it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=68.547, player_2/loss=297.038, rew=-18.75]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 409.20it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=104.562, player_2/loss=263.796, rew=-18.75]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 405.03it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=139.351, player_2/loss=247.152, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 393.65it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=118.386, player_2/loss=244.853, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 389.87it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=74.229, player_2/loss=278.357, rew=-19.44]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 419.32it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_1/loss=49.481, player_2/loss=281.477, rew=-16.67]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 411.81it/s, env_step=13312, len=10, n/ep=7, n/st=64, player_1/loss=108.800, player_2/loss=263.465, rew=-3.57]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 412.22it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=112.216, player_2/loss=178.240, rew=-12.50]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 408.21it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=92.026, player_2/loss=163.908, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 410.81it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=201.304, player_2/loss=267.805, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 412.13it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=162.650, player_2/loss=290.068, rew=-19.44]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 411.25it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=45.174, player_2/loss=274.942, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 339.17it/s, env_step=19456, len=18, n/ep=4, n/st=64, player_1/loss=89.831, player_2/loss=229.626, rew=12.50]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 400.93it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=68.419, player_2/loss=229.146, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 365.76it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=103.426, player_2/loss=268.433, rew=19.44]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 369.97it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=137.818, player_2/loss=284.187, rew=12.50]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 398.42it/s, env_step=4096, len=12, n/ep=5, n/st=64, player_1/loss=88.420, player_2/loss=264.804, rew=15.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 414.16it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=45.196, player_2/loss=256.576, rew=18.75]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 412.48it/s, env_step=6144, len=8, n/ep=7, n/st=64, player_1/loss=50.167, player_2/loss=270.443, rew=17.86]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 412.59it/s, env_step=7168, len=7, n/ep=9, n/st=64, player_1/loss=65.445, rew=25.00]           


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 402.36it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=116.950, player_2/loss=228.568, rew=18.75]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 396.26it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=195.423, player_2/loss=246.978, rew=10.71]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 367.36it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=158.375, player_2/loss=247.707, rew=19.44]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 398.70it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=77.640, player_2/loss=271.276, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 395.54it/s, env_step=12288, len=8, n/ep=9, n/st=64, player_1/loss=40.036, player_2/loss=290.132, rew=13.89]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 399.51it/s, env_step=13312, len=8, n/ep=8, n/st=64, player_1/loss=49.695, player_2/loss=253.608, rew=0.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 393.04it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=126.164, player_2/loss=223.599, rew=12.50]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 377.98it/s, env_step=15360, len=7, n/ep=8, n/st=64, player_1/loss=104.343, player_2/loss=268.110, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 400.60it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=126.426, player_2/loss=292.409, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 402.88it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=128.931, player_2/loss=260.719, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 394.87it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=49.854, player_2/loss=254.367, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 360.28it/s, env_step=19456, len=8, n/ep=7, n/st=64, player_1/loss=36.906, player_2/loss=250.947, rew=10.71]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 398.19it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=70.223, player_2/loss=228.762, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 391.58it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=87.334, player_2/loss=245.746, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 387.69it/s, env_step=3072, len=9, n/ep=7, n/st=64, player_1/loss=71.295, player_2/loss=254.952, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 393.20it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=54.729, player_2/loss=264.161, rew=-25.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 396.00it/s, env_step=5120, len=8, n/ep=8, n/st=64, player_1/loss=49.107, player_2/loss=259.614, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 393.76it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=32.664, player_2/loss=266.528, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 395.22it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=52.990, player_2/loss=290.252, rew=-18.75]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 398.26it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=125.561, player_2/loss=275.398, rew=-18.75]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 393.74it/s, env_step=9216, len=11, n/ep=6, n/st=64, player_1/loss=123.893, player_2/loss=235.346, rew=-8.33]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 368.16it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=147.495, player_2/loss=220.375, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 395.76it/s, env_step=11264, len=9, n/ep=7, n/st=64, player_1/loss=156.683, player_2/loss=209.452, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 397.05it/s, env_step=12288, len=10, n/ep=7, n/st=64, player_1/loss=66.179, player_2/loss=201.557, rew=-10.71]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 394.61it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=77.807, player_2/loss=205.492, rew=-18.75]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 403.32it/s, env_step=14336, len=11, n/ep=7, n/st=64, player_1/loss=110.082, player_2/loss=236.919, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 411.64it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=68.761, player_2/loss=241.143, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 376.59it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=90.714, player_2/loss=214.075, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 381.38it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=107.072, player_2/loss=223.921, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 360.91it/s, env_step=18432, len=8, n/ep=7, n/st=64, player_1/loss=54.627, player_2/loss=242.428, rew=-17.86]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 361.91it/s, env_step=19456, len=19, n/ep=4, n/st=64, player_1/loss=94.980, player_2/loss=185.437, rew=0.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 339.06it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=30.146, player_2/loss=213.976, rew=19.44]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 357.21it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=71.043, player_2/loss=261.099, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 364.68it/s, env_step=3072, len=11, n/ep=6, n/st=64, player_1/loss=84.092, player_2/loss=281.859, rew=16.67]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 384.37it/s, env_step=4096, len=14, n/ep=4, n/st=64, player_1/loss=42.241, player_2/loss=288.116, rew=12.50]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 394.06it/s, env_step=5120, len=7, n/ep=7, n/st=64, player_1/loss=44.577, player_2/loss=271.031, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 390.34it/s, env_step=6144, len=13, n/ep=5, n/st=64, player_1/loss=69.336, player_2/loss=246.792, rew=15.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 407.87it/s, env_step=7168, len=8, n/ep=7, n/st=64, player_1/loss=70.551, player_2/loss=235.379, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 402.64it/s, env_step=8192, len=9, n/ep=7, n/st=64, player_1/loss=42.003, player_2/loss=224.988, rew=17.86]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 411.45it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=37.542, player_2/loss=217.611, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 398.06it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=112.128, player_2/loss=199.846, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 396.66it/s, env_step=11264, len=8, n/ep=7, n/st=64, player_1/loss=145.150, player_2/loss=216.285, rew=17.86]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 390.44it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=103.123, player_2/loss=253.878, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 380.38it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=61.698, player_2/loss=224.048, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 381.76it/s, env_step=14336, len=9, n/ep=8, n/st=64, player_1/loss=34.887, player_2/loss=208.445, rew=18.75]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 393.31it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=27.619, player_2/loss=196.792, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 409.70it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=23.916, player_2/loss=224.504, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 383.63it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=49.621, player_2/loss=245.982, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 400.93it/s, env_step=18432, len=8, n/ep=7, n/st=64, player_1/loss=52.886, player_2/loss=257.129, rew=17.86]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 401.46it/s, env_step=19456, len=17, n/ep=4, n/st=64, player_1/loss=66.217, player_2/loss=216.770, rew=0.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 396.02it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=22.901, player_2/loss=207.673, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 385.37it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=37.193, player_2/loss=234.730, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 406.12it/s, env_step=3072, len=10, n/ep=6, n/st=64, player_1/loss=58.683, player_2/loss=232.952, rew=-16.67]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 411.29it/s, env_step=4096, len=11, n/ep=5, n/st=64, player_1/loss=89.131, player_2/loss=257.124, rew=-15.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 412.19it/s, env_step=5120, len=8, n/ep=8, n/st=64, player_1/loss=61.051, player_2/loss=279.186, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 404.48it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=27.687, player_2/loss=257.080, rew=-18.75]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 409.36it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=44.740, player_2/loss=261.620, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 407.00it/s, env_step=8192, len=8, n/ep=5, n/st=64, player_1/loss=53.822, player_2/loss=255.604, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 405.94it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=64.596, player_2/loss=236.336, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 406.32it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=155.251, player_2/loss=222.668, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 409.18it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=126.612, player_2/loss=268.748, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 402.48it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=74.830, player_2/loss=281.358, rew=-10.71]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 408.55it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=98.601, player_2/loss=272.724, rew=-17.86]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 410.37it/s, env_step=14336, len=10, n/ep=8, n/st=64, player_1/loss=75.626, player_2/loss=232.998, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 405.10it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=58.999, player_2/loss=234.312, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 401.48it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=36.591, player_2/loss=250.342, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 402.24it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=49.813, player_2/loss=260.250, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 411.23it/s, env_step=18432, len=8, n/ep=7, n/st=64, player_1/loss=57.258, player_2/loss=232.515, rew=-17.86]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 407.22it/s, env_step=19456, len=14, n/ep=4, n/st=64, player_1/loss=112.201, player_2/loss=191.297, rew=-12.50]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 408.66it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=53.111, player_2/loss=237.349, rew=19.44]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 410.53it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=52.367, player_2/loss=242.496, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 411.59it/s, env_step=3072, len=9, n/ep=7, n/st=64, player_1/loss=78.802, player_2/loss=260.419, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 411.22it/s, env_step=4096, len=10, n/ep=6, n/st=64, player_1/loss=85.555, player_2/loss=287.692, rew=16.67]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 406.39it/s, env_step=5120, len=8, n/ep=8, n/st=64, player_1/loss=32.807, player_2/loss=290.525, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 402.78it/s, env_step=6144, len=9, n/ep=7, n/st=64, player_1/loss=46.798, player_2/loss=260.373, rew=17.86]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 396.00it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=62.412, player_2/loss=258.074, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 409.13it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=71.700, player_2/loss=235.234, rew=16.67]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 402.13it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=78.319, player_2/loss=229.273, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 406.05it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=138.564, player_2/loss=216.953, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 406.22it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=118.788, player_2/loss=236.285, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 400.91it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=71.450, player_2/loss=287.061, rew=17.86]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 405.31it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=138.179, player_2/loss=295.082, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 403.12it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=115.482, player_2/loss=250.024, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 409.07it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=56.334, player_2/loss=218.810, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 401.47it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=41.560, player_2/loss=214.100, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 405.75it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=33.770, player_2/loss=227.335, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 406.94it/s, env_step=18432, len=8, n/ep=9, n/st=64, player_1/loss=33.994, rew=19.44]         


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 403.00it/s, env_step=19456, len=15, n/ep=4, n/st=64, player_1/loss=173.454, player_2/loss=206.659, rew=0.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 407.71it/s, env_step=1024, len=8, n/ep=7, n/st=64, player_1/loss=103.045, player_2/loss=250.368, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 409.72it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=97.345, player_2/loss=247.481, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 412.57it/s, env_step=3072, len=7, n/ep=8, n/st=64, player_1/loss=92.881, player_2/loss=261.973, rew=-18.75]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 407.06it/s, env_step=4096, len=8, n/ep=7, n/st=64, player_1/loss=83.139, player_2/loss=269.434, rew=-17.86]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 404.47it/s, env_step=5120, len=8, n/ep=8, n/st=64, player_1/loss=60.548, player_2/loss=262.588, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 412.13it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=69.000, player_2/loss=252.147, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 410.65it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=62.389, player_2/loss=224.508, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 416.77it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=25.933, player_2/loss=225.511, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 411.10it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=59.840, player_2/loss=256.505, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 413.03it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=90.224, player_2/loss=239.795, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 414.16it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=58.514, player_2/loss=242.381, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 414.82it/s, env_step=12288, len=8, n/ep=8, n/st=64, player_1/loss=30.240, player_2/loss=268.100, rew=-18.75]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 412.15it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=27.683, player_2/loss=232.198, rew=-17.86]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 412.80it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=75.413, player_2/loss=211.577, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 416.50it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=92.876, player_2/loss=255.784, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 411.09it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=47.312, player_2/loss=258.639, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 396.06it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=32.351, player_2/loss=253.847, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 398.53it/s, env_step=18432, len=8, n/ep=9, n/st=64, player_1/loss=38.966, rew=-19.44]        


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 395.88it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=27.700, player_2/loss=204.018, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 395.18it/s, env_step=1024, len=8, n/ep=7, n/st=64, player_1/loss=36.000, player_2/loss=186.462, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 393.61it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=59.884, player_2/loss=208.837, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 395.05it/s, env_step=3072, len=9, n/ep=7, n/st=64, player_1/loss=86.199, player_2/loss=226.433, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 394.13it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=58.026, player_2/loss=268.213, rew=16.67]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 398.26it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=43.055, player_2/loss=303.165, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 393.94it/s, env_step=6144, len=10, n/ep=7, n/st=64, player_1/loss=45.272, player_2/loss=264.161, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 392.52it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=33.514, player_2/loss=249.176, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 355.47it/s, env_step=8192, len=7, n/ep=9, n/st=64, player_1/loss=41.401, player_2/loss=231.409, rew=19.44]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 360.56it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=64.251, player_2/loss=229.769, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 392.99it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=52.228, player_2/loss=237.183, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 391.98it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=30.779, player_2/loss=246.569, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 395.46it/s, env_step=12288, len=8, n/ep=8, n/st=64, player_1/loss=29.020, player_2/loss=270.722, rew=18.75]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 392.45it/s, env_step=13312, len=12, n/ep=5, n/st=64, player_1/loss=87.781, player_2/loss=248.257, rew=-5.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 395.69it/s, env_step=14336, len=9, n/ep=8, n/st=64, player_1/loss=121.316, player_2/loss=221.308, rew=18.75]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 394.62it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=67.491, player_2/loss=239.374, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 394.65it/s, env_step=16384, len=7, n/ep=8, n/st=64, player_1/loss=26.652, player_2/loss=239.698, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 391.50it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=35.538, player_2/loss=238.323, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 399.80it/s, env_step=18432, len=8, n/ep=7, n/st=64, player_1/loss=34.233, player_2/loss=217.551, rew=17.86]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 395.13it/s, env_step=19456, len=12, n/ep=5, n/st=64, player_1/loss=103.272, player_2/loss=184.382, rew=15.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 390.91it/s, env_step=1024, len=8, n/ep=7, n/st=64, player_1/loss=33.236, player_2/loss=195.594, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 396.23it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=28.770, player_2/loss=207.818, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 396.98it/s, env_step=3072, len=10, n/ep=6, n/st=64, player_1/loss=59.311, player_2/loss=230.450, rew=-16.67]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 398.11it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=62.047, player_2/loss=248.647, rew=-25.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 394.66it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=42.745, player_2/loss=286.658, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 396.99it/s, env_step=6144, len=10, n/ep=7, n/st=64, player_1/loss=43.185, player_2/loss=254.962, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 395.10it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=77.087, player_2/loss=231.905, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 398.69it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=115.456, player_2/loss=236.499, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 397.57it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=80.339, player_2/loss=237.405, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 400.94it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=56.746, player_2/loss=238.687, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 399.05it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=39.952, player_2/loss=232.346, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 399.92it/s, env_step=12288, len=8, n/ep=8, n/st=64, player_1/loss=38.919, player_2/loss=248.678, rew=-18.75]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 398.41it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=35.230, player_2/loss=240.816, rew=-17.86]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 397.51it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=43.400, player_2/loss=238.473, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 397.33it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=52.344, player_2/loss=251.047, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 396.58it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=40.164, player_2/loss=269.043, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 402.67it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=64.199, player_2/loss=245.877, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 397.78it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=66.979, player_2/loss=245.295, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 398.26it/s, env_step=19456, len=14, n/ep=5, n/st=64, player_1/loss=77.311, player_2/loss=215.349, rew=-5.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 370.36it/s, env_step=1024, len=8, n/ep=7, n/st=64, player_1/loss=67.841, player_2/loss=241.834, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 393.70it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=86.885, player_2/loss=239.944, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 396.14it/s, env_step=3072, len=10, n/ep=6, n/st=64, player_1/loss=88.712, player_2/loss=220.015, rew=16.67]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 396.80it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=54.240, player_2/loss=235.955, rew=25.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 397.32it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=37.623, player_2/loss=279.799, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 396.54it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=67.937, player_2/loss=287.852, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 395.52it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=110.683, player_2/loss=260.436, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 380.01it/s, env_step=8192, len=11, n/ep=6, n/st=64, player_1/loss=96.970, player_2/loss=235.428, rew=16.67]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 390.27it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=94.237, player_2/loss=239.829, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 395.98it/s, env_step=10240, len=8, n/ep=7, n/st=64, player_1/loss=104.591, rew=17.86]        


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 397.17it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=61.898, player_2/loss=239.876, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 396.06it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_1/loss=43.486, player_2/loss=237.855, rew=16.67]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 395.95it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=37.792, player_2/loss=245.382, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 397.30it/s, env_step=14336, len=11, n/ep=7, n/st=64, player_1/loss=50.587, player_2/loss=231.003, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 395.20it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=51.789, player_2/loss=225.228, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 396.83it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=32.975, player_2/loss=246.531, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 396.78it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=65.174, player_2/loss=239.568, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 396.66it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=84.140, player_2/loss=211.540, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 387.08it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=129.311, player_2/loss=185.811, rew=25.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 393.55it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=25.951, player_2/loss=275.396, rew=-18.75]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 398.28it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=30.177, player_2/loss=246.485, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 399.53it/s, env_step=3072, len=10, n/ep=6, n/st=64, player_1/loss=40.657, player_2/loss=234.625, rew=-16.67]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 394.19it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=31.303, player_2/loss=198.063, rew=-25.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 437.21it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=26.999, player_2/loss=276.274, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 498.17it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=74.284, player_2/loss=266.168, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 495.94it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=116.547, player_2/loss=250.063, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 491.60it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=75.675, player_2/loss=244.562, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 492.69it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=50.918, player_2/loss=227.555, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 488.19it/s, env_step=10240, len=8, n/ep=7, n/st=64, player_1/loss=68.504, rew=-17.86]        


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 489.13it/s, env_step=11264, len=8, n/ep=7, n/st=64, player_1/loss=46.777, player_2/loss=216.526, rew=-17.86]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 489.14it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_1/loss=30.565, player_2/loss=240.032, rew=-16.67]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 490.56it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=34.316, player_2/loss=231.810, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 491.08it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=35.154, player_2/loss=224.420, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 491.84it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_1/loss=31.677, player_2/loss=208.056, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 490.55it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=45.500, player_2/loss=230.764, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 491.76it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=45.509, player_2/loss=198.747, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 482.06it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=31.595, player_2/loss=211.308, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 491.17it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=28.229, player_2/loss=226.445, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 488.33it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=77.449, player_2/loss=229.383, rew=18.75]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 488.73it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=50.836, player_2/loss=221.259, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 484.89it/s, env_step=3072, len=7, n/ep=6, n/st=64, player_1/loss=37.872, player_2/loss=233.351, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 484.56it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=45.654, player_2/loss=202.474, rew=25.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 491.37it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=33.023, player_2/loss=245.729, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 488.47it/s, env_step=6144, len=13, n/ep=5, n/st=64, player_1/loss=21.515, player_2/loss=269.588, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 486.04it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=18.266, player_2/loss=248.002, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 490.27it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=46.840, player_2/loss=237.360, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 489.93it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=51.840, player_2/loss=238.592, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 489.00it/s, env_step=10240, len=9, n/ep=7, n/st=64, player_1/loss=83.657, player_2/loss=249.634, rew=17.86]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 488.99it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=78.297, player_2/loss=282.154, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 491.40it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_1/loss=22.174, player_2/loss=277.142, rew=16.67]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 490.82it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=27.102, player_2/loss=248.100, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 489.97it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=34.424, player_2/loss=204.333, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 487.81it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=30.460, player_2/loss=198.826, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 493.39it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=38.720, player_2/loss=226.363, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 490.59it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=44.186, player_2/loss=207.605, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 491.33it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=22.074, player_2/loss=216.379, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 489.85it/s, env_step=19456, len=10, n/ep=6, n/st=64, player_1/loss=46.826, player_2/loss=229.310, rew=25.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 490.38it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=83.926, player_2/loss=258.560, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 487.36it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=59.085, player_2/loss=225.004, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 491.49it/s, env_step=3072, len=8, n/ep=7, n/st=64, player_1/loss=58.636, player_2/loss=209.967, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 492.96it/s, env_step=4096, len=11, n/ep=5, n/st=64, player_1/loss=58.250, player_2/loss=215.346, rew=-25.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 492.48it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=32.029, player_2/loss=274.654, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 492.91it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=58.174, player_2/loss=265.803, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 489.35it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=49.019, player_2/loss=255.993, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 489.65it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=23.519, player_2/loss=243.640, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 487.55it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=29.115, player_2/loss=229.489, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 493.98it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=39.995, player_2/loss=248.738, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 488.37it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=38.538, player_2/loss=230.166, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 491.87it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_1/loss=33.435, player_2/loss=248.951, rew=-16.67]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 490.34it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=36.266, player_2/loss=242.221, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 490.03it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=53.169, player_2/loss=243.273, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 496.02it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=46.675, player_2/loss=225.843, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 488.73it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=33.204, player_2/loss=259.353, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 491.67it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=53.183, player_2/loss=248.171, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 486.32it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=54.004, player_2/loss=224.521, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 493.32it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=28.039, player_2/loss=224.473, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 490.86it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=22.108, player_2/loss=266.653, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 488.51it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=34.024, player_2/loss=268.987, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 490.82it/s, env_step=3072, len=7, n/ep=8, n/st=64, player_1/loss=69.546, player_2/loss=231.530, rew=18.75]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 486.96it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=58.497, player_2/loss=203.951, rew=25.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 491.49it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=23.032, player_2/loss=265.399, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 492.19it/s, env_step=6144, len=9, n/ep=7, n/st=64, player_1/loss=24.353, player_2/loss=249.796, rew=17.86]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 493.07it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=17.190, player_2/loss=241.453, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 484.89it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=39.184, player_2/loss=234.617, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 489.67it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=36.663, player_2/loss=240.399, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 488.73it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=25.679, player_2/loss=224.637, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 494.13it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=37.458, player_2/loss=222.932, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 488.78it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=40.374, player_2/loss=245.631, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 487.94it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=56.738, player_2/loss=253.992, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 492.24it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=51.777, player_2/loss=231.310, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 484.06it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=16.420, player_2/loss=232.951, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 492.89it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=42.034, player_2/loss=264.289, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 493.24it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=81.685, player_2/loss=238.250, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 491.21it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=136.395, player_2/loss=216.679, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 488.04it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=19.673, player_2/loss=230.997, rew=25.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 490.51it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=24.222, player_2/loss=254.246, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 494.06it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=32.401, player_2/loss=230.147, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 492.30it/s, env_step=3072, len=8, n/ep=7, n/st=64, player_1/loss=56.407, player_2/loss=233.223, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 491.57it/s, env_step=4096, len=10, n/ep=6, n/st=64, player_1/loss=54.349, player_2/loss=240.337, rew=-16.67]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 494.46it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=26.346, player_2/loss=258.346, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 495.13it/s, env_step=6144, len=9, n/ep=7, n/st=64, player_1/loss=62.103, player_2/loss=245.021, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 489.38it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=55.978, player_2/loss=246.153, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 493.15it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=35.422, player_2/loss=238.795, rew=-18.75]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 490.71it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=39.555, player_2/loss=244.389, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 492.62it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=32.919, player_2/loss=235.319, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 491.64it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=30.289, player_2/loss=217.123, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 493.52it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=106.674, player_2/loss=257.817, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 494.34it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=103.818, player_2/loss=282.496, rew=-19.44]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 488.95it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=76.248, player_2/loss=245.698, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 492.91it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=41.947, player_2/loss=213.425, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 493.71it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=20.973, player_2/loss=232.477, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 495.20it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=53.578, player_2/loss=203.099, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 491.94it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=128.272, player_2/loss=192.843, rew=-17.86]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 492.81it/s, env_step=19456, len=11, n/ep=5, n/st=64, player_1/loss=104.096, player_2/loss=228.058, rew=-5.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 488.17it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=30.929, player_2/loss=282.536, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 491.78it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=30.902, player_2/loss=258.853, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 490.40it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=59.112, player_2/loss=238.991, rew=18.75]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 489.53it/s, env_step=4096, len=10, n/ep=6, n/st=64, player_1/loss=59.330, player_2/loss=216.157, rew=16.67]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 491.24it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=34.456, player_2/loss=245.966, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 492.76it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=86.215, player_2/loss=225.821, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 489.82it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=84.064, player_2/loss=231.720, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 486.21it/s, env_step=8192, len=7, n/ep=9, n/st=64, player_1/loss=105.492, player_2/loss=220.201, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 489.96it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=103.563, player_2/loss=212.875, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 490.18it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=35.879, player_2/loss=192.711, rew=18.75]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 486.21it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=45.926, player_2/loss=184.274, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 493.95it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_1/loss=119.151, player_2/loss=246.966, rew=16.67]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 489.38it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=135.316, player_2/loss=244.603, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 491.58it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=84.170, player_2/loss=203.359, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 491.42it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=43.411, player_2/loss=226.696, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 492.91it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=52.922, player_2/loss=258.161, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 490.52it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=82.322, player_2/loss=225.614, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 492.19it/s, env_step=18432, len=8, n/ep=9, n/st=64, player_1/loss=57.413, rew=19.44]         


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 490.86it/s, env_step=19456, len=8, n/ep=8, n/st=64, player_1/loss=52.021, player_2/loss=207.363, rew=25.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 488.81it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=24.807, player_2/loss=263.278, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 490.02it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=24.495, player_2/loss=241.335, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 492.03it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=52.900, player_2/loss=251.409, rew=-18.75]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 490.31it/s, env_step=4096, len=8, n/ep=7, n/st=64, player_1/loss=64.031, player_2/loss=232.433, rew=-17.86]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 486.40it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=39.492, player_2/loss=242.742, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 488.71it/s, env_step=6144, len=11, n/ep=6, n/st=64, player_1/loss=49.265, player_2/loss=222.207, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 487.04it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=47.569, player_2/loss=226.237, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 495.34it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=59.395, player_2/loss=222.150, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 490.75it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=83.581, player_2/loss=237.179, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 488.35it/s, env_step=10240, len=8, n/ep=7, n/st=64, player_1/loss=147.613, player_2/loss=203.591, rew=-17.86]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 490.15it/s, env_step=11264, len=9, n/ep=7, n/st=64, player_1/loss=140.872, player_2/loss=212.279, rew=-17.86]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 494.33it/s, env_step=12288, len=15, n/ep=3, n/st=64, player_1/loss=85.689, player_2/loss=236.265, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 486.51it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=88.228, player_2/loss=263.649, rew=-18.75]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 474.96it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=58.734, player_2/loss=228.530, rew=-18.75]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 487.48it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=39.742, player_2/loss=198.855, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 491.26it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=41.343, player_2/loss=250.947, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 487.61it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=45.015, player_2/loss=243.634, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 486.40it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=100.013, player_2/loss=219.547, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 489.45it/s, env_step=19456, len=8, n/ep=7, n/st=64, player_1/loss=85.915, player_2/loss=209.881, rew=-17.86]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 482.91it/s, env_step=1024, len=8, n/ep=7, n/st=64, player_1/loss=19.681, player_2/loss=218.398, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 489.05it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=30.041, player_2/loss=274.934, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 486.40it/s, env_step=3072, len=9, n/ep=7, n/st=64, player_1/loss=85.754, player_2/loss=268.242, rew=17.86]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 489.84it/s, env_step=4096, len=10, n/ep=6, n/st=64, player_1/loss=85.342, player_2/loss=276.569, rew=16.67]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 488.45it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=32.542, player_2/loss=263.637, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 493.44it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=40.235, player_2/loss=260.781, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 489.26it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=54.549, player_2/loss=250.923, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 490.05it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=53.696, player_2/loss=254.088, rew=18.75]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 488.88it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=51.696, player_2/loss=235.650, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 491.96it/s, env_step=10240, len=9, n/ep=7, n/st=64, player_1/loss=79.466, player_2/loss=181.199, rew=17.86]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 494.49it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=65.752, player_2/loss=187.753, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 491.98it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=120.763, player_2/loss=258.277, rew=10.71]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 491.84it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=109.268, player_2/loss=276.682, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 493.90it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=41.763, player_2/loss=253.263, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 491.70it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=10.804, player_2/loss=244.702, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 494.55it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=34.203, player_2/loss=251.586, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 493.65it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=41.313, player_2/loss=219.760, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 493.42it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=97.298, player_2/loss=207.181, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 488.40it/s, env_step=19456, len=8, n/ep=7, n/st=64, player_1/loss=18.154, player_2/loss=219.848, rew=17.86]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 490.48it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=23.357, player_2/loss=182.896, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 489.38it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=54.652, player_2/loss=220.081, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 492.84it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=82.910, player_2/loss=237.035, rew=-18.75]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 490.97it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=54.975, player_2/loss=236.333, rew=-16.67]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 491.82it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=28.138, player_2/loss=262.939, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 487.22it/s, env_step=6144, len=11, n/ep=6, n/st=64, player_1/loss=71.899, player_2/loss=236.006, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 490.22it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=68.791, player_2/loss=219.344, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 495.35it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=32.277, player_2/loss=189.159, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 490.06it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=47.439, player_2/loss=206.825, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 493.72it/s, env_step=10240, len=8, n/ep=8, n/st=64, player_1/loss=140.047, player_2/loss=228.834, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 494.46it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=172.233, player_2/loss=248.495, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 490.44it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=112.432, player_2/loss=268.462, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 493.44it/s, env_step=13312, len=10, n/ep=6, n/st=64, player_1/loss=65.569, player_2/loss=262.326, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 492.31it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=29.905, player_2/loss=239.285, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 490.79it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=53.822, player_2/loss=229.118, rew=-19.44]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 493.08it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=94.770, player_2/loss=221.373, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 494.84it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=67.327, player_2/loss=201.128, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 488.59it/s, env_step=18432, len=7, n/ep=7, n/st=64, player_1/loss=92.927, player_2/loss=184.019, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 488.25it/s, env_step=19456, len=10, n/ep=6, n/st=64, player_1/loss=77.150, player_2/loss=210.826, rew=-16.67]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 490.25it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=22.620, player_2/loss=285.559, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 493.05it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=22.219, player_2/loss=263.024, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 488.86it/s, env_step=3072, len=7, n/ep=6, n/st=64, player_1/loss=22.451, player_2/loss=253.834, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 488.66it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=40.703, player_2/loss=232.782, rew=16.67]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 488.96it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=39.014, player_2/loss=254.414, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 494.40it/s, env_step=6144, len=10, n/ep=6, n/st=64, player_1/loss=74.695, player_2/loss=237.167, rew=16.67]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 490.72it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=69.327, player_2/loss=240.621, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 493.53it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=53.185, player_2/loss=225.067, rew=18.75]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 493.42it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=67.991, player_2/loss=219.541, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 493.50it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=35.701, player_2/loss=207.112, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 493.51it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=31.596, player_2/loss=232.632, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 493.70it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=112.624, player_2/loss=273.231, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 490.67it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=109.578, player_2/loss=282.007, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 488.38it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=31.175, player_2/loss=256.917, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 492.41it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=31.918, player_2/loss=270.901, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 493.55it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=91.155, player_2/loss=256.779, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 488.69it/s, env_step=17408, len=7, n/ep=8, n/st=64, player_1/loss=135.136, player_2/loss=206.642, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 492.08it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=49.182, player_2/loss=213.337, rew=17.86]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 490.16it/s, env_step=19456, len=13, n/ep=5, n/st=64, player_1/loss=57.957, player_2/loss=211.077, rew=5.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 491.48it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=15.583, player_2/loss=254.213, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 492.07it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=24.156, player_2/loss=235.982, rew=-19.44]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 490.76it/s, env_step=3072, len=9, n/ep=7, n/st=64, player_1/loss=29.840, player_2/loss=249.784, rew=-17.86]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 495.70it/s, env_step=4096, len=11, n/ep=8, n/st=64, player_1/loss=28.918, player_2/loss=262.913, rew=-18.75]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 496.07it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=39.915, player_2/loss=249.666, rew=-18.75]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:02, 492.01it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=35.175, player_2/loss=251.666, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:02, 494.75it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=29.998, player_2/loss=252.256, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:02, 496.22it/s, env_step=8192, len=9, n/ep=6, n/st=64, player_1/loss=45.827, player_2/loss=253.051, rew=-16.67]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:02, 492.30it/s, env_step=9216, len=9, n/ep=7, n/st=64, player_1/loss=29.211, player_2/loss=247.100, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:02, 496.66it/s, env_step=10240, len=9, n/ep=7, n/st=64, player_1/loss=115.036, player_2/loss=233.832, rew=-10.71]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:02, 492.43it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=166.563, player_2/loss=248.041, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:02, 493.24it/s, env_step=12288, len=11, n/ep=7, n/st=64, player_1/loss=74.206, player_2/loss=265.891, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:02, 497.35it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=38.850, player_2/loss=252.059, rew=-17.86]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:02, 489.91it/s, env_step=14336, len=10, n/ep=7, n/st=64, player_1/loss=79.746, player_2/loss=208.927, rew=-17.86]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:02, 490.98it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=47.660, player_2/loss=248.013, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:02, 493.14it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=36.040, player_2/loss=284.232, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:02, 492.79it/s, env_step=17408, len=7, n/ep=8, n/st=64, player_1/loss=83.587, player_2/loss=253.062, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:02, 485.91it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=83.643, player_2/loss=215.494, rew=-18.75]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:02, 492.81it/s, env_step=19456, len=11, n/ep=6, n/st=64, player_1/loss=51.672, player_2/loss=225.374, rew=-16.67]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:02, 490.49it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=6.192, player_2/loss=257.887, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 495.40it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=37.003, player_2/loss=255.718, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 491.54it/s, env_step=3072, len=7, n/ep=8, n/st=64, player_1/loss=67.484, player_2/loss=255.888, rew=18.75]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 492.85it/s, env_step=4096, len=14, n/ep=4, n/st=64, player_1/loss=55.705, player_2/loss=246.058, rew=12.50]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 492.72it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=13.693, player_2/loss=252.824, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 497.87it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=45.907, player_2/loss=236.756, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 490.75it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=58.321, player_2/loss=243.939, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 490.64it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=54.373, player_2/loss=264.072, rew=18.75]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 491.71it/s, env_step=9216, len=7, n/ep=7, n/st=64, player_1/loss=58.437, player_2/loss=250.439, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 491.92it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=103.695, player_2/loss=227.499, rew=19.44]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 488.58it/s, env_step=11264, len=9, n/ep=7, n/st=64, player_1/loss=95.474, player_2/loss=236.424, rew=10.71]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 492.36it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=79.869, player_2/loss=276.677, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 486.72it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=89.315, player_2/loss=285.932, rew=18.75]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 491.58it/s, env_step=14336, len=9, n/ep=8, n/st=64, player_1/loss=78.032, player_2/loss=271.763, rew=12.50]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 488.99it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=32.291, player_2/loss=263.181, rew=19.44]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 492.01it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=41.303, player_2/loss=254.233, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 491.72it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=108.349, player_2/loss=208.518, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 487.02it/s, env_step=18432, len=8, n/ep=9, n/st=64, player_1/loss=110.478, rew=19.44]        


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 493.82it/s, env_step=19456, len=12, n/ep=5, n/st=64, player_1/loss=37.325, player_2/loss=209.283, rew=15.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 488.45it/s, env_step=1024, len=8, n/ep=7, n/st=64, player_1/loss=18.105, player_2/loss=247.139, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 494.11it/s, env_step=2048, len=16, n/ep=4, n/st=64, player_1/loss=74.832, player_2/loss=233.090, rew=12.50]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 496.06it/s, env_step=3072, len=7, n/ep=8, n/st=64, player_1/loss=166.011, player_2/loss=190.177, rew=-18.75]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 484.81it/s, env_step=4096, len=10, n/ep=6, n/st=64, player_1/loss=101.527, player_2/loss=191.228, rew=-8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 491.40it/s, env_step=5120, len=19, n/ep=3, n/st=64, player_1/loss=51.502, player_2/loss=220.513, rew=25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 491.35it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=110.193, player_2/loss=177.749, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 495.03it/s, env_step=7168, len=15, n/ep=4, n/st=64, player_1/loss=130.296, player_2/loss=181.563, rew=12.50]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #8: 1025it [00:02, 492.71it/s, env_step=8192, len=15, n/ep=4, n/st=64, player_1/loss=211.399, player_2/loss=215.689, rew=12.50]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #9: 1025it [00:02, 495.01it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=224.578, player_2/loss=185.221, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #10: 1025it [00:02, 494.90it/s, env_step=10240, len=12, n/ep=5, n/st=64, player_1/loss=159.739, player_2/loss=201.851, rew=-5.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #11: 1025it [00:02, 492.07it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=146.170, player_2/loss=229.505, rew=-18.75]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #12: 1025it [00:02, 494.67it/s, env_step=12288, len=8, n/ep=8, n/st=64, player_1/loss=113.370, player_2/loss=224.608, rew=-18.75]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #13: 1025it [00:02, 493.99it/s, env_step=13312, len=9, n/ep=6, n/st=64, player_1/loss=128.432, player_2/loss=214.123, rew=-16.67]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #14: 1025it [00:02, 491.30it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=115.062, player_2/loss=200.533, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #15: 1025it [00:02, 493.50it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=49.870, player_2/loss=236.089, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #16: 1025it [00:02, 491.53it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=54.608, player_2/loss=238.502, rew=-18.75]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #17: 1025it [00:02, 491.48it/s, env_step=17408, len=12, n/ep=5, n/st=64, player_1/loss=50.434, player_2/loss=251.005, rew=-15.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #18: 1025it [00:02, 490.89it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=32.556, player_2/loss=241.734, rew=-18.75]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #19: 1025it [00:02, 490.83it/s, env_step=19456, len=7, n/ep=9, n/st=64, player_1/loss=28.842, player_2/loss=203.036, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #7


Epoch #1: 1025it [00:02, 488.52it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=195.041, player_2/loss=478.810, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 491.63it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=166.967, player_2/loss=373.555, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 491.62it/s, env_step=3072, len=9, n/ep=7, n/st=64, player_1/loss=105.602, player_2/loss=260.198, rew=17.86]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 487.39it/s, env_step=4096, len=8, n/ep=7, n/st=64, player_1/loss=78.997, player_2/loss=251.164, rew=10.71]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 494.21it/s, env_step=5120, len=7, n/ep=9, n/st=64, player_1/loss=116.178, player_2/loss=223.380, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 487.54it/s, env_step=6144, len=9, n/ep=7, n/st=64, player_1/loss=97.549, player_2/loss=234.944, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 490.20it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=44.421, player_2/loss=258.815, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 488.19it/s, env_step=8192, len=9, n/ep=7, n/st=64, player_1/loss=73.617, player_2/loss=268.772, rew=17.86]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 488.92it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=107.621, player_2/loss=260.565, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 484.80it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=145.517, player_2/loss=240.324, rew=19.44]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 490.81it/s, env_step=11264, len=11, n/ep=6, n/st=64, player_1/loss=108.739, player_2/loss=243.049, rew=8.33]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 490.57it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=169.041, player_2/loss=283.885, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 487.90it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=145.749, player_2/loss=287.305, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 491.38it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=73.482, player_2/loss=253.673, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 488.35it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=64.420, player_2/loss=215.780, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 485.32it/s, env_step=16384, len=8, n/ep=7, n/st=64, player_1/loss=54.660, player_2/loss=243.747, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 488.02it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=72.991, player_2/loss=226.805, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 488.56it/s, env_step=18432, len=7, n/ep=7, n/st=64, player_1/loss=59.142, player_2/loss=205.126, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 489.59it/s, env_step=19456, len=12, n/ep=5, n/st=64, player_1/loss=35.907, player_2/loss=210.157, rew=15.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 490.13it/s, env_step=1024, len=12, n/ep=5, n/st=64, player_1/loss=32.568, player_2/loss=171.722, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 490.01it/s, env_step=2048, len=9, n/ep=7, n/st=64, player_1/loss=70.594, player_2/loss=169.140, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 492.00it/s, env_step=3072, len=13, n/ep=5, n/st=64, player_1/loss=93.077, player_2/loss=176.319, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 481.39it/s, env_step=4096, len=21, n/ep=3, n/st=64, player_1/loss=100.858, player_2/loss=135.191, rew=-25.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 493.48it/s, env_step=5120, len=8, n/ep=6, n/st=64, player_1/loss=65.190, player_2/loss=133.237, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 492.83it/s, env_step=6144, len=12, n/ep=4, n/st=64, player_1/loss=33.376, player_2/loss=152.510, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 491.95it/s, env_step=7168, len=12, n/ep=5, n/st=64, player_1/loss=118.499, player_2/loss=178.523, rew=-15.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 492.33it/s, env_step=8192, len=19, n/ep=4, n/st=64, player_1/loss=117.711, player_2/loss=160.185, rew=-12.50]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 489.12it/s, env_step=9216, len=16, n/ep=4, n/st=64, player_1/loss=96.077, player_2/loss=137.138, rew=-25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #10: 1025it [00:02, 491.06it/s, env_step=10240, len=17, n/ep=4, n/st=64, player_1/loss=135.101, player_2/loss=94.103, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #11: 1025it [00:02, 486.60it/s, env_step=11264, len=13, n/ep=4, n/st=64, player_1/loss=169.625, player_2/loss=101.732, rew=-12.50]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #12: 1025it [00:02, 491.96it/s, env_step=12288, len=12, n/ep=5, n/st=64, player_1/loss=229.327, player_2/loss=128.776, rew=-5.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #13: 1025it [00:02, 490.35it/s, env_step=13312, len=17, n/ep=3, n/st=64, player_1/loss=176.117, player_2/loss=144.627, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #14: 1025it [00:02, 491.30it/s, env_step=14336, len=13, n/ep=5, n/st=64, player_1/loss=88.882, player_2/loss=143.740, rew=-15.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #15: 1025it [00:02, 494.20it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_2/loss=149.300, rew=-17.86]       


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #16: 1025it [00:02, 491.18it/s, env_step=16384, len=15, n/ep=5, n/st=64, player_1/loss=132.865, player_2/loss=148.998, rew=-15.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #17: 1025it [00:02, 489.28it/s, env_step=17408, len=16, n/ep=4, n/st=64, player_1/loss=125.932, player_2/loss=141.575, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #18: 1025it [00:02, 491.98it/s, env_step=18432, len=9, n/ep=6, n/st=64, player_1/loss=97.969, player_2/loss=97.867, rew=-16.67]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #19: 1025it [00:02, 489.62it/s, env_step=19456, len=15, n/ep=4, n/st=64, player_1/loss=124.237, player_2/loss=110.966, rew=-12.50]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #1: 1025it [00:02, 489.82it/s, env_step=1024, len=12, n/ep=5, n/st=64, player_1/loss=123.024, player_2/loss=96.197, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 494.88it/s, env_step=2048, len=12, n/ep=6, n/st=64, player_1/loss=105.064, player_2/loss=110.842, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 492.43it/s, env_step=3072, len=13, n/ep=7, n/st=64, player_1/loss=113.875, player_2/loss=108.135, rew=3.57]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 492.04it/s, env_step=4096, len=16, n/ep=5, n/st=64, player_1/loss=110.599, player_2/loss=103.429, rew=-5.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 496.98it/s, env_step=5120, len=13, n/ep=4, n/st=64, player_1/loss=141.763, player_2/loss=132.090, rew=-12.50]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 489.54it/s, env_step=6144, len=12, n/ep=5, n/st=64, player_1/loss=156.359, rew=15.00]         


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 494.16it/s, env_step=7168, len=12, n/ep=6, n/st=64, player_1/loss=132.412, player_2/loss=144.554, rew=16.67]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 493.82it/s, env_step=8192, len=11, n/ep=5, n/st=64, player_1/loss=127.336, player_2/loss=142.579, rew=5.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 489.66it/s, env_step=9216, len=13, n/ep=5, n/st=64, player_1/loss=113.563, player_2/loss=124.921, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 489.88it/s, env_step=10240, len=13, n/ep=5, n/st=64, player_1/loss=144.218, player_2/loss=103.802, rew=25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 495.57it/s, env_step=11264, len=19, n/ep=3, n/st=64, player_1/loss=164.530, player_2/loss=112.574, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 494.05it/s, env_step=12288, len=12, n/ep=5, n/st=64, player_1/loss=114.148, player_2/loss=153.569, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 490.39it/s, env_step=13312, len=13, n/ep=5, n/st=64, player_1/loss=125.184, player_2/loss=176.745, rew=5.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 492.37it/s, env_step=14336, len=16, n/ep=4, n/st=64, player_1/loss=256.034, player_2/loss=169.080, rew=0.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 493.83it/s, env_step=15360, len=15, n/ep=5, n/st=64, player_1/loss=194.954, player_2/loss=123.087, rew=5.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 494.58it/s, env_step=16384, len=14, n/ep=4, n/st=64, player_1/loss=107.750, player_2/loss=75.870, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 493.22it/s, env_step=17408, len=14, n/ep=4, n/st=64, player_1/loss=109.375, player_2/loss=109.471, rew=-12.50]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 497.15it/s, env_step=18432, len=13, n/ep=5, n/st=64, player_1/loss=162.757, player_2/loss=142.918, rew=5.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 487.45it/s, env_step=19456, len=22, n/ep=3, n/st=64, player_1/loss=191.902, player_2/loss=138.987, rew=8.33]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 485.74it/s, env_step=1024, len=15, n/ep=4, n/st=64, player_1/loss=246.904, player_2/loss=152.009, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #2: 1025it [00:02, 488.56it/s, env_step=2048, len=8, n/ep=8, n/st=64, player_1/loss=207.819, player_2/loss=172.020, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #3: 1025it [00:02, 490.10it/s, env_step=3072, len=15, n/ep=4, n/st=64, player_1/loss=173.910, player_2/loss=186.415, rew=12.50]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #4: 1025it [00:02, 488.91it/s, env_step=4096, len=10, n/ep=6, n/st=64, player_1/loss=172.698, player_2/loss=170.500, rew=-16.67]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #5: 1025it [00:02, 491.47it/s, env_step=5120, len=21, n/ep=3, n/st=64, player_1/loss=204.184, player_2/loss=137.274, rew=8.33]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #6: 1025it [00:02, 489.41it/s, env_step=6144, len=9, n/ep=7, n/st=64, player_1/loss=239.463, player_2/loss=160.862, rew=-17.86]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #7: 1025it [00:02, 488.49it/s, env_step=7168, len=14, n/ep=5, n/st=64, player_2/loss=195.048, rew=-5.00]         


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #8: 1025it [00:02, 487.07it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=178.119, player_2/loss=203.539, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #9: 1025it [00:02, 486.23it/s, env_step=9216, len=9, n/ep=8, n/st=64, player_1/loss=159.014, player_2/loss=210.462, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #10: 1025it [00:02, 491.01it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=75.431, player_2/loss=197.354, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #11: 1025it [00:02, 488.27it/s, env_step=11264, len=10, n/ep=6, n/st=64, player_1/loss=90.502, player_2/loss=140.703, rew=-16.67]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #12: 1025it [00:02, 491.76it/s, env_step=12288, len=7, n/ep=9, n/st=64, player_1/loss=141.442, player_2/loss=199.087, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #13: 1025it [00:02, 493.75it/s, env_step=13312, len=9, n/ep=7, n/st=64, player_1/loss=131.909, player_2/loss=224.211, rew=-17.86]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #14: 1025it [00:02, 490.57it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=107.031, player_2/loss=195.578, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #15: 1025it [00:02, 489.43it/s, env_step=15360, len=7, n/ep=7, n/st=64, player_1/loss=114.182, player_2/loss=171.312, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #16: 1025it [00:02, 494.16it/s, env_step=16384, len=15, n/ep=4, n/st=64, player_1/loss=215.592, player_2/loss=246.894, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #17: 1025it [00:02, 487.93it/s, env_step=17408, len=7, n/ep=10, n/st=64, player_1/loss=167.684, player_2/loss=224.475, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #18: 1025it [00:02, 490.26it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=124.593, player_2/loss=192.751, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #19: 1025it [00:02, 492.36it/s, env_step=19456, len=11, n/ep=6, n/st=64, player_1/loss=151.868, player_2/loss=228.072, rew=-8.33]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #1: 1025it [00:02, 489.01it/s, env_step=1024, len=8, n/ep=7, n/st=64, player_1/loss=85.272, player_2/loss=303.745, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 489.02it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=109.692, player_2/loss=231.433, rew=19.44]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 492.74it/s, env_step=3072, len=7, n/ep=8, n/st=64, player_1/loss=78.166, player_2/loss=201.475, rew=18.75]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 490.45it/s, env_step=4096, len=8, n/ep=7, n/st=64, player_1/loss=128.541, player_2/loss=204.141, rew=10.71]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 490.79it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=119.709, player_2/loss=260.738, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 489.64it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=88.268, player_2/loss=263.180, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 486.93it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=90.546, rew=25.00]           


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 492.55it/s, env_step=8192, len=9, n/ep=7, n/st=64, player_1/loss=71.565, player_2/loss=241.925, rew=17.86]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 490.12it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=67.782, player_2/loss=230.400, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 489.48it/s, env_step=10240, len=10, n/ep=6, n/st=64, player_1/loss=87.274, player_2/loss=232.861, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 484.14it/s, env_step=11264, len=9, n/ep=7, n/st=64, player_1/loss=117.982, player_2/loss=198.130, rew=10.71]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 489.09it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=121.091, player_2/loss=232.346, rew=17.86]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 493.33it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=79.872, player_2/loss=241.883, rew=17.86]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 492.52it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=37.484, player_2/loss=202.047, rew=17.86]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 486.85it/s, env_step=15360, len=7, n/ep=8, n/st=64, player_1/loss=68.895, player_2/loss=218.431, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 488.90it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=140.422, player_2/loss=263.571, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 490.45it/s, env_step=17408, len=7, n/ep=8, n/st=64, player_1/loss=105.936, player_2/loss=230.945, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 489.96it/s, env_step=18432, len=8, n/ep=9, n/st=64, player_1/loss=41.610, rew=19.44]         


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 493.19it/s, env_step=19456, len=13, n/ep=5, n/st=64, player_1/loss=66.208, player_2/loss=209.285, rew=15.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 489.82it/s, env_step=1024, len=18, n/ep=3, n/st=64, player_1/loss=114.841, player_2/loss=200.592, rew=-8.33]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 492.35it/s, env_step=2048, len=8, n/ep=8, n/st=64, player_1/loss=59.570, player_2/loss=194.093, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 487.97it/s, env_step=3072, len=9, n/ep=7, n/st=64, player_1/loss=51.815, player_2/loss=217.401, rew=-17.86]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 494.29it/s, env_step=4096, len=11, n/ep=6, n/st=64, player_1/loss=107.351, player_2/loss=226.060, rew=-16.67]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #5: 1025it [00:02, 494.87it/s, env_step=5120, len=13, n/ep=5, n/st=64, player_1/loss=216.766, player_2/loss=169.340, rew=5.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #6: 1025it [00:02, 493.39it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=191.415, player_2/loss=163.420, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #7: 1025it [00:02, 495.73it/s, env_step=7168, len=10, n/ep=7, n/st=64, player_1/loss=75.284, player_2/loss=145.469, rew=-17.86]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #8: 1025it [00:02, 491.63it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=32.262, player_2/loss=196.382, rew=-8.33]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #9: 1025it [00:02, 490.82it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=40.248, player_2/loss=221.116, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #10: 1025it [00:02, 490.67it/s, env_step=10240, len=8, n/ep=6, n/st=64, player_1/loss=75.646, player_2/loss=217.491, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #11: 1025it [00:02, 489.65it/s, env_step=11264, len=13, n/ep=5, n/st=64, player_1/loss=99.182, player_2/loss=220.120, rew=5.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #12: 1025it [00:02, 493.33it/s, env_step=12288, len=16, n/ep=5, n/st=64, player_1/loss=130.085, rew=-15.00]      


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #13: 1025it [00:02, 488.70it/s, env_step=13312, len=11, n/ep=5, n/st=64, player_1/loss=156.590, player_2/loss=192.037, rew=-5.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #14: 1025it [00:02, 491.43it/s, env_step=14336, len=21, n/ep=3, n/st=64, player_1/loss=176.332, player_2/loss=154.610, rew=-8.33]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #15: 1025it [00:02, 493.39it/s, env_step=15360, len=16, n/ep=4, n/st=64, player_1/loss=180.719, player_2/loss=144.410, rew=-12.50]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #16: 1025it [00:02, 495.64it/s, env_step=16384, len=19, n/ep=3, n/st=64, player_1/loss=158.121, player_2/loss=144.326, rew=-8.33]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #17: 1025it [00:02, 493.54it/s, env_step=17408, len=19, n/ep=3, n/st=64, player_1/loss=165.626, player_2/loss=145.628, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #18: 1025it [00:02, 489.22it/s, env_step=18432, len=19, n/ep=4, n/st=64, player_1/loss=227.755, player_2/loss=127.658, rew=12.50]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #19: 1025it [00:02, 498.94it/s, env_step=19456, len=16, n/ep=4, n/st=64, player_1/loss=226.531, player_2/loss=131.741, rew=-12.50]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #1: 1025it [00:02, 486.81it/s, env_step=1024, len=18, n/ep=3, n/st=64, player_1/loss=102.932, player_2/loss=139.973, rew=8.33]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 489.32it/s, env_step=2048, len=14, n/ep=5, n/st=64, player_1/loss=137.034, player_2/loss=136.847, rew=15.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 493.40it/s, env_step=3072, len=11, n/ep=6, n/st=64, player_1/loss=147.739, player_2/loss=105.512, rew=0.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 492.49it/s, env_step=4096, len=16, n/ep=4, n/st=64, player_1/loss=128.811, player_2/loss=83.429, rew=12.50]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 491.83it/s, env_step=5120, len=15, n/ep=4, n/st=64, player_1/loss=169.847, player_2/loss=113.993, rew=-12.50]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 491.04it/s, env_step=6144, len=20, n/ep=3, n/st=64, player_1/loss=183.302, player_2/loss=138.480, rew=8.33]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 495.86it/s, env_step=7168, len=11, n/ep=5, n/st=64, player_1/loss=204.022, player_2/loss=127.256, rew=-5.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 490.84it/s, env_step=8192, len=14, n/ep=5, n/st=64, player_1/loss=288.610, player_2/loss=135.709, rew=-5.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 489.11it/s, env_step=9216, len=19, n/ep=3, n/st=64, player_1/loss=219.130, player_2/loss=147.203, rew=8.33]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 493.40it/s, env_step=10240, len=12, n/ep=6, n/st=64, player_1/loss=125.908, player_2/loss=121.996, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 489.61it/s, env_step=11264, len=20, n/ep=3, n/st=64, player_1/loss=165.538, player_2/loss=105.521, rew=-8.33]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 490.28it/s, env_step=12288, len=20, n/ep=4, n/st=64, player_1/loss=165.830, player_2/loss=112.190, rew=12.50]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 491.27it/s, env_step=13312, len=14, n/ep=5, n/st=64, player_1/loss=114.096, player_2/loss=138.252, rew=-5.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 493.63it/s, env_step=14336, len=15, n/ep=4, n/st=64, player_1/loss=174.416, player_2/loss=88.625, rew=25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 494.18it/s, env_step=15360, len=10, n/ep=5, n/st=64, player_1/loss=143.985, player_2/loss=99.897, rew=15.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 491.80it/s, env_step=16384, len=18, n/ep=3, n/st=64, player_1/loss=77.377, player_2/loss=129.575, rew=-8.33]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 489.42it/s, env_step=17408, len=15, n/ep=4, n/st=64, player_1/loss=190.685, player_2/loss=129.438, rew=-12.50]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 493.57it/s, env_step=18432, len=14, n/ep=5, n/st=64, player_1/loss=266.017, player_2/loss=96.484, rew=-5.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 494.60it/s, env_step=19456, len=16, n/ep=3, n/st=64, player_1/loss=213.216, player_2/loss=92.650, rew=-8.33]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 488.81it/s, env_step=1024, len=19, n/ep=3, n/st=64, player_1/loss=357.326, player_2/loss=146.489, rew=-8.33]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #2: 1025it [00:02, 489.84it/s, env_step=2048, len=15, n/ep=5, n/st=64, player_1/loss=274.581, player_2/loss=144.712, rew=-5.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #3: 1025it [00:02, 495.63it/s, env_step=3072, len=20, n/ep=3, n/st=64, player_1/loss=222.151, player_2/loss=156.248, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #4: 1025it [00:02, 493.32it/s, env_step=4096, len=15, n/ep=4, n/st=64, player_1/loss=296.639, player_2/loss=161.129, rew=12.50]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #5: 1025it [00:02, 493.61it/s, env_step=5120, len=18, n/ep=4, n/st=64, player_1/loss=243.901, player_2/loss=148.317, rew=12.50]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #6: 1025it [00:02, 494.34it/s, env_step=6144, len=12, n/ep=5, n/st=64, player_1/loss=198.719, player_2/loss=128.788, rew=5.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #7: 1025it [00:02, 493.00it/s, env_step=7168, len=16, n/ep=4, n/st=64, player_1/loss=303.370, player_2/loss=124.579, rew=12.50]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #8: 1025it [00:02, 492.13it/s, env_step=8192, len=8, n/ep=7, n/st=64, player_1/loss=272.805, player_2/loss=94.303, rew=-17.86]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #9: 1025it [00:02, 490.22it/s, env_step=9216, len=11, n/ep=5, n/st=64, player_1/loss=103.120, player_2/loss=116.946, rew=-5.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #10: 1025it [00:02, 494.09it/s, env_step=10240, len=11, n/ep=6, n/st=64, player_1/loss=171.053, player_2/loss=115.818, rew=-8.33]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #11: 1025it [00:02, 493.44it/s, env_step=11264, len=15, n/ep=4, n/st=64, player_1/loss=262.133, player_2/loss=124.289, rew=12.50]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #12: 1025it [00:02, 491.69it/s, env_step=12288, len=13, n/ep=5, n/st=64, player_1/loss=182.985, player_2/loss=160.272, rew=15.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #13: 1025it [00:02, 492.12it/s, env_step=13312, len=9, n/ep=6, n/st=64, player_1/loss=177.588, player_2/loss=187.007, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #14: 1025it [00:02, 496.10it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=187.385, player_2/loss=185.347, rew=-13.89]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #15: 1025it [00:02, 485.06it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_2/loss=251.637, rew=-18.75]       


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #16: 1025it [00:02, 490.32it/s, env_step=16384, len=11, n/ep=6, n/st=64, player_1/loss=199.259, player_2/loss=257.321, rew=-8.33]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #17: 1025it [00:02, 486.34it/s, env_step=17408, len=8, n/ep=7, n/st=64, player_1/loss=300.681, player_2/loss=262.700, rew=-3.57]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #18: 1025it [00:02, 492.13it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=257.609, player_2/loss=254.502, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #19: 1025it [00:02, 486.57it/s, env_step=19456, len=7, n/ep=8, n/st=64, player_2/loss=236.438, rew=-18.75]       


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #1: 1025it [00:02, 485.24it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=125.980, player_2/loss=265.980, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 492.00it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=128.615, player_2/loss=259.576, rew=19.44]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 487.90it/s, env_step=3072, len=7, n/ep=8, n/st=64, player_1/loss=156.544, player_2/loss=234.825, rew=18.75]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 492.12it/s, env_step=4096, len=10, n/ep=6, n/st=64, player_1/loss=162.543, player_2/loss=246.934, rew=-8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 469.25it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=134.396, player_2/loss=290.661, rew=25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 488.79it/s, env_step=6144, len=8, n/ep=7, n/st=64, player_1/loss=105.248, player_2/loss=274.787, rew=17.86]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 489.88it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=35.825, player_2/loss=269.578, rew=12.50]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 488.14it/s, env_step=8192, len=7, n/ep=9, n/st=64, player_1/loss=118.402, player_2/loss=277.410, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 491.36it/s, env_step=9216, len=8, n/ep=8, n/st=64, player_1/loss=179.948, player_2/loss=262.328, rew=12.50]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 489.22it/s, env_step=10240, len=9, n/ep=7, n/st=64, player_1/loss=151.271, player_2/loss=259.881, rew=10.71]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 489.14it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=169.013, player_2/loss=260.707, rew=12.50]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 488.11it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_1/loss=150.703, player_2/loss=240.867, rew=12.50]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 474.55it/s, env_step=13312, len=8, n/ep=8, n/st=64, player_1/loss=125.633, player_2/loss=206.436, rew=12.50]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 492.29it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=126.549, player_2/loss=223.718, rew=17.86]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 493.32it/s, env_step=15360, len=7, n/ep=8, n/st=64, player_1/loss=116.893, player_2/loss=259.065, rew=18.75]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 487.22it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=116.533, player_2/loss=278.640, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 486.38it/s, env_step=17408, len=8, n/ep=8, n/st=64, player_1/loss=112.061, player_2/loss=265.813, rew=18.75]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 491.51it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=227.906, player_2/loss=236.917, rew=12.50]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 490.02it/s, env_step=19456, len=7, n/ep=9, n/st=64, player_1/loss=168.640, player_2/loss=242.772, rew=19.44]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 488.42it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=400.168, player_2/loss=240.047, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 488.47it/s, env_step=2048, len=14, n/ep=4, n/st=64, player_1/loss=281.946, player_2/loss=251.935, rew=25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 494.99it/s, env_step=3072, len=7, n/ep=8, n/st=64, player_1/loss=215.375, player_2/loss=233.067, rew=-18.75]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 494.17it/s, env_step=4096, len=13, n/ep=6, n/st=64, player_1/loss=207.738, player_2/loss=163.306, rew=0.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 493.47it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=147.215, player_2/loss=195.863, rew=-25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:02, 492.19it/s, env_step=6144, len=8, n/ep=7, n/st=64, player_1/loss=99.587, player_2/loss=215.949, rew=-17.86]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:02, 489.95it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=87.711, player_2/loss=228.211, rew=-12.50]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:02, 489.57it/s, env_step=8192, len=12, n/ep=5, n/st=64, player_1/loss=121.584, player_2/loss=247.458, rew=-5.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:02, 490.96it/s, env_step=9216, len=8, n/ep=9, n/st=64, player_1/loss=169.881, player_2/loss=217.938, rew=-13.89]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:02, 491.66it/s, env_step=10240, len=8, n/ep=8, n/st=64, player_1/loss=134.164, player_2/loss=189.067, rew=-12.50]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:02, 491.75it/s, env_step=11264, len=15, n/ep=4, n/st=64, player_1/loss=181.309, player_2/loss=212.054, rew=25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:02, 494.02it/s, env_step=12288, len=8, n/ep=8, n/st=64, player_1/loss=211.737, player_2/loss=195.044, rew=-12.50]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:02, 493.44it/s, env_step=13312, len=14, n/ep=5, n/st=64, player_1/loss=159.436, player_2/loss=178.016, rew=15.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:02, 491.52it/s, env_step=14336, len=9, n/ep=7, n/st=64, player_1/loss=168.566, player_2/loss=161.815, rew=3.57]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:02, 488.50it/s, env_step=15360, len=7, n/ep=8, n/st=64, player_1/loss=180.316, player_2/loss=241.608, rew=-18.75]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:02, 494.53it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=151.554, player_2/loss=265.845, rew=-19.44]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:02, 494.89it/s, env_step=17408, len=14, n/ep=4, n/st=64, player_1/loss=116.266, player_2/loss=269.670, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:02, 491.98it/s, env_step=18432, len=9, n/ep=6, n/st=64, player_1/loss=179.866, player_2/loss=180.137, rew=-8.33]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:02, 490.96it/s, env_step=19456, len=10, n/ep=6, n/st=64, player_1/loss=219.548, player_2/loss=196.451, rew=0.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:02, 486.92it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=165.336, player_2/loss=254.796, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 492.14it/s, env_step=2048, len=7, n/ep=7, n/st=64, player_1/loss=195.316, player_2/loss=258.457, rew=25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 485.34it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=198.269, player_2/loss=288.043, rew=6.25]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 487.74it/s, env_step=4096, len=7, n/ep=9, n/st=64, player_1/loss=194.887, player_2/loss=275.584, rew=19.44]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 489.96it/s, env_step=5120, len=8, n/ep=6, n/st=64, player_1/loss=154.619, player_2/loss=276.317, rew=16.67]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 490.86it/s, env_step=6144, len=10, n/ep=6, n/st=64, player_1/loss=142.102, player_2/loss=270.150, rew=8.33]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 490.10it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=131.296, player_2/loss=277.308, rew=12.50]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 490.72it/s, env_step=8192, len=9, n/ep=7, n/st=64, player_1/loss=168.549, player_2/loss=285.445, rew=3.57]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 494.17it/s, env_step=9216, len=8, n/ep=8, n/st=64, player_1/loss=189.987, player_2/loss=219.625, rew=6.25]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 488.39it/s, env_step=10240, len=8, n/ep=7, n/st=64, player_1/loss=182.920, player_2/loss=199.623, rew=10.71]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 488.15it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=154.872, player_2/loss=246.138, rew=12.50]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 490.51it/s, env_step=12288, len=9, n/ep=7, n/st=64, player_1/loss=147.388, player_2/loss=247.157, rew=3.57]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 490.07it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=145.933, player_2/loss=222.473, rew=25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 491.79it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=110.957, player_2/loss=234.849, rew=19.44]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 490.51it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_1/loss=154.340, player_2/loss=246.310, rew=-3.57]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 489.36it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=183.107, player_2/loss=264.777, rew=6.25]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 491.26it/s, env_step=17408, len=8, n/ep=8, n/st=64, player_1/loss=191.902, player_2/loss=251.600, rew=18.75]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 493.85it/s, env_step=18432, len=8, n/ep=7, n/st=64, player_1/loss=291.111, player_2/loss=260.218, rew=17.86]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 488.95it/s, env_step=19456, len=9, n/ep=6, n/st=64, player_1/loss=229.431, player_2/loss=269.036, rew=16.67]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 490.81it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=128.045, player_2/loss=236.603, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 490.03it/s, env_step=2048, len=7, n/ep=7, n/st=64, player_1/loss=190.396, player_2/loss=250.511, rew=-25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #3: 1025it [00:02, 491.44it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=201.295, player_2/loss=283.130, rew=-6.25]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #4: 1025it [00:02, 493.21it/s, env_step=4096, len=7, n/ep=9, n/st=64, player_1/loss=208.306, player_2/loss=284.179, rew=-19.44]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #5: 1025it [00:02, 497.16it/s, env_step=5120, len=8, n/ep=6, n/st=64, player_1/loss=157.312, player_2/loss=249.832, rew=-16.67]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #6: 1025it [00:02, 491.75it/s, env_step=6144, len=10, n/ep=6, n/st=64, player_1/loss=141.989, player_2/loss=255.702, rew=-8.33]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #7: 1025it [00:02, 490.73it/s, env_step=7168, len=7, n/ep=9, n/st=64, player_1/loss=114.625, player_2/loss=259.966, rew=-13.89]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #8: 1025it [00:02, 493.83it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=226.993, player_2/loss=275.969, rew=-12.50]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #9: 1025it [00:02, 489.76it/s, env_step=9216, len=8, n/ep=8, n/st=64, player_1/loss=232.544, player_2/loss=257.663, rew=-6.25]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #10: 1025it [00:02, 490.54it/s, env_step=10240, len=8, n/ep=7, n/st=64, player_1/loss=148.771, player_2/loss=241.348, rew=-10.71]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #11: 1025it [00:02, 492.60it/s, env_step=11264, len=14, n/ep=4, n/st=64, player_2/loss=204.030, rew=25.00]       


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #12: 1025it [00:02, 494.64it/s, env_step=12288, len=9, n/ep=6, n/st=64, player_1/loss=161.355, player_2/loss=177.695, rew=0.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #13: 1025it [00:02, 487.86it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=193.321, player_2/loss=146.419, rew=-17.86]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #14: 1025it [00:02, 490.86it/s, env_step=14336, len=9, n/ep=8, n/st=64, player_1/loss=232.474, player_2/loss=195.661, rew=-6.25]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #15: 1025it [00:02, 491.59it/s, env_step=15360, len=8, n/ep=7, n/st=64, player_1/loss=234.061, player_2/loss=285.864, rew=-10.71]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #16: 1025it [00:02, 495.14it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=259.254, player_2/loss=292.769, rew=-18.75]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #17: 1025it [00:02, 493.77it/s, env_step=17408, len=9, n/ep=8, n/st=64, player_1/loss=240.097, player_2/loss=274.902, rew=-12.50]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #18: 1025it [00:02, 489.64it/s, env_step=18432, len=7, n/ep=7, n/st=64, player_1/loss=229.590, player_2/loss=165.147, rew=-3.57]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #19: 1025it [00:02, 492.00it/s, env_step=19456, len=10, n/ep=6, n/st=64, player_1/loss=228.085, player_2/loss=157.809, rew=0.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #1: 1025it [00:02, 489.51it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=208.225, player_2/loss=308.403, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 489.71it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=170.790, player_2/loss=318.071, rew=25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 491.77it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=173.171, player_2/loss=311.724, rew=6.25]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 490.34it/s, env_step=4096, len=7, n/ep=8, n/st=64, player_1/loss=246.150, player_2/loss=306.997, rew=12.50]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 487.78it/s, env_step=5120, len=8, n/ep=6, n/st=64, player_1/loss=201.028, player_2/loss=278.570, rew=16.67]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 486.98it/s, env_step=6144, len=10, n/ep=6, n/st=64, player_1/loss=190.188, player_2/loss=261.007, rew=8.33]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 493.78it/s, env_step=7168, len=7, n/ep=9, n/st=64, player_1/loss=182.907, player_2/loss=268.352, rew=13.89]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 489.04it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=292.543, player_2/loss=278.041, rew=12.50]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 489.60it/s, env_step=9216, len=10, n/ep=6, n/st=64, player_1/loss=199.054, player_2/loss=257.568, rew=0.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 486.83it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=111.778, player_2/loss=237.871, rew=6.25]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 490.06it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=135.148, player_2/loss=232.716, rew=12.50]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 488.68it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=169.293, player_2/loss=218.425, rew=-3.57]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 488.32it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=178.422, player_2/loss=212.479, rew=18.75]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 493.42it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=166.404, player_2/loss=236.629, rew=19.44]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 493.43it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_1/loss=152.031, player_2/loss=246.236, rew=-3.57]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 488.93it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=85.782, player_2/loss=231.638, rew=6.25]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 490.23it/s, env_step=17408, len=8, n/ep=8, n/st=64, player_1/loss=121.432, player_2/loss=221.078, rew=18.75]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 493.41it/s, env_step=18432, len=8, n/ep=7, n/st=64, player_1/loss=206.806, player_2/loss=229.762, rew=3.57]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 488.72it/s, env_step=19456, len=9, n/ep=6, n/st=64, player_1/loss=188.773, player_2/loss=241.238, rew=16.67]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 493.05it/s, env_step=1024, len=7, n/ep=9, n/st=64, player_1/loss=215.989, player_2/loss=290.874, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 492.43it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=182.026, player_2/loss=310.510, rew=-25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #3: 1025it [00:02, 494.18it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=243.579, player_2/loss=308.635, rew=-6.25]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #4: 1025it [00:02, 496.06it/s, env_step=4096, len=7, n/ep=8, n/st=64, player_1/loss=307.220, player_2/loss=300.241, rew=-12.50]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #5: 1025it [00:02, 491.31it/s, env_step=5120, len=8, n/ep=8, n/st=64, player_1/loss=209.666, player_2/loss=296.774, rew=-6.25]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #6: 1025it [00:02, 491.53it/s, env_step=6144, len=10, n/ep=6, n/st=64, player_1/loss=188.412, player_2/loss=285.640, rew=-8.33]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #7: 1025it [00:02, 485.82it/s, env_step=7168, len=7, n/ep=9, n/st=64, player_1/loss=205.048, player_2/loss=248.542, rew=-13.89]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #8: 1025it [00:02, 493.64it/s, env_step=8192, len=9, n/ep=7, n/st=64, player_1/loss=176.533, player_2/loss=233.258, rew=-3.57]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #9: 1025it [00:02, 498.71it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=254.486, player_2/loss=203.091, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #10: 1025it [00:02, 494.89it/s, env_step=10240, len=10, n/ep=6, n/st=64, player_1/loss=344.063, player_2/loss=183.846, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #11: 1025it [00:02, 491.94it/s, env_step=11264, len=10, n/ep=7, n/st=64, player_1/loss=240.361, player_2/loss=196.641, rew=10.71]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #12: 1025it [00:02, 493.92it/s, env_step=12288, len=10, n/ep=7, n/st=64, player_1/loss=261.096, player_2/loss=217.839, rew=3.57]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #13: 1025it [00:02, 489.36it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=275.742, player_2/loss=206.661, rew=-17.86]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #14: 1025it [00:02, 491.42it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=177.721, player_2/loss=205.405, rew=-18.75]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #15: 1025it [00:02, 494.44it/s, env_step=15360, len=11, n/ep=6, n/st=64, player_1/loss=141.552, rew=0.00]        


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #16: 1025it [00:02, 492.31it/s, env_step=16384, len=9, n/ep=7, n/st=64, player_1/loss=230.278, player_2/loss=287.614, rew=-17.86]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #17: 1025it [00:02, 492.25it/s, env_step=17408, len=10, n/ep=7, n/st=64, player_1/loss=206.768, player_2/loss=284.527, rew=-3.57]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #18: 1025it [00:02, 494.16it/s, env_step=18432, len=10, n/ep=6, n/st=64, player_1/loss=197.230, player_2/loss=310.696, rew=8.33]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #19: 1025it [00:02, 490.29it/s, env_step=19456, len=10, n/ep=7, n/st=64, player_1/loss=229.137, player_2/loss=310.145, rew=-10.71]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #2


Epoch #1: 1025it [00:02, 492.43it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=174.276, player_2/loss=282.456, rew=18.75]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 493.07it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=144.629, player_2/loss=213.780, rew=12.50]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 490.26it/s, env_step=3072, len=7, n/ep=8, n/st=64, player_1/loss=243.587, player_2/loss=199.635, rew=18.75]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 493.75it/s, env_step=4096, len=11, n/ep=5, n/st=64, player_1/loss=282.697, player_2/loss=258.191, rew=-5.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 493.76it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=154.273, player_2/loss=247.487, rew=-3.57]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 492.80it/s, env_step=6144, len=8, n/ep=7, n/st=64, player_1/loss=189.391, player_2/loss=224.022, rew=3.57]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 490.66it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=249.756, player_2/loss=232.694, rew=6.25]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 491.94it/s, env_step=8192, len=9, n/ep=6, n/st=64, player_1/loss=272.663, player_2/loss=258.254, rew=0.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 494.38it/s, env_step=9216, len=8, n/ep=8, n/st=64, player_1/loss=276.807, player_2/loss=264.737, rew=12.50]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 486.62it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=258.842, player_2/loss=255.331, rew=18.75]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 493.04it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=246.927, player_2/loss=250.219, rew=18.75]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 493.94it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_2/loss=224.953, rew=12.50]        


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 491.75it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=139.766, player_2/loss=206.412, rew=19.44]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 490.87it/s, env_step=14336, len=7, n/ep=8, n/st=64, player_1/loss=189.082, player_2/loss=263.029, rew=18.75]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 490.95it/s, env_step=15360, len=14, n/ep=5, n/st=64, player_1/loss=191.102, player_2/loss=277.502, rew=5.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 491.24it/s, env_step=16384, len=10, n/ep=6, n/st=64, player_1/loss=154.652, player_2/loss=262.432, rew=0.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 488.81it/s, env_step=17408, len=8, n/ep=8, n/st=64, player_1/loss=244.580, player_2/loss=228.353, rew=12.50]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 489.23it/s, env_step=18432, len=8, n/ep=7, n/st=64, player_1/loss=240.363, player_2/loss=227.088, rew=10.71]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 480.65it/s, env_step=19456, len=8, n/ep=9, n/st=64, player_1/loss=187.535, player_2/loss=271.195, rew=13.89]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 489.34it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=351.606, player_2/loss=209.944, rew=-12.50]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 491.05it/s, env_step=2048, len=9, n/ep=7, n/st=64, player_1/loss=280.023, player_2/loss=227.976, rew=17.86]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 488.67it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=291.607, player_2/loss=247.874, rew=-12.50]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 494.25it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=303.309, player_2/loss=258.021, rew=8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 489.63it/s, env_step=5120, len=8, n/ep=8, n/st=64, player_1/loss=177.875, player_2/loss=273.720, rew=-25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:02, 491.19it/s, env_step=6144, len=8, n/ep=7, n/st=64, player_1/loss=210.831, player_2/loss=254.575, rew=-3.57]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:02, 492.65it/s, env_step=7168, len=7, n/ep=7, n/st=64, player_1/loss=249.297, player_2/loss=257.791, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:02, 491.23it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=198.179, player_2/loss=259.445, rew=-12.50]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:02, 491.48it/s, env_step=9216, len=12, n/ep=5, n/st=64, player_1/loss=207.945, player_2/loss=232.921, rew=15.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:02, 490.99it/s, env_step=10240, len=11, n/ep=6, n/st=64, player_2/loss=232.842, rew=25.00]       


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:02, 495.82it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=274.786, player_2/loss=249.766, rew=-12.50]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:02, 495.75it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_1/loss=218.185, player_2/loss=260.098, rew=-18.75]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:02, 491.90it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=269.995, player_2/loss=288.906, rew=-19.44]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:02, 490.06it/s, env_step=14336, len=7, n/ep=8, n/st=64, player_1/loss=265.247, player_2/loss=268.923, rew=-18.75]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:02, 492.02it/s, env_step=15360, len=8, n/ep=7, n/st=64, player_1/loss=213.502, player_2/loss=271.682, rew=-17.86]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:02, 495.13it/s, env_step=16384, len=9, n/ep=7, n/st=64, player_1/loss=258.266, player_2/loss=252.682, rew=-10.71]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:02, 494.57it/s, env_step=17408, len=8, n/ep=8, n/st=64, player_1/loss=343.388, player_2/loss=300.804, rew=0.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:02, 496.31it/s, env_step=18432, len=10, n/ep=6, n/st=64, player_1/loss=337.215, player_2/loss=297.001, rew=0.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:02, 491.84it/s, env_step=19456, len=7, n/ep=8, n/st=64, player_1/loss=288.788, player_2/loss=244.027, rew=0.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:02, 486.42it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=160.644, player_2/loss=276.644, rew=18.75]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 494.95it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=243.005, player_2/loss=250.968, rew=13.89]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 475.69it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=258.279, player_2/loss=217.020, rew=12.50]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 480.64it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=219.840, player_2/loss=235.287, rew=0.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 487.03it/s, env_step=5120, len=9, n/ep=7, n/st=64, player_1/loss=143.604, player_2/loss=260.376, rew=-3.57]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 486.58it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=177.380, player_2/loss=295.055, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 491.85it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=274.731, player_2/loss=288.995, rew=12.50]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 488.17it/s, env_step=8192, len=7, n/ep=9, n/st=64, player_1/loss=188.330, player_2/loss=291.090, rew=8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 491.03it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=263.648, player_2/loss=298.780, rew=13.89]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 490.26it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=296.479, player_2/loss=280.663, rew=18.75]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 486.76it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=239.126, player_2/loss=301.979, rew=12.50]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 485.29it/s, env_step=12288, len=7, n/ep=9, n/st=64, player_1/loss=180.423, player_2/loss=291.863, rew=13.89]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 491.97it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=100.861, player_2/loss=254.390, rew=12.50]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 493.38it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=131.242, player_2/loss=280.714, rew=3.57]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 486.53it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=113.915, player_2/loss=294.706, rew=12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 487.56it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=111.252, player_2/loss=282.284, rew=19.44]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 489.59it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=215.454, player_2/loss=243.456, rew=13.89]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 487.54it/s, env_step=18432, len=7, n/ep=9, n/st=64, player_1/loss=158.740, rew=19.44]        


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 488.43it/s, env_step=19456, len=8, n/ep=6, n/st=64, player_1/loss=129.680, player_2/loss=251.948, rew=0.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 493.41it/s, env_step=1024, len=9, n/ep=6, n/st=64, player_1/loss=304.134, player_2/loss=271.278, rew=8.33]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #2: 1025it [00:02, 492.78it/s, env_step=2048, len=9, n/ep=6, n/st=64, player_1/loss=292.910, player_2/loss=228.811, rew=8.33]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #3: 1025it [00:02, 488.15it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=332.965, player_2/loss=209.417, rew=0.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #4: 1025it [00:02, 491.48it/s, env_step=4096, len=12, n/ep=5, n/st=64, player_1/loss=281.572, player_2/loss=253.077, rew=5.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #5: 1025it [00:02, 492.63it/s, env_step=5120, len=8, n/ep=6, n/st=64, player_1/loss=251.975, player_2/loss=211.677, rew=0.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #6: 1025it [00:02, 495.11it/s, env_step=6144, len=12, n/ep=5, n/st=64, player_1/loss=348.644, player_2/loss=215.396, rew=15.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #7: 1025it [00:02, 492.37it/s, env_step=7168, len=11, n/ep=5, n/st=64, player_1/loss=500.580, player_2/loss=256.978, rew=-5.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #8: 1025it [00:02, 467.68it/s, env_step=8192, len=15, n/ep=5, n/st=64, player_1/loss=364.890, player_2/loss=235.009, rew=-5.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #9: 1025it [00:02, 485.54it/s, env_step=9216, len=10, n/ep=7, n/st=64, player_1/loss=164.779, player_2/loss=214.012, rew=17.86]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #10: 1025it [00:02, 484.59it/s, env_step=10240, len=9, n/ep=7, n/st=64, player_1/loss=264.647, player_2/loss=219.964, rew=10.71]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #11: 1025it [00:02, 404.40it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=251.815, player_2/loss=180.167, rew=-8.33]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #12: 1025it [00:02, 485.36it/s, env_step=12288, len=8, n/ep=8, n/st=64, player_1/loss=339.888, rew=0.00]         


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #13: 1025it [00:02, 473.06it/s, env_step=13312, len=11, n/ep=7, n/st=64, player_1/loss=318.866, player_2/loss=210.369, rew=10.71]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #14: 1025it [00:02, 390.82it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=197.863, player_2/loss=236.195, rew=-18.75]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #15: 1025it [00:02, 350.27it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_1/loss=271.978, player_2/loss=241.578, rew=10.71]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #16: 1025it [00:02, 360.45it/s, env_step=16384, len=9, n/ep=8, n/st=64, player_1/loss=260.201, player_2/loss=213.489, rew=6.25]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #17: 1025it [00:03, 333.06it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=199.348, player_2/loss=190.871, rew=-19.44]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #18: 1025it [00:02, 342.41it/s, env_step=18432, len=11, n/ep=5, n/st=64, player_1/loss=155.138, player_2/loss=161.983, rew=-5.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #19: 1025it [00:03, 306.91it/s, env_step=19456, len=7, n/ep=8, n/st=64, player_1/loss=162.367, player_2/loss=220.715, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #1: 1025it [00:03, 336.26it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=129.389, player_2/loss=246.636, rew=18.75]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 325.98it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=230.516, player_2/loss=265.798, rew=18.75]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 321.33it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=281.505, player_2/loss=236.882, rew=12.50]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 340.21it/s, env_step=4096, len=14, n/ep=6, n/st=64, player_1/loss=230.818, player_2/loss=238.737, rew=0.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 331.81it/s, env_step=5120, len=7, n/ep=7, n/st=64, player_1/loss=171.154, player_2/loss=244.233, rew=17.86]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 330.77it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=135.966, player_2/loss=240.436, rew=12.50]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 336.91it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=148.777, player_2/loss=220.943, rew=18.75]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 336.34it/s, env_step=8192, len=7, n/ep=9, n/st=64, player_1/loss=116.941, player_2/loss=269.263, rew=8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 334.20it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=215.748, player_2/loss=292.286, rew=13.89]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:03, 315.73it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=265.860, player_2/loss=294.391, rew=18.75]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 328.93it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=302.908, player_2/loss=293.583, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 340.46it/s, env_step=12288, len=10, n/ep=7, n/st=64, player_2/loss=278.558, rew=10.71]       


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 352.81it/s, env_step=13312, len=8, n/ep=8, n/st=64, player_1/loss=180.335, player_2/loss=233.875, rew=12.50]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 352.01it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=194.216, player_2/loss=253.590, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:03, 333.28it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=140.720, player_2/loss=262.079, rew=19.44]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 278.38it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=84.839, player_2/loss=244.958, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:03, 261.47it/s, env_step=17408, len=9, n/ep=7, n/st=64, player_1/loss=206.774, player_2/loss=237.961, rew=10.71]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:03, 298.52it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=187.502, player_2/loss=244.477, rew=6.25]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 308.63it/s, env_step=19456, len=7, n/ep=8, n/st=64, player_1/loss=100.500, player_2/loss=240.605, rew=6.25]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 308.34it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=388.858, player_2/loss=240.471, rew=-18.75]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 330.12it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=394.766, player_2/loss=249.535, rew=-18.75]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 326.91it/s, env_step=3072, len=10, n/ep=6, n/st=64, player_1/loss=223.717, player_2/loss=214.118, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 342.44it/s, env_step=4096, len=12, n/ep=5, n/st=64, player_1/loss=302.029, player_2/loss=226.231, rew=5.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 317.76it/s, env_step=5120, len=9, n/ep=7, n/st=64, player_1/loss=268.216, player_2/loss=246.837, rew=17.86]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:03, 312.23it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=173.062, player_2/loss=232.729, rew=-12.50]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:02, 343.31it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=178.821, player_2/loss=220.070, rew=-12.50]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:02, 351.36it/s, env_step=8192, len=7, n/ep=9, n/st=64, player_1/loss=183.994, player_2/loss=228.073, rew=-8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:03, 339.34it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=253.391, player_2/loss=230.707, rew=-6.25]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:02, 366.00it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=242.870, player_2/loss=223.561, rew=-2.78]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:05, 201.28it/s, env_step=11264, len=9, n/ep=7, n/st=64, player_2/loss=229.279, rew=17.86]        


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:05, 198.54it/s, env_step=12288, len=11, n/ep=6, n/st=64, player_1/loss=269.246, player_2/loss=243.201, rew=8.33]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:03, 283.34it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=343.414, player_2/loss=227.835, rew=-3.57]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:03, 274.48it/s, env_step=14336, len=11, n/ep=5, n/st=64, player_1/loss=259.665, player_2/loss=192.592, rew=-15.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:03, 272.20it/s, env_step=15360, len=7, n/ep=8, n/st=64, player_1/loss=276.393, player_2/loss=225.612, rew=-12.50]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:03, 260.07it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=253.232, player_2/loss=265.742, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:04, 237.84it/s, env_step=17408, len=9, n/ep=7, n/st=64, player_1/loss=237.499, player_2/loss=303.790, rew=17.86]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:03, 262.61it/s, env_step=18432, len=10, n/ep=6, n/st=64, player_1/loss=254.873, player_2/loss=278.613, rew=16.67]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:03, 278.31it/s, env_step=19456, len=13, n/ep=5, n/st=64, player_1/loss=326.447, player_2/loss=231.403, rew=15.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:03, 281.91it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=180.326, player_2/loss=243.720, rew=18.75]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 271.94it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=226.652, player_2/loss=268.138, rew=18.75]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 272.63it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=314.814, player_2/loss=253.462, rew=12.50]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 268.91it/s, env_step=4096, len=8, n/ep=7, n/st=64, player_1/loss=272.987, player_2/loss=229.472, rew=10.71]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 257.53it/s, env_step=5120, len=7, n/ep=7, n/st=64, player_1/loss=196.952, player_2/loss=227.756, rew=17.86]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 262.21it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=169.408, player_2/loss=244.880, rew=0.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 309.63it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=153.629, player_2/loss=219.162, rew=18.75]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 282.29it/s, env_step=8192, len=7, n/ep=9, n/st=64, player_1/loss=84.551, player_2/loss=256.878, rew=8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 300.85it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=204.003, player_2/loss=284.982, rew=2.78]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:03, 280.00it/s, env_step=10240, len=8, n/ep=8, n/st=64, player_1/loss=228.037, player_2/loss=272.931, rew=18.75]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 270.57it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=282.163, player_2/loss=273.937, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 283.75it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_1/loss=224.137, rew=18.75]        


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:03, 265.30it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=189.574, player_2/loss=259.049, rew=12.50]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 257.96it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=197.337, player_2/loss=279.355, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:03, 261.33it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=131.138, player_2/loss=293.036, rew=19.44]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 257.65it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=116.353, player_2/loss=300.767, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:04, 245.55it/s, env_step=17408, len=8, n/ep=7, n/st=64, player_1/loss=207.084, player_2/loss=301.739, rew=10.71]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:03, 265.55it/s, env_step=18432, len=8, n/ep=9, n/st=64, player_1/loss=164.359, player_2/loss=292.289, rew=2.78]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 259.24it/s, env_step=19456, len=8, n/ep=8, n/st=64, player_1/loss=171.072, player_2/loss=252.953, rew=12.50]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 284.32it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=273.235, player_2/loss=294.062, rew=-12.50]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 300.58it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=282.382, player_2/loss=251.722, rew=-19.44]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 315.38it/s, env_step=3072, len=9, n/ep=7, n/st=64, player_1/loss=285.991, player_2/loss=239.701, rew=-17.86]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 308.07it/s, env_step=4096, len=8, n/ep=7, n/st=64, player_1/loss=244.600, player_2/loss=228.022, rew=-10.71]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 290.77it/s, env_step=5120, len=15, n/ep=4, n/st=64, player_1/loss=257.429, player_2/loss=223.045, rew=12.50]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:03, 293.48it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=241.369, player_2/loss=217.156, rew=-12.50]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:03, 291.48it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=193.759, player_2/loss=199.780, rew=-18.75]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:03, 287.18it/s, env_step=8192, len=7, n/ep=9, n/st=64, player_1/loss=193.490, player_2/loss=231.470, rew=-8.33]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:03, 284.17it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=278.414, player_2/loss=274.743, rew=-2.78]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:03, 286.08it/s, env_step=10240, len=9, n/ep=7, n/st=64, player_1/loss=302.197, player_2/loss=246.780, rew=-25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:03, 291.68it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=334.765, player_2/loss=254.129, rew=-19.44]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:03, 296.50it/s, env_step=12288, len=9, n/ep=8, n/st=64, player_1/loss=206.497, rew=-18.75]       


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:03, 295.98it/s, env_step=13312, len=10, n/ep=7, n/st=64, player_1/loss=169.030, player_2/loss=218.114, rew=10.71]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:03, 290.52it/s, env_step=14336, len=10, n/ep=6, n/st=64, player_1/loss=198.258, player_2/loss=209.266, rew=16.67]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:03, 288.92it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_1/loss=259.096, player_2/loss=213.381, rew=10.71]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:03, 290.08it/s, env_step=16384, len=9, n/ep=7, n/st=64, player_1/loss=234.922, player_2/loss=249.283, rew=-17.86]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:03, 296.06it/s, env_step=17408, len=13, n/ep=5, n/st=64, player_1/loss=157.766, player_2/loss=262.638, rew=-5.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:03, 304.28it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=229.923, player_2/loss=255.074, rew=-18.75]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:03, 289.76it/s, env_step=19456, len=8, n/ep=8, n/st=64, player_1/loss=273.829, player_2/loss=277.323, rew=-18.75]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:03, 283.35it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=166.026, player_2/loss=270.733, rew=18.75]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:04, 233.90it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=228.279, player_2/loss=268.561, rew=19.44]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:04, 237.93it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=304.191, player_2/loss=223.537, rew=12.50]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 292.61it/s, env_step=4096, len=8, n/ep=7, n/st=64, player_1/loss=278.499, player_2/loss=204.146, rew=10.71]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 319.41it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=211.671, player_2/loss=256.868, rew=-3.57]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 275.14it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=149.419, player_2/loss=263.050, rew=0.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 299.99it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=168.510, player_2/loss=233.990, rew=18.75]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 303.17it/s, env_step=8192, len=7, n/ep=9, n/st=64, player_1/loss=159.232, player_2/loss=265.917, rew=13.89]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 310.42it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=244.831, player_2/loss=293.887, rew=2.78]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:03, 295.21it/s, env_step=10240, len=8, n/ep=8, n/st=64, player_1/loss=245.111, player_2/loss=255.899, rew=18.75]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 320.74it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=279.156, player_2/loss=267.736, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 325.81it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_1/loss=221.727, rew=18.75]        


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 353.52it/s, env_step=13312, len=8, n/ep=8, n/st=64, player_1/loss=170.385, player_2/loss=258.758, rew=12.50]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 267.58it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=187.956, player_2/loss=278.253, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:03, 293.61it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=109.139, player_2/loss=292.290, rew=12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 317.66it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=124.298, player_2/loss=285.019, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:03, 326.11it/s, env_step=17408, len=8, n/ep=7, n/st=64, player_1/loss=254.063, player_2/loss=286.701, rew=10.71]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:03, 330.90it/s, env_step=18432, len=8, n/ep=9, n/st=64, player_1/loss=225.746, player_2/loss=271.927, rew=2.78]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 341.41it/s, env_step=19456, len=8, n/ep=8, n/st=64, player_1/loss=188.412, player_2/loss=270.016, rew=12.50]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 313.48it/s, env_step=1024, len=7, n/ep=8, n/st=64, player_1/loss=354.317, player_2/loss=282.757, rew=-18.75]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 322.91it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=337.396, player_2/loss=265.687, rew=-19.44]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 332.83it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=337.198, player_2/loss=229.267, rew=-12.50]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 338.95it/s, env_step=4096, len=8, n/ep=7, n/st=64, player_1/loss=275.604, player_2/loss=220.765, rew=-10.71]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 335.49it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=205.762, player_2/loss=240.971, rew=3.57]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:03, 322.20it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=143.969, player_2/loss=256.793, rew=0.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:03, 322.66it/s, env_step=7168, len=7, n/ep=8, n/st=64, player_1/loss=162.479, player_2/loss=218.417, rew=-18.75]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:03, 331.15it/s, env_step=8192, len=9, n/ep=6, n/st=64, player_1/loss=205.292, player_2/loss=229.191, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:03, 325.06it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=233.494, player_2/loss=248.834, rew=-6.25]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:02, 348.81it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=273.151, player_2/loss=248.897, rew=-6.25]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:03, 314.90it/s, env_step=11264, len=9, n/ep=7, n/st=64, player_2/loss=283.180, rew=25.00]        


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:03, 318.37it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=234.890, player_2/loss=272.145, rew=3.57]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:03, 327.55it/s, env_step=13312, len=9, n/ep=7, n/st=64, player_1/loss=256.797, player_2/loss=247.077, rew=17.86]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:02, 345.30it/s, env_step=14336, len=7, n/ep=8, n/st=64, player_1/loss=235.048, player_2/loss=298.960, rew=-6.25]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:03, 321.32it/s, env_step=15360, len=8, n/ep=7, n/st=64, player_1/loss=257.216, player_2/loss=302.585, rew=-3.57]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:03, 271.74it/s, env_step=16384, len=7, n/ep=8, n/st=64, player_1/loss=258.250, player_2/loss=284.722, rew=-12.50]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:04, 220.13it/s, env_step=17408, len=7, n/ep=9, n/st=64, player_1/loss=165.111, player_2/loss=278.005, rew=-13.89]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:04, 239.34it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=260.385, player_2/loss=277.500, rew=3.57]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:04, 230.70it/s, env_step=19456, len=7, n/ep=9, n/st=64, player_1/loss=262.078, player_2/loss=254.708, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:03, 296.87it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=97.730, player_2/loss=244.492, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 298.43it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=136.951, player_2/loss=283.995, rew=13.89]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 300.03it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=232.953, player_2/loss=286.893, rew=6.25]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 312.34it/s, env_step=4096, len=11, n/ep=7, n/st=64, player_1/loss=225.713, player_2/loss=231.893, rew=-3.57]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 317.77it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=133.923, player_2/loss=216.819, rew=-3.57]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 324.93it/s, env_step=6144, len=10, n/ep=7, n/st=64, player_1/loss=116.243, player_2/loss=212.388, rew=17.86]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 327.33it/s, env_step=7168, len=8, n/ep=7, n/st=64, player_1/loss=151.715, player_2/loss=222.713, rew=10.71]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 324.81it/s, env_step=8192, len=8, n/ep=9, n/st=64, player_1/loss=204.300, player_2/loss=266.589, rew=8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 295.70it/s, env_step=9216, len=8, n/ep=8, n/st=64, player_1/loss=296.755, player_2/loss=272.943, rew=0.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:03, 296.69it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=261.299, player_2/loss=238.654, rew=18.75]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 314.72it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=208.928, player_2/loss=226.461, rew=12.50]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 343.71it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_2/loss=267.441, rew=12.50]        


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:03, 297.35it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=148.779, player_2/loss=273.431, rew=10.71]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 339.84it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=153.529, player_2/loss=278.028, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:03, 316.79it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=90.445, player_2/loss=289.026, rew=19.44]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 342.07it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=111.349, player_2/loss=279.545, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 346.07it/s, env_step=17408, len=8, n/ep=7, n/st=64, player_1/loss=241.717, player_2/loss=271.873, rew=10.71]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 345.67it/s, env_step=18432, len=7, n/ep=9, n/st=64, player_1/loss=207.725, player_2/loss=287.693, rew=13.89]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 318.57it/s, env_step=19456, len=8, n/ep=8, n/st=64, player_1/loss=179.629, player_2/loss=258.892, rew=12.50]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 328.61it/s, env_step=1024, len=10, n/ep=6, n/st=64, player_1/loss=183.293, player_2/loss=243.018, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #2: 1025it [00:03, 341.49it/s, env_step=2048, len=13, n/ep=5, n/st=64, player_1/loss=299.490, player_2/loss=248.026, rew=5.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #3: 1025it [00:03, 328.37it/s, env_step=3072, len=15, n/ep=5, n/st=64, player_1/loss=300.974, player_2/loss=233.311, rew=5.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #4: 1025it [00:03, 329.86it/s, env_step=4096, len=8, n/ep=7, n/st=64, player_1/loss=221.611, player_2/loss=251.905, rew=3.57]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #5: 1025it [00:03, 330.95it/s, env_step=5120, len=10, n/ep=6, n/st=64, player_1/loss=265.243, player_2/loss=251.367, rew=-8.33]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #6: 1025it [00:03, 291.30it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=217.948, player_2/loss=257.529, rew=0.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #7: 1025it [00:03, 288.42it/s, env_step=7168, len=9, n/ep=7, n/st=64, player_1/loss=208.058, player_2/loss=235.286, rew=17.86]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #8: 1025it [00:03, 262.61it/s, env_step=8192, len=7, n/ep=9, n/st=64, player_1/loss=181.705, player_2/loss=216.196, rew=-8.33]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #9: 1025it [00:03, 267.50it/s, env_step=9216, len=9, n/ep=6, n/st=64, player_1/loss=221.597, player_2/loss=241.783, rew=-16.67]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #10: 1025it [00:03, 283.86it/s, env_step=10240, len=12, n/ep=5, n/st=64, player_1/loss=238.455, player_2/loss=250.339, rew=5.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #11: 1025it [00:03, 273.85it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=187.182, player_2/loss=259.158, rew=-12.50]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #12: 1025it [00:03, 264.23it/s, env_step=12288, len=7, n/ep=9, n/st=64, player_1/loss=237.124, player_2/loss=239.593, rew=-19.44]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #13: 1025it [00:03, 284.28it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=249.203, player_2/loss=251.378, rew=-19.44]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #14: 1025it [00:03, 285.17it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=288.474, player_2/loss=233.226, rew=-19.44]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #15: 1025it [00:03, 279.50it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=278.169, player_2/loss=264.592, rew=-12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #16: 1025it [00:03, 280.21it/s, env_step=16384, len=7, n/ep=8, n/st=64, player_1/loss=228.750, player_2/loss=270.925, rew=-12.50]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #17: 1025it [00:03, 283.50it/s, env_step=17408, len=9, n/ep=6, n/st=64, player_1/loss=238.865, player_2/loss=277.074, rew=16.67]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #18: 1025it [00:03, 280.11it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=241.845, player_2/loss=254.917, rew=0.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #19: 1025it [00:03, 278.70it/s, env_step=19456, len=10, n/ep=7, n/st=64, player_1/loss=189.213, player_2/loss=212.903, rew=-17.86]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #1: 1025it [00:03, 293.86it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=162.927, player_2/loss=313.954, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 295.20it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=256.299, player_2/loss=218.206, rew=12.50]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 288.67it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=311.496, player_2/loss=177.362, rew=6.25]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 304.45it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=209.316, player_2/loss=187.654, rew=0.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 301.14it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=181.759, player_2/loss=214.968, rew=3.57]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 286.83it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=179.168, player_2/loss=216.843, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 309.28it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=237.947, player_2/loss=249.509, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 303.81it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=267.138, player_2/loss=238.490, rew=-8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 279.89it/s, env_step=9216, len=11, n/ep=6, n/st=64, player_1/loss=178.901, player_2/loss=239.331, rew=8.33]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:03, 302.02it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=159.421, player_2/loss=245.876, rew=13.89]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 306.76it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=130.404, player_2/loss=244.410, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 297.58it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=187.671, player_2/loss=224.789, rew=-3.57]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:03, 309.10it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=257.780, player_2/loss=253.024, rew=13.89]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 307.30it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=166.986, player_2/loss=261.352, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:03, 311.03it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=158.581, player_2/loss=274.274, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 302.63it/s, env_step=16384, len=8, n/ep=7, n/st=64, player_1/loss=178.056, player_2/loss=250.946, rew=3.57]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:03, 286.68it/s, env_step=17408, len=10, n/ep=6, n/st=64, player_1/loss=177.703, player_2/loss=226.444, rew=0.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:03, 314.78it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=168.851, player_2/loss=242.946, rew=12.50]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 331.79it/s, env_step=19456, len=12, n/ep=5, n/st=64, player_1/loss=231.734, player_2/loss=258.555, rew=-5.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 319.99it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=204.272, player_2/loss=251.028, rew=-6.25]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 293.42it/s, env_step=2048, len=10, n/ep=8, n/st=64, player_1/loss=218.645, player_2/loss=223.160, rew=-12.50]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 300.43it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=293.408, player_2/loss=220.093, rew=-6.25]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 278.27it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=210.138, player_2/loss=235.993, rew=0.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 301.50it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=148.803, player_2/loss=295.023, rew=-3.57]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:03, 289.26it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=282.637, player_2/loss=271.148, rew=-18.75]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:03, 307.72it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=345.292, player_2/loss=267.390, rew=0.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:03, 308.06it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=281.439, player_2/loss=274.867, rew=8.33]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:03, 314.50it/s, env_step=9216, len=9, n/ep=7, n/st=64, player_1/loss=177.736, player_2/loss=244.043, rew=-17.86]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:03, 319.79it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=175.620, player_2/loss=238.075, rew=-13.89]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:03, 299.97it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_2/loss=198.445, rew=-12.50]       


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:03, 301.39it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=190.256, player_2/loss=187.120, rew=3.57]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:03, 295.44it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=253.371, player_2/loss=226.923, rew=-13.89]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:03, 305.28it/s, env_step=14336, len=10, n/ep=6, n/st=64, player_1/loss=204.774, player_2/loss=244.190, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:03, 308.30it/s, env_step=15360, len=8, n/ep=7, n/st=64, player_1/loss=154.062, player_2/loss=233.271, rew=-10.71]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:03, 297.08it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=169.673, player_2/loss=224.683, rew=-12.50]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:03, 307.17it/s, env_step=17408, len=8, n/ep=8, n/st=64, player_1/loss=276.392, player_2/loss=256.878, rew=-6.25]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:03, 308.90it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=322.739, player_2/loss=251.834, rew=3.57]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:03, 296.71it/s, env_step=19456, len=11, n/ep=6, n/st=64, player_1/loss=228.254, player_2/loss=230.773, rew=0.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:03, 303.86it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=146.341, player_2/loss=293.776, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 298.38it/s, env_step=2048, len=10, n/ep=8, n/st=64, player_1/loss=227.495, player_2/loss=236.296, rew=12.50]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 266.35it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=212.630, player_2/loss=203.226, rew=12.50]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 306.56it/s, env_step=4096, len=12, n/ep=5, n/st=64, player_1/loss=156.975, player_2/loss=204.701, rew=5.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 320.49it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=135.275, player_2/loss=266.196, rew=3.57]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 314.31it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=218.453, player_2/loss=237.938, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 295.73it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=274.873, player_2/loss=244.781, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 293.39it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=251.671, player_2/loss=264.493, rew=-8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 267.40it/s, env_step=9216, len=9, n/ep=5, n/st=64, player_1/loss=211.583, player_2/loss=234.439, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:03, 261.82it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=182.879, player_2/loss=219.107, rew=13.89]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 283.68it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=111.762, player_2/loss=197.718, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 286.32it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=161.419, player_2/loss=214.628, rew=-3.57]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:03, 287.91it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=251.276, player_2/loss=237.589, rew=13.89]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 301.01it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=206.040, player_2/loss=248.065, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:03, 316.21it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_1/loss=204.105, player_2/loss=263.848, rew=10.71]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 312.48it/s, env_step=16384, len=10, n/ep=7, n/st=64, player_1/loss=264.139, rew=-10.71]      


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:03, 301.83it/s, env_step=17408, len=12, n/ep=5, n/st=64, player_1/loss=205.027, player_2/loss=210.463, rew=-5.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:03, 328.59it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=221.952, player_2/loss=229.719, rew=12.50]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 330.45it/s, env_step=19456, len=12, n/ep=5, n/st=64, player_1/loss=258.480, player_2/loss=275.972, rew=-5.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 305.88it/s, env_step=1024, len=8, n/ep=7, n/st=64, player_1/loss=165.225, player_2/loss=245.479, rew=-3.57]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 310.66it/s, env_step=2048, len=10, n/ep=6, n/st=64, player_1/loss=209.171, player_2/loss=190.399, rew=25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 321.97it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=307.193, player_2/loss=193.724, rew=-12.50]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 302.78it/s, env_step=4096, len=11, n/ep=6, n/st=64, player_1/loss=289.191, player_2/loss=236.586, rew=8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 286.72it/s, env_step=5120, len=11, n/ep=4, n/st=64, player_1/loss=171.499, player_2/loss=291.208, rew=0.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:03, 298.32it/s, env_step=6144, len=10, n/ep=6, n/st=64, player_1/loss=278.540, player_2/loss=284.405, rew=0.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:03, 290.45it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=334.603, player_2/loss=269.919, rew=0.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:04, 255.57it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=270.735, player_2/loss=260.217, rew=8.33]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:03, 295.25it/s, env_step=9216, len=9, n/ep=7, n/st=64, player_1/loss=167.341, player_2/loss=230.981, rew=-17.86]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:03, 287.03it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=152.175, player_2/loss=214.143, rew=-19.44]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:03, 281.28it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=152.553, player_2/loss=237.402, rew=-19.44]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:03, 285.36it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=205.009, player_2/loss=242.985, rew=3.57]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:03, 277.10it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=237.163, player_2/loss=243.009, rew=-13.89]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:03, 276.66it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=182.061, player_2/loss=252.903, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:03, 311.06it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_1/loss=203.573, player_2/loss=262.973, rew=-10.71]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:03, 313.64it/s, env_step=16384, len=8, n/ep=7, n/st=64, player_1/loss=249.523, player_2/loss=265.648, rew=-3.57]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:03, 296.93it/s, env_step=17408, len=11, n/ep=5, n/st=64, player_1/loss=269.716, player_2/loss=243.167, rew=15.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:04, 231.28it/s, env_step=18432, len=12, n/ep=5, n/st=64, player_1/loss=314.469, player_2/loss=254.630, rew=15.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:05, 201.40it/s, env_step=19456, len=11, n/ep=6, n/st=64, player_1/loss=278.002, player_2/loss=262.068, rew=-8.33]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:08, 123.33it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=196.743, player_2/loss=270.651, rew=6.25]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:06, 153.76it/s, env_step=2048, len=10, n/ep=8, n/st=64, player_1/loss=210.583, player_2/loss=214.711, rew=12.50]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:05, 171.65it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=302.808, player_2/loss=204.708, rew=12.50]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:04, 231.53it/s, env_step=4096, len=11, n/ep=6, n/st=64, player_1/loss=247.451, player_2/loss=228.224, rew=-8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:04, 253.04it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=186.717, player_2/loss=286.200, rew=3.57]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:04, 247.65it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=252.406, player_2/loss=265.795, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 382.74it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=254.656, player_2/loss=273.588, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 309.78it/s, env_step=8192, len=12, n/ep=7, n/st=64, player_1/loss=218.697, player_2/loss=254.008, rew=-3.57]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 320.36it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=244.680, player_2/loss=229.187, rew=18.75]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:03, 336.20it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=248.536, player_2/loss=236.486, rew=13.89]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 328.80it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=165.304, player_2/loss=241.424, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 342.49it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=193.718, player_2/loss=242.987, rew=-3.57]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:03, 327.99it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=213.517, player_2/loss=250.971, rew=13.89]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 330.12it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=159.341, player_2/loss=256.531, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:03, 330.27it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_1/loss=202.630, player_2/loss=262.657, rew=10.71]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 329.61it/s, env_step=16384, len=8, n/ep=7, n/st=64, player_1/loss=250.185, player_2/loss=272.122, rew=3.57]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:03, 331.74it/s, env_step=17408, len=11, n/ep=5, n/st=64, player_1/loss=196.976, player_2/loss=253.528, rew=-15.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:03, 337.91it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=161.046, player_2/loss=283.183, rew=10.71]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 328.95it/s, env_step=19456, len=12, n/ep=5, n/st=64, player_1/loss=191.367, player_2/loss=277.361, rew=-5.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 327.54it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=256.919, player_2/loss=269.199, rew=-6.25]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 336.32it/s, env_step=2048, len=8, n/ep=7, n/st=64, player_1/loss=255.634, player_2/loss=221.001, rew=-17.86]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 340.04it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=292.177, player_2/loss=182.818, rew=-12.50]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 333.38it/s, env_step=4096, len=11, n/ep=6, n/st=64, player_1/loss=228.745, player_2/loss=207.169, rew=8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 344.86it/s, env_step=5120, len=11, n/ep=4, n/st=64, player_1/loss=174.938, player_2/loss=280.305, rew=0.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:03, 325.80it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=307.320, player_2/loss=276.109, rew=-18.75]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:03, 333.37it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=300.123, player_2/loss=271.968, rew=0.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:03, 333.07it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=235.733, player_2/loss=265.575, rew=8.33]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:03, 336.08it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=210.994, player_2/loss=225.295, rew=-18.75]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:03, 337.18it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=195.227, player_2/loss=219.321, rew=-13.89]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:03, 334.41it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=179.941, player_2/loss=234.848, rew=-19.44]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:03, 335.51it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=183.992, player_2/loss=225.140, rew=3.57]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:03, 325.29it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=185.456, player_2/loss=242.190, rew=-13.89]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:03, 330.10it/s, env_step=14336, len=10, n/ep=6, n/st=64, player_1/loss=186.031, player_2/loss=238.721, rew=25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:03, 323.88it/s, env_step=15360, len=11, n/ep=6, n/st=64, player_1/loss=220.234, player_2/loss=232.952, rew=8.33]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:03, 332.19it/s, env_step=16384, len=8, n/ep=7, n/st=64, player_1/loss=243.343, player_2/loss=251.773, rew=-3.57]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:03, 333.70it/s, env_step=17408, len=11, n/ep=5, n/st=64, player_1/loss=257.932, player_2/loss=241.938, rew=15.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:03, 334.32it/s, env_step=18432, len=12, n/ep=5, n/st=64, player_1/loss=292.356, player_2/loss=235.992, rew=15.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:02, 342.84it/s, env_step=19456, len=11, n/ep=6, n/st=64, player_1/loss=276.427, player_2/loss=231.826, rew=-8.33]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:03, 322.41it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=182.667, player_2/loss=246.761, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 339.06it/s, env_step=2048, len=8, n/ep=7, n/st=64, player_1/loss=194.189, player_2/loss=220.244, rew=17.86]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 338.75it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=237.710, player_2/loss=204.962, rew=12.50]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 333.75it/s, env_step=4096, len=11, n/ep=6, n/st=64, player_1/loss=207.633, player_2/loss=217.365, rew=-8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 337.28it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=172.718, player_2/loss=266.020, rew=3.57]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 332.91it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=231.639, player_2/loss=252.299, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:03, 336.03it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=248.705, player_2/loss=265.241, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:03, 335.99it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=204.713, player_2/loss=265.713, rew=-8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:03, 333.20it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=152.311, player_2/loss=245.766, rew=18.75]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:03, 337.00it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=144.469, player_2/loss=224.606, rew=13.89]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 318.39it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=147.691, player_2/loss=248.781, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 314.13it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=208.870, player_2/loss=260.537, rew=-3.57]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:03, 327.15it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=237.626, player_2/loss=247.773, rew=13.89]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 329.06it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=180.932, player_2/loss=233.943, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:03, 323.45it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=179.107, player_2/loss=228.676, rew=12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:03, 327.78it/s, env_step=16384, len=8, n/ep=7, n/st=64, player_1/loss=144.372, player_2/loss=272.084, rew=3.57]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:03, 329.05it/s, env_step=17408, len=11, n/ep=5, n/st=64, player_1/loss=138.445, player_2/loss=269.236, rew=-15.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:03, 325.70it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=220.257, player_2/loss=269.169, rew=12.50]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 328.61it/s, env_step=19456, len=12, n/ep=5, n/st=64, player_1/loss=256.836, player_2/loss=257.247, rew=-5.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:03, 320.24it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=202.292, player_2/loss=229.598, rew=-12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #2: 1025it [00:03, 332.96it/s, env_step=2048, len=10, n/ep=7, n/st=64, player_1/loss=217.727, player_2/loss=211.084, rew=-3.57]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #3: 1025it [00:03, 318.72it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=224.594, player_2/loss=233.057, rew=0.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #4: 1025it [00:03, 331.29it/s, env_step=4096, len=11, n/ep=5, n/st=64, player_1/loss=227.481, player_2/loss=254.269, rew=15.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #5: 1025it [00:03, 326.94it/s, env_step=5120, len=8, n/ep=8, n/st=64, player_1/loss=214.117, player_2/loss=202.766, rew=0.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #6: 1025it [00:03, 326.09it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=192.133, player_2/loss=201.954, rew=6.25]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #7: 1025it [00:03, 335.56it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=200.300, player_2/loss=258.018, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #8: 1025it [00:03, 335.01it/s, env_step=8192, len=8, n/ep=7, n/st=64, player_1/loss=167.714, player_2/loss=285.659, rew=-3.57]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #9: 1025it [00:03, 324.30it/s, env_step=9216, len=10, n/ep=6, n/st=64, player_1/loss=199.998, player_2/loss=225.520, rew=-16.67]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #10: 1025it [00:03, 317.52it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=144.190, player_2/loss=197.808, rew=-12.50]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #11: 1025it [00:03, 320.35it/s, env_step=11264, len=11, n/ep=6, n/st=64, player_1/loss=133.293, player_2/loss=232.379, rew=-8.33]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #12: 1025it [00:03, 317.22it/s, env_step=12288, len=10, n/ep=8, n/st=64, player_1/loss=166.689, player_2/loss=234.412, rew=-6.25]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #13: 1025it [00:03, 315.70it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=189.942, player_2/loss=227.442, rew=-18.75]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #14: 1025it [00:03, 315.53it/s, env_step=14336, len=7, n/ep=8, n/st=64, player_1/loss=181.046, player_2/loss=216.309, rew=-12.50]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #15: 1025it [00:03, 318.27it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=183.077, player_2/loss=213.290, rew=-12.50]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #16: 1025it [00:03, 324.30it/s, env_step=16384, len=11, n/ep=5, n/st=64, player_1/loss=209.230, player_2/loss=253.374, rew=25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #17: 1025it [00:03, 280.60it/s, env_step=17408, len=10, n/ep=7, n/st=64, player_1/loss=245.104, player_2/loss=238.144, rew=10.71]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #18: 1025it [00:03, 274.48it/s, env_step=18432, len=7, n/ep=8, n/st=64, player_1/loss=270.271, player_2/loss=171.797, rew=-12.50]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #19: 1025it [00:03, 270.62it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=266.591, player_2/loss=186.269, rew=-3.57]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #1: 1025it [00:03, 320.10it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=189.684, player_2/loss=272.852, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:03, 288.01it/s, env_step=2048, len=11, n/ep=7, n/st=64, player_1/loss=176.532, player_2/loss=222.222, rew=10.71]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:03, 316.16it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=228.746, player_2/loss=204.132, rew=12.50]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 340.86it/s, env_step=4096, len=11, n/ep=6, n/st=64, player_1/loss=182.368, player_2/loss=207.686, rew=-8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 349.05it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=113.519, player_2/loss=236.471, rew=3.57]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 360.20it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=187.506, player_2/loss=223.020, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 354.95it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=216.108, player_2/loss=225.413, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 358.75it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=184.613, player_2/loss=235.839, rew=-8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 360.75it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=160.770, player_2/loss=241.520, rew=18.75]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 359.82it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=188.789, player_2/loss=222.351, rew=13.89]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 360.38it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=185.972, player_2/loss=225.035, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 357.76it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=224.282, player_2/loss=217.686, rew=-3.57]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 356.86it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=221.811, player_2/loss=230.432, rew=13.89]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 357.27it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=161.571, player_2/loss=245.468, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 358.64it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=195.772, player_2/loss=241.291, rew=12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 356.30it/s, env_step=16384, len=9, n/ep=7, n/st=64, player_1/loss=206.784, rew=-3.57]        


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 360.72it/s, env_step=17408, len=11, n/ep=4, n/st=64, player_1/loss=172.345, player_2/loss=219.008, rew=-25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 360.83it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=180.066, player_2/loss=238.215, rew=12.50]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 358.27it/s, env_step=19456, len=12, n/ep=5, n/st=64, player_1/loss=192.512, player_2/loss=249.232, rew=-5.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 360.24it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=154.865, player_2/loss=226.806, rew=-12.50]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 358.08it/s, env_step=2048, len=10, n/ep=8, n/st=64, player_1/loss=187.848, player_2/loss=195.240, rew=-6.25]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 358.71it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=261.472, player_2/loss=197.361, rew=-12.50]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 364.90it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=146.484, player_2/loss=220.806, rew=8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 351.69it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=148.281, player_2/loss=280.387, rew=-3.57]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:02, 359.13it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=249.372, player_2/loss=270.871, rew=-18.75]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:02, 342.68it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=210.163, player_2/loss=262.146, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:03, 278.46it/s, env_step=8192, len=8, n/ep=7, n/st=64, player_1/loss=280.975, player_2/loss=207.663, rew=-3.57]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:04, 242.08it/s, env_step=9216, len=9, n/ep=7, n/st=64, player_1/loss=247.778, player_2/loss=214.320, rew=-17.86]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:03, 322.80it/s, env_step=10240, len=8, n/ep=8, n/st=64, player_1/loss=216.385, player_2/loss=253.443, rew=-12.50]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:03, 338.33it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=187.162, player_2/loss=256.526, rew=-12.50]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:02, 352.67it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_1/loss=206.380, player_2/loss=219.849, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:02, 353.44it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=277.316, player_2/loss=179.349, rew=-19.44]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:02, 361.16it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=222.314, player_2/loss=214.451, rew=-19.44]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:02, 357.06it/s, env_step=15360, len=13, n/ep=6, n/st=64, player_1/loss=126.777, player_2/loss=195.772, rew=16.67]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:02, 356.14it/s, env_step=16384, len=10, n/ep=6, n/st=64, player_1/loss=217.330, player_2/loss=175.559, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:02, 349.29it/s, env_step=17408, len=10, n/ep=6, n/st=64, player_1/loss=309.955, player_2/loss=201.412, rew=25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:02, 352.63it/s, env_step=18432, len=8, n/ep=8, n/st=64, player_1/loss=281.876, player_2/loss=212.065, rew=0.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:03, 325.69it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=160.205, player_2/loss=193.900, rew=-10.71]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:02, 345.05it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=155.340, player_2/loss=322.132, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 344.27it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=225.008, player_2/loss=225.350, rew=12.50]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 346.87it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=272.704, player_2/loss=190.030, rew=12.50]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:03, 317.35it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=172.322, player_2/loss=230.112, rew=-8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:03, 327.91it/s, env_step=5120, len=10, n/ep=6, n/st=64, player_1/loss=163.977, player_2/loss=272.577, rew=8.33]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 384.76it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=261.867, player_2/loss=248.521, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 360.53it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=212.510, player_2/loss=232.907, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 357.95it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=172.410, player_2/loss=250.113, rew=6.25]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 359.25it/s, env_step=9216, len=9, n/ep=7, n/st=64, player_1/loss=159.695, player_2/loss=241.335, rew=17.86]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 380.14it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=212.177, player_2/loss=235.629, rew=13.89]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 387.40it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=183.685, player_2/loss=250.697, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:03, 292.65it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=213.331, player_2/loss=237.013, rew=-3.57]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:03, 328.34it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=218.352, player_2/loss=219.446, rew=13.89]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 354.68it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=165.627, player_2/loss=233.792, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 348.96it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=180.623, player_2/loss=250.459, rew=12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 352.17it/s, env_step=16384, len=9, n/ep=7, n/st=64, player_1/loss=233.400, rew=-3.57]        


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 404.70it/s, env_step=17408, len=11, n/ep=5, n/st=64, player_1/loss=213.218, player_2/loss=260.683, rew=-15.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 387.40it/s, env_step=18432, len=8, n/ep=7, n/st=64, player_1/loss=236.337, player_2/loss=280.964, rew=3.57]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 339.31it/s, env_step=19456, len=12, n/ep=5, n/st=64, player_1/loss=204.115, player_2/loss=279.093, rew=-5.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 375.25it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=304.958, player_2/loss=179.248, rew=-12.50]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 342.38it/s, env_step=2048, len=10, n/ep=6, n/st=64, player_1/loss=274.301, player_2/loss=169.173, rew=25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 360.85it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=317.829, player_2/loss=169.794, rew=-12.50]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 404.89it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=239.772, player_2/loss=200.330, rew=8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #5: 1025it [00:02, 367.99it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=185.748, player_2/loss=216.044, rew=-3.57]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #6: 1025it [00:02, 347.90it/s, env_step=6144, len=8, n/ep=7, n/st=64, player_1/loss=254.866, player_2/loss=201.711, rew=-10.71]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #7: 1025it [00:02, 381.75it/s, env_step=7168, len=10, n/ep=5, n/st=64, player_1/loss=221.813, player_2/loss=198.345, rew=5.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #8: 1025it [00:02, 369.01it/s, env_step=8192, len=8, n/ep=7, n/st=64, player_1/loss=114.056, player_2/loss=208.987, rew=-3.57]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #9: 1025it [00:02, 362.25it/s, env_step=9216, len=10, n/ep=6, n/st=64, player_1/loss=167.161, player_2/loss=198.869, rew=0.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #10: 1025it [00:02, 399.04it/s, env_step=10240, len=10, n/ep=7, n/st=64, player_1/loss=238.223, player_2/loss=154.338, rew=25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #11: 1025it [00:02, 410.37it/s, env_step=11264, len=14, n/ep=5, n/st=64, player_1/loss=182.986, player_2/loss=147.604, rew=25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #12: 1025it [00:02, 417.65it/s, env_step=12288, len=13, n/ep=5, n/st=64, player_1/loss=217.019, player_2/loss=160.035, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #13: 1025it [00:02, 416.38it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=221.753, player_2/loss=136.974, rew=-3.57]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #14: 1025it [00:02, 421.49it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=177.812, player_2/loss=180.618, rew=-19.44]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #15: 1025it [00:02, 420.33it/s, env_step=15360, len=7, n/ep=8, n/st=64, player_1/loss=179.602, player_2/loss=197.640, rew=-18.75]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #16: 1025it [00:02, 418.52it/s, env_step=16384, len=9, n/ep=6, n/st=64, player_1/loss=272.672, player_2/loss=212.151, rew=16.67]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #17: 1025it [00:02, 418.55it/s, env_step=17408, len=8, n/ep=7, n/st=64, player_1/loss=230.667, player_2/loss=183.820, rew=-10.71]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #18: 1025it [00:02, 410.99it/s, env_step=18432, len=9, n/ep=5, n/st=64, player_1/loss=177.955, player_2/loss=194.250, rew=-5.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #19: 1025it [00:02, 414.67it/s, env_step=19456, len=7, n/ep=9, n/st=64, player_1/loss=254.451, player_2/loss=231.376, rew=-13.89]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #1: 1025it [00:02, 414.64it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=149.579, player_2/loss=297.853, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 413.45it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=240.869, player_2/loss=217.328, rew=12.50]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 421.43it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=320.966, player_2/loss=203.703, rew=6.25]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 414.34it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=231.801, player_2/loss=260.870, rew=8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 405.13it/s, env_step=5120, len=10, n/ep=6, n/st=64, player_1/loss=160.427, player_2/loss=272.725, rew=8.33]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 407.15it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=229.706, player_2/loss=239.516, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 398.79it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=209.128, player_2/loss=227.699, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 402.63it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=169.845, player_2/loss=244.897, rew=6.25]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 409.37it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=115.593, player_2/loss=240.002, rew=19.44]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 404.92it/s, env_step=10240, len=7, n/ep=9, n/st=64, player_1/loss=150.248, player_2/loss=234.597, rew=13.89]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 402.76it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=186.774, player_2/loss=247.323, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 401.58it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=200.384, player_2/loss=206.313, rew=-3.57]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 377.47it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=187.379, player_2/loss=217.454, rew=13.89]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 368.17it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=156.589, player_2/loss=232.252, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 388.84it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=162.913, player_2/loss=239.321, rew=12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 390.02it/s, env_step=16384, len=9, n/ep=7, n/st=64, player_1/loss=211.198, rew=-3.57]        


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 388.30it/s, env_step=17408, len=11, n/ep=5, n/st=64, player_1/loss=213.443, player_2/loss=236.365, rew=-15.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 375.01it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=197.701, player_2/loss=242.767, rew=10.71]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 402.38it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=212.348, player_2/loss=249.080, rew=3.57]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 409.17it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=254.072, player_2/loss=248.155, rew=-12.50]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 416.33it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=316.162, player_2/loss=180.522, rew=-18.75]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 411.71it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=303.756, player_2/loss=161.291, rew=-6.25]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 401.69it/s, env_step=4096, len=11, n/ep=5, n/st=64, player_1/loss=243.587, player_2/loss=225.814, rew=15.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #5: 1025it [00:02, 421.92it/s, env_step=5120, len=10, n/ep=6, n/st=64, player_1/loss=159.028, player_2/loss=203.601, rew=8.33]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #6: 1025it [00:02, 412.63it/s, env_step=6144, len=11, n/ep=6, n/st=64, player_1/loss=187.929, rew=25.00]         


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #7: 1025it [00:02, 420.15it/s, env_step=7168, len=8, n/ep=7, n/st=64, player_1/loss=263.415, player_2/loss=167.009, rew=-10.71]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #8: 1025it [00:02, 420.87it/s, env_step=8192, len=15, n/ep=4, n/st=64, player_1/loss=209.893, player_2/loss=203.035, rew=12.50]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #9: 1025it [00:02, 422.29it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=211.384, player_2/loss=220.889, rew=-18.75]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #10: 1025it [00:02, 418.86it/s, env_step=10240, len=8, n/ep=9, n/st=64, player_1/loss=238.491, rew=-13.89]       


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #11: 1025it [00:02, 420.09it/s, env_step=11264, len=10, n/ep=6, n/st=64, player_1/loss=253.691, player_2/loss=179.537, rew=0.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #12: 1025it [00:02, 416.93it/s, env_step=12288, len=10, n/ep=7, n/st=64, player_1/loss=270.299, player_2/loss=226.220, rew=17.86]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #13: 1025it [00:02, 418.17it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=316.093, player_2/loss=229.763, rew=-19.44]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #14: 1025it [00:02, 419.59it/s, env_step=14336, len=11, n/ep=6, n/st=64, player_1/loss=302.025, player_2/loss=246.757, rew=0.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #15: 1025it [00:02, 420.37it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=215.438, player_2/loss=226.836, rew=-2.78]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #16: 1025it [00:02, 419.97it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=154.902, player_2/loss=230.203, rew=-19.44]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #17: 1025it [00:02, 419.62it/s, env_step=17408, len=8, n/ep=8, n/st=64, player_1/loss=179.245, player_2/loss=240.705, rew=-6.25]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #18: 1025it [00:02, 419.09it/s, env_step=18432, len=11, n/ep=5, n/st=64, player_1/loss=235.281, player_2/loss=190.133, rew=5.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #19: 1025it [00:02, 421.41it/s, env_step=19456, len=10, n/ep=7, n/st=64, player_1/loss=245.053, player_2/loss=171.206, rew=-10.71]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #1: 1025it [00:02, 411.21it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=150.560, player_2/loss=308.255, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 407.19it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=267.718, player_2/loss=205.330, rew=18.75]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 399.99it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=308.158, player_2/loss=170.361, rew=6.25]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 397.41it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=222.428, player_2/loss=219.611, rew=8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 402.01it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=125.817, player_2/loss=257.437, rew=10.71]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 400.63it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=183.040, player_2/loss=232.916, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 399.05it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=234.223, player_2/loss=236.529, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 401.23it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=218.763, player_2/loss=244.793, rew=6.25]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 399.59it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=147.709, player_2/loss=229.423, rew=19.44]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 399.65it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=176.720, player_2/loss=227.952, rew=12.50]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 399.07it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=196.952, player_2/loss=222.726, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 397.87it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_1/loss=164.077, player_2/loss=193.645, rew=-8.33]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 394.91it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=161.206, player_2/loss=187.687, rew=13.89]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 397.26it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=158.928, player_2/loss=219.035, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 396.98it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=184.057, player_2/loss=267.020, rew=12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 395.42it/s, env_step=16384, len=11, n/ep=6, n/st=64, player_1/loss=238.009, rew=0.00]        


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 399.62it/s, env_step=17408, len=11, n/ep=5, n/st=64, player_1/loss=266.382, player_2/loss=249.231, rew=-15.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 403.28it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=245.920, player_2/loss=272.917, rew=10.71]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 400.62it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=172.629, player_2/loss=270.119, rew=3.57]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 399.02it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=360.888, player_2/loss=224.171, rew=-12.50]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 394.65it/s, env_step=2048, len=8, n/ep=7, n/st=64, player_1/loss=316.854, player_2/loss=204.669, rew=-17.86]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 401.20it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=218.427, player_2/loss=212.326, rew=-6.25]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 395.82it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=162.275, player_2/loss=253.491, rew=-8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 401.54it/s, env_step=5120, len=7, n/ep=8, n/st=64, player_1/loss=211.180, player_2/loss=241.316, rew=-6.25]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:02, 400.53it/s, env_step=6144, len=12, n/ep=5, n/st=64, player_1/loss=327.236, player_2/loss=216.271, rew=15.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:02, 402.10it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=304.719, player_2/loss=214.307, rew=-12.50]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:02, 399.73it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=303.572, player_2/loss=220.106, rew=8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:02, 402.43it/s, env_step=9216, len=8, n/ep=7, n/st=64, player_1/loss=275.050, player_2/loss=225.613, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:02, 403.06it/s, env_step=10240, len=10, n/ep=6, n/st=64, player_1/loss=250.448, player_2/loss=229.808, rew=16.67]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:02, 400.71it/s, env_step=11264, len=9, n/ep=7, n/st=64, player_1/loss=326.514, player_2/loss=207.544, rew=3.57]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:02, 401.60it/s, env_step=12288, len=11, n/ep=6, n/st=64, player_1/loss=305.171, player_2/loss=175.441, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:02, 397.16it/s, env_step=13312, len=8, n/ep=8, n/st=64, player_1/loss=194.984, player_2/loss=153.074, rew=-6.25]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:02, 402.13it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=162.738, player_2/loss=171.066, rew=-19.44]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:02, 403.89it/s, env_step=15360, len=7, n/ep=8, n/st=64, player_1/loss=185.719, player_2/loss=187.986, rew=-18.75]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:02, 403.62it/s, env_step=16384, len=8, n/ep=7, n/st=64, player_1/loss=212.321, player_2/loss=211.664, rew=-3.57]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:02, 397.34it/s, env_step=17408, len=8, n/ep=8, n/st=64, player_1/loss=222.626, player_2/loss=229.751, rew=-12.50]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:02, 403.19it/s, env_step=18432, len=9, n/ep=6, n/st=64, player_1/loss=285.954, player_2/loss=207.657, rew=16.67]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:02, 401.82it/s, env_step=19456, len=11, n/ep=6, n/st=64, player_1/loss=283.972, player_2/loss=194.168, rew=8.33]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:02, 401.45it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=150.605, player_2/loss=298.105, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 402.42it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=200.626, player_2/loss=221.096, rew=18.75]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 401.46it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=244.624, player_2/loss=198.547, rew=6.25]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 401.24it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=210.466, player_2/loss=249.061, rew=8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 400.78it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=163.414, player_2/loss=273.898, rew=10.71]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 399.66it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=246.561, player_2/loss=252.759, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 397.47it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=203.513, player_2/loss=248.534, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 402.43it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=196.964, player_2/loss=248.126, rew=6.25]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 401.78it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=153.894, player_2/loss=224.720, rew=19.44]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 399.77it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=165.241, player_2/loss=227.798, rew=12.50]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 380.66it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=206.428, player_2/loss=244.764, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 377.84it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=219.452, player_2/loss=226.851, rew=-3.57]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 381.28it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=200.777, player_2/loss=221.912, rew=13.89]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 381.74it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=166.882, player_2/loss=228.535, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 380.02it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=172.052, player_2/loss=227.693, rew=12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 377.61it/s, env_step=16384, len=10, n/ep=6, n/st=64, player_1/loss=182.906, rew=0.00]        


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 382.79it/s, env_step=17408, len=10, n/ep=6, n/st=64, player_1/loss=186.661, player_2/loss=227.437, rew=0.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 387.01it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=225.549, player_2/loss=267.498, rew=10.71]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 389.49it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=220.670, player_2/loss=266.693, rew=3.57]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 389.08it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=221.951, player_2/loss=276.786, rew=-12.50]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 384.62it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=301.484, player_2/loss=202.739, rew=-18.75]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 389.30it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=331.965, player_2/loss=171.664, rew=-12.50]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 392.64it/s, env_step=4096, len=12, n/ep=5, n/st=64, player_1/loss=183.007, player_2/loss=209.565, rew=5.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 386.50it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=228.809, player_2/loss=246.067, rew=-10.71]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:02, 386.09it/s, env_step=6144, len=10, n/ep=7, n/st=64, player_1/loss=328.895, player_2/loss=242.092, rew=17.86]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:02, 390.82it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=291.775, player_2/loss=201.233, rew=-6.25]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:02, 389.80it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=191.752, player_2/loss=181.495, rew=16.67]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:02, 387.85it/s, env_step=9216, len=9, n/ep=6, n/st=64, player_1/loss=159.111, player_2/loss=193.226, rew=-16.67]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:02, 392.12it/s, env_step=10240, len=9, n/ep=6, n/st=64, player_1/loss=192.881, player_2/loss=218.895, rew=-8.33]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:02, 391.68it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=249.745, player_2/loss=221.022, rew=-12.50]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:02, 392.54it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_1/loss=232.343, player_2/loss=234.380, rew=-18.75]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:02, 388.57it/s, env_step=13312, len=10, n/ep=7, n/st=64, player_1/loss=271.898, player_2/loss=203.370, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:02, 388.02it/s, env_step=14336, len=9, n/ep=8, n/st=64, player_1/loss=250.289, player_2/loss=153.856, rew=0.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:02, 394.27it/s, env_step=15360, len=8, n/ep=8, n/st=64, player_1/loss=334.602, player_2/loss=152.233, rew=-6.25]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:02, 387.92it/s, env_step=16384, len=7, n/ep=8, n/st=64, player_2/loss=214.181, rew=-25.00]       


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:02, 381.82it/s, env_step=17408, len=10, n/ep=6, n/st=64, player_1/loss=327.775, player_2/loss=230.955, rew=25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:02, 391.24it/s, env_step=18432, len=8, n/ep=7, n/st=64, player_1/loss=281.366, player_2/loss=227.161, rew=3.57]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:02, 390.59it/s, env_step=19456, len=8, n/ep=8, n/st=64, player_1/loss=230.749, player_2/loss=211.717, rew=-12.50]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:02, 389.17it/s, env_step=1024, len=10, n/ep=6, n/st=64, player_1/loss=386.732, player_2/loss=146.368, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 392.08it/s, env_step=2048, len=9, n/ep=7, n/st=64, player_1/loss=353.279, player_2/loss=142.426, rew=-17.86]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 393.21it/s, env_step=3072, len=10, n/ep=6, n/st=64, player_1/loss=340.565, player_2/loss=143.089, rew=-25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #4: 1025it [00:02, 390.08it/s, env_step=4096, len=10, n/ep=6, n/st=64, player_1/loss=360.349, player_2/loss=132.641, rew=-8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #5: 1025it [00:02, 390.32it/s, env_step=5120, len=14, n/ep=5, n/st=64, player_1/loss=412.714, player_2/loss=123.439, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #6: 1025it [00:02, 384.32it/s, env_step=6144, len=9, n/ep=6, n/st=64, player_1/loss=387.661, player_2/loss=152.502, rew=-8.33]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #7: 1025it [00:02, 392.38it/s, env_step=7168, len=10, n/ep=6, n/st=64, player_1/loss=401.262, player_2/loss=166.088, rew=-16.67]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #8: 1025it [00:02, 390.61it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=363.292, player_2/loss=154.756, rew=-8.33]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #9: 1025it [00:02, 390.22it/s, env_step=9216, len=12, n/ep=5, n/st=64, player_1/loss=392.761, player_2/loss=168.351, rew=-5.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #10: 1025it [00:02, 352.20it/s, env_step=10240, len=13, n/ep=5, n/st=64, player_1/loss=410.679, player_2/loss=192.831, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #11: 1025it [00:02, 383.36it/s, env_step=11264, len=10, n/ep=6, n/st=64, player_1/loss=378.537, player_2/loss=166.901, rew=-16.67]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #12: 1025it [00:02, 385.35it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_2/loss=159.578, rew=-16.67]      


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #13: 1025it [00:02, 389.57it/s, env_step=13312, len=10, n/ep=6, n/st=64, player_1/loss=354.553, player_2/loss=148.352, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #14: 1025it [00:03, 320.80it/s, env_step=14336, len=11, n/ep=5, n/st=64, player_1/loss=347.501, player_2/loss=127.651, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #15: 1025it [00:03, 263.48it/s, env_step=15360, len=10, n/ep=6, n/st=64, player_1/loss=390.963, player_2/loss=160.916, rew=-16.67]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #16: 1025it [00:03, 322.51it/s, env_step=16384, len=11, n/ep=6, n/st=64, player_1/loss=424.883, player_2/loss=177.461, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #17: 1025it [00:02, 368.42it/s, env_step=17408, len=12, n/ep=5, n/st=64, player_1/loss=419.517, player_2/loss=186.300, rew=-15.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #18: 1025it [00:02, 393.69it/s, env_step=18432, len=10, n/ep=6, n/st=64, player_1/loss=337.203, player_2/loss=174.717, rew=-8.33]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #19: 1025it [00:02, 373.09it/s, env_step=19456, len=12, n/ep=6, n/st=64, player_1/loss=359.008, player_2/loss=168.245, rew=-16.67]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #1: 1025it [00:02, 384.65it/s, env_step=1024, len=10, n/ep=6, n/st=64, player_1/loss=264.446, player_2/loss=164.092, rew=25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 386.25it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=261.558, player_2/loss=165.685, rew=-18.75]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 370.14it/s, env_step=3072, len=8, n/ep=7, n/st=64, player_1/loss=241.109, player_2/loss=184.080, rew=-17.86]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 382.48it/s, env_step=4096, len=11, n/ep=6, n/st=64, player_1/loss=192.605, player_2/loss=224.141, rew=8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 346.71it/s, env_step=5120, len=10, n/ep=6, n/st=64, player_1/loss=215.974, player_2/loss=211.073, rew=25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:03, 317.02it/s, env_step=6144, len=10, n/ep=6, n/st=64, player_1/loss=279.126, player_2/loss=184.830, rew=-16.67]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 343.07it/s, env_step=7168, len=8, n/ep=7, n/st=64, player_1/loss=253.323, player_2/loss=207.768, rew=10.71]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 415.24it/s, env_step=8192, len=14, n/ep=4, n/st=64, player_1/loss=206.860, player_2/loss=214.778, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 410.86it/s, env_step=9216, len=9, n/ep=7, n/st=64, player_1/loss=210.091, player_2/loss=220.419, rew=10.71]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 386.15it/s, env_step=10240, len=10, n/ep=6, n/st=64, player_1/loss=189.010, player_2/loss=223.611, rew=25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 409.89it/s, env_step=11264, len=11, n/ep=6, n/st=64, player_1/loss=200.148, player_2/loss=229.325, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 408.69it/s, env_step=12288, len=8, n/ep=7, n/st=64, player_1/loss=249.545, player_2/loss=234.477, rew=-10.71]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 406.79it/s, env_step=13312, len=13, n/ep=5, n/st=64, player_1/loss=229.251, player_2/loss=224.122, rew=15.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 355.93it/s, env_step=14336, len=10, n/ep=6, n/st=64, player_1/loss=188.568, player_2/loss=210.790, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 396.13it/s, env_step=15360, len=8, n/ep=7, n/st=64, player_1/loss=261.148, player_2/loss=207.062, rew=-3.57]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 410.34it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=227.829, player_2/loss=187.810, rew=12.50]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 381.49it/s, env_step=17408, len=8, n/ep=7, n/st=64, player_1/loss=182.120, player_2/loss=170.383, rew=-3.57]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 398.65it/s, env_step=18432, len=8, n/ep=7, n/st=64, player_1/loss=245.029, player_2/loss=176.496, rew=-17.86]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 398.79it/s, env_step=19456, len=7, n/ep=9, n/st=64, player_1/loss=277.530, player_2/loss=214.601, rew=-19.44]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 400.24it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=161.611, player_2/loss=298.732, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 397.95it/s, env_step=2048, len=7, n/ep=8, n/st=64, player_1/loss=207.448, player_2/loss=231.405, rew=18.75]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 387.49it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=222.148, player_2/loss=205.441, rew=6.25]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 374.53it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=175.779, player_2/loss=244.645, rew=-8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 400.04it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=133.234, player_2/loss=290.526, rew=10.71]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 383.49it/s, env_step=6144, len=7, n/ep=8, n/st=64, player_1/loss=238.126, player_2/loss=273.197, rew=18.75]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 359.26it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=209.865, player_2/loss=286.199, rew=6.25]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 346.07it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=189.337, player_2/loss=286.475, rew=6.25]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 354.41it/s, env_step=9216, len=7, n/ep=9, n/st=64, player_1/loss=188.214, player_2/loss=240.096, rew=19.44]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 366.88it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=190.029, player_2/loss=246.602, rew=12.50]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 378.29it/s, env_step=11264, len=7, n/ep=9, n/st=64, player_1/loss=223.648, player_2/loss=249.334, rew=19.44]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 386.49it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_1/loss=212.973, player_2/loss=246.260, rew=6.25]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 353.25it/s, env_step=13312, len=7, n/ep=9, n/st=64, player_1/loss=197.364, player_2/loss=246.970, rew=13.89]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 369.78it/s, env_step=14336, len=7, n/ep=9, n/st=64, player_1/loss=200.140, player_2/loss=268.590, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 384.41it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=193.868, player_2/loss=263.339, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 383.93it/s, env_step=16384, len=8, n/ep=7, n/st=64, player_1/loss=173.921, player_2/loss=260.603, rew=3.57]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 389.18it/s, env_step=17408, len=11, n/ep=5, n/st=64, player_1/loss=153.931, player_2/loss=259.056, rew=-15.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 387.03it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=210.781, player_2/loss=266.837, rew=10.71]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 381.71it/s, env_step=19456, len=8, n/ep=7, n/st=64, player_1/loss=190.093, player_2/loss=253.324, rew=3.57]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 385.75it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=125.522, player_2/loss=302.291, rew=-12.50]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 383.45it/s, env_step=2048, len=8, n/ep=7, n/st=64, player_1/loss=170.037, player_2/loss=235.092, rew=-17.86]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 388.12it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=173.290, player_2/loss=210.636, rew=-6.25]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 392.59it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=202.062, player_2/loss=208.168, rew=8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 378.43it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=265.121, player_2/loss=206.338, rew=-10.71]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:02, 371.41it/s, env_step=6144, len=11, n/ep=5, n/st=64, player_1/loss=346.604, player_2/loss=212.979, rew=5.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:02, 376.11it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=335.972, player_2/loss=191.512, rew=-6.25]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:02, 365.22it/s, env_step=8192, len=8, n/ep=8, n/st=64, player_1/loss=265.910, player_2/loss=126.651, rew=-12.50]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:02, 369.86it/s, env_step=9216, len=8, n/ep=8, n/st=64, player_1/loss=233.715, player_2/loss=144.364, rew=-12.50]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:02, 364.16it/s, env_step=10240, len=8, n/ep=8, n/st=64, player_1/loss=208.660, player_2/loss=188.741, rew=-12.50]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:02, 359.10it/s, env_step=11264, len=9, n/ep=6, n/st=64, player_1/loss=173.724, player_2/loss=169.629, rew=16.67]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:02, 360.17it/s, env_step=12288, len=7, n/ep=9, n/st=64, player_1/loss=201.604, player_2/loss=191.784, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:02, 367.70it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=253.585, player_2/loss=202.639, rew=-10.71]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:02, 364.33it/s, env_step=14336, len=8, n/ep=8, n/st=64, player_1/loss=327.179, player_2/loss=191.756, rew=-6.25]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:02, 381.19it/s, env_step=15360, len=9, n/ep=8, n/st=64, player_1/loss=250.296, player_2/loss=192.263, rew=0.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:02, 388.55it/s, env_step=16384, len=7, n/ep=9, n/st=64, player_1/loss=179.381, rew=-13.89]       


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:02, 372.32it/s, env_step=17408, len=10, n/ep=6, n/st=64, player_1/loss=239.126, player_2/loss=256.388, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:02, 368.71it/s, env_step=18432, len=9, n/ep=6, n/st=64, player_1/loss=246.584, player_2/loss=221.619, rew=0.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:02, 384.11it/s, env_step=19456, len=12, n/ep=5, n/st=64, player_1/loss=215.175, player_2/loss=233.633, rew=5.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #5


Epoch #1: 1025it [00:02, 386.35it/s, env_step=1024, len=10, n/ep=6, n/st=64, player_1/loss=335.391, player_2/loss=136.567, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 378.46it/s, env_step=2048, len=11, n/ep=6, n/st=64, player_1/loss=377.311, player_2/loss=129.773, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 364.93it/s, env_step=3072, len=10, n/ep=6, n/st=64, player_1/loss=371.820, player_2/loss=128.446, rew=-25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #4: 1025it [00:02, 361.00it/s, env_step=4096, len=9, n/ep=6, n/st=64, player_1/loss=326.652, player_2/loss=133.315, rew=-8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #5: 1025it [00:02, 374.56it/s, env_step=5120, len=14, n/ep=5, n/st=64, player_1/loss=372.708, player_2/loss=122.418, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #6: 1025it [00:02, 373.01it/s, env_step=6144, len=9, n/ep=6, n/st=64, player_1/loss=402.256, player_2/loss=128.901, rew=-8.33]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #7: 1025it [00:02, 371.61it/s, env_step=7168, len=10, n/ep=6, n/st=64, player_1/loss=380.049, player_2/loss=147.447, rew=-16.67]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #8: 1025it [00:02, 371.32it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=346.427, player_2/loss=131.672, rew=-8.33]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #9: 1025it [00:02, 394.43it/s, env_step=9216, len=12, n/ep=5, n/st=64, player_1/loss=329.706, player_2/loss=139.738, rew=-5.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #10: 1025it [00:03, 336.31it/s, env_step=10240, len=10, n/ep=6, n/st=64, player_1/loss=380.979, player_2/loss=169.281, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #11: 1025it [00:02, 375.27it/s, env_step=11264, len=11, n/ep=5, n/st=64, player_1/loss=377.425, player_2/loss=151.066, rew=-25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #12: 1025it [00:02, 360.54it/s, env_step=12288, len=10, n/ep=6, n/st=64, player_2/loss=154.326, rew=-16.67]      


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #13: 1025it [00:02, 372.93it/s, env_step=13312, len=10, n/ep=6, n/st=64, player_1/loss=295.566, player_2/loss=147.517, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #14: 1025it [00:02, 380.99it/s, env_step=14336, len=11, n/ep=6, n/st=64, player_1/loss=327.140, player_2/loss=130.270, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #15: 1025it [00:02, 375.65it/s, env_step=15360, len=11, n/ep=6, n/st=64, player_1/loss=332.850, player_2/loss=129.413, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #16: 1025it [00:02, 389.18it/s, env_step=16384, len=11, n/ep=6, n/st=64, player_1/loss=354.951, player_2/loss=133.229, rew=-16.67]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #17: 1025it [00:02, 366.26it/s, env_step=17408, len=12, n/ep=5, n/st=64, player_1/loss=341.799, player_2/loss=140.623, rew=-15.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #18: 1025it [00:02, 378.89it/s, env_step=18432, len=10, n/ep=6, n/st=64, player_1/loss=278.069, player_2/loss=132.576, rew=0.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #19: 1025it [00:02, 384.42it/s, env_step=19456, len=11, n/ep=6, n/st=64, player_1/loss=275.749, player_2/loss=146.700, rew=-8.33]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #1: 1025it [00:02, 380.85it/s, env_step=1024, len=8, n/ep=7, n/st=64, player_1/loss=192.742, player_2/loss=181.329, rew=-10.71]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 396.51it/s, env_step=2048, len=10, n/ep=6, n/st=64, player_1/loss=185.829, player_2/loss=169.489, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 413.57it/s, env_step=3072, len=8, n/ep=9, n/st=64, player_1/loss=197.494, player_2/loss=155.663, rew=-19.44]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 411.71it/s, env_step=4096, len=7, n/ep=8, n/st=64, player_1/loss=216.118, player_2/loss=189.395, rew=-18.75]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 358.17it/s, env_step=5120, len=8, n/ep=7, n/st=64, player_1/loss=215.941, player_2/loss=212.228, rew=-3.57]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 347.25it/s, env_step=6144, len=12, n/ep=5, n/st=64, player_1/loss=255.172, player_2/loss=240.520, rew=15.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 362.75it/s, env_step=7168, len=11, n/ep=6, n/st=64, player_1/loss=252.136, player_2/loss=196.222, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 380.04it/s, env_step=8192, len=12, n/ep=5, n/st=64, player_1/loss=214.505, player_2/loss=175.122, rew=25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 367.56it/s, env_step=9216, len=7, n/ep=8, n/st=64, player_1/loss=196.153, player_2/loss=180.647, rew=-18.75]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 350.64it/s, env_step=10240, len=8, n/ep=7, n/st=64, player_1/loss=214.744, player_2/loss=204.317, rew=-10.71]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:03, 326.83it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=189.062, player_2/loss=177.058, rew=-12.50]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 350.81it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_1/loss=171.305, player_2/loss=155.761, rew=-18.75]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 370.45it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=157.786, player_2/loss=173.151, rew=-18.75]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 394.18it/s, env_step=14336, len=8, n/ep=7, n/st=64, player_1/loss=149.764, player_2/loss=220.101, rew=-3.57]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 398.66it/s, env_step=15360, len=10, n/ep=7, n/st=64, player_1/loss=166.508, player_2/loss=231.692, rew=17.86]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 400.76it/s, env_step=16384, len=10, n/ep=6, n/st=64, player_1/loss=222.458, player_2/loss=256.182, rew=16.67]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 393.42it/s, env_step=17408, len=8, n/ep=8, n/st=64, player_1/loss=225.227, player_2/loss=219.061, rew=-6.25]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 402.89it/s, env_step=18432, len=10, n/ep=7, n/st=64, player_1/loss=216.480, player_2/loss=214.607, rew=17.86]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 395.79it/s, env_step=19456, len=10, n/ep=6, n/st=64, player_1/loss=168.148, player_2/loss=199.331, rew=16.67]

Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0





In [16]:
####################################################
# EXPERIMENT: VIEWING THE BEST LEARNED POLICY
####################################################

# Get the environment settings
env = get_env()
observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
state_shape = observation_space.shape or observation_space.n
action_shape = env.action_space.shape or env.action_space.n

# Configure the best agent
best_agent1 = cf_custom_dqn_policy(state_shape= state_shape,
                                   action_shape= action_shape)
best_agent1.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/6-20epoch_100loop/looping-iteration-99/best_policy_agent1.pth"))
best_agent1.set_eps(0)


best_agent2 = cf_custom_dqn_policy(state_shape= state_shape,
                                   action_shape= action_shape)
best_agent2.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/6-20epoch_100loop/looping-iteration-98/best_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= best_agent1,
      agent_player2= best_agent2)



Average steps of game:  10.0
Final mean reward agent 1: -25.0, std: 0.0
Final mean reward agent 2: 25.0, std: 0.0


In [17]:
####################################################
# EXPERIMENT: VIEWING THE LAST LEARNED POLICY
####################################################

# Configure the final agent
final_agent_player1 = cf_custom_dqn_policy(state_shape= state_shape,
                                           action_shape= action_shape)
final_agent_player1.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/6-20epoch_100loop/looping-iteration-99/final_policy_agent1.pth"))
best_agent1.set_eps(0)

final_agent_player2 = cf_custom_dqn_policy(state_shape= state_shape,
                                           action_shape= action_shape)
final_agent_player2.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/6-20epoch_100loop/looping-iteration-98/final_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= final_agent_player1,
      agent_player2= final_agent_player2)



Average steps of game:  11.333333333333334
Final mean reward agent 1: -25.0, std: 0.0
Final mean reward agent 2: 25.0, std: 0.0


<hr><hr>

## Discussion

We see that the agent can learn quickly to win against a fixed strategy oponent but the overall performance of the agent is still weak, making human play of very poor quality once again.

In [None]:
####################################################
# CLEAN VARIABLES
####################################################

del action_shape
del agent1
del agent2
del best_agent1
del best_agent2
del env
del final_agent_player1
del final_agent_player2
del observation_space
del off_policy_traininer_results
del state_shape
