# CNN based DQN agent against fixed oponent

As discussed in `5-improving-dqn-architecture.ipynb` we thought of three aspects that might be the root of the agent's not learning to play the game pleasingly:
- Training two DQN agents simultaneously is known to be though, especially when starting from a random initialisation
- The network used was a simple MLP
- The training is not done over enough iterations

In the notebooks `5-improving-dqn-architecture.ipynb` and `6-dqn-using-a-cnn.ipynb`, two alternative networks besides MLP were used.
Whilst these give somewhat satisfactory results when trained for long enough and incentivising moves by giving a reward for making a move, it is still far from perfect.
The iterations were also boosted to a couple of hours on a CUDA GPU, which didn't improve things all that much.

Thus, what is most likely to be an issue is the fact that we are training two agents simultaneously.
This makes it hard to get a good performing agent and makes the target non stationary as both agents evolve over time.
An alternative to this is training an agent for a couple of epochs whilst freezing the other and alternating this between the agents.
This makes the problem to learn more stationary and is known to make learning easier.
What is also done, often in very complex games, is starting from a somewhat smart agent instead of a random one.

Whilst some libraries such as Ray RL lib offer implementations of such a training strategy, the experimental notebook `4-rllib-for-more-learning-control.ipynb` found that even the Ray provided example results in error codes.
Seeing their GitHub page has many open issues, the one we encountered being one of them, we refrain from using a different library considering Tianshou has many algorithms implemented and we have found a way to make things work.

<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Training two DQN agents on connect four Gym
  - Building the environment
  - Implementing the DQN policy
  - Building agents
  - Function for letting agents learn
  - Function for watching learned agent
  - Doing the experiment
- Discussion

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [2]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib

# Plotting
import matplotlib; print(f"Matplotlib version (3.5.1 recommended): {matplotlib.__version__}")
import matplotlib.pyplot as plt

# Argparser
import argparse

# More data types
import typing
import numpy as np

# Pygame
import pygame; print(f"Pygame version (2.1.2 recommended): {pygame.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Tianshou for RL algorithms
import tianshou as ts; print(f"Tianshou version (0.4.8 recommended): {ts.__version__}")

# Torch is a popular DL framework
import torch; print(f"Torch version (1.12.0 recommended): {torch.__version__}")

# PPrint is a pretty print for variables
from pprint import pprint

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym
importlib.invalidate_caches()
importlib.reload(cfgym)

# Time for allowing "freezes" in execution
import time;

# Allow for copying objects in a non reference manner
import copy

# Used for updating notebook display
from IPython.display import clear_output

Matplotlib version (3.5.1 recommended): 3.5.1
pygame 2.1.2 (SDL 2.0.18, Python 3.8.10)
Hello from the pygame community. https://www.pygame.org/contribute.html
Pygame version (2.1.2 recommended): 2.1.2
Gym version (0.21.0 recommended): 0.21.0


  from .autonotebook import tqdm as notebook_tqdm


Tianshou version (0.4.8 recommended): 0.4.8
Torch version (1.12.0 recommended): 1.12.0.dev20220520+cu116


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

In [3]:
####################################################
# CUDA VALIDATION
####################################################

# Check cuda available
print(f"CUDA is available: {torch.cuda.is_available()}")

# Show cuda devices
print(f"\nAmount of connected devices supporting CUDA: {torch.cuda.device_count()}")

# Show current cuda device
print(f"\nCurrent CUDA device: {torch.cuda.current_device()}")

# Show cuda device name
print(f"Cuda device 0 name: {torch.cuda.get_device_name(0)}")

CUDA is available: True

Amount of connected devices supporting CUDA: 1

Current CUDA device: 0
Cuda device 0 name: NVIDIA GeForce GTX 970


<hr><hr>

## Training two DQN agents on connect four Gym

Our connect four gym setup requires two agents, one for each player.
To reduce complexity, agents will always play as the same player, e.g. always as player 1.
It is important to note that connect four is a *solved game*.
According to [The Washington Post](https://www.washingtonpost.com/news/wonk/wp/2015/05/08/how-to-win-any-popular-game-according-to-data-scientists/):

> Connect Four is what mathematicians call a "solved game," meaning you can play it perfectly every time, no matter what your opponent does. You will need to get the first move, but as long as you do so, you can always win within 41 moves.

<hr>

### Building the environment

This code is taken from previous notebooks.
We don't allow invalid moves to make the problem easier for now.

In [4]:
####################################################
# CONNECT FOUR V2 ENVIRONMENT
####################################################

def get_env():
    """
    Returns the connect four gym environment V2 altered for Tianshou and Petting Zoo compatibility.
    Already wrapped with a ts.env.PettingZooEnv wrapper.
    """
    return ts.env.PettingZooEnv(cfgym.env(reward_move= 0, # Set to 1 for reward to make moves (incentivise longer games)
                                          reward_invalid= -3,
                                          reward_draw= 100,
                                          reward_win= 25,
                                          reward_loss= -25,
                                          allow_invalid_move= False))
    
    
# Test the environment
env = get_env()
print(f"Observation space: {env.observation_space}")
print(f"\nAction space: {env.action_space}")

# Reset the environment to start from a clean state, returns the initial observation
observation = env.reset()

print("\n Initial player id:")
print(observation["agent_id"])

print("\n Initial observation:")
print(observation["obs"])

print("\n Initial mask:")
print(observation["mask"])

# Clean unused variables
del observation
del env

Observation space: Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]], [[2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]], (6, 7), int8))

Action space: Discrete(7)

 Initial player id:
player_1

 Initial observation:
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]

 Initial mask:
[True, True, True, True, True, True, True]


<hr>

### Implementing the DQN policy

The DQN policy for the agent is configured and set up below.
This is identical to the previous notebook with the added option of "freezing" an agent which corresponds to giving it an optimizer with learning rate 0.

In [5]:
####################################################
# DQN ARCHITECTURE
####################################################

class CNNBasedDQN(torch.nn.Module):
    """
    Custom DQN using a model based on CNN
    """
    def __init__(self,
                 state_shape: typing.Sequence[int],
                 action_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu',):
        # Parent call
        super().__init__()
        
        # Save device (e.g. cuda)
        self.device = device
        
        # Number of input channels
        input_channels_cnn = 1
        output_channels_cnn = 32
        flatten_size = (state_shape[0] - 3) * (state_shape[1] - 3) * output_channels_cnn
        output_size= np.prod(action_shape)
        
        self.model = torch.nn.Sequential(
            torch.nn.Conv2d(in_channels= input_channels_cnn, out_channels= output_channels_cnn, kernel_size= 4, stride= 1), torch.nn.ReLU(inplace=True),
            torch.nn.Flatten(0,-1),
            torch.nn.Unflatten(0, (1, flatten_size)),
            torch.nn.Linear(flatten_size, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, output_size),
        )

    def forward(self, obs, state=None, info={}):
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float, device=self.device)
        
        logits = self.model(obs)
        return logits, state


In [6]:
####################################################
# DQN POLICY
####################################################

def cf_cnn_dqn_policy(state_shape: tuple,
                      action_shape: tuple,
                      optim: typing.Optional[torch.optim.Optimizer] = None,
                      learning_rate: float =  0.0001,
                      gamma: float = 0.9, # Smaller gamma favours "faster" win
                      n_step: int = 4, # Number of steps to look ahead
                      frozen: bool = False,
                      target_update_freq: int = 320):
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Network to be used for DQN
    net = CNNBasedDQN(state_shape, action_shape, device= device).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    if optim is None:
        optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # If we are frozen, we use an optimizer that has learning rate 0
    if frozen:
        optim = torch.optim.SGD(net.parameters(), lr= 0)
        
        
    # Our agent DQN policy
    return ts.policy.DQNPolicy(model= net,
                               optim= optim,
                               discount_factor= gamma,
                               estimation_step= n_step,
                               target_update_freq= target_update_freq)

<hr>

### Building agents

This is identical to the previous notebook with the added option of "freezing" an agent which corresponds to giving it an optimizer with learning rate 0.

In [7]:
####################################################
# AGENT CREATION
####################################################

def get_agents(agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
               agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
               optim: typing.Optional[torch.optim.Optimizer] = None,
               resume_path_player_1: str = '', # Path to file to resume agent training from
               resume_path_player_2: str = '', 
               agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
               agent_player2_frozen: bool = False,
               ) -> typing.Tuple[ts.policy.BasePolicy, torch.optim.Optimizer, list]:
    """
    Gets a multi agent policy manager, optimizer and player ids for the connect four V2 gym environment.
    Per default this returns 
        - Multi agent manager for 2 agents using DQN
        - Adam optimizer
        - ['player_1', 'player_2'] from the connect four environment
    """
    
    # Get the environment to play in (Connect four gym V2)
    env = get_env()
    
    # Get the observation space from the environment, depending on typo of space (ternary operator)
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    
    # Set the arguments
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent player 1 to be a DQN if no policy is passed.
    if agent_player1 is None:
        # Our agent1 uses a DQN policy
        agent_player1 = cf_cnn_dqn_policy(state_shape= state_shape,
                                          action_shape= action_shape,
                                          optim= optim,
                                          frozen= agent_player1_frozen)
        
        # If we resume our agent we need to load the previous config
        if resume_path_player_1:
            agent_player1.load_state_dict(torch.load(resume_path_player_1))
            
    
    # Configure agent player 2 to be a DQN if no policy is passed.
    if agent_player2 is None:
        # Our agent1 uses a DQN policy
        agent_player2 = cf_cnn_dqn_policy(state_shape= state_shape,
                                          action_shape= action_shape,
                                          optim= optim,
                                          frozen= agent_player2_frozen)
        
        # If we resume our agent we need to load the previous config
        if resume_path_player_2:
            agent_player2.load_state_dict(torch.load(resume_path_player_2))

    # Both our agents are DQN agents by default
    agents = [agent_player1, agent_player2]
        
    # Our policy depends on the order of the agents
    policy = ts.policy.MultiAgentPolicyManager(agents, env)
    
    # Return our policy, optimizer and the available agents in the environment
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    
    return policy, optim, env.agents

<hr>

### Function for letting agents learn

This is identical to the previous notebook with the added option of "freezing" an agent which corresponds to giving it an optimizer with learning rate 0.

In [8]:
####################################################
# AGENT TRAINING
####################################################

def train_agent(filename: str = "dqn_vs_dqn_cnn_based",
                agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
                agent_player2_frozen: bool = False,
                single_agent_score_as_reward: bool= False, # Uses non frozen agent's score as reward
                optim: typing.Optional[torch.optim.Optimizer] = None,
                training_env_num: int = 1,
                testing_env_num: int = 1,
                buffer_size: int = 2^14,
                batch_size: int = 1, 
                epochs: int = 50, #50
                step_per_epoch: int = 1024, #1024
                step_per_collect: int = 64, # transition before update
                update_per_step: float = 0.1,
                testing_eps: float = 0.05,
                training_eps: float = 0.1,
                ) -> typing.Tuple[dict, ts.policy.BasePolicy]:
    """
    Trains two agents in the connect four V2 environment and saves their best model and logs.
    Returns:
        - result from offpolicy_trainer
        - final version of agent 1
        - final version of agent 2
    """

    # ======== notebook specific =========
    notebook_version = '7' # Used for foldering logs and models

    # ======== environment setup =========
    train_envs = ts.env.DummyVectorEnv([get_env for _ in range(training_env_num)])
    test_envs = ts.env.DummyVectorEnv([get_env for _ in range(testing_env_num)])
    
    # set the seed for reproducibility
    np.random.seed(1998)
    torch.manual_seed(1998)
    train_envs.seed(1998)
    test_envs.seed(1998)

    # ======== agent setup =========
    # Gets our agents from the previously made function
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    policy, optim, agents = get_agents(agent_player1=agent_player1,
                                       agent_player2=agent_player2,
                                       agent_player1_frozen= agent_player1_frozen,
                                       agent_player2_frozen= agent_player2_frozen,
                                       optim=optim)

    # ======== collector setup =========
    # Make a collector for the training environments
    train_collector = ts.data.Collector(policy= policy,
                                        env= train_envs,
                                        buffer= ts.data.VectorReplayBuffer(buffer_size, len(train_envs)),
                                        exploration_noise= True)
    
    # Make a collector for the testing environments
    test_collector = ts.data.Collector(policy= policy,
                                       env= test_envs,
                                       buffer= ts.data.VectorReplayBuffer(buffer_size, len(test_envs)),
                                       exploration_noise= True)
    
    # Uncomment below if you want to set epsilon in epsilon policy
    # policy.set_eps(1)
    
    # Collect data fot the training evnironments
    train_collector.collect(n_step= batch_size * training_env_num)
    
    # ======== ensure folders exist =========
    if not os.path.exists(os.path.join('./logs', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./logs', 'paper_notebooks', notebook_version, filename))
    if not os.path.exists(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename))

    # ======== tensorboard logging setup =========
    # Allows to save the training progress to tensorboard compatable logs
    log_path = os.path.join('./logs', 'paper_notebooks', notebook_version, filename)
    writer = torch.utils.tensorboard.SummaryWriter(log_path)
    logger = ts.utils.TensorboardLogger(writer)

    # ======== callback functions used during training =========
    # We want to save our best policy
    def save_best_fn(policy):
        """
        Callback to save the best model
        """
        # Save best agent 1
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent1.pth')
        torch.save(policy.policies[agents[0]].state_dict(), model_save_path)
        
        # Save best agent 2
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent2.pth')
        torch.save(policy.policies[agents[1]].state_dict(), model_save_path)
        
        # Save agent2

    def stop_fn(mean_rewards):
        """
        Callback to stop training when we've reached the win rate
        """
        return mean_rewards >= 7 # (win = 10, 70% win without invalid moves = mean of 7)

    def train_fn(epoch, env_step):
        """
        Callback before training
        """        
        # Before training we want to configure the epsilon for the agents
        # In general more exploratory than the test case
        policy.policies[agents[0]].set_eps(training_eps)
        policy.policies[agents[1]].set_eps(training_eps)

    def test_fn(epoch, env_step):
        """
        Callback beore testing
        """        
        # Before testing we want to configure the epsilon for the agents
        # In general more greedy than the train case but not
        #   to avoid getting stuck on invalid moves
        policy.policies[agents[0]].set_eps(testing_eps)
        policy.policies[agents[1]].set_eps(testing_eps)

    def reward_metric(rews):
        """
        Callback for reward collection
        """        
        if agent_player2_frozen and single_agent_score_as_reward:
            # agent 2 frozen, optimizing for agent 1
            return rews[:, 0]
        
        if agent_player1_frozen and single_agent_score_as_reward:
            # agent 1 frozen, optimizing for agent 2
            return rews[:, 1]
        
        # Per default we are interested in optimizing both agents
        return rews[:, 0] + rews[:, 1]
    
            

    # trainer
    result = ts.trainer.offpolicy_trainer(policy= policy,
                                          train_collector= train_collector,
                                          test_collector= test_collector,
                                          max_epoch= epochs,
                                          step_per_epoch= step_per_epoch,
                                          step_per_collect= step_per_collect,
                                          episode_per_test= testing_env_num,
                                          batch_size= batch_size,
                                          train_fn= train_fn,
                                          test_fn= test_fn,
                                          # Stop function to stop before specified amount of epochs
                                          #stop_fn= stop_fn
                                          save_best_fn= save_best_fn,
                                          update_per_step= update_per_step,
                                          logger= logger,
                                          test_in_train= False,
                                          reward_metric= reward_metric)
    
    # Save final agent 1
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent1.pth')
    torch.save(policy.policies[agents[0]].state_dict(), model_save_path)

    # Save final agent 2
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent2.pth')
    torch.save(policy.policies[agents[1]].state_dict(), model_save_path)

    return result, policy.policies[agents[0]], policy.policies[agents[1]]

<hr>

### Function for watching learned agent

Identical to the previous notebook.

In [9]:
####################################################
# WATCHING THE LEARNED POLICY IN ACTION
####################################################

def watch(numer_of_games: int = 3,
          agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
          agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
          test_epsilon: float = 0.05, # For the watching we act completely greedy but low random for not getting stuck on invalid move
          render_speed: float = 0.15, # Amount of seconds to update frame/ do a step
          ) -> None:
    
    # Get the connect four V2 environment (must be a list)
    env= ts.env.DummyVectorEnv([get_env])
    
    # Get the agents from the trained agents
    policy, optim, agents = get_agents(agent_player1= agent_player1,
                                       agent_player2= agent_player2)
    
    # Evaluate the policy
    policy.eval()
    
    # Set the testing policy epsilon for our agents
    policy.policies[agents[0]].set_eps(test_epsilon)
    policy.policies[agents[1]].set_eps(test_epsilon)
    
    # Collect the test data
    collector = ts.data.Collector(policy= policy,
                                  env= env,
                                  exploration_noise= True)
    
    # Render games in human mode to see how it plays
    result = collector.collect(n_episode= numer_of_games, render= render_speed)
    
    # Close the environment aftering collecting the results
    # This closes the pygame window after completion
    env.close()
    
    # Get the rewards and length from the test trials
    rewards, length = result["rews"], result["lens"]
    
    # Print the final reward for the first agent
    print(f"Average steps of game:  {length.mean()}")
    print(f"Final mean reward agent 1: {rewards[:, 0].mean()}, std: {rewards[:, 0].std()}")
    print(f"Final mean reward agent 2: {rewards[:, 1].mean()}, std: {rewards[:, 1].std()}")

<hr>

### Doing the experiment

We now do the experiment with using our previously created functions.
We freeze one agent and initialize both agents from previous versions.

The following iterations were made:

1. Freeze agent 1, train agent 2:
    - Model save name: `1-cnn_dqn_frozen_agent1` 
    - Agent 1 start: `./saved_variables/paper_notebooks/6/dqn_vs_dqn_cnn_based/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/6/dqn_vs_dqn_cnn_based/best_policy_agent2.pth`
    - Learning rate: `0.0001`
    - Training epsilon: `0.2`
    - Look ahead steps: `4`
    - Reward for move/invalid: `+1` / `-3`
    - Allow invalid move: `False`
    - Epochs: `1000`
    - Gamma: `0.9`
    - Best epoch: `51` with test reward `1102`
    - Scoring: sum of `both` agent's score
2. Freeze agent 2, train agent 1:
    - Model save name: `2-cnn_dqn_frozen_agent2` 
    - Agent 1 start: `./saved_variables/paper_notebooks/6/dqn_vs_dqn_cnn_based/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/7/1-cnn_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.0001`
    - Training epsilon: `0.2`
    - Look ahead steps: `4`
    - Reward for move/invalid: `+1` / `-3`
    - Allow invalid move: `False`
    - Epochs: `1000`
    - Gamma: `0.9`
    - Best epoch: `360` with test reward `1102`
    - Scoring: sum of `both` agent's score

After which the agent was so focused on prolonging the game, we decided to lower the learning rate and start optimizing for winning again. We also lowered the amount of epochs in each iterations of swapping the frozen agent.

3. Freeze agent 1, train agent 2:
    - Model save name: `3-cnn_dqn_frozen_agent1` 
    - Agent 1 start: `./saved_variables/paper_notebooks/7/2-cnn_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/7/1-cnn_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.00005` # halfed learning rate
    - Training epsilon: `0.1` # halfed training epsilon
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `500`
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: `1` with test reward `100` - tie game
    - Scoring: reward of `agent 2`
4. Freeze agent 2, train agent 1:
    - Model save name: `4-cnn_dqn_frozen_agent2` 
    - Agent 1 start: `./saved_variables/paper_notebooks/7/2-cnn_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/7/3-cnn_dqn_frozen_agent1/best_policy_agent2.pth`
    - Learning rate: `0.00005`
    - Training epsilon: `0.1`
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `500`
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: `1` with test reward `100` - tie game
    - Scoring: reward of `agent 1`
    
To do further training, a loop was created which alternated between freezing agens every 50 epochs. This loop was executed 20 times. The learning rate was also lowered once again.

5. Loop frozen agents:
    - Model save name: `5-looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/7/4-cnn_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/7/3-cnn_dqn_frozen_agent1/best_policy_agent2.pth`
    - Learning rate: `0.000001`
    - Training epsilon: `0.1`
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `50` x `20` loops 
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`
6. Loop frozen agents:
    - Model save name: `6-looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/7/5-looping-iteration-19/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/7/5-looping-iteration-19/best_policy_agent2.pth`
    - Learning rate: `0.000001`
    - Training epsilon: `0.1`
    - Look ahead steps: `8`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `20` x `100` loops 
    - Gamma: `0.9` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`



In [26]:
####################################################
# EXPERIMENT: TRAINING AGENTS
####################################################

# Configs for the agents
freeze_agent1 = False
agent1_starting_params = "./saved_variables/paper_notebooks/7/5-looping-iteration-19/best_policy_agent1.pth"

freeze_agent2 = True
agent2_starting_params = "./saved_variables/paper_notebooks/7/5-looping-iteration-19/best_policy_agent2.pth"

single_agent_score_as_reward = True # To use combined reward or non frozen agent reward as scoring
filename = "6-looping-iteration-i"
epochs = 20
loops = 100

learning_rate = 0.000001
training_eps = 0.1
gamma = 0.9
n_step = 8

for loop_idx in range(loops):
    # Filename
    filename = f"6-looping-iteration-{loop_idx}"
    
    # Use provided starting params in first loop, the one from previous iteration in next
    if loop_idx > 0:
        agent1_starting_params = f"./saved_variables/paper_notebooks/7/6-looping-iteration-{loop_idx-1}/best_policy_agent1.pth"
        agent2_starting_params = f"./saved_variables/paper_notebooks/7/6-looping-iteration-{loop_idx-1}/best_policy_agent2.pth"
    
    # Determine what agent to freeze
    freeze_agent1 = True if loop_idx % 2 == 1 else False
    freeze_agent2 = True if loop_idx % 2 == 0 else False
    
    # Get the environment settings
    env = get_env()
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent 1
    agent1 = cf_cnn_dqn_policy(state_shape= state_shape,
                               action_shape= action_shape,
                               gamma= gamma,
                               frozen= freeze_agent1,
                               learning_rate = learning_rate,
                               n_step= n_step)
    
    if agent1_starting_params:
        agent1.load_state_dict(torch.load(agent1_starting_params))
        
        # Configure agent 2
        agent2 = cf_cnn_dqn_policy(state_shape= state_shape,
                                   action_shape= action_shape,
                                   gamma= gamma,
                                   frozen= freeze_agent2,
                                   learning_rate = learning_rate,
                                   n_step= n_step)
        
        if agent2_starting_params:
            agent2.load_state_dict(torch.load(agent2_starting_params))
            
            
            # Train the agent
            off_policy_traininer_results, final_agent_player1, final_agent_player2 = train_agent(epochs= epochs,
                                                                                                 agent_player1= agent1,
                                                                                                 agent_player1_frozen = freeze_agent1,
                                                                                                 agent_player2= agent2,
                                                                                                 agent_player2_frozen = freeze_agent2,
                                                                                                 filename= filename,
                                                                                                 single_agent_score_as_reward = single_agent_score_as_reward,
                                                                                                 training_eps= training_eps)
            
            

Epoch #1: 1025it [00:02, 367.02it/s, env_step=1024, len=32, n/ep=2, n/st=64, player_1/loss=1848.734, player_2/loss=1739.194, rew=0.00]


Epoch #1: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #2: 1025it [00:02, 414.10it/s, env_step=2048, len=39, n/ep=1, n/st=64, player_1/loss=2260.895, player_2/loss=1494.538, rew=25.00]


Epoch #2: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #3: 1025it [00:02, 437.65it/s, env_step=3072, len=37, n/ep=2, n/st=64, player_1/loss=2344.517, player_2/loss=1558.386, rew=0.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #4: 1025it [00:02, 414.16it/s, env_step=4096, len=34, n/ep=2, n/st=64, player_1/loss=2172.545, player_2/loss=1671.299, rew=0.00]


Epoch #4: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #5: 1025it [00:02, 400.32it/s, env_step=5120, len=38, n/ep=2, n/st=64, player_1/loss=2567.365, player_2/loss=1602.979, rew=0.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #6: 1025it [00:02, 442.97it/s, env_step=6144, len=34, n/ep=2, n/st=64, player_1/loss=2412.646, player_2/loss=1576.615, rew=25.00]


Epoch #6: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #7: 1025it [00:02, 441.80it/s, env_step=7168, len=37, n/ep=1, n/st=64, player_1/loss=2089.986, player_2/loss=1636.302, rew=25.00]


Epoch #7: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #8: 1025it [00:02, 447.52it/s, env_step=8192, len=22, n/ep=4, n/st=64, player_1/loss=2027.799, player_2/loss=1518.107, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #9: 1025it [00:02, 448.70it/s, env_step=9216, len=40, n/ep=2, n/st=64, player_1/loss=2293.830, player_2/loss=1463.477, rew=37.50]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #10: 1025it [00:02, 448.63it/s, env_step=10240, len=40, n/ep=2, n/st=64, player_1/loss=2541.487, player_2/loss=1505.114, rew=37.50]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #11: 1025it [00:02, 450.03it/s, env_step=11264, len=36, n/ep=2, n/st=64, player_1/loss=2843.984, rew=0.00]       


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #12: 1025it [00:02, 439.49it/s, env_step=12288, len=42, n/ep=1, n/st=64, player_1/loss=2501.857, player_2/loss=1526.268, rew=100.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #13: 1025it [00:02, 447.59it/s, env_step=13312, len=38, n/ep=2, n/st=64, player_1/loss=2057.891, player_2/loss=1324.525, rew=-25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #14: 1025it [00:02, 453.64it/s, env_step=14336, len=31, n/ep=3, n/st=64, player_1/loss=2097.435, player_2/loss=1545.766, rew=16.67]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #15: 1025it [00:02, 431.03it/s, env_step=15360, len=33, n/ep=2, n/st=64, player_1/loss=2081.850, player_2/loss=1723.927, rew=0.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #16: 1025it [00:02, 445.68it/s, env_step=16384, len=24, n/ep=3, n/st=64, player_1/loss=1897.142, player_2/loss=1810.944, rew=8.33]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #17: 1025it [00:02, 445.40it/s, env_step=17408, len=37, n/ep=2, n/st=64, player_2/loss=1547.853, rew=0.00]       


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #18: 1025it [00:02, 452.77it/s, env_step=18432, len=32, n/ep=2, n/st=64, player_1/loss=1611.917, player_2/loss=1332.235, rew=25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #19: 1025it [00:02, 449.49it/s, env_step=19456, len=38, n/ep=1, n/st=64, player_1/loss=1781.395, player_2/loss=1636.670, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #20: 1025it [00:02, 449.42it/s, env_step=20480, len=34, n/ep=2, n/st=64, player_2/loss=1795.970, rew=0.00]       


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #21: 1025it [00:02, 445.94it/s, env_step=21504, len=38, n/ep=2, n/st=64, player_1/loss=1886.535, player_2/loss=1566.611, rew=0.00]


Epoch #21: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #22: 1025it [00:02, 444.23it/s, env_step=22528, len=34, n/ep=1, n/st=64, player_1/loss=2258.709, player_2/loss=1366.765, rew=-25.00]


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #23: 1025it [00:02, 442.00it/s, env_step=23552, len=31, n/ep=2, n/st=64, player_1/loss=2161.187, player_2/loss=1640.639, rew=-25.00]


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #24: 1025it [00:02, 421.42it/s, env_step=24576, len=35, n/ep=2, n/st=64, player_1/loss=1813.516, player_2/loss=1741.370, rew=0.00]


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #25: 1025it [00:02, 425.64it/s, env_step=25600, len=35, n/ep=2, n/st=64, player_1/loss=1647.634, player_2/loss=1643.214, rew=0.00]


Epoch #25: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #26: 1025it [00:02, 424.45it/s, env_step=26624, len=40, n/ep=1, n/st=64, player_1/loss=2027.439, player_2/loss=1687.467, rew=-25.00]


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #27: 1025it [00:02, 422.12it/s, env_step=27648, len=36, n/ep=2, n/st=64, player_1/loss=1935.929, player_2/loss=1592.575, rew=25.00]


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #28: 1025it [00:02, 425.07it/s, env_step=28672, len=39, n/ep=2, n/st=64, player_1/loss=1707.841, player_2/loss=1415.511, rew=0.00]


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #29: 1025it [00:02, 426.01it/s, env_step=29696, len=38, n/ep=1, n/st=64, player_1/loss=1770.235, player_2/loss=1466.771, rew=-25.00]


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #30: 1025it [00:02, 422.18it/s, env_step=30720, len=37, n/ep=2, n/st=64, player_1/loss=1874.184, player_2/loss=1671.404, rew=-25.00]


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #31: 1025it [00:02, 428.58it/s, env_step=31744, len=38, n/ep=1, n/st=64, player_1/loss=1772.584, player_2/loss=1675.870, rew=-25.00]


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #32: 1025it [00:02, 419.44it/s, env_step=32768, len=38, n/ep=2, n/st=64, player_1/loss=1787.401, player_2/loss=1384.719, rew=-25.00]


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #33: 1025it [00:02, 413.86it/s, env_step=33792, len=37, n/ep=2, n/st=64, player_1/loss=1849.402, player_2/loss=1538.907, rew=-25.00]


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #34: 1025it [00:02, 398.62it/s, env_step=34816, len=32, n/ep=3, n/st=64, player_1/loss=1829.932, player_2/loss=1520.660, rew=33.33]


Epoch #34: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #35: 1025it [00:02, 428.17it/s, env_step=35840, len=37, n/ep=2, n/st=64, player_1/loss=2115.886, player_2/loss=1779.105, rew=25.00]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #36: 1025it [00:02, 423.67it/s, env_step=36864, len=38, n/ep=2, n/st=64, player_1/loss=2157.624, player_2/loss=1731.260, rew=-25.00]


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #37: 1025it [00:02, 421.49it/s, env_step=37888, len=39, n/ep=2, n/st=64, player_1/loss=1808.263, player_2/loss=1639.152, rew=62.50]


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #38: 1025it [00:02, 399.69it/s, env_step=38912, len=35, n/ep=2, n/st=64, player_1/loss=1470.636, player_2/loss=1645.520, rew=0.00]


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #39: 1025it [00:02, 385.60it/s, env_step=39936, len=33, n/ep=2, n/st=64, player_1/loss=1535.614, player_2/loss=1721.461, rew=0.00]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #40: 1025it [00:02, 378.18it/s, env_step=40960, len=37, n/ep=2, n/st=64, player_1/loss=1598.587, player_2/loss=1727.485, rew=0.00]


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #41: 1025it [00:02, 389.61it/s, env_step=41984, len=35, n/ep=2, n/st=64, player_1/loss=1848.389, player_2/loss=1905.148, rew=0.00]


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #42: 1025it [00:02, 432.83it/s, env_step=43008, len=26, n/ep=3, n/st=64, player_1/loss=1995.427, player_2/loss=1818.703, rew=-8.33]


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #43: 1025it [00:02, 421.26it/s, env_step=44032, len=34, n/ep=2, n/st=64, player_1/loss=1773.038, player_2/loss=1608.660, rew=-25.00]


Epoch #43: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #44: 1025it [00:02, 431.75it/s, env_step=45056, len=37, n/ep=2, n/st=64, player_1/loss=1299.627, player_2/loss=1438.975, rew=37.50]


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #45: 1025it [00:02, 429.23it/s, env_step=46080, len=25, n/ep=2, n/st=64, player_1/loss=1449.154, player_2/loss=1520.041, rew=25.00]


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #46: 1025it [00:02, 434.25it/s, env_step=47104, len=38, n/ep=1, n/st=64, player_1/loss=1468.565, player_2/loss=1873.950, rew=-25.00]


Epoch #46: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #47: 1025it [00:02, 431.76it/s, env_step=48128, len=39, n/ep=2, n/st=64, player_1/loss=1409.264, player_2/loss=1570.801, rew=62.50]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #48: 1025it [00:02, 429.72it/s, env_step=49152, len=38, n/ep=2, n/st=64, player_1/loss=1186.087, player_2/loss=1387.847, rew=0.00]


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #49: 1025it [00:02, 430.96it/s, env_step=50176, len=37, n/ep=1, n/st=64, player_1/loss=1338.201, player_2/loss=1593.859, rew=25.00]


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #1: 1025it [00:02, 430.95it/s, env_step=1024, len=32, n/ep=2, n/st=64, player_1/loss=1838.289, player_2/loss=1384.061, rew=-25.00]


Epoch #1: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #2: 1025it [00:02, 433.16it/s, env_step=2048, len=35, n/ep=2, n/st=64, player_1/loss=1856.977, player_2/loss=1316.945, rew=25.00]


Epoch #2: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #3: 1025it [00:02, 434.59it/s, env_step=3072, len=39, n/ep=1, n/st=64, player_1/loss=1806.113, player_2/loss=1451.195, rew=-25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #4: 1025it [00:02, 432.44it/s, env_step=4096, len=29, n/ep=2, n/st=64, player_1/loss=1899.106, player_2/loss=1759.196, rew=0.00]


Epoch #4: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #5: 1025it [00:02, 434.22it/s, env_step=5120, len=38, n/ep=2, n/st=64, player_1/loss=2208.521, player_2/loss=1777.489, rew=0.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #6: 1025it [00:02, 433.89it/s, env_step=6144, len=39, n/ep=1, n/st=64, player_1/loss=1763.014, player_2/loss=1597.152, rew=-25.00]


Epoch #6: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #7: 1025it [00:02, 432.68it/s, env_step=7168, len=37, n/ep=1, n/st=64, player_1/loss=1577.808, player_2/loss=1731.837, rew=-25.00]


Epoch #7: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #8: 1025it [00:02, 428.92it/s, env_step=8192, len=35, n/ep=1, n/st=64, player_1/loss=1998.952, player_2/loss=1598.372, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #9: 1025it [00:02, 395.56it/s, env_step=9216, len=35, n/ep=2, n/st=64, player_1/loss=1697.578, player_2/loss=1553.805, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #10: 1025it [00:02, 368.05it/s, env_step=10240, len=37, n/ep=2, n/st=64, player_1/loss=2147.494, player_2/loss=1443.386, rew=0.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #11: 1025it [00:02, 398.62it/s, env_step=11264, len=32, n/ep=2, n/st=64, player_1/loss=1978.333, rew=0.00]       


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #12: 1025it [00:02, 383.77it/s, env_step=12288, len=42, n/ep=1, n/st=64, player_1/loss=1217.767, player_2/loss=1567.039, rew=100.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #13: 1025it [00:02, 387.13it/s, env_step=13312, len=28, n/ep=2, n/st=64, player_1/loss=1368.620, player_2/loss=1116.496, rew=25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #14: 1025it [00:02, 431.67it/s, env_step=14336, len=31, n/ep=3, n/st=64, player_1/loss=1853.776, player_2/loss=1350.068, rew=50.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #15: 1025it [00:02, 428.65it/s, env_step=15360, len=33, n/ep=2, n/st=64, player_1/loss=1701.954, player_2/loss=1553.927, rew=0.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #16: 1025it [00:02, 416.39it/s, env_step=16384, len=29, n/ep=3, n/st=64, player_1/loss=1816.875, player_2/loss=1765.062, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #17: 1025it [00:02, 424.45it/s, env_step=17408, len=34, n/ep=2, n/st=64, player_1/loss=1571.104, player_2/loss=1617.025, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #18: 1025it [00:02, 408.90it/s, env_step=18432, len=34, n/ep=2, n/st=64, player_1/loss=1229.228, player_2/loss=1236.744, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #19: 1025it [00:02, 421.09it/s, env_step=19456, len=36, n/ep=2, n/st=64, player_1/loss=1516.529, player_2/loss=1337.834, rew=0.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #20: 1025it [00:02, 414.85it/s, env_step=20480, len=34, n/ep=2, n/st=64, player_2/loss=1526.668, rew=0.00]       


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #21: 1025it [00:02, 431.09it/s, env_step=21504, len=39, n/ep=1, n/st=64, player_1/loss=1518.298, player_2/loss=1396.211, rew=-25.00]


Epoch #21: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #22: 1025it [00:02, 432.77it/s, env_step=22528, len=34, n/ep=1, n/st=64, player_1/loss=1724.678, player_2/loss=1206.087, rew=25.00]


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #23: 1025it [00:02, 433.78it/s, env_step=23552, len=34, n/ep=2, n/st=64, player_1/loss=1819.565, player_2/loss=1305.585, rew=25.00]


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #24: 1025it [00:02, 436.05it/s, env_step=24576, len=40, n/ep=1, n/st=64, player_1/loss=1622.412, player_2/loss=1259.633, rew=25.00]


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #25: 1025it [00:02, 432.65it/s, env_step=25600, len=35, n/ep=2, n/st=64, player_1/loss=1639.831, player_2/loss=1107.418, rew=0.00]


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #26: 1025it [00:02, 432.75it/s, env_step=26624, len=35, n/ep=2, n/st=64, player_1/loss=1841.753, player_2/loss=1046.327, rew=62.50]


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #27: 1025it [00:02, 434.22it/s, env_step=27648, len=34, n/ep=2, n/st=64, player_1/loss=2266.452, player_2/loss=1137.350, rew=25.00]


Epoch #27: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #28: 1025it [00:02, 404.53it/s, env_step=28672, len=25, n/ep=2, n/st=64, player_1/loss=2414.033, player_2/loss=1366.159, rew=-25.00]


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #29: 1025it [00:02, 391.17it/s, env_step=29696, len=39, n/ep=2, n/st=64, player_1/loss=1840.626, player_2/loss=1305.033, rew=37.50]


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #30: 1025it [00:02, 375.31it/s, env_step=30720, len=37, n/ep=1, n/st=64, player_1/loss=1793.405, player_2/loss=1232.066, rew=-25.00]


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #31: 1025it [00:02, 378.78it/s, env_step=31744, len=37, n/ep=2, n/st=64, player_1/loss=1569.912, player_2/loss=1583.764, rew=0.00]


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #32: 1025it [00:02, 380.78it/s, env_step=32768, len=37, n/ep=2, n/st=64, player_1/loss=1710.332, player_2/loss=1839.364, rew=0.00]


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #33: 1025it [00:02, 384.32it/s, env_step=33792, len=38, n/ep=2, n/st=64, player_1/loss=1515.147, player_2/loss=1718.261, rew=-25.00]


Epoch #33: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #34: 1025it [00:02, 385.09it/s, env_step=34816, len=38, n/ep=2, n/st=64, player_1/loss=1713.632, player_2/loss=1094.112, rew=0.00]


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #35: 1025it [00:02, 381.46it/s, env_step=35840, len=36, n/ep=1, n/st=64, player_1/loss=1783.424, player_2/loss=1036.792, rew=25.00]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #36: 1025it [00:02, 377.49it/s, env_step=36864, len=36, n/ep=2, n/st=64, player_1/loss=1774.225, player_2/loss=1333.885, rew=0.00]


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #37: 1025it [00:02, 384.63it/s, env_step=37888, len=33, n/ep=2, n/st=64, player_1/loss=1919.493, rew=0.00]       


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #38: 1025it [00:02, 384.15it/s, env_step=38912, len=37, n/ep=2, n/st=64, player_1/loss=1988.214, player_2/loss=1171.427, rew=0.00]


Epoch #38: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #39: 1025it [00:02, 384.31it/s, env_step=39936, len=35, n/ep=2, n/st=64, player_1/loss=2164.427, player_2/loss=1357.507, rew=37.50]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #40: 1025it [00:02, 383.13it/s, env_step=40960, len=32, n/ep=2, n/st=64, player_1/loss=1805.569, player_2/loss=1225.678, rew=-25.00]


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #41: 1025it [00:02, 381.83it/s, env_step=41984, len=33, n/ep=2, n/st=64, player_1/loss=1579.844, rew=0.00]       


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #42: 1025it [00:02, 383.90it/s, env_step=43008, len=40, n/ep=2, n/st=64, player_1/loss=1563.771, player_2/loss=1315.917, rew=62.50]


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #43: 1025it [00:02, 380.50it/s, env_step=44032, len=38, n/ep=1, n/st=64, player_1/loss=1510.927, player_2/loss=1331.653, rew=25.00]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #44: 1025it [00:02, 384.32it/s, env_step=45056, len=38, n/ep=2, n/st=64, player_1/loss=1786.095, player_2/loss=1208.245, rew=25.00]


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #45: 1025it [00:02, 380.71it/s, env_step=46080, len=34, n/ep=2, n/st=64, player_1/loss=1836.211, player_2/loss=1271.920, rew=0.00]


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #46: 1025it [00:02, 381.47it/s, env_step=47104, len=36, n/ep=2, n/st=64, player_1/loss=1565.312, player_2/loss=1471.106, rew=-25.00]


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #47: 1025it [00:02, 382.12it/s, env_step=48128, len=37, n/ep=2, n/st=64, player_1/loss=1591.664, player_2/loss=1591.716, rew=0.00]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #48: 1025it [00:02, 385.92it/s, env_step=49152, len=38, n/ep=2, n/st=64, player_1/loss=1549.626, player_2/loss=1458.659, rew=25.00]


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #49: 1025it [00:02, 382.93it/s, env_step=50176, len=37, n/ep=2, n/st=64, player_1/loss=1465.298, player_2/loss=1170.000, rew=62.50]


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #1


Epoch #1: 1025it [00:02, 378.54it/s, env_step=1024, len=29, n/ep=2, n/st=64, player_2/loss=954.186, rew=-25.00]        


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 382.28it/s, env_step=2048, len=39, n/ep=2, n/st=64, player_1/loss=2191.761, player_2/loss=1008.689, rew=37.50]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 382.08it/s, env_step=3072, len=38, n/ep=2, n/st=64, player_1/loss=2000.023, player_2/loss=1074.668, rew=-25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #4: 1025it [00:02, 379.69it/s, env_step=4096, len=34, n/ep=2, n/st=64, player_1/loss=1707.668, player_2/loss=1106.994, rew=0.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #5: 1025it [00:02, 382.20it/s, env_step=5120, len=35, n/ep=2, n/st=64, player_1/loss=1410.066, player_2/loss=1241.815, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #6: 1025it [00:02, 382.90it/s, env_step=6144, len=35, n/ep=2, n/st=64, player_1/loss=1275.696, player_2/loss=1605.923, rew=0.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #7: 1025it [00:02, 385.41it/s, env_step=7168, len=27, n/ep=2, n/st=64, player_1/loss=1525.389, player_2/loss=1538.928, rew=0.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #8: 1025it [00:02, 386.29it/s, env_step=8192, len=38, n/ep=2, n/st=64, player_1/loss=1455.505, player_2/loss=1252.710, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #9: 1025it [00:02, 381.39it/s, env_step=9216, len=36, n/ep=2, n/st=64, player_1/loss=1355.693, player_2/loss=1253.693, rew=25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #10: 1025it [00:02, 383.71it/s, env_step=10240, len=29, n/ep=3, n/st=64, player_1/loss=1402.621, player_2/loss=1205.525, rew=-8.33]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #11: 1025it [00:02, 382.74it/s, env_step=11264, len=37, n/ep=2, n/st=64, player_1/loss=1412.602, player_2/loss=1322.061, rew=25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #12: 1025it [00:02, 383.92it/s, env_step=12288, len=27, n/ep=2, n/st=64, player_1/loss=1418.544, player_2/loss=1244.078, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #13: 1025it [00:02, 382.76it/s, env_step=13312, len=39, n/ep=2, n/st=64, player_1/loss=1339.019, player_2/loss=1217.535, rew=62.50]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #14: 1025it [00:02, 378.56it/s, env_step=14336, len=38, n/ep=1, n/st=64, player_1/loss=1190.748, player_2/loss=1357.701, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #15: 1025it [00:02, 383.80it/s, env_step=15360, len=24, n/ep=2, n/st=64, player_1/loss=1247.032, player_2/loss=1571.994, rew=0.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #16: 1025it [00:02, 380.53it/s, env_step=16384, len=35, n/ep=2, n/st=64, player_1/loss=1351.244, player_2/loss=1672.685, rew=0.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #17: 1025it [00:02, 384.80it/s, env_step=17408, len=37, n/ep=1, n/st=64, player_1/loss=1151.216, player_2/loss=1259.495, rew=25.00]


Epoch #17: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #18: 1025it [00:02, 386.28it/s, env_step=18432, len=30, n/ep=2, n/st=64, player_1/loss=1254.359, player_2/loss=1236.533, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #19: 1025it [00:02, 381.71it/s, env_step=19456, len=38, n/ep=2, n/st=64, player_1/loss=1310.288, player_2/loss=1243.807, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #20: 1025it [00:02, 380.86it/s, env_step=20480, len=31, n/ep=2, n/st=64, player_1/loss=1511.953, player_2/loss=1086.830, rew=-25.00]


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #21: 1025it [00:02, 382.07it/s, env_step=21504, len=42, n/ep=1, n/st=64, player_1/loss=1443.655, player_2/loss=1077.276, rew=100.00]


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #22: 1025it [00:02, 385.28it/s, env_step=22528, len=42, n/ep=1, n/st=64, player_1/loss=1178.562, player_2/loss=1059.490, rew=100.00]


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #23: 1025it [00:02, 384.93it/s, env_step=23552, len=32, n/ep=2, n/st=64, player_1/loss=1297.358, player_2/loss=991.855, rew=0.00]


Epoch #23: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #24: 1025it [00:02, 376.46it/s, env_step=24576, len=37, n/ep=2, n/st=64, player_1/loss=1441.858, rew=0.00]       


Epoch #24: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #25: 1025it [00:02, 376.99it/s, env_step=25600, len=38, n/ep=2, n/st=64, player_1/loss=1459.897, player_2/loss=1237.975, rew=-25.00]


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #26: 1025it [00:02, 378.22it/s, env_step=26624, len=39, n/ep=2, n/st=64, player_1/loss=1512.171, player_2/loss=1281.468, rew=-25.00]


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #27: 1025it [00:02, 393.33it/s, env_step=27648, len=38, n/ep=1, n/st=64, player_1/loss=1576.616, player_2/loss=1547.563, rew=-25.00]


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #28: 1025it [00:02, 389.95it/s, env_step=28672, len=32, n/ep=2, n/st=64, player_1/loss=1809.867, player_2/loss=1563.241, rew=25.00]


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #29: 1025it [00:02, 374.83it/s, env_step=29696, len=33, n/ep=2, n/st=64, player_1/loss=1619.785, player_2/loss=1496.365, rew=0.00]


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #30: 1025it [00:02, 376.66it/s, env_step=30720, len=39, n/ep=1, n/st=64, player_1/loss=1786.542, player_2/loss=1341.220, rew=25.00]


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #31: 1025it [00:02, 371.27it/s, env_step=31744, len=36, n/ep=2, n/st=64, player_1/loss=1942.763, player_2/loss=1287.998, rew=0.00]


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #32: 1025it [00:02, 374.96it/s, env_step=32768, len=38, n/ep=2, n/st=64, player_1/loss=1481.536, player_2/loss=1081.959, rew=-25.00]


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #33: 1025it [00:02, 374.89it/s, env_step=33792, len=24, n/ep=2, n/st=64, player_1/loss=1457.726, player_2/loss=1056.610, rew=0.00]


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #34: 1025it [00:02, 375.71it/s, env_step=34816, len=38, n/ep=2, n/st=64, player_1/loss=1279.419, player_2/loss=1256.970, rew=0.00]


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #35: 1025it [00:02, 372.40it/s, env_step=35840, len=34, n/ep=2, n/st=64, player_1/loss=1182.130, player_2/loss=1192.567, rew=25.00]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #36: 1025it [00:02, 374.61it/s, env_step=36864, len=42, n/ep=1, n/st=64, player_1/loss=1273.274, player_2/loss=1274.234, rew=100.00]


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #37: 1025it [00:02, 373.64it/s, env_step=37888, len=28, n/ep=2, n/st=64, player_1/loss=1741.800, player_2/loss=1663.912, rew=-25.00]


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #38: 1025it [00:02, 370.83it/s, env_step=38912, len=35, n/ep=2, n/st=64, player_1/loss=1737.335, player_2/loss=1658.081, rew=-25.00]


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #39: 1025it [00:02, 371.10it/s, env_step=39936, len=36, n/ep=2, n/st=64, player_1/loss=1273.073, player_2/loss=1306.199, rew=0.00]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #40: 1025it [00:02, 373.06it/s, env_step=40960, len=22, n/ep=2, n/st=64, player_1/loss=1500.215, player_2/loss=1190.305, rew=-25.00]


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #41: 1025it [00:02, 370.66it/s, env_step=41984, len=34, n/ep=2, n/st=64, player_1/loss=1622.628, player_2/loss=1235.377, rew=-25.00]


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #42: 1025it [00:02, 371.02it/s, env_step=43008, len=27, n/ep=3, n/st=64, player_1/loss=1364.321, player_2/loss=1267.602, rew=8.33]


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #43: 1025it [00:02, 372.47it/s, env_step=44032, len=27, n/ep=2, n/st=64, player_1/loss=1136.161, player_2/loss=1378.689, rew=0.00]


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #44: 1025it [00:02, 373.69it/s, env_step=45056, len=38, n/ep=1, n/st=64, player_1/loss=1251.563, player_2/loss=1249.384, rew=-25.00]


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #45: 1025it [00:02, 371.64it/s, env_step=46080, len=19, n/ep=3, n/st=64, player_1/loss=1207.172, player_2/loss=1257.904, rew=-8.33]


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #46: 1025it [00:02, 374.11it/s, env_step=47104, len=23, n/ep=2, n/st=64, player_1/loss=1067.782, player_2/loss=1497.791, rew=0.00]


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #47: 1025it [00:02, 373.63it/s, env_step=48128, len=35, n/ep=2, n/st=64, player_1/loss=1206.980, player_2/loss=1401.270, rew=-25.00]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #48: 1025it [00:02, 371.94it/s, env_step=49152, len=42, n/ep=1, n/st=64, player_1/loss=1247.282, player_2/loss=1229.895, rew=100.00]


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #49: 1025it [00:02, 375.05it/s, env_step=50176, len=35, n/ep=2, n/st=64, player_1/loss=1219.496, player_2/loss=1184.261, rew=0.00]


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #17


Epoch #1: 1025it [00:02, 370.88it/s, env_step=1024, len=38, n/ep=1, n/st=64, player_1/loss=1260.154, rew=25.00]        


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 372.72it/s, env_step=2048, len=38, n/ep=1, n/st=64, player_1/loss=1285.047, player_2/loss=1200.431, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 372.94it/s, env_step=3072, len=38, n/ep=1, n/st=64, player_1/loss=1384.744, player_2/loss=1096.031, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 372.78it/s, env_step=4096, len=31, n/ep=3, n/st=64, player_1/loss=1397.676, player_2/loss=1133.055, rew=8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 372.53it/s, env_step=5120, len=29, n/ep=3, n/st=64, player_1/loss=1561.019, player_2/loss=1273.017, rew=33.33]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 372.82it/s, env_step=6144, len=38, n/ep=2, n/st=64, player_1/loss=1653.738, player_2/loss=1354.355, rew=-25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 373.09it/s, env_step=7168, len=38, n/ep=2, n/st=64, player_1/loss=1481.514, player_2/loss=1140.933, rew=0.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 372.11it/s, env_step=8192, len=36, n/ep=2, n/st=64, player_1/loss=1444.545, player_2/loss=1041.290, rew=0.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 372.43it/s, env_step=9216, len=31, n/ep=2, n/st=64, player_1/loss=1400.649, player_2/loss=1225.796, rew=0.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 371.76it/s, env_step=10240, len=38, n/ep=2, n/st=64, player_1/loss=1339.818, player_2/loss=1114.134, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 358.88it/s, env_step=11264, len=37, n/ep=2, n/st=64, player_1/loss=1264.575, player_2/loss=1006.838, rew=0.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 370.86it/s, env_step=12288, len=21, n/ep=4, n/st=64, player_1/loss=1368.996, player_2/loss=1161.472, rew=12.50]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 373.45it/s, env_step=13312, len=31, n/ep=2, n/st=64, player_1/loss=1132.018, player_2/loss=1066.174, rew=25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 372.33it/s, env_step=14336, len=33, n/ep=2, n/st=64, player_1/loss=738.627, player_2/loss=908.620, rew=0.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 374.40it/s, env_step=15360, len=39, n/ep=2, n/st=64, player_1/loss=960.759, player_2/loss=917.980, rew=0.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 373.26it/s, env_step=16384, len=34, n/ep=2, n/st=64, player_1/loss=999.529, player_2/loss=976.424, rew=0.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 372.95it/s, env_step=17408, len=32, n/ep=2, n/st=64, player_1/loss=988.380, player_2/loss=1051.469, rew=0.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 370.89it/s, env_step=18432, len=40, n/ep=1, n/st=64, player_1/loss=1294.137, player_2/loss=1059.102, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 382.22it/s, env_step=19456, len=39, n/ep=2, n/st=64, player_1/loss=1522.993, player_2/loss=1070.254, rew=0.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 383.96it/s, env_step=20480, len=35, n/ep=2, n/st=64, player_2/loss=1142.421, rew=25.00]      


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 373.49it/s, env_step=21504, len=20, n/ep=2, n/st=64, player_1/loss=1529.104, player_2/loss=1089.107, rew=0.00]


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 370.07it/s, env_step=22528, len=36, n/ep=2, n/st=64, player_1/loss=1601.844, player_2/loss=1037.944, rew=0.00]


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 372.68it/s, env_step=23552, len=38, n/ep=1, n/st=64, player_1/loss=1327.257, player_2/loss=953.162, rew=25.00]


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 369.90it/s, env_step=24576, len=23, n/ep=2, n/st=64, player_1/loss=1274.566, player_2/loss=920.863, rew=0.00]


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 372.55it/s, env_step=25600, len=38, n/ep=2, n/st=64, player_1/loss=1102.274, player_2/loss=918.326, rew=25.00]


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 374.76it/s, env_step=26624, len=40, n/ep=2, n/st=64, player_1/loss=1251.701, player_2/loss=1148.861, rew=62.50]


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 359.19it/s, env_step=27648, len=27, n/ep=2, n/st=64, player_1/loss=1366.977, player_2/loss=1238.497, rew=0.00]


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 370.25it/s, env_step=28672, len=29, n/ep=1, n/st=64, player_1/loss=1955.899, player_2/loss=1055.482, rew=-25.00]


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 371.05it/s, env_step=29696, len=18, n/ep=3, n/st=64, player_1/loss=1770.269, player_2/loss=1351.473, rew=8.33]


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 370.46it/s, env_step=30720, len=23, n/ep=2, n/st=64, player_1/loss=1177.262, player_2/loss=1535.200, rew=0.00]


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 371.19it/s, env_step=31744, len=36, n/ep=2, n/st=64, player_1/loss=1133.267, player_2/loss=1107.128, rew=25.00]


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 372.18it/s, env_step=32768, len=37, n/ep=2, n/st=64, player_2/loss=987.858, rew=25.00]       


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 371.56it/s, env_step=33792, len=32, n/ep=3, n/st=64, player_1/loss=1072.330, player_2/loss=996.614, rew=25.00]


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 371.51it/s, env_step=34816, len=39, n/ep=2, n/st=64, player_1/loss=1106.611, player_2/loss=945.998, rew=25.00]


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 373.73it/s, env_step=35840, len=27, n/ep=2, n/st=64, player_1/loss=1450.704, player_2/loss=1005.281, rew=25.00]


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 375.65it/s, env_step=36864, len=23, n/ep=2, n/st=64, player_1/loss=1509.438, player_2/loss=998.584, rew=0.00]


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 358.25it/s, env_step=37888, len=39, n/ep=1, n/st=64, player_1/loss=1168.454, player_2/loss=927.687, rew=-25.00]


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 363.23it/s, env_step=38912, len=27, n/ep=3, n/st=64, player_1/loss=1073.842, player_2/loss=881.727, rew=25.00]


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 358.78it/s, env_step=39936, len=38, n/ep=1, n/st=64, player_1/loss=1026.686, player_2/loss=988.092, rew=25.00]


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 361.68it/s, env_step=40960, len=30, n/ep=3, n/st=64, player_1/loss=1304.873, player_2/loss=967.113, rew=25.00]


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 363.20it/s, env_step=41984, len=29, n/ep=2, n/st=64, player_1/loss=1397.290, player_2/loss=923.860, rew=25.00]


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 361.57it/s, env_step=43008, len=38, n/ep=2, n/st=64, player_1/loss=1493.594, player_2/loss=978.994, rew=25.00]


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 369.93it/s, env_step=44032, len=28, n/ep=2, n/st=64, player_1/loss=1424.201, player_2/loss=873.684, rew=25.00]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 367.55it/s, env_step=45056, len=34, n/ep=3, n/st=64, player_1/loss=976.150, player_2/loss=844.675, rew=75.00]


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 369.14it/s, env_step=46080, len=20, n/ep=4, n/st=64, player_1/loss=1008.324, player_2/loss=1023.925, rew=12.50]


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 368.63it/s, env_step=47104, len=36, n/ep=2, n/st=64, player_1/loss=1114.010, player_2/loss=919.248, rew=25.00]


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 370.09it/s, env_step=48128, len=40, n/ep=2, n/st=64, player_1/loss=1061.262, player_2/loss=755.920, rew=25.00]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 370.95it/s, env_step=49152, len=31, n/ep=3, n/st=64, player_1/loss=1117.428, player_2/loss=792.732, rew=8.33]


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 368.63it/s, env_step=50176, len=29, n/ep=3, n/st=64, player_2/loss=1061.473, rew=-8.33]      


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 367.05it/s, env_step=1024, len=24, n/ep=3, n/st=64, player_1/loss=702.281, player_2/loss=867.599, rew=16.67]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 367.41it/s, env_step=2048, len=38, n/ep=2, n/st=64, player_1/loss=892.963, player_2/loss=1082.692, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 366.37it/s, env_step=3072, len=29, n/ep=3, n/st=64, player_1/loss=1013.267, player_2/loss=1063.105, rew=16.67]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 365.09it/s, env_step=4096, len=32, n/ep=2, n/st=64, player_1/loss=1110.133, player_2/loss=819.482, rew=0.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 377.57it/s, env_step=5120, len=28, n/ep=2, n/st=64, player_1/loss=1201.980, player_2/loss=878.146, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 380.32it/s, env_step=6144, len=36, n/ep=2, n/st=64, player_1/loss=1159.074, player_2/loss=881.884, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 373.85it/s, env_step=7168, len=26, n/ep=3, n/st=64, player_1/loss=1019.247, player_2/loss=972.056, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 383.10it/s, env_step=8192, len=27, n/ep=2, n/st=64, player_1/loss=1046.442, player_2/loss=895.310, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 380.98it/s, env_step=9216, len=28, n/ep=2, n/st=64, player_1/loss=1151.056, player_2/loss=874.236, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 381.07it/s, env_step=10240, len=25, n/ep=3, n/st=64, player_1/loss=1069.195, player_2/loss=935.182, rew=-25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 376.75it/s, env_step=11264, len=26, n/ep=3, n/st=64, player_1/loss=1164.370, player_2/loss=1010.789, rew=8.33]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 380.63it/s, env_step=12288, len=37, n/ep=1, n/st=64, player_1/loss=1221.656, player_2/loss=942.170, rew=25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 380.74it/s, env_step=13312, len=35, n/ep=2, n/st=64, player_1/loss=1042.860, player_2/loss=896.693, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 381.74it/s, env_step=14336, len=36, n/ep=2, n/st=64, player_1/loss=1101.963, player_2/loss=916.485, rew=0.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #15: 1025it [00:02, 381.12it/s, env_step=15360, len=39, n/ep=2, n/st=64, player_1/loss=1056.704, player_2/loss=812.119, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #16: 1025it [00:02, 380.86it/s, env_step=16384, len=39, n/ep=1, n/st=64, player_1/loss=955.668, player_2/loss=788.286, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #17: 1025it [00:02, 382.95it/s, env_step=17408, len=25, n/ep=3, n/st=64, player_1/loss=933.400, player_2/loss=914.229, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #18: 1025it [00:02, 378.17it/s, env_step=18432, len=21, n/ep=3, n/st=64, player_1/loss=866.882, player_2/loss=1042.394, rew=-8.33]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #19: 1025it [00:02, 378.39it/s, env_step=19456, len=27, n/ep=2, n/st=64, player_1/loss=1085.330, player_2/loss=1073.311, rew=0.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #20: 1025it [00:02, 379.26it/s, env_step=20480, len=30, n/ep=3, n/st=64, player_1/loss=1180.411, player_2/loss=963.661, rew=-25.00]


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #21: 1025it [00:02, 377.89it/s, env_step=21504, len=25, n/ep=3, n/st=64, player_1/loss=1090.932, player_2/loss=885.540, rew=-25.00]


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #22: 1025it [00:02, 378.15it/s, env_step=22528, len=35, n/ep=2, n/st=64, player_1/loss=917.568, player_2/loss=946.223, rew=37.50]


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #23: 1025it [00:02, 379.81it/s, env_step=23552, len=16, n/ep=2, n/st=64, player_1/loss=1006.782, player_2/loss=938.639, rew=-25.00]


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #24: 1025it [00:02, 383.25it/s, env_step=24576, len=28, n/ep=3, n/st=64, player_1/loss=918.647, player_2/loss=808.550, rew=-8.33]


Epoch #24: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #25: 1025it [00:02, 380.23it/s, env_step=25600, len=27, n/ep=2, n/st=64, player_1/loss=818.835, player_2/loss=850.560, rew=0.00]


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #26: 1025it [00:02, 382.43it/s, env_step=26624, len=24, n/ep=3, n/st=64, player_1/loss=921.048, player_2/loss=842.672, rew=-25.00]


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #27: 1025it [00:02, 381.41it/s, env_step=27648, len=28, n/ep=2, n/st=64, player_1/loss=1050.761, player_2/loss=710.616, rew=-25.00]


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #28: 1025it [00:02, 382.73it/s, env_step=28672, len=28, n/ep=2, n/st=64, player_1/loss=976.616, player_2/loss=849.691, rew=-25.00]


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #29: 1025it [00:02, 378.54it/s, env_step=29696, len=25, n/ep=2, n/st=64, player_1/loss=900.534, player_2/loss=837.111, rew=-25.00]


Epoch #29: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #30: 1025it [00:02, 380.60it/s, env_step=30720, len=40, n/ep=1, n/st=64, player_1/loss=1227.551, player_2/loss=783.737, rew=-25.00]


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #31: 1025it [00:02, 380.69it/s, env_step=31744, len=26, n/ep=2, n/st=64, player_1/loss=1175.403, player_2/loss=889.026, rew=0.00]


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #32: 1025it [00:02, 383.32it/s, env_step=32768, len=35, n/ep=2, n/st=64, player_1/loss=1045.340, player_2/loss=847.525, rew=0.00]


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #33: 1025it [00:02, 381.75it/s, env_step=33792, len=27, n/ep=2, n/st=64, player_1/loss=1045.707, player_2/loss=710.459, rew=-25.00]


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #34: 1025it [00:02, 381.45it/s, env_step=34816, len=36, n/ep=2, n/st=64, player_1/loss=1160.835, player_2/loss=768.564, rew=-25.00]


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #35: 1025it [00:02, 383.09it/s, env_step=35840, len=40, n/ep=2, n/st=64, player_1/loss=1113.045, player_2/loss=807.500, rew=37.50]


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #36: 1025it [00:02, 375.19it/s, env_step=36864, len=24, n/ep=3, n/st=64, player_1/loss=905.677, player_2/loss=798.481, rew=-25.00]


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #37: 1025it [00:02, 382.23it/s, env_step=37888, len=24, n/ep=3, n/st=64, player_1/loss=744.226, player_2/loss=791.564, rew=-25.00]


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #38: 1025it [00:02, 380.45it/s, env_step=38912, len=28, n/ep=2, n/st=64, player_2/loss=759.398, rew=-25.00]      


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #39: 1025it [00:02, 381.74it/s, env_step=39936, len=38, n/ep=2, n/st=64, player_1/loss=884.713, player_2/loss=726.570, rew=-25.00]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #40: 1025it [00:02, 382.62it/s, env_step=40960, len=24, n/ep=3, n/st=64, player_1/loss=1041.867, player_2/loss=681.202, rew=-25.00]


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #41: 1025it [00:02, 383.00it/s, env_step=41984, len=38, n/ep=2, n/st=64, player_1/loss=978.393, player_2/loss=783.681, rew=-25.00]


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #42: 1025it [00:02, 380.03it/s, env_step=43008, len=36, n/ep=2, n/st=64, player_1/loss=864.608, player_2/loss=884.875, rew=0.00]


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #43: 1025it [00:02, 381.40it/s, env_step=44032, len=39, n/ep=2, n/st=64, player_1/loss=1000.293, player_2/loss=836.226, rew=-25.00]


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #44: 1025it [00:02, 379.17it/s, env_step=45056, len=24, n/ep=3, n/st=64, player_1/loss=958.562, player_2/loss=1016.541, rew=-8.33]


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #45: 1025it [00:02, 378.90it/s, env_step=46080, len=34, n/ep=2, n/st=64, player_1/loss=1115.169, player_2/loss=1093.196, rew=0.00]


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #46: 1025it [00:02, 381.27it/s, env_step=47104, len=38, n/ep=2, n/st=64, player_1/loss=1223.997, player_2/loss=896.348, rew=37.50]


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #47: 1025it [00:02, 381.34it/s, env_step=48128, len=14, n/ep=4, n/st=64, player_1/loss=1142.600, player_2/loss=906.309, rew=-12.50]


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #48: 1025it [00:02, 381.07it/s, env_step=49152, len=28, n/ep=2, n/st=64, player_1/loss=882.946, player_2/loss=878.967, rew=0.00]


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #49: 1025it [00:02, 380.85it/s, env_step=50176, len=34, n/ep=2, n/st=64, player_1/loss=811.332, player_2/loss=810.714, rew=-25.00]


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #24


Epoch #1: 1025it [00:02, 378.97it/s, env_step=1024, len=23, n/ep=3, n/st=64, player_1/loss=619.973, player_2/loss=855.159, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 382.25it/s, env_step=2048, len=17, n/ep=4, n/st=64, player_1/loss=623.304, player_2/loss=962.898, rew=0.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 383.52it/s, env_step=3072, len=29, n/ep=3, n/st=64, player_1/loss=896.147, player_2/loss=936.149, rew=50.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 379.80it/s, env_step=4096, len=32, n/ep=2, n/st=64, player_1/loss=991.888, player_2/loss=751.185, rew=0.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 383.15it/s, env_step=5120, len=31, n/ep=2, n/st=64, player_1/loss=842.010, player_2/loss=799.966, rew=0.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 378.38it/s, env_step=6144, len=23, n/ep=3, n/st=64, player_1/loss=902.143, player_2/loss=708.206, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 379.74it/s, env_step=7168, len=26, n/ep=1, n/st=64, player_1/loss=1023.188, player_2/loss=721.647, rew=25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 383.03it/s, env_step=8192, len=21, n/ep=3, n/st=64, player_1/loss=957.397, player_2/loss=807.066, rew=8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 383.30it/s, env_step=9216, len=39, n/ep=2, n/st=64, player_1/loss=829.350, player_2/loss=861.549, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 385.08it/s, env_step=10240, len=17, n/ep=2, n/st=64, player_1/loss=779.362, player_2/loss=856.358, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 383.95it/s, env_step=11264, len=35, n/ep=2, n/st=64, player_1/loss=728.386, player_2/loss=762.147, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 381.19it/s, env_step=12288, len=22, n/ep=4, n/st=64, player_1/loss=992.165, player_2/loss=860.749, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 381.85it/s, env_step=13312, len=20, n/ep=3, n/st=64, player_1/loss=904.140, player_2/loss=756.900, rew=-8.33]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 382.10it/s, env_step=14336, len=39, n/ep=2, n/st=64, player_1/loss=826.632, player_2/loss=642.467, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 384.64it/s, env_step=15360, len=19, n/ep=4, n/st=64, player_1/loss=775.541, player_2/loss=664.034, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 383.12it/s, env_step=16384, len=37, n/ep=2, n/st=64, player_1/loss=994.980, player_2/loss=808.721, rew=0.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 382.76it/s, env_step=17408, len=28, n/ep=2, n/st=64, player_1/loss=1005.505, player_2/loss=886.995, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 381.08it/s, env_step=18432, len=32, n/ep=2, n/st=64, player_1/loss=1157.428, player_2/loss=779.411, rew=0.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 382.32it/s, env_step=19456, len=19, n/ep=4, n/st=64, player_1/loss=1162.880, player_2/loss=716.432, rew=25.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 378.11it/s, env_step=20480, len=24, n/ep=3, n/st=64, player_1/loss=1088.065, player_2/loss=723.452, rew=25.00]


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 382.12it/s, env_step=21504, len=27, n/ep=2, n/st=64, player_1/loss=1048.245, player_2/loss=828.355, rew=25.00]


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 383.87it/s, env_step=22528, len=27, n/ep=2, n/st=64, player_1/loss=1142.213, player_2/loss=811.728, rew=0.00]


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 383.36it/s, env_step=23552, len=15, n/ep=5, n/st=64, player_1/loss=993.840, player_2/loss=727.618, rew=15.00]


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 381.75it/s, env_step=24576, len=38, n/ep=2, n/st=64, player_1/loss=671.579, player_2/loss=757.138, rew=0.00]


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 381.28it/s, env_step=25600, len=37, n/ep=2, n/st=64, player_1/loss=667.223, player_2/loss=676.329, rew=25.00]


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 382.32it/s, env_step=26624, len=36, n/ep=2, n/st=64, player_1/loss=739.512, player_2/loss=633.182, rew=25.00]


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 385.65it/s, env_step=27648, len=36, n/ep=2, n/st=64, player_1/loss=775.034, player_2/loss=657.763, rew=25.00]


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 382.55it/s, env_step=28672, len=25, n/ep=2, n/st=64, player_1/loss=847.316, player_2/loss=706.034, rew=0.00]


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 380.82it/s, env_step=29696, len=16, n/ep=3, n/st=64, player_1/loss=836.692, player_2/loss=734.369, rew=25.00]


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 384.89it/s, env_step=30720, len=20, n/ep=3, n/st=64, player_1/loss=894.947, player_2/loss=680.758, rew=8.33]


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 384.27it/s, env_step=31744, len=32, n/ep=2, n/st=64, player_1/loss=794.465, player_2/loss=802.267, rew=25.00]


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 381.72it/s, env_step=32768, len=28, n/ep=3, n/st=64, player_1/loss=912.567, player_2/loss=824.880, rew=-8.33]


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 381.15it/s, env_step=33792, len=35, n/ep=2, n/st=64, player_1/loss=1093.220, player_2/loss=609.983, rew=25.00]


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 382.29it/s, env_step=34816, len=38, n/ep=1, n/st=64, player_1/loss=1081.491, player_2/loss=653.004, rew=25.00]


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 379.78it/s, env_step=35840, len=25, n/ep=3, n/st=64, player_1/loss=1139.889, player_2/loss=816.353, rew=8.33]


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 382.93it/s, env_step=36864, len=24, n/ep=3, n/st=64, player_1/loss=1178.474, player_2/loss=737.855, rew=25.00]


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 375.95it/s, env_step=37888, len=33, n/ep=2, n/st=64, player_1/loss=1086.797, player_2/loss=682.391, rew=0.00]


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 385.35it/s, env_step=38912, len=34, n/ep=2, n/st=64, player_1/loss=845.072, player_2/loss=733.657, rew=0.00]


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 380.64it/s, env_step=39936, len=27, n/ep=2, n/st=64, player_1/loss=922.587, player_2/loss=634.500, rew=25.00]


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 383.62it/s, env_step=40960, len=30, n/ep=2, n/st=64, player_1/loss=827.329, player_2/loss=512.752, rew=25.00]


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 381.00it/s, env_step=41984, len=39, n/ep=1, n/st=64, player_1/loss=1007.858, player_2/loss=577.857, rew=-25.00]


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 381.98it/s, env_step=43008, len=37, n/ep=1, n/st=64, player_1/loss=1273.028, player_2/loss=729.075, rew=-25.00]


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 382.11it/s, env_step=44032, len=23, n/ep=3, n/st=64, player_1/loss=1253.962, player_2/loss=660.030, rew=8.33]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 381.97it/s, env_step=45056, len=35, n/ep=2, n/st=64, player_1/loss=983.085, player_2/loss=646.478, rew=25.00]


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 381.64it/s, env_step=46080, len=24, n/ep=3, n/st=64, player_1/loss=982.757, player_2/loss=607.554, rew=25.00]


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 382.37it/s, env_step=47104, len=38, n/ep=1, n/st=64, player_1/loss=1066.684, player_2/loss=601.692, rew=25.00]


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 383.40it/s, env_step=48128, len=16, n/ep=3, n/st=64, player_1/loss=1122.844, player_2/loss=614.557, rew=25.00]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 382.26it/s, env_step=49152, len=24, n/ep=3, n/st=64, player_1/loss=846.690, player_2/loss=510.413, rew=8.33]


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 380.58it/s, env_step=50176, len=30, n/ep=3, n/st=64, player_1/loss=869.086, player_2/loss=572.097, rew=25.00]


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 379.46it/s, env_step=1024, len=36, n/ep=1, n/st=64, player_1/loss=728.414, player_2/loss=736.435, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 378.85it/s, env_step=2048, len=28, n/ep=2, n/st=64, player_1/loss=738.632, player_2/loss=648.577, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 384.22it/s, env_step=3072, len=30, n/ep=3, n/st=64, player_1/loss=838.578, player_2/loss=532.179, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 383.46it/s, env_step=4096, len=28, n/ep=3, n/st=64, player_1/loss=767.742, player_2/loss=603.322, rew=-25.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #5: 1025it [00:02, 383.33it/s, env_step=5120, len=31, n/ep=2, n/st=64, player_1/loss=756.807, player_2/loss=636.255, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #6: 1025it [00:02, 383.85it/s, env_step=6144, len=34, n/ep=2, n/st=64, player_1/loss=855.097, player_2/loss=581.825, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #7: 1025it [00:02, 381.84it/s, env_step=7168, len=32, n/ep=2, n/st=64, player_1/loss=978.001, player_2/loss=565.190, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #8: 1025it [00:02, 382.43it/s, env_step=8192, len=20, n/ep=2, n/st=64, player_1/loss=1058.843, player_2/loss=553.700, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #9: 1025it [00:02, 384.38it/s, env_step=9216, len=26, n/ep=3, n/st=64, player_1/loss=1030.114, player_2/loss=733.594, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #10: 1025it [00:02, 384.93it/s, env_step=10240, len=23, n/ep=3, n/st=64, player_1/loss=1139.842, player_2/loss=745.122, rew=-8.33]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #11: 1025it [00:02, 384.51it/s, env_step=11264, len=31, n/ep=2, n/st=64, player_1/loss=939.302, player_2/loss=643.031, rew=0.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #12: 1025it [00:02, 381.10it/s, env_step=12288, len=21, n/ep=3, n/st=64, player_1/loss=900.031, player_2/loss=582.583, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #13: 1025it [00:02, 384.89it/s, env_step=13312, len=16, n/ep=4, n/st=64, player_1/loss=967.347, player_2/loss=469.125, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #14: 1025it [00:02, 381.44it/s, env_step=14336, len=36, n/ep=1, n/st=64, player_1/loss=813.633, player_2/loss=515.176, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #15: 1025it [00:02, 384.91it/s, env_step=15360, len=28, n/ep=2, n/st=64, player_1/loss=778.396, player_2/loss=626.761, rew=-25.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #16: 1025it [00:02, 385.37it/s, env_step=16384, len=24, n/ep=3, n/st=64, player_1/loss=1057.151, player_2/loss=575.706, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #17: 1025it [00:02, 383.42it/s, env_step=17408, len=23, n/ep=3, n/st=64, player_1/loss=1180.316, player_2/loss=501.590, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #18: 1025it [00:02, 383.05it/s, env_step=18432, len=20, n/ep=3, n/st=64, player_1/loss=1041.087, player_2/loss=543.967, rew=-8.33]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #19: 1025it [00:02, 387.20it/s, env_step=19456, len=23, n/ep=3, n/st=64, player_1/loss=905.072, player_2/loss=527.683, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #20: 1025it [00:02, 382.01it/s, env_step=20480, len=31, n/ep=2, n/st=64, player_1/loss=893.380, player_2/loss=599.846, rew=25.00]


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #21: 1025it [00:02, 382.86it/s, env_step=21504, len=21, n/ep=3, n/st=64, player_1/loss=887.320, player_2/loss=741.426, rew=-8.33]


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #22: 1025it [00:02, 383.11it/s, env_step=22528, len=29, n/ep=2, n/st=64, player_1/loss=733.945, player_2/loss=771.260, rew=-25.00]


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #23: 1025it [00:02, 383.85it/s, env_step=23552, len=32, n/ep=1, n/st=64, player_1/loss=781.302, player_2/loss=711.528, rew=-25.00]


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #24: 1025it [00:02, 382.67it/s, env_step=24576, len=36, n/ep=2, n/st=64, player_1/loss=803.926, player_2/loss=681.146, rew=-25.00]


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #25: 1025it [00:02, 382.23it/s, env_step=25600, len=28, n/ep=2, n/st=64, player_1/loss=688.394, player_2/loss=695.576, rew=0.00]


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #26: 1025it [00:02, 383.15it/s, env_step=26624, len=24, n/ep=2, n/st=64, player_1/loss=763.501, player_2/loss=686.159, rew=0.00]


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #27: 1025it [00:02, 386.64it/s, env_step=27648, len=29, n/ep=2, n/st=64, player_1/loss=749.322, player_2/loss=686.862, rew=-25.00]


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #28: 1025it [00:02, 385.87it/s, env_step=28672, len=38, n/ep=1, n/st=64, player_1/loss=740.159, player_2/loss=915.563, rew=-25.00]


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #29: 1025it [00:02, 384.39it/s, env_step=29696, len=35, n/ep=2, n/st=64, player_1/loss=574.765, player_2/loss=871.405, rew=0.00]


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #30: 1025it [00:02, 383.97it/s, env_step=30720, len=26, n/ep=2, n/st=64, player_1/loss=698.284, player_2/loss=656.843, rew=-25.00]


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #31: 1025it [00:02, 384.10it/s, env_step=31744, len=27, n/ep=2, n/st=64, player_1/loss=749.994, player_2/loss=613.157, rew=-25.00]


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #32: 1025it [00:02, 383.82it/s, env_step=32768, len=35, n/ep=2, n/st=64, player_1/loss=632.791, player_2/loss=735.873, rew=-25.00]


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #33: 1025it [00:02, 383.77it/s, env_step=33792, len=38, n/ep=2, n/st=64, player_1/loss=710.737, player_2/loss=840.381, rew=-25.00]


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #34: 1025it [00:02, 384.17it/s, env_step=34816, len=34, n/ep=2, n/st=64, player_2/loss=831.832, rew=0.00]        


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #35: 1025it [00:02, 384.97it/s, env_step=35840, len=32, n/ep=2, n/st=64, player_1/loss=848.003, player_2/loss=778.580, rew=25.00]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #36: 1025it [00:02, 382.36it/s, env_step=36864, len=25, n/ep=2, n/st=64, player_1/loss=872.437, player_2/loss=806.721, rew=0.00]


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #37: 1025it [00:02, 381.78it/s, env_step=37888, len=35, n/ep=2, n/st=64, player_1/loss=700.294, player_2/loss=916.876, rew=0.00]


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #38: 1025it [00:02, 384.69it/s, env_step=38912, len=35, n/ep=2, n/st=64, player_1/loss=716.189, player_2/loss=846.824, rew=-25.00]


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #39: 1025it [00:02, 382.07it/s, env_step=39936, len=40, n/ep=1, n/st=64, player_1/loss=778.374, player_2/loss=822.297, rew=-25.00]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #40: 1025it [00:02, 386.51it/s, env_step=40960, len=35, n/ep=2, n/st=64, player_1/loss=765.147, player_2/loss=733.491, rew=-25.00]


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #41: 1025it [00:02, 382.21it/s, env_step=41984, len=28, n/ep=2, n/st=64, player_1/loss=736.602, player_2/loss=664.284, rew=0.00]


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #42: 1025it [00:02, 384.98it/s, env_step=43008, len=38, n/ep=1, n/st=64, player_1/loss=937.901, player_2/loss=738.342, rew=-25.00]


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #43: 1025it [00:02, 381.08it/s, env_step=44032, len=38, n/ep=2, n/st=64, player_1/loss=900.382, player_2/loss=713.836, rew=0.00]


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #44: 1025it [00:02, 384.89it/s, env_step=45056, len=35, n/ep=2, n/st=64, player_1/loss=851.994, player_2/loss=797.051, rew=-25.00]


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #45: 1025it [00:02, 378.25it/s, env_step=46080, len=38, n/ep=2, n/st=64, player_1/loss=661.190, player_2/loss=920.018, rew=-25.00]


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #46: 1025it [00:02, 384.17it/s, env_step=47104, len=38, n/ep=2, n/st=64, player_1/loss=608.958, player_2/loss=767.993, rew=0.00]


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #47: 1025it [00:02, 385.08it/s, env_step=48128, len=35, n/ep=2, n/st=64, player_1/loss=684.541, player_2/loss=580.213, rew=25.00]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #48: 1025it [00:02, 380.79it/s, env_step=49152, len=39, n/ep=2, n/st=64, player_1/loss=760.440, player_2/loss=668.416, rew=-25.00]


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #49: 1025it [00:02, 384.10it/s, env_step=50176, len=31, n/ep=3, n/st=64, player_1/loss=662.435, player_2/loss=681.828, rew=8.33]


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #1: 1025it [00:02, 380.90it/s, env_step=1024, len=27, n/ep=2, n/st=64, player_2/loss=580.714, rew=25.00]         


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 381.94it/s, env_step=2048, len=38, n/ep=2, n/st=64, player_1/loss=682.825, player_2/loss=580.077, rew=0.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 383.73it/s, env_step=3072, len=24, n/ep=2, n/st=64, player_1/loss=935.866, player_2/loss=783.615, rew=25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 383.63it/s, env_step=4096, len=34, n/ep=2, n/st=64, player_1/loss=781.793, player_2/loss=740.326, rew=0.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 381.00it/s, env_step=5120, len=38, n/ep=2, n/st=64, player_1/loss=701.293, player_2/loss=650.887, rew=0.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 380.35it/s, env_step=6144, len=37, n/ep=2, n/st=64, player_1/loss=826.987, player_2/loss=683.389, rew=0.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 380.83it/s, env_step=7168, len=27, n/ep=2, n/st=64, player_1/loss=736.753, player_2/loss=701.230, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 383.47it/s, env_step=8192, len=34, n/ep=2, n/st=64, player_1/loss=595.083, player_2/loss=677.379, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 380.99it/s, env_step=9216, len=36, n/ep=2, n/st=64, player_1/loss=749.682, player_2/loss=584.685, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 381.73it/s, env_step=10240, len=28, n/ep=2, n/st=64, player_1/loss=838.998, player_2/loss=569.145, rew=0.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 388.50it/s, env_step=11264, len=27, n/ep=2, n/st=64, player_1/loss=809.417, player_2/loss=638.808, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 382.38it/s, env_step=12288, len=38, n/ep=2, n/st=64, player_1/loss=853.124, player_2/loss=760.617, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 386.16it/s, env_step=13312, len=38, n/ep=2, n/st=64, player_1/loss=812.859, player_2/loss=913.702, rew=0.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 372.36it/s, env_step=14336, len=37, n/ep=2, n/st=64, player_1/loss=907.788, player_2/loss=887.420, rew=0.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 373.02it/s, env_step=15360, len=30, n/ep=3, n/st=64, player_1/loss=746.830, player_2/loss=800.142, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 368.66it/s, env_step=16384, len=38, n/ep=2, n/st=64, player_1/loss=615.152, player_2/loss=689.178, rew=0.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 372.49it/s, env_step=17408, len=32, n/ep=2, n/st=64, player_1/loss=624.748, player_2/loss=603.929, rew=0.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 372.40it/s, env_step=18432, len=27, n/ep=3, n/st=64, player_1/loss=719.148, player_2/loss=689.324, rew=8.33]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 374.36it/s, env_step=19456, len=38, n/ep=1, n/st=64, player_1/loss=766.450, player_2/loss=641.677, rew=25.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 371.86it/s, env_step=20480, len=34, n/ep=2, n/st=64, player_1/loss=877.606, player_2/loss=666.531, rew=0.00]


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 371.56it/s, env_step=21504, len=38, n/ep=1, n/st=64, player_1/loss=836.556, player_2/loss=720.917, rew=25.00]


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 374.63it/s, env_step=22528, len=33, n/ep=2, n/st=64, player_1/loss=748.189, player_2/loss=638.031, rew=0.00]


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 372.22it/s, env_step=23552, len=40, n/ep=1, n/st=64, player_1/loss=744.544, player_2/loss=648.024, rew=25.00]


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 373.19it/s, env_step=24576, len=25, n/ep=3, n/st=64, player_1/loss=809.136, rew=8.33]        


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 372.47it/s, env_step=25600, len=37, n/ep=2, n/st=64, player_1/loss=869.733, player_2/loss=549.763, rew=25.00]


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 371.35it/s, env_step=26624, len=27, n/ep=3, n/st=64, player_1/loss=944.587, player_2/loss=573.330, rew=8.33]


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 373.38it/s, env_step=27648, len=38, n/ep=1, n/st=64, player_1/loss=936.683, player_2/loss=596.781, rew=25.00]


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 371.66it/s, env_step=28672, len=36, n/ep=1, n/st=64, player_1/loss=974.938, player_2/loss=631.266, rew=25.00]


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 372.70it/s, env_step=29696, len=19, n/ep=3, n/st=64, player_1/loss=793.655, player_2/loss=547.223, rew=-8.33]


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 375.80it/s, env_step=30720, len=28, n/ep=2, n/st=64, player_1/loss=799.354, player_2/loss=567.914, rew=25.00]


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 373.13it/s, env_step=31744, len=36, n/ep=2, n/st=64, player_1/loss=897.282, player_2/loss=553.734, rew=25.00]


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 371.96it/s, env_step=32768, len=35, n/ep=2, n/st=64, player_1/loss=903.897, player_2/loss=505.831, rew=25.00]


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 372.22it/s, env_step=33792, len=34, n/ep=2, n/st=64, player_1/loss=694.269, player_2/loss=443.076, rew=0.00]


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 370.53it/s, env_step=34816, len=38, n/ep=1, n/st=64, player_1/loss=670.976, player_2/loss=528.863, rew=25.00]


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 374.20it/s, env_step=35840, len=27, n/ep=2, n/st=64, player_1/loss=950.024, player_2/loss=632.096, rew=25.00]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 374.32it/s, env_step=36864, len=25, n/ep=3, n/st=64, player_1/loss=1128.689, player_2/loss=644.660, rew=25.00]


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 373.32it/s, env_step=37888, len=23, n/ep=2, n/st=64, player_1/loss=907.635, player_2/loss=598.399, rew=0.00]


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 374.43it/s, env_step=38912, len=38, n/ep=2, n/st=64, player_1/loss=801.050, player_2/loss=635.027, rew=0.00]


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 377.73it/s, env_step=39936, len=38, n/ep=1, n/st=64, player_1/loss=798.423, player_2/loss=670.103, rew=25.00]


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 382.89it/s, env_step=40960, len=38, n/ep=2, n/st=64, player_1/loss=850.318, player_2/loss=600.764, rew=0.00]


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 387.89it/s, env_step=41984, len=38, n/ep=2, n/st=64, player_1/loss=785.831, player_2/loss=578.287, rew=0.00]


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 389.14it/s, env_step=43008, len=38, n/ep=1, n/st=64, player_1/loss=701.871, player_2/loss=469.897, rew=25.00]


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 389.12it/s, env_step=44032, len=30, n/ep=3, n/st=64, player_1/loss=718.285, player_2/loss=458.495, rew=25.00]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 385.97it/s, env_step=45056, len=22, n/ep=2, n/st=64, player_1/loss=862.915, player_2/loss=539.627, rew=0.00]


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 382.97it/s, env_step=46080, len=23, n/ep=2, n/st=64, player_1/loss=880.220, player_2/loss=533.179, rew=25.00]


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 384.20it/s, env_step=47104, len=28, n/ep=2, n/st=64, player_1/loss=810.133, player_2/loss=484.462, rew=0.00]


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 382.98it/s, env_step=48128, len=24, n/ep=3, n/st=64, player_1/loss=770.100, player_2/loss=688.095, rew=-8.33]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 378.24it/s, env_step=49152, len=33, n/ep=2, n/st=64, player_1/loss=775.017, player_2/loss=713.641, rew=0.00]


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 372.75it/s, env_step=50176, len=36, n/ep=2, n/st=64, player_1/loss=872.805, player_2/loss=617.891, rew=25.00]


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 369.18it/s, env_step=1024, len=27, n/ep=2, n/st=64, player_1/loss=600.194, player_2/loss=601.211, rew=-25.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 372.21it/s, env_step=2048, len=31, n/ep=3, n/st=64, player_1/loss=628.308, player_2/loss=546.765, rew=-25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 371.24it/s, env_step=3072, len=32, n/ep=2, n/st=64, player_1/loss=910.003, player_2/loss=563.779, rew=-25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 370.16it/s, env_step=4096, len=33, n/ep=2, n/st=64, player_1/loss=1017.218, player_2/loss=492.981, rew=-25.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 372.09it/s, env_step=5120, len=38, n/ep=1, n/st=64, player_1/loss=802.835, player_2/loss=464.349, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 372.19it/s, env_step=6144, len=28, n/ep=2, n/st=64, player_1/loss=759.208, player_2/loss=525.669, rew=25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 372.76it/s, env_step=7168, len=33, n/ep=2, n/st=64, player_1/loss=722.598, player_2/loss=493.269, rew=0.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 372.49it/s, env_step=8192, len=17, n/ep=2, n/st=64, player_1/loss=573.009, player_2/loss=510.081, rew=0.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 374.45it/s, env_step=9216, len=37, n/ep=2, n/st=64, player_1/loss=603.009, player_2/loss=528.056, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 374.83it/s, env_step=10240, len=33, n/ep=2, n/st=64, player_1/loss=674.703, player_2/loss=499.220, rew=0.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 373.01it/s, env_step=11264, len=36, n/ep=2, n/st=64, player_1/loss=691.789, player_2/loss=499.310, rew=-25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 373.79it/s, env_step=12288, len=37, n/ep=2, n/st=64, player_1/loss=799.290, player_2/loss=556.537, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 371.62it/s, env_step=13312, len=27, n/ep=2, n/st=64, player_1/loss=1102.786, player_2/loss=728.483, rew=0.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 373.67it/s, env_step=14336, len=31, n/ep=3, n/st=64, player_1/loss=895.069, player_2/loss=673.624, rew=8.33]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 371.17it/s, env_step=15360, len=35, n/ep=2, n/st=64, player_1/loss=756.960, player_2/loss=657.045, rew=0.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 371.69it/s, env_step=16384, len=34, n/ep=2, n/st=64, player_1/loss=655.457, player_2/loss=625.067, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 371.23it/s, env_step=17408, len=28, n/ep=3, n/st=64, player_1/loss=707.801, player_2/loss=555.229, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 370.28it/s, env_step=18432, len=32, n/ep=1, n/st=64, player_1/loss=602.667, player_2/loss=409.885, rew=-25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 373.31it/s, env_step=19456, len=20, n/ep=3, n/st=64, player_1/loss=441.394, player_2/loss=481.030, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 372.60it/s, env_step=20480, len=39, n/ep=1, n/st=64, player_1/loss=651.821, player_2/loss=502.061, rew=25.00]


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 374.08it/s, env_step=21504, len=21, n/ep=3, n/st=64, player_1/loss=917.143, player_2/loss=564.983, rew=-25.00]


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 370.98it/s, env_step=22528, len=28, n/ep=2, n/st=64, player_1/loss=859.625, player_2/loss=589.368, rew=-25.00]


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 370.60it/s, env_step=23552, len=38, n/ep=1, n/st=64, player_1/loss=642.784, player_2/loss=563.619, rew=-25.00]


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 374.93it/s, env_step=24576, len=33, n/ep=2, n/st=64, player_1/loss=763.828, player_2/loss=621.013, rew=-25.00]


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 372.43it/s, env_step=25600, len=28, n/ep=2, n/st=64, player_1/loss=835.498, player_2/loss=602.662, rew=-25.00]


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 374.48it/s, env_step=26624, len=35, n/ep=2, n/st=64, player_1/loss=726.035, player_2/loss=521.080, rew=-25.00]


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 370.42it/s, env_step=27648, len=33, n/ep=2, n/st=64, player_1/loss=609.390, player_2/loss=566.687, rew=-25.00]


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 371.49it/s, env_step=28672, len=30, n/ep=3, n/st=64, player_1/loss=670.458, player_2/loss=477.645, rew=-8.33]


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 373.13it/s, env_step=29696, len=19, n/ep=2, n/st=64, player_1/loss=689.944, player_2/loss=575.341, rew=-25.00]


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 374.55it/s, env_step=30720, len=29, n/ep=3, n/st=64, player_1/loss=681.080, player_2/loss=606.906, rew=-25.00]


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 373.47it/s, env_step=31744, len=38, n/ep=1, n/st=64, player_1/loss=633.621, player_2/loss=469.077, rew=-25.00]


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 367.26it/s, env_step=32768, len=20, n/ep=3, n/st=64, player_1/loss=566.807, player_2/loss=510.457, rew=-25.00]


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 369.86it/s, env_step=33792, len=25, n/ep=2, n/st=64, player_1/loss=552.542, player_2/loss=534.587, rew=0.00]


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 372.12it/s, env_step=34816, len=23, n/ep=2, n/st=64, player_1/loss=532.446, player_2/loss=474.294, rew=0.00]


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 374.19it/s, env_step=35840, len=35, n/ep=2, n/st=64, player_1/loss=590.189, player_2/loss=505.775, rew=-25.00]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 372.49it/s, env_step=36864, len=32, n/ep=2, n/st=64, player_1/loss=648.195, player_2/loss=563.354, rew=0.00]


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 372.25it/s, env_step=37888, len=30, n/ep=2, n/st=64, player_1/loss=482.737, player_2/loss=555.015, rew=0.00]


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 372.31it/s, env_step=38912, len=29, n/ep=2, n/st=64, player_1/loss=462.690, player_2/loss=491.120, rew=25.00]


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 369.91it/s, env_step=39936, len=31, n/ep=2, n/st=64, player_1/loss=553.018, player_2/loss=524.693, rew=0.00]


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 372.78it/s, env_step=40960, len=39, n/ep=1, n/st=64, player_1/loss=568.222, player_2/loss=582.428, rew=25.00]


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 371.04it/s, env_step=41984, len=35, n/ep=2, n/st=64, player_1/loss=677.262, player_2/loss=585.205, rew=-25.00]


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 370.22it/s, env_step=43008, len=36, n/ep=2, n/st=64, player_1/loss=660.468, player_2/loss=616.825, rew=0.00]


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 369.30it/s, env_step=44032, len=37, n/ep=2, n/st=64, player_1/loss=616.467, player_2/loss=586.234, rew=-25.00]


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 372.24it/s, env_step=45056, len=27, n/ep=3, n/st=64, player_1/loss=662.894, player_2/loss=465.345, rew=8.33]


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 371.51it/s, env_step=46080, len=26, n/ep=2, n/st=64, player_1/loss=518.602, player_2/loss=384.988, rew=0.00]


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 372.41it/s, env_step=47104, len=20, n/ep=2, n/st=64, player_1/loss=489.809, player_2/loss=462.534, rew=0.00]


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 370.76it/s, env_step=48128, len=32, n/ep=2, n/st=64, player_1/loss=514.804, player_2/loss=502.686, rew=-25.00]


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 373.33it/s, env_step=49152, len=23, n/ep=3, n/st=64, player_1/loss=541.009, player_2/loss=486.738, rew=-8.33]


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 375.99it/s, env_step=50176, len=36, n/ep=2, n/st=64, player_1/loss=535.242, player_2/loss=396.557, rew=-25.00]


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 368.15it/s, env_step=1024, len=35, n/ep=2, n/st=64, player_1/loss=1153.816, player_2/loss=400.958, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 372.11it/s, env_step=2048, len=20, n/ep=3, n/st=64, player_1/loss=850.797, player_2/loss=390.594, rew=8.33]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 371.55it/s, env_step=3072, len=28, n/ep=2, n/st=64, player_1/loss=576.619, player_2/loss=367.711, rew=25.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 373.00it/s, env_step=4096, len=29, n/ep=2, n/st=64, player_1/loss=434.820, player_2/loss=373.065, rew=-25.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 367.67it/s, env_step=5120, len=22, n/ep=3, n/st=64, player_1/loss=528.361, player_2/loss=426.465, rew=25.00]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 371.68it/s, env_step=6144, len=29, n/ep=2, n/st=64, player_1/loss=533.524, player_2/loss=405.469, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 370.85it/s, env_step=7168, len=36, n/ep=1, n/st=64, player_1/loss=576.171, player_2/loss=541.066, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 370.26it/s, env_step=8192, len=25, n/ep=3, n/st=64, player_1/loss=573.324, player_2/loss=618.432, rew=8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 372.36it/s, env_step=9216, len=33, n/ep=2, n/st=64, player_1/loss=627.321, player_2/loss=537.755, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 370.36it/s, env_step=10240, len=27, n/ep=2, n/st=64, player_1/loss=520.851, player_2/loss=475.264, rew=0.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 369.51it/s, env_step=11264, len=32, n/ep=3, n/st=64, player_1/loss=472.062, player_2/loss=471.038, rew=25.00]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 371.43it/s, env_step=12288, len=29, n/ep=3, n/st=64, player_1/loss=644.021, player_2/loss=561.412, rew=8.33]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 372.03it/s, env_step=13312, len=32, n/ep=2, n/st=64, player_1/loss=598.353, player_2/loss=601.956, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 371.37it/s, env_step=14336, len=23, n/ep=2, n/st=64, player_1/loss=631.907, player_2/loss=665.133, rew=0.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 374.17it/s, env_step=15360, len=29, n/ep=2, n/st=64, player_1/loss=696.435, player_2/loss=666.089, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 370.73it/s, env_step=16384, len=29, n/ep=2, n/st=64, player_1/loss=689.463, player_2/loss=471.325, rew=0.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 365.50it/s, env_step=17408, len=22, n/ep=2, n/st=64, player_1/loss=730.006, player_2/loss=502.235, rew=0.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 370.21it/s, env_step=18432, len=30, n/ep=2, n/st=64, player_1/loss=488.048, player_2/loss=595.542, rew=0.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 371.17it/s, env_step=19456, len=29, n/ep=3, n/st=64, player_1/loss=410.689, rew=25.00]       


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 371.17it/s, env_step=20480, len=22, n/ep=2, n/st=64, player_1/loss=627.704, player_2/loss=759.634, rew=25.00]


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 369.35it/s, env_step=21504, len=27, n/ep=2, n/st=64, player_1/loss=746.644, player_2/loss=537.934, rew=-25.00]


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 371.48it/s, env_step=22528, len=28, n/ep=2, n/st=64, player_1/loss=616.193, player_2/loss=502.822, rew=25.00]


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 371.55it/s, env_step=23552, len=34, n/ep=2, n/st=64, player_1/loss=644.326, player_2/loss=530.218, rew=25.00]


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 370.27it/s, env_step=24576, len=27, n/ep=2, n/st=64, player_1/loss=481.312, player_2/loss=541.458, rew=-25.00]


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 372.00it/s, env_step=25600, len=28, n/ep=2, n/st=64, player_1/loss=396.554, player_2/loss=488.613, rew=0.00]


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 365.41it/s, env_step=26624, len=22, n/ep=3, n/st=64, player_1/loss=634.138, player_2/loss=440.601, rew=8.33]


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 386.01it/s, env_step=27648, len=27, n/ep=3, n/st=64, player_1/loss=544.391, player_2/loss=389.060, rew=-8.33]


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 373.68it/s, env_step=28672, len=30, n/ep=2, n/st=64, player_1/loss=457.491, player_2/loss=404.570, rew=25.00]


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 371.82it/s, env_step=29696, len=37, n/ep=2, n/st=64, player_1/loss=452.430, player_2/loss=455.582, rew=25.00]


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 373.23it/s, env_step=30720, len=31, n/ep=2, n/st=64, player_1/loss=444.932, player_2/loss=472.497, rew=25.00]


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 372.97it/s, env_step=31744, len=28, n/ep=2, n/st=64, player_1/loss=444.066, player_2/loss=400.430, rew=25.00]


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 371.05it/s, env_step=32768, len=31, n/ep=2, n/st=64, player_1/loss=522.277, player_2/loss=536.537, rew=25.00]


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 372.04it/s, env_step=33792, len=32, n/ep=2, n/st=64, player_1/loss=472.868, player_2/loss=638.814, rew=0.00]


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 370.49it/s, env_step=34816, len=27, n/ep=2, n/st=64, player_1/loss=410.910, player_2/loss=428.418, rew=25.00]


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 369.58it/s, env_step=35840, len=32, n/ep=2, n/st=64, player_1/loss=479.349, player_2/loss=350.204, rew=0.00]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 366.79it/s, env_step=36864, len=28, n/ep=2, n/st=64, player_1/loss=483.373, player_2/loss=460.840, rew=0.00]


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 372.05it/s, env_step=37888, len=24, n/ep=3, n/st=64, player_1/loss=468.431, player_2/loss=507.456, rew=-8.33]


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 373.22it/s, env_step=38912, len=24, n/ep=3, n/st=64, player_1/loss=442.678, player_2/loss=623.196, rew=-8.33]


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 370.44it/s, env_step=39936, len=22, n/ep=3, n/st=64, player_1/loss=434.151, player_2/loss=589.660, rew=8.33]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 360.23it/s, env_step=40960, len=30, n/ep=2, n/st=64, player_1/loss=423.595, player_2/loss=613.224, rew=25.00]


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 370.86it/s, env_step=41984, len=20, n/ep=3, n/st=64, player_1/loss=486.397, player_2/loss=618.694, rew=-8.33]


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 372.76it/s, env_step=43008, len=34, n/ep=2, n/st=64, player_1/loss=440.605, player_2/loss=499.069, rew=25.00]


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 372.51it/s, env_step=44032, len=23, n/ep=4, n/st=64, player_1/loss=275.045, player_2/loss=563.557, rew=0.00]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 369.68it/s, env_step=45056, len=25, n/ep=2, n/st=64, player_1/loss=332.452, player_2/loss=539.511, rew=-25.00]


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 374.04it/s, env_step=46080, len=17, n/ep=4, n/st=64, player_1/loss=405.078, player_2/loss=474.086, rew=-12.50]


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 372.55it/s, env_step=47104, len=26, n/ep=3, n/st=64, player_1/loss=380.807, player_2/loss=425.108, rew=-8.33]


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 366.93it/s, env_step=48128, len=20, n/ep=4, n/st=64, player_1/loss=536.432, player_2/loss=471.203, rew=0.00]


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 372.99it/s, env_step=49152, len=20, n/ep=3, n/st=64, player_1/loss=510.424, player_2/loss=447.726, rew=-8.33]


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 371.13it/s, env_step=50176, len=23, n/ep=3, n/st=64, player_1/loss=335.009, player_2/loss=471.095, rew=-8.33]


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 366.51it/s, env_step=1024, len=14, n/ep=5, n/st=64, player_1/loss=680.497, player_2/loss=489.480, rew=15.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 370.78it/s, env_step=2048, len=23, n/ep=3, n/st=64, player_1/loss=507.252, player_2/loss=418.360, rew=-8.33]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 367.39it/s, env_step=3072, len=22, n/ep=3, n/st=64, player_1/loss=425.808, player_2/loss=484.444, rew=8.33]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 369.55it/s, env_step=4096, len=27, n/ep=3, n/st=64, player_1/loss=456.927, player_2/loss=445.355, rew=-8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 370.74it/s, env_step=5120, len=18, n/ep=4, n/st=64, player_1/loss=400.289, player_2/loss=389.124, rew=12.50]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 364.74it/s, env_step=6144, len=22, n/ep=2, n/st=64, player_1/loss=434.061, player_2/loss=398.710, rew=0.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 367.36it/s, env_step=7168, len=17, n/ep=3, n/st=64, player_1/loss=666.144, player_2/loss=384.800, rew=8.33]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 369.89it/s, env_step=8192, len=21, n/ep=3, n/st=64, player_1/loss=544.602, player_2/loss=428.518, rew=-8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 368.45it/s, env_step=9216, len=17, n/ep=5, n/st=64, player_1/loss=582.652, player_2/loss=450.051, rew=15.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 371.16it/s, env_step=10240, len=20, n/ep=3, n/st=64, player_1/loss=535.647, player_2/loss=460.629, rew=8.33]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 373.53it/s, env_step=11264, len=28, n/ep=2, n/st=64, player_1/loss=393.244, player_2/loss=512.309, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 371.71it/s, env_step=12288, len=19, n/ep=3, n/st=64, player_1/loss=454.369, player_2/loss=395.593, rew=8.33]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 369.88it/s, env_step=13312, len=25, n/ep=3, n/st=64, player_1/loss=550.457, player_2/loss=317.155, rew=-8.33]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 372.89it/s, env_step=14336, len=21, n/ep=3, n/st=64, player_1/loss=529.292, player_2/loss=336.522, rew=-8.33]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 367.63it/s, env_step=15360, len=18, n/ep=3, n/st=64, player_1/loss=378.454, player_2/loss=370.929, rew=8.33]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 378.00it/s, env_step=16384, len=24, n/ep=3, n/st=64, player_1/loss=488.970, player_2/loss=375.362, rew=-8.33]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 376.48it/s, env_step=17408, len=19, n/ep=3, n/st=64, player_1/loss=375.819, player_2/loss=355.048, rew=8.33]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 366.88it/s, env_step=18432, len=26, n/ep=2, n/st=64, player_1/loss=461.191, player_2/loss=434.434, rew=0.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 373.05it/s, env_step=19456, len=17, n/ep=3, n/st=64, player_1/loss=519.691, player_2/loss=487.558, rew=8.33]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 382.13it/s, env_step=20480, len=21, n/ep=2, n/st=64, player_1/loss=506.686, player_2/loss=519.735, rew=0.00]


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 383.28it/s, env_step=21504, len=18, n/ep=4, n/st=64, player_1/loss=559.726, player_2/loss=530.709, rew=12.50]


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 377.35it/s, env_step=22528, len=22, n/ep=2, n/st=64, player_1/loss=416.362, player_2/loss=485.517, rew=0.00]


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 378.90it/s, env_step=23552, len=19, n/ep=3, n/st=64, player_1/loss=302.677, player_2/loss=601.667, rew=8.33]


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 379.38it/s, env_step=24576, len=16, n/ep=4, n/st=64, player_1/loss=425.173, player_2/loss=649.530, rew=25.00]


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 371.88it/s, env_step=25600, len=24, n/ep=3, n/st=64, player_1/loss=399.666, player_2/loss=673.366, rew=-25.00]


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 376.10it/s, env_step=26624, len=19, n/ep=3, n/st=64, player_1/loss=422.087, player_2/loss=460.965, rew=-8.33]


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 360.43it/s, env_step=27648, len=16, n/ep=3, n/st=64, player_1/loss=534.254, player_2/loss=517.304, rew=8.33]


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 359.58it/s, env_step=28672, len=22, n/ep=3, n/st=64, player_1/loss=470.686, player_2/loss=440.694, rew=-8.33]


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 365.63it/s, env_step=29696, len=24, n/ep=2, n/st=64, player_1/loss=407.204, player_2/loss=308.178, rew=-25.00]


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 362.54it/s, env_step=30720, len=27, n/ep=3, n/st=64, player_1/loss=389.749, player_2/loss=392.542, rew=25.00]


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 365.56it/s, env_step=31744, len=24, n/ep=2, n/st=64, player_1/loss=422.165, player_2/loss=489.550, rew=0.00]


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 364.08it/s, env_step=32768, len=23, n/ep=3, n/st=64, player_1/loss=350.362, player_2/loss=444.554, rew=-8.33]


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 364.95it/s, env_step=33792, len=22, n/ep=3, n/st=64, player_1/loss=297.767, player_2/loss=372.441, rew=-8.33]


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 366.01it/s, env_step=34816, len=20, n/ep=3, n/st=64, player_1/loss=404.961, player_2/loss=441.296, rew=-8.33]


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 366.69it/s, env_step=35840, len=23, n/ep=3, n/st=64, player_1/loss=443.873, player_2/loss=421.993, rew=8.33]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 366.66it/s, env_step=36864, len=26, n/ep=2, n/st=64, player_1/loss=366.044, player_2/loss=406.116, rew=0.00]


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 368.12it/s, env_step=37888, len=21, n/ep=3, n/st=64, player_1/loss=362.058, player_2/loss=481.191, rew=8.33]


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 365.72it/s, env_step=38912, len=27, n/ep=2, n/st=64, player_1/loss=278.904, player_2/loss=647.697, rew=0.00]


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 365.34it/s, env_step=39936, len=28, n/ep=2, n/st=64, player_1/loss=388.026, player_2/loss=655.603, rew=-25.00]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 365.67it/s, env_step=40960, len=19, n/ep=2, n/st=64, player_1/loss=453.623, player_2/loss=638.344, rew=0.00]


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 365.77it/s, env_step=41984, len=22, n/ep=4, n/st=64, player_1/loss=349.053, player_2/loss=427.271, rew=-12.50]


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 366.88it/s, env_step=43008, len=27, n/ep=3, n/st=64, player_1/loss=328.984, player_2/loss=328.281, rew=-25.00]


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 367.00it/s, env_step=44032, len=24, n/ep=3, n/st=64, player_1/loss=454.741, player_2/loss=365.902, rew=8.33]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 364.60it/s, env_step=45056, len=19, n/ep=4, n/st=64, player_1/loss=543.021, player_2/loss=373.755, rew=0.00]


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 363.86it/s, env_step=46080, len=20, n/ep=2, n/st=64, player_1/loss=464.290, player_2/loss=459.117, rew=0.00]


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 364.77it/s, env_step=47104, len=19, n/ep=3, n/st=64, player_1/loss=445.006, player_2/loss=402.699, rew=-25.00]


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 366.65it/s, env_step=48128, len=28, n/ep=3, n/st=64, player_1/loss=486.242, player_2/loss=333.251, rew=-25.00]


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 365.65it/s, env_step=49152, len=24, n/ep=2, n/st=64, player_1/loss=419.125, player_2/loss=385.236, rew=25.00]


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 362.72it/s, env_step=50176, len=30, n/ep=2, n/st=64, player_1/loss=414.111, player_2/loss=412.339, rew=-25.00]


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 361.32it/s, env_step=1024, len=29, n/ep=2, n/st=64, player_1/loss=483.079, player_2/loss=249.426, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 365.97it/s, env_step=2048, len=26, n/ep=3, n/st=64, player_1/loss=476.803, player_2/loss=289.291, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 367.37it/s, env_step=3072, len=20, n/ep=2, n/st=64, player_1/loss=520.188, player_2/loss=346.861, rew=0.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 365.74it/s, env_step=4096, len=21, n/ep=3, n/st=64, player_1/loss=457.960, player_2/loss=422.383, rew=25.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 363.02it/s, env_step=5120, len=24, n/ep=3, n/st=64, player_1/loss=332.158, player_2/loss=470.563, rew=-8.33]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 361.62it/s, env_step=6144, len=27, n/ep=3, n/st=64, player_1/loss=511.670, player_2/loss=413.617, rew=8.33]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 365.47it/s, env_step=7168, len=21, n/ep=2, n/st=64, player_1/loss=566.527, player_2/loss=354.781, rew=0.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 366.86it/s, env_step=8192, len=26, n/ep=3, n/st=64, player_1/loss=431.937, player_2/loss=345.796, rew=8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 359.27it/s, env_step=9216, len=25, n/ep=3, n/st=64, player_1/loss=409.244, player_2/loss=381.658, rew=-8.33]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 367.44it/s, env_step=10240, len=27, n/ep=2, n/st=64, player_1/loss=517.172, player_2/loss=450.699, rew=25.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 368.01it/s, env_step=11264, len=29, n/ep=2, n/st=64, player_1/loss=589.381, player_2/loss=442.812, rew=0.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 362.29it/s, env_step=12288, len=24, n/ep=3, n/st=64, player_1/loss=534.029, player_2/loss=287.299, rew=25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 374.28it/s, env_step=13312, len=29, n/ep=3, n/st=64, player_1/loss=379.825, player_2/loss=333.024, rew=8.33]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 379.70it/s, env_step=14336, len=19, n/ep=4, n/st=64, player_1/loss=468.996, player_2/loss=337.205, rew=25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 367.75it/s, env_step=15360, len=28, n/ep=3, n/st=64, player_1/loss=641.425, player_2/loss=466.651, rew=8.33]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 364.36it/s, env_step=16384, len=22, n/ep=3, n/st=64, player_1/loss=558.172, player_2/loss=463.066, rew=8.33]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 364.04it/s, env_step=17408, len=20, n/ep=3, n/st=64, player_1/loss=528.326, player_2/loss=308.042, rew=8.33]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 366.30it/s, env_step=18432, len=37, n/ep=2, n/st=64, player_1/loss=575.059, player_2/loss=313.906, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 358.70it/s, env_step=19456, len=26, n/ep=3, n/st=64, player_1/loss=465.836, player_2/loss=268.537, rew=8.33]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 365.22it/s, env_step=20480, len=22, n/ep=3, n/st=64, player_1/loss=386.927, player_2/loss=288.102, rew=-8.33]


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 364.85it/s, env_step=21504, len=24, n/ep=3, n/st=64, player_1/loss=352.486, player_2/loss=335.696, rew=8.33]


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 363.23it/s, env_step=22528, len=19, n/ep=3, n/st=64, player_1/loss=320.115, player_2/loss=379.332, rew=8.33]


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 368.11it/s, env_step=23552, len=27, n/ep=3, n/st=64, player_1/loss=405.743, player_2/loss=446.601, rew=8.33]


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 366.85it/s, env_step=24576, len=22, n/ep=3, n/st=64, player_1/loss=427.361, player_2/loss=460.903, rew=25.00]


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 365.78it/s, env_step=25600, len=26, n/ep=3, n/st=64, player_1/loss=458.406, player_2/loss=417.180, rew=8.33]


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 365.49it/s, env_step=26624, len=29, n/ep=3, n/st=64, player_1/loss=488.981, player_2/loss=390.610, rew=25.00]


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 368.07it/s, env_step=27648, len=24, n/ep=2, n/st=64, player_1/loss=455.508, player_2/loss=345.495, rew=0.00]


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 365.07it/s, env_step=28672, len=20, n/ep=3, n/st=64, player_1/loss=441.579, player_2/loss=324.091, rew=-8.33]


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 366.13it/s, env_step=29696, len=22, n/ep=2, n/st=64, player_1/loss=495.203, player_2/loss=327.048, rew=0.00]


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 367.05it/s, env_step=30720, len=19, n/ep=3, n/st=64, player_1/loss=505.264, player_2/loss=283.458, rew=-25.00]


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 364.24it/s, env_step=31744, len=24, n/ep=3, n/st=64, player_1/loss=445.281, player_2/loss=286.878, rew=-8.33]


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 360.59it/s, env_step=32768, len=24, n/ep=2, n/st=64, player_1/loss=318.652, player_2/loss=408.075, rew=-25.00]


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 363.36it/s, env_step=33792, len=24, n/ep=3, n/st=64, player_1/loss=244.339, player_2/loss=454.358, rew=25.00]


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 368.38it/s, env_step=34816, len=21, n/ep=4, n/st=64, player_1/loss=271.457, player_2/loss=314.742, rew=-12.50]


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 365.17it/s, env_step=35840, len=25, n/ep=2, n/st=64, player_1/loss=441.852, player_2/loss=315.710, rew=25.00]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 356.67it/s, env_step=36864, len=24, n/ep=3, n/st=64, player_1/loss=399.995, player_2/loss=323.109, rew=8.33]


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 365.38it/s, env_step=37888, len=24, n/ep=3, n/st=64, player_1/loss=337.496, player_2/loss=359.211, rew=-8.33]


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 365.50it/s, env_step=38912, len=21, n/ep=3, n/st=64, player_1/loss=401.602, player_2/loss=420.215, rew=-8.33]


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 365.63it/s, env_step=39936, len=22, n/ep=3, n/st=64, player_1/loss=397.187, player_2/loss=382.155, rew=-8.33]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 366.12it/s, env_step=40960, len=22, n/ep=3, n/st=64, player_1/loss=354.062, player_2/loss=333.805, rew=8.33]


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 367.93it/s, env_step=41984, len=28, n/ep=2, n/st=64, player_1/loss=388.981, player_2/loss=289.424, rew=25.00]


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 365.08it/s, env_step=43008, len=23, n/ep=3, n/st=64, player_1/loss=373.131, player_2/loss=255.132, rew=25.00]


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 361.34it/s, env_step=44032, len=19, n/ep=3, n/st=64, player_1/loss=439.185, player_2/loss=318.772, rew=-8.33]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 360.23it/s, env_step=45056, len=21, n/ep=3, n/st=64, player_1/loss=483.444, player_2/loss=314.844, rew=8.33]


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 365.02it/s, env_step=46080, len=24, n/ep=2, n/st=64, player_1/loss=320.847, player_2/loss=263.676, rew=25.00]


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 367.77it/s, env_step=47104, len=21, n/ep=2, n/st=64, player_1/loss=298.746, player_2/loss=263.308, rew=0.00]


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 363.98it/s, env_step=48128, len=21, n/ep=3, n/st=64, player_1/loss=460.466, player_2/loss=262.763, rew=8.33]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 367.39it/s, env_step=49152, len=28, n/ep=2, n/st=64, player_1/loss=399.199, player_2/loss=298.536, rew=25.00]


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 366.97it/s, env_step=50176, len=24, n/ep=3, n/st=64, player_1/loss=344.684, player_2/loss=345.916, rew=-8.33]


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 363.29it/s, env_step=1024, len=19, n/ep=2, n/st=64, player_1/loss=383.487, player_2/loss=328.073, rew=-25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 367.75it/s, env_step=2048, len=19, n/ep=4, n/st=64, player_1/loss=358.760, player_2/loss=282.149, rew=-12.50]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 364.94it/s, env_step=3072, len=17, n/ep=3, n/st=64, player_1/loss=396.987, player_2/loss=285.004, rew=25.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 367.15it/s, env_step=4096, len=21, n/ep=3, n/st=64, player_1/loss=419.393, player_2/loss=241.967, rew=-8.33]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 374.30it/s, env_step=5120, len=24, n/ep=2, n/st=64, player_1/loss=308.332, player_2/loss=209.016, rew=0.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 443.08it/s, env_step=6144, len=19, n/ep=3, n/st=64, player_1/loss=334.960, player_2/loss=378.247, rew=-8.33]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 417.74it/s, env_step=7168, len=28, n/ep=3, n/st=64, player_1/loss=346.308, player_2/loss=482.223, rew=8.33]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 404.08it/s, env_step=8192, len=21, n/ep=3, n/st=64, player_1/loss=272.119, player_2/loss=360.540, rew=8.33]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 390.56it/s, env_step=9216, len=19, n/ep=3, n/st=64, player_1/loss=281.722, player_2/loss=257.824, rew=-8.33]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 380.51it/s, env_step=10240, len=23, n/ep=3, n/st=64, player_1/loss=234.356, player_2/loss=240.921, rew=-8.33]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 366.38it/s, env_step=11264, len=34, n/ep=2, n/st=64, player_1/loss=281.868, player_2/loss=284.313, rew=-25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 367.93it/s, env_step=12288, len=24, n/ep=3, n/st=64, player_1/loss=385.426, player_2/loss=322.394, rew=8.33]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 367.58it/s, env_step=13312, len=19, n/ep=3, n/st=64, player_1/loss=333.346, player_2/loss=383.945, rew=-25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 358.42it/s, env_step=14336, len=21, n/ep=3, n/st=64, player_1/loss=270.753, player_2/loss=365.373, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 366.44it/s, env_step=15360, len=19, n/ep=4, n/st=64, player_1/loss=304.276, player_2/loss=267.468, rew=-12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 367.62it/s, env_step=16384, len=21, n/ep=3, n/st=64, player_1/loss=344.510, player_2/loss=320.622, rew=8.33]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 361.49it/s, env_step=17408, len=22, n/ep=3, n/st=64, player_1/loss=444.000, player_2/loss=349.908, rew=-8.33]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 366.89it/s, env_step=18432, len=20, n/ep=3, n/st=64, player_1/loss=377.412, player_2/loss=342.608, rew=8.33]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 365.62it/s, env_step=19456, len=20, n/ep=3, n/st=64, player_1/loss=363.055, player_2/loss=286.696, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 366.83it/s, env_step=20480, len=23, n/ep=2, n/st=64, player_1/loss=456.093, player_2/loss=278.533, rew=0.00]


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 367.21it/s, env_step=21504, len=21, n/ep=3, n/st=64, player_1/loss=355.194, player_2/loss=394.053, rew=-8.33]


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 364.36it/s, env_step=22528, len=28, n/ep=2, n/st=64, player_1/loss=282.803, player_2/loss=406.633, rew=-25.00]


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 366.25it/s, env_step=23552, len=24, n/ep=3, n/st=64, player_1/loss=401.437, player_2/loss=366.922, rew=-8.33]


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 365.04it/s, env_step=24576, len=16, n/ep=3, n/st=64, player_1/loss=463.049, player_2/loss=364.546, rew=8.33]


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 365.97it/s, env_step=25600, len=19, n/ep=3, n/st=64, player_1/loss=364.682, player_2/loss=335.741, rew=-8.33]


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 364.18it/s, env_step=26624, len=20, n/ep=4, n/st=64, player_1/loss=492.456, player_2/loss=357.838, rew=0.00]


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 364.49it/s, env_step=27648, len=23, n/ep=3, n/st=64, player_1/loss=464.293, player_2/loss=372.470, rew=-8.33]


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 366.47it/s, env_step=28672, len=21, n/ep=3, n/st=64, player_1/loss=334.134, player_2/loss=314.478, rew=-8.33]


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 364.79it/s, env_step=29696, len=23, n/ep=3, n/st=64, player_1/loss=295.274, player_2/loss=365.497, rew=-8.33]


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 365.52it/s, env_step=30720, len=20, n/ep=3, n/st=64, player_1/loss=304.674, player_2/loss=433.112, rew=-25.00]


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 365.75it/s, env_step=31744, len=21, n/ep=3, n/st=64, player_1/loss=286.841, player_2/loss=343.675, rew=-8.33]


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 367.25it/s, env_step=32768, len=22, n/ep=3, n/st=64, player_1/loss=218.472, player_2/loss=310.516, rew=-25.00]


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 367.95it/s, env_step=33792, len=19, n/ep=3, n/st=64, player_1/loss=243.838, player_2/loss=293.066, rew=8.33]


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 364.41it/s, env_step=34816, len=18, n/ep=3, n/st=64, player_1/loss=298.622, player_2/loss=254.989, rew=8.33]


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 365.26it/s, env_step=35840, len=19, n/ep=3, n/st=64, player_1/loss=316.161, player_2/loss=215.035, rew=-8.33]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 366.42it/s, env_step=36864, len=19, n/ep=3, n/st=64, player_1/loss=313.236, player_2/loss=238.971, rew=-8.33]


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 364.27it/s, env_step=37888, len=25, n/ep=3, n/st=64, player_1/loss=283.067, player_2/loss=312.590, rew=-8.33]


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 361.93it/s, env_step=38912, len=21, n/ep=3, n/st=64, player_1/loss=273.595, player_2/loss=285.243, rew=-25.00]


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 365.45it/s, env_step=39936, len=19, n/ep=4, n/st=64, player_1/loss=420.188, player_2/loss=300.265, rew=-12.50]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 364.88it/s, env_step=40960, len=18, n/ep=4, n/st=64, player_1/loss=407.167, player_2/loss=361.396, rew=-25.00]


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 365.82it/s, env_step=41984, len=32, n/ep=2, n/st=64, player_1/loss=407.206, player_2/loss=248.372, rew=0.00]


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 366.36it/s, env_step=43008, len=17, n/ep=4, n/st=64, player_1/loss=425.506, player_2/loss=203.519, rew=-25.00]


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 366.26it/s, env_step=44032, len=21, n/ep=2, n/st=64, player_1/loss=369.813, player_2/loss=242.195, rew=-25.00]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 366.07it/s, env_step=45056, len=19, n/ep=3, n/st=64, player_1/loss=338.863, player_2/loss=227.470, rew=-25.00]


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 365.20it/s, env_step=46080, len=16, n/ep=4, n/st=64, player_1/loss=355.522, player_2/loss=251.273, rew=-25.00]


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 365.60it/s, env_step=47104, len=21, n/ep=3, n/st=64, player_1/loss=358.169, player_2/loss=255.937, rew=-8.33]


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 366.65it/s, env_step=48128, len=18, n/ep=4, n/st=64, player_1/loss=485.429, player_2/loss=204.369, rew=-25.00]


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:03, 325.00it/s, env_step=49152, len=17, n/ep=3, n/st=64, player_1/loss=475.155, player_2/loss=213.095, rew=-25.00]


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 374.74it/s, env_step=50176, len=16, n/ep=4, n/st=64, player_1/loss=349.676, player_2/loss=279.191, rew=-25.00]


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 404.68it/s, env_step=1024, len=16, n/ep=4, n/st=64, player_1/loss=470.404, player_2/loss=161.103, rew=25.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 433.58it/s, env_step=2048, len=18, n/ep=4, n/st=64, player_1/loss=444.635, player_2/loss=222.240, rew=12.50]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 426.86it/s, env_step=3072, len=22, n/ep=2, n/st=64, player_1/loss=327.457, player_2/loss=289.555, rew=0.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 431.97it/s, env_step=4096, len=20, n/ep=3, n/st=64, player_1/loss=319.690, player_2/loss=299.049, rew=25.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 438.02it/s, env_step=5120, len=18, n/ep=3, n/st=64, player_1/loss=378.259, player_2/loss=272.030, rew=8.33]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 416.96it/s, env_step=6144, len=17, n/ep=4, n/st=64, player_1/loss=340.624, player_2/loss=236.871, rew=12.50]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 406.09it/s, env_step=7168, len=20, n/ep=3, n/st=64, player_1/loss=302.243, player_2/loss=234.730, rew=25.00]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 392.52it/s, env_step=8192, len=14, n/ep=4, n/st=64, player_1/loss=336.229, player_2/loss=249.647, rew=12.50]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 379.37it/s, env_step=9216, len=24, n/ep=3, n/st=64, player_1/loss=349.088, player_2/loss=201.435, rew=-8.33]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 377.24it/s, env_step=10240, len=18, n/ep=2, n/st=64, player_1/loss=297.975, player_2/loss=245.396, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 363.67it/s, env_step=11264, len=16, n/ep=4, n/st=64, player_1/loss=353.701, player_2/loss=279.566, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 377.22it/s, env_step=12288, len=21, n/ep=3, n/st=64, player_1/loss=378.639, player_2/loss=291.087, rew=-8.33]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 434.47it/s, env_step=13312, len=23, n/ep=2, n/st=64, player_1/loss=325.051, player_2/loss=261.164, rew=0.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:03, 325.31it/s, env_step=14336, len=16, n/ep=4, n/st=64, player_1/loss=446.712, player_2/loss=276.879, rew=25.00]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 395.40it/s, env_step=15360, len=16, n/ep=4, n/st=64, player_1/loss=442.650, player_2/loss=199.046, rew=25.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 449.10it/s, env_step=16384, len=21, n/ep=3, n/st=64, player_1/loss=408.917, player_2/loss=209.857, rew=8.33]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 408.96it/s, env_step=17408, len=19, n/ep=3, n/st=64, player_1/loss=409.987, player_2/loss=221.342, rew=25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 452.83it/s, env_step=18432, len=21, n/ep=4, n/st=64, player_1/loss=394.531, player_2/loss=212.662, rew=0.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 406.54it/s, env_step=19456, len=15, n/ep=4, n/st=64, player_1/loss=365.092, player_2/loss=203.042, rew=0.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 456.14it/s, env_step=20480, len=18, n/ep=4, n/st=64, player_1/loss=394.280, player_2/loss=200.676, rew=25.00]


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 421.68it/s, env_step=21504, len=21, n/ep=3, n/st=64, player_1/loss=467.528, player_2/loss=213.571, rew=25.00]


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 448.26it/s, env_step=22528, len=16, n/ep=4, n/st=64, player_1/loss=471.930, player_2/loss=175.991, rew=25.00]


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 458.00it/s, env_step=23552, len=16, n/ep=3, n/st=64, player_1/loss=370.488, player_2/loss=187.549, rew=25.00]


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 453.26it/s, env_step=24576, len=18, n/ep=4, n/st=64, player_1/loss=375.144, player_2/loss=265.553, rew=12.50]


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 442.52it/s, env_step=25600, len=26, n/ep=3, n/st=64, player_1/loss=524.589, player_2/loss=260.820, rew=25.00]


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 445.63it/s, env_step=26624, len=20, n/ep=4, n/st=64, player_1/loss=430.440, player_2/loss=228.400, rew=12.50]


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 432.21it/s, env_step=27648, len=17, n/ep=3, n/st=64, player_1/loss=312.610, player_2/loss=235.483, rew=25.00]


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 452.67it/s, env_step=28672, len=18, n/ep=3, n/st=64, player_1/loss=412.041, player_2/loss=231.530, rew=25.00]


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 459.17it/s, env_step=29696, len=24, n/ep=2, n/st=64, player_1/loss=354.566, player_2/loss=229.018, rew=0.00]


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 386.94it/s, env_step=30720, len=26, n/ep=2, n/st=64, player_2/loss=258.766, rew=0.00]        


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 423.14it/s, env_step=31744, len=17, n/ep=4, n/st=64, player_1/loss=257.740, player_2/loss=239.394, rew=12.50]


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 413.98it/s, env_step=32768, len=21, n/ep=3, n/st=64, player_1/loss=300.606, player_2/loss=183.351, rew=25.00]


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 435.28it/s, env_step=33792, len=19, n/ep=3, n/st=64, player_1/loss=403.754, player_2/loss=167.664, rew=8.33]


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 436.97it/s, env_step=34816, len=19, n/ep=3, n/st=64, player_1/loss=330.313, player_2/loss=221.768, rew=8.33]


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 455.78it/s, env_step=35840, len=17, n/ep=4, n/st=64, player_1/loss=268.676, player_2/loss=242.786, rew=25.00]


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 453.14it/s, env_step=36864, len=18, n/ep=4, n/st=64, player_1/loss=325.127, player_2/loss=235.286, rew=25.00]


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 392.45it/s, env_step=37888, len=16, n/ep=4, n/st=64, player_1/loss=486.132, player_2/loss=210.784, rew=25.00]


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 440.02it/s, env_step=38912, len=19, n/ep=3, n/st=64, player_1/loss=510.358, player_2/loss=172.582, rew=25.00]


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 461.27it/s, env_step=39936, len=17, n/ep=4, n/st=64, player_1/loss=471.121, player_2/loss=211.629, rew=12.50]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 461.23it/s, env_step=40960, len=16, n/ep=4, n/st=64, player_1/loss=429.479, player_2/loss=212.274, rew=12.50]


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 438.47it/s, env_step=41984, len=15, n/ep=4, n/st=64, player_1/loss=318.645, player_2/loss=205.219, rew=25.00]


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 442.93it/s, env_step=43008, len=15, n/ep=3, n/st=64, player_1/loss=293.001, player_2/loss=262.350, rew=25.00]


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 470.33it/s, env_step=44032, len=22, n/ep=3, n/st=64, player_1/loss=466.039, player_2/loss=231.749, rew=8.33]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 464.49it/s, env_step=45056, len=18, n/ep=3, n/st=64, player_1/loss=537.623, player_2/loss=155.108, rew=8.33]


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 467.98it/s, env_step=46080, len=17, n/ep=4, n/st=64, player_1/loss=451.476, player_2/loss=159.206, rew=12.50]


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 435.19it/s, env_step=47104, len=17, n/ep=3, n/st=64, player_2/loss=139.244, rew=25.00]       


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 466.93it/s, env_step=48128, len=15, n/ep=4, n/st=64, player_1/loss=406.566, player_2/loss=144.758, rew=25.00]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 445.33it/s, env_step=49152, len=18, n/ep=3, n/st=64, player_1/loss=374.332, player_2/loss=151.171, rew=25.00]


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 440.20it/s, env_step=50176, len=21, n/ep=3, n/st=64, player_1/loss=530.855, rew=25.00]       


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 399.95it/s, env_step=1024, len=19, n/ep=4, n/st=64, player_1/loss=363.228, player_2/loss=164.275, rew=-12.50]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 374.11it/s, env_step=2048, len=39, n/ep=1, n/st=64, player_1/loss=403.885, player_2/loss=217.256, rew=25.00]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 375.90it/s, env_step=3072, len=17, n/ep=4, n/st=64, player_1/loss=419.941, player_2/loss=216.560, rew=-12.50]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 420.24it/s, env_step=4096, len=15, n/ep=4, n/st=64, player_1/loss=389.799, player_2/loss=195.435, rew=0.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 464.22it/s, env_step=5120, len=20, n/ep=3, n/st=64, player_1/loss=366.131, player_2/loss=173.950, rew=-25.00]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 446.41it/s, env_step=6144, len=20, n/ep=4, n/st=64, player_1/loss=365.521, player_2/loss=218.325, rew=-12.50]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 436.61it/s, env_step=7168, len=16, n/ep=5, n/st=64, player_1/loss=306.905, player_2/loss=220.943, rew=-5.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 433.55it/s, env_step=8192, len=13, n/ep=5, n/st=64, player_1/loss=499.086, player_2/loss=197.874, rew=-15.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 449.95it/s, env_step=9216, len=18, n/ep=3, n/st=64, player_1/loss=528.890, player_2/loss=190.500, rew=-8.33]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 445.14it/s, env_step=10240, len=18, n/ep=4, n/st=64, player_1/loss=441.311, player_2/loss=207.218, rew=-12.50]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 444.71it/s, env_step=11264, len=15, n/ep=4, n/st=64, player_1/loss=339.380, player_2/loss=208.481, rew=-12.50]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #12: 1025it [00:02, 402.74it/s, env_step=12288, len=20, n/ep=3, n/st=64, player_1/loss=416.264, player_2/loss=204.238, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #13: 1025it [00:02, 454.21it/s, env_step=13312, len=16, n/ep=4, n/st=64, player_1/loss=341.231, player_2/loss=202.524, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #14: 1025it [00:02, 462.02it/s, env_step=14336, len=16, n/ep=4, n/st=64, player_1/loss=339.139, player_2/loss=207.087, rew=-25.00]


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #15: 1025it [00:02, 458.19it/s, env_step=15360, len=12, n/ep=5, n/st=64, player_1/loss=352.454, player_2/loss=263.365, rew=-5.00]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #16: 1025it [00:02, 454.76it/s, env_step=16384, len=25, n/ep=3, n/st=64, player_1/loss=232.089, player_2/loss=310.354, rew=-8.33]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #17: 1025it [00:02, 438.02it/s, env_step=17408, len=14, n/ep=4, n/st=64, player_1/loss=207.827, player_2/loss=237.659, rew=-12.50]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #18: 1025it [00:02, 418.05it/s, env_step=18432, len=16, n/ep=5, n/st=64, player_1/loss=292.850, player_2/loss=222.587, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #19: 1025it [00:02, 415.22it/s, env_step=19456, len=17, n/ep=4, n/st=64, player_1/loss=343.731, player_2/loss=304.665, rew=25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #20: 1025it [00:02, 432.15it/s, env_step=20480, len=16, n/ep=4, n/st=64, player_1/loss=337.518, player_2/loss=271.435, rew=-12.50]


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #21: 1025it [00:02, 465.24it/s, env_step=21504, len=17, n/ep=3, n/st=64, player_1/loss=354.725, player_2/loss=200.339, rew=-8.33]


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #22: 1025it [00:02, 440.88it/s, env_step=22528, len=15, n/ep=5, n/st=64, player_1/loss=294.448, player_2/loss=210.098, rew=-15.00]


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #23: 1025it [00:02, 417.51it/s, env_step=23552, len=16, n/ep=4, n/st=64, player_2/loss=284.832, rew=-12.50]      


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #24: 1025it [00:02, 407.92it/s, env_step=24576, len=17, n/ep=4, n/st=64, player_1/loss=403.291, player_2/loss=265.920, rew=-25.00]


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #25: 1025it [00:02, 395.82it/s, env_step=25600, len=17, n/ep=4, n/st=64, player_1/loss=335.174, player_2/loss=230.637, rew=-12.50]


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #26: 1025it [00:02, 397.12it/s, env_step=26624, len=14, n/ep=4, n/st=64, player_1/loss=310.445, player_2/loss=200.834, rew=-12.50]


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #27: 1025it [00:02, 388.40it/s, env_step=27648, len=15, n/ep=4, n/st=64, player_1/loss=325.728, player_2/loss=175.146, rew=-12.50]


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #28: 1025it [00:02, 440.13it/s, env_step=28672, len=18, n/ep=4, n/st=64, player_1/loss=281.529, player_2/loss=205.262, rew=-25.00]


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #29: 1025it [00:02, 451.15it/s, env_step=29696, len=16, n/ep=4, n/st=64, player_1/loss=363.671, player_2/loss=163.624, rew=-12.50]


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #30: 1025it [00:02, 445.31it/s, env_step=30720, len=18, n/ep=3, n/st=64, player_1/loss=434.235, player_2/loss=211.913, rew=-8.33]


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #31: 1025it [00:02, 453.27it/s, env_step=31744, len=19, n/ep=4, n/st=64, player_1/loss=354.362, player_2/loss=225.115, rew=-12.50]


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #32: 1025it [00:02, 449.21it/s, env_step=32768, len=20, n/ep=3, n/st=64, player_1/loss=239.633, player_2/loss=165.498, rew=-25.00]


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #33: 1025it [00:02, 448.21it/s, env_step=33792, len=20, n/ep=3, n/st=64, player_1/loss=214.676, player_2/loss=239.956, rew=-25.00]


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #34: 1025it [00:02, 437.38it/s, env_step=34816, len=19, n/ep=3, n/st=64, player_1/loss=268.519, player_2/loss=265.870, rew=-8.33]


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #35: 1025it [00:02, 437.61it/s, env_step=35840, len=21, n/ep=4, n/st=64, player_1/loss=393.902, player_2/loss=188.312, rew=-12.50]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #36: 1025it [00:02, 441.15it/s, env_step=36864, len=19, n/ep=3, n/st=64, player_1/loss=391.366, player_2/loss=182.184, rew=-8.33]


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #37: 1025it [00:02, 446.69it/s, env_step=37888, len=17, n/ep=4, n/st=64, player_1/loss=310.853, player_2/loss=282.631, rew=0.00]


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #38: 1025it [00:02, 456.96it/s, env_step=38912, len=18, n/ep=3, n/st=64, player_1/loss=210.568, player_2/loss=313.150, rew=-8.33]


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #39: 1025it [00:02, 445.83it/s, env_step=39936, len=17, n/ep=4, n/st=64, player_1/loss=207.687, player_2/loss=243.567, rew=-25.00]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #40: 1025it [00:02, 440.93it/s, env_step=40960, len=23, n/ep=3, n/st=64, player_1/loss=246.868, rew=-8.33]       


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #41: 1025it [00:02, 441.96it/s, env_step=41984, len=18, n/ep=3, n/st=64, player_1/loss=259.220, player_2/loss=202.946, rew=-8.33]


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #42: 1025it [00:02, 456.52it/s, env_step=43008, len=23, n/ep=3, n/st=64, player_1/loss=177.403, player_2/loss=205.812, rew=-25.00]


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #43: 1025it [00:02, 447.77it/s, env_step=44032, len=25, n/ep=2, n/st=64, player_1/loss=148.484, player_2/loss=243.912, rew=25.00]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #44: 1025it [00:02, 448.78it/s, env_step=45056, len=15, n/ep=4, n/st=64, player_1/loss=146.880, player_2/loss=213.210, rew=0.00]


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #45: 1025it [00:02, 459.96it/s, env_step=46080, len=15, n/ep=4, n/st=64, player_1/loss=226.688, player_2/loss=171.604, rew=0.00]


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #46: 1025it [00:02, 430.83it/s, env_step=47104, len=20, n/ep=4, n/st=64, player_1/loss=282.803, player_2/loss=193.810, rew=-12.50]


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #47: 1025it [00:02, 419.89it/s, env_step=48128, len=20, n/ep=3, n/st=64, player_1/loss=194.413, player_2/loss=222.523, rew=-25.00]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #48: 1025it [00:02, 473.25it/s, env_step=49152, len=19, n/ep=3, n/st=64, player_1/loss=212.702, player_2/loss=275.494, rew=-8.33]


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #49: 1025it [00:02, 472.06it/s, env_step=50176, len=21, n/ep=3, n/st=64, player_1/loss=235.301, player_2/loss=240.795, rew=-8.33]


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #11


Epoch #1: 1025it [00:02, 475.26it/s, env_step=1024, len=20, n/ep=4, n/st=64, player_1/loss=265.268, player_2/loss=263.515, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 472.74it/s, env_step=2048, len=16, n/ep=4, n/st=64, player_1/loss=252.853, player_2/loss=200.540, rew=12.50]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 464.31it/s, env_step=3072, len=19, n/ep=4, n/st=64, player_1/loss=238.843, player_2/loss=171.101, rew=0.00]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 477.39it/s, env_step=4096, len=23, n/ep=3, n/st=64, player_1/loss=216.049, player_2/loss=261.774, rew=25.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 469.72it/s, env_step=5120, len=18, n/ep=3, n/st=64, player_1/loss=279.408, player_2/loss=258.571, rew=8.33]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 469.49it/s, env_step=6144, len=12, n/ep=4, n/st=64, player_1/loss=329.044, player_2/loss=210.804, rew=-12.50]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 476.73it/s, env_step=7168, len=20, n/ep=3, n/st=64, player_1/loss=286.938, player_2/loss=298.334, rew=8.33]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 476.49it/s, env_step=8192, len=20, n/ep=3, n/st=64, player_1/loss=363.423, player_2/loss=241.325, rew=25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 475.40it/s, env_step=9216, len=27, n/ep=3, n/st=64, player_1/loss=416.103, player_2/loss=165.971, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 432.75it/s, env_step=10240, len=28, n/ep=2, n/st=64, player_1/loss=325.018, player_2/loss=231.394, rew=25.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 456.70it/s, env_step=11264, len=20, n/ep=3, n/st=64, player_1/loss=317.465, player_2/loss=227.830, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 421.71it/s, env_step=12288, len=18, n/ep=3, n/st=64, player_1/loss=296.732, player_2/loss=214.853, rew=8.33]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 422.99it/s, env_step=13312, len=18, n/ep=4, n/st=64, player_1/loss=358.094, player_2/loss=251.855, rew=0.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 401.82it/s, env_step=14336, len=16, n/ep=4, n/st=64, player_1/loss=360.796, player_2/loss=303.799, rew=12.50]


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 396.68it/s, env_step=15360, len=14, n/ep=4, n/st=64, player_1/loss=322.845, player_2/loss=299.733, rew=12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 401.24it/s, env_step=16384, len=14, n/ep=3, n/st=64, player_1/loss=276.293, player_2/loss=232.144, rew=25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 432.87it/s, env_step=17408, len=17, n/ep=4, n/st=64, player_1/loss=313.592, player_2/loss=189.256, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 430.81it/s, env_step=18432, len=15, n/ep=4, n/st=64, player_1/loss=313.589, player_2/loss=143.437, rew=0.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 412.53it/s, env_step=19456, len=22, n/ep=3, n/st=64, player_1/loss=258.343, player_2/loss=155.082, rew=25.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 390.69it/s, env_step=20480, len=15, n/ep=4, n/st=64, player_1/loss=309.430, player_2/loss=147.778, rew=25.00]


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 393.68it/s, env_step=21504, len=17, n/ep=3, n/st=64, player_1/loss=298.451, player_2/loss=178.568, rew=8.33]


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 402.91it/s, env_step=22528, len=26, n/ep=2, n/st=64, player_1/loss=340.818, player_2/loss=195.480, rew=25.00]


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 403.14it/s, env_step=23552, len=14, n/ep=4, n/st=64, player_1/loss=345.852, player_2/loss=168.859, rew=25.00]


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 425.13it/s, env_step=24576, len=16, n/ep=4, n/st=64, player_1/loss=246.820, player_2/loss=178.343, rew=0.00]


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 401.43it/s, env_step=25600, len=15, n/ep=4, n/st=64, player_1/loss=332.817, player_2/loss=203.684, rew=25.00]


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 415.98it/s, env_step=26624, len=16, n/ep=4, n/st=64, player_1/loss=381.621, player_2/loss=167.160, rew=25.00]


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 423.80it/s, env_step=27648, len=21, n/ep=4, n/st=64, player_1/loss=437.726, player_2/loss=137.771, rew=25.00]


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 430.51it/s, env_step=28672, len=17, n/ep=3, n/st=64, player_1/loss=457.742, player_2/loss=137.211, rew=8.33]


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 428.93it/s, env_step=29696, len=15, n/ep=5, n/st=64, player_1/loss=311.602, player_2/loss=165.534, rew=15.00]


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 427.93it/s, env_step=30720, len=16, n/ep=4, n/st=64, player_1/loss=299.029, player_2/loss=172.457, rew=12.50]


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 428.91it/s, env_step=31744, len=14, n/ep=5, n/st=64, player_1/loss=413.022, player_2/loss=160.969, rew=15.00]


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 430.93it/s, env_step=32768, len=15, n/ep=4, n/st=64, player_1/loss=423.180, player_2/loss=198.092, rew=25.00]


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 432.78it/s, env_step=33792, len=15, n/ep=5, n/st=64, player_1/loss=369.006, player_2/loss=189.273, rew=25.00]


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 439.03it/s, env_step=34816, len=18, n/ep=4, n/st=64, player_1/loss=413.167, player_2/loss=213.173, rew=0.00]


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 438.67it/s, env_step=35840, len=15, n/ep=4, n/st=64, player_1/loss=319.879, player_2/loss=231.715, rew=25.00]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 439.56it/s, env_step=36864, len=14, n/ep=4, n/st=64, player_1/loss=297.160, player_2/loss=177.347, rew=25.00]


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 436.92it/s, env_step=37888, len=13, n/ep=4, n/st=64, player_1/loss=385.794, player_2/loss=181.985, rew=12.50]


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 436.42it/s, env_step=38912, len=14, n/ep=5, n/st=64, player_1/loss=285.742, player_2/loss=207.525, rew=15.00]


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 412.76it/s, env_step=39936, len=15, n/ep=4, n/st=64, player_1/loss=255.750, player_2/loss=201.930, rew=25.00]


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 390.82it/s, env_step=40960, len=22, n/ep=3, n/st=64, player_1/loss=406.503, player_2/loss=141.268, rew=8.33]


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 378.98it/s, env_step=41984, len=13, n/ep=5, n/st=64, player_1/loss=383.871, player_2/loss=167.869, rew=15.00]


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 365.76it/s, env_step=43008, len=19, n/ep=4, n/st=64, player_1/loss=285.101, player_2/loss=176.116, rew=25.00]


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 362.66it/s, env_step=44032, len=14, n/ep=4, n/st=64, player_1/loss=245.478, player_2/loss=138.728, rew=25.00]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 364.79it/s, env_step=45056, len=18, n/ep=4, n/st=64, player_1/loss=343.462, player_2/loss=162.413, rew=12.50]


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 404.79it/s, env_step=46080, len=19, n/ep=3, n/st=64, player_1/loss=378.790, player_2/loss=155.673, rew=25.00]


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 431.92it/s, env_step=47104, len=15, n/ep=5, n/st=64, player_1/loss=289.905, player_2/loss=149.051, rew=25.00]


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 411.63it/s, env_step=48128, len=14, n/ep=4, n/st=64, player_1/loss=327.808, player_2/loss=163.238, rew=0.00]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 380.66it/s, env_step=49152, len=16, n/ep=4, n/st=64, player_1/loss=418.770, player_2/loss=188.708, rew=12.50]


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 366.93it/s, env_step=50176, len=15, n/ep=4, n/st=64, player_1/loss=387.906, player_2/loss=206.412, rew=25.00]


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 435.22it/s, env_step=1024, len=15, n/ep=4, n/st=64, player_1/loss=237.775, player_2/loss=275.397, rew=0.00]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 447.44it/s, env_step=2048, len=13, n/ep=4, n/st=64, player_1/loss=280.560, player_2/loss=195.402, rew=-12.50]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 418.40it/s, env_step=3072, len=14, n/ep=5, n/st=64, player_1/loss=328.546, player_2/loss=131.415, rew=-15.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 438.52it/s, env_step=4096, len=18, n/ep=3, n/st=64, player_1/loss=364.859, player_2/loss=203.494, rew=-8.33]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 446.49it/s, env_step=5120, len=15, n/ep=4, n/st=64, player_1/loss=359.401, player_2/loss=208.428, rew=-12.50]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 431.21it/s, env_step=6144, len=16, n/ep=4, n/st=64, player_1/loss=419.851, player_2/loss=166.495, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 447.58it/s, env_step=7168, len=17, n/ep=4, n/st=64, player_1/loss=368.321, player_2/loss=204.509, rew=-12.50]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 452.51it/s, env_step=8192, len=14, n/ep=3, n/st=64, player_1/loss=242.139, player_2/loss=222.542, rew=-25.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 437.92it/s, env_step=9216, len=16, n/ep=4, n/st=64, player_1/loss=258.080, player_2/loss=266.366, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 450.06it/s, env_step=10240, len=16, n/ep=4, n/st=64, player_1/loss=286.613, player_2/loss=226.873, rew=0.00]


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 413.06it/s, env_step=11264, len=14, n/ep=4, n/st=64, player_1/loss=333.844, player_2/loss=246.189, rew=-12.50]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 397.39it/s, env_step=12288, len=15, n/ep=4, n/st=64, player_1/loss=474.180, player_2/loss=275.108, rew=0.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 373.39it/s, env_step=13312, len=14, n/ep=4, n/st=64, player_1/loss=461.532, player_2/loss=200.569, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 380.03it/s, env_step=14336, len=15, n/ep=4, n/st=64, player_2/loss=224.838, rew=-12.50]      


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 403.01it/s, env_step=15360, len=15, n/ep=4, n/st=64, player_1/loss=385.754, player_2/loss=233.554, rew=-12.50]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 386.00it/s, env_step=16384, len=16, n/ep=4, n/st=64, player_1/loss=304.239, player_2/loss=188.155, rew=-25.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #17: 1025it [00:02, 408.30it/s, env_step=17408, len=15, n/ep=4, n/st=64, player_1/loss=220.819, player_2/loss=155.324, rew=-25.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #18: 1025it [00:02, 446.80it/s, env_step=18432, len=14, n/ep=4, n/st=64, player_1/loss=229.751, player_2/loss=140.281, rew=-25.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #19: 1025it [00:02, 444.40it/s, env_step=19456, len=14, n/ep=5, n/st=64, player_1/loss=220.591, player_2/loss=142.171, rew=-5.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #20: 1025it [00:02, 446.85it/s, env_step=20480, len=17, n/ep=4, n/st=64, player_1/loss=202.696, player_2/loss=153.026, rew=-12.50]


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #21: 1025it [00:02, 446.56it/s, env_step=21504, len=15, n/ep=4, n/st=64, player_1/loss=353.403, player_2/loss=172.265, rew=-12.50]


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #22: 1025it [00:02, 448.66it/s, env_step=22528, len=13, n/ep=5, n/st=64, player_1/loss=343.258, player_2/loss=217.658, rew=-15.00]


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #23: 1025it [00:02, 450.79it/s, env_step=23552, len=14, n/ep=4, n/st=64, player_1/loss=172.726, player_2/loss=246.864, rew=-25.00]


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #24: 1025it [00:02, 414.94it/s, env_step=24576, len=14, n/ep=4, n/st=64, player_1/loss=229.067, player_2/loss=241.871, rew=-12.50]


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #25: 1025it [00:02, 440.13it/s, env_step=25600, len=15, n/ep=4, n/st=64, player_1/loss=300.321, player_2/loss=276.945, rew=-25.00]


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #26: 1025it [00:02, 424.64it/s, env_step=26624, len=16, n/ep=4, n/st=64, player_1/loss=302.407, player_2/loss=193.053, rew=-25.00]


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #27: 1025it [00:02, 448.02it/s, env_step=27648, len=16, n/ep=4, n/st=64, player_1/loss=304.112, player_2/loss=176.077, rew=-25.00]


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #28: 1025it [00:02, 451.49it/s, env_step=28672, len=20, n/ep=3, n/st=64, player_1/loss=239.053, player_2/loss=199.173, rew=-25.00]


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #29: 1025it [00:02, 450.85it/s, env_step=29696, len=13, n/ep=5, n/st=64, player_1/loss=200.478, player_2/loss=193.245, rew=-15.00]


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #30: 1025it [00:02, 455.45it/s, env_step=30720, len=17, n/ep=4, n/st=64, player_1/loss=281.956, player_2/loss=215.817, rew=12.50]


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #31: 1025it [00:02, 450.19it/s, env_step=31744, len=14, n/ep=4, n/st=64, player_1/loss=264.345, player_2/loss=150.750, rew=-25.00]


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #32: 1025it [00:02, 428.89it/s, env_step=32768, len=18, n/ep=3, n/st=64, player_1/loss=226.034, player_2/loss=151.452, rew=-8.33]


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #33: 1025it [00:02, 398.24it/s, env_step=33792, len=17, n/ep=4, n/st=64, player_1/loss=240.084, player_2/loss=171.560, rew=-12.50]


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #34: 1025it [00:02, 380.47it/s, env_step=34816, len=15, n/ep=4, n/st=64, player_1/loss=223.550, player_2/loss=213.186, rew=-12.50]


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #35: 1025it [00:02, 374.50it/s, env_step=35840, len=15, n/ep=4, n/st=64, player_1/loss=293.123, player_2/loss=196.635, rew=-12.50]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #36: 1025it [00:02, 367.03it/s, env_step=36864, len=16, n/ep=4, n/st=64, player_1/loss=333.383, player_2/loss=170.257, rew=0.00]


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #37: 1025it [00:02, 363.73it/s, env_step=37888, len=15, n/ep=4, n/st=64, player_1/loss=308.023, player_2/loss=168.627, rew=-25.00]


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #38: 1025it [00:02, 419.86it/s, env_step=38912, len=15, n/ep=4, n/st=64, player_1/loss=256.482, player_2/loss=148.815, rew=-12.50]


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #39: 1025it [00:02, 432.57it/s, env_step=39936, len=15, n/ep=4, n/st=64, player_1/loss=245.057, player_2/loss=194.152, rew=-12.50]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #40: 1025it [00:02, 434.75it/s, env_step=40960, len=14, n/ep=5, n/st=64, player_1/loss=239.737, player_2/loss=245.736, rew=-25.00]


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #41: 1025it [00:02, 433.07it/s, env_step=41984, len=14, n/ep=5, n/st=64, player_1/loss=153.441, player_2/loss=205.432, rew=-25.00]


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #42: 1025it [00:02, 446.86it/s, env_step=43008, len=17, n/ep=4, n/st=64, player_1/loss=191.954, player_2/loss=177.232, rew=0.00]


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #43: 1025it [00:02, 447.87it/s, env_step=44032, len=13, n/ep=5, n/st=64, player_1/loss=253.952, player_2/loss=186.050, rew=-15.00]


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #44: 1025it [00:02, 448.54it/s, env_step=45056, len=14, n/ep=4, n/st=64, player_1/loss=231.687, player_2/loss=181.094, rew=-25.00]


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #45: 1025it [00:02, 446.21it/s, env_step=46080, len=14, n/ep=5, n/st=64, player_1/loss=286.242, player_2/loss=144.063, rew=-15.00]


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #46: 1025it [00:02, 444.70it/s, env_step=47104, len=15, n/ep=4, n/st=64, player_1/loss=389.498, player_2/loss=134.394, rew=-25.00]


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #47: 1025it [00:02, 427.69it/s, env_step=48128, len=15, n/ep=4, n/st=64, player_1/loss=283.856, player_2/loss=130.519, rew=0.00]


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #48: 1025it [00:02, 417.37it/s, env_step=49152, len=14, n/ep=4, n/st=64, player_1/loss=237.736, player_2/loss=120.493, rew=-12.50]


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #49: 1025it [00:02, 437.17it/s, env_step=50176, len=14, n/ep=4, n/st=64, player_1/loss=283.665, player_2/loss=158.577, rew=-12.50]


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #1: 1025it [00:02, 419.32it/s, env_step=1024, len=15, n/ep=4, n/st=64, player_1/loss=202.305, player_2/loss=251.873, rew=0.00]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 425.80it/s, env_step=2048, len=15, n/ep=4, n/st=64, player_1/loss=259.574, player_2/loss=193.570, rew=12.50]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 425.81it/s, env_step=3072, len=12, n/ep=6, n/st=64, player_1/loss=309.637, player_2/loss=202.450, rew=8.33]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 416.67it/s, env_step=4096, len=15, n/ep=4, n/st=64, player_1/loss=299.535, player_2/loss=148.757, rew=12.50]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 408.22it/s, env_step=5120, len=18, n/ep=3, n/st=64, player_1/loss=301.198, player_2/loss=132.560, rew=8.33]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 385.19it/s, env_step=6144, len=14, n/ep=4, n/st=64, player_1/loss=315.521, player_2/loss=145.308, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 374.98it/s, env_step=7168, len=17, n/ep=4, n/st=64, player_1/loss=329.076, player_2/loss=257.786, rew=12.50]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 374.83it/s, env_step=8192, len=14, n/ep=4, n/st=64, player_1/loss=223.122, player_2/loss=286.844, rew=0.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 357.51it/s, env_step=9216, len=14, n/ep=4, n/st=64, player_1/loss=260.283, player_2/loss=207.508, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 368.82it/s, env_step=10240, len=16, n/ep=4, n/st=64, player_1/loss=371.347, player_2/loss=146.744, rew=0.00]


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 366.89it/s, env_step=11264, len=18, n/ep=3, n/st=64, player_1/loss=400.737, player_2/loss=144.722, rew=8.33]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 363.71it/s, env_step=12288, len=14, n/ep=4, n/st=64, player_1/loss=326.067, player_2/loss=178.724, rew=25.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 348.14it/s, env_step=13312, len=14, n/ep=4, n/st=64, player_1/loss=408.789, player_2/loss=175.501, rew=25.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 346.91it/s, env_step=14336, len=14, n/ep=5, n/st=64, player_2/loss=168.754, rew=5.00]        


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 355.51it/s, env_step=15360, len=15, n/ep=4, n/st=64, player_1/loss=302.527, player_2/loss=142.291, rew=12.50]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 362.79it/s, env_step=16384, len=14, n/ep=4, n/st=64, player_1/loss=243.070, player_2/loss=202.919, rew=25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 366.76it/s, env_step=17408, len=15, n/ep=4, n/st=64, player_1/loss=213.069, player_2/loss=224.961, rew=25.00]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 372.77it/s, env_step=18432, len=14, n/ep=5, n/st=64, player_1/loss=245.653, player_2/loss=158.627, rew=5.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 353.33it/s, env_step=19456, len=15, n/ep=4, n/st=64, player_1/loss=354.576, player_2/loss=141.489, rew=0.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 359.62it/s, env_step=20480, len=13, n/ep=4, n/st=64, player_1/loss=352.723, player_2/loss=163.836, rew=0.00]


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 352.57it/s, env_step=21504, len=18, n/ep=4, n/st=64, player_1/loss=260.109, player_2/loss=222.289, rew=12.50]


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 366.26it/s, env_step=22528, len=14, n/ep=4, n/st=64, player_1/loss=270.172, player_2/loss=199.880, rew=-12.50]


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 414.52it/s, env_step=23552, len=14, n/ep=4, n/st=64, player_1/loss=341.988, player_2/loss=190.770, rew=25.00]


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 421.75it/s, env_step=24576, len=15, n/ep=4, n/st=64, player_1/loss=368.258, player_2/loss=200.981, rew=12.50]


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 442.33it/s, env_step=25600, len=15, n/ep=4, n/st=64, player_1/loss=434.149, player_2/loss=112.660, rew=12.50]


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 424.50it/s, env_step=26624, len=14, n/ep=5, n/st=64, player_1/loss=404.816, player_2/loss=117.248, rew=15.00]


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 392.99it/s, env_step=27648, len=15, n/ep=4, n/st=64, player_1/loss=265.267, player_2/loss=139.570, rew=12.50]


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 397.55it/s, env_step=28672, len=14, n/ep=4, n/st=64, player_1/loss=293.555, player_2/loss=176.500, rew=12.50]


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 408.65it/s, env_step=29696, len=15, n/ep=4, n/st=64, player_1/loss=276.440, player_2/loss=172.869, rew=25.00]


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 404.28it/s, env_step=30720, len=18, n/ep=3, n/st=64, player_1/loss=161.609, player_2/loss=189.524, rew=25.00]


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 386.54it/s, env_step=31744, len=15, n/ep=4, n/st=64, player_1/loss=260.691, player_2/loss=213.512, rew=25.00]


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 362.77it/s, env_step=32768, len=15, n/ep=2, n/st=64, player_1/loss=338.403, player_2/loss=173.628, rew=25.00]


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 427.83it/s, env_step=33792, len=17, n/ep=4, n/st=64, player_1/loss=340.781, player_2/loss=194.659, rew=25.00]


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 414.48it/s, env_step=34816, len=15, n/ep=5, n/st=64, player_1/loss=301.892, player_2/loss=211.337, rew=15.00]


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 435.08it/s, env_step=35840, len=12, n/ep=5, n/st=64, player_1/loss=345.767, player_2/loss=191.890, rew=5.00]


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 381.08it/s, env_step=36864, len=19, n/ep=4, n/st=64, player_1/loss=317.303, player_2/loss=118.997, rew=25.00]


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 440.88it/s, env_step=37888, len=13, n/ep=4, n/st=64, player_1/loss=314.119, player_2/loss=114.363, rew=0.00]


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 447.13it/s, env_step=38912, len=14, n/ep=4, n/st=64, player_1/loss=288.884, player_2/loss=107.218, rew=0.00]


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 446.63it/s, env_step=39936, len=14, n/ep=4, n/st=64, player_1/loss=280.385, player_2/loss=117.725, rew=25.00]


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 450.35it/s, env_step=40960, len=14, n/ep=5, n/st=64, player_1/loss=310.189, player_2/loss=129.498, rew=25.00]


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 460.90it/s, env_step=41984, len=13, n/ep=5, n/st=64, player_1/loss=358.595, player_2/loss=151.566, rew=15.00]


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 452.88it/s, env_step=43008, len=14, n/ep=4, n/st=64, player_1/loss=280.739, player_2/loss=174.461, rew=12.50]


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 448.70it/s, env_step=44032, len=13, n/ep=5, n/st=64, player_1/loss=237.391, player_2/loss=203.171, rew=15.00]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 442.45it/s, env_step=45056, len=20, n/ep=3, n/st=64, player_1/loss=292.632, player_2/loss=162.584, rew=-8.33]


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 436.33it/s, env_step=46080, len=14, n/ep=4, n/st=64, player_1/loss=333.529, player_2/loss=154.359, rew=-12.50]


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 449.69it/s, env_step=47104, len=15, n/ep=4, n/st=64, player_1/loss=262.969, player_2/loss=136.202, rew=25.00]


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 425.07it/s, env_step=48128, len=14, n/ep=5, n/st=64, player_1/loss=236.227, player_2/loss=143.882, rew=25.00]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 428.63it/s, env_step=49152, len=13, n/ep=4, n/st=64, player_1/loss=236.817, player_2/loss=147.070, rew=12.50]


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 444.13it/s, env_step=50176, len=14, n/ep=5, n/st=64, player_1/loss=244.136, player_2/loss=135.017, rew=-5.00]


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 442.48it/s, env_step=1024, len=20, n/ep=4, n/st=64, player_1/loss=286.751, player_2/loss=128.363, rew=-12.50]


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 420.30it/s, env_step=2048, len=13, n/ep=4, n/st=64, player_1/loss=209.352, player_2/loss=162.089, rew=-12.50]


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 416.97it/s, env_step=3072, len=14, n/ep=5, n/st=64, player_1/loss=213.275, player_2/loss=152.219, rew=-5.00]


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 394.32it/s, env_step=4096, len=15, n/ep=4, n/st=64, player_1/loss=276.550, player_2/loss=132.392, rew=-25.00]


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 390.71it/s, env_step=5120, len=16, n/ep=4, n/st=64, player_1/loss=311.784, player_2/loss=115.988, rew=-12.50]


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 342.87it/s, env_step=6144, len=14, n/ep=4, n/st=64, player_1/loss=292.553, player_2/loss=116.318, rew=-25.00]


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 443.23it/s, env_step=7168, len=15, n/ep=4, n/st=64, player_1/loss=310.508, player_2/loss=115.447, rew=-25.00]


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 443.06it/s, env_step=8192, len=14, n/ep=4, n/st=64, player_1/loss=287.701, player_2/loss=131.195, rew=0.00]


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 451.26it/s, env_step=9216, len=14, n/ep=4, n/st=64, player_1/loss=178.901, player_2/loss=170.982, rew=-25.00]


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 440.67it/s, env_step=10240, len=20, n/ep=3, n/st=64, player_1/loss=265.860, rew=-25.00]      


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 442.71it/s, env_step=11264, len=15, n/ep=4, n/st=64, player_1/loss=365.190, player_2/loss=154.784, rew=-12.50]


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 427.92it/s, env_step=12288, len=14, n/ep=4, n/st=64, player_1/loss=321.040, player_2/loss=140.983, rew=-25.00]


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 445.83it/s, env_step=13312, len=16, n/ep=4, n/st=64, player_1/loss=331.402, player_2/loss=163.910, rew=-25.00]


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 446.25it/s, env_step=14336, len=15, n/ep=4, n/st=64, player_2/loss=162.688, rew=-12.50]      


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 450.19it/s, env_step=15360, len=15, n/ep=4, n/st=64, player_1/loss=315.683, player_2/loss=112.910, rew=-12.50]


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 451.58it/s, env_step=16384, len=14, n/ep=4, n/st=64, player_1/loss=302.074, player_2/loss=186.922, rew=-25.00]


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 455.48it/s, env_step=17408, len=14, n/ep=5, n/st=64, player_1/loss=201.306, player_2/loss=212.002, rew=-5.00]


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 441.13it/s, env_step=18432, len=13, n/ep=5, n/st=64, player_1/loss=200.971, player_2/loss=144.835, rew=-15.00]


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 456.03it/s, env_step=19456, len=15, n/ep=4, n/st=64, player_1/loss=220.212, player_2/loss=150.903, rew=-25.00]


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 444.45it/s, env_step=20480, len=17, n/ep=3, n/st=64, player_1/loss=212.401, player_2/loss=148.521, rew=8.33]


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 416.40it/s, env_step=21504, len=14, n/ep=5, n/st=64, player_1/loss=255.086, player_2/loss=127.575, rew=-25.00]


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 441.38it/s, env_step=22528, len=16, n/ep=4, n/st=64, player_1/loss=265.617, player_2/loss=173.094, rew=-12.50]


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 436.59it/s, env_step=23552, len=15, n/ep=4, n/st=64, player_1/loss=269.214, player_2/loss=179.912, rew=0.00]


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 398.40it/s, env_step=24576, len=16, n/ep=4, n/st=64, player_1/loss=254.036, player_2/loss=129.382, rew=-12.50]


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 443.86it/s, env_step=25600, len=16, n/ep=5, n/st=64, player_1/loss=247.143, player_2/loss=110.459, rew=-15.00]


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 413.32it/s, env_step=26624, len=15, n/ep=4, n/st=64, player_1/loss=325.878, player_2/loss=133.539, rew=-12.50]


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #27: 1025it [00:02, 379.10it/s, env_step=27648, len=17, n/ep=4, n/st=64, player_1/loss=319.163, player_2/loss=149.842, rew=-12.50]


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #28: 1025it [00:02, 381.77it/s, env_step=28672, len=15, n/ep=3, n/st=64, player_1/loss=291.789, player_2/loss=133.915, rew=-8.33]


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #29: 1025it [00:02, 439.97it/s, env_step=29696, len=14, n/ep=4, n/st=64, player_1/loss=267.208, player_2/loss=134.489, rew=-12.50]


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #30: 1025it [00:02, 429.71it/s, env_step=30720, len=18, n/ep=4, n/st=64, player_1/loss=274.414, player_2/loss=135.183, rew=-12.50]


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #31: 1025it [00:02, 443.71it/s, env_step=31744, len=15, n/ep=4, n/st=64, player_1/loss=243.851, player_2/loss=157.303, rew=0.00]


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #32: 1025it [00:02, 457.83it/s, env_step=32768, len=16, n/ep=4, n/st=64, player_1/loss=255.422, player_2/loss=149.690, rew=-12.50]


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #33: 1025it [00:02, 458.33it/s, env_step=33792, len=15, n/ep=4, n/st=64, player_1/loss=234.744, player_2/loss=114.246, rew=0.00]


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #34: 1025it [00:02, 443.99it/s, env_step=34816, len=14, n/ep=4, n/st=64, player_1/loss=206.986, player_2/loss=148.168, rew=-25.00]


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #35: 1025it [00:02, 424.59it/s, env_step=35840, len=14, n/ep=4, n/st=64, player_1/loss=236.398, player_2/loss=146.544, rew=0.00]


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #36: 1025it [00:02, 443.72it/s, env_step=36864, len=15, n/ep=4, n/st=64, player_2/loss=116.109, rew=-12.50]      


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #37: 1025it [00:02, 428.86it/s, env_step=37888, len=14, n/ep=4, n/st=64, player_1/loss=192.260, player_2/loss=129.456, rew=-25.00]


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #38: 1025it [00:02, 441.37it/s, env_step=38912, len=14, n/ep=4, n/st=64, player_1/loss=224.551, player_2/loss=115.928, rew=-12.50]


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #39: 1025it [00:02, 444.63it/s, env_step=39936, len=15, n/ep=5, n/st=64, player_1/loss=284.396, player_2/loss=114.072, rew=-25.00]


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #40: 1025it [00:02, 443.60it/s, env_step=40960, len=22, n/ep=3, n/st=64, player_1/loss=258.852, player_2/loss=104.824, rew=-8.33]


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #41: 1025it [00:02, 397.49it/s, env_step=41984, len=15, n/ep=5, n/st=64, player_1/loss=200.485, player_2/loss=110.089, rew=-15.00]


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #42: 1025it [00:02, 418.75it/s, env_step=43008, len=15, n/ep=4, n/st=64, player_1/loss=158.223, player_2/loss=131.333, rew=-25.00]


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #43: 1025it [00:02, 433.93it/s, env_step=44032, len=15, n/ep=4, n/st=64, player_1/loss=182.362, player_2/loss=137.366, rew=-12.50]


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #44: 1025it [00:02, 413.49it/s, env_step=45056, len=14, n/ep=4, n/st=64, player_1/loss=307.585, player_2/loss=113.399, rew=-12.50]


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #45: 1025it [00:02, 422.28it/s, env_step=46080, len=15, n/ep=4, n/st=64, player_1/loss=369.925, player_2/loss=98.301, rew=0.00]


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #46: 1025it [00:02, 396.19it/s, env_step=47104, len=17, n/ep=4, n/st=64, player_1/loss=293.645, player_2/loss=115.769, rew=-12.50]


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #47: 1025it [00:02, 434.04it/s, env_step=48128, len=13, n/ep=4, n/st=64, player_1/loss=264.925, player_2/loss=127.813, rew=-25.00]


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #48: 1025it [00:02, 406.13it/s, env_step=49152, len=15, n/ep=4, n/st=64, player_1/loss=172.567, player_2/loss=135.454, rew=-25.00]


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #49: 1025it [00:02, 390.67it/s, env_step=50176, len=14, n/ep=4, n/st=64, player_1/loss=181.777, player_2/loss=129.847, rew=-25.00]


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #26


Epoch #1: 1025it [00:02, 391.74it/s, env_step=1024, len=17, n/ep=4, n/st=64, player_1/loss=270.302, player_2/loss=125.298, rew=12.50]


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 449.55it/s, env_step=2048, len=16, n/ep=4, n/st=64, player_1/loss=245.749, player_2/loss=117.470, rew=25.00]


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 439.21it/s, env_step=3072, len=16, n/ep=4, n/st=64, player_1/loss=262.116, player_2/loss=114.368, rew=12.50]


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 415.22it/s, env_step=4096, len=15, n/ep=4, n/st=64, player_1/loss=262.773, player_2/loss=139.909, rew=0.00]


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 413.00it/s, env_step=5120, len=16, n/ep=4, n/st=64, player_1/loss=256.399, player_2/loss=179.408, rew=12.50]


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 452.23it/s, env_step=6144, len=14, n/ep=4, n/st=64, player_1/loss=230.039, player_2/loss=167.462, rew=25.00]


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 442.63it/s, env_step=7168, len=16, n/ep=4, n/st=64, player_1/loss=187.523, player_2/loss=152.639, rew=12.50]


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 428.10it/s, env_step=8192, len=14, n/ep=4, n/st=64, player_1/loss=188.944, player_2/loss=139.778, rew=25.00]


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 424.32it/s, env_step=9216, len=14, n/ep=4, n/st=64, player_1/loss=162.539, player_2/loss=171.564, rew=25.00]


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 434.03it/s, env_step=10240, len=17, n/ep=4, n/st=64, player_1/loss=202.437, rew=12.50]       


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 431.88it/s, env_step=11264, len=15, n/ep=4, n/st=64, player_1/loss=206.453, player_2/loss=122.712, rew=25.00]


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 425.46it/s, env_step=12288, len=14, n/ep=5, n/st=64, player_1/loss=245.077, player_2/loss=110.376, rew=5.00]


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 428.30it/s, env_step=13312, len=13, n/ep=5, n/st=64, player_1/loss=280.516, player_2/loss=102.593, rew=15.00]


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 419.74it/s, env_step=14336, len=15, n/ep=5, n/st=64, player_2/loss=109.962, rew=15.00]       


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 418.85it/s, env_step=15360, len=14, n/ep=5, n/st=64, player_1/loss=247.346, player_2/loss=130.272, rew=15.00]


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 425.08it/s, env_step=16384, len=13, n/ep=5, n/st=64, player_1/loss=257.038, player_2/loss=165.793, rew=15.00]


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 407.38it/s, env_step=17408, len=15, n/ep=4, n/st=64, player_1/loss=263.409, player_2/loss=198.741, rew=12.50]


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 410.59it/s, env_step=18432, len=13, n/ep=4, n/st=64, player_1/loss=270.558, player_2/loss=170.691, rew=25.00]


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 419.03it/s, env_step=19456, len=14, n/ep=5, n/st=64, player_1/loss=268.664, player_2/loss=165.267, rew=15.00]


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 402.06it/s, env_step=20480, len=15, n/ep=5, n/st=64, player_1/loss=254.396, rew=25.00]       


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 401.22it/s, env_step=21504, len=14, n/ep=5, n/st=64, player_1/loss=314.329, player_2/loss=144.027, rew=25.00]


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 397.34it/s, env_step=22528, len=15, n/ep=4, n/st=64, player_1/loss=285.192, player_2/loss=134.532, rew=12.50]


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 389.62it/s, env_step=23552, len=14, n/ep=5, n/st=64, player_1/loss=308.347, player_2/loss=137.898, rew=25.00]


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 430.89it/s, env_step=24576, len=15, n/ep=5, n/st=64, player_1/loss=283.297, player_2/loss=163.620, rew=15.00]


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 451.70it/s, env_step=25600, len=14, n/ep=4, n/st=64, player_1/loss=222.885, player_2/loss=163.328, rew=25.00]


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 431.99it/s, env_step=26624, len=15, n/ep=4, n/st=64, player_1/loss=266.062, player_2/loss=110.345, rew=25.00]


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 434.99it/s, env_step=27648, len=15, n/ep=5, n/st=64, player_1/loss=211.762, player_2/loss=127.468, rew=25.00]


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 444.57it/s, env_step=28672, len=12, n/ep=5, n/st=64, player_1/loss=168.565, player_2/loss=136.166, rew=25.00]


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 448.03it/s, env_step=29696, len=15, n/ep=4, n/st=64, player_1/loss=184.184, player_2/loss=103.225, rew=12.50]


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 456.66it/s, env_step=30720, len=14, n/ep=4, n/st=64, player_1/loss=224.244, player_2/loss=95.341, rew=12.50]


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 430.16it/s, env_step=31744, len=15, n/ep=4, n/st=64, player_1/loss=255.671, player_2/loss=157.093, rew=25.00]


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 422.06it/s, env_step=32768, len=13, n/ep=5, n/st=64, player_1/loss=234.547, player_2/loss=168.635, rew=15.00]


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 442.66it/s, env_step=33792, len=13, n/ep=5, n/st=64, player_1/loss=185.262, player_2/loss=194.090, rew=15.00]


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 434.49it/s, env_step=34816, len=15, n/ep=4, n/st=64, player_1/loss=237.545, player_2/loss=174.450, rew=12.50]


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 430.63it/s, env_step=35840, len=13, n/ep=5, n/st=64, player_1/loss=252.996, player_2/loss=126.033, rew=25.00]


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 427.04it/s, env_step=36864, len=13, n/ep=4, n/st=64, player_1/loss=221.237, player_2/loss=136.829, rew=12.50]


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 408.03it/s, env_step=37888, len=13, n/ep=5, n/st=64, player_1/loss=299.980, player_2/loss=127.280, rew=15.00]


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 452.60it/s, env_step=38912, len=13, n/ep=5, n/st=64, player_1/loss=286.626, player_2/loss=129.659, rew=15.00]


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 424.90it/s, env_step=39936, len=14, n/ep=5, n/st=64, player_1/loss=217.799, player_2/loss=131.001, rew=15.00]


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 420.60it/s, env_step=40960, len=14, n/ep=4, n/st=64, player_1/loss=242.075, player_2/loss=131.458, rew=25.00]


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 403.77it/s, env_step=41984, len=13, n/ep=5, n/st=64, player_1/loss=246.365, player_2/loss=92.416, rew=15.00]


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 413.08it/s, env_step=43008, len=12, n/ep=5, n/st=64, player_1/loss=257.944, player_2/loss=148.253, rew=25.00]


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 421.54it/s, env_step=44032, len=13, n/ep=5, n/st=64, player_1/loss=235.823, player_2/loss=150.022, rew=25.00]


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 435.67it/s, env_step=45056, len=16, n/ep=4, n/st=64, player_1/loss=221.217, player_2/loss=108.781, rew=12.50]


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 396.89it/s, env_step=46080, len=13, n/ep=5, n/st=64, player_1/loss=189.655, player_2/loss=113.886, rew=25.00]


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 407.75it/s, env_step=47104, len=13, n/ep=5, n/st=64, player_1/loss=228.796, player_2/loss=124.528, rew=15.00]


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 409.67it/s, env_step=48128, len=13, n/ep=5, n/st=64, player_1/loss=204.789, player_2/loss=121.079, rew=15.00]


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 399.44it/s, env_step=49152, len=13, n/ep=5, n/st=64, player_1/loss=173.100, player_2/loss=130.727, rew=15.00]


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 409.40it/s, env_step=50176, len=13, n/ep=5, n/st=64, player_1/loss=203.702, player_2/loss=129.309, rew=15.00]

Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0





In [27]:
####################################################
# EXPERIMENT: VIEWING THE BEST LEARNED POLICY
####################################################

# Get the environment settings
env = get_env()
observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
state_shape = observation_space.shape or observation_space.n
action_shape = env.action_space.shape or env.action_space.n

# Configure the best agent
best_agent1 = cf_cnn_dqn_policy(state_shape= state_shape,
                                action_shape= action_shape)
best_agent1.load_state_dict(torch.load("./saved_variables/paper_notebooks/7/5-looping-iteration-19/best_policy_agent1.pth"))
best_agent1.set_eps(0)


best_agent2 = cf_cnn_dqn_policy(state_shape= state_shape,
                                action_shape= action_shape)
best_agent2.load_state_dict(torch.load("./saved_variables/paper_notebooks/7/5-looping-iteration-19/best_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= best_agent1,
      agent_player2= best_agent2)



Average steps of game:  15.0
Final mean reward agent 1: -8.333333333333334, std: 23.570226039551585
Final mean reward agent 2: 8.333333333333334, std: 23.570226039551585


In [28]:
####################################################
# EXPERIMENT: VIEWING THE LAST LEARNED POLICY
####################################################

# Configure the final agent
final_agent_player1 = cf_cnn_dqn_policy(state_shape= state_shape,
                                        action_shape= action_shape)
final_agent_player1.load_state_dict(torch.load("./saved_variables/paper_notebooks/7/5-looping-iteration-19/final_policy_agent1.pth"))
best_agent1.set_eps(0)

final_agent_player2 = cf_cnn_dqn_policy(state_shape= state_shape,
                                        action_shape= action_shape)
final_agent_player2.load_state_dict(torch.load("./saved_variables/paper_notebooks/7/5-looping-iteration-19/best_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= final_agent_player1,
      agent_player2= final_agent_player2)



Average steps of game:  14.0
Final mean reward agent 1: -25.0, std: 0.0
Final mean reward agent 2: 25.0, std: 0.0


<hr><hr>

## Discussion

The performance of this model based on a CNN is similar to the previous model used.
We will address other difficult points in the next notebooks to build an appropriate bot. 

In [13]:
####################################################
# CLEAN VARIABLES
####################################################

del action_shape
del agent1
del agent2
del best_agent1
del best_agent2
del env
del final_agent_player1
del final_agent_player2
del observation_space
del off_policy_traininer_results
del state_shape
