# MLP based DQN agent against fixed oponent

In the previous notebook, `7-cnn-dqn-fixed-oponent.ipynb`, we used the CNN based model for training through an iteration of alternating frozen agents.
We found this to give interesting but not fully statisfactory results.
We will now use the same technique for the custom MLP based approach designed in `5-improving-dqn-architecture.ipynb` to properly compare both architectures performance for the agents.

<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Training two DQN agents on connect four Gym
  - Building the environment
  - Implementing the DQN policy
  - Building agents
  - Function for letting agents learn
  - Function for watching learned agent
  - Doing the experiment
- Discussion

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [3]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib

# Plotting
import matplotlib; print(f"Matplotlib version (3.5.1 recommended): {matplotlib.__version__}")
import matplotlib.pyplot as plt

# Argparser
import argparse

# More data types
import typing
import numpy as np

# Pygame
import pygame; print(f"Pygame version (2.1.2 recommended): {pygame.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Tianshou for RL algorithms
import tianshou as ts; print(f"Tianshou version (0.4.8 recommended): {ts.__version__}")

# Torch is a popular DL framework
import torch; print(f"Torch version (1.12.0 recommended): {torch.__version__}")

# PPrint is a pretty print for variables
from pprint import pprint

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym
importlib.invalidate_caches()
importlib.reload(cfgym)

# Time for allowing "freezes" in execution
import time;

# Allow for copying objects in a non reference manner
import copy

# Used for updating notebook display
from IPython.display import clear_output

Matplotlib version (3.5.1 recommended): 3.5.1
Pygame version (2.1.2 recommended): 2.1.2
Gym version (0.21.0 recommended): 0.21.0
Tianshou version (0.4.8 recommended): 0.4.8
Torch version (1.12.0 recommended): 1.12.0.dev20220520+cu116


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

In [4]:
####################################################
# CUDA VALIDATION
####################################################

# Check cuda available
print(f"CUDA is available: {torch.cuda.is_available()}")

# Show cuda devices
print(f"\nAmount of connected devices supporting CUDA: {torch.cuda.device_count()}")

# Show current cuda device
print(f"\nCurrent CUDA device: {torch.cuda.current_device()}")

# Show cuda device name
print(f"Cuda device 0 name: {torch.cuda.get_device_name(0)}")

CUDA is available: True

Amount of connected devices supporting CUDA: 1

Current CUDA device: 0
Cuda device 0 name: NVIDIA GeForce GTX 970


<hr><hr>

## Training two DQN agents on connect four Gym

Our connect four gym setup requires two agents, one for each player.
To reduce complexity, agents will always play as the same player, e.g. always as player 1.
It is important to note that connect four is a *solved game*.
According to [The Washington Post](https://www.washingtonpost.com/news/wonk/wp/2015/05/08/how-to-win-any-popular-game-according-to-data-scientists/):

> Connect Four is what mathematicians call a "solved game," meaning you can play it perfectly every time, no matter what your opponent does. You will need to get the first move, but as long as you do so, you can always win within 41 moves.

<hr>

### Building the environment

This code is taken from previous notebooks.
We don't allow invalid moves to make the problem easier for now.

In [5]:
####################################################
# CONNECT FOUR V2 ENVIRONMENT
####################################################

def get_env():
    """
    Returns the connect four gym environment V2 altered for Tianshou and Petting Zoo compatibility.
    Already wrapped with a ts.env.PettingZooEnv wrapper.
    """
    return ts.env.PettingZooEnv(cfgym.env(reward_move= 0, # Set to 1 for reward to make moves (incentivise longer games)
                                          reward_invalid= -3,
                                          reward_draw= 100,
                                          reward_win= 25,
                                          reward_loss= -25,
                                          allow_invalid_move= False))
    
    
# Test the environment
env = get_env()
print(f"Observation space: {env.observation_space}")
print(f"\nAction space: {env.action_space}")

# Reset the environment to start from a clean state, returns the initial observation
observation = env.reset()

print("\n Initial player id:")
print(observation["agent_id"])

print("\n Initial observation:")
print(observation["obs"])

print("\n Initial mask:")
print(observation["mask"])

# Clean unused variables
del observation
del env

Observation space: Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]], [[2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]], (6, 7), int8))

Action space: Discrete(7)

 Initial player id:
player_1

 Initial observation:
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]

 Initial mask:
[True, True, True, True, True, True, True]


<hr>

### Implementing the DQN policy

We use the strategy created in `5-improving-dqn-architecture.ipynb`.

In [6]:
####################################################
# DQN ARCHITECTURE
####################################################

class CustomDQN(torch.nn.Module):
    """
    Custom DQN using a model based on CNN
    """
    def __init__(self,
                 state_shape: typing.Sequence[int],
                 action_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu',):
        # Parent call
        super().__init__()
        
        # Save device (e.g. cuda)
        self.device = device
        
        self.model = torch.nn.Sequential(
            torch.nn.Linear(np.prod(state_shape), 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, np.prod(action_shape)),
        )

    def forward(self, obs, state=None, info={}):
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float, device=self.device)
        batch = obs.shape[0]
        logits = self.model(obs.view(batch, -1))
        return logits, state


In [7]:
####################################################
# DQN POLICY
####################################################

def cf_custom_dqn_policy(state_shape: tuple,
                         action_shape: tuple,
                         optim: typing.Optional[torch.optim.Optimizer] = None,
                         learning_rate: float =  0.0001,
                         gamma: float = 0.9, # Smaller gamma favours "faster" win
                         n_step: int = 4, # Number of steps to look ahead
                         frozen: bool = False,
                         target_update_freq: int = 320):
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Network to be used for DQN
    net = CustomDQN(state_shape, action_shape, device= device).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    if optim is None:
        optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # If we are frozen, we use an optimizer that has learning rate 0
    if frozen:
        optim = torch.optim.SGD(net.parameters(), lr= 0)
        
        
    # Our agent DQN policy
    return ts.policy.DQNPolicy(model= net,
                               optim= optim,
                               discount_factor= gamma,
                               estimation_step= n_step,
                               target_update_freq= target_update_freq)

<hr>

### Building agents

This is identical to the previous notebook with the added option of "freezing" an agent which corresponds to giving it an optimizer with learning rate 0.

In [8]:
####################################################
# AGENT CREATION
####################################################

def get_agents(agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
               agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
               optim: typing.Optional[torch.optim.Optimizer] = None,
               resume_path_player_1: str = '', # Path to file to resume agent training from
               resume_path_player_2: str = '', 
               agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
               agent_player2_frozen: bool = False,
               ) -> typing.Tuple[ts.policy.BasePolicy, torch.optim.Optimizer, list]:
    """
    Gets a multi agent policy manager, optimizer and player ids for the connect four V2 gym environment.
    Per default this returns 
        - Multi agent manager for 2 agents using DQN
        - Adam optimizer
        - ['player_1', 'player_2'] from the connect four environment
    """
    
    # Get the environment to play in (Connect four gym V2)
    env = get_env()
    
    # Get the observation space from the environment, depending on typo of space (ternary operator)
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    
    # Set the arguments
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent player 1 to be a DQN if no policy is passed.
    if agent_player1 is None:
        # Our agent1 uses a DQN policy
        agent_player1 = cf_custom_dqn_policy(state_shape= state_shape,
                                             action_shape= action_shape,
                                             optim= optim,
                                             frozen= agent_player1_frozen)
                
        # If we resume our agent we need to load the previous config
        if resume_path_player_1:
            agent_player1.load_state_dict(torch.load(resume_path_player_1))
            
    
    # Configure agent player 2 to be a DQN if no policy is passed.
    if agent_player2 is None:
        # Our agent1 uses a DQN policy
        agent_player2 = cf_custom_dqn_policy(state_shape= state_shape,
                                             action_shape= action_shape,
                                             optim= optim,
                                             frozen= agent_player2_frozen)
        
                
        # If we resume our agent we need to load the previous config
        if resume_path_player_2:
            agent_player2.load_state_dict(torch.load(resume_path_player_2))

    # Both our agents are DQN agents by default
    agents = [agent_player1, agent_player2]
        
    # Our policy depends on the order of the agents
    policy = ts.policy.MultiAgentPolicyManager(agents, env)
    
    # Return our policy, optimizer and the available agents in the environment
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    
    return policy, optim, env.agents

<hr>

### Function for letting agents learn

This is identical to the previous notebook.

In [9]:
####################################################
# AGENT TRAINING
####################################################

def train_agent(filename: str = "dqn_vs_dqn_cnn_based",
                agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
                agent_player2_frozen: bool = False,
                single_agent_score_as_reward: bool= False, # Uses non frozen agent's score as reward
                optim: typing.Optional[torch.optim.Optimizer] = None,
                training_env_num: int = 1,
                testing_env_num: int = 1,
                buffer_size: int = 2^14,
                batch_size: int = 1, 
                epochs: int = 50, #50
                step_per_epoch: int = 1024, #1024
                step_per_collect: int = 64, # transition before update
                update_per_step: float = 0.1,
                testing_eps: float = 0.05,
                training_eps: float = 0.1,
                ) -> typing.Tuple[dict, ts.policy.BasePolicy]:
    """
    Trains two agents in the connect four V2 environment and saves their best model and logs.
    Returns:
        - result from offpolicy_trainer
        - final version of agent 1
        - final version of agent 2
    """

    # ======== notebook specific =========
    notebook_version = '8' # Used for foldering logs and models

    # ======== environment setup =========
    train_envs = ts.env.DummyVectorEnv([get_env for _ in range(training_env_num)])
    test_envs = ts.env.DummyVectorEnv([get_env for _ in range(testing_env_num)])
    
    # set the seed for reproducibility
    np.random.seed(1998)
    torch.manual_seed(1998)
    train_envs.seed(1998)
    test_envs.seed(1998)

    # ======== agent setup =========
    # Gets our agents from the previously made function
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    policy, optim, agents = get_agents(agent_player1=agent_player1,
                                       agent_player2=agent_player2,
                                       agent_player1_frozen= agent_player1_frozen,
                                       agent_player2_frozen= agent_player2_frozen,
                                       optim=optim)

    # ======== collector setup =========
    # Make a collector for the training environments
    train_collector = ts.data.Collector(policy= policy,
                                        env= train_envs,
                                        buffer= ts.data.VectorReplayBuffer(buffer_size, len(train_envs)),
                                        exploration_noise= True)
    
    # Make a collector for the testing environments
    test_collector = ts.data.Collector(policy= policy,
                                       env= test_envs,
                                       buffer= ts.data.VectorReplayBuffer(buffer_size, len(test_envs)),
                                       exploration_noise= True)
    
    # Uncomment below if you want to set epsilon in epsilon policy
    # policy.set_eps(1)
    
    # Collect data fot the training evnironments
    train_collector.collect(n_step= batch_size * training_env_num)
    
    # ======== ensure folders exist =========
    if not os.path.exists(os.path.join('./logs', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./logs', 'paper_notebooks', notebook_version, filename))
    if not os.path.exists(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename))

    # ======== tensorboard logging setup =========
    # Allows to save the training progress to tensorboard compatable logs
    log_path = os.path.join('./logs', 'paper_notebooks', notebook_version, filename)
    writer = torch.utils.tensorboard.SummaryWriter(log_path)
    logger = ts.utils.TensorboardLogger(writer)

    # ======== callback functions used during training =========
    # We want to save our best policy
    def save_best_fn(policy):
        """
        Callback to save the best model
        """
        # Save best agent 1
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent1.pth')
        torch.save(policy.policies[agents[0]].state_dict(), model_save_path)
        
        # Save best agent 2
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent2.pth')
        torch.save(policy.policies[agents[1]].state_dict(), model_save_path)
        
        # Save agent2

    def stop_fn(mean_rewards):
        """
        Callback to stop training when we've reached the win rate
        """
        return mean_rewards >= 7 # (win = 10, 70% win without invalid moves = mean of 7)

    def train_fn(epoch, env_step):
        """
        Callback before training
        """        
        # Before training we want to configure the epsilon for the agents
        # In general more exploratory than the test case
        policy.policies[agents[0]].set_eps(training_eps)
        policy.policies[agents[1]].set_eps(training_eps)

    def test_fn(epoch, env_step):
        """
        Callback beore testing
        """        
        # Before testing we want to configure the epsilon for the agents
        # In general more greedy than the train case but not
        #   to avoid getting stuck on invalid moves
        policy.policies[agents[0]].set_eps(testing_eps)
        policy.policies[agents[1]].set_eps(testing_eps)

    def reward_metric(rews):
        """
        Callback for reward collection
        """        
        if agent_player2_frozen and single_agent_score_as_reward:
            # agent 2 frozen, optimizing for agent 1
            return rews[:, 0]
        
        if agent_player1_frozen and single_agent_score_as_reward:
            # agent 1 frozen, optimizing for agent 2
            return rews[:, 1]
        
        # Per default we are interested in optimizing both agents
        return rews[:, 0] + rews[:, 1]
    
            

    # trainer
    result = ts.trainer.offpolicy_trainer(policy= policy,
                                          train_collector= train_collector,
                                          test_collector= test_collector,
                                          max_epoch= epochs,
                                          step_per_epoch= step_per_epoch,
                                          step_per_collect= step_per_collect,
                                          episode_per_test= testing_env_num,
                                          batch_size= batch_size,
                                          train_fn= train_fn,
                                          test_fn= test_fn,
                                          # Stop function to stop before specified amount of epochs
                                          #stop_fn= stop_fn
                                          save_best_fn= save_best_fn,
                                          update_per_step= update_per_step,
                                          logger= logger,
                                          test_in_train= False,
                                          reward_metric= reward_metric)
    
    # Save final agent 1
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent1.pth')
    torch.save(policy.policies[agents[0]].state_dict(), model_save_path)

    # Save final agent 2
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent2.pth')
    torch.save(policy.policies[agents[1]].state_dict(), model_save_path)

    return result, policy.policies[agents[0]], policy.policies[agents[1]]

<hr>

### Function for watching learned agent

Identical to the previous notebook.

In [10]:
####################################################
# WATCHING THE LEARNED POLICY IN ACTION
####################################################

def watch(numer_of_games: int = 3,
          agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
          agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
          test_epsilon: float = 0.05, # For the watching we act completely greedy but low random for not getting stuck on invalid move
          render_speed: float = 0.15, # Amount of seconds to update frame/ do a step
          ) -> None:
    
    # Get the connect four V2 environment (must be a list)
    env= ts.env.DummyVectorEnv([get_env])
    
    # Get the agents from the trained agents
    policy, optim, agents = get_agents(agent_player1= agent_player1,
                                       agent_player2= agent_player2)
    
    # Evaluate the policy
    policy.eval()
    
    # Set the testing policy epsilon for our agents
    policy.policies[agents[0]].set_eps(test_epsilon)
    policy.policies[agents[1]].set_eps(test_epsilon)
    
    # Collect the test data
    collector = ts.data.Collector(policy= policy,
                                  env= env,
                                  exploration_noise= True)
    
    # Render games in human mode to see how it plays
    result = collector.collect(n_episode= numer_of_games, render= render_speed)
    
    # Close the environment aftering collecting the results
    # This closes the pygame window after completion
    env.close()
    
    # Get the rewards and length from the test trials
    rewards, length = result["rews"], result["lens"]
    
    # Print the final reward for the first agent
    print(f"Average steps of game:  {length.mean()}")
    print(f"Final mean reward agent 1: {rewards[:, 0].mean()}, std: {rewards[:, 0].std()}")
    print(f"Final mean reward agent 2: {rewards[:, 1].mean()}, std: {rewards[:, 1].std()}")

<hr>

### Doing the experiment

We now do the experiment with using our previously created functions.
We freeze one agent and initialize both agents from previous versions.

The following iterations were made:

1. Freeze agent 1, train agent 2:
    - Model save name: `1-mlp_dqn_frozen_agent1` 
    - Agent 1 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent2.pth`
    - Learning rate: `0.0001`
    - Training epsilon: `0.2`
    - Look ahead steps: `4`
    - Reward for move/invalid: `+1` / `-3`
    - Allow invalid move: `False`
    - Epochs: `1000`
    - Gamma: `0.9`
    - Best epoch: `1` with test reward `1102`
    - Scoring: sum of `both` agent's score
2. Freeze agent 2, train agent 1:
    - Model save name: `2-mlp_dqn_frozen_agent2` 
    - Agent 1 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/1-mlp_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.0001`
    - Training epsilon: `0.2`
    - Look ahead steps: `4`
    - Reward for move/invalid: `+1` / `-3`
    - Allow invalid move: `False`
    - Epochs: `1000`
    - Gamma: `0.9`
    - Best epoch: `482` with test reward `1102`
    - Scoring: sum of `both` agent's score

After which the agent was so focused on prolonging the game, we decided to lower the learning rate and start optimizing for winning again. We also lowered the amount of epochs in each iterations of swapping the frozen agent.

3. Freeze agent 1, train agent 2:
    - Model save name: `3-mlp_dqn_frozen_agent1` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/2-mlp_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/1-mlp_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.00005` # halfed learning rate
    - Training epsilon: `0.1` # halfed training epsilon
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `500`
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: `7` with test reward `100`
    - Scoring: reward of `agent 2`
4. Freeze agent 2, train agent 1:
    - Model save name: `4-mlp_dqn_frozen_agent2` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/2-mlp_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/3-mlp_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.00005`
    - Training epsilon: `0.1`
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `500`
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: `XXX` with test reward `YYY`
    - Scoring: reward of `agent 1`
    
To do further training, a loop was created which alternated between freezing agens every 50 epochs. This loop was executed 20 times. The learning rate was also lowered once again.

5. Loop frozen agents:
    - Model save name: `5-50epoch_20loop/looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/4-mlp_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/3-mlp_dqn_frozen_agent1/best_policy_agent2.pth`
    - Learning rate: `0.000001`
    - Training epsilon: `0.1`
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `50` x `20` loops 
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`
6. Loop frozen agents:
    - Model save name: `6-20epoch_100loop/looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/5-looping-iteration-19/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/5-looping-iteration-19/best_policy_agent2.pth`
    - Learning rate: `0.000003`
    - Training epsilon: `0.1`
    - Look ahead steps: `8`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `20` x `100` loops 
    - Gamma: `0.9` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`
7. Loop frozen agents:
    - Model save name: `7-20epoch_500loop/looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/6-looping-iteration-99/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/6-looping-iteration-99/best_policy_agent2.pth`
    - Learning rate: `0.001`
    - Training epsilon: `0.05`
    - Look ahead steps: `8`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `20` x `500` loops 
    - Gamma: `0.9` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`

For file size reasons, only a portion of the saved agents are kept and stored on GitHub.


In [18]:
####################################################
# EXPERIMENT: TRAINING AGENTS
####################################################

# Configs for the agents
#freeze_agent1 = False
agent1_starting_params = "./saved_variables/paper_notebooks/8/4-mlp_dqn_frozen_agent2/best_policy_agent1.pth"

#freeze_agent2 = True
agent2_starting_params = "./saved_variables/paper_notebooks/8/3-mlp_dqn_frozen_agent1/best_policy_agent2.pth"

single_agent_score_as_reward = True # To use combined reward or non frozen agent reward as scoring
filename = "5-looping-iteration-i"
epochs = 50
loops = 20

learning_rate = 0.000001
training_eps = 0.1
gamma = 0.8
n_step = 4

for loop_idx in range(loops):
    # Filename
    filename = f"5-50epoch_20loop/looping-iteration-{loop_idx}"
    
    # Use provided starting params in first loop, the one from previous iteration in next
    if loop_idx > 0:
        agent1_starting_params = f"./saved_variables/paper_notebooks/8/5-50epoch_20loop/looping-iteration-{loop_idx-1}/final_policy_agent1.pth"
        agent2_starting_params = f"./saved_variables/paper_notebooks/8/5-50epoch_20loop/looping-iteration-{loop_idx-1}/final_policy_agent2.pth"
    
    # Determine what agent to freeze
    freeze_agent1 = True if loop_idx % 2 == 1 else False
    freeze_agent2 = True if loop_idx % 2 == 0 else False
    
    # Get the environment settings
    env = get_env()
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent 1
    agent1 = cf_custom_dqn_policy(state_shape= state_shape,
                                  action_shape= action_shape,
                                  gamma= gamma,
                                  frozen= freeze_agent1,
                                  learning_rate = learning_rate,
                                  n_step= n_step)
    
    if agent1_starting_params:
        agent1.load_state_dict(torch.load(agent1_starting_params))
        
        # Configure agent 2
        agent2 = cf_custom_dqn_policy(state_shape= state_shape,
                                      action_shape= action_shape,
                                      gamma= gamma,
                                      frozen= freeze_agent2,
                                      learning_rate = learning_rate,
                                      n_step= n_step)
        
        if agent2_starting_params:
            agent2.load_state_dict(torch.load(agent2_starting_params))
            
            
            # Train the agent
            off_policy_traininer_results, final_agent_player1, final_agent_player2 = train_agent(epochs= epochs,
                                                                                                 agent_player1= agent1,
                                                                                                 agent_player1_frozen = freeze_agent1,
                                                                                                 agent_player2= agent2,
                                                                                                 agent_player2_frozen = freeze_agent2,
                                                                                                 filename= filename,
                                                                                                 single_agent_score_as_reward = single_agent_score_as_reward,
                                                                                                 training_eps= training_eps)
            
            

Epoch #1: 1025it [00:02, 380.25it/s, env_step=1024, len=24, n/ep=2, n/st=64, player_1/loss=712.986, player_2/loss=1092.072, rew=25.00]                                                                                                                                                                                      


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 483.34it/s, env_step=2048, len=17, n/ep=4, n/st=64, player_1/loss=775.767, player_2/loss=893.256, rew=25.00]                                                                                                                                                                                       


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 474.98it/s, env_step=3072, len=31, n/ep=2, n/st=64, player_1/loss=729.617, rew=25.00]                                                                                                                                                                                                              


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 489.80it/s, env_step=4096, len=12, n/ep=4, n/st=64, player_1/loss=748.700, player_2/loss=826.673, rew=25.00]                                                                                                                                                                                       


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 487.37it/s, env_step=5120, len=26, n/ep=3, n/st=64, player_1/loss=880.022, player_2/loss=747.462, rew=25.00]                                                                                                                                                                                       


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 478.58it/s, env_step=6144, len=17, n/ep=3, n/st=64, player_1/loss=868.350, player_2/loss=811.166, rew=25.00]                                                                                                                                                                                       


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 491.60it/s, env_step=7168, len=17, n/ep=4, n/st=64, player_1/loss=911.909, player_2/loss=817.532, rew=0.00]                                                                                                                                                                                        


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 490.02it/s, env_step=8192, len=22, n/ep=3, n/st=64, player_1/loss=977.128, player_2/loss=857.136, rew=8.33]                                                                                                                                                                                        


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 488.92it/s, env_step=9216, len=29, n/ep=2, n/st=64, player_1/loss=781.723, player_2/loss=820.591, rew=0.00]                                                                                                                                                                                        


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 490.99it/s, env_step=10240, len=31, n/ep=2, n/st=64, player_1/loss=810.156, player_2/loss=788.820, rew=0.00]                                                                                                                                                                                      


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 441.85it/s, env_step=11264, len=28, n/ep=3, n/st=64, player_1/loss=884.174, player_2/loss=822.599, rew=25.00]                                                                                                                                                                                     


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 470.05it/s, env_step=12288, len=23, n/ep=2, n/st=64, player_1/loss=886.511, player_2/loss=996.311, rew=25.00]                                                                                                                                                                                     


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 480.06it/s, env_step=13312, len=26, n/ep=3, n/st=64, player_1/loss=845.450, player_2/loss=971.054, rew=25.00]                                                                                                                                                                                     


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 483.54it/s, env_step=14336, len=17, n/ep=4, n/st=64, player_1/loss=732.529, player_2/loss=850.641, rew=0.00]                                                                                                                                                                                      


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 491.00it/s, env_step=15360, len=21, n/ep=2, n/st=64, player_1/loss=699.443, player_2/loss=912.814, rew=0.00]                                                                                                                                                                                      


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 490.98it/s, env_step=16384, len=21, n/ep=3, n/st=64, player_1/loss=714.935, player_2/loss=1024.085, rew=25.00]                                                                                                                                                                                    


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 491.26it/s, env_step=17408, len=29, n/ep=2, n/st=64, player_1/loss=692.452, player_2/loss=987.421, rew=25.00]                                                                                                                                                                                     


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 490.42it/s, env_step=18432, len=31, n/ep=2, n/st=64, player_1/loss=705.333, player_2/loss=1122.597, rew=25.00]                                                                                                                                                                                    


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 492.87it/s, env_step=19456, len=27, n/ep=2, n/st=64, player_1/loss=906.390, player_2/loss=914.896, rew=0.00]                                                                                                                                                                                      


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 480.70it/s, env_step=20480, len=18, n/ep=3, n/st=64, player_1/loss=834.893, player_2/loss=616.326, rew=25.00]                                                                                                                                                                                     


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 469.22it/s, env_step=21504, len=27, n/ep=2, n/st=64, player_1/loss=840.704, player_2/loss=543.293, rew=0.00]                                                                                                                                                                                      


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 438.31it/s, env_step=22528, len=30, n/ep=3, n/st=64, player_1/loss=849.080, player_2/loss=691.731, rew=8.33]                                                                                                                                                                                      


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 405.39it/s, env_step=23552, len=28, n/ep=2, n/st=64, player_1/loss=784.676, player_2/loss=635.102, rew=25.00]                                                                                                                                                                                     


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 394.62it/s, env_step=24576, len=29, n/ep=3, n/st=64, player_1/loss=747.163, player_2/loss=680.564, rew=8.33]                                                                                                                                                                                      


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 393.79it/s, env_step=25600, len=26, n/ep=2, n/st=64, player_1/loss=932.537, player_2/loss=715.809, rew=25.00]                                                                                                                                                                                     


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 394.28it/s, env_step=26624, len=36, n/ep=2, n/st=64, player_1/loss=939.064, player_2/loss=556.564, rew=25.00]                                                                                                                                                                                     


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 395.53it/s, env_step=27648, len=35, n/ep=2, n/st=64, player_1/loss=793.535, player_2/loss=450.538, rew=0.00]                                                                                                                                                                                      


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 392.81it/s, env_step=28672, len=35, n/ep=2, n/st=64, player_1/loss=641.575, player_2/loss=559.943, rew=0.00]                                                                                                                                                                                      


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 394.23it/s, env_step=29696, len=34, n/ep=2, n/st=64, player_1/loss=736.676, player_2/loss=668.448, rew=0.00]                                                                                                                                                                                      


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 392.75it/s, env_step=30720, len=29, n/ep=2, n/st=64, player_1/loss=891.939, player_2/loss=699.465, rew=25.00]                                                                                                                                                                                     


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 391.79it/s, env_step=31744, len=23, n/ep=2, n/st=64, player_1/loss=757.307, player_2/loss=715.680, rew=25.00]                                                                                                                                                                                     


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 393.89it/s, env_step=32768, len=31, n/ep=3, n/st=64, player_1/loss=734.747, player_2/loss=994.188, rew=25.00]                                                                                                                                                                                     


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 395.43it/s, env_step=33792, len=24, n/ep=3, n/st=64, player_1/loss=784.039, player_2/loss=1071.183, rew=25.00]                                                                                                                                                                                    


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 393.29it/s, env_step=34816, len=30, n/ep=2, n/st=64, player_1/loss=667.546, player_2/loss=855.619, rew=0.00]                                                                                                                                                                                      


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 392.68it/s, env_step=35840, len=31, n/ep=2, n/st=64, player_1/loss=727.713, player_2/loss=606.014, rew=25.00]                                                                                                                                                                                     


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 385.62it/s, env_step=36864, len=30, n/ep=2, n/st=64, player_1/loss=772.833, player_2/loss=594.141, rew=-25.00]                                                                                                                                                                                    


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 389.36it/s, env_step=37888, len=30, n/ep=2, n/st=64, player_1/loss=703.800, player_2/loss=628.800, rew=0.00]                                                                                                                                                                                      


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 396.14it/s, env_step=38912, len=31, n/ep=2, n/st=64, player_1/loss=714.497, player_2/loss=810.417, rew=0.00]                                                                                                                                                                                      


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 394.92it/s, env_step=39936, len=27, n/ep=2, n/st=64, player_1/loss=894.057, player_2/loss=785.513, rew=25.00]                                                                                                                                                                                     


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 391.57it/s, env_step=40960, len=24, n/ep=2, n/st=64, player_1/loss=871.558, player_2/loss=751.351, rew=0.00]                                                                                                                                                                                      


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 394.35it/s, env_step=41984, len=36, n/ep=2, n/st=64, player_1/loss=702.690, player_2/loss=691.405, rew=62.50]                                                                                                                                                                                     


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 393.11it/s, env_step=43008, len=21, n/ep=3, n/st=64, player_1/loss=774.967, player_2/loss=555.584, rew=8.33]                                                                                                                                                                                      


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 393.58it/s, env_step=44032, len=34, n/ep=2, n/st=64, player_1/loss=859.315, player_2/loss=472.247, rew=0.00]                                                                                                                                                                                      


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 392.68it/s, env_step=45056, len=32, n/ep=1, n/st=64, player_1/loss=821.616, player_2/loss=531.962, rew=-25.00]                                                                                                                                                                                    


Epoch #44: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #44


Epoch #45: 1025it [00:02, 392.14it/s, env_step=46080, len=31, n/ep=2, n/st=64, player_1/loss=743.715, player_2/loss=526.980, rew=-25.00]                                                                                                                                                                                    


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #44


Epoch #46: 1025it [00:02, 391.91it/s, env_step=47104, len=32, n/ep=2, n/st=64, player_1/loss=852.333, player_2/loss=507.008, rew=-25.00]                                                                                                                                                                                    


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #44


Epoch #47: 1025it [00:02, 394.46it/s, env_step=48128, len=30, n/ep=3, n/st=64, player_1/loss=778.520, player_2/loss=570.438, rew=8.33]                                                                                                                                                                                      


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #44


Epoch #48: 1025it [00:02, 392.64it/s, env_step=49152, len=32, n/ep=1, n/st=64, player_1/loss=640.781, player_2/loss=563.890, rew=-25.00]                                                                                                                                                                                    


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #44


Epoch #49: 1025it [00:02, 393.03it/s, env_step=50176, len=22, n/ep=3, n/st=64, player_1/loss=760.250, player_2/loss=672.482, rew=8.33]                                                                                                                                                                                      


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #44


Epoch #1: 1025it [00:02, 393.54it/s, env_step=1024, len=22, n/ep=2, n/st=64, player_1/loss=788.356, player_2/loss=866.964, rew=0.00]                                                                                                                                                                                        


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 381.16it/s, env_step=2048, len=31, n/ep=2, n/st=64, player_1/loss=730.939, player_2/loss=772.040, rew=25.00]                                                                                                                                                                                       


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 441.23it/s, env_step=3072, len=19, n/ep=3, n/st=64, player_1/loss=680.518, player_2/loss=739.832, rew=-25.00]                                                                                                                                                                                      


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 416.42it/s, env_step=4096, len=30, n/ep=3, n/st=64, player_1/loss=794.974, player_2/loss=758.144, rew=-8.33]                                                                                                                                                                                       


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #5: 1025it [00:02, 387.57it/s, env_step=5120, len=31, n/ep=2, n/st=64, player_1/loss=750.491, player_2/loss=601.191, rew=0.00]                                                                                                                                                                                        


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #6: 1025it [00:02, 342.54it/s, env_step=6144, len=33, n/ep=2, n/st=64, player_1/loss=575.736, player_2/loss=452.329, rew=25.00]                                                                                                                                                                                       


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #7: 1025it [00:02, 447.24it/s, env_step=7168, len=27, n/ep=3, n/st=64, player_1/loss=663.533, player_2/loss=462.318, rew=-25.00]                                                                                                                                                                                      


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #8: 1025it [00:02, 443.54it/s, env_step=8192, len=32, n/ep=2, n/st=64, player_1/loss=695.174, player_2/loss=529.309, rew=-25.00]                                                                                                                                                                                      


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #9: 1025it [00:02, 477.80it/s, env_step=9216, len=19, n/ep=3, n/st=64, player_1/loss=780.117, player_2/loss=615.758, rew=-25.00]                                                                                                                                                                                      


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #10: 1025it [00:02, 483.42it/s, env_step=10240, len=28, n/ep=2, n/st=64, player_1/loss=975.232, player_2/loss=535.906, rew=0.00]                                                                                                                                                                                      


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #11: 1025it [00:02, 468.64it/s, env_step=11264, len=21, n/ep=3, n/st=64, player_1/loss=805.814, player_2/loss=504.407, rew=8.33]                                                                                                                                                                                      


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #12: 1025it [00:02, 479.72it/s, env_step=12288, len=37, n/ep=1, n/st=64, player_1/loss=670.056, player_2/loss=590.533, rew=-25.00]                                                                                                                                                                                    


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #13: 1025it [00:02, 487.34it/s, env_step=13312, len=35, n/ep=2, n/st=64, player_1/loss=890.649, player_2/loss=641.248, rew=25.00]                                                                                                                                                                                     


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #14: 1025it [00:02, 494.57it/s, env_step=14336, len=30, n/ep=3, n/st=64, player_1/loss=930.034, player_2/loss=634.548, rew=-8.33]                                                                                                                                                                                     


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #15: 1025it [00:02, 495.96it/s, env_step=15360, len=33, n/ep=2, n/st=64, player_1/loss=841.567, player_2/loss=768.770, rew=0.00]                                                                                                                                                                                      


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #16: 1025it [00:02, 496.43it/s, env_step=16384, len=29, n/ep=2, n/st=64, player_1/loss=814.635, player_2/loss=645.805, rew=0.00]                                                                                                                                                                                      


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #17: 1025it [00:02, 496.29it/s, env_step=17408, len=29, n/ep=2, n/st=64, player_1/loss=1000.306, player_2/loss=384.112, rew=0.00]                                                                                                                                                                                     


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #18: 1025it [00:02, 496.93it/s, env_step=18432, len=21, n/ep=2, n/st=64, player_1/loss=1003.773, player_2/loss=628.540, rew=-25.00]                                                                                                                                                                                   


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #19: 1025it [00:02, 495.14it/s, env_step=19456, len=26, n/ep=2, n/st=64, player_1/loss=789.553, player_2/loss=690.840, rew=-25.00]                                                                                                                                                                                    


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #20: 1025it [00:02, 496.04it/s, env_step=20480, len=25, n/ep=2, n/st=64, player_1/loss=805.029, player_2/loss=617.710, rew=0.00]                                                                                                                                                                                      


Epoch #20: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #21: 1025it [00:02, 491.34it/s, env_step=21504, len=30, n/ep=2, n/st=64, player_1/loss=860.556, player_2/loss=600.391, rew=0.00]                                                                                                                                                                                      


Epoch #21: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #22: 1025it [00:02, 497.34it/s, env_step=22528, len=23, n/ep=2, n/st=64, player_1/loss=806.619, player_2/loss=455.953, rew=0.00]                                                                                                                                                                                      


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #23: 1025it [00:02, 493.78it/s, env_step=23552, len=37, n/ep=2, n/st=64, player_1/loss=658.961, player_2/loss=740.736, rew=0.00]                                                                                                                                                                                      


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #24: 1025it [00:02, 494.94it/s, env_step=24576, len=39, n/ep=1, n/st=64, player_1/loss=745.536, player_2/loss=889.034, rew=-25.00]                                                                                                                                                                                    


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #25: 1025it [00:02, 495.88it/s, env_step=25600, len=34, n/ep=2, n/st=64, player_1/loss=717.264, player_2/loss=570.120, rew=25.00]                                                                                                                                                                                     


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #26: 1025it [00:02, 480.55it/s, env_step=26624, len=32, n/ep=1, n/st=64, player_1/loss=749.564, player_2/loss=454.047, rew=25.00]                                                                                                                                                                                     


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #27: 1025it [00:02, 461.90it/s, env_step=27648, len=38, n/ep=1, n/st=64, player_1/loss=917.881, player_2/loss=548.363, rew=25.00]                                                                                                                                                                                     


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #28: 1025it [00:02, 440.36it/s, env_step=28672, len=30, n/ep=2, n/st=64, player_1/loss=958.106, player_2/loss=650.078, rew=0.00]                                                                                                                                                                                      


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #29: 1025it [00:02, 410.26it/s, env_step=29696, len=26, n/ep=3, n/st=64, player_1/loss=960.951, player_2/loss=658.622, rew=-8.33]                                                                                                                                                                                     


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #30: 1025it [00:02, 392.91it/s, env_step=30720, len=26, n/ep=2, n/st=64, player_1/loss=826.207, player_2/loss=515.135, rew=0.00]                                                                                                                                                                                      


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #31: 1025it [00:02, 395.22it/s, env_step=31744, len=33, n/ep=2, n/st=64, player_1/loss=698.411, player_2/loss=524.024, rew=25.00]                                                                                                                                                                                     


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #32: 1025it [00:02, 393.41it/s, env_step=32768, len=35, n/ep=2, n/st=64, player_1/loss=799.181, player_2/loss=587.321, rew=25.00]                                                                                                                                                                                     


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #33: 1025it [00:02, 392.95it/s, env_step=33792, len=29, n/ep=2, n/st=64, player_1/loss=734.090, player_2/loss=625.661, rew=0.00]                                                                                                                                                                                      


Epoch #33: test_reward: 100.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #34: 1025it [00:02, 391.36it/s, env_step=34816, len=30, n/ep=2, n/st=64, player_1/loss=670.056, player_2/loss=581.981, rew=0.00]                                                                                                                                                                                      


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #35: 1025it [00:02, 393.28it/s, env_step=35840, len=34, n/ep=2, n/st=64, player_1/loss=768.321, player_2/loss=446.010, rew=0.00]                                                                                                                                                                                      


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #36: 1025it [00:02, 386.39it/s, env_step=36864, len=26, n/ep=3, n/st=64, player_1/loss=819.268, player_2/loss=521.435, rew=-8.33]                                                                                                                                                                                     


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #37: 1025it [00:02, 388.86it/s, env_step=37888, len=34, n/ep=2, n/st=64, player_1/loss=648.908, player_2/loss=521.031, rew=25.00]                                                                                                                                                                                     


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #38: 1025it [00:02, 394.35it/s, env_step=38912, len=31, n/ep=3, n/st=64, player_1/loss=674.091, player_2/loss=390.228, rew=8.33]                                                                                                                                                                                      


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #39: 1025it [00:02, 392.73it/s, env_step=39936, len=29, n/ep=3, n/st=64, player_1/loss=713.760, player_2/loss=472.842, rew=25.00]                                                                                                                                                                                     


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #40: 1025it [00:02, 394.74it/s, env_step=40960, len=25, n/ep=3, n/st=64, player_1/loss=906.763, player_2/loss=533.117, rew=25.00]                                                                                                                                                                                     


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #41: 1025it [00:02, 391.63it/s, env_step=41984, len=36, n/ep=2, n/st=64, player_1/loss=936.891, player_2/loss=494.917, rew=0.00]                                                                                                                                                                                      


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #42: 1025it [00:02, 394.99it/s, env_step=43008, len=16, n/ep=3, n/st=64, player_1/loss=754.460, player_2/loss=561.477, rew=-25.00]                                                                                                                                                                                    


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #43: 1025it [00:02, 392.72it/s, env_step=44032, len=28, n/ep=2, n/st=64, player_1/loss=689.488, player_2/loss=548.429, rew=0.00]                                                                                                                                                                                      


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #44: 1025it [00:02, 393.14it/s, env_step=45056, len=29, n/ep=3, n/st=64, player_1/loss=672.601, player_2/loss=460.301, rew=-25.00]                                                                                                                                                                                    


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #45: 1025it [00:02, 393.51it/s, env_step=46080, len=27, n/ep=3, n/st=64, player_1/loss=721.603, player_2/loss=457.644, rew=8.33]                                                                                                                                                                                      


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #46: 1025it [00:02, 391.99it/s, env_step=47104, len=33, n/ep=2, n/st=64, player_1/loss=738.047, player_2/loss=523.043, rew=37.50]                                                                                                                                                                                     


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #47: 1025it [00:02, 393.93it/s, env_step=48128, len=22, n/ep=2, n/st=64, player_2/loss=659.088, rew=-25.00]                                                                                                                                                                                                           


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #48: 1025it [00:02, 394.88it/s, env_step=49152, len=29, n/ep=2, n/st=64, player_1/loss=801.097, player_2/loss=543.517, rew=0.00]                                                                                                                                                                                      


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #49: 1025it [00:02, 393.89it/s, env_step=50176, len=23, n/ep=3, n/st=64, player_1/loss=791.234, player_2/loss=425.010, rew=-8.33]                                                                                                                                                                                     


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 100.000000 ± 0.000000 in #20


Epoch #1: 1025it [00:02, 391.34it/s, env_step=1024, len=29, n/ep=2, n/st=64, player_1/loss=858.236, player_2/loss=513.161, rew=0.00]                                                                                                                                                                                        


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 395.89it/s, env_step=2048, len=31, n/ep=2, n/st=64, player_1/loss=819.974, player_2/loss=458.931, rew=-25.00]                                                                                                                                                                                      


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 392.44it/s, env_step=3072, len=32, n/ep=2, n/st=64, player_1/loss=750.302, player_2/loss=622.740, rew=-25.00]                                                                                                                                                                                      


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 395.14it/s, env_step=4096, len=25, n/ep=3, n/st=64, player_1/loss=809.522, player_2/loss=741.775, rew=25.00]                                                                                                                                                                                       


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 394.96it/s, env_step=5120, len=31, n/ep=2, n/st=64, player_1/loss=887.775, player_2/loss=609.359, rew=25.00]                                                                                                                                                                                       


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 394.48it/s, env_step=6144, len=42, n/ep=1, n/st=64, player_1/loss=759.684, player_2/loss=673.619, rew=-25.00]                                                                                                                                                                                      


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 393.12it/s, env_step=7168, len=30, n/ep=2, n/st=64, player_1/loss=869.235, player_2/loss=595.022, rew=25.00]                                                                                                                                                                                       


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 392.63it/s, env_step=8192, len=21, n/ep=2, n/st=64, player_1/loss=904.951, player_2/loss=542.882, rew=0.00]                                                                                                                                                                                        


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 393.32it/s, env_step=9216, len=35, n/ep=2, n/st=64, player_1/loss=957.200, player_2/loss=417.795, rew=-25.00]                                                                                                                                                                                      


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 393.90it/s, env_step=10240, len=28, n/ep=3, n/st=64, player_1/loss=923.855, player_2/loss=475.546, rew=8.33]                                                                                                                                                                                      


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 396.51it/s, env_step=11264, len=30, n/ep=2, n/st=64, player_1/loss=677.078, player_2/loss=461.281, rew=25.00]                                                                                                                                                                                     


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 394.56it/s, env_step=12288, len=36, n/ep=2, n/st=64, player_1/loss=733.642, player_2/loss=399.704, rew=0.00]                                                                                                                                                                                      


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 395.82it/s, env_step=13312, len=34, n/ep=2, n/st=64, player_1/loss=776.964, player_2/loss=369.411, rew=0.00]                                                                                                                                                                                      


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 395.56it/s, env_step=14336, len=38, n/ep=1, n/st=64, player_1/loss=765.948, player_2/loss=507.226, rew=-25.00]                                                                                                                                                                                    


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 391.67it/s, env_step=15360, len=28, n/ep=1, n/st=64, player_1/loss=800.453, player_2/loss=633.468, rew=-25.00]                                                                                                                                                                                    


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 387.24it/s, env_step=16384, len=19, n/ep=3, n/st=64, player_1/loss=753.623, player_2/loss=488.351, rew=25.00]                                                                                                                                                                                     


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 394.98it/s, env_step=17408, len=36, n/ep=2, n/st=64, player_1/loss=687.180, player_2/loss=456.088, rew=-25.00]                                                                                                                                                                                    


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 392.75it/s, env_step=18432, len=30, n/ep=2, n/st=64, player_1/loss=728.421, player_2/loss=483.481, rew=0.00]                                                                                                                                                                                      


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:03, 337.21it/s, env_step=19456, len=32, n/ep=2, n/st=64, player_1/loss=654.375, player_2/loss=403.128, rew=-25.00]                                                                                                                                                                                    


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 402.58it/s, env_step=20480, len=32, n/ep=1, n/st=64, player_1/loss=665.589, player_2/loss=467.330, rew=-25.00]                                                                                                                                                                                    


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 395.29it/s, env_step=21504, len=30, n/ep=2, n/st=64, player_1/loss=774.694, player_2/loss=593.564, rew=25.00]                                                                                                                                                                                     


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 394.46it/s, env_step=22528, len=32, n/ep=2, n/st=64, player_1/loss=816.761, player_2/loss=518.963, rew=0.00]                                                                                                                                                                                      


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 394.74it/s, env_step=23552, len=29, n/ep=3, n/st=64, player_1/loss=773.578, player_2/loss=424.120, rew=-8.33]                                                                                                                                                                                     


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 392.77it/s, env_step=24576, len=30, n/ep=2, n/st=64, player_1/loss=581.195, player_2/loss=406.349, rew=0.00]                                                                                                                                                                                      


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 393.76it/s, env_step=25600, len=31, n/ep=3, n/st=64, player_1/loss=487.624, player_2/loss=415.275, rew=-8.33]                                                                                                                                                                                     


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 393.81it/s, env_step=26624, len=31, n/ep=2, n/st=64, player_1/loss=825.870, player_2/loss=562.308, rew=-25.00]                                                                                                                                                                                    


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 393.62it/s, env_step=27648, len=40, n/ep=2, n/st=64, player_1/loss=882.576, player_2/loss=465.756, rew=37.50]                                                                                                                                                                                     


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 396.22it/s, env_step=28672, len=34, n/ep=2, n/st=64, player_1/loss=573.015, player_2/loss=381.372, rew=0.00]                                                                                                                                                                                      


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 354.85it/s, env_step=29696, len=28, n/ep=2, n/st=64, player_1/loss=739.518, player_2/loss=372.572, rew=0.00]                                                                                                                                                                                      


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 415.10it/s, env_step=30720, len=31, n/ep=2, n/st=64, player_1/loss=924.058, rew=0.00]                                                                                                                                                                                                             


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 417.49it/s, env_step=31744, len=30, n/ep=2, n/st=64, player_1/loss=806.345, player_2/loss=513.193, rew=25.00]                                                                                                                                                                                     


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 358.30it/s, env_step=32768, len=28, n/ep=3, n/st=64, player_1/loss=780.851, player_2/loss=525.846, rew=-8.33]                                                                                                                                                                                     


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 399.40it/s, env_step=33792, len=29, n/ep=2, n/st=64, player_1/loss=750.396, player_2/loss=493.580, rew=0.00]                                                                                                                                                                                      


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 393.43it/s, env_step=34816, len=22, n/ep=3, n/st=64, player_1/loss=747.185, player_2/loss=514.229, rew=-25.00]                                                                                                                                                                                    


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 453.79it/s, env_step=35840, len=30, n/ep=3, n/st=64, player_1/loss=609.863, player_2/loss=508.182, rew=8.33]                                                                                                                                                                                      


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 425.36it/s, env_step=36864, len=28, n/ep=3, n/st=64, player_1/loss=536.741, player_2/loss=498.057, rew=8.33]                                                                                                                                                                                      


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 437.70it/s, env_step=37888, len=32, n/ep=2, n/st=64, player_1/loss=532.661, player_2/loss=433.442, rew=0.00]                                                                                                                                                                                      


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 444.18it/s, env_step=38912, len=27, n/ep=1, n/st=64, player_1/loss=587.650, player_2/loss=521.665, rew=25.00]                                                                                                                                                                                     


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 431.62it/s, env_step=39936, len=38, n/ep=2, n/st=64, player_1/loss=543.899, player_2/loss=545.521, rew=-25.00]                                                                                                                                                                                    


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 402.68it/s, env_step=40960, len=25, n/ep=3, n/st=64, player_1/loss=528.130, player_2/loss=440.436, rew=8.33]                                                                                                                                                                                      


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 468.26it/s, env_step=41984, len=31, n/ep=2, n/st=64, player_1/loss=655.070, player_2/loss=373.520, rew=0.00]                                                                                                                                                                                      


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 472.06it/s, env_step=43008, len=25, n/ep=2, n/st=64, player_1/loss=686.615, player_2/loss=378.950, rew=25.00]                                                                                                                                                                                     


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 488.61it/s, env_step=44032, len=30, n/ep=2, n/st=64, player_1/loss=638.771, player_2/loss=355.834, rew=0.00]                                                                                                                                                                                      


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 499.24it/s, env_step=45056, len=30, n/ep=2, n/st=64, player_1/loss=517.597, player_2/loss=365.535, rew=-25.00]                                                                                                                                                                                    


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 490.88it/s, env_step=46080, len=35, n/ep=2, n/st=64, player_1/loss=449.626, player_2/loss=424.591, rew=25.00]                                                                                                                                                                                     


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 496.66it/s, env_step=47104, len=29, n/ep=2, n/st=64, player_1/loss=507.074, player_2/loss=389.154, rew=-25.00]                                                                                                                                                                                    


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 494.32it/s, env_step=48128, len=32, n/ep=2, n/st=64, player_1/loss=466.344, player_2/loss=376.303, rew=0.00]                                                                                                                                                                                      


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 495.35it/s, env_step=49152, len=27, n/ep=2, n/st=64, player_1/loss=539.979, player_2/loss=451.654, rew=-25.00]                                                                                                                                                                                    


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 496.24it/s, env_step=50176, len=28, n/ep=3, n/st=64, player_1/loss=606.895, player_2/loss=452.407, rew=8.33]                                                                                                                                                                                      


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 494.60it/s, env_step=1024, len=26, n/ep=2, n/st=64, player_1/loss=543.306, player_2/loss=333.586, rew=0.00]                                                                                                                                                                                        


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 496.68it/s, env_step=2048, len=27, n/ep=3, n/st=64, player_1/loss=556.532, player_2/loss=367.793, rew=8.33]                                                                                                                                                                                        


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 494.59it/s, env_step=3072, len=26, n/ep=3, n/st=64, player_1/loss=610.555, player_2/loss=417.685, rew=8.33]                                                                                                                                                                                        


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 495.46it/s, env_step=4096, len=24, n/ep=3, n/st=64, player_1/loss=594.494, player_2/loss=479.597, rew=25.00]                                                                                                                                                                                       


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 493.86it/s, env_step=5120, len=26, n/ep=3, n/st=64, player_1/loss=693.538, player_2/loss=441.936, rew=25.00]                                                                                                                                                                                       


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 493.14it/s, env_step=6144, len=29, n/ep=2, n/st=64, player_1/loss=754.095, player_2/loss=351.712, rew=-25.00]                                                                                                                                                                                      


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 492.22it/s, env_step=7168, len=25, n/ep=3, n/st=64, player_1/loss=646.622, player_2/loss=358.990, rew=-25.00]                                                                                                                                                                                      


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 495.97it/s, env_step=8192, len=29, n/ep=2, n/st=64, player_1/loss=733.945, player_2/loss=320.149, rew=25.00]                                                                                                                                                                                       


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 495.71it/s, env_step=9216, len=31, n/ep=2, n/st=64, player_1/loss=730.346, player_2/loss=342.687, rew=0.00]                                                                                                                                                                                        


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 495.63it/s, env_step=10240, len=26, n/ep=3, n/st=64, player_1/loss=680.222, player_2/loss=350.404, rew=25.00]                                                                                                                                                                                     


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 478.03it/s, env_step=11264, len=26, n/ep=2, n/st=64, player_1/loss=686.844, player_2/loss=319.369, rew=-25.00]                                                                                                                                                                                    


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 455.98it/s, env_step=12288, len=27, n/ep=2, n/st=64, player_1/loss=716.007, player_2/loss=347.581, rew=-25.00]                                                                                                                                                                                    


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 434.94it/s, env_step=13312, len=32, n/ep=2, n/st=64, player_1/loss=679.906, player_2/loss=450.465, rew=25.00]                                                                                                                                                                                     


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 408.13it/s, env_step=14336, len=22, n/ep=2, n/st=64, player_1/loss=592.616, player_2/loss=446.715, rew=0.00]                                                                                                                                                                                      


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 397.06it/s, env_step=15360, len=33, n/ep=2, n/st=64, player_1/loss=617.116, player_2/loss=329.187, rew=0.00]                                                                                                                                                                                      


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 395.34it/s, env_step=16384, len=27, n/ep=2, n/st=64, player_1/loss=790.960, player_2/loss=304.990, rew=25.00]                                                                                                                                                                                     


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 395.03it/s, env_step=17408, len=31, n/ep=2, n/st=64, player_1/loss=800.473, player_2/loss=306.084, rew=25.00]                                                                                                                                                                                     


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 394.91it/s, env_step=18432, len=24, n/ep=3, n/st=64, player_1/loss=823.747, player_2/loss=429.481, rew=8.33]                                                                                                                                                                                      


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 395.28it/s, env_step=19456, len=33, n/ep=2, n/st=64, player_1/loss=712.196, player_2/loss=416.396, rew=0.00]                                                                                                                                                                                      


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 395.64it/s, env_step=20480, len=27, n/ep=2, n/st=64, player_1/loss=570.703, rew=25.00]                                                                                                                                                                                                            


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 391.31it/s, env_step=21504, len=26, n/ep=2, n/st=64, player_1/loss=518.670, player_2/loss=411.812, rew=25.00]                                                                                                                                                                                     


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 394.83it/s, env_step=22528, len=25, n/ep=2, n/st=64, player_1/loss=585.847, player_2/loss=379.791, rew=0.00]                                                                                                                                                                                      


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 394.57it/s, env_step=23552, len=27, n/ep=2, n/st=64, player_1/loss=676.085, player_2/loss=379.093, rew=25.00]                                                                                                                                                                                     


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 394.64it/s, env_step=24576, len=22, n/ep=3, n/st=64, player_1/loss=683.456, player_2/loss=363.681, rew=-25.00]                                                                                                                                                                                    


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 395.23it/s, env_step=25600, len=27, n/ep=2, n/st=64, player_1/loss=719.519, player_2/loss=279.927, rew=0.00]                                                                                                                                                                                      


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 394.00it/s, env_step=26624, len=22, n/ep=3, n/st=64, player_1/loss=665.840, player_2/loss=256.411, rew=-8.33]                                                                                                                                                                                     


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 393.64it/s, env_step=27648, len=26, n/ep=3, n/st=64, player_1/loss=586.049, player_2/loss=441.782, rew=8.33]                                                                                                                                                                                      


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 392.98it/s, env_step=28672, len=27, n/ep=2, n/st=64, player_1/loss=604.931, player_2/loss=429.783, rew=0.00]                                                                                                                                                                                      


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 394.72it/s, env_step=29696, len=25, n/ep=2, n/st=64, player_1/loss=751.227, player_2/loss=294.235, rew=0.00]                                                                                                                                                                                      


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 394.62it/s, env_step=30720, len=25, n/ep=2, n/st=64, player_1/loss=624.745, player_2/loss=350.880, rew=-25.00]                                                                                                                                                                                    


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 390.62it/s, env_step=31744, len=26, n/ep=3, n/st=64, player_1/loss=798.343, player_2/loss=395.187, rew=25.00]                                                                                                                                                                                     


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 392.71it/s, env_step=32768, len=29, n/ep=3, n/st=64, player_1/loss=794.694, player_2/loss=344.565, rew=-8.33]                                                                                                                                                                                     


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 385.10it/s, env_step=33792, len=25, n/ep=3, n/st=64, player_1/loss=759.831, player_2/loss=298.964, rew=8.33]                                                                                                                                                                                      


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 389.36it/s, env_step=34816, len=26, n/ep=3, n/st=64, player_1/loss=693.252, player_2/loss=358.561, rew=-8.33]                                                                                                                                                                                     


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 393.73it/s, env_step=35840, len=31, n/ep=2, n/st=64, player_1/loss=813.827, player_2/loss=362.859, rew=0.00]                                                                                                                                                                                      


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 393.20it/s, env_step=36864, len=27, n/ep=3, n/st=64, player_1/loss=815.309, player_2/loss=362.895, rew=8.33]                                                                                                                                                                                      


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 394.70it/s, env_step=37888, len=25, n/ep=2, n/st=64, player_1/loss=634.753, player_2/loss=393.894, rew=25.00]                                                                                                                                                                                     


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 392.84it/s, env_step=38912, len=28, n/ep=2, n/st=64, player_1/loss=580.032, player_2/loss=388.510, rew=0.00]                                                                                                                                                                                      


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 391.51it/s, env_step=39936, len=26, n/ep=2, n/st=64, player_1/loss=669.155, player_2/loss=386.978, rew=0.00]                                                                                                                                                                                      


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 395.88it/s, env_step=40960, len=29, n/ep=3, n/st=64, player_1/loss=718.495, player_2/loss=396.167, rew=8.33]                                                                                                                                                                                      


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 394.05it/s, env_step=41984, len=25, n/ep=3, n/st=64, player_1/loss=876.242, player_2/loss=335.632, rew=-8.33]                                                                                                                                                                                     


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 394.23it/s, env_step=43008, len=25, n/ep=2, n/st=64, player_1/loss=833.507, player_2/loss=372.214, rew=-25.00]                                                                                                                                                                                    


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 396.16it/s, env_step=44032, len=32, n/ep=2, n/st=64, player_1/loss=744.788, player_2/loss=416.083, rew=0.00]                                                                                                                                                                                      


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 389.48it/s, env_step=45056, len=24, n/ep=3, n/st=64, player_1/loss=956.490, player_2/loss=354.287, rew=-8.33]                                                                                                                                                                                     


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 394.99it/s, env_step=46080, len=27, n/ep=2, n/st=64, player_1/loss=828.489, player_2/loss=318.564, rew=25.00]                                                                                                                                                                                     


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 391.17it/s, env_step=47104, len=28, n/ep=2, n/st=64, player_1/loss=760.677, player_2/loss=291.265, rew=25.00]                                                                                                                                                                                     


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 395.14it/s, env_step=48128, len=26, n/ep=2, n/st=64, player_1/loss=689.576, player_2/loss=264.476, rew=25.00]                                                                                                                                                                                     


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 393.15it/s, env_step=49152, len=29, n/ep=3, n/st=64, player_1/loss=614.258, rew=-25.00]                                                                                                                                                                                                           


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 391.30it/s, env_step=50176, len=27, n/ep=2, n/st=64, player_1/loss=648.737, player_2/loss=209.501, rew=0.00]                                                                                                                                                                                      


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 392.34it/s, env_step=1024, len=25, n/ep=3, n/st=64, player_1/loss=732.021, player_2/loss=232.382, rew=-8.33]                                                                                                                                                                                       


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 393.71it/s, env_step=2048, len=27, n/ep=3, n/st=64, player_1/loss=769.625, player_2/loss=263.499, rew=-8.33]                                                                                                                                                                                       


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 394.78it/s, env_step=3072, len=27, n/ep=3, n/st=64, player_1/loss=667.785, player_2/loss=360.097, rew=-8.33]                                                                                                                                                                                       


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 394.77it/s, env_step=4096, len=37, n/ep=2, n/st=64, player_1/loss=522.342, player_2/loss=430.698, rew=0.00]                                                                                                                                                                                        


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #5: 1025it [00:02, 395.54it/s, env_step=5120, len=32, n/ep=2, n/st=64, player_1/loss=536.402, player_2/loss=336.878, rew=25.00]                                                                                                                                                                                       


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #6: 1025it [00:02, 391.34it/s, env_step=6144, len=29, n/ep=2, n/st=64, player_1/loss=491.623, player_2/loss=342.801, rew=25.00]                                                                                                                                                                                       


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #7: 1025it [00:02, 395.57it/s, env_step=7168, len=27, n/ep=3, n/st=64, player_1/loss=564.587, player_2/loss=301.531, rew=8.33]                                                                                                                                                                                        


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #8: 1025it [00:02, 394.09it/s, env_step=8192, len=32, n/ep=2, n/st=64, player_1/loss=607.422, player_2/loss=428.803, rew=25.00]                                                                                                                                                                                       


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #9: 1025it [00:02, 392.01it/s, env_step=9216, len=26, n/ep=3, n/st=64, player_1/loss=682.224, player_2/loss=471.058, rew=8.33]                                                                                                                                                                                        


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #10: 1025it [00:02, 393.62it/s, env_step=10240, len=25, n/ep=2, n/st=64, player_1/loss=591.247, player_2/loss=467.969, rew=-25.00]                                                                                                                                                                                    


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #11: 1025it [00:02, 396.26it/s, env_step=11264, len=28, n/ep=2, n/st=64, player_2/loss=329.114, rew=0.00]                                                                                                                                                                                                             


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #12: 1025it [00:02, 393.14it/s, env_step=12288, len=26, n/ep=2, n/st=64, player_1/loss=718.755, player_2/loss=280.710, rew=-25.00]                                                                                                                                                                                    


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #13: 1025it [00:02, 394.56it/s, env_step=13312, len=22, n/ep=3, n/st=64, player_1/loss=777.543, player_2/loss=310.541, rew=8.33]                                                                                                                                                                                      


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #14: 1025it [00:02, 395.52it/s, env_step=14336, len=25, n/ep=3, n/st=64, player_1/loss=795.822, player_2/loss=360.801, rew=-8.33]                                                                                                                                                                                     


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #15: 1025it [00:02, 394.83it/s, env_step=15360, len=23, n/ep=3, n/st=64, player_1/loss=700.373, player_2/loss=412.337, rew=25.00]                                                                                                                                                                                     


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #16: 1025it [00:02, 397.33it/s, env_step=16384, len=33, n/ep=2, n/st=64, player_1/loss=639.253, player_2/loss=327.211, rew=0.00]                                                                                                                                                                                      


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #17: 1025it [00:02, 394.67it/s, env_step=17408, len=18, n/ep=4, n/st=64, player_1/loss=581.250, player_2/loss=259.085, rew=12.50]                                                                                                                                                                                     


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #18: 1025it [00:02, 394.62it/s, env_step=18432, len=28, n/ep=2, n/st=64, player_1/loss=558.255, player_2/loss=205.642, rew=-25.00]                                                                                                                                                                                    


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #19: 1025it [00:02, 394.49it/s, env_step=19456, len=23, n/ep=3, n/st=64, player_1/loss=540.586, player_2/loss=266.061, rew=-8.33]                                                                                                                                                                                     


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #20: 1025it [00:02, 393.50it/s, env_step=20480, len=31, n/ep=2, n/st=64, player_1/loss=650.360, player_2/loss=347.678, rew=0.00]                                                                                                                                                                                      


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #21: 1025it [00:02, 394.72it/s, env_step=21504, len=26, n/ep=3, n/st=64, player_1/loss=578.202, player_2/loss=351.111, rew=-8.33]                                                                                                                                                                                     


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #22: 1025it [00:02, 392.05it/s, env_step=22528, len=33, n/ep=2, n/st=64, player_1/loss=532.655, player_2/loss=305.786, rew=25.00]                                                                                                                                                                                     


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #23: 1025it [00:02, 396.05it/s, env_step=23552, len=31, n/ep=2, n/st=64, player_1/loss=521.440, player_2/loss=369.967, rew=0.00]                                                                                                                                                                                      


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #24: 1025it [00:02, 392.62it/s, env_step=24576, len=26, n/ep=2, n/st=64, player_1/loss=571.483, player_2/loss=360.636, rew=-25.00]                                                                                                                                                                                    


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #25: 1025it [00:02, 396.88it/s, env_step=25600, len=27, n/ep=3, n/st=64, player_1/loss=580.052, player_2/loss=303.632, rew=-8.33]                                                                                                                                                                                     


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #26: 1025it [00:02, 394.77it/s, env_step=26624, len=26, n/ep=3, n/st=64, player_1/loss=582.729, player_2/loss=307.849, rew=-25.00]                                                                                                                                                                                    


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #27: 1025it [00:02, 393.92it/s, env_step=27648, len=28, n/ep=2, n/st=64, player_1/loss=551.119, player_2/loss=365.693, rew=0.00]                                                                                                                                                                                      


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #28: 1025it [00:02, 392.84it/s, env_step=28672, len=27, n/ep=2, n/st=64, player_1/loss=556.329, player_2/loss=332.683, rew=-25.00]                                                                                                                                                                                    


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #29: 1025it [00:02, 386.42it/s, env_step=29696, len=24, n/ep=2, n/st=64, player_1/loss=493.889, player_2/loss=364.834, rew=0.00]                                                                                                                                                                                      


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #30: 1025it [00:02, 389.44it/s, env_step=30720, len=29, n/ep=3, n/st=64, player_1/loss=481.201, player_2/loss=411.152, rew=-8.33]                                                                                                                                                                                     


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #31: 1025it [00:02, 393.09it/s, env_step=31744, len=25, n/ep=3, n/st=64, player_1/loss=504.738, player_2/loss=389.726, rew=25.00]                                                                                                                                                                                     


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #32: 1025it [00:02, 391.18it/s, env_step=32768, len=27, n/ep=3, n/st=64, player_1/loss=486.237, player_2/loss=355.046, rew=-8.33]                                                                                                                                                                                     


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #33: 1025it [00:02, 393.98it/s, env_step=33792, len=25, n/ep=2, n/st=64, player_1/loss=509.644, player_2/loss=319.152, rew=0.00]                                                                                                                                                                                      


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #34: 1025it [00:02, 391.11it/s, env_step=34816, len=25, n/ep=3, n/st=64, player_1/loss=624.955, player_2/loss=243.704, rew=8.33]                                                                                                                                                                                      


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #35: 1025it [00:02, 395.18it/s, env_step=35840, len=26, n/ep=2, n/st=64, player_1/loss=675.920, player_2/loss=276.206, rew=0.00]                                                                                                                                                                                      


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #36: 1025it [00:02, 392.75it/s, env_step=36864, len=30, n/ep=2, n/st=64, player_1/loss=522.280, player_2/loss=342.888, rew=-25.00]                                                                                                                                                                                    


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #37: 1025it [00:02, 393.87it/s, env_step=37888, len=27, n/ep=3, n/st=64, player_1/loss=430.761, player_2/loss=406.883, rew=-8.33]                                                                                                                                                                                     


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #38: 1025it [00:02, 395.90it/s, env_step=38912, len=26, n/ep=3, n/st=64, player_1/loss=456.031, player_2/loss=335.247, rew=-25.00]                                                                                                                                                                                    


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #39: 1025it [00:02, 392.01it/s, env_step=39936, len=30, n/ep=3, n/st=64, player_1/loss=507.908, player_2/loss=283.619, rew=-25.00]                                                                                                                                                                                    


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #40: 1025it [00:02, 392.83it/s, env_step=40960, len=26, n/ep=3, n/st=64, player_1/loss=596.347, player_2/loss=305.879, rew=-25.00]                                                                                                                                                                                    


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #41: 1025it [00:02, 395.38it/s, env_step=41984, len=26, n/ep=2, n/st=64, player_1/loss=551.598, player_2/loss=355.108, rew=0.00]                                                                                                                                                                                      


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #42: 1025it [00:02, 393.26it/s, env_step=43008, len=30, n/ep=3, n/st=64, player_1/loss=672.080, player_2/loss=318.213, rew=8.33]                                                                                                                                                                                      


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #43: 1025it [00:02, 395.91it/s, env_step=44032, len=30, n/ep=2, n/st=64, player_1/loss=713.488, player_2/loss=259.866, rew=-25.00]                                                                                                                                                                                    


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #44: 1025it [00:02, 396.26it/s, env_step=45056, len=29, n/ep=2, n/st=64, player_1/loss=687.309, player_2/loss=312.004, rew=0.00]                                                                                                                                                                                      


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #45: 1025it [00:02, 395.40it/s, env_step=46080, len=26, n/ep=2, n/st=64, player_1/loss=637.301, player_2/loss=341.141, rew=0.00]                                                                                                                                                                                      


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #46: 1025it [00:02, 396.26it/s, env_step=47104, len=27, n/ep=2, n/st=64, player_1/loss=671.270, player_2/loss=278.984, rew=-25.00]                                                                                                                                                                                    


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #47: 1025it [00:02, 393.57it/s, env_step=48128, len=28, n/ep=2, n/st=64, player_1/loss=675.231, player_2/loss=287.961, rew=-25.00]                                                                                                                                                                                    


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #48: 1025it [00:02, 395.20it/s, env_step=49152, len=28, n/ep=2, n/st=64, player_1/loss=615.222, player_2/loss=338.431, rew=0.00]                                                                                                                                                                                      


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #49: 1025it [00:02, 393.47it/s, env_step=50176, len=30, n/ep=2, n/st=64, player_1/loss=744.094, rew=0.00]                                                                                                                                                                                                             


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #1: 1025it [00:02, 393.07it/s, env_step=1024, len=27, n/ep=2, n/st=64, player_1/loss=512.314, player_2/loss=272.952, rew=25.00]                                                                                                                                                                                       


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 395.96it/s, env_step=2048, len=26, n/ep=2, n/st=64, player_1/loss=514.672, player_2/loss=291.159, rew=25.00]                                                                                                                                                                                       


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 394.90it/s, env_step=3072, len=29, n/ep=2, n/st=64, player_1/loss=555.674, player_2/loss=285.166, rew=25.00]                                                                                                                                                                                       


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 396.78it/s, env_step=4096, len=25, n/ep=2, n/st=64, player_1/loss=635.540, player_2/loss=247.702, rew=0.00]                                                                                                                                                                                        


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 386.75it/s, env_step=5120, len=25, n/ep=2, n/st=64, player_1/loss=585.558, player_2/loss=235.789, rew=0.00]                                                                                                                                                                                        


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 392.02it/s, env_step=6144, len=28, n/ep=3, n/st=64, player_1/loss=495.328, player_2/loss=251.330, rew=25.00]                                                                                                                                                                                       


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 395.16it/s, env_step=7168, len=25, n/ep=2, n/st=64, player_1/loss=451.252, player_2/loss=281.099, rew=25.00]                                                                                                                                                                                       


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 396.98it/s, env_step=8192, len=25, n/ep=3, n/st=64, player_1/loss=552.247, player_2/loss=295.594, rew=-25.00]                                                                                                                                                                                      


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 396.41it/s, env_step=9216, len=26, n/ep=3, n/st=64, player_1/loss=568.181, player_2/loss=377.916, rew=8.33]                                                                                                                                                                                        


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 395.42it/s, env_step=10240, len=28, n/ep=2, n/st=64, player_1/loss=559.599, player_2/loss=345.895, rew=0.00]                                                                                                                                                                                      


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 393.60it/s, env_step=11264, len=30, n/ep=2, n/st=64, player_1/loss=564.765, player_2/loss=254.056, rew=25.00]                                                                                                                                                                                     


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 394.42it/s, env_step=12288, len=26, n/ep=1, n/st=64, player_1/loss=547.949, player_2/loss=267.291, rew=25.00]                                                                                                                                                                                     


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 394.01it/s, env_step=13312, len=23, n/ep=3, n/st=64, player_1/loss=608.991, player_2/loss=269.004, rew=-25.00]                                                                                                                                                                                    


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 394.70it/s, env_step=14336, len=27, n/ep=2, n/st=64, player_1/loss=704.315, player_2/loss=290.455, rew=25.00]                                                                                                                                                                                     


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 394.23it/s, env_step=15360, len=31, n/ep=2, n/st=64, player_1/loss=705.095, player_2/loss=264.793, rew=0.00]                                                                                                                                                                                      


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 396.73it/s, env_step=16384, len=30, n/ep=2, n/st=64, player_1/loss=687.317, player_2/loss=253.906, rew=0.00]                                                                                                                                                                                      


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 396.36it/s, env_step=17408, len=26, n/ep=2, n/st=64, player_1/loss=719.331, player_2/loss=300.186, rew=25.00]                                                                                                                                                                                     


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 398.22it/s, env_step=18432, len=25, n/ep=3, n/st=64, player_1/loss=610.944, player_2/loss=312.872, rew=8.33]                                                                                                                                                                                      


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 393.18it/s, env_step=19456, len=25, n/ep=2, n/st=64, player_1/loss=594.860, player_2/loss=309.814, rew=25.00]                                                                                                                                                                                     


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 392.65it/s, env_step=20480, len=26, n/ep=2, n/st=64, player_1/loss=607.497, player_2/loss=331.913, rew=25.00]                                                                                                                                                                                     


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 397.13it/s, env_step=21504, len=29, n/ep=2, n/st=64, player_1/loss=572.131, player_2/loss=323.740, rew=-25.00]                                                                                                                                                                                    


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 394.20it/s, env_step=22528, len=23, n/ep=3, n/st=64, player_1/loss=530.073, player_2/loss=329.350, rew=8.33]                                                                                                                                                                                      


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 396.37it/s, env_step=23552, len=23, n/ep=3, n/st=64, player_1/loss=481.189, player_2/loss=336.742, rew=8.33]                                                                                                                                                                                      


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 394.40it/s, env_step=24576, len=26, n/ep=3, n/st=64, player_1/loss=509.162, player_2/loss=290.769, rew=8.33]                                                                                                                                                                                      


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 390.09it/s, env_step=25600, len=25, n/ep=3, n/st=64, player_1/loss=522.383, player_2/loss=253.445, rew=-25.00]                                                                                                                                                                                    


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 389.65it/s, env_step=26624, len=25, n/ep=3, n/st=64, player_1/loss=597.585, player_2/loss=266.676, rew=25.00]                                                                                                                                                                                     


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 396.70it/s, env_step=27648, len=26, n/ep=3, n/st=64, player_1/loss=607.763, player_2/loss=316.515, rew=25.00]                                                                                                                                                                                     


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 394.61it/s, env_step=28672, len=27, n/ep=2, n/st=64, player_1/loss=607.595, player_2/loss=268.693, rew=-25.00]                                                                                                                                                                                    


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 394.10it/s, env_step=29696, len=26, n/ep=3, n/st=64, player_1/loss=631.342, player_2/loss=190.228, rew=25.00]                                                                                                                                                                                     


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 395.29it/s, env_step=30720, len=28, n/ep=2, n/st=64, player_1/loss=582.823, player_2/loss=260.370, rew=0.00]                                                                                                                                                                                      


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 396.03it/s, env_step=31744, len=26, n/ep=3, n/st=64, player_1/loss=586.814, player_2/loss=260.447, rew=25.00]                                                                                                                                                                                     


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 394.99it/s, env_step=32768, len=27, n/ep=2, n/st=64, player_1/loss=625.739, player_2/loss=217.570, rew=25.00]                                                                                                                                                                                     


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 397.09it/s, env_step=33792, len=27, n/ep=2, n/st=64, player_1/loss=655.836, player_2/loss=256.841, rew=25.00]                                                                                                                                                                                     


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 393.58it/s, env_step=34816, len=30, n/ep=2, n/st=64, player_1/loss=579.205, player_2/loss=307.634, rew=0.00]                                                                                                                                                                                      


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 395.20it/s, env_step=35840, len=28, n/ep=2, n/st=64, player_1/loss=472.296, player_2/loss=242.590, rew=0.00]                                                                                                                                                                                      


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 393.89it/s, env_step=36864, len=26, n/ep=3, n/st=64, player_1/loss=514.148, player_2/loss=211.303, rew=-8.33]                                                                                                                                                                                     


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 395.76it/s, env_step=37888, len=30, n/ep=2, n/st=64, player_1/loss=499.608, player_2/loss=221.715, rew=25.00]                                                                                                                                                                                     


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 394.81it/s, env_step=38912, len=26, n/ep=2, n/st=64, player_1/loss=557.273, player_2/loss=288.547, rew=25.00]                                                                                                                                                                                     


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 395.62it/s, env_step=39936, len=24, n/ep=3, n/st=64, player_1/loss=578.285, player_2/loss=351.794, rew=8.33]                                                                                                                                                                                      


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 395.92it/s, env_step=40960, len=28, n/ep=2, n/st=64, player_1/loss=570.451, player_2/loss=303.434, rew=25.00]                                                                                                                                                                                     


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 394.93it/s, env_step=41984, len=27, n/ep=3, n/st=64, player_1/loss=528.998, player_2/loss=204.485, rew=8.33]                                                                                                                                                                                      


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 395.15it/s, env_step=43008, len=24, n/ep=2, n/st=64, player_1/loss=453.204, player_2/loss=208.858, rew=25.00]                                                                                                                                                                                     


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 393.61it/s, env_step=44032, len=26, n/ep=3, n/st=64, player_1/loss=483.571, player_2/loss=212.457, rew=25.00]                                                                                                                                                                                     


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 396.06it/s, env_step=45056, len=26, n/ep=2, n/st=64, player_1/loss=498.924, player_2/loss=215.024, rew=25.00]                                                                                                                                                                                     


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 392.83it/s, env_step=46080, len=27, n/ep=2, n/st=64, player_1/loss=546.343, player_2/loss=274.927, rew=25.00]                                                                                                                                                                                     


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 395.36it/s, env_step=47104, len=32, n/ep=2, n/st=64, player_1/loss=576.985, player_2/loss=297.387, rew=25.00]                                                                                                                                                                                     


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 396.20it/s, env_step=48128, len=26, n/ep=3, n/st=64, player_1/loss=590.302, player_2/loss=288.828, rew=8.33]                                                                                                                                                                                      


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 396.87it/s, env_step=49152, len=23, n/ep=3, n/st=64, player_1/loss=409.565, player_2/loss=259.059, rew=8.33]                                                                                                                                                                                      


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 394.80it/s, env_step=50176, len=31, n/ep=2, n/st=64, player_1/loss=409.462, player_2/loss=284.601, rew=25.00]                                                                                                                                                                                     


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 390.85it/s, env_step=1024, len=34, n/ep=2, n/st=64, player_1/loss=537.776, player_2/loss=193.852, rew=0.00]                                                                                                                                                                                        


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 394.30it/s, env_step=2048, len=27, n/ep=3, n/st=64, player_1/loss=588.630, player_2/loss=188.121, rew=-25.00]                                                                                                                                                                                      


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 395.64it/s, env_step=3072, len=24, n/ep=3, n/st=64, player_1/loss=542.128, player_2/loss=229.181, rew=-8.33]                                                                                                                                                                                       


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 396.19it/s, env_step=4096, len=27, n/ep=3, n/st=64, player_1/loss=600.522, player_2/loss=260.351, rew=-8.33]                                                                                                                                                                                       


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 396.77it/s, env_step=5120, len=27, n/ep=3, n/st=64, player_1/loss=626.740, player_2/loss=214.844, rew=-25.00]                                                                                                                                                                                      


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 395.19it/s, env_step=6144, len=26, n/ep=2, n/st=64, player_1/loss=659.380, player_2/loss=290.739, rew=-25.00]                                                                                                                                                                                      


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 391.51it/s, env_step=7168, len=28, n/ep=3, n/st=64, player_1/loss=507.365, player_2/loss=281.872, rew=-25.00]                                                                                                                                                                                      


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 394.07it/s, env_step=8192, len=24, n/ep=3, n/st=64, player_1/loss=446.457, player_2/loss=228.749, rew=-8.33]                                                                                                                                                                                       


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 395.52it/s, env_step=9216, len=25, n/ep=2, n/st=64, player_1/loss=565.685, player_2/loss=196.495, rew=-25.00]                                                                                                                                                                                      


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 395.00it/s, env_step=10240, len=26, n/ep=2, n/st=64, player_1/loss=612.300, rew=0.00]                                                                                                                                                                                                             


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 392.98it/s, env_step=11264, len=27, n/ep=3, n/st=64, player_1/loss=560.830, player_2/loss=249.151, rew=-8.33]                                                                                                                                                                                     


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 393.60it/s, env_step=12288, len=27, n/ep=2, n/st=64, player_1/loss=636.311, player_2/loss=255.518, rew=-25.00]                                                                                                                                                                                    


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 392.65it/s, env_step=13312, len=25, n/ep=3, n/st=64, player_1/loss=641.902, player_2/loss=262.234, rew=-25.00]                                                                                                                                                                                    


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 391.32it/s, env_step=14336, len=28, n/ep=2, n/st=64, player_1/loss=527.410, player_2/loss=229.021, rew=0.00]                                                                                                                                                                                      


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 396.04it/s, env_step=15360, len=26, n/ep=2, n/st=64, player_1/loss=646.180, player_2/loss=201.186, rew=0.00]                                                                                                                                                                                      


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 393.75it/s, env_step=16384, len=26, n/ep=2, n/st=64, player_1/loss=657.631, player_2/loss=176.557, rew=-25.00]                                                                                                                                                                                    


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #17: 1025it [00:02, 395.92it/s, env_step=17408, len=23, n/ep=3, n/st=64, player_1/loss=555.586, player_2/loss=180.984, rew=-8.33]                                                                                                                                                                                     


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #18: 1025it [00:02, 393.82it/s, env_step=18432, len=26, n/ep=3, n/st=64, player_1/loss=534.143, player_2/loss=188.344, rew=-25.00]                                                                                                                                                                                    


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #19: 1025it [00:02, 394.10it/s, env_step=19456, len=27, n/ep=2, n/st=64, player_1/loss=558.018, player_2/loss=201.269, rew=-25.00]                                                                                                                                                                                    


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #20: 1025it [00:02, 394.55it/s, env_step=20480, len=28, n/ep=2, n/st=64, player_1/loss=602.474, player_2/loss=212.077, rew=0.00]                                                                                                                                                                                      


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #21: 1025it [00:02, 384.92it/s, env_step=21504, len=33, n/ep=2, n/st=64, player_1/loss=571.625, player_2/loss=219.544, rew=25.00]                                                                                                                                                                                     


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #22: 1025it [00:02, 388.88it/s, env_step=22528, len=25, n/ep=3, n/st=64, player_1/loss=509.772, player_2/loss=228.135, rew=-25.00]                                                                                                                                                                                    


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #23: 1025it [00:02, 397.08it/s, env_step=23552, len=27, n/ep=2, n/st=64, player_1/loss=611.669, player_2/loss=202.454, rew=-25.00]                                                                                                                                                                                    


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #24: 1025it [00:02, 393.83it/s, env_step=24576, len=25, n/ep=3, n/st=64, player_1/loss=599.062, player_2/loss=192.838, rew=-8.33]                                                                                                                                                                                     


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #25: 1025it [00:02, 393.03it/s, env_step=25600, len=25, n/ep=3, n/st=64, player_1/loss=468.140, player_2/loss=246.813, rew=-8.33]                                                                                                                                                                                     


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #26: 1025it [00:02, 393.13it/s, env_step=26624, len=24, n/ep=3, n/st=64, player_1/loss=519.519, player_2/loss=229.742, rew=-8.33]                                                                                                                                                                                     


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #27: 1025it [00:02, 397.11it/s, env_step=27648, len=26, n/ep=2, n/st=64, player_1/loss=571.366, player_2/loss=200.471, rew=-25.00]                                                                                                                                                                                    


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #28: 1025it [00:02, 393.62it/s, env_step=28672, len=32, n/ep=2, n/st=64, player_1/loss=425.453, player_2/loss=330.322, rew=25.00]                                                                                                                                                                                     


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #29: 1025it [00:02, 394.15it/s, env_step=29696, len=28, n/ep=2, n/st=64, player_1/loss=416.607, player_2/loss=355.408, rew=-25.00]                                                                                                                                                                                    


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #30: 1025it [00:02, 395.08it/s, env_step=30720, len=25, n/ep=2, n/st=64, player_1/loss=509.107, player_2/loss=249.302, rew=0.00]                                                                                                                                                                                      


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #31: 1025it [00:02, 393.91it/s, env_step=31744, len=27, n/ep=2, n/st=64, player_1/loss=672.386, player_2/loss=195.177, rew=0.00]                                                                                                                                                                                      


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #32: 1025it [00:02, 391.99it/s, env_step=32768, len=33, n/ep=2, n/st=64, player_1/loss=551.433, rew=0.00]                                                                                                                                                                                                             


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #33: 1025it [00:02, 394.09it/s, env_step=33792, len=27, n/ep=2, n/st=64, player_1/loss=414.126, player_2/loss=208.567, rew=-25.00]                                                                                                                                                                                    


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #34: 1025it [00:02, 396.49it/s, env_step=34816, len=24, n/ep=2, n/st=64, player_1/loss=424.601, player_2/loss=191.710, rew=-25.00]                                                                                                                                                                                    


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #35: 1025it [00:02, 392.50it/s, env_step=35840, len=22, n/ep=3, n/st=64, player_1/loss=542.478, player_2/loss=215.734, rew=-8.33]                                                                                                                                                                                     


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #36: 1025it [00:02, 395.45it/s, env_step=36864, len=26, n/ep=2, n/st=64, player_1/loss=521.218, player_2/loss=215.086, rew=0.00]                                                                                                                                                                                      


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #37: 1025it [00:02, 395.97it/s, env_step=37888, len=28, n/ep=2, n/st=64, player_1/loss=400.382, player_2/loss=280.136, rew=25.00]                                                                                                                                                                                     


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #38: 1025it [00:02, 394.20it/s, env_step=38912, len=24, n/ep=3, n/st=64, player_1/loss=352.703, player_2/loss=253.483, rew=-25.00]                                                                                                                                                                                    


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #39: 1025it [00:02, 394.09it/s, env_step=39936, len=26, n/ep=3, n/st=64, player_1/loss=467.562, player_2/loss=225.557, rew=-25.00]                                                                                                                                                                                    


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #40: 1025it [00:02, 395.70it/s, env_step=40960, len=25, n/ep=2, n/st=64, player_1/loss=482.625, player_2/loss=196.983, rew=-25.00]                                                                                                                                                                                    


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #41: 1025it [00:02, 393.02it/s, env_step=41984, len=28, n/ep=2, n/st=64, player_1/loss=411.070, player_2/loss=209.101, rew=-25.00]                                                                                                                                                                                    


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #42: 1025it [00:02, 394.76it/s, env_step=43008, len=26, n/ep=3, n/st=64, player_1/loss=389.177, player_2/loss=243.148, rew=-8.33]                                                                                                                                                                                     


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #43: 1025it [00:02, 394.88it/s, env_step=44032, len=22, n/ep=3, n/st=64, player_1/loss=476.992, player_2/loss=257.956, rew=8.33]                                                                                                                                                                                      


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #44: 1025it [00:02, 395.33it/s, env_step=45056, len=25, n/ep=2, n/st=64, player_1/loss=477.753, player_2/loss=285.719, rew=-25.00]                                                                                                                                                                                    


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #45: 1025it [00:02, 395.90it/s, env_step=46080, len=26, n/ep=2, n/st=64, player_1/loss=445.932, player_2/loss=255.183, rew=-25.00]                                                                                                                                                                                    


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #46: 1025it [00:02, 399.19it/s, env_step=47104, len=29, n/ep=2, n/st=64, player_1/loss=509.642, player_2/loss=194.589, rew=25.00]                                                                                                                                                                                     


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #47: 1025it [00:02, 395.03it/s, env_step=48128, len=27, n/ep=2, n/st=64, player_1/loss=523.382, player_2/loss=193.236, rew=0.00]                                                                                                                                                                                      


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #48: 1025it [00:02, 394.74it/s, env_step=49152, len=28, n/ep=2, n/st=64, player_1/loss=513.238, player_2/loss=212.220, rew=0.00]                                                                                                                                                                                      


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #49: 1025it [00:02, 389.51it/s, env_step=50176, len=26, n/ep=2, n/st=64, player_1/loss=547.687, player_2/loss=222.135, rew=-25.00]                                                                                                                                                                                    


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #16


Epoch #1: 1025it [00:02, 390.50it/s, env_step=1024, len=28, n/ep=2, n/st=64, player_1/loss=273.000, player_2/loss=279.083, rew=-25.00]                                                                                                                                                                                      


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 396.06it/s, env_step=2048, len=27, n/ep=3, n/st=64, player_1/loss=457.601, player_2/loss=236.432, rew=25.00]                                                                                                                                                                                       


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 395.19it/s, env_step=3072, len=28, n/ep=3, n/st=64, player_1/loss=493.593, player_2/loss=271.819, rew=25.00]                                                                                                                                                                                       


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 395.31it/s, env_step=4096, len=25, n/ep=2, n/st=64, player_1/loss=415.065, player_2/loss=250.200, rew=0.00]                                                                                                                                                                                        


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 394.13it/s, env_step=5120, len=26, n/ep=2, n/st=64, player_1/loss=497.499, player_2/loss=177.377, rew=0.00]                                                                                                                                                                                        


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 395.31it/s, env_step=6144, len=27, n/ep=2, n/st=64, player_1/loss=418.767, player_2/loss=215.372, rew=25.00]                                                                                                                                                                                       


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 392.88it/s, env_step=7168, len=23, n/ep=3, n/st=64, player_1/loss=310.037, player_2/loss=215.704, rew=8.33]                                                                                                                                                                                        


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 396.65it/s, env_step=8192, len=29, n/ep=2, n/st=64, player_1/loss=293.938, player_2/loss=175.000, rew=0.00]                                                                                                                                                                                        


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 395.32it/s, env_step=9216, len=25, n/ep=2, n/st=64, player_1/loss=473.924, player_2/loss=186.482, rew=25.00]                                                                                                                                                                                       


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 394.90it/s, env_step=10240, len=20, n/ep=3, n/st=64, player_2/loss=270.413, rew=-8.33]                                                                                                                                                                                                            


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 396.15it/s, env_step=11264, len=27, n/ep=3, n/st=64, player_1/loss=415.130, player_2/loss=342.172, rew=8.33]                                                                                                                                                                                      


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 395.51it/s, env_step=12288, len=27, n/ep=2, n/st=64, player_1/loss=572.843, player_2/loss=248.149, rew=25.00]                                                                                                                                                                                     


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 393.40it/s, env_step=13312, len=26, n/ep=2, n/st=64, player_1/loss=583.601, player_2/loss=231.703, rew=25.00]                                                                                                                                                                                     


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 396.70it/s, env_step=14336, len=29, n/ep=2, n/st=64, player_1/loss=476.489, player_2/loss=273.996, rew=25.00]                                                                                                                                                                                     


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 396.36it/s, env_step=15360, len=34, n/ep=2, n/st=64, player_1/loss=578.322, player_2/loss=228.633, rew=62.50]                                                                                                                                                                                     


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 395.98it/s, env_step=16384, len=26, n/ep=2, n/st=64, player_1/loss=593.432, player_2/loss=166.466, rew=25.00]                                                                                                                                                                                     


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 389.43it/s, env_step=17408, len=33, n/ep=2, n/st=64, player_1/loss=628.410, player_2/loss=185.051, rew=0.00]                                                                                                                                                                                      


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 388.69it/s, env_step=18432, len=25, n/ep=3, n/st=64, player_1/loss=608.648, player_2/loss=186.054, rew=8.33]                                                                                                                                                                                      


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 394.79it/s, env_step=19456, len=23, n/ep=3, n/st=64, player_1/loss=502.090, player_2/loss=210.370, rew=8.33]                                                                                                                                                                                      


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 394.73it/s, env_step=20480, len=27, n/ep=3, n/st=64, player_1/loss=484.640, player_2/loss=228.472, rew=25.00]                                                                                                                                                                                     


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 394.73it/s, env_step=21504, len=25, n/ep=2, n/st=64, player_1/loss=512.666, player_2/loss=221.243, rew=25.00]                                                                                                                                                                                     


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 394.57it/s, env_step=22528, len=25, n/ep=3, n/st=64, player_1/loss=504.877, player_2/loss=221.026, rew=25.00]                                                                                                                                                                                     


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 395.96it/s, env_step=23552, len=27, n/ep=3, n/st=64, player_1/loss=528.395, player_2/loss=204.093, rew=25.00]                                                                                                                                                                                     


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 395.11it/s, env_step=24576, len=25, n/ep=3, n/st=64, player_1/loss=577.258, player_2/loss=214.199, rew=25.00]                                                                                                                                                                                     


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 393.27it/s, env_step=25600, len=25, n/ep=2, n/st=64, player_1/loss=457.048, player_2/loss=178.611, rew=0.00]                                                                                                                                                                                      


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 393.18it/s, env_step=26624, len=26, n/ep=2, n/st=64, player_1/loss=410.014, player_2/loss=180.550, rew=0.00]                                                                                                                                                                                      


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 396.03it/s, env_step=27648, len=26, n/ep=3, n/st=64, player_1/loss=340.786, player_2/loss=201.155, rew=8.33]                                                                                                                                                                                      


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 395.94it/s, env_step=28672, len=26, n/ep=2, n/st=64, player_1/loss=370.203, player_2/loss=173.783, rew=25.00]                                                                                                                                                                                     


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 392.36it/s, env_step=29696, len=25, n/ep=2, n/st=64, player_1/loss=458.258, player_2/loss=165.011, rew=0.00]                                                                                                                                                                                      


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 395.70it/s, env_step=30720, len=27, n/ep=2, n/st=64, player_1/loss=468.361, player_2/loss=164.214, rew=25.00]                                                                                                                                                                                     


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 394.85it/s, env_step=31744, len=25, n/ep=3, n/st=64, player_1/loss=459.147, player_2/loss=184.779, rew=8.33]                                                                                                                                                                                      


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 395.38it/s, env_step=32768, len=25, n/ep=2, n/st=64, player_1/loss=450.579, player_2/loss=180.267, rew=25.00]                                                                                                                                                                                     


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 395.63it/s, env_step=33792, len=30, n/ep=2, n/st=64, player_1/loss=424.147, player_2/loss=210.853, rew=0.00]                                                                                                                                                                                      


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 395.42it/s, env_step=34816, len=26, n/ep=3, n/st=64, player_1/loss=451.688, player_2/loss=206.723, rew=25.00]                                                                                                                                                                                     


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 397.79it/s, env_step=35840, len=24, n/ep=3, n/st=64, player_1/loss=373.732, player_2/loss=176.017, rew=8.33]                                                                                                                                                                                      


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 393.95it/s, env_step=36864, len=26, n/ep=2, n/st=64, player_1/loss=398.478, player_2/loss=198.855, rew=25.00]                                                                                                                                                                                     


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 393.17it/s, env_step=37888, len=25, n/ep=3, n/st=64, player_1/loss=478.933, player_2/loss=200.878, rew=8.33]                                                                                                                                                                                      


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 392.25it/s, env_step=38912, len=25, n/ep=3, n/st=64, player_1/loss=434.365, player_2/loss=178.590, rew=8.33]                                                                                                                                                                                      


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 391.73it/s, env_step=39936, len=23, n/ep=3, n/st=64, player_1/loss=448.589, player_2/loss=226.716, rew=-8.33]                                                                                                                                                                                     


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 395.48it/s, env_step=40960, len=26, n/ep=3, n/st=64, player_1/loss=407.995, player_2/loss=195.676, rew=25.00]                                                                                                                                                                                     


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 390.27it/s, env_step=41984, len=26, n/ep=2, n/st=64, player_1/loss=457.209, player_2/loss=145.584, rew=0.00]                                                                                                                                                                                      


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 389.94it/s, env_step=43008, len=27, n/ep=2, n/st=64, player_2/loss=210.663, rew=25.00]                                                                                                                                                                                                            


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 391.43it/s, env_step=44032, len=26, n/ep=2, n/st=64, player_1/loss=407.858, player_2/loss=222.515, rew=25.00]                                                                                                                                                                                     


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 391.10it/s, env_step=45056, len=26, n/ep=3, n/st=64, player_1/loss=473.351, player_2/loss=190.338, rew=25.00]                                                                                                                                                                                     


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 392.90it/s, env_step=46080, len=25, n/ep=3, n/st=64, player_1/loss=526.919, player_2/loss=169.526, rew=25.00]                                                                                                                                                                                     


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 395.90it/s, env_step=47104, len=27, n/ep=2, n/st=64, player_1/loss=566.329, player_2/loss=141.243, rew=25.00]                                                                                                                                                                                     


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 394.47it/s, env_step=48128, len=26, n/ep=2, n/st=64, player_1/loss=523.309, rew=0.00]                                                                                                                                                                                                             


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 394.48it/s, env_step=49152, len=30, n/ep=2, n/st=64, player_1/loss=394.796, player_2/loss=241.406, rew=25.00]                                                                                                                                                                                     


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 395.80it/s, env_step=50176, len=20, n/ep=3, n/st=64, player_1/loss=343.749, player_2/loss=238.270, rew=-8.33]                                                                                                                                                                                     


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 394.50it/s, env_step=1024, len=28, n/ep=2, n/st=64, player_1/loss=271.924, player_2/loss=223.199, rew=25.00]                                                                                                                                                                                       


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 395.34it/s, env_step=2048, len=27, n/ep=3, n/st=64, player_1/loss=453.842, player_2/loss=186.582, rew=-25.00]                                                                                                                                                                                      


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 394.24it/s, env_step=3072, len=24, n/ep=3, n/st=64, player_1/loss=478.517, player_2/loss=201.348, rew=-8.33]                                                                                                                                                                                       


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 395.81it/s, env_step=4096, len=27, n/ep=2, n/st=64, player_1/loss=481.151, player_2/loss=188.692, rew=-25.00]                                                                                                                                                                                      


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 396.02it/s, env_step=5120, len=28, n/ep=2, n/st=64, player_1/loss=499.450, player_2/loss=179.754, rew=-25.00]                                                                                                                                                                                      


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 395.15it/s, env_step=6144, len=26, n/ep=2, n/st=64, player_1/loss=424.695, player_2/loss=255.519, rew=-25.00]                                                                                                                                                                                      


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 393.47it/s, env_step=7168, len=26, n/ep=3, n/st=64, player_1/loss=356.913, player_2/loss=269.863, rew=-8.33]                                                                                                                                                                                       


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 396.21it/s, env_step=8192, len=32, n/ep=2, n/st=64, player_1/loss=347.905, player_2/loss=183.290, rew=0.00]                                                                                                                                                                                        


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 392.80it/s, env_step=9216, len=27, n/ep=2, n/st=64, player_1/loss=402.069, player_2/loss=175.062, rew=-25.00]                                                                                                                                                                                      


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 395.76it/s, env_step=10240, len=20, n/ep=3, n/st=64, player_2/loss=245.747, rew=8.33]                                                                                                                                                                                                             


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 390.28it/s, env_step=11264, len=25, n/ep=2, n/st=64, player_1/loss=451.086, player_2/loss=255.191, rew=-25.00]                                                                                                                                                                                    


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 395.89it/s, env_step=12288, len=26, n/ep=3, n/st=64, player_1/loss=479.374, player_2/loss=218.395, rew=-25.00]                                                                                                                                                                                    


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 388.77it/s, env_step=13312, len=25, n/ep=3, n/st=64, player_1/loss=402.751, player_2/loss=196.362, rew=-8.33]                                                                                                                                                                                     


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 389.09it/s, env_step=14336, len=25, n/ep=3, n/st=64, player_1/loss=523.316, player_2/loss=212.695, rew=8.33]                                                                                                                                                                                      


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #15: 1025it [00:02, 393.15it/s, env_step=15360, len=27, n/ep=3, n/st=64, player_2/loss=216.781, rew=-8.33]                                                                                                                                                                                                            


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #16: 1025it [00:02, 393.18it/s, env_step=16384, len=24, n/ep=3, n/st=64, player_1/loss=335.065, player_2/loss=217.457, rew=8.33]                                                                                                                                                                                      


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #17: 1025it [00:02, 394.75it/s, env_step=17408, len=28, n/ep=2, n/st=64, player_1/loss=357.409, player_2/loss=213.852, rew=0.00]                                                                                                                                                                                      


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #18: 1025it [00:02, 393.97it/s, env_step=18432, len=25, n/ep=2, n/st=64, player_1/loss=392.789, player_2/loss=197.810, rew=-25.00]                                                                                                                                                                                    


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #19: 1025it [00:02, 394.27it/s, env_step=19456, len=23, n/ep=3, n/st=64, player_1/loss=497.988, player_2/loss=176.075, rew=-8.33]                                                                                                                                                                                     


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #20: 1025it [00:02, 394.75it/s, env_step=20480, len=25, n/ep=3, n/st=64, player_1/loss=472.368, player_2/loss=194.878, rew=-8.33]                                                                                                                                                                                     


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #21: 1025it [00:02, 393.57it/s, env_step=21504, len=26, n/ep=3, n/st=64, player_1/loss=356.661, player_2/loss=186.447, rew=8.33]                                                                                                                                                                                      


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #22: 1025it [00:02, 396.06it/s, env_step=22528, len=26, n/ep=3, n/st=64, player_1/loss=403.582, player_2/loss=146.338, rew=-25.00]                                                                                                                                                                                    


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #23: 1025it [00:02, 396.51it/s, env_step=23552, len=26, n/ep=2, n/st=64, player_1/loss=372.154, player_2/loss=150.751, rew=-25.00]                                                                                                                                                                                    


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #24: 1025it [00:02, 393.98it/s, env_step=24576, len=26, n/ep=3, n/st=64, player_1/loss=405.128, player_2/loss=151.483, rew=-8.33]                                                                                                                                                                                     


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #25: 1025it [00:02, 394.42it/s, env_step=25600, len=27, n/ep=2, n/st=64, player_1/loss=382.994, player_2/loss=214.856, rew=-25.00]                                                                                                                                                                                    


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #26: 1025it [00:02, 393.13it/s, env_step=26624, len=26, n/ep=2, n/st=64, player_1/loss=271.680, player_2/loss=242.640, rew=-25.00]                                                                                                                                                                                    


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #27: 1025it [00:02, 393.11it/s, env_step=27648, len=26, n/ep=2, n/st=64, player_1/loss=330.502, player_2/loss=229.540, rew=-25.00]                                                                                                                                                                                    


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #28: 1025it [00:02, 393.47it/s, env_step=28672, len=26, n/ep=3, n/st=64, player_1/loss=386.449, player_2/loss=262.813, rew=-25.00]                                                                                                                                                                                    


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #29: 1025it [00:02, 393.93it/s, env_step=29696, len=28, n/ep=2, n/st=64, player_1/loss=351.861, player_2/loss=228.413, rew=0.00]                                                                                                                                                                                      


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #30: 1025it [00:02, 394.94it/s, env_step=30720, len=29, n/ep=3, n/st=64, player_1/loss=412.326, player_2/loss=153.508, rew=-8.33]                                                                                                                                                                                     


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #31: 1025it [00:02, 393.34it/s, env_step=31744, len=28, n/ep=3, n/st=64, player_1/loss=384.238, player_2/loss=158.444, rew=8.33]                                                                                                                                                                                      


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #32: 1025it [00:02, 394.09it/s, env_step=32768, len=26, n/ep=2, n/st=64, player_1/loss=348.859, rew=-25.00]                                                                                                                                                                                                           


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #33: 1025it [00:02, 393.62it/s, env_step=33792, len=23, n/ep=2, n/st=64, player_1/loss=433.856, player_2/loss=166.587, rew=0.00]                                                                                                                                                                                      


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #34: 1025it [00:02, 391.66it/s, env_step=34816, len=30, n/ep=2, n/st=64, player_1/loss=457.911, player_2/loss=179.136, rew=0.00]                                                                                                                                                                                      


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #35: 1025it [00:02, 392.15it/s, env_step=35840, len=26, n/ep=2, n/st=64, player_1/loss=399.608, player_2/loss=183.978, rew=-25.00]                                                                                                                                                                                    


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #36: 1025it [00:02, 394.75it/s, env_step=36864, len=27, n/ep=2, n/st=64, player_1/loss=376.262, player_2/loss=244.804, rew=0.00]                                                                                                                                                                                      


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #37: 1025it [00:02, 393.47it/s, env_step=37888, len=24, n/ep=3, n/st=64, player_1/loss=324.895, player_2/loss=252.537, rew=-25.00]                                                                                                                                                                                    


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #38: 1025it [00:02, 392.41it/s, env_step=38912, len=25, n/ep=2, n/st=64, player_1/loss=454.557, player_2/loss=203.589, rew=0.00]                                                                                                                                                                                      


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #39: 1025it [00:02, 394.52it/s, env_step=39936, len=25, n/ep=3, n/st=64, player_1/loss=439.096, player_2/loss=206.327, rew=25.00]                                                                                                                                                                                     


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #40: 1025it [00:02, 392.21it/s, env_step=40960, len=26, n/ep=2, n/st=64, player_1/loss=317.840, player_2/loss=212.727, rew=-25.00]                                                                                                                                                                                    


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #41: 1025it [00:02, 393.63it/s, env_step=41984, len=27, n/ep=3, n/st=64, player_2/loss=173.504, rew=-25.00]                                                                                                                                                                                                           


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #42: 1025it [00:02, 393.34it/s, env_step=43008, len=28, n/ep=2, n/st=64, player_1/loss=313.624, player_2/loss=129.259, rew=25.00]                                                                                                                                                                                     


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #43: 1025it [00:02, 393.33it/s, env_step=44032, len=24, n/ep=2, n/st=64, player_1/loss=308.270, player_2/loss=128.638, rew=0.00]                                                                                                                                                                                      


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #44: 1025it [00:02, 394.50it/s, env_step=45056, len=27, n/ep=2, n/st=64, player_1/loss=385.332, player_2/loss=144.894, rew=-25.00]                                                                                                                                                                                    


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #45: 1025it [00:02, 393.03it/s, env_step=46080, len=26, n/ep=3, n/st=64, player_1/loss=400.433, player_2/loss=165.212, rew=-25.00]                                                                                                                                                                                    


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #46: 1025it [00:02, 394.17it/s, env_step=47104, len=27, n/ep=2, n/st=64, player_1/loss=408.685, player_2/loss=179.545, rew=-25.00]                                                                                                                                                                                    


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #47: 1025it [00:02, 395.45it/s, env_step=48128, len=27, n/ep=2, n/st=64, player_1/loss=451.630, player_2/loss=157.454, rew=-25.00]                                                                                                                                                                                    


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #48: 1025it [00:02, 390.79it/s, env_step=49152, len=26, n/ep=3, n/st=64, player_1/loss=513.816, player_2/loss=135.627, rew=-25.00]                                                                                                                                                                                    


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #49: 1025it [00:02, 393.42it/s, env_step=50176, len=26, n/ep=2, n/st=64, player_1/loss=412.957, player_2/loss=177.830, rew=0.00]                                                                                                                                                                                      


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #14


Epoch #1: 1025it [00:02, 389.60it/s, env_step=1024, len=26, n/ep=3, n/st=64, player_1/loss=297.682, player_2/loss=147.063, rew=25.00]                                                                                                                                                                                       


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 392.63it/s, env_step=2048, len=26, n/ep=2, n/st=64, player_1/loss=288.442, player_2/loss=154.854, rew=25.00]                                                                                                                                                                                       


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 390.97it/s, env_step=3072, len=23, n/ep=3, n/st=64, player_1/loss=295.719, player_2/loss=163.498, rew=8.33]                                                                                                                                                                                        


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 392.71it/s, env_step=4096, len=27, n/ep=2, n/st=64, player_1/loss=325.491, player_2/loss=157.952, rew=25.00]                                                                                                                                                                                       


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 394.94it/s, env_step=5120, len=27, n/ep=3, n/st=64, player_1/loss=328.901, player_2/loss=148.128, rew=25.00]                                                                                                                                                                                       


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 394.44it/s, env_step=6144, len=27, n/ep=3, n/st=64, player_1/loss=320.349, player_2/loss=202.651, rew=8.33]                                                                                                                                                                                        


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 392.37it/s, env_step=7168, len=26, n/ep=3, n/st=64, player_1/loss=326.291, player_2/loss=189.718, rew=8.33]                                                                                                                                                                                        


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 392.65it/s, env_step=8192, len=26, n/ep=2, n/st=64, player_1/loss=312.191, player_2/loss=141.979, rew=0.00]                                                                                                                                                                                        


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 384.64it/s, env_step=9216, len=23, n/ep=3, n/st=64, player_1/loss=316.857, player_2/loss=137.767, rew=8.33]                                                                                                                                                                                        


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 388.31it/s, env_step=10240, len=20, n/ep=3, n/st=64, player_2/loss=151.153, rew=-8.33]                                                                                                                                                                                                            


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 393.58it/s, env_step=11264, len=25, n/ep=2, n/st=64, player_1/loss=390.958, player_2/loss=149.903, rew=25.00]                                                                                                                                                                                     


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 393.98it/s, env_step=12288, len=21, n/ep=3, n/st=64, player_1/loss=346.395, player_2/loss=160.049, rew=8.33]                                                                                                                                                                                      


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 390.68it/s, env_step=13312, len=26, n/ep=2, n/st=64, player_1/loss=389.237, player_2/loss=178.142, rew=0.00]                                                                                                                                                                                      


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 390.05it/s, env_step=14336, len=23, n/ep=3, n/st=64, player_1/loss=400.435, player_2/loss=195.377, rew=8.33]                                                                                                                                                                                      


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 383.46it/s, env_step=15360, len=26, n/ep=2, n/st=64, player_1/loss=412.402, player_2/loss=176.674, rew=25.00]                                                                                                                                                                                     


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 391.80it/s, env_step=16384, len=24, n/ep=2, n/st=64, player_1/loss=324.974, player_2/loss=166.342, rew=25.00]                                                                                                                                                                                     


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 392.77it/s, env_step=17408, len=26, n/ep=3, n/st=64, player_1/loss=328.147, player_2/loss=152.558, rew=25.00]                                                                                                                                                                                     


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 392.97it/s, env_step=18432, len=23, n/ep=2, n/st=64, player_1/loss=293.129, player_2/loss=168.867, rew=-25.00]                                                                                                                                                                                    


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 393.81it/s, env_step=19456, len=29, n/ep=2, n/st=64, player_1/loss=371.497, player_2/loss=174.109, rew=25.00]                                                                                                                                                                                     


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 391.23it/s, env_step=20480, len=26, n/ep=3, n/st=64, player_1/loss=434.361, player_2/loss=166.433, rew=25.00]                                                                                                                                                                                     


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 390.58it/s, env_step=21504, len=31, n/ep=2, n/st=64, player_1/loss=381.735, player_2/loss=154.680, rew=-25.00]                                                                                                                                                                                    


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 391.85it/s, env_step=22528, len=24, n/ep=3, n/st=64, player_1/loss=432.290, player_2/loss=140.818, rew=8.33]                                                                                                                                                                                      


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 394.65it/s, env_step=23552, len=25, n/ep=3, n/st=64, player_1/loss=349.882, player_2/loss=131.394, rew=8.33]                                                                                                                                                                                      


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 391.68it/s, env_step=24576, len=27, n/ep=2, n/st=64, player_1/loss=290.224, player_2/loss=153.505, rew=25.00]                                                                                                                                                                                     


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 391.90it/s, env_step=25600, len=27, n/ep=2, n/st=64, player_1/loss=316.626, player_2/loss=166.557, rew=0.00]                                                                                                                                                                                      


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:03, 327.81it/s, env_step=26624, len=26, n/ep=3, n/st=64, player_1/loss=404.885, player_2/loss=201.392, rew=25.00]                                                                                                                                                                                     


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 343.59it/s, env_step=27648, len=24, n/ep=3, n/st=64, player_1/loss=414.130, player_2/loss=203.773, rew=25.00]                                                                                                                                                                                     


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 421.34it/s, env_step=28672, len=26, n/ep=2, n/st=64, player_1/loss=461.582, player_2/loss=148.482, rew=25.00]                                                                                                                                                                                     


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 472.94it/s, env_step=29696, len=25, n/ep=3, n/st=64, player_1/loss=440.652, player_2/loss=137.912, rew=25.00]                                                                                                                                                                                     


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 470.96it/s, env_step=30720, len=28, n/ep=3, n/st=64, player_1/loss=404.448, player_2/loss=127.381, rew=25.00]                                                                                                                                                                                     


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 410.63it/s, env_step=31744, len=26, n/ep=3, n/st=64, player_1/loss=309.053, player_2/loss=136.393, rew=25.00]                                                                                                                                                                                     


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 441.69it/s, env_step=32768, len=31, n/ep=2, n/st=64, player_1/loss=272.402, player_2/loss=162.061, rew=0.00]                                                                                                                                                                                      


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 490.64it/s, env_step=33792, len=26, n/ep=2, n/st=64, player_1/loss=380.981, player_2/loss=194.318, rew=0.00]                                                                                                                                                                                      


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 479.12it/s, env_step=34816, len=26, n/ep=2, n/st=64, player_1/loss=483.823, player_2/loss=189.724, rew=25.00]                                                                                                                                                                                     


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 466.23it/s, env_step=35840, len=27, n/ep=3, n/st=64, player_1/loss=407.928, player_2/loss=125.432, rew=8.33]                                                                                                                                                                                      


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 441.84it/s, env_step=36864, len=27, n/ep=2, n/st=64, player_1/loss=351.647, player_2/loss=161.332, rew=0.00]                                                                                                                                                                                      


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 399.66it/s, env_step=37888, len=26, n/ep=2, n/st=64, player_1/loss=437.399, player_2/loss=164.038, rew=25.00]                                                                                                                                                                                     


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 487.77it/s, env_step=38912, len=25, n/ep=3, n/st=64, player_1/loss=456.207, player_2/loss=108.377, rew=8.33]                                                                                                                                                                                      


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 481.95it/s, env_step=39936, len=25, n/ep=3, n/st=64, player_1/loss=410.434, player_2/loss=141.269, rew=25.00]                                                                                                                                                                                     


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 469.31it/s, env_step=40960, len=23, n/ep=3, n/st=64, player_1/loss=343.823, player_2/loss=163.725, rew=8.33]                                                                                                                                                                                      


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 455.91it/s, env_step=41984, len=24, n/ep=3, n/st=64, player_1/loss=338.066, player_2/loss=152.012, rew=8.33]                                                                                                                                                                                      


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 478.09it/s, env_step=43008, len=24, n/ep=2, n/st=64, player_1/loss=360.229, player_2/loss=165.530, rew=0.00]                                                                                                                                                                                      


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 493.20it/s, env_step=44032, len=28, n/ep=3, n/st=64, player_1/loss=411.141, player_2/loss=146.531, rew=25.00]                                                                                                                                                                                     


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 484.82it/s, env_step=45056, len=28, n/ep=2, n/st=64, player_1/loss=319.646, player_2/loss=139.468, rew=-25.00]                                                                                                                                                                                    


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 478.33it/s, env_step=46080, len=25, n/ep=2, n/st=64, player_1/loss=323.216, player_2/loss=178.836, rew=0.00]                                                                                                                                                                                      


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 491.73it/s, env_step=47104, len=29, n/ep=2, n/st=64, player_1/loss=337.707, player_2/loss=182.568, rew=25.00]                                                                                                                                                                                     


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 475.10it/s, env_step=48128, len=28, n/ep=2, n/st=64, player_1/loss=352.994, player_2/loss=172.652, rew=25.00]                                                                                                                                                                                     


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 467.21it/s, env_step=49152, len=27, n/ep=3, n/st=64, player_1/loss=414.421, player_2/loss=152.147, rew=25.00]                                                                                                                                                                                     


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 461.95it/s, env_step=50176, len=25, n/ep=3, n/st=64, player_1/loss=449.786, player_2/loss=138.876, rew=8.33]                                                                                                                                                                                      


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 397.95it/s, env_step=1024, len=23, n/ep=3, n/st=64, player_1/loss=265.808, player_2/loss=119.364, rew=-8.33]                                                                                                                                                                                       


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 436.36it/s, env_step=2048, len=26, n/ep=3, n/st=64, player_1/loss=405.429, player_2/loss=129.556, rew=-25.00]                                                                                                                                                                                      


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 465.70it/s, env_step=3072, len=20, n/ep=3, n/st=64, player_1/loss=421.060, player_2/loss=136.544, rew=8.33]                                                                                                                                                                                        


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 430.44it/s, env_step=4096, len=27, n/ep=2, n/st=64, player_1/loss=345.797, player_2/loss=131.438, rew=-25.00]                                                                                                                                                                                      


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #5: 1025it [00:02, 405.60it/s, env_step=5120, len=25, n/ep=3, n/st=64, player_1/loss=387.729, player_2/loss=134.090, rew=-25.00]                                                                                                                                                                                      


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #6: 1025it [00:02, 404.63it/s, env_step=6144, len=26, n/ep=3, n/st=64, player_1/loss=380.870, player_2/loss=172.215, rew=-25.00]                                                                                                                                                                                      


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #7: 1025it [00:02, 403.51it/s, env_step=7168, len=26, n/ep=3, n/st=64, player_1/loss=347.440, player_2/loss=173.049, rew=-8.33]                                                                                                                                                                                       


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #8: 1025it [00:02, 405.39it/s, env_step=8192, len=26, n/ep=2, n/st=64, player_1/loss=333.692, player_2/loss=131.767, rew=0.00]                                                                                                                                                                                        


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #9: 1025it [00:02, 404.75it/s, env_step=9216, len=26, n/ep=2, n/st=64, player_1/loss=282.057, player_2/loss=122.115, rew=-25.00]                                                                                                                                                                                      


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #10: 1025it [00:02, 402.94it/s, env_step=10240, len=26, n/ep=2, n/st=64, player_1/loss=316.942, rew=-25.00]                                                                                                                                                                                                           


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #11: 1025it [00:02, 352.49it/s, env_step=11264, len=25, n/ep=2, n/st=64, player_1/loss=345.881, player_2/loss=180.445, rew=-25.00]                                                                                                                                                                                    


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #12: 1025it [00:02, 372.36it/s, env_step=12288, len=21, n/ep=3, n/st=64, player_1/loss=312.945, player_2/loss=166.304, rew=-8.33]                                                                                                                                                                                     


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #13: 1025it [00:02, 390.06it/s, env_step=13312, len=26, n/ep=2, n/st=64, player_1/loss=267.868, player_2/loss=125.895, rew=-25.00]                                                                                                                                                                                    


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #14: 1025it [00:02, 366.15it/s, env_step=14336, len=27, n/ep=2, n/st=64, player_1/loss=275.328, player_2/loss=182.079, rew=-25.00]                                                                                                                                                                                    


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #15: 1025it [00:02, 409.72it/s, env_step=15360, len=26, n/ep=2, n/st=64, player_1/loss=290.415, player_2/loss=169.708, rew=-25.00]                                                                                                                                                                                    


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #16: 1025it [00:02, 374.28it/s, env_step=16384, len=27, n/ep=3, n/st=64, player_1/loss=341.962, player_2/loss=104.503, rew=-8.33]                                                                                                                                                                                     


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #17: 1025it [00:02, 406.50it/s, env_step=17408, len=27, n/ep=2, n/st=64, player_1/loss=353.707, player_2/loss=129.858, rew=0.00]                                                                                                                                                                                      


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #18: 1025it [00:02, 416.46it/s, env_step=18432, len=27, n/ep=3, n/st=64, player_1/loss=323.624, player_2/loss=135.181, rew=-8.33]                                                                                                                                                                                     


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #19: 1025it [00:02, 470.55it/s, env_step=19456, len=27, n/ep=3, n/st=64, player_1/loss=332.462, player_2/loss=139.645, rew=-8.33]                                                                                                                                                                                     


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #20: 1025it [00:02, 468.19it/s, env_step=20480, len=24, n/ep=3, n/st=64, player_1/loss=269.387, player_2/loss=131.797, rew=-8.33]                                                                                                                                                                                     


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #21: 1025it [00:02, 498.82it/s, env_step=21504, len=25, n/ep=3, n/st=64, player_1/loss=269.658, player_2/loss=169.718, rew=8.33]                                                                                                                                                                                      


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #22: 1025it [00:02, 494.08it/s, env_step=22528, len=23, n/ep=3, n/st=64, player_1/loss=330.865, player_2/loss=158.292, rew=-8.33]                                                                                                                                                                                     


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #23: 1025it [00:02, 467.75it/s, env_step=23552, len=25, n/ep=3, n/st=64, player_1/loss=320.379, player_2/loss=142.602, rew=-8.33]                                                                                                                                                                                     


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #24: 1025it [00:02, 447.85it/s, env_step=24576, len=30, n/ep=2, n/st=64, player_1/loss=266.731, player_2/loss=130.811, rew=-25.00]                                                                                                                                                                                    


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #25: 1025it [00:02, 365.06it/s, env_step=25600, len=27, n/ep=2, n/st=64, player_1/loss=295.199, player_2/loss=109.970, rew=0.00]                                                                                                                                                                                      


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #26: 1025it [00:02, 359.39it/s, env_step=26624, len=26, n/ep=2, n/st=64, player_1/loss=394.055, player_2/loss=115.722, rew=-25.00]                                                                                                                                                                                    


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #27: 1025it [00:02, 388.51it/s, env_step=27648, len=27, n/ep=3, n/st=64, player_1/loss=340.305, player_2/loss=151.619, rew=-25.00]                                                                                                                                                                                    


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #28: 1025it [00:02, 465.85it/s, env_step=28672, len=25, n/ep=2, n/st=64, player_1/loss=356.606, player_2/loss=167.202, rew=0.00]                                                                                                                                                                                      


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #29: 1025it [00:02, 481.60it/s, env_step=29696, len=26, n/ep=2, n/st=64, player_1/loss=350.557, player_2/loss=133.495, rew=0.00]                                                                                                                                                                                      


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #30: 1025it [00:02, 407.21it/s, env_step=30720, len=23, n/ep=3, n/st=64, player_1/loss=269.698, player_2/loss=204.186, rew=25.00]                                                                                                                                                                                     


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #31: 1025it [00:02, 450.08it/s, env_step=31744, len=25, n/ep=3, n/st=64, player_1/loss=253.586, player_2/loss=224.782, rew=8.33]                                                                                                                                                                                      


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #32: 1025it [00:02, 488.64it/s, env_step=32768, len=28, n/ep=2, n/st=64, player_1/loss=284.725, player_2/loss=201.165, rew=-25.00]                                                                                                                                                                                    


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #33: 1025it [00:02, 463.48it/s, env_step=33792, len=27, n/ep=2, n/st=64, player_1/loss=323.967, player_2/loss=168.728, rew=25.00]                                                                                                                                                                                     


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #34: 1025it [00:02, 491.57it/s, env_step=34816, len=25, n/ep=3, n/st=64, player_1/loss=312.851, player_2/loss=176.778, rew=-8.33]                                                                                                                                                                                     


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #35: 1025it [00:02, 491.16it/s, env_step=35840, len=28, n/ep=2, n/st=64, player_1/loss=249.918, player_2/loss=163.135, rew=-25.00]                                                                                                                                                                                    


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #36: 1025it [00:02, 489.00it/s, env_step=36864, len=26, n/ep=2, n/st=64, player_1/loss=231.340, player_2/loss=193.625, rew=-25.00]                                                                                                                                                                                    


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #37: 1025it [00:02, 491.72it/s, env_step=37888, len=26, n/ep=2, n/st=64, player_1/loss=311.214, player_2/loss=168.730, rew=-25.00]                                                                                                                                                                                    


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #38: 1025it [00:02, 491.49it/s, env_step=38912, len=26, n/ep=3, n/st=64, player_1/loss=403.641, player_2/loss=111.911, rew=-25.00]                                                                                                                                                                                    


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #39: 1025it [00:02, 494.24it/s, env_step=39936, len=26, n/ep=3, n/st=64, player_1/loss=458.958, player_2/loss=119.590, rew=-25.00]                                                                                                                                                                                    


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #40: 1025it [00:02, 494.85it/s, env_step=40960, len=26, n/ep=2, n/st=64, player_1/loss=371.779, player_2/loss=114.635, rew=-25.00]                                                                                                                                                                                    


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #41: 1025it [00:02, 492.44it/s, env_step=41984, len=27, n/ep=2, n/st=64, player_1/loss=301.789, player_2/loss=120.519, rew=-25.00]                                                                                                                                                                                    


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #42: 1025it [00:02, 492.95it/s, env_step=43008, len=23, n/ep=2, n/st=64, player_2/loss=119.884, rew=0.00]                                                                                                                                                                                                             


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #43: 1025it [00:02, 492.93it/s, env_step=44032, len=28, n/ep=3, n/st=64, player_1/loss=340.570, player_2/loss=135.652, rew=-25.00]                                                                                                                                                                                    


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #44: 1025it [00:02, 492.80it/s, env_step=45056, len=27, n/ep=3, n/st=64, player_1/loss=285.659, player_2/loss=171.551, rew=8.33]                                                                                                                                                                                      


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #45: 1025it [00:02, 482.33it/s, env_step=46080, len=27, n/ep=3, n/st=64, player_1/loss=258.092, player_2/loss=164.500, rew=-8.33]                                                                                                                                                                                     


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #46: 1025it [00:02, 460.03it/s, env_step=47104, len=29, n/ep=2, n/st=64, player_2/loss=146.994, rew=25.00]                                                                                                                                                                                                            


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #47: 1025it [00:02, 438.95it/s, env_step=48128, len=25, n/ep=3, n/st=64, player_1/loss=309.045, rew=-8.33]                                                                                                                                                                                                            


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #48: 1025it [00:02, 411.06it/s, env_step=49152, len=23, n/ep=3, n/st=64, player_1/loss=311.812, player_2/loss=156.879, rew=-8.33]                                                                                                                                                                                     


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #49: 1025it [00:02, 392.26it/s, env_step=50176, len=25, n/ep=2, n/st=64, player_1/loss=323.609, player_2/loss=140.882, rew=-25.00]                                                                                                                                                                                    


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #4


Epoch #1: 1025it [00:02, 390.46it/s, env_step=1024, len=26, n/ep=3, n/st=64, player_1/loss=222.761, player_2/loss=121.151, rew=25.00]                                                                                                                                                                                       


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 390.43it/s, env_step=2048, len=24, n/ep=3, n/st=64, player_1/loss=260.629, player_2/loss=135.534, rew=8.33]                                                                                                                                                                                        


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 392.29it/s, env_step=3072, len=24, n/ep=2, n/st=64, player_1/loss=289.372, player_2/loss=150.168, rew=25.00]                                                                                                                                                                                       


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 393.74it/s, env_step=4096, len=28, n/ep=2, n/st=64, player_1/loss=206.322, player_2/loss=207.824, rew=0.00]                                                                                                                                                                                        


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 393.21it/s, env_step=5120, len=26, n/ep=3, n/st=64, player_1/loss=172.717, player_2/loss=212.952, rew=8.33]                                                                                                                                                                                        


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 392.48it/s, env_step=6144, len=25, n/ep=2, n/st=64, player_1/loss=238.525, player_2/loss=187.252, rew=0.00]                                                                                                                                                                                        


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 390.20it/s, env_step=7168, len=22, n/ep=3, n/st=64, player_1/loss=286.513, player_2/loss=156.625, rew=-8.33]                                                                                                                                                                                       


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 393.77it/s, env_step=8192, len=31, n/ep=3, n/st=64, player_1/loss=279.500, rew=8.33]                                                                                                                                                                                                               


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 385.33it/s, env_step=9216, len=26, n/ep=3, n/st=64, player_1/loss=249.928, player_2/loss=215.499, rew=25.00]                                                                                                                                                                                       


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 385.91it/s, env_step=10240, len=26, n/ep=3, n/st=64, player_1/loss=239.786, player_2/loss=199.380, rew=-8.33]                                                                                                                                                                                     


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 392.62it/s, env_step=11264, len=26, n/ep=3, n/st=64, player_1/loss=214.684, player_2/loss=181.776, rew=25.00]                                                                                                                                                                                     


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 389.65it/s, env_step=12288, len=29, n/ep=2, n/st=64, player_1/loss=230.686, player_2/loss=183.834, rew=-25.00]                                                                                                                                                                                    


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 391.54it/s, env_step=13312, len=27, n/ep=2, n/st=64, player_1/loss=228.964, player_2/loss=202.137, rew=0.00]                                                                                                                                                                                      


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 391.74it/s, env_step=14336, len=21, n/ep=3, n/st=64, player_1/loss=282.286, player_2/loss=172.571, rew=8.33]                                                                                                                                                                                      


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 392.70it/s, env_step=15360, len=22, n/ep=3, n/st=64, player_1/loss=271.829, player_2/loss=189.761, rew=-25.00]                                                                                                                                                                                    


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 389.63it/s, env_step=16384, len=23, n/ep=3, n/st=64, player_1/loss=232.286, player_2/loss=227.881, rew=8.33]                                                                                                                                                                                      


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 391.22it/s, env_step=17408, len=27, n/ep=2, n/st=64, player_1/loss=302.951, player_2/loss=168.469, rew=0.00]                                                                                                                                                                                      


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 390.95it/s, env_step=18432, len=26, n/ep=3, n/st=64, player_1/loss=331.479, player_2/loss=163.337, rew=-8.33]                                                                                                                                                                                     


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 389.51it/s, env_step=19456, len=26, n/ep=2, n/st=64, player_1/loss=322.849, player_2/loss=157.356, rew=25.00]                                                                                                                                                                                     


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 392.98it/s, env_step=20480, len=25, n/ep=3, n/st=64, player_1/loss=293.039, player_2/loss=209.163, rew=8.33]                                                                                                                                                                                      


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 392.31it/s, env_step=21504, len=25, n/ep=2, n/st=64, player_1/loss=255.812, player_2/loss=281.954, rew=25.00]                                                                                                                                                                                     


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 393.61it/s, env_step=22528, len=26, n/ep=3, n/st=64, player_1/loss=238.444, player_2/loss=258.633, rew=-8.33]                                                                                                                                                                                     


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 391.94it/s, env_step=23552, len=26, n/ep=2, n/st=64, player_1/loss=289.027, player_2/loss=259.018, rew=25.00]                                                                                                                                                                                     


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 391.88it/s, env_step=24576, len=27, n/ep=2, n/st=64, player_1/loss=270.179, player_2/loss=206.758, rew=25.00]                                                                                                                                                                                     


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 389.85it/s, env_step=25600, len=23, n/ep=2, n/st=64, player_1/loss=203.374, player_2/loss=153.251, rew=-25.00]                                                                                                                                                                                    


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 390.32it/s, env_step=26624, len=23, n/ep=3, n/st=64, player_1/loss=228.124, player_2/loss=169.168, rew=8.33]                                                                                                                                                                                      


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 390.42it/s, env_step=27648, len=26, n/ep=3, n/st=64, player_1/loss=261.788, player_2/loss=142.331, rew=-8.33]                                                                                                                                                                                     


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 391.70it/s, env_step=28672, len=28, n/ep=2, n/st=64, player_1/loss=251.137, player_2/loss=154.001, rew=-25.00]                                                                                                                                                                                    


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 391.10it/s, env_step=29696, len=26, n/ep=2, n/st=64, player_1/loss=239.051, player_2/loss=160.289, rew=25.00]                                                                                                                                                                                     


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 390.54it/s, env_step=30720, len=26, n/ep=3, n/st=64, player_1/loss=333.080, player_2/loss=145.373, rew=-8.33]                                                                                                                                                                                     


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 389.70it/s, env_step=31744, len=25, n/ep=2, n/st=64, player_1/loss=317.262, player_2/loss=129.901, rew=0.00]                                                                                                                                                                                      


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 390.31it/s, env_step=32768, len=24, n/ep=2, n/st=64, player_1/loss=246.851, player_2/loss=161.440, rew=0.00]                                                                                                                                                                                      


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 392.09it/s, env_step=33792, len=26, n/ep=2, n/st=64, player_1/loss=290.350, player_2/loss=172.589, rew=25.00]                                                                                                                                                                                     


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 390.74it/s, env_step=34816, len=24, n/ep=2, n/st=64, player_1/loss=246.294, player_2/loss=185.937, rew=0.00]                                                                                                                                                                                      


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 392.30it/s, env_step=35840, len=26, n/ep=2, n/st=64, player_1/loss=247.074, player_2/loss=154.142, rew=0.00]                                                                                                                                                                                      


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 390.91it/s, env_step=36864, len=26, n/ep=2, n/st=64, player_1/loss=223.394, player_2/loss=133.222, rew=25.00]                                                                                                                                                                                     


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 382.20it/s, env_step=37888, len=27, n/ep=3, n/st=64, player_1/loss=237.465, player_2/loss=216.899, rew=8.33]                                                                                                                                                                                      


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 389.84it/s, env_step=38912, len=30, n/ep=2, n/st=64, player_1/loss=241.858, player_2/loss=197.227, rew=0.00]                                                                                                                                                                                      


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 391.29it/s, env_step=39936, len=26, n/ep=3, n/st=64, player_1/loss=288.802, player_2/loss=126.909, rew=25.00]                                                                                                                                                                                     


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 391.00it/s, env_step=40960, len=26, n/ep=2, n/st=64, player_1/loss=326.715, player_2/loss=106.797, rew=0.00]                                                                                                                                                                                      


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 393.17it/s, env_step=41984, len=26, n/ep=3, n/st=64, player_1/loss=324.560, player_2/loss=112.091, rew=-8.33]                                                                                                                                                                                     


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 390.31it/s, env_step=43008, len=23, n/ep=2, n/st=64, player_1/loss=248.835, player_2/loss=175.471, rew=0.00]                                                                                                                                                                                      


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 392.49it/s, env_step=44032, len=24, n/ep=3, n/st=64, player_1/loss=227.561, player_2/loss=181.925, rew=8.33]                                                                                                                                                                                      


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 392.19it/s, env_step=45056, len=24, n/ep=3, n/st=64, player_1/loss=234.488, player_2/loss=185.614, rew=-8.33]                                                                                                                                                                                     


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 391.95it/s, env_step=46080, len=22, n/ep=2, n/st=64, player_1/loss=242.955, player_2/loss=143.729, rew=0.00]                                                                                                                                                                                      


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 391.30it/s, env_step=47104, len=27, n/ep=3, n/st=64, player_1/loss=272.100, player_2/loss=113.864, rew=25.00]                                                                                                                                                                                     


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 392.77it/s, env_step=48128, len=25, n/ep=3, n/st=64, player_1/loss=316.070, player_2/loss=118.493, rew=8.33]                                                                                                                                                                                      


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 389.35it/s, env_step=49152, len=26, n/ep=2, n/st=64, player_1/loss=373.106, player_2/loss=141.253, rew=25.00]                                                                                                                                                                                     


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 391.87it/s, env_step=50176, len=23, n/ep=2, n/st=64, player_1/loss=284.011, player_2/loss=138.108, rew=-25.00]                                                                                                                                                                                    


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 388.94it/s, env_step=1024, len=26, n/ep=3, n/st=64, player_1/loss=236.405, player_2/loss=75.004, rew=-25.00]                                                                                                                                                                                       


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #2: 1025it [00:02, 389.14it/s, env_step=2048, len=29, n/ep=2, n/st=64, player_1/loss=254.184, player_2/loss=98.753, rew=-25.00]                                                                                                                                                                                       


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #3: 1025it [00:02, 391.32it/s, env_step=3072, len=25, n/ep=3, n/st=64, player_1/loss=303.703, player_2/loss=106.975, rew=-25.00]                                                                                                                                                                                      


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #4: 1025it [00:02, 390.77it/s, env_step=4096, len=29, n/ep=3, n/st=64, player_1/loss=293.280, player_2/loss=104.113, rew=-8.33]                                                                                                                                                                                       


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #5: 1025it [00:02, 380.13it/s, env_step=5120, len=26, n/ep=3, n/st=64, player_1/loss=343.457, player_2/loss=124.300, rew=8.33]                                                                                                                                                                                        


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #6: 1025it [00:02, 390.58it/s, env_step=6144, len=26, n/ep=3, n/st=64, player_1/loss=386.640, player_2/loss=115.884, rew=-25.00]                                                                                                                                                                                      


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #7: 1025it [00:02, 389.20it/s, env_step=7168, len=23, n/ep=3, n/st=64, player_1/loss=353.252, player_2/loss=107.204, rew=-8.33]                                                                                                                                                                                       


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #8: 1025it [00:02, 390.69it/s, env_step=8192, len=27, n/ep=2, n/st=64, player_1/loss=272.630, player_2/loss=105.957, rew=-25.00]                                                                                                                                                                                      


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #9: 1025it [00:02, 387.45it/s, env_step=9216, len=26, n/ep=2, n/st=64, player_1/loss=215.823, player_2/loss=84.001, rew=0.00]                                                                                                                                                                                         


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #10: 1025it [00:02, 392.66it/s, env_step=10240, len=23, n/ep=3, n/st=64, player_1/loss=260.462, rew=-8.33]                                                                                                                                                                                                            


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #11: 1025it [00:02, 390.19it/s, env_step=11264, len=27, n/ep=3, n/st=64, player_1/loss=285.684, player_2/loss=125.907, rew=-8.33]                                                                                                                                                                                     


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #12: 1025it [00:02, 390.44it/s, env_step=12288, len=24, n/ep=3, n/st=64, player_1/loss=385.831, player_2/loss=129.903, rew=-25.00]                                                                                                                                                                                    


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #13: 1025it [00:02, 390.42it/s, env_step=13312, len=28, n/ep=2, n/st=64, player_1/loss=409.789, player_2/loss=146.076, rew=-25.00]                                                                                                                                                                                    


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #14: 1025it [00:02, 391.06it/s, env_step=14336, len=25, n/ep=3, n/st=64, player_1/loss=305.181, player_2/loss=148.553, rew=-8.33]                                                                                                                                                                                     


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #15: 1025it [00:02, 390.49it/s, env_step=15360, len=26, n/ep=2, n/st=64, player_1/loss=275.000, player_2/loss=128.930, rew=-25.00]                                                                                                                                                                                    


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #16: 1025it [00:02, 392.34it/s, env_step=16384, len=27, n/ep=3, n/st=64, player_1/loss=282.353, player_2/loss=115.513, rew=-8.33]                                                                                                                                                                                     


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #17: 1025it [00:02, 394.25it/s, env_step=17408, len=27, n/ep=2, n/st=64, player_1/loss=272.972, player_2/loss=116.170, rew=0.00]                                                                                                                                                                                      


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #18: 1025it [00:02, 391.57it/s, env_step=18432, len=25, n/ep=3, n/st=64, player_1/loss=265.773, player_2/loss=111.705, rew=-8.33]                                                                                                                                                                                     


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #19: 1025it [00:02, 389.79it/s, env_step=19456, len=29, n/ep=2, n/st=64, player_1/loss=297.876, player_2/loss=111.039, rew=-25.00]                                                                                                                                                                                    


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #20: 1025it [00:02, 390.80it/s, env_step=20480, len=24, n/ep=3, n/st=64, player_1/loss=278.204, player_2/loss=124.383, rew=-8.33]                                                                                                                                                                                     


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #21: 1025it [00:02, 392.10it/s, env_step=21504, len=26, n/ep=2, n/st=64, player_1/loss=193.088, player_2/loss=119.401, rew=-25.00]                                                                                                                                                                                    


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #22: 1025it [00:02, 390.72it/s, env_step=22528, len=28, n/ep=2, n/st=64, player_1/loss=265.476, player_2/loss=103.524, rew=25.00]                                                                                                                                                                                     


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #23: 1025it [00:02, 392.61it/s, env_step=23552, len=26, n/ep=2, n/st=64, player_1/loss=268.298, player_2/loss=146.795, rew=-25.00]                                                                                                                                                                                    


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #24: 1025it [00:02, 388.96it/s, env_step=24576, len=29, n/ep=2, n/st=64, player_1/loss=203.153, player_2/loss=146.796, rew=0.00]                                                                                                                                                                                      


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #25: 1025it [00:02, 388.24it/s, env_step=25600, len=30, n/ep=2, n/st=64, player_1/loss=212.075, player_2/loss=108.032, rew=0.00]                                                                                                                                                                                      


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #26: 1025it [00:02, 389.36it/s, env_step=26624, len=25, n/ep=3, n/st=64, player_1/loss=287.336, player_2/loss=119.230, rew=8.33]                                                                                                                                                                                      


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #27: 1025it [00:02, 392.28it/s, env_step=27648, len=26, n/ep=2, n/st=64, player_1/loss=297.621, player_2/loss=102.309, rew=-25.00]                                                                                                                                                                                    


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #28: 1025it [00:02, 391.48it/s, env_step=28672, len=26, n/ep=3, n/st=64, player_1/loss=268.120, player_2/loss=114.074, rew=-25.00]                                                                                                                                                                                    


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #29: 1025it [00:02, 391.50it/s, env_step=29696, len=31, n/ep=2, n/st=64, player_1/loss=294.026, player_2/loss=151.166, rew=-25.00]                                                                                                                                                                                    


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #30: 1025it [00:02, 391.63it/s, env_step=30720, len=26, n/ep=2, n/st=64, player_1/loss=267.572, player_2/loss=150.123, rew=0.00]                                                                                                                                                                                      


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #31: 1025it [00:02, 376.06it/s, env_step=31744, len=25, n/ep=2, n/st=64, player_1/loss=239.471, player_2/loss=125.617, rew=-25.00]                                                                                                                                                                                    


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #32: 1025it [00:02, 372.52it/s, env_step=32768, len=27, n/ep=2, n/st=64, player_1/loss=374.496, player_2/loss=130.046, rew=0.00]                                                                                                                                                                                      


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #33: 1025it [00:02, 389.74it/s, env_step=33792, len=27, n/ep=2, n/st=64, player_1/loss=336.152, player_2/loss=115.121, rew=-25.00]                                                                                                                                                                                    


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #34: 1025it [00:02, 390.47it/s, env_step=34816, len=27, n/ep=2, n/st=64, player_1/loss=232.809, player_2/loss=141.401, rew=0.00]                                                                                                                                                                                      


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #35: 1025it [00:02, 390.68it/s, env_step=35840, len=25, n/ep=2, n/st=64, player_1/loss=243.288, player_2/loss=148.763, rew=-25.00]                                                                                                                                                                                    


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #36: 1025it [00:02, 391.45it/s, env_step=36864, len=26, n/ep=3, n/st=64, player_1/loss=290.928, player_2/loss=139.704, rew=-8.33]                                                                                                                                                                                     


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #37: 1025it [00:02, 392.43it/s, env_step=37888, len=26, n/ep=2, n/st=64, player_1/loss=235.877, player_2/loss=136.754, rew=-25.00]                                                                                                                                                                                    


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #38: 1025it [00:02, 392.03it/s, env_step=38912, len=26, n/ep=2, n/st=64, player_1/loss=269.291, player_2/loss=123.130, rew=-25.00]                                                                                                                                                                                    


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #39: 1025it [00:02, 389.87it/s, env_step=39936, len=24, n/ep=3, n/st=64, player_1/loss=271.205, player_2/loss=105.034, rew=-25.00]                                                                                                                                                                                    


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #40: 1025it [00:02, 390.22it/s, env_step=40960, len=26, n/ep=3, n/st=64, player_1/loss=237.017, player_2/loss=103.737, rew=-25.00]                                                                                                                                                                                    


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #41: 1025it [00:02, 390.95it/s, env_step=41984, len=28, n/ep=2, n/st=64, player_1/loss=274.830, player_2/loss=117.811, rew=-25.00]                                                                                                                                                                                    


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #42: 1025it [00:02, 392.59it/s, env_step=43008, len=23, n/ep=2, n/st=64, player_2/loss=122.454, rew=0.00]                                                                                                                                                                                                             


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #43: 1025it [00:02, 390.64it/s, env_step=44032, len=26, n/ep=3, n/st=64, player_1/loss=326.904, player_2/loss=113.028, rew=-8.33]                                                                                                                                                                                     


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #44: 1025it [00:02, 389.11it/s, env_step=45056, len=25, n/ep=3, n/st=64, player_1/loss=289.249, player_2/loss=110.236, rew=-8.33]                                                                                                                                                                                     


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #45: 1025it [00:02, 388.36it/s, env_step=46080, len=26, n/ep=3, n/st=64, player_1/loss=260.818, player_2/loss=111.603, rew=-25.00]                                                                                                                                                                                    


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #46: 1025it [00:02, 393.16it/s, env_step=47104, len=26, n/ep=3, n/st=64, player_1/loss=242.414, player_2/loss=91.524, rew=-25.00]                                                                                                                                                                                     


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #47: 1025it [00:02, 387.32it/s, env_step=48128, len=28, n/ep=3, n/st=64, player_1/loss=222.628, player_2/loss=134.370, rew=-8.33]                                                                                                                                                                                     


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #48: 1025it [00:02, 389.77it/s, env_step=49152, len=25, n/ep=2, n/st=64, player_1/loss=264.741, player_2/loss=140.801, rew=0.00]                                                                                                                                                                                      


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #49: 1025it [00:02, 387.19it/s, env_step=50176, len=24, n/ep=3, n/st=64, player_1/loss=321.972, player_2/loss=94.830, rew=8.33]                                                                                                                                                                                       


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #1


Epoch #1: 1025it [00:02, 384.30it/s, env_step=1024, len=26, n/ep=2, n/st=64, player_1/loss=251.646, player_2/loss=92.209, rew=0.00]                                                                                                                                                                                         


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 393.03it/s, env_step=2048, len=26, n/ep=3, n/st=64, player_1/loss=257.358, player_2/loss=87.883, rew=8.33]                                                                                                                                                                                         


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 390.13it/s, env_step=3072, len=26, n/ep=2, n/st=64, player_1/loss=237.122, player_2/loss=108.873, rew=25.00]                                                                                                                                                                                       


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 389.96it/s, env_step=4096, len=30, n/ep=2, n/st=64, player_1/loss=238.480, player_2/loss=166.957, rew=0.00]                                                                                                                                                                                        


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 392.30it/s, env_step=5120, len=26, n/ep=2, n/st=64, player_1/loss=299.428, player_2/loss=162.608, rew=0.00]                                                                                                                                                                                        


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 390.04it/s, env_step=6144, len=24, n/ep=3, n/st=64, player_1/loss=254.923, player_2/loss=106.695, rew=8.33]                                                                                                                                                                                        


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 391.46it/s, env_step=7168, len=22, n/ep=3, n/st=64, player_1/loss=207.201, player_2/loss=101.354, rew=8.33]                                                                                                                                                                                        


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 391.89it/s, env_step=8192, len=27, n/ep=2, n/st=64, player_1/loss=207.447, player_2/loss=114.940, rew=0.00]                                                                                                                                                                                        


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 390.47it/s, env_step=9216, len=27, n/ep=2, n/st=64, player_1/loss=212.453, player_2/loss=137.661, rew=0.00]                                                                                                                                                                                        


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 392.00it/s, env_step=10240, len=25, n/ep=2, n/st=64, player_2/loss=125.625, rew=0.00]                                                                                                                                                                                                             


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 391.06it/s, env_step=11264, len=25, n/ep=2, n/st=64, player_1/loss=228.378, player_2/loss=111.392, rew=25.00]                                                                                                                                                                                     


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 389.42it/s, env_step=12288, len=26, n/ep=3, n/st=64, player_1/loss=245.639, player_2/loss=97.179, rew=-8.33]                                                                                                                                                                                      


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 391.60it/s, env_step=13312, len=28, n/ep=2, n/st=64, player_1/loss=221.702, player_2/loss=147.459, rew=25.00]                                                                                                                                                                                     


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 392.44it/s, env_step=14336, len=25, n/ep=3, n/st=64, player_1/loss=194.813, player_2/loss=177.823, rew=8.33]                                                                                                                                                                                      


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 389.44it/s, env_step=15360, len=25, n/ep=3, n/st=64, player_1/loss=228.164, player_2/loss=133.708, rew=25.00]                                                                                                                                                                                     


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 388.20it/s, env_step=16384, len=27, n/ep=2, n/st=64, player_1/loss=227.879, player_2/loss=130.634, rew=0.00]                                                                                                                                                                                      


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 391.45it/s, env_step=17408, len=27, n/ep=2, n/st=64, player_1/loss=243.200, player_2/loss=125.940, rew=0.00]                                                                                                                                                                                      


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 392.60it/s, env_step=18432, len=29, n/ep=2, n/st=64, player_1/loss=209.698, player_2/loss=108.983, rew=25.00]                                                                                                                                                                                     


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 391.61it/s, env_step=19456, len=25, n/ep=3, n/st=64, player_1/loss=165.310, player_2/loss=141.216, rew=-8.33]                                                                                                                                                                                     


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 390.11it/s, env_step=20480, len=27, n/ep=2, n/st=64, player_1/loss=160.150, player_2/loss=144.996, rew=25.00]                                                                                                                                                                                     


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 389.04it/s, env_step=21504, len=27, n/ep=2, n/st=64, player_1/loss=201.339, player_2/loss=121.884, rew=25.00]                                                                                                                                                                                     


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 390.87it/s, env_step=22528, len=25, n/ep=3, n/st=64, player_1/loss=309.304, player_2/loss=130.243, rew=8.33]                                                                                                                                                                                      


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 389.86it/s, env_step=23552, len=26, n/ep=2, n/st=64, player_1/loss=341.092, player_2/loss=141.025, rew=0.00]                                                                                                                                                                                      


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 389.39it/s, env_step=24576, len=26, n/ep=2, n/st=64, player_1/loss=289.589, player_2/loss=128.528, rew=25.00]                                                                                                                                                                                     


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 388.85it/s, env_step=25600, len=26, n/ep=2, n/st=64, player_1/loss=224.197, player_2/loss=115.680, rew=0.00]                                                                                                                                                                                      


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 390.59it/s, env_step=26624, len=25, n/ep=2, n/st=64, player_1/loss=197.111, player_2/loss=149.559, rew=0.00]                                                                                                                                                                                      


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 390.21it/s, env_step=27648, len=22, n/ep=3, n/st=64, player_1/loss=204.256, player_2/loss=146.207, rew=25.00]                                                                                                                                                                                     


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 391.49it/s, env_step=28672, len=27, n/ep=2, n/st=64, player_1/loss=184.399, player_2/loss=94.824, rew=25.00]                                                                                                                                                                                      


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 388.99it/s, env_step=29696, len=28, n/ep=3, n/st=64, player_1/loss=154.026, player_2/loss=115.081, rew=8.33]                                                                                                                                                                                      


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 390.72it/s, env_step=30720, len=26, n/ep=2, n/st=64, player_1/loss=143.019, player_2/loss=152.309, rew=25.00]                                                                                                                                                                                     


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 390.82it/s, env_step=31744, len=27, n/ep=2, n/st=64, player_1/loss=186.727, player_2/loss=156.025, rew=-25.00]                                                                                                                                                                                    


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 391.46it/s, env_step=32768, len=26, n/ep=2, n/st=64, player_1/loss=245.379, player_2/loss=116.881, rew=25.00]                                                                                                                                                                                     


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 389.64it/s, env_step=33792, len=27, n/ep=3, n/st=64, player_1/loss=257.701, player_2/loss=100.875, rew=8.33]                                                                                                                                                                                      


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 394.10it/s, env_step=34816, len=24, n/ep=2, n/st=64, player_1/loss=225.423, player_2/loss=120.826, rew=0.00]                                                                                                                                                                                      


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 392.76it/s, env_step=35840, len=26, n/ep=3, n/st=64, player_1/loss=203.949, player_2/loss=113.162, rew=25.00]                                                                                                                                                                                     


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 387.51it/s, env_step=36864, len=24, n/ep=3, n/st=64, player_1/loss=190.232, player_2/loss=99.165, rew=8.33]                                                                                                                                                                                       


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 389.07it/s, env_step=37888, len=26, n/ep=2, n/st=64, player_1/loss=225.061, player_2/loss=105.692, rew=25.00]                                                                                                                                                                                     


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 391.80it/s, env_step=38912, len=25, n/ep=2, n/st=64, player_1/loss=238.636, player_2/loss=95.403, rew=25.00]                                                                                                                                                                                      


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 391.22it/s, env_step=39936, len=27, n/ep=2, n/st=64, player_1/loss=229.608, player_2/loss=142.914, rew=25.00]                                                                                                                                                                                     


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 390.30it/s, env_step=40960, len=28, n/ep=2, n/st=64, player_1/loss=213.275, player_2/loss=150.841, rew=0.00]                                                                                                                                                                                      


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 391.93it/s, env_step=41984, len=25, n/ep=2, n/st=64, player_1/loss=241.064, player_2/loss=117.821, rew=0.00]                                                                                                                                                                                      


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 390.09it/s, env_step=43008, len=25, n/ep=3, n/st=64, player_1/loss=212.145, player_2/loss=118.841, rew=8.33]                                                                                                                                                                                      


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 389.17it/s, env_step=44032, len=26, n/ep=2, n/st=64, player_1/loss=181.584, player_2/loss=87.043, rew=-25.00]                                                                                                                                                                                     


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 391.55it/s, env_step=45056, len=28, n/ep=2, n/st=64, player_1/loss=226.173, player_2/loss=93.175, rew=25.00]                                                                                                                                                                                      


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 383.40it/s, env_step=46080, len=26, n/ep=3, n/st=64, player_1/loss=203.527, player_2/loss=145.888, rew=-8.33]                                                                                                                                                                                     


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 385.07it/s, env_step=47104, len=24, n/ep=3, n/st=64, player_1/loss=180.838, player_2/loss=168.955, rew=8.33]                                                                                                                                                                                      


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 393.34it/s, env_step=48128, len=27, n/ep=2, n/st=64, player_1/loss=229.321, player_2/loss=129.524, rew=25.00]                                                                                                                                                                                     


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 391.61it/s, env_step=49152, len=24, n/ep=2, n/st=64, player_1/loss=306.832, player_2/loss=94.148, rew=0.00]                                                                                                                                                                                       


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 386.10it/s, env_step=50176, len=26, n/ep=3, n/st=64, player_1/loss=245.269, player_2/loss=105.454, rew=-8.33]                                                                                                                                                                                     


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 373.18it/s, env_step=1024, len=27, n/ep=2, n/st=64, player_1/loss=127.374, player_2/loss=145.676, rew=-25.00]                                                                                                                                                                                      


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 389.93it/s, env_step=2048, len=26, n/ep=3, n/st=64, player_1/loss=213.514, player_2/loss=111.478, rew=-25.00]                                                                                                                                                                                      


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 391.76it/s, env_step=3072, len=26, n/ep=3, n/st=64, player_1/loss=227.554, player_2/loss=115.763, rew=-8.33]                                                                                                                                                                                       


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 388.89it/s, env_step=4096, len=26, n/ep=2, n/st=64, player_1/loss=202.617, player_2/loss=76.675, rew=-25.00]                                                                                                                                                                                       


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 390.21it/s, env_step=5120, len=26, n/ep=3, n/st=64, player_1/loss=223.350, player_2/loss=110.600, rew=-25.00]                                                                                                                                                                                      


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 386.32it/s, env_step=6144, len=26, n/ep=2, n/st=64, player_1/loss=194.610, player_2/loss=118.009, rew=0.00]                                                                                                                                                                                        


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 391.56it/s, env_step=7168, len=27, n/ep=3, n/st=64, player_1/loss=228.279, player_2/loss=77.045, rew=-25.00]                                                                                                                                                                                       


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 390.73it/s, env_step=8192, len=25, n/ep=3, n/st=64, player_1/loss=237.482, player_2/loss=114.908, rew=-8.33]                                                                                                                                                                                       


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 392.53it/s, env_step=9216, len=23, n/ep=3, n/st=64, player_1/loss=239.932, player_2/loss=136.706, rew=-8.33]                                                                                                                                                                                       


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #10: 1025it [00:02, 391.59it/s, env_step=10240, len=25, n/ep=2, n/st=64, player_1/loss=211.651, player_2/loss=105.078, rew=0.00]                                                                                                                                                                                      


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #11: 1025it [00:02, 391.10it/s, env_step=11264, len=26, n/ep=3, n/st=64, player_1/loss=197.122, player_2/loss=130.721, rew=-25.00]                                                                                                                                                                                    


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #12: 1025it [00:02, 391.54it/s, env_step=12288, len=26, n/ep=3, n/st=64, player_1/loss=250.341, player_2/loss=127.840, rew=-25.00]                                                                                                                                                                                    


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #13: 1025it [00:02, 393.60it/s, env_step=13312, len=26, n/ep=3, n/st=64, player_1/loss=282.236, player_2/loss=128.405, rew=25.00]                                                                                                                                                                                     


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #14: 1025it [00:02, 390.98it/s, env_step=14336, len=27, n/ep=3, n/st=64, player_1/loss=226.064, player_2/loss=160.434, rew=-8.33]                                                                                                                                                                                     


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #15: 1025it [00:02, 392.20it/s, env_step=15360, len=26, n/ep=3, n/st=64, player_1/loss=177.660, player_2/loss=164.814, rew=-25.00]                                                                                                                                                                                    


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #16: 1025it [00:02, 393.06it/s, env_step=16384, len=29, n/ep=2, n/st=64, player_1/loss=235.692, player_2/loss=85.530, rew=-25.00]                                                                                                                                                                                     


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #17: 1025it [00:02, 392.93it/s, env_step=17408, len=27, n/ep=3, n/st=64, player_1/loss=261.012, player_2/loss=113.078, rew=8.33]                                                                                                                                                                                      


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #18: 1025it [00:02, 392.52it/s, env_step=18432, len=25, n/ep=2, n/st=64, player_1/loss=214.433, player_2/loss=127.801, rew=-25.00]                                                                                                                                                                                    


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #19: 1025it [00:02, 390.19it/s, env_step=19456, len=29, n/ep=2, n/st=64, player_1/loss=189.611, player_2/loss=105.008, rew=-25.00]                                                                                                                                                                                    


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #20: 1025it [00:02, 390.69it/s, env_step=20480, len=27, n/ep=2, n/st=64, player_1/loss=170.235, player_2/loss=99.933, rew=0.00]                                                                                                                                                                                       


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #21: 1025it [00:02, 389.54it/s, env_step=21504, len=26, n/ep=2, n/st=64, player_1/loss=203.269, player_2/loss=88.218, rew=0.00]                                                                                                                                                                                       


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #22: 1025it [00:02, 391.06it/s, env_step=22528, len=29, n/ep=3, n/st=64, player_1/loss=223.825, player_2/loss=94.384, rew=-8.33]                                                                                                                                                                                      


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #23: 1025it [00:02, 392.10it/s, env_step=23552, len=26, n/ep=3, n/st=64, player_1/loss=198.843, player_2/loss=109.797, rew=-25.00]                                                                                                                                                                                    


Epoch #23: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #24: 1025it [00:02, 392.54it/s, env_step=24576, len=29, n/ep=2, n/st=64, player_1/loss=241.089, player_2/loss=121.814, rew=-25.00]                                                                                                                                                                                    


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #25: 1025it [00:02, 389.16it/s, env_step=25600, len=26, n/ep=3, n/st=64, player_1/loss=209.960, player_2/loss=111.961, rew=-25.00]                                                                                                                                                                                    


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #26: 1025it [00:02, 391.19it/s, env_step=26624, len=26, n/ep=2, n/st=64, player_1/loss=173.692, player_2/loss=92.667, rew=0.00]                                                                                                                                                                                       


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #27: 1025it [00:02, 392.07it/s, env_step=27648, len=27, n/ep=2, n/st=64, player_1/loss=211.365, player_2/loss=91.857, rew=-25.00]                                                                                                                                                                                     


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #28: 1025it [00:02, 393.07it/s, env_step=28672, len=26, n/ep=2, n/st=64, player_1/loss=196.154, player_2/loss=140.745, rew=-25.00]                                                                                                                                                                                    


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #29: 1025it [00:02, 387.92it/s, env_step=29696, len=31, n/ep=2, n/st=64, player_1/loss=168.132, player_2/loss=140.247, rew=-25.00]                                                                                                                                                                                    


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #30: 1025it [00:02, 390.53it/s, env_step=30720, len=28, n/ep=2, n/st=64, player_1/loss=225.079, player_2/loss=82.473, rew=-25.00]                                                                                                                                                                                     


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #31: 1025it [00:02, 391.51it/s, env_step=31744, len=22, n/ep=2, n/st=64, player_1/loss=247.565, player_2/loss=82.770, rew=0.00]                                                                                                                                                                                       


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #32: 1025it [00:02, 388.48it/s, env_step=32768, len=25, n/ep=2, n/st=64, player_1/loss=207.327, rew=0.00]                                                                                                                                                                                                             


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #33: 1025it [00:02, 391.09it/s, env_step=33792, len=26, n/ep=2, n/st=64, player_1/loss=174.488, player_2/loss=121.626, rew=-25.00]                                                                                                                                                                                    


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #34: 1025it [00:02, 389.98it/s, env_step=34816, len=26, n/ep=3, n/st=64, player_1/loss=191.129, player_2/loss=125.272, rew=-8.33]                                                                                                                                                                                     


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #35: 1025it [00:02, 389.94it/s, env_step=35840, len=26, n/ep=2, n/st=64, player_1/loss=157.884, player_2/loss=123.183, rew=0.00]                                                                                                                                                                                      


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #36: 1025it [00:02, 392.71it/s, env_step=36864, len=28, n/ep=2, n/st=64, player_1/loss=195.897, player_2/loss=132.441, rew=-25.00]                                                                                                                                                                                    


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #37: 1025it [00:02, 389.67it/s, env_step=37888, len=24, n/ep=2, n/st=64, player_1/loss=244.307, player_2/loss=110.734, rew=0.00]                                                                                                                                                                                      


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #38: 1025it [00:02, 392.59it/s, env_step=38912, len=25, n/ep=2, n/st=64, player_1/loss=227.867, player_2/loss=76.864, rew=-25.00]                                                                                                                                                                                     


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #39: 1025it [00:02, 387.40it/s, env_step=39936, len=27, n/ep=2, n/st=64, player_1/loss=195.729, player_2/loss=96.730, rew=0.00]                                                                                                                                                                                       


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #40: 1025it [00:02, 383.38it/s, env_step=40960, len=26, n/ep=3, n/st=64, player_1/loss=217.702, player_2/loss=103.561, rew=-25.00]                                                                                                                                                                                    


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #41: 1025it [00:02, 386.52it/s, env_step=41984, len=25, n/ep=3, n/st=64, player_1/loss=206.278, player_2/loss=119.361, rew=-8.33]                                                                                                                                                                                     


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #42: 1025it [00:02, 394.95it/s, env_step=43008, len=26, n/ep=2, n/st=64, player_2/loss=150.271, rew=25.00]                                                                                                                                                                                                            


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #43: 1025it [00:02, 389.15it/s, env_step=44032, len=25, n/ep=3, n/st=64, player_1/loss=245.802, player_2/loss=148.098, rew=-25.00]                                                                                                                                                                                    


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #44: 1025it [00:02, 391.91it/s, env_step=45056, len=22, n/ep=3, n/st=64, player_1/loss=190.308, player_2/loss=131.393, rew=-8.33]                                                                                                                                                                                     


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #45: 1025it [00:02, 392.37it/s, env_step=46080, len=24, n/ep=3, n/st=64, player_1/loss=181.790, player_2/loss=101.029, rew=-25.00]                                                                                                                                                                                    


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #46: 1025it [00:02, 388.96it/s, env_step=47104, len=27, n/ep=2, n/st=64, player_1/loss=202.029, rew=0.00]                                                                                                                                                                                                             


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #47: 1025it [00:02, 389.80it/s, env_step=48128, len=26, n/ep=2, n/st=64, player_1/loss=231.470, player_2/loss=113.852, rew=-25.00]                                                                                                                                                                                    


Epoch #47: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #48: 1025it [00:02, 391.27it/s, env_step=49152, len=27, n/ep=2, n/st=64, player_1/loss=261.067, player_2/loss=101.518, rew=-25.00]                                                                                                                                                                                    


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #49: 1025it [00:02, 391.21it/s, env_step=50176, len=25, n/ep=2, n/st=64, player_1/loss=235.792, player_2/loss=86.407, rew=-25.00]                                                                                                                                                                                     


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #9


Epoch #1: 1025it [00:02, 389.81it/s, env_step=1024, len=27, n/ep=2, n/st=64, player_1/loss=179.789, player_2/loss=86.003, rew=0.00]                                                                                                                                                                                         


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 391.60it/s, env_step=2048, len=26, n/ep=3, n/st=64, player_1/loss=227.600, player_2/loss=79.139, rew=25.00]                                                                                                                                                                                        


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 391.71it/s, env_step=3072, len=27, n/ep=3, n/st=64, player_1/loss=268.935, player_2/loss=68.583, rew=-8.33]                                                                                                                                                                                        


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 392.68it/s, env_step=4096, len=26, n/ep=2, n/st=64, player_1/loss=213.778, player_2/loss=102.236, rew=25.00]                                                                                                                                                                                       


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 393.02it/s, env_step=5120, len=26, n/ep=2, n/st=64, player_1/loss=171.748, player_2/loss=112.938, rew=25.00]                                                                                                                                                                                       


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 391.15it/s, env_step=6144, len=27, n/ep=2, n/st=64, player_1/loss=184.330, player_2/loss=81.101, rew=25.00]                                                                                                                                                                                        


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 395.18it/s, env_step=7168, len=24, n/ep=2, n/st=64, player_1/loss=199.580, player_2/loss=71.571, rew=25.00]                                                                                                                                                                                        


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 395.38it/s, env_step=8192, len=26, n/ep=3, n/st=64, player_1/loss=268.655, player_2/loss=77.838, rew=25.00]                                                                                                                                                                                        


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 390.43it/s, env_step=9216, len=21, n/ep=3, n/st=64, player_1/loss=271.159, player_2/loss=95.574, rew=25.00]                                                                                                                                                                                        


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 392.40it/s, env_step=10240, len=21, n/ep=3, n/st=64, player_1/loss=260.127, rew=8.33]                                                                                                                                                                                                             


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 393.18it/s, env_step=11264, len=23, n/ep=2, n/st=64, player_1/loss=247.800, player_2/loss=105.660, rew=0.00]                                                                                                                                                                                      


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 389.06it/s, env_step=12288, len=26, n/ep=2, n/st=64, player_1/loss=215.045, player_2/loss=91.663, rew=25.00]                                                                                                                                                                                      


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 391.72it/s, env_step=13312, len=26, n/ep=2, n/st=64, player_1/loss=228.540, player_2/loss=85.896, rew=25.00]                                                                                                                                                                                      


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 391.02it/s, env_step=14336, len=23, n/ep=2, n/st=64, player_1/loss=224.721, player_2/loss=85.229, rew=0.00]                                                                                                                                                                                       


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 392.73it/s, env_step=15360, len=28, n/ep=2, n/st=64, player_1/loss=202.792, player_2/loss=79.678, rew=25.00]                                                                                                                                                                                      


Epoch #15: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 393.82it/s, env_step=16384, len=25, n/ep=2, n/st=64, player_1/loss=208.176, player_2/loss=67.736, rew=25.00]                                                                                                                                                                                      


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 392.98it/s, env_step=17408, len=22, n/ep=2, n/st=64, player_1/loss=239.895, player_2/loss=76.881, rew=0.00]                                                                                                                                                                                       


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 389.45it/s, env_step=18432, len=24, n/ep=3, n/st=64, player_1/loss=227.983, player_2/loss=81.085, rew=8.33]                                                                                                                                                                                       


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 393.43it/s, env_step=19456, len=25, n/ep=2, n/st=64, player_1/loss=226.738, player_2/loss=76.886, rew=25.00]                                                                                                                                                                                      


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 390.23it/s, env_step=20480, len=26, n/ep=3, n/st=64, player_1/loss=248.897, player_2/loss=71.317, rew=25.00]                                                                                                                                                                                      


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 389.13it/s, env_step=21504, len=25, n/ep=3, n/st=64, player_1/loss=231.731, player_2/loss=77.682, rew=25.00]                                                                                                                                                                                      


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 391.52it/s, env_step=22528, len=29, n/ep=2, n/st=64, player_1/loss=238.795, player_2/loss=72.839, rew=25.00]                                                                                                                                                                                      


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 392.21it/s, env_step=23552, len=25, n/ep=3, n/st=64, player_1/loss=302.377, player_2/loss=70.292, rew=8.33]                                                                                                                                                                                       


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 392.44it/s, env_step=24576, len=25, n/ep=2, n/st=64, player_1/loss=236.068, player_2/loss=101.969, rew=25.00]                                                                                                                                                                                     


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 390.34it/s, env_step=25600, len=24, n/ep=3, n/st=64, player_1/loss=173.986, player_2/loss=115.834, rew=25.00]                                                                                                                                                                                     


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 392.83it/s, env_step=26624, len=30, n/ep=2, n/st=64, player_1/loss=222.721, player_2/loss=96.392, rew=0.00]                                                                                                                                                                                       


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 392.18it/s, env_step=27648, len=30, n/ep=2, n/st=64, player_1/loss=197.973, player_2/loss=90.175, rew=0.00]                                                                                                                                                                                       


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 389.96it/s, env_step=28672, len=25, n/ep=2, n/st=64, player_1/loss=201.930, player_2/loss=120.992, rew=25.00]                                                                                                                                                                                     


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 391.66it/s, env_step=29696, len=23, n/ep=3, n/st=64, player_1/loss=235.068, rew=8.33]                                                                                                                                                                                                             


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 390.16it/s, env_step=30720, len=24, n/ep=3, n/st=64, player_1/loss=231.473, player_2/loss=114.158, rew=-8.33]                                                                                                                                                                                     


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 391.16it/s, env_step=31744, len=26, n/ep=3, n/st=64, player_1/loss=253.085, player_2/loss=115.251, rew=25.00]                                                                                                                                                                                     


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 390.80it/s, env_step=32768, len=25, n/ep=2, n/st=64, player_1/loss=260.361, player_2/loss=78.184, rew=25.00]                                                                                                                                                                                      


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 391.02it/s, env_step=33792, len=31, n/ep=2, n/st=64, player_1/loss=201.672, player_2/loss=72.413, rew=25.00]                                                                                                                                                                                      


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 389.58it/s, env_step=34816, len=25, n/ep=3, n/st=64, player_1/loss=187.103, player_2/loss=93.114, rew=25.00]                                                                                                                                                                                      


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 390.91it/s, env_step=35840, len=20, n/ep=3, n/st=64, player_1/loss=208.099, player_2/loss=105.176, rew=25.00]                                                                                                                                                                                     


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 385.35it/s, env_step=36864, len=25, n/ep=3, n/st=64, player_1/loss=198.661, player_2/loss=94.476, rew=25.00]                                                                                                                                                                                      


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 383.86it/s, env_step=37888, len=26, n/ep=3, n/st=64, player_2/loss=99.900, rew=25.00]                                                                                                                                                                                                             


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 390.67it/s, env_step=38912, len=24, n/ep=2, n/st=64, player_1/loss=240.694, player_2/loss=82.481, rew=0.00]                                                                                                                                                                                       


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 388.73it/s, env_step=39936, len=22, n/ep=3, n/st=64, player_1/loss=248.091, player_2/loss=66.030, rew=25.00]                                                                                                                                                                                      


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 391.28it/s, env_step=40960, len=22, n/ep=2, n/st=64, player_1/loss=241.331, player_2/loss=73.444, rew=0.00]                                                                                                                                                                                       


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 391.23it/s, env_step=41984, len=23, n/ep=3, n/st=64, player_1/loss=348.244, player_2/loss=69.001, rew=25.00]                                                                                                                                                                                      


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 390.57it/s, env_step=43008, len=23, n/ep=3, n/st=64, player_1/loss=306.048, player_2/loss=95.179, rew=8.33]                                                                                                                                                                                       


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 390.06it/s, env_step=44032, len=21, n/ep=3, n/st=64, player_1/loss=179.409, player_2/loss=102.605, rew=25.00]                                                                                                                                                                                     


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 390.70it/s, env_step=45056, len=22, n/ep=3, n/st=64, player_1/loss=191.073, player_2/loss=77.926, rew=8.33]                                                                                                                                                                                       


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 391.24it/s, env_step=46080, len=22, n/ep=3, n/st=64, player_1/loss=192.622, player_2/loss=78.374, rew=25.00]                                                                                                                                                                                      


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 390.46it/s, env_step=47104, len=25, n/ep=2, n/st=64, player_1/loss=184.982, player_2/loss=66.156, rew=25.00]                                                                                                                                                                                      


Epoch #46: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 389.67it/s, env_step=48128, len=27, n/ep=2, n/st=64, player_1/loss=289.206, player_2/loss=83.757, rew=0.00]                                                                                                                                                                                       


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 389.98it/s, env_step=49152, len=19, n/ep=3, n/st=64, player_1/loss=263.772, player_2/loss=102.184, rew=8.33]                                                                                                                                                                                      


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 390.26it/s, env_step=50176, len=23, n/ep=2, n/st=64, player_1/loss=193.380, player_2/loss=86.841, rew=25.00]                                                                                                                                                                                      


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 386.57it/s, env_step=1024, len=25, n/ep=3, n/st=64, player_1/loss=152.611, player_2/loss=122.607, rew=8.33]                                                                                                                                                                                        


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 391.41it/s, env_step=2048, len=25, n/ep=2, n/st=64, player_1/loss=235.221, player_2/loss=110.826, rew=-25.00]                                                                                                                                                                                      


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 390.14it/s, env_step=3072, len=22, n/ep=3, n/st=64, player_1/loss=290.389, player_2/loss=97.031, rew=-25.00]                                                                                                                                                                                       


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 394.38it/s, env_step=4096, len=24, n/ep=2, n/st=64, player_1/loss=268.847, player_2/loss=98.950, rew=-25.00]                                                                                                                                                                                       


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 392.08it/s, env_step=5120, len=28, n/ep=3, n/st=64, player_1/loss=233.145, player_2/loss=113.662, rew=-8.33]                                                                                                                                                                                       


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 392.56it/s, env_step=6144, len=20, n/ep=3, n/st=64, player_1/loss=225.873, player_2/loss=107.970, rew=-25.00]                                                                                                                                                                                      


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 388.32it/s, env_step=7168, len=27, n/ep=2, n/st=64, player_1/loss=221.860, player_2/loss=97.525, rew=-25.00]                                                                                                                                                                                       


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 389.25it/s, env_step=8192, len=25, n/ep=2, n/st=64, player_1/loss=181.802, player_2/loss=102.161, rew=-25.00]                                                                                                                                                                                      


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 388.64it/s, env_step=9216, len=21, n/ep=3, n/st=64, player_1/loss=156.586, player_2/loss=95.484, rew=-8.33]                                                                                                                                                                                        


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 390.45it/s, env_step=10240, len=22, n/ep=3, n/st=64, player_1/loss=171.194, player_2/loss=113.934, rew=-25.00]                                                                                                                                                                                    


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 389.93it/s, env_step=11264, len=20, n/ep=4, n/st=64, player_1/loss=237.264, player_2/loss=96.374, rew=-25.00]                                                                                                                                                                                     


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 391.23it/s, env_step=12288, len=24, n/ep=3, n/st=64, player_1/loss=227.231, player_2/loss=71.942, rew=-25.00]                                                                                                                                                                                     


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 391.77it/s, env_step=13312, len=22, n/ep=3, n/st=64, player_1/loss=238.594, player_2/loss=115.364, rew=-25.00]                                                                                                                                                                                    


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 389.84it/s, env_step=14336, len=24, n/ep=3, n/st=64, player_1/loss=231.499, player_2/loss=112.478, rew=-25.00]                                                                                                                                                                                    


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 384.85it/s, env_step=15360, len=22, n/ep=2, n/st=64, player_1/loss=225.703, player_2/loss=100.793, rew=0.00]                                                                                                                                                                                      


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 389.74it/s, env_step=16384, len=22, n/ep=3, n/st=64, player_1/loss=188.028, player_2/loss=102.210, rew=-25.00]                                                                                                                                                                                    


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 389.66it/s, env_step=17408, len=21, n/ep=3, n/st=64, player_1/loss=216.405, player_2/loss=92.208, rew=-8.33]                                                                                                                                                                                      


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 390.60it/s, env_step=18432, len=22, n/ep=2, n/st=64, player_1/loss=276.165, player_2/loss=79.374, rew=0.00]                                                                                                                                                                                       


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #19: 1025it [00:02, 390.32it/s, env_step=19456, len=21, n/ep=3, n/st=64, player_1/loss=321.825, player_2/loss=65.119, rew=-25.00]                                                                                                                                                                                     


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #20: 1025it [00:02, 389.75it/s, env_step=20480, len=24, n/ep=2, n/st=64, player_1/loss=228.587, player_2/loss=79.238, rew=0.00]                                                                                                                                                                                       


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #21: 1025it [00:02, 389.29it/s, env_step=21504, len=21, n/ep=2, n/st=64, player_1/loss=191.859, player_2/loss=104.224, rew=-25.00]                                                                                                                                                                                    


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #22: 1025it [00:02, 390.01it/s, env_step=22528, len=19, n/ep=3, n/st=64, player_1/loss=230.650, player_2/loss=120.677, rew=-8.33]                                                                                                                                                                                     


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #23: 1025it [00:02, 389.86it/s, env_step=23552, len=23, n/ep=3, n/st=64, player_1/loss=230.788, player_2/loss=89.906, rew=-25.00]                                                                                                                                                                                     


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #24: 1025it [00:02, 389.46it/s, env_step=24576, len=22, n/ep=3, n/st=64, player_1/loss=216.064, player_2/loss=82.055, rew=-8.33]                                                                                                                                                                                      


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #25: 1025it [00:02, 389.36it/s, env_step=25600, len=22, n/ep=3, n/st=64, player_1/loss=227.212, player_2/loss=115.917, rew=-25.00]                                                                                                                                                                                    


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #26: 1025it [00:02, 392.05it/s, env_step=26624, len=23, n/ep=2, n/st=64, player_1/loss=232.252, player_2/loss=120.381, rew=-25.00]                                                                                                                                                                                    


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #27: 1025it [00:02, 392.02it/s, env_step=27648, len=23, n/ep=3, n/st=64, player_1/loss=186.696, player_2/loss=120.016, rew=-8.33]                                                                                                                                                                                     


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #28: 1025it [00:02, 390.01it/s, env_step=28672, len=22, n/ep=3, n/st=64, player_1/loss=196.602, player_2/loss=99.489, rew=-25.00]                                                                                                                                                                                     


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #29: 1025it [00:02, 386.84it/s, env_step=29696, len=30, n/ep=2, n/st=64, player_1/loss=226.875, player_2/loss=75.147, rew=-25.00]                                                                                                                                                                                     


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #30: 1025it [00:02, 391.50it/s, env_step=30720, len=22, n/ep=3, n/st=64, player_2/loss=101.270, rew=-8.33]                                                                                                                                                                                                            


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #31: 1025it [00:02, 390.78it/s, env_step=31744, len=24, n/ep=3, n/st=64, player_1/loss=260.040, player_2/loss=99.998, rew=-25.00]                                                                                                                                                                                     


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #32: 1025it [00:02, 384.01it/s, env_step=32768, len=24, n/ep=2, n/st=64, player_1/loss=314.833, player_2/loss=79.382, rew=-25.00]                                                                                                                                                                                     


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #33: 1025it [00:02, 386.30it/s, env_step=33792, len=21, n/ep=3, n/st=64, player_1/loss=290.411, player_2/loss=80.552, rew=-8.33]                                                                                                                                                                                      


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #34: 1025it [00:02, 389.04it/s, env_step=34816, len=23, n/ep=2, n/st=64, player_1/loss=184.687, player_2/loss=72.097, rew=-25.00]                                                                                                                                                                                     


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #35: 1025it [00:02, 388.27it/s, env_step=35840, len=24, n/ep=3, n/st=64, player_1/loss=192.883, player_2/loss=76.028, rew=-8.33]                                                                                                                                                                                      


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #36: 1025it [00:02, 390.54it/s, env_step=36864, len=27, n/ep=2, n/st=64, player_1/loss=245.093, player_2/loss=82.274, rew=0.00]                                                                                                                                                                                       


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #37: 1025it [00:02, 389.99it/s, env_step=37888, len=20, n/ep=3, n/st=64, player_1/loss=274.718, player_2/loss=107.952, rew=-8.33]                                                                                                                                                                                     


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #38: 1025it [00:02, 391.98it/s, env_step=38912, len=21, n/ep=3, n/st=64, player_1/loss=234.857, player_2/loss=96.814, rew=-8.33]                                                                                                                                                                                      


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #39: 1025it [00:02, 390.36it/s, env_step=39936, len=29, n/ep=2, n/st=64, player_1/loss=252.460, player_2/loss=97.232, rew=-25.00]                                                                                                                                                                                     


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #40: 1025it [00:02, 389.30it/s, env_step=40960, len=23, n/ep=3, n/st=64, player_1/loss=232.620, player_2/loss=81.571, rew=-8.33]                                                                                                                                                                                      


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #41: 1025it [00:02, 391.64it/s, env_step=41984, len=22, n/ep=3, n/st=64, player_1/loss=184.525, player_2/loss=99.632, rew=8.33]                                                                                                                                                                                       


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #42: 1025it [00:02, 390.94it/s, env_step=43008, len=22, n/ep=3, n/st=64, player_1/loss=214.243, player_2/loss=95.939, rew=-25.00]                                                                                                                                                                                     


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #43: 1025it [00:02, 388.01it/s, env_step=44032, len=22, n/ep=3, n/st=64, player_1/loss=190.352, player_2/loss=101.065, rew=-8.33]                                                                                                                                                                                     


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #44: 1025it [00:02, 392.60it/s, env_step=45056, len=25, n/ep=2, n/st=64, player_1/loss=197.222, player_2/loss=99.205, rew=-25.00]                                                                                                                                                                                     


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #45: 1025it [00:02, 388.78it/s, env_step=46080, len=23, n/ep=3, n/st=64, player_1/loss=217.902, player_2/loss=115.377, rew=-8.33]                                                                                                                                                                                     


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #46: 1025it [00:02, 388.19it/s, env_step=47104, len=19, n/ep=3, n/st=64, player_1/loss=204.992, player_2/loss=116.958, rew=-25.00]                                                                                                                                                                                    


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #47: 1025it [00:02, 392.20it/s, env_step=48128, len=21, n/ep=3, n/st=64, player_1/loss=139.150, player_2/loss=83.033, rew=-8.33]                                                                                                                                                                                      


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #48: 1025it [00:02, 387.77it/s, env_step=49152, len=22, n/ep=2, n/st=64, player_1/loss=137.349, player_2/loss=83.211, rew=0.00]                                                                                                                                                                                       


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #49: 1025it [00:02, 390.83it/s, env_step=50176, len=23, n/ep=3, n/st=64, player_1/loss=189.982, player_2/loss=83.131, rew=-8.33]                                                                                                                                                                                      


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #18


Epoch #1: 1025it [00:02, 385.29it/s, env_step=1024, len=20, n/ep=3, n/st=64, player_1/loss=201.992, player_2/loss=54.847, rew=8.33]                                                                                                                                                                                         


Epoch #1: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 387.60it/s, env_step=2048, len=22, n/ep=2, n/st=64, player_1/loss=211.739, player_2/loss=70.875, rew=0.00]                                                                                                                                                                                         


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 387.85it/s, env_step=3072, len=20, n/ep=3, n/st=64, player_1/loss=222.991, player_2/loss=83.533, rew=25.00]                                                                                                                                                                                        


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 390.11it/s, env_step=4096, len=26, n/ep=2, n/st=64, player_1/loss=225.434, player_2/loss=117.673, rew=25.00]                                                                                                                                                                                       


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 388.93it/s, env_step=5120, len=20, n/ep=4, n/st=64, player_1/loss=209.298, player_2/loss=133.401, rew=25.00]                                                                                                                                                                                       


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 389.65it/s, env_step=6144, len=20, n/ep=3, n/st=64, player_1/loss=207.418, player_2/loss=103.229, rew=-8.33]                                                                                                                                                                                       


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 389.23it/s, env_step=7168, len=19, n/ep=3, n/st=64, player_1/loss=205.513, player_2/loss=93.692, rew=-8.33]                                                                                                                                                                                        


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 391.40it/s, env_step=8192, len=22, n/ep=3, n/st=64, player_1/loss=166.549, player_2/loss=106.676, rew=8.33]                                                                                                                                                                                        


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 388.57it/s, env_step=9216, len=20, n/ep=3, n/st=64, player_1/loss=233.546, player_2/loss=89.042, rew=8.33]                                                                                                                                                                                         


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 389.64it/s, env_step=10240, len=20, n/ep=3, n/st=64, player_1/loss=208.910, player_2/loss=69.745, rew=25.00]                                                                                                                                                                                      


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 391.59it/s, env_step=11264, len=26, n/ep=3, n/st=64, player_1/loss=199.561, player_2/loss=73.438, rew=25.00]                                                                                                                                                                                      


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 391.48it/s, env_step=12288, len=18, n/ep=2, n/st=64, player_1/loss=207.450, player_2/loss=90.165, rew=0.00]                                                                                                                                                                                       


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 389.23it/s, env_step=13312, len=18, n/ep=4, n/st=64, player_1/loss=142.071, player_2/loss=109.944, rew=12.50]                                                                                                                                                                                     


Epoch #13: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 387.81it/s, env_step=14336, len=21, n/ep=3, n/st=64, player_1/loss=151.630, player_2/loss=116.330, rew=8.33]                                                                                                                                                                                      


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 390.49it/s, env_step=15360, len=30, n/ep=2, n/st=64, player_1/loss=186.778, player_2/loss=113.919, rew=0.00]                                                                                                                                                                                      


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 392.49it/s, env_step=16384, len=21, n/ep=3, n/st=64, player_1/loss=162.402, player_2/loss=130.543, rew=8.33]                                                                                                                                                                                      


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 390.41it/s, env_step=17408, len=19, n/ep=4, n/st=64, player_1/loss=164.619, player_2/loss=155.458, rew=25.00]                                                                                                                                                                                     


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 389.87it/s, env_step=18432, len=19, n/ep=2, n/st=64, player_1/loss=194.083, player_2/loss=96.646, rew=0.00]                                                                                                                                                                                       


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 389.41it/s, env_step=19456, len=20, n/ep=3, n/st=64, player_1/loss=205.981, player_2/loss=72.946, rew=8.33]                                                                                                                                                                                       


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 391.93it/s, env_step=20480, len=20, n/ep=3, n/st=64, player_1/loss=188.468, player_2/loss=109.502, rew=25.00]                                                                                                                                                                                     


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 392.52it/s, env_step=21504, len=16, n/ep=4, n/st=64, player_1/loss=193.279, player_2/loss=135.392, rew=25.00]                                                                                                                                                                                     


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 390.62it/s, env_step=22528, len=19, n/ep=3, n/st=64, player_1/loss=178.957, player_2/loss=148.374, rew=8.33]                                                                                                                                                                                      


Epoch #22: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 389.21it/s, env_step=23552, len=19, n/ep=3, n/st=64, player_1/loss=179.937, player_2/loss=131.670, rew=-8.33]                                                                                                                                                                                     


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 391.57it/s, env_step=24576, len=18, n/ep=4, n/st=64, player_1/loss=195.306, player_2/loss=100.546, rew=-12.50]                                                                                                                                                                                    


Epoch #24: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 391.51it/s, env_step=25600, len=21, n/ep=2, n/st=64, player_1/loss=222.515, player_2/loss=78.076, rew=25.00]                                                                                                                                                                                      


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 384.86it/s, env_step=26624, len=18, n/ep=3, n/st=64, player_1/loss=192.877, player_2/loss=95.202, rew=25.00]                                                                                                                                                                                      


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 390.49it/s, env_step=27648, len=30, n/ep=2, n/st=64, player_1/loss=159.936, player_2/loss=128.265, rew=25.00]                                                                                                                                                                                     


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 379.61it/s, env_step=28672, len=24, n/ep=3, n/st=64, player_1/loss=135.866, player_2/loss=115.376, rew=25.00]                                                                                                                                                                                     


Epoch #28: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 388.82it/s, env_step=29696, len=18, n/ep=4, n/st=64, player_1/loss=187.247, player_2/loss=106.629, rew=0.00]                                                                                                                                                                                      


Epoch #29: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 389.39it/s, env_step=30720, len=18, n/ep=4, n/st=64, player_1/loss=214.865, player_2/loss=116.798, rew=25.00]                                                                                                                                                                                     


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 393.49it/s, env_step=31744, len=22, n/ep=3, n/st=64, player_1/loss=165.720, player_2/loss=117.557, rew=25.00]                                                                                                                                                                                     


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 390.52it/s, env_step=32768, len=16, n/ep=4, n/st=64, player_1/loss=121.402, player_2/loss=124.137, rew=-12.50]                                                                                                                                                                                    


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 390.69it/s, env_step=33792, len=22, n/ep=3, n/st=64, player_1/loss=121.998, player_2/loss=145.807, rew=25.00]                                                                                                                                                                                     


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 386.55it/s, env_step=34816, len=15, n/ep=4, n/st=64, player_1/loss=132.079, player_2/loss=161.082, rew=-12.50]                                                                                                                                                                                    


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 388.97it/s, env_step=35840, len=15, n/ep=4, n/st=64, player_1/loss=99.315, player_2/loss=163.802, rew=-25.00]                                                                                                                                                                                     


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 388.78it/s, env_step=36864, len=20, n/ep=2, n/st=64, player_1/loss=109.949, player_2/loss=164.103, rew=25.00]                                                                                                                                                                                     


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 389.48it/s, env_step=37888, len=15, n/ep=4, n/st=64, player_1/loss=130.436, player_2/loss=143.789, rew=-12.50]                                                                                                                                                                                    


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 385.95it/s, env_step=38912, len=16, n/ep=4, n/st=64, player_1/loss=107.084, player_2/loss=129.217, rew=0.00]                                                                                                                                                                                      


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 389.65it/s, env_step=39936, len=17, n/ep=4, n/st=64, player_1/loss=128.663, player_2/loss=137.477, rew=12.50]                                                                                                                                                                                     


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 389.90it/s, env_step=40960, len=18, n/ep=3, n/st=64, player_1/loss=167.956, player_2/loss=151.173, rew=-8.33]                                                                                                                                                                                     


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 391.63it/s, env_step=41984, len=16, n/ep=4, n/st=64, player_1/loss=188.115, player_2/loss=106.559, rew=-25.00]                                                                                                                                                                                    


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 390.16it/s, env_step=43008, len=16, n/ep=4, n/st=64, player_1/loss=171.546, player_2/loss=149.491, rew=0.00]                                                                                                                                                                                      


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 387.30it/s, env_step=44032, len=16, n/ep=4, n/st=64, player_1/loss=155.190, player_2/loss=159.788, rew=-12.50]                                                                                                                                                                                    


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 390.59it/s, env_step=45056, len=20, n/ep=3, n/st=64, player_1/loss=121.835, player_2/loss=141.106, rew=8.33]                                                                                                                                                                                      


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 391.59it/s, env_step=46080, len=13, n/ep=5, n/st=64, player_1/loss=99.326, player_2/loss=152.725, rew=-15.00]                                                                                                                                                                                     


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 390.19it/s, env_step=47104, len=14, n/ep=4, n/st=64, player_1/loss=91.646, player_2/loss=175.301, rew=-12.50]                                                                                                                                                                                     


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 392.59it/s, env_step=48128, len=15, n/ep=5, n/st=64, player_1/loss=132.854, player_2/loss=175.777, rew=-25.00]                                                                                                                                                                                    


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 386.93it/s, env_step=49152, len=18, n/ep=3, n/st=64, player_1/loss=133.272, player_2/loss=150.675, rew=-8.33]                                                                                                                                                                                     


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 389.24it/s, env_step=50176, len=17, n/ep=4, n/st=64, player_1/loss=152.462, player_2/loss=188.052, rew=0.00]                                                                                                                                                                                      


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 388.78it/s, env_step=1024, len=19, n/ep=3, n/st=64, player_1/loss=100.310, player_2/loss=70.608, rew=-8.33]                                                                                                                                                                                        


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 390.27it/s, env_step=2048, len=24, n/ep=3, n/st=64, player_1/loss=106.943, player_2/loss=114.747, rew=-25.00]                                                                                                                                                                                      


Epoch #2: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 390.26it/s, env_step=3072, len=15, n/ep=4, n/st=64, player_1/loss=111.834, player_2/loss=136.655, rew=12.50]                                                                                                                                                                                       


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 391.39it/s, env_step=4096, len=17, n/ep=3, n/st=64, player_1/loss=113.404, player_2/loss=143.936, rew=8.33]                                                                                                                                                                                        


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 391.34it/s, env_step=5120, len=19, n/ep=3, n/st=64, player_1/loss=136.329, player_2/loss=163.635, rew=-8.33]                                                                                                                                                                                       


Epoch #5: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 389.46it/s, env_step=6144, len=16, n/ep=4, n/st=64, player_1/loss=139.686, player_2/loss=161.465, rew=0.00]                                                                                                                                                                                        


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 390.61it/s, env_step=7168, len=18, n/ep=3, n/st=64, player_1/loss=110.981, player_2/loss=149.071, rew=-8.33]                                                                                                                                                                                       


Epoch #7: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 389.56it/s, env_step=8192, len=15, n/ep=4, n/st=64, player_1/loss=129.467, player_2/loss=124.147, rew=25.00]                                                                                                                                                                                       


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 391.08it/s, env_step=9216, len=15, n/ep=4, n/st=64, player_1/loss=121.913, player_2/loss=104.418, rew=0.00]                                                                                                                                                                                        


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 389.05it/s, env_step=10240, len=18, n/ep=4, n/st=64, player_1/loss=122.949, player_2/loss=128.806, rew=12.50]                                                                                                                                                                                     


Epoch #10: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 390.42it/s, env_step=11264, len=15, n/ep=4, n/st=64, player_1/loss=119.701, player_2/loss=127.401, rew=25.00]                                                                                                                                                                                     


Epoch #11: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 390.77it/s, env_step=12288, len=16, n/ep=4, n/st=64, player_1/loss=134.597, player_2/loss=100.235, rew=12.50]                                                                                                                                                                                     


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 391.11it/s, env_step=13312, len=21, n/ep=3, n/st=64, player_1/loss=132.155, player_2/loss=164.170, rew=-8.33]                                                                                                                                                                                     


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 389.63it/s, env_step=14336, len=20, n/ep=3, n/st=64, player_1/loss=93.965, player_2/loss=210.484, rew=-8.33]                                                                                                                                                                                      


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 391.65it/s, env_step=15360, len=20, n/ep=3, n/st=64, player_1/loss=125.228, player_2/loss=147.490, rew=8.33]                                                                                                                                                                                      


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 391.57it/s, env_step=16384, len=20, n/ep=3, n/st=64, player_1/loss=178.563, player_2/loss=120.942, rew=8.33]                                                                                                                                                                                      


Epoch #16: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 391.33it/s, env_step=17408, len=15, n/ep=4, n/st=64, player_1/loss=168.295, player_2/loss=146.035, rew=25.00]                                                                                                                                                                                     


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 387.58it/s, env_step=18432, len=15, n/ep=4, n/st=64, player_1/loss=156.340, player_2/loss=169.620, rew=25.00]                                                                                                                                                                                     


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 391.92it/s, env_step=19456, len=19, n/ep=4, n/st=64, player_1/loss=160.985, player_2/loss=137.511, rew=25.00]                                                                                                                                                                                     


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 392.75it/s, env_step=20480, len=16, n/ep=3, n/st=64, player_1/loss=140.059, player_2/loss=95.260, rew=8.33]                                                                                                                                                                                       


Epoch #20: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 389.78it/s, env_step=21504, len=16, n/ep=4, n/st=64, player_1/loss=104.172, rew=12.50]                                                                                                                                                                                                            


Epoch #21: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 390.29it/s, env_step=22528, len=18, n/ep=4, n/st=64, player_1/loss=80.236, player_2/loss=185.874, rew=-12.50]                                                                                                                                                                                     


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 390.82it/s, env_step=23552, len=20, n/ep=3, n/st=64, player_1/loss=98.938, player_2/loss=169.156, rew=8.33]                                                                                                                                                                                       


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 384.69it/s, env_step=24576, len=19, n/ep=3, n/st=64, player_1/loss=115.386, player_2/loss=111.804, rew=25.00]                                                                                                                                                                                     


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 393.45it/s, env_step=25600, len=20, n/ep=3, n/st=64, player_1/loss=128.284, player_2/loss=137.944, rew=25.00]                                                                                                                                                                                     


Epoch #25: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 390.19it/s, env_step=26624, len=19, n/ep=4, n/st=64, player_1/loss=126.349, player_2/loss=146.281, rew=0.00]                                                                                                                                                                                      


Epoch #26: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 391.46it/s, env_step=27648, len=22, n/ep=3, n/st=64, player_1/loss=99.759, player_2/loss=145.118, rew=-8.33]                                                                                                                                                                                      


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 393.20it/s, env_step=28672, len=15, n/ep=4, n/st=64, player_1/loss=142.269, player_2/loss=106.743, rew=-25.00]                                                                                                                                                                                    


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 387.14it/s, env_step=29696, len=19, n/ep=3, n/st=64, player_1/loss=205.299, player_2/loss=96.546, rew=8.33]                                                                                                                                                                                       


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 389.88it/s, env_step=30720, len=23, n/ep=3, n/st=64, player_1/loss=242.904, player_2/loss=128.166, rew=25.00]                                                                                                                                                                                     


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 391.53it/s, env_step=31744, len=21, n/ep=3, n/st=64, player_1/loss=204.323, player_2/loss=140.146, rew=-8.33]                                                                                                                                                                                     


Epoch #31: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 393.10it/s, env_step=32768, len=9, n/ep=7, n/st=64, player_1/loss=133.728, rew=10.71]                                                                                                                                                                                                             


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 388.12it/s, env_step=33792, len=9, n/ep=7, n/st=64, player_1/loss=124.221, player_2/loss=212.895, rew=3.57]                                                                                                                                                                                       


Epoch #33: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 386.67it/s, env_step=34816, len=15, n/ep=5, n/st=64, player_1/loss=142.942, player_2/loss=331.964, rew=5.00]                                                                                                                                                                                      


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 390.30it/s, env_step=35840, len=19, n/ep=4, n/st=64, player_1/loss=131.793, player_2/loss=309.141, rew=-12.50]                                                                                                                                                                                    


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 389.74it/s, env_step=36864, len=10, n/ep=6, n/st=64, player_1/loss=130.403, player_2/loss=210.403, rew=8.33]                                                                                                                                                                                      


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 389.90it/s, env_step=37888, len=17, n/ep=4, n/st=64, player_1/loss=154.655, player_2/loss=145.852, rew=25.00]                                                                                                                                                                                     


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 388.92it/s, env_step=38912, len=26, n/ep=2, n/st=64, player_1/loss=171.621, player_2/loss=136.467, rew=0.00]                                                                                                                                                                                      


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 392.00it/s, env_step=39936, len=18, n/ep=3, n/st=64, player_1/loss=177.592, player_2/loss=134.410, rew=-8.33]                                                                                                                                                                                     


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 389.40it/s, env_step=40960, len=23, n/ep=2, n/st=64, player_1/loss=116.736, player_2/loss=120.131, rew=-25.00]                                                                                                                                                                                    


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 390.53it/s, env_step=41984, len=17, n/ep=4, n/st=64, player_1/loss=109.520, rew=0.00]                                                                                                                                                                                                             


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 389.95it/s, env_step=43008, len=18, n/ep=3, n/st=64, player_1/loss=103.074, player_2/loss=104.204, rew=-8.33]                                                                                                                                                                                     


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 391.45it/s, env_step=44032, len=8, n/ep=7, n/st=64, player_1/loss=140.416, player_2/loss=133.091, rew=17.86]                                                                                                                                                                                      


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 390.36it/s, env_step=45056, len=16, n/ep=5, n/st=64, player_1/loss=176.946, player_2/loss=112.244, rew=-5.00]                                                                                                                                                                                     


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 387.28it/s, env_step=46080, len=7, n/ep=9, n/st=64, player_1/loss=168.493, player_2/loss=164.834, rew=25.00]                                                                                                                                                                                      


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 387.64it/s, env_step=47104, len=21, n/ep=3, n/st=64, player_1/loss=170.990, rew=-8.33]                                                                                                                                                                                                            


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 389.43it/s, env_step=48128, len=30, n/ep=2, n/st=64, player_1/loss=166.006, player_2/loss=143.483, rew=-25.00]                                                                                                                                                                                    


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 392.38it/s, env_step=49152, len=13, n/ep=4, n/st=64, player_1/loss=91.127, rew=12.50]                                                                                                                                                                                                             


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 391.43it/s, env_step=50176, len=9, n/ep=7, n/st=64, player_1/loss=166.052, player_2/loss=89.910, rew=17.86]                                                                                                                                                                                       


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #1: 1025it [00:02, 388.29it/s, env_step=1024, len=12, n/ep=7, n/st=64, player_1/loss=140.552, player_2/loss=111.786, rew=-17.86]                                                                                                                                                                                      


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 388.84it/s, env_step=2048, len=18, n/ep=3, n/st=64, player_1/loss=118.835, player_2/loss=157.343, rew=-8.33]                                                                                                                                                                                       


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 390.44it/s, env_step=3072, len=16, n/ep=5, n/st=64, player_1/loss=117.490, player_2/loss=138.371, rew=-5.00]                                                                                                                                                                                       


Epoch #3: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 388.46it/s, env_step=4096, len=11, n/ep=6, n/st=64, player_1/loss=127.881, player_2/loss=169.676, rew=-16.67]                                                                                                                                                                                      


Epoch #4: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 387.69it/s, env_step=5120, len=10, n/ep=6, n/st=64, player_1/loss=146.897, player_2/loss=260.856, rew=-8.33]                                                                                                                                                                                       


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 387.48it/s, env_step=6144, len=10, n/ep=6, n/st=64, player_1/loss=131.268, player_2/loss=218.233, rew=-8.33]                                                                                                                                                                                       


Epoch #6: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 391.58it/s, env_step=7168, len=13, n/ep=6, n/st=64, player_1/loss=96.149, player_2/loss=236.902, rew=-8.33]                                                                                                                                                                                        


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 390.38it/s, env_step=8192, len=16, n/ep=4, n/st=64, player_1/loss=77.591, player_2/loss=233.139, rew=-12.50]                                                                                                                                                                                       


Epoch #8: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 387.88it/s, env_step=9216, len=14, n/ep=4, n/st=64, player_1/loss=78.781, player_2/loss=195.269, rew=0.00]                                                                                                                                                                                         


Epoch #9: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 388.94it/s, env_step=10240, len=17, n/ep=3, n/st=64, player_1/loss=94.984, player_2/loss=155.355, rew=-8.33]                                                                                                                                                                                      


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 389.63it/s, env_step=11264, len=9, n/ep=6, n/st=64, player_1/loss=113.463, player_2/loss=221.482, rew=-8.33]                                                                                                                                                                                      


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 388.95it/s, env_step=12288, len=8, n/ep=8, n/st=64, player_1/loss=132.734, player_2/loss=216.913, rew=-18.75]                                                                                                                                                                                     


Epoch #12: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 389.56it/s, env_step=13312, len=12, n/ep=5, n/st=64, player_1/loss=101.038, rew=-5.00]                                                                                                                                                                                                            


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 390.53it/s, env_step=14336, len=11, n/ep=6, n/st=64, player_1/loss=118.499, player_2/loss=256.285, rew=0.00]                                                                                                                                                                                      


Epoch #14: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 389.19it/s, env_step=15360, len=9, n/ep=7, n/st=64, player_1/loss=144.493, player_2/loss=217.197, rew=-3.57]                                                                                                                                                                                      


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 389.07it/s, env_step=16384, len=20, n/ep=3, n/st=64, player_1/loss=215.121, player_2/loss=169.030, rew=8.33]                                                                                                                                                                                      


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 385.97it/s, env_step=17408, len=13, n/ep=5, n/st=64, player_1/loss=218.391, player_2/loss=306.346, rew=-15.00]                                                                                                                                                                                    


Epoch #17: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 388.78it/s, env_step=18432, len=9, n/ep=7, n/st=64, player_1/loss=126.684, player_2/loss=347.281, rew=-10.71]                                                                                                                                                                                     


Epoch #18: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 384.99it/s, env_step=19456, len=9, n/ep=8, n/st=64, player_1/loss=167.579, player_2/loss=259.587, rew=-12.50]                                                                                                                                                                                     


Epoch #19: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 382.19it/s, env_step=20480, len=11, n/ep=7, n/st=64, player_1/loss=148.177, player_2/loss=203.223, rew=-17.86]                                                                                                                                                                                    


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 390.45it/s, env_step=21504, len=10, n/ep=5, n/st=64, player_1/loss=112.843, player_2/loss=190.517, rew=-15.00]                                                                                                                                                                                    


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 388.53it/s, env_step=22528, len=12, n/ep=5, n/st=64, player_1/loss=146.138, player_2/loss=217.462, rew=-15.00]                                                                                                                                                                                    


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 388.52it/s, env_step=23552, len=12, n/ep=5, n/st=64, player_1/loss=121.992, player_2/loss=267.968, rew=-25.00]                                                                                                                                                                                    


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 386.53it/s, env_step=24576, len=10, n/ep=5, n/st=64, player_1/loss=122.661, player_2/loss=276.546, rew=-5.00]                                                                                                                                                                                     


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 374.83it/s, env_step=25600, len=13, n/ep=6, n/st=64, player_1/loss=145.604, player_2/loss=237.370, rew=-8.33]                                                                                                                                                                                     


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 388.84it/s, env_step=26624, len=10, n/ep=6, n/st=64, player_1/loss=185.861, player_2/loss=235.757, rew=-16.67]                                                                                                                                                                                    


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 389.63it/s, env_step=27648, len=12, n/ep=5, n/st=64, player_1/loss=165.833, player_2/loss=223.031, rew=-25.00]                                                                                                                                                                                    


Epoch #27: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 389.10it/s, env_step=28672, len=16, n/ep=4, n/st=64, player_1/loss=109.407, player_2/loss=235.896, rew=-12.50]                                                                                                                                                                                    


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 387.82it/s, env_step=29696, len=16, n/ep=5, n/st=64, player_1/loss=118.782, player_2/loss=212.515, rew=-15.00]                                                                                                                                                                                    


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 387.69it/s, env_step=30720, len=8, n/ep=8, n/st=64, player_1/loss=113.723, player_2/loss=216.600, rew=-25.00]                                                                                                                                                                                     


Epoch #30: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 390.97it/s, env_step=31744, len=11, n/ep=6, n/st=64, player_1/loss=113.042, player_2/loss=263.403, rew=-16.67]                                                                                                                                                                                    


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 387.89it/s, env_step=32768, len=13, n/ep=5, n/st=64, player_1/loss=137.650, player_2/loss=242.102, rew=-5.00]                                                                                                                                                                                     


Epoch #32: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 389.80it/s, env_step=33792, len=16, n/ep=4, n/st=64, player_1/loss=138.285, player_2/loss=156.540, rew=0.00]                                                                                                                                                                                      


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 388.50it/s, env_step=34816, len=11, n/ep=5, n/st=64, player_1/loss=165.761, player_2/loss=218.619, rew=-15.00]                                                                                                                                                                                    


Epoch #34: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 390.10it/s, env_step=35840, len=8, n/ep=6, n/st=64, player_1/loss=124.567, player_2/loss=330.085, rew=-16.67]                                                                                                                                                                                     


Epoch #35: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 386.81it/s, env_step=36864, len=10, n/ep=6, n/st=64, player_1/loss=161.968, player_2/loss=271.680, rew=8.33]                                                                                                                                                                                      


Epoch #36: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 387.15it/s, env_step=37888, len=9, n/ep=8, n/st=64, player_1/loss=141.042, player_2/loss=220.295, rew=-18.75]                                                                                                                                                                                     


Epoch #37: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 389.96it/s, env_step=38912, len=11, n/ep=5, n/st=64, player_1/loss=74.498, player_2/loss=182.098, rew=-15.00]                                                                                                                                                                                     


Epoch #38: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 385.36it/s, env_step=39936, len=8, n/ep=7, n/st=64, player_1/loss=93.164, player_2/loss=214.525, rew=-17.86]                                                                                                                                                                                      


Epoch #39: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 385.95it/s, env_step=40960, len=8, n/ep=7, n/st=64, player_1/loss=100.865, player_2/loss=260.014, rew=-17.86]                                                                                                                                                                                     


Epoch #40: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 387.46it/s, env_step=41984, len=10, n/ep=5, n/st=64, player_1/loss=88.723, player_2/loss=223.600, rew=-5.00]                                                                                                                                                                                      


Epoch #41: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 390.12it/s, env_step=43008, len=9, n/ep=7, n/st=64, player_1/loss=108.947, player_2/loss=252.588, rew=-25.00]                                                                                                                                                                                     


Epoch #42: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 391.09it/s, env_step=44032, len=9, n/ep=8, n/st=64, player_1/loss=115.589, player_2/loss=256.987, rew=-18.75]                                                                                                                                                                                     


Epoch #43: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 388.83it/s, env_step=45056, len=11, n/ep=5, n/st=64, player_1/loss=140.155, player_2/loss=209.680, rew=-15.00]                                                                                                                                                                                    


Epoch #44: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 388.18it/s, env_step=46080, len=9, n/ep=6, n/st=64, player_1/loss=146.769, player_2/loss=190.500, rew=-16.67]                                                                                                                                                                                     


Epoch #45: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 387.71it/s, env_step=47104, len=16, n/ep=4, n/st=64, player_1/loss=146.358, player_2/loss=192.493, rew=-12.50]                                                                                                                                                                                    


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 386.31it/s, env_step=48128, len=8, n/ep=7, n/st=64, player_1/loss=135.291, player_2/loss=162.138, rew=-25.00]                                                                                                                                                                                     


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 391.44it/s, env_step=49152, len=7, n/ep=9, n/st=64, player_1/loss=185.734, player_2/loss=167.615, rew=-13.89]                                                                                                                                                                                     


Epoch #48: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 388.51it/s, env_step=50176, len=12, n/ep=5, n/st=64, player_1/loss=131.804, player_2/loss=199.390, rew=-25.00]                                                                                                                                                                                    


Epoch #49: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #0


In [19]:
####################################################
# EXPERIMENT: VIEWING THE BEST LEARNED POLICY
####################################################

# Get the environment settings
env = get_env()
observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
state_shape = observation_space.shape or observation_space.n
action_shape = env.action_space.shape or env.action_space.n

# Configure the best agent
best_agent1 = cf_custom_dqn_policy(state_shape= state_shape,
                                   action_shape= action_shape)
best_agent1.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/5-50epoch_20loop/looping-iteration-19/best_policy_agent1.pth"))
best_agent1.set_eps(0)


best_agent2 = cf_custom_dqn_policy(state_shape= state_shape,
                                   action_shape= action_shape)
best_agent2.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/5-50epoch_20loop/looping-iteration-19/best_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= best_agent1,
      agent_player2= best_agent2)



Average steps of game:  8.333333333333334
Final mean reward agent 1: 25.0, std: 0.0
Final mean reward agent 2: -25.0, std: 0.0


In [20]:
####################################################
# EXPERIMENT: VIEWING THE LAST LEARNED POLICY
####################################################

# Configure the final agent
final_agent_player1 = cf_custom_dqn_policy(state_shape= state_shape,
                                           action_shape= action_shape)
final_agent_player1.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/5-50epoch_20loop/looping-iteration-19/final_policy_agent1.pth"))
best_agent1.set_eps(0)

final_agent_player2 = cf_custom_dqn_policy(state_shape= state_shape,
                                           action_shape= action_shape)
final_agent_player2.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/5-50epoch_20loop/looping-iteration-19/final_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= final_agent_player1,
      agent_player2= final_agent_player2)



Average steps of game:  7.0
Final mean reward agent 1: 25.0, std: 0.0
Final mean reward agent 2: -25.0, std: 0.0


<hr><hr>

## Discussion

We see that the agent can learn quickly to win against a fixed strategy oponent but the overall performance of the agent is still weak, making human play of very poor quality once again.

In [None]:
####################################################
# CLEAN VARIABLES
####################################################

del action_shape
del agent1
del agent2
del best_agent1
del best_agent2
del env
del final_agent_player1
del final_agent_player2
del observation_space
del off_policy_traininer_results
del state_shape
