# CNN based DQN agent against fixed oponent

As discussed in `5-improving-dqn-architecture.ipynb` we thought of three aspects that might be the root of the agent's not learning to play the game pleasingly:
- Training two DQN agents simultaneously is known to be though, especially when starting from a random initialisation
- The network used was a simple MLP
- The training is not done over enough iterations

In the notebooks `5-improving-dqn-architecture.ipynb` and `6-dqn-using-a-cnn.ipynb`, two alternative networks besides MLP were used.
Whilst these give somewhat satisfactory results when trained for long enough and incentivising moves by giving a reward for making a move, it is still far from perfect.
The iterations were also boosted to a couple of hours on a CUDA GPU, which didn't improve things all that much.

Thus, what is most likely to be an issue is the fact that we are training two agents simultaneously.
This makes it hard to get a good performing agent and makes the target non stationary as both agents evolve over time.
An alternative to this is training an agent for a couple of epochs whilst freezing the other and alternating this between the agents.
This makes the problem to learn more stationary and is known to make learning easier.
What is also done, often in very complex games, is starting from a somewhat smart agent instead of a random one.

Whilst some libraries such as Ray RL lib offer implementations of such a training strategy, the experimental notebook `4-rllib-for-more-learning-control.ipynb` found that even the Ray provided example results in error codes.
Seeing their GitHub page has many open issues, the one we encountered being one of them, we refrain from using a different library considering Tianshou has many algorithms implemented and we have found a way to make things work.

<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Training two DQN agents on connect four Gym
  - Building the environment
  - Implementing the DQN policy
  - Building agents
  - Function for letting agents learn
  - Function for watching learned agent
  - Doing the experiment
- Discussion

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [3]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib

# Plotting
import matplotlib; print(f"Matplotlib version (3.5.1 recommended): {matplotlib.__version__}")
import matplotlib.pyplot as plt

# Argparser
import argparse

# More data types
import typing
import numpy as np

# Pygame
import pygame; print(f"Pygame version (2.1.2 recommended): {pygame.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Tianshou for RL algorithms
import tianshou as ts; print(f"Tianshou version (0.4.8 recommended): {ts.__version__}")

# Torch is a popular DL framework
import torch; print(f"Torch version (1.12.0 recommended): {torch.__version__}")

# PPrint is a pretty print for variables
from pprint import pprint

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym
importlib.invalidate_caches()
importlib.reload(cfgym)

# Time for allowing "freezes" in execution
import time;

# Allow for copying objects in a non reference manner
import copy

# Used for updating notebook display
from IPython.display import clear_output

Matplotlib version (3.5.1 recommended): 3.5.1
Pygame version (2.1.2 recommended): 2.1.2
Gym version (0.21.0 recommended): 0.21.0
Tianshou version (0.4.8 recommended): 0.4.8
Torch version (1.12.0 recommended): 1.12.0.dev20220520+cu116


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

In [4]:
####################################################
# CUDA VALIDATION
####################################################

# Check cuda available
print(f"CUDA is available: {torch.cuda.is_available()}")

# Show cuda devices
print(f"\nAmount of connected devices supporting CUDA: {torch.cuda.device_count()}")

# Show current cuda device
print(f"\nCurrent CUDA device: {torch.cuda.current_device()}")

# Show cuda device name
print(f"Cuda device 0 name: {torch.cuda.get_device_name(0)}")

CUDA is available: True

Amount of connected devices supporting CUDA: 1

Current CUDA device: 0
Cuda device 0 name: NVIDIA GeForce GTX 970


<hr><hr>

## Training two DQN agents on connect four Gym

Our connect four gym setup requires two agents, one for each player.
To reduce complexity, agents will always play as the same player, e.g. always as player 1.
It is important to note that connect four is a *solved game*.
According to [The Washington Post](https://www.washingtonpost.com/news/wonk/wp/2015/05/08/how-to-win-any-popular-game-according-to-data-scientists/):

> Connect Four is what mathematicians call a "solved game," meaning you can play it perfectly every time, no matter what your opponent does. You will need to get the first move, but as long as you do so, you can always win within 41 moves.

<hr>

### Building the environment

This code is taken from previous notebooks.
We don't allow invalid moves to make the problem easier for now.

In [5]:
####################################################
# CONNECT FOUR V2 ENVIRONMENT
####################################################

def get_env():
    """
    Returns the connect four gym environment V2 altered for Tianshou and Petting Zoo compatibility.
    Already wrapped with a ts.env.PettingZooEnv wrapper.
    """
    return ts.env.PettingZooEnv(cfgym.env(reward_move= 1, # Set to 1 for reward to make moves (incentivise longer games)
                                          reward_invalid= -3,
                                          reward_draw= 15,
                                          reward_win= 25,
                                          reward_loss= -25,
                                          allow_invalid_move= False))
    
    
# Test the environment
env = get_env()
print(f"Observation space: {env.observation_space}")
print(f"\nAction space: {env.action_space}")

# Reset the environment to start from a clean state, returns the initial observation
observation = env.reset()

print("\n Initial player id:")
print(observation["agent_id"])

print("\n Initial observation:")
print(observation["obs"])

print("\n Initial mask:")
print(observation["mask"])

# Clean unused variables
del observation
del env

Observation space: Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]], [[2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]], (6, 7), int8))

Action space: Discrete(7)

 Initial player id:
player_1

 Initial observation:
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]

 Initial mask:
[True, True, True, True, True, True, True]


<hr>

### Implementing the DQN policy

The DQN policy for the agent is configured and set up below.
This is identical to the previous notebook with the added option of "freezing" an agent which corresponds to giving it an optimizer with learning rate 0.

In [6]:
####################################################
# DQN ARCHITECTURE
####################################################

class CNNBasedDQN(torch.nn.Module):
    """
    Custom DQN using a model based on CNN
    """
    def __init__(self,
                 state_shape: typing.Sequence[int],
                 action_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu',):
        # Parent call
        super().__init__()
        
        # Save device (e.g. cuda)
        self.device = device
        
        # Number of input channels
        input_channels_cnn = 1
        output_channels_cnn = 32
        flatten_size = (state_shape[0] - 3) * (state_shape[1] - 3) * output_channels_cnn
        output_size= np.prod(action_shape)
        
        self.model = torch.nn.Sequential(
            torch.nn.Conv2d(in_channels= input_channels_cnn, out_channels= output_channels_cnn, kernel_size= 4, stride= 1), torch.nn.ReLU(inplace=True),
            torch.nn.Flatten(0,-1),
            torch.nn.Unflatten(0, (1, flatten_size)),
            torch.nn.Linear(flatten_size, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, output_size),
        )

    def forward(self, obs, state=None, info={}):
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float, device=self.device)
        
        logits = self.model(obs)
        return logits, state


In [7]:
####################################################
# DQN POLICY
####################################################

def cf_cnn_dqn_policy(state_shape: tuple,
                      action_shape: tuple,
                      optim: typing.Optional[torch.optim.Optimizer] = None,
                      learning_rate: float =  0.0001,
                      gamma: float = 0.9, # Smaller gamma favours "faster" win
                      n_step: int = 4, # Number of steps to look ahead
                      frozen: bool = False,
                      target_update_freq: int = 320):
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Network to be used for DQN
    net = CNNBasedDQN(state_shape, action_shape, device= device).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    if optim is None:
        optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # If we are frozen, we use an optimizer that has learning rate 0
    if frozen:
        optim = torch.optim.SGD(net.parameters(), lr= 0)
        
        
    # Our agent DQN policy
    return ts.policy.DQNPolicy(model= net,
                               optim= optim,
                               discount_factor= gamma,
                               estimation_step= n_step,
                               target_update_freq= target_update_freq)

<hr>

### Building agents

This is identical to the previous notebook with the added option of "freezing" an agent which corresponds to giving it an optimizer with learning rate 0.

In [8]:
####################################################
# AGENT CREATION
####################################################

def get_agents(agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
               agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
               optim: typing.Optional[torch.optim.Optimizer] = None,
               resume_path_player_1: str = '', # Path to file to resume agent training from
               resume_path_player_2: str = '', 
               agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
               agent_player2_frozen: bool = False,
               ) -> typing.Tuple[ts.policy.BasePolicy, torch.optim.Optimizer, list]:
    """
    Gets a multi agent policy manager, optimizer and player ids for the connect four V2 gym environment.
    Per default this returns 
        - Multi agent manager for 2 agents using DQN
        - Adam optimizer
        - ['player_1', 'player_2'] from the connect four environment
    """
    
    # Get the environment to play in (Connect four gym V2)
    env = get_env()
    
    # Get the observation space from the environment, depending on typo of space (ternary operator)
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    
    # Set the arguments
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent player 1 to be a DQN if no policy is passed.
    if agent_player1 is None:
        # Our agent1 uses a DQN policy
        agent_player1 = cf_cnn_dqn_policy(state_shape= state_shape,
                                          action_shape= action_shape,
                                          optim= optim,
                                          frozen= agent_player1_frozen)
        
        # If we resume our agent we need to load the previous config
        if resume_path_player_1:
            agent_player1.load_state_dict(torch.load(resume_path_player_1))
            
    
    # Configure agent player 2 to be a DQN if no policy is passed.
    if agent_player2 is None:
        # Our agent1 uses a DQN policy
        agent_player2 = cf_cnn_dqn_policy(state_shape= state_shape,
                                          action_shape= action_shape,
                                          optim= optim,
                                          frozen= agent_player2_frozen)
        
        # If we resume our agent we need to load the previous config
        if resume_path_player_2:
            agent_player2.load_state_dict(torch.load(resume_path_player_2))

    # Both our agents are DQN agents by default
    agents = [agent_player1, agent_player2]
        
    # Our policy depends on the order of the agents
    policy = ts.policy.MultiAgentPolicyManager(agents, env)
    
    # Return our policy, optimizer and the available agents in the environment
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    
    return policy, optim, env.agents

<hr>

### Function for letting agents learn

This is identical to the previous notebook with the added option of "freezing" an agent which corresponds to giving it an optimizer with learning rate 0.

In [9]:
####################################################
# AGENT TRAINING
####################################################

def train_agent(filename: str = "dqn_vs_dqn_cnn_based",
                agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
                agent_player2_frozen: bool = False,
                optim: typing.Optional[torch.optim.Optimizer] = None,
                training_env_num: int = 1,
                testing_env_num: int = 1,
                buffer_size: int = 2^14,
                batch_size: int = 1, 
                epochs: int = 50, #50
                step_per_epoch: int = 1024, #1024
                step_per_collect: int = 64, # transition before update
                update_per_step: float = 0.1,
                testing_eps: float = 0.05,
                training_eps: float = 0.1,
                ) -> typing.Tuple[dict, ts.policy.BasePolicy]:
    """
    Trains two agents in the connect four V2 environment and saves their best model and logs.
    Returns:
        - result from offpolicy_trainer
        - final version of agent 1
        - final version of agent 2
    """

    # ======== notebook specific =========
    notebook_version = '7' # Used for foldering logs and models

    # ======== environment setup =========
    train_envs = ts.env.DummyVectorEnv([get_env for _ in range(training_env_num)])
    test_envs = ts.env.DummyVectorEnv([get_env for _ in range(testing_env_num)])
    
    # set the seed for reproducibility
    np.random.seed(1998)
    torch.manual_seed(1998)
    train_envs.seed(1998)
    test_envs.seed(1998)

    # ======== agent setup =========
    # Gets our agents from the previously made function
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    policy, optim, agents = get_agents(agent_player1=agent_player1,
                                       agent_player2=agent_player2,
                                       agent_player1_frozen= agent_player1_frozen,
                                       agent_player2_frozen= agent_player2_frozen,
                                       optim=optim)

    # ======== collector setup =========
    # Make a collector for the training environments
    train_collector = ts.data.Collector(policy= policy,
                                        env= train_envs,
                                        buffer= ts.data.VectorReplayBuffer(buffer_size, len(train_envs)),
                                        exploration_noise= True)
    
    # Make a collector for the testing environments
    test_collector = ts.data.Collector(policy= policy,
                                       env= test_envs,
                                       buffer= ts.data.VectorReplayBuffer(buffer_size, len(test_envs)),
                                       exploration_noise= True)
    
    # Uncomment below if you want to set epsilon in epsilon policy
    # policy.set_eps(1)
    
    # Collect data fot the training evnironments
    train_collector.collect(n_step= batch_size * training_env_num)
    
    # ======== ensure folders exist =========
    if not os.path.exists(os.path.join('./logs', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./logs', 'paper_notebooks', notebook_version, filename))
    if not os.path.exists(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename))

    # ======== tensorboard logging setup =========
    # Allows to save the training progress to tensorboard compatable logs
    log_path = os.path.join('./logs', 'paper_notebooks', notebook_version, filename)
    writer = torch.utils.tensorboard.SummaryWriter(log_path)
    logger = ts.utils.TensorboardLogger(writer)

    # ======== callback functions used during training =========
    # We want to save our best policy
    def save_best_fn(policy):
        """
        Callback to save the best model
        """
        # Save best agent 1
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent1.pth')
        torch.save(policy.policies[agents[0]].state_dict(), model_save_path)
        
        # Save best agent 2
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent2.pth')
        torch.save(policy.policies[agents[1]].state_dict(), model_save_path)
        
        # Save agent2

    def stop_fn(mean_rewards):
        """
        Callback to stop training when we've reached the win rate
        """
        return mean_rewards >= 7 # (win = 10, 70% win without invalid moves = mean of 7)

    def train_fn(epoch, env_step):
        """
        Callback before training
        """        
        # Before training we want to configure the epsilon for the agents
        # In general more exploratory than the test case
        policy.policies[agents[0]].set_eps(training_eps)
        policy.policies[agents[1]].set_eps(training_eps)

    def test_fn(epoch, env_step):
        """
        Callback beore testing
        """        
        # Before testing we want to configure the epsilon for the agents
        # In general more greedy than the train case but not
        #   to avoid getting stuck on invalid moves
        policy.policies[agents[0]].set_eps(testing_eps)
        policy.policies[agents[1]].set_eps(testing_eps)

    def reward_metric(rews):
        """
        Callback for reward collection
        """
        # We are interested in having a high total total reward,
        #   as this would mean equally good agents.
        return rews[:, 0] + rews[:, 1]

    # trainer
    result = ts.trainer.offpolicy_trainer(policy= policy,
                                          train_collector= train_collector,
                                          test_collector= test_collector,
                                          max_epoch= epochs,
                                          step_per_epoch= step_per_epoch,
                                          step_per_collect= step_per_collect,
                                          episode_per_test= testing_env_num,
                                          batch_size= batch_size,
                                          train_fn= train_fn,
                                          test_fn= test_fn,
                                          # Stop function to stop before specified amount of epochs
                                          #stop_fn= stop_fn
                                          save_best_fn= save_best_fn,
                                          update_per_step= update_per_step,
                                          logger= logger,
                                          test_in_train= False,
                                          reward_metric= reward_metric)
    
    # Save final agent 1
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent1.pth')
    torch.save(policy.policies[agents[0]].state_dict(), model_save_path)

    # Save final agent 2
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent2.pth')
    torch.save(policy.policies[agents[1]].state_dict(), model_save_path)

    return result, policy.policies[agents[0]], policy.policies[agents[1]]

<hr>

### Function for watching learned agent

Identical to the previous notebook.

In [10]:
####################################################
# WATCHING THE LEARNED POLICY IN ACTION
####################################################

def watch(numer_of_games: int = 3,
          agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
          agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
          test_epsilon: float = 0.05, # For the watching we act completely greedy but low random for not getting stuck on invalid move
          render_speed: float = 0.15, # Amount of seconds to update frame/ do a step
          ) -> None:
    
    # Get the connect four V2 environment (must be a list)
    env= ts.env.DummyVectorEnv([get_env])
    
    # Get the agents from the trained agents
    policy, optim, agents = get_agents(agent_player1= agent_player1,
                                       agent_player2= agent_player2)
    
    # Evaluate the policy
    policy.eval()
    
    # Set the testing policy epsilon for our agents
    policy.policies[agents[0]].set_eps(test_epsilon)
    policy.policies[agents[1]].set_eps(test_epsilon)
    
    # Collect the test data
    collector = ts.data.Collector(policy= policy,
                                  env= env,
                                  exploration_noise= True)
    
    # Render games in human mode to see how it plays
    result = collector.collect(n_episode= numer_of_games, render= render_speed)
    
    # Close the environment aftering collecting the results
    # This closes the pygame window after completion
    env.close()
    
    # Get the rewards and length from the test trials
    rewards, length = result["rews"], result["lens"]
    
    # Print the final reward for the first agent
    print(f"Average steps of game:  {length.mean()}")
    print(f"Final mean reward agent 1: {rewards[:, 0].mean()}, std: {rewards[:, 0].std()}")
    print(f"Final mean reward agent 2: {rewards[:, 1].mean()}, std: {rewards[:, 1].std()}")

<hr>

### Doing the experiment

We now do the experiment with using our previously created functions.
We freeze one agent and initialize both agents from previous versions.

1. Freeze agent 1, train agent 2:
    - Model save name: 1-cnn_dqn_frozen_agent1
    - Agent 1 start: ./saved_variables/paper_notebooks/6/dqn_vs_dqn_cnn_based/best_policy_agent1.pth
    - Agent 2 start: ./saved_variables/paper_notebooks/6/dqn_vs_dqn_cnn_based/best_policy_agent2.pth
    - Learning rate: 0.0001
    - Look ahead steps: 4
    - Reward for move: 1
    - Epochs: 1000
    - Best epoch: 88


In [12]:
####################################################
# EXPERIMENT: TRAINING AGENTS
####################################################

# Configs for the agents
freeze_agent1 = False
agent1_starting_params = "./saved_variables/paper_notebooks/6/dqn_vs_dqn_cnn_based/best_policy_agent1.pth"
freeze_agent2 = True
agent2_starting_params = "./saved_variables/paper_notebooks/7/1-cnn_dqn_frozen_agent1/final_policy_agent2.pth"

# Get the environment settings
env = get_env()
observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
state_shape = observation_space.shape or observation_space.n
action_shape = env.action_space.shape or env.action_space.n

# Configure agent 1
agent1 = cf_cnn_dqn_policy(state_shape= state_shape,
                           action_shape= action_shape,
                           gamma= 0.95, # Favour shorter solutions if small
                           frozen= freeze_agent1,
                           learning_rate = 0.0001,
                           n_step= 4)

if agent1_starting_params:
    agent1.load_state_dict(torch.load(agent1_starting_params))

# Configure agent 2
agent2 = cf_cnn_dqn_policy(state_shape= state_shape,
                           action_shape= action_shape,
                           gamma= 0.95, # Favour shorter solutions if small
                           frozen= freeze_agent2,
                           learning_rate = 0.0001,
                           n_step= 4)

if agent2_starting_params:
    agent2.load_state_dict(torch.load(agent2_starting_params))


# Train the agent
off_policy_traininer_results, final_agent_player1, final_agent_player2 = train_agent(epochs= 1000,
                                                                                     filename="2-cnn_dqn_frozen_agent2",
                                                                                     training_eps= 0.2)

Epoch #1: 1025it [00:02, 439.28it/s, env_step=1024, len=8, n/ep=8, n/st=64, player_1/loss=895.981, player_2/loss=363.033, rew=77.75]                                                                                                        


Epoch #1: test_reward: 70.000000 ± 0.000000, best_reward: 70.000000 ± 0.000000 in #1


Epoch #2: 1025it [00:02, 448.88it/s, env_step=2048, len=7, n/ep=9, n/st=64, player_1/loss=637.946, player_2/loss=393.168, rew=63.11]                                                                                                        


Epoch #2: test_reward: 70.000000 ± 0.000000, best_reward: 70.000000 ± 0.000000 in #1


Epoch #3: 1025it [00:02, 435.77it/s, env_step=3072, len=8, n/ep=8, n/st=64, player_1/loss=764.732, player_2/loss=570.248, rew=78.50]                                                                                                        


Epoch #3: test_reward: 54.000000 ± 0.000000, best_reward: 70.000000 ± 0.000000 in #1


Epoch #4: 1025it [00:02, 448.23it/s, env_step=4096, len=11, n/ep=7, n/st=64, player_1/loss=848.691, player_2/loss=733.860, rew=191.14]                                                                                                      


Epoch #4: test_reward: 54.000000 ± 0.000000, best_reward: 70.000000 ± 0.000000 in #1


Epoch #5: 1025it [00:02, 448.61it/s, env_step=5120, len=8, n/ep=8, n/st=64, player_1/loss=516.356, player_2/loss=463.644, rew=87.25]                                                                                                        


Epoch #5: test_reward: 54.000000 ± 0.000000, best_reward: 70.000000 ± 0.000000 in #1


Epoch #6: 1025it [00:02, 445.72it/s, env_step=6144, len=9, n/ep=7, n/st=64, player_1/loss=398.166, player_2/loss=415.905, rew=106.29]                                                                                                       


Epoch #6: test_reward: 54.000000 ± 0.000000, best_reward: 70.000000 ± 0.000000 in #1


Epoch #7: 1025it [00:02, 447.16it/s, env_step=7168, len=8, n/ep=8, n/st=64, player_1/loss=318.328, player_2/loss=356.495, rew=73.25]                                                                                                        


Epoch #7: test_reward: 54.000000 ± 0.000000, best_reward: 70.000000 ± 0.000000 in #1


Epoch #8: 1025it [00:02, 452.18it/s, env_step=8192, len=7, n/ep=8, n/st=64, player_1/loss=270.324, player_2/loss=309.888, rew=66.50]                                                                                                        


Epoch #8: test_reward: 54.000000 ± 0.000000, best_reward: 70.000000 ± 0.000000 in #1


Epoch #9: 1025it [00:02, 452.26it/s, env_step=9216, len=9, n/ep=7, n/st=64, player_1/loss=221.353, player_2/loss=271.589, rew=94.57]                                                                                                        


Epoch #9: test_reward: 54.000000 ± 0.000000, best_reward: 70.000000 ± 0.000000 in #1


Epoch #10: 1025it [00:02, 446.76it/s, env_step=10240, len=7, n/ep=8, n/st=64, player_1/loss=177.534, player_2/loss=237.566, rew=67.00]                                                                                                      


Epoch #10: test_reward: 54.000000 ± 0.000000, best_reward: 70.000000 ± 0.000000 in #1


Epoch #11: 1025it [00:02, 450.43it/s, env_step=11264, len=8, n/ep=7, n/st=64, player_1/loss=190.786, player_2/loss=278.734, rew=76.29]                                                                                                      


Epoch #11: test_reward: 54.000000 ± 0.000000, best_reward: 70.000000 ± 0.000000 in #1


Epoch #12: 1025it [00:02, 451.48it/s, env_step=12288, len=8, n/ep=8, n/st=64, player_1/loss=242.433, player_2/loss=270.893, rew=77.25]                                                                                                      


Epoch #12: test_reward: 70.000000 ± 0.000000, best_reward: 70.000000 ± 0.000000 in #1


Epoch #13: 1025it [00:02, 445.33it/s, env_step=13312, len=8, n/ep=7, n/st=64, player_1/loss=202.155, player_2/loss=249.902, rew=97.14]                                                                                                      


Epoch #13: test_reward: 88.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #14: 1025it [00:02, 416.89it/s, env_step=14336, len=10, n/ep=6, n/st=64, player_1/loss=201.942, player_2/loss=329.607, rew=152.00]                                                                                                    


Epoch #14: test_reward: 70.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #15: 1025it [00:02, 434.91it/s, env_step=15360, len=8, n/ep=7, n/st=64, player_1/loss=297.982, player_2/loss=329.431, rew=80.57]                                                                                                      


Epoch #15: test_reward: 70.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #16: 1025it [00:02, 442.38it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=262.221, player_2/loss=428.093, rew=81.00]                                                                                                      


Epoch #16: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #17: 1025it [00:02, 425.49it/s, env_step=17408, len=8, n/ep=8, n/st=64, player_1/loss=254.089, player_2/loss=434.316, rew=71.75]                                                                                                      


Epoch #17: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #18: 1025it [00:02, 450.50it/s, env_step=18432, len=7, n/ep=9, n/st=64, player_1/loss=283.815, player_2/loss=369.610, rew=67.33]                                                                                                      


Epoch #18: test_reward: 88.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #19: 1025it [00:02, 451.99it/s, env_step=19456, len=10, n/ep=6, n/st=64, player_1/loss=259.773, player_2/loss=431.712, rew=119.33]                                                                                                    


Epoch #19: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #20: 1025it [00:02, 455.59it/s, env_step=20480, len=8, n/ep=8, n/st=64, player_1/loss=348.658, player_2/loss=504.590, rew=78.00]                                                                                                      


Epoch #20: test_reward: 88.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #21: 1025it [00:02, 412.43it/s, env_step=21504, len=9, n/ep=8, n/st=64, player_1/loss=412.300, player_2/loss=592.312, rew=127.00]                                                                                                     


Epoch #21: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #22: 1025it [00:02, 420.46it/s, env_step=22528, len=8, n/ep=8, n/st=64, player_1/loss=317.211, player_2/loss=405.562, rew=73.25]                                                                                                      


Epoch #22: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #23: 1025it [00:02, 400.53it/s, env_step=23552, len=8, n/ep=7, n/st=64, player_1/loss=208.120, player_2/loss=380.847, rew=75.43]                                                                                                      


Epoch #23: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #24: 1025it [00:02, 436.38it/s, env_step=24576, len=8, n/ep=7, n/st=64, player_1/loss=154.013, player_2/loss=364.535, rew=80.86]                                                                                                      


Epoch #24: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #25: 1025it [00:02, 427.71it/s, env_step=25600, len=7, n/ep=8, n/st=64, player_1/loss=187.262, player_2/loss=453.237, rew=67.00]                                                                                                      


Epoch #25: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #26: 1025it [00:02, 416.90it/s, env_step=26624, len=8, n/ep=8, n/st=64, player_1/loss=173.900, player_2/loss=431.555, rew=78.00]                                                                                                      


Epoch #26: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #27: 1025it [00:02, 415.42it/s, env_step=27648, len=9, n/ep=7, n/st=64, player_1/loss=170.100, player_2/loss=315.946, rew=90.29]                                                                                                      


Epoch #27: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #28: 1025it [00:02, 419.99it/s, env_step=28672, len=8, n/ep=8, n/st=64, player_1/loss=212.601, player_2/loss=286.868, rew=71.50]                                                                                                      


Epoch #28: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #29: 1025it [00:02, 428.30it/s, env_step=29696, len=8, n/ep=7, n/st=64, player_1/loss=218.974, player_2/loss=348.681, rew=85.43]                                                                                                      


Epoch #29: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #30: 1025it [00:02, 404.57it/s, env_step=30720, len=8, n/ep=7, n/st=64, player_1/loss=313.475, player_2/loss=479.987, rew=85.14]                                                                                                      


Epoch #30: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #31: 1025it [00:02, 434.88it/s, env_step=31744, len=8, n/ep=8, n/st=64, player_1/loss=457.904, player_2/loss=602.726, rew=78.25]                                                                                                      


Epoch #31: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #32: 1025it [00:02, 457.87it/s, env_step=32768, len=7, n/ep=8, n/st=64, player_1/loss=381.609, player_2/loss=522.867, rew=60.25]                                                                                                      


Epoch #32: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #33: 1025it [00:02, 423.16it/s, env_step=33792, len=8, n/ep=7, n/st=64, player_1/loss=232.985, player_2/loss=399.575, rew=80.86]                                                                                                      


Epoch #33: test_reward: 70.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #34: 1025it [00:02, 443.56it/s, env_step=34816, len=9, n/ep=7, n/st=64, player_1/loss=219.736, player_2/loss=420.206, rew=94.00]                                                                                                      


Epoch #34: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #35: 1025it [00:02, 421.89it/s, env_step=35840, len=9, n/ep=6, n/st=64, player_1/loss=188.700, player_2/loss=482.059, rew=100.00]                                                                                                     


Epoch #35: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #36: 1025it [00:02, 426.40it/s, env_step=36864, len=8, n/ep=7, n/st=64, player_1/loss=150.452, player_2/loss=468.332, rew=82.29]                                                                                                      


Epoch #36: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #37: 1025it [00:02, 362.04it/s, env_step=37888, len=9, n/ep=8, n/st=64, player_1/loss=183.049, player_2/loss=434.625, rew=120.25]                                                                                                     


Epoch #37: test_reward: 70.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #38: 1025it [00:02, 358.67it/s, env_step=38912, len=8, n/ep=8, n/st=64, player_1/loss=157.116, player_2/loss=391.640, rew=73.25]                                                                                                      


Epoch #38: test_reward: 70.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #39: 1025it [00:02, 406.38it/s, env_step=39936, len=8, n/ep=8, n/st=64, player_1/loss=212.993, player_2/loss=318.622, rew=76.25]                                                                                                      


Epoch #39: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #40: 1025it [00:02, 424.77it/s, env_step=40960, len=8, n/ep=8, n/st=64, player_1/loss=230.364, player_2/loss=268.057, rew=76.50]                                                                                                      


Epoch #40: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #41: 1025it [00:02, 419.43it/s, env_step=41984, len=7, n/ep=9, n/st=64, player_1/loss=307.875, player_2/loss=303.808, rew=57.56]                                                                                                      


Epoch #41: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #42: 1025it [00:02, 415.56it/s, env_step=43008, len=8, n/ep=8, n/st=64, player_1/loss=247.898, rew=83.75]                                                                                                                             


Epoch #42: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #43: 1025it [00:02, 423.06it/s, env_step=44032, len=8, n/ep=8, n/st=64, player_1/loss=139.241, player_2/loss=325.046, rew=75.75]                                                                                                      


Epoch #43: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #44: 1025it [00:02, 425.23it/s, env_step=45056, len=7, n/ep=8, n/st=64, player_1/loss=172.027, player_2/loss=324.379, rew=58.00]                                                                                                      


Epoch #44: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #45: 1025it [00:02, 422.58it/s, env_step=46080, len=8, n/ep=8, n/st=64, player_1/loss=241.665, player_2/loss=344.659, rew=79.75]                                                                                                      


Epoch #45: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #46: 1025it [00:02, 423.97it/s, env_step=47104, len=19, n/ep=3, n/st=64, player_1/loss=212.249, player_2/loss=412.567, rew=574.00]                                                                                                    


Epoch #46: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #47: 1025it [00:02, 424.42it/s, env_step=48128, len=9, n/ep=7, n/st=64, player_1/loss=217.867, player_2/loss=338.183, rew=90.29]                                                                                                      


Epoch #47: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #48: 1025it [00:02, 427.09it/s, env_step=49152, len=7, n/ep=6, n/st=64, player_1/loss=332.579, player_2/loss=466.611, rew=62.00]                                                                                                      


Epoch #48: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #49: 1025it [00:02, 426.39it/s, env_step=50176, len=7, n/ep=8, n/st=64, player_1/loss=292.163, player_2/loss=593.396, rew=58.00]                                                                                                      


Epoch #49: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #50: 1025it [00:02, 425.26it/s, env_step=51200, len=8, n/ep=8, n/st=64, player_1/loss=191.712, player_2/loss=469.326, rew=78.00]                                                                                                      


Epoch #50: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #51: 1025it [00:02, 423.77it/s, env_step=52224, len=9, n/ep=7, n/st=64, player_1/loss=141.973, player_2/loss=423.013, rew=122.86]                                                                                                     


Epoch #51: test_reward: 54.000000 ± 0.000000, best_reward: 88.000000 ± 0.000000 in #13


Epoch #52: 1025it [00:02, 425.02it/s, env_step=53248, len=15, n/ep=4, n/st=64, player_1/loss=282.954, player_2/loss=520.214, rew=284.50]                                                                                                    


Epoch #52: test_reward: 378.000000 ± 0.000000, best_reward: 378.000000 ± 0.000000 in #52


Epoch #53: 1025it [00:02, 421.78it/s, env_step=54272, len=17, n/ep=4, n/st=64, player_1/loss=525.177, player_2/loss=620.986, rew=360.50]                                                                                                    


Epoch #53: test_reward: 378.000000 ± 0.000000, best_reward: 378.000000 ± 0.000000 in #52


Epoch #54: 1025it [00:02, 424.34it/s, env_step=55296, len=19, n/ep=3, n/st=64, player_1/loss=410.065, player_2/loss=521.683, rew=392.67]                                                                                                    


Epoch #54: test_reward: 270.000000 ± 0.000000, best_reward: 378.000000 ± 0.000000 in #52


Epoch #55: 1025it [00:02, 424.73it/s, env_step=56320, len=21, n/ep=3, n/st=64, player_1/loss=394.415, player_2/loss=442.847, rew=489.33]                                                                                                    


Epoch #55: test_reward: 460.000000 ± 0.000000, best_reward: 460.000000 ± 0.000000 in #55


Epoch #56: 1025it [00:02, 424.93it/s, env_step=57344, len=21, n/ep=3, n/st=64, player_1/loss=730.396, player_2/loss=491.373, rew=478.67]                                                                                                    


Epoch #56: test_reward: 340.000000 ± 0.000000, best_reward: 460.000000 ± 0.000000 in #55


Epoch #57: 1025it [00:02, 424.82it/s, env_step=58368, len=10, n/ep=6, n/st=64, player_1/loss=691.001, player_2/loss=490.229, rew=111.33]                                                                                                    


Epoch #57: test_reward: 88.000000 ± 0.000000, best_reward: 460.000000 ± 0.000000 in #55


Epoch #58: 1025it [00:02, 427.95it/s, env_step=59392, len=21, n/ep=3, n/st=64, player_1/loss=435.400, player_2/loss=615.755, rew=462.67]                                                                                                    


Epoch #58: test_reward: 238.000000 ± 0.000000, best_reward: 460.000000 ± 0.000000 in #55


Epoch #59: 1025it [00:02, 426.71it/s, env_step=60416, len=23, n/ep=3, n/st=64, player_1/loss=394.106, player_2/loss=561.617, rew=567.33]                                                                                                    


Epoch #59: test_reward: 378.000000 ± 0.000000, best_reward: 460.000000 ± 0.000000 in #55


Epoch #60: 1025it [00:02, 426.36it/s, env_step=61440, len=21, n/ep=3, n/st=64, player_1/loss=703.435, player_2/loss=603.096, rew=478.00]                                                                                                    


Epoch #60: test_reward: 460.000000 ± 0.000000, best_reward: 460.000000 ± 0.000000 in #55


Epoch #61: 1025it [00:02, 423.93it/s, env_step=62464, len=28, n/ep=2, n/st=64, player_1/loss=1138.412, player_2/loss=845.227, rew=811.00]                                                                                                   


Epoch #61: test_reward: 868.000000 ± 0.000000, best_reward: 868.000000 ± 0.000000 in #61


Epoch #62: 1025it [00:02, 422.99it/s, env_step=63488, len=20, n/ep=3, n/st=64, player_1/loss=1234.145, player_2/loss=848.496, rew=436.00]                                                                                                   


Epoch #62: test_reward: 378.000000 ± 0.000000, best_reward: 868.000000 ± 0.000000 in #61


Epoch #63: 1025it [00:02, 425.08it/s, env_step=64512, len=26, n/ep=3, n/st=64, player_1/loss=881.911, player_2/loss=552.861, rew=738.00]                                                                                                    


Epoch #63: test_reward: 1054.000000 ± 0.000000, best_reward: 1054.000000 ± 0.000000 in #63


Epoch #64: 1025it [00:02, 423.93it/s, env_step=65536, len=28, n/ep=3, n/st=64, player_1/loss=1046.460, player_2/loss=560.647, rew=828.67]                                                                                                   


Epoch #64: test_reward: 1258.000000 ± 0.000000, best_reward: 1258.000000 ± 0.000000 in #64


Epoch #65: 1025it [00:02, 425.91it/s, env_step=66560, len=25, n/ep=3, n/st=64, player_1/loss=1572.768, player_2/loss=966.078, rew=704.00]                                                                                                   


Epoch #65: test_reward: 868.000000 ± 0.000000, best_reward: 1258.000000 ± 0.000000 in #64


Epoch #66: 1025it [00:02, 426.94it/s, env_step=67584, len=28, n/ep=3, n/st=64, player_1/loss=1402.392, player_2/loss=1209.436, rew=826.67]                                                                                                  


Epoch #66: test_reward: 868.000000 ± 0.000000, best_reward: 1258.000000 ± 0.000000 in #64


Epoch #67: 1025it [00:02, 427.75it/s, env_step=68608, len=20, n/ep=2, n/st=64, player_1/loss=820.367, player_2/loss=1275.021, rew=419.00]                                                                                                   


Epoch #67: test_reward: 754.000000 ± 0.000000, best_reward: 1258.000000 ± 0.000000 in #64


Epoch #68: 1025it [00:02, 423.75it/s, env_step=69632, len=30, n/ep=2, n/st=64, player_1/loss=1171.672, player_2/loss=1626.300, rew=929.00]                                                                                                  


Epoch #68: test_reward: 990.000000 ± 0.000000, best_reward: 1258.000000 ± 0.000000 in #64


Epoch #69: 1025it [00:02, 424.08it/s, env_step=70656, len=27, n/ep=2, n/st=64, player_1/loss=1654.587, player_2/loss=1188.219, rew=770.00]                                                                                                  


Epoch #69: test_reward: 1558.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #70: 1025it [00:02, 423.74it/s, env_step=71680, len=23, n/ep=3, n/st=64, player_1/loss=1552.119, player_2/loss=1015.186, rew=652.00]                                                                                                  


Epoch #70: test_reward: 238.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #71: 1025it [00:02, 426.98it/s, env_step=72704, len=25, n/ep=2, n/st=64, player_1/loss=754.747, player_2/loss=627.698, rew=648.00]                                                                                                    


Epoch #71: test_reward: 700.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #72: 1025it [00:02, 427.55it/s, env_step=73728, len=31, n/ep=2, n/st=64, player_1/loss=685.578, player_2/loss=530.249, rew=1006.00]                                                                                                   


Epoch #72: test_reward: 1330.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #73: 1025it [00:02, 425.28it/s, env_step=74752, len=38, n/ep=2, n/st=64, player_1/loss=946.239, player_2/loss=733.926, rew=1546.00]                                                                                                   


Epoch #73: test_reward: 1120.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #74: 1025it [00:02, 424.32it/s, env_step=75776, len=23, n/ep=3, n/st=64, player_1/loss=839.831, player_2/loss=1081.471, rew=602.00]                                                                                                   


Epoch #74: test_reward: 1330.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #75: 1025it [00:02, 424.87it/s, env_step=76800, len=28, n/ep=3, n/st=64, player_1/loss=1173.473, player_2/loss=1867.039, rew=876.00]                                                                                                  


Epoch #75: test_reward: 1480.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #76: 1025it [00:02, 425.01it/s, env_step=77824, len=23, n/ep=3, n/st=64, player_1/loss=1657.910, player_2/loss=2384.002, rew=590.00]                                                                                                  


Epoch #76: test_reward: 700.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #77: 1025it [00:02, 427.62it/s, env_step=78848, len=24, n/ep=3, n/st=64, player_1/loss=2358.159, player_2/loss=2031.038, rew=665.33]                                                                                                  


Epoch #77: test_reward: 1404.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #78: 1025it [00:02, 426.02it/s, env_step=79872, len=15, n/ep=4, n/st=64, player_1/loss=2200.318, player_2/loss=2333.583, rew=251.00]                                                                                                  


Epoch #78: test_reward: 270.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #79: 1025it [00:02, 414.77it/s, env_step=80896, len=28, n/ep=2, n/st=64, player_1/loss=1825.422, player_2/loss=2097.944, rew=895.00]                                                                                                  


Epoch #79: test_reward: 460.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #80: 1025it [00:02, 416.89it/s, env_step=81920, len=21, n/ep=3, n/st=64, player_1/loss=1199.147, player_2/loss=1917.574, rew=492.00]                                                                                                  


Epoch #80: test_reward: 270.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #81: 1025it [00:02, 429.43it/s, env_step=82944, len=24, n/ep=3, n/st=64, player_1/loss=1257.130, player_2/loss=1991.188, rew=730.67]                                                                                                  


Epoch #81: test_reward: 810.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #82: 1025it [00:02, 405.57it/s, env_step=83968, len=31, n/ep=2, n/st=64, player_1/loss=1058.093, player_2/loss=1634.332, rew=1022.00]                                                                                                 


Epoch #82: test_reward: 1480.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #83: 1025it [00:03, 330.76it/s, env_step=84992, len=31, n/ep=2, n/st=64, player_1/loss=1357.320, player_2/loss=1946.776, rew=994.00]                                                                                                  


Epoch #83: test_reward: 418.000000 ± 0.000000, best_reward: 1558.000000 ± 0.000000 in #69


Epoch #84: 1025it [00:03, 322.23it/s, env_step=86016, len=36, n/ep=1, n/st=64, player_1/loss=1982.438, player_2/loss=2389.662, rew=1330.00]                                                                                                 


Epoch #84: test_reward: 1638.000000 ± 0.000000, best_reward: 1638.000000 ± 0.000000 in #84


Epoch #85: 1025it [00:03, 339.07it/s, env_step=87040, len=28, n/ep=1, n/st=64, player_1/loss=1596.666, rew=810.00]                                                                                                                          


Epoch #85: test_reward: 418.000000 ± 0.000000, best_reward: 1638.000000 ± 0.000000 in #84


Epoch #86: 1025it [00:02, 349.35it/s, env_step=88064, len=33, n/ep=2, n/st=64, player_1/loss=1209.815, player_2/loss=1478.173, rew=1156.00]                                                                                                 


Epoch #86: test_reward: 754.000000 ± 0.000000, best_reward: 1638.000000 ± 0.000000 in #84


Epoch #87: 1025it [00:02, 348.46it/s, env_step=89088, len=34, n/ep=2, n/st=64, player_1/loss=1442.112, player_2/loss=1674.623, rew=1192.00]                                                                                                 


Epoch #87: test_reward: 990.000000 ± 0.000000, best_reward: 1638.000000 ± 0.000000 in #84


Epoch #88: 1025it [00:03, 340.99it/s, env_step=90112, len=36, n/ep=2, n/st=64, player_1/loss=1367.231, player_2/loss=1726.907, rew=1412.00]                                                                                                 


Epoch #88: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #89: 1025it [00:03, 328.31it/s, env_step=91136, len=30, n/ep=2, n/st=64, player_1/loss=2090.661, player_2/loss=2145.683, rew=1106.00]                                                                                                 


Epoch #89: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #90: 1025it [00:04, 248.26it/s, env_step=92160, len=35, n/ep=2, n/st=64, player_1/loss=2055.884, player_2/loss=2064.327, rew=1306.00]                                                                                                 


Epoch #90: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #91: 1025it [00:03, 273.79it/s, env_step=93184, len=22, n/ep=2, n/st=64, player_1/loss=1567.609, player_2/loss=1541.441, rew=617.00]                                                                                                  


Epoch #91: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #92: 1025it [00:03, 270.48it/s, env_step=94208, len=29, n/ep=2, n/st=64, player_1/loss=1595.501, player_2/loss=956.136, rew=918.00]                                                                                                   


Epoch #92: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #93: 1025it [00:03, 269.34it/s, env_step=95232, len=16, n/ep=2, n/st=64, player_1/loss=1518.382, player_2/loss=1394.839, rew=271.00]                                                                                                  


Epoch #93: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #94: 1025it [00:03, 265.53it/s, env_step=96256, len=30, n/ep=2, n/st=64, player_1/loss=1619.390, player_2/loss=1992.078, rew=1015.00]                                                                                                 


Epoch #94: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #95: 1025it [00:03, 270.16it/s, env_step=97280, len=25, n/ep=4, n/st=64, player_1/loss=2112.976, player_2/loss=2539.524, rew=758.00]                                                                                                  


Epoch #95: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #96: 1025it [00:03, 274.64it/s, env_step=98304, len=34, n/ep=2, n/st=64, player_1/loss=2355.543, player_2/loss=2776.237, rew=1235.00]                                                                                                 


Epoch #96: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #97: 1025it [00:03, 271.79it/s, env_step=99328, len=36, n/ep=2, n/st=64, player_1/loss=2668.769, player_2/loss=2996.128, rew=1331.00]                                                                                                 


Epoch #97: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #98: 1025it [00:03, 270.23it/s, env_step=100352, len=25, n/ep=2, n/st=64, player_1/loss=2412.493, player_2/loss=2464.939, rew=784.00]                                                                                                 


Epoch #98: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #99: 1025it [00:04, 254.03it/s, env_step=101376, len=15, n/ep=4, n/st=64, player_1/loss=2436.687, player_2/loss=2764.170, rew=238.50]                                                                                                 


Epoch #99: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #100: 1025it [00:03, 257.68it/s, env_step=102400, len=22, n/ep=3, n/st=64, player_1/loss=2689.833, player_2/loss=2440.923, rew=537.33]                                                                                                


Epoch #100: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #101: 1025it [00:03, 290.60it/s, env_step=103424, len=24, n/ep=2, n/st=64, player_1/loss=2623.336, player_2/loss=1597.189, rew=805.00]                                                                                                


Epoch #101: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #102: 1025it [00:03, 289.54it/s, env_step=104448, len=33, n/ep=2, n/st=64, player_1/loss=1877.298, player_2/loss=1837.178, rew=1154.00]                                                                                               


Epoch #102: test_reward: 1638.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #103: 1025it [00:03, 287.69it/s, env_step=105472, len=36, n/ep=2, n/st=64, player_1/loss=1442.476, player_2/loss=1566.153, rew=1369.00]                                                                                               


Epoch #103: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #104: 1025it [00:03, 297.08it/s, env_step=106496, len=24, n/ep=3, n/st=64, player_1/loss=967.848, player_2/loss=1157.263, rew=732.67]                                                                                                 


Epoch #104: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #105: 1025it [00:03, 295.23it/s, env_step=107520, len=32, n/ep=2, n/st=64, player_1/loss=1929.538, player_2/loss=1842.381, rew=1087.00]                                                                                               


Epoch #105: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #106: 1025it [00:03, 294.48it/s, env_step=108544, len=15, n/ep=4, n/st=64, player_1/loss=2445.090, player_2/loss=1951.664, rew=256.00]                                                                                                


Epoch #106: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #107: 1025it [00:03, 285.54it/s, env_step=109568, len=32, n/ep=2, n/st=64, player_1/loss=1993.447, player_2/loss=2282.804, rew=1055.00]                                                                                               


Epoch #107: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #108: 1025it [00:03, 281.11it/s, env_step=110592, len=28, n/ep=3, n/st=64, player_1/loss=1626.335, player_2/loss=2137.086, rew=842.00]                                                                                                


Epoch #108: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #109: 1025it [00:03, 281.29it/s, env_step=111616, len=35, n/ep=2, n/st=64, player_1/loss=2058.145, player_2/loss=1327.081, rew=1294.00]                                                                                               


Epoch #109: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #110: 1025it [00:03, 278.43it/s, env_step=112640, len=31, n/ep=2, n/st=64, player_1/loss=2534.787, player_2/loss=1693.702, rew=1006.00]                                                                                               


Epoch #110: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #111: 1025it [00:04, 243.49it/s, env_step=113664, len=27, n/ep=2, n/st=64, player_1/loss=2217.136, player_2/loss=2389.413, rew=835.00]                                                                                                


Epoch #111: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #112: 1025it [00:05, 197.41it/s, env_step=114688, len=14, n/ep=4, n/st=64, player_1/loss=2073.590, player_2/loss=2549.408, rew=231.00]                                                                                                


Epoch #112: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #113: 1025it [00:04, 233.78it/s, env_step=115712, len=27, n/ep=2, n/st=64, player_1/loss=1357.960, player_2/loss=1944.730, rew=784.00]                                                                                                


Epoch #113: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #114: 1025it [00:04, 207.58it/s, env_step=116736, len=28, n/ep=3, n/st=64, player_1/loss=1684.082, player_2/loss=1250.008, rew=840.00]                                                                                                


Epoch #114: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #115: 1025it [00:04, 225.14it/s, env_step=117760, len=35, n/ep=2, n/st=64, player_1/loss=3237.619, player_2/loss=1407.524, rew=1267.00]                                                                                               


Epoch #115: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #116: 1025it [00:04, 228.55it/s, env_step=118784, len=27, n/ep=2, n/st=64, player_1/loss=2835.467, player_2/loss=1345.807, rew=802.00]                                                                                                


Epoch #116: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #117: 1025it [00:04, 228.55it/s, env_step=119808, len=30, n/ep=2, n/st=64, player_1/loss=1456.390, player_2/loss=1717.042, rew=953.00]                                                                                                


Epoch #117: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #118: 1025it [00:04, 224.72it/s, env_step=120832, len=29, n/ep=2, n/st=64, player_1/loss=1669.057, player_2/loss=2103.081, rew=904.00]                                                                                                


Epoch #118: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #119: 1025it [00:04, 227.07it/s, env_step=121856, len=28, n/ep=3, n/st=64, player_1/loss=1638.665, player_2/loss=1942.437, rew=895.33]                                                                                                


Epoch #119: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #120: 1025it [00:04, 228.58it/s, env_step=122880, len=34, n/ep=2, n/st=64, player_1/loss=1908.114, player_2/loss=2118.445, rew=1188.00]                                                                                               


Epoch #120: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #121: 1025it [00:04, 224.73it/s, env_step=123904, len=28, n/ep=3, n/st=64, player_1/loss=2601.959, player_2/loss=2272.643, rew=930.00]                                                                                                


Epoch #121: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #122: 1025it [00:04, 226.12it/s, env_step=124928, len=23, n/ep=3, n/st=64, player_1/loss=1974.370, player_2/loss=2109.035, rew=584.67]                                                                                                


Epoch #122: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #123: 1025it [00:04, 217.86it/s, env_step=125952, len=17, n/ep=3, n/st=64, player_1/loss=1609.920, player_2/loss=2356.845, rew=317.33]                                                                                                


Epoch #123: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #124: 1025it [00:03, 273.90it/s, env_step=126976, len=31, n/ep=2, n/st=64, player_1/loss=1438.057, player_2/loss=1492.799, rew=1132.00]                                                                                               


Epoch #124: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #125: 1025it [00:04, 251.44it/s, env_step=128000, len=33, n/ep=2, n/st=64, player_1/loss=1081.203, player_2/loss=805.295, rew=1120.00]                                                                                                


Epoch #125: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #126: 1025it [00:04, 227.66it/s, env_step=129024, len=35, n/ep=2, n/st=64, player_1/loss=1180.902, player_2/loss=1122.969, rew=1296.00]                                                                                               


Epoch #126: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #127: 1025it [00:04, 223.67it/s, env_step=130048, len=27, n/ep=2, n/st=64, player_1/loss=1126.269, player_2/loss=1541.714, rew=898.00]                                                                                                


Epoch #127: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #128: 1025it [00:04, 223.73it/s, env_step=131072, len=16, n/ep=3, n/st=64, player_1/loss=1193.268, player_2/loss=1185.977, rew=286.67]                                                                                                


Epoch #128: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #129: 1025it [00:04, 228.52it/s, env_step=132096, len=19, n/ep=4, n/st=64, player_1/loss=677.790, player_2/loss=1310.390, rew=453.00]                                                                                                 


Epoch #129: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #130: 1025it [00:04, 229.58it/s, env_step=133120, len=23, n/ep=3, n/st=64, player_1/loss=928.697, player_2/loss=2203.569, rew=564.00]                                                                                                 


Epoch #130: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #131: 1025it [00:04, 227.42it/s, env_step=134144, len=29, n/ep=2, n/st=64, player_1/loss=1512.661, player_2/loss=2758.040, rew=904.00]                                                                                                


Epoch #131: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #132: 1025it [00:04, 250.52it/s, env_step=135168, len=32, n/ep=2, n/st=64, player_1/loss=1586.068, player_2/loss=2606.335, rew=1063.00]                                                                                               


Epoch #132: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #133: 1025it [00:04, 220.49it/s, env_step=136192, len=15, n/ep=4, n/st=64, player_1/loss=1417.604, player_2/loss=2719.435, rew=252.00]                                                                                                


Epoch #133: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #134: 1025it [00:04, 217.07it/s, env_step=137216, len=14, n/ep=5, n/st=64, player_1/loss=1701.486, player_2/loss=3974.140, rew=220.00]                                                                                                


Epoch #134: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #135: 1025it [00:04, 218.54it/s, env_step=138240, len=27, n/ep=3, n/st=64, player_1/loss=2047.902, player_2/loss=4301.670, rew=818.67]                                                                                                


Epoch #135: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #136: 1025it [00:04, 227.55it/s, env_step=139264, len=27, n/ep=3, n/st=64, player_1/loss=2122.914, player_2/loss=2502.348, rew=792.67]                                                                                                


Epoch #136: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #137: 1025it [00:03, 267.84it/s, env_step=140288, len=32, n/ep=2, n/st=64, player_1/loss=1683.189, player_2/loss=1672.727, rew=1070.00]                                                                                               


Epoch #137: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #138: 1025it [00:04, 228.24it/s, env_step=141312, len=32, n/ep=2, n/st=64, player_1/loss=933.618, player_2/loss=1663.323, rew=1093.00]                                                                                                


Epoch #138: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #139: 1025it [00:04, 234.75it/s, env_step=142336, len=34, n/ep=2, n/st=64, player_1/loss=844.317, player_2/loss=1865.788, rew=1225.00]                                                                                                


Epoch #139: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #140: 1025it [00:04, 224.41it/s, env_step=143360, len=34, n/ep=2, n/st=64, player_1/loss=1102.291, player_2/loss=2329.185, rew=1294.00]                                                                                               


Epoch #140: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #141: 1025it [00:04, 240.38it/s, env_step=144384, len=36, n/ep=2, n/st=64, player_1/loss=1315.910, player_2/loss=2749.426, rew=1412.00]                                                                                               


Epoch #141: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #142: 1025it [00:04, 244.88it/s, env_step=145408, len=15, n/ep=3, n/st=64, player_1/loss=2033.894, player_2/loss=2514.473, rew=266.00]                                                                                                


Epoch #142: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #143: 1025it [00:04, 236.97it/s, env_step=146432, len=23, n/ep=2, n/st=64, player_1/loss=1702.394, player_2/loss=1631.320, rew=586.00]                                                                                                


Epoch #143: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #144: 1025it [00:04, 246.78it/s, env_step=147456, len=30, n/ep=2, n/st=64, player_1/loss=1568.755, player_2/loss=1690.798, rew=937.00]                                                                                                


Epoch #144: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #145: 1025it [00:04, 238.03it/s, env_step=148480, len=21, n/ep=2, n/st=64, player_1/loss=1900.023, player_2/loss=1844.313, rew=581.00]                                                                                                


Epoch #145: test_reward: 1558.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #146: 1025it [00:04, 232.13it/s, env_step=149504, len=20, n/ep=3, n/st=64, player_1/loss=1945.801, player_2/loss=2050.070, rew=463.33]                                                                                                


Epoch #146: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #147: 1025it [00:04, 233.98it/s, env_step=150528, len=21, n/ep=3, n/st=64, player_1/loss=1700.314, player_2/loss=2368.115, rew=490.67]                                                                                                


Epoch #147: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #148: 1025it [00:04, 234.99it/s, env_step=151552, len=34, n/ep=2, n/st=64, player_1/loss=1601.808, player_2/loss=1655.779, rew=1267.00]                                                                                               


Epoch #148: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #149: 1025it [00:04, 231.26it/s, env_step=152576, len=35, n/ep=2, n/st=64, player_1/loss=1468.120, player_2/loss=2031.661, rew=1300.00]                                                                                               


Epoch #149: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #150: 1025it [00:04, 231.87it/s, env_step=153600, len=33, n/ep=1, n/st=64, player_1/loss=1201.822, player_2/loss=2114.368, rew=1120.00]                                                                                               


Epoch #150: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #151: 1025it [00:04, 217.39it/s, env_step=154624, len=22, n/ep=3, n/st=64, player_1/loss=1294.820, player_2/loss=2521.794, rew=597.33]                                                                                                


Epoch #151: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #152: 1025it [00:04, 228.95it/s, env_step=155648, len=21, n/ep=2, n/st=64, player_1/loss=1122.955, player_2/loss=2379.665, rew=482.00]                                                                                                


Epoch #152: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #153: 1025it [00:04, 240.76it/s, env_step=156672, len=20, n/ep=3, n/st=64, player_1/loss=1808.426, player_2/loss=1734.111, rew=448.67]                                                                                                


Epoch #153: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #154: 1025it [00:04, 253.49it/s, env_step=157696, len=29, n/ep=2, n/st=64, player_1/loss=1712.926, player_2/loss=2012.881, rew=893.00]                                                                                                


Epoch #154: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #155: 1025it [00:04, 230.58it/s, env_step=158720, len=33, n/ep=2, n/st=64, player_1/loss=1535.511, rew=1120.00]                                                                                                                       


Epoch #155: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #156: 1025it [00:04, 222.66it/s, env_step=159744, len=38, n/ep=2, n/st=64, player_1/loss=2364.486, player_2/loss=2246.768, rew=1511.00]                                                                                               


Epoch #156: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #157: 1025it [00:05, 185.16it/s, env_step=160768, len=21, n/ep=3, n/st=64, player_1/loss=1931.736, player_2/loss=1491.125, rew=475.33]                                                                                                


Epoch #157: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #158: 1025it [00:04, 225.10it/s, env_step=161792, len=31, n/ep=2, n/st=64, player_1/loss=1873.286, player_2/loss=1753.789, rew=1006.00]                                                                                               


Epoch #158: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #159: 1025it [00:03, 275.52it/s, env_step=162816, len=30, n/ep=2, n/st=64, player_1/loss=2079.144, player_2/loss=2484.067, rew=929.00]                                                                                                


Epoch #159: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #160: 1025it [00:04, 243.84it/s, env_step=163840, len=28, n/ep=2, n/st=64, player_1/loss=2091.263, player_2/loss=1836.097, rew=839.00]                                                                                                


Epoch #160: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #161: 1025it [00:03, 276.84it/s, env_step=164864, len=33, n/ep=2, n/st=64, player_1/loss=2297.103, player_2/loss=1454.028, rew=1160.00]                                                                                               


Epoch #161: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #162: 1025it [00:03, 278.80it/s, env_step=165888, len=24, n/ep=3, n/st=64, player_1/loss=1965.965, player_2/loss=1015.547, rew=652.00]                                                                                                


Epoch #162: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #163: 1025it [00:03, 275.53it/s, env_step=166912, len=23, n/ep=2, n/st=64, player_1/loss=1481.120, player_2/loss=882.134, rew=614.00]                                                                                                 


Epoch #163: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #164: 1025it [00:03, 272.83it/s, env_step=167936, len=33, n/ep=2, n/st=64, player_1/loss=1343.829, player_2/loss=717.097, rew=1156.00]                                                                                                


Epoch #164: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #165: 1025it [00:03, 265.64it/s, env_step=168960, len=33, n/ep=2, n/st=64, player_1/loss=1180.465, player_2/loss=1485.949, rew=1156.00]                                                                                               


Epoch #165: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #166: 1025it [00:03, 258.40it/s, env_step=169984, len=35, n/ep=2, n/st=64, player_1/loss=1859.369, player_2/loss=2441.473, rew=1262.00]                                                                                               


Epoch #166: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #167: 1025it [00:03, 263.09it/s, env_step=171008, len=29, n/ep=2, n/st=64, player_1/loss=1691.605, player_2/loss=1960.522, rew=910.00]                                                                                                


Epoch #167: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #168: 1025it [00:03, 261.82it/s, env_step=172032, len=24, n/ep=2, n/st=64, player_1/loss=1379.086, player_2/loss=1225.448, rew=623.00]                                                                                                


Epoch #168: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #169: 1025it [00:03, 258.51it/s, env_step=173056, len=26, n/ep=3, n/st=64, player_1/loss=1327.150, player_2/loss=1610.563, rew=731.33]                                                                                                


Epoch #169: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #170: 1025it [00:03, 287.58it/s, env_step=174080, len=35, n/ep=2, n/st=64, player_1/loss=1494.293, player_2/loss=1829.795, rew=1259.00]                                                                                               


Epoch #170: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #171: 1025it [00:04, 248.18it/s, env_step=175104, len=36, n/ep=2, n/st=64, player_1/loss=1262.303, player_2/loss=1289.339, rew=1334.00]                                                                                               


Epoch #171: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #172: 1025it [00:03, 264.55it/s, env_step=176128, len=22, n/ep=3, n/st=64, player_1/loss=1053.866, player_2/loss=1340.148, rew=536.00]                                                                                                


Epoch #172: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #173: 1025it [00:03, 269.41it/s, env_step=177152, len=37, n/ep=2, n/st=64, player_1/loss=1393.173, player_2/loss=1316.139, rew=1413.00]                                                                                               


Epoch #173: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #174: 1025it [00:03, 276.57it/s, env_step=178176, len=27, n/ep=3, n/st=64, player_1/loss=1533.218, player_2/loss=2565.693, rew=802.00]                                                                                                


Epoch #174: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #175: 1025it [00:03, 281.82it/s, env_step=179200, len=18, n/ep=4, n/st=64, player_1/loss=1826.709, player_2/loss=2693.952, rew=429.00]                                                                                                


Epoch #175: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #176: 1025it [00:03, 280.31it/s, env_step=180224, len=26, n/ep=3, n/st=64, player_1/loss=1555.080, player_2/loss=1757.410, rew=700.67]                                                                                                


Epoch #176: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #177: 1025it [00:03, 281.79it/s, env_step=181248, len=25, n/ep=3, n/st=64, player_1/loss=915.619, player_2/loss=1632.451, rew=648.00]                                                                                                 


Epoch #177: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #178: 1025it [00:03, 282.99it/s, env_step=182272, len=29, n/ep=2, n/st=64, player_1/loss=910.793, player_2/loss=1516.404, rew=940.00]                                                                                                 


Epoch #178: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #179: 1025it [00:03, 275.09it/s, env_step=183296, len=42, n/ep=1, n/st=64, player_1/loss=769.719, player_2/loss=1191.048, rew=1834.00]                                                                                                


Epoch #179: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #180: 1025it [00:03, 275.62it/s, env_step=184320, len=30, n/ep=3, n/st=64, player_1/loss=569.123, player_2/loss=1076.435, rew=995.33]                                                                                                 


Epoch #180: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #181: 1025it [00:03, 276.13it/s, env_step=185344, len=16, n/ep=4, n/st=64, player_1/loss=1078.571, player_2/loss=2426.113, rew=294.00]                                                                                                


Epoch #181: test_reward: 154.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #182: 1025it [00:03, 307.96it/s, env_step=186368, len=39, n/ep=2, n/st=64, player_1/loss=1568.735, player_2/loss=3066.601, rew=1558.00]                                                                                               


Epoch #182: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #183: 1025it [00:03, 286.03it/s, env_step=187392, len=29, n/ep=2, n/st=64, player_1/loss=2048.794, player_2/loss=2767.995, rew=872.00]                                                                                                


Epoch #183: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #184: 1025it [00:03, 282.57it/s, env_step=188416, len=33, n/ep=2, n/st=64, player_1/loss=1330.680, player_2/loss=2215.335, rew=1184.00]                                                                                               


Epoch #184: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #185: 1025it [00:03, 282.76it/s, env_step=189440, len=26, n/ep=2, n/st=64, player_1/loss=841.699, player_2/loss=1463.603, rew=727.00]                                                                                                 


Epoch #185: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #186: 1025it [00:03, 301.10it/s, env_step=190464, len=13, n/ep=5, n/st=64, player_1/loss=1724.211, rew=197.20]                                                                                                                        


Epoch #186: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #187: 1025it [00:04, 238.40it/s, env_step=191488, len=31, n/ep=3, n/st=64, player_1/loss=1812.913, rew=1068.67]                                                                                                                       


Epoch #187: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #188: 1025it [00:04, 208.14it/s, env_step=192512, len=29, n/ep=2, n/st=64, player_1/loss=1152.007, player_2/loss=1657.883, rew=918.00]                                                                                                


Epoch #188: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #189: 1025it [00:05, 201.30it/s, env_step=193536, len=35, n/ep=2, n/st=64, player_1/loss=940.479, player_2/loss=1279.137, rew=1274.00]                                                                                                


Epoch #189: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #190: 1025it [00:05, 199.53it/s, env_step=194560, len=41, n/ep=2, n/st=64, player_1/loss=1180.063, player_2/loss=1530.468, rew=1777.00]                                                                                               


Epoch #190: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #191: 1025it [00:03, 261.80it/s, env_step=195584, len=15, n/ep=4, n/st=64, player_1/loss=1426.700, player_2/loss=1639.270, rew=264.00]                                                                                                


Epoch #191: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #192: 1025it [00:03, 265.06it/s, env_step=196608, len=16, n/ep=4, n/st=64, player_1/loss=1695.073, player_2/loss=2070.711, rew=283.50]                                                                                                


Epoch #192: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #193: 1025it [00:03, 276.35it/s, env_step=197632, len=33, n/ep=2, n/st=64, player_1/loss=1799.063, player_2/loss=2738.434, rew=1154.00]                                                                                               


Epoch #193: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #194: 1025it [00:03, 284.27it/s, env_step=198656, len=22, n/ep=3, n/st=64, player_1/loss=1033.163, player_2/loss=2275.527, rew=582.67]                                                                                                


Epoch #194: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #195: 1025it [00:03, 278.69it/s, env_step=199680, len=21, n/ep=3, n/st=64, player_1/loss=1239.315, player_2/loss=1521.820, rew=464.67]                                                                                                


Epoch #195: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #196: 1025it [00:03, 278.14it/s, env_step=200704, len=32, n/ep=2, n/st=64, player_1/loss=1060.442, player_2/loss=1382.235, rew=1055.00]                                                                                               


Epoch #196: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #197: 1025it [00:03, 266.47it/s, env_step=201728, len=21, n/ep=3, n/st=64, player_1/loss=853.900, player_2/loss=2227.246, rew=482.67]                                                                                                 


Epoch #197: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #198: 1025it [00:03, 270.20it/s, env_step=202752, len=19, n/ep=4, n/st=64, player_1/loss=1540.203, player_2/loss=2413.968, rew=384.00]                                                                                                


Epoch #198: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #199: 1025it [00:03, 272.82it/s, env_step=203776, len=23, n/ep=4, n/st=64, player_1/loss=1836.111, player_2/loss=1698.935, rew=632.50]                                                                                                


Epoch #199: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #200: 1025it [00:03, 309.67it/s, env_step=204800, len=10, n/ep=7, n/st=64, player_1/loss=1395.473, player_2/loss=1849.092, rew=138.86]                                                                                                


Epoch #200: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #201: 1025it [00:04, 249.30it/s, env_step=205824, len=31, n/ep=2, n/st=64, player_1/loss=1681.119, player_2/loss=2383.295, rew=1006.00]                                                                                               


Epoch #201: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #202: 1025it [00:03, 264.55it/s, env_step=206848, len=23, n/ep=2, n/st=64, player_1/loss=1345.189, player_2/loss=2006.973, rew=554.00]                                                                                                


Epoch #202: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #203: 1025it [00:03, 262.22it/s, env_step=207872, len=29, n/ep=2, n/st=64, player_1/loss=1377.110, player_2/loss=1500.753, rew=904.00]                                                                                                


Epoch #203: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #204: 1025it [00:03, 279.11it/s, env_step=208896, len=18, n/ep=3, n/st=64, player_1/loss=1629.255, player_2/loss=1508.791, rew=526.67]                                                                                                


Epoch #204: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #205: 1025it [00:03, 285.86it/s, env_step=209920, len=28, n/ep=2, n/st=64, player_1/loss=1453.261, player_2/loss=1673.946, rew=1036.00]                                                                                               


Epoch #205: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #206: 1025it [00:03, 283.11it/s, env_step=210944, len=20, n/ep=4, n/st=64, player_1/loss=1118.083, player_2/loss=2037.036, rew=530.50]                                                                                                


Epoch #206: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #207: 1025it [00:03, 283.23it/s, env_step=211968, len=33, n/ep=2, n/st=64, player_1/loss=1116.270, player_2/loss=2263.189, rew=1166.00]                                                                                               


Epoch #207: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #208: 1025it [00:03, 288.99it/s, env_step=212992, len=31, n/ep=2, n/st=64, player_1/loss=1754.457, player_2/loss=2231.556, rew=1022.00]                                                                                               


Epoch #208: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #209: 1025it [00:03, 286.36it/s, env_step=214016, len=15, n/ep=4, n/st=64, player_1/loss=1440.269, player_2/loss=2034.825, rew=266.50]                                                                                                


Epoch #209: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #210: 1025it [00:03, 265.50it/s, env_step=215040, len=16, n/ep=4, n/st=64, player_1/loss=1517.162, player_2/loss=1475.438, rew=298.50]                                                                                                


Epoch #210: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #211: 1025it [00:03, 258.24it/s, env_step=216064, len=15, n/ep=4, n/st=64, player_1/loss=1654.882, player_2/loss=1594.000, rew=264.00]                                                                                                


Epoch #211: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #212: 1025it [00:04, 254.86it/s, env_step=217088, len=28, n/ep=3, n/st=64, player_1/loss=1335.604, player_2/loss=1377.490, rew=894.67]                                                                                                


Epoch #212: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #213: 1025it [00:03, 265.59it/s, env_step=218112, len=22, n/ep=3, n/st=64, player_1/loss=730.884, player_2/loss=1056.612, rew=506.00]                                                                                                 


Epoch #213: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #214: 1025it [00:04, 249.25it/s, env_step=219136, len=17, n/ep=4, n/st=64, player_1/loss=765.056, player_2/loss=1265.710, rew=382.00]                                                                                                 


Epoch #214: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #215: 1025it [00:03, 271.59it/s, env_step=220160, len=14, n/ep=6, n/st=64, player_1/loss=1207.026, player_2/loss=2062.913, rew=377.67]                                                                                                


Epoch #215: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #216: 1025it [00:03, 268.03it/s, env_step=221184, len=25, n/ep=2, n/st=64, player_1/loss=1979.227, player_2/loss=2249.568, rew=674.00]                                                                                                


Epoch #216: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #217: 1025it [00:04, 247.37it/s, env_step=222208, len=28, n/ep=2, n/st=64, player_1/loss=1685.055, player_2/loss=1678.429, rew=846.00]                                                                                                


Epoch #217: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #218: 1025it [00:04, 237.39it/s, env_step=223232, len=7, n/ep=8, n/st=64, player_1/loss=1404.239, player_2/loss=1596.262, rew=62.50]                                                                                                  


Epoch #218: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #219: 1025it [00:04, 252.34it/s, env_step=224256, len=20, n/ep=3, n/st=64, player_1/loss=1263.523, player_2/loss=1801.314, rew=418.67]                                                                                                


Epoch #219: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #220: 1025it [00:04, 238.94it/s, env_step=225280, len=29, n/ep=2, n/st=64, player_1/loss=1743.786, player_2/loss=1673.308, rew=904.00]                                                                                                


Epoch #220: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #221: 1025it [00:04, 236.71it/s, env_step=226304, len=25, n/ep=3, n/st=64, player_1/loss=1667.138, player_2/loss=1054.253, rew=672.67]                                                                                                


Epoch #221: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #222: 1025it [00:04, 242.95it/s, env_step=227328, len=25, n/ep=2, n/st=64, player_1/loss=1113.967, player_2/loss=1513.912, rew=961.00]                                                                                                


Epoch #222: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #223: 1025it [00:04, 243.17it/s, env_step=228352, len=36, n/ep=2, n/st=64, player_1/loss=1074.343, player_2/loss=1636.984, rew=1373.00]                                                                                               


Epoch #223: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #224: 1025it [00:04, 247.20it/s, env_step=229376, len=23, n/ep=3, n/st=64, player_1/loss=1115.702, player_2/loss=1336.931, rew=560.67]                                                                                                


Epoch #224: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #225: 1025it [00:04, 251.02it/s, env_step=230400, len=30, n/ep=2, n/st=64, player_1/loss=1090.236, player_2/loss=1594.088, rew=977.00]                                                                                                


Epoch #225: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #226: 1025it [00:04, 238.83it/s, env_step=231424, len=32, n/ep=2, n/st=64, player_1/loss=1090.295, player_2/loss=1580.657, rew=1058.00]                                                                                               


Epoch #226: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #227: 1025it [00:04, 243.36it/s, env_step=232448, len=26, n/ep=2, n/st=64, player_1/loss=1272.162, player_2/loss=1693.008, rew=909.00]                                                                                                


Epoch #227: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #228: 1025it [00:04, 242.87it/s, env_step=233472, len=22, n/ep=2, n/st=64, player_1/loss=1352.346, player_2/loss=1940.719, rew=547.00]                                                                                                


Epoch #228: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #229: 1025it [00:04, 234.72it/s, env_step=234496, len=20, n/ep=3, n/st=64, player_1/loss=1045.440, player_2/loss=1025.113, rew=447.33]                                                                                                


Epoch #229: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #230: 1025it [00:03, 260.30it/s, env_step=235520, len=32, n/ep=2, n/st=64, player_1/loss=898.557, player_2/loss=1307.766, rew=1117.00]                                                                                                


Epoch #230: test_reward: 130.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #231: 1025it [00:03, 261.51it/s, env_step=236544, len=12, n/ep=4, n/st=64, player_1/loss=1134.121, player_2/loss=1607.581, rew=173.00]                                                                                                


Epoch #231: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #232: 1025it [00:03, 267.09it/s, env_step=237568, len=15, n/ep=4, n/st=64, player_1/loss=1300.577, player_2/loss=957.041, rew=244.50]                                                                                                 


Epoch #232: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #233: 1025it [00:03, 278.06it/s, env_step=238592, len=8, n/ep=8, n/st=64, player_1/loss=1380.673, player_2/loss=1010.733, rew=79.25]                                                                                                  


Epoch #233: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #234: 1025it [00:03, 279.31it/s, env_step=239616, len=7, n/ep=8, n/st=64, player_1/loss=1379.621, player_2/loss=1646.864, rew=69.50]                                                                                                  


Epoch #234: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #235: 1025it [00:03, 273.88it/s, env_step=240640, len=8, n/ep=7, n/st=64, player_1/loss=1528.456, player_2/loss=2408.060, rew=76.57]                                                                                                  


Epoch #235: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #236: 1025it [00:03, 260.08it/s, env_step=241664, len=13, n/ep=5, n/st=64, player_1/loss=1125.996, player_2/loss=1852.493, rew=189.60]                                                                                                


Epoch #236: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #237: 1025it [00:04, 253.02it/s, env_step=242688, len=16, n/ep=4, n/st=64, player_1/loss=988.545, player_2/loss=1319.187, rew=320.50]                                                                                                 


Epoch #237: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #238: 1025it [00:03, 272.48it/s, env_step=243712, len=8, n/ep=8, n/st=64, player_1/loss=1529.160, player_2/loss=1659.375, rew=78.25]                                                                                                  


Epoch #238: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #239: 1025it [00:05, 193.82it/s, env_step=244736, len=25, n/ep=2, n/st=64, player_1/loss=1277.605, player_2/loss=1801.998, rew=712.00]                                                                                                


Epoch #239: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #240: 1025it [00:04, 242.56it/s, env_step=245760, len=7, n/ep=9, n/st=64, player_1/loss=969.805, rew=68.00]                                                                                                                           


Epoch #240: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #241: 1025it [00:04, 245.08it/s, env_step=246784, len=14, n/ep=4, n/st=64, player_1/loss=891.528, player_2/loss=1438.678, rew=217.00]                                                                                                 


Epoch #241: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #242: 1025it [00:04, 250.60it/s, env_step=247808, len=21, n/ep=3, n/st=64, player_1/loss=939.963, player_2/loss=1609.901, rew=502.67]                                                                                                 


Epoch #242: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #243: 1025it [00:04, 244.29it/s, env_step=248832, len=34, n/ep=2, n/st=64, player_1/loss=1049.163, player_2/loss=1421.219, rew=1225.00]                                                                                               


Epoch #243: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #244: 1025it [00:04, 254.91it/s, env_step=249856, len=9, n/ep=7, n/st=64, player_1/loss=1011.886, player_2/loss=1713.987, rew=100.00]                                                                                                 


Epoch #244: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #245: 1025it [00:03, 263.82it/s, env_step=250880, len=12, n/ep=5, n/st=64, player_1/loss=1106.319, player_2/loss=1435.087, rew=172.80]                                                                                                


Epoch #245: test_reward: 130.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #246: 1025it [00:03, 294.84it/s, env_step=251904, len=14, n/ep=5, n/st=64, player_1/loss=998.546, player_2/loss=1114.752, rew=233.20]                                                                                                 


Epoch #246: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #247: 1025it [00:04, 219.42it/s, env_step=252928, len=21, n/ep=3, n/st=64, player_2/loss=1625.251, rew=492.00]                                                                                                                        


Epoch #247: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #248: 1025it [00:05, 204.72it/s, env_step=253952, len=34, n/ep=1, n/st=64, player_1/loss=1651.626, player_2/loss=2135.944, rew=1188.00]                                                                                               


Epoch #248: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #249: 1025it [00:10, 94.22it/s, env_step=254976, len=27, n/ep=3, n/st=64, player_1/loss=1266.685, player_2/loss=1668.241, rew=820.67]                                                                                                 


Epoch #249: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #250: 1025it [00:08, 116.33it/s, env_step=256000, len=27, n/ep=3, n/st=64, player_1/loss=1475.487, player_2/loss=1417.469, rew=826.67]                                                                                                


Epoch #250: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #251: 1025it [00:04, 253.06it/s, env_step=257024, len=29, n/ep=2, n/st=64, player_1/loss=1596.103, player_2/loss=1052.280, rew=877.00]                                                                                                


Epoch #251: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #252: 1025it [00:03, 298.68it/s, env_step=258048, len=21, n/ep=3, n/st=64, player_1/loss=1194.097, player_2/loss=1331.966, rew=512.00]                                                                                                


Epoch #252: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #253: 1025it [00:03, 301.00it/s, env_step=259072, len=12, n/ep=6, n/st=64, player_1/loss=892.545, player_2/loss=1745.432, rew=242.67]                                                                                                 


Epoch #253: test_reward: 130.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #254: 1025it [00:03, 308.13it/s, env_step=260096, len=34, n/ep=2, n/st=64, player_1/loss=661.088, player_2/loss=1856.449, rew=1189.00]                                                                                                


Epoch #254: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #255: 1025it [00:03, 301.37it/s, env_step=261120, len=20, n/ep=4, n/st=64, player_1/loss=1476.162, player_2/loss=1278.657, rew=623.50]                                                                                                


Epoch #255: test_reward: 70.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #256: 1025it [00:03, 299.71it/s, env_step=262144, len=30, n/ep=2, n/st=64, player_1/loss=1611.821, player_2/loss=1085.588, rew=1049.00]                                                                                               


Epoch #256: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #257: 1025it [00:03, 295.46it/s, env_step=263168, len=39, n/ep=2, n/st=64, player_1/loss=1040.396, player_2/loss=1050.379, rew=1619.00]                                                                                               


Epoch #257: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #258: 1025it [00:03, 267.99it/s, env_step=264192, len=35, n/ep=2, n/st=64, player_1/loss=1256.130, player_2/loss=1084.994, rew=1262.00]                                                                                               


Epoch #258: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #259: 1025it [00:03, 272.25it/s, env_step=265216, len=23, n/ep=3, n/st=64, player_1/loss=1007.849, player_2/loss=829.936, rew=716.00]                                                                                                 


Epoch #259: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #260: 1025it [00:03, 286.93it/s, env_step=266240, len=28, n/ep=2, n/st=64, player_1/loss=1450.195, player_2/loss=745.482, rew=835.00]                                                                                                 


Epoch #260: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #261: 1025it [00:03, 288.96it/s, env_step=267264, len=34, n/ep=2, n/st=64, player_1/loss=1700.437, player_2/loss=1062.046, rew=1189.00]                                                                                               


Epoch #261: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #262: 1025it [00:13, 78.81it/s, env_step=268288, len=36, n/ep=2, n/st=64, player_1/loss=1221.134, player_2/loss=1050.271, rew=1331.00]                                                                                                


Epoch #262: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #263: 1025it [00:06, 154.08it/s, env_step=269312, len=37, n/ep=1, n/st=64, player_1/loss=1289.091, player_2/loss=1269.679, rew=1404.00]                                                                                               


Epoch #263: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #264: 1025it [00:06, 160.11it/s, env_step=270336, len=19, n/ep=3, n/st=64, player_1/loss=1230.072, player_2/loss=1580.971, rew=380.67]                                                                                                


Epoch #264: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #265: 1025it [00:06, 163.47it/s, env_step=271360, len=20, n/ep=4, n/st=64, player_1/loss=940.210, player_2/loss=1725.516, rew=432.50]                                                                                                 


Epoch #265: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #266: 1025it [00:03, 293.00it/s, env_step=272384, len=18, n/ep=3, n/st=64, player_1/loss=1095.801, player_2/loss=1411.862, rew=380.67]                                                                                                


Epoch #266: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #267: 1025it [00:04, 240.63it/s, env_step=273408, len=34, n/ep=2, n/st=64, player_1/loss=1498.913, player_2/loss=1368.423, rew=1223.00]                                                                                               


Epoch #267: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #268: 1025it [00:05, 182.94it/s, env_step=274432, len=28, n/ep=2, n/st=64, player_1/loss=1453.833, player_2/loss=1297.800, rew=859.00]                                                                                                


Epoch #268: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #269: 1025it [00:04, 214.20it/s, env_step=275456, len=26, n/ep=2, n/st=64, player_1/loss=878.515, player_2/loss=1206.230, rew=757.00]                                                                                                 


Epoch #269: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #270: 1025it [00:03, 260.01it/s, env_step=276480, len=31, n/ep=2, n/st=64, player_1/loss=623.829, player_2/loss=1140.537, rew=1022.00]                                                                                                


Epoch #270: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #271: 1025it [00:03, 261.01it/s, env_step=277504, len=22, n/ep=3, n/st=64, player_1/loss=581.377, player_2/loss=1259.207, rew=536.00]                                                                                                 


Epoch #271: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #272: 1025it [00:03, 272.11it/s, env_step=278528, len=20, n/ep=3, n/st=64, player_1/loss=1569.207, player_2/loss=1261.757, rew=447.33]                                                                                                


Epoch #272: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #273: 1025it [00:03, 272.94it/s, env_step=279552, len=19, n/ep=3, n/st=64, player_1/loss=1364.481, player_2/loss=1698.521, rew=391.33]                                                                                                


Epoch #273: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #274: 1025it [00:04, 242.96it/s, env_step=280576, len=24, n/ep=2, n/st=64, player_1/loss=1006.616, player_2/loss=1940.043, rew=625.00]                                                                                                


Epoch #274: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #275: 1025it [00:04, 232.31it/s, env_step=281600, len=35, n/ep=1, n/st=64, player_1/loss=1045.131, player_2/loss=1996.316, rew=1258.00]                                                                                               


Epoch #275: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #276: 1025it [00:04, 243.46it/s, env_step=282624, len=17, n/ep=3, n/st=64, player_1/loss=994.393, player_2/loss=1384.362, rew=312.00]                                                                                                 


Epoch #276: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #277: 1025it [00:03, 269.53it/s, env_step=283648, len=14, n/ep=5, n/st=64, player_1/loss=991.742, rew=224.40]                                                                                                                         


Epoch #277: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #278: 1025it [00:03, 274.31it/s, env_step=284672, len=17, n/ep=4, n/st=64, player_1/loss=1067.634, player_2/loss=1510.281, rew=329.00]                                                                                                


Epoch #278: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #279: 1025it [00:03, 264.65it/s, env_step=285696, len=29, n/ep=2, n/st=64, player_1/loss=1116.725, player_2/loss=1328.591, rew=904.00]                                                                                                


Epoch #279: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #280: 1025it [00:03, 267.84it/s, env_step=286720, len=19, n/ep=3, n/st=64, player_1/loss=1433.165, player_2/loss=1676.390, rew=405.33]                                                                                                


Epoch #280: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #281: 1025it [00:03, 279.64it/s, env_step=287744, len=20, n/ep=3, n/st=64, player_1/loss=1871.455, player_2/loss=1927.836, rew=432.67]                                                                                                


Epoch #281: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #282: 1025it [00:03, 291.32it/s, env_step=288768, len=31, n/ep=2, n/st=64, player_1/loss=1917.869, player_2/loss=1106.859, rew=994.00]                                                                                                


Epoch #282: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #283: 1025it [00:03, 288.89it/s, env_step=289792, len=42, n/ep=1, n/st=64, player_1/loss=2074.012, player_2/loss=1285.056, rew=1834.00]                                                                                               


Epoch #283: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #284: 1025it [00:03, 293.35it/s, env_step=290816, len=20, n/ep=3, n/st=64, player_1/loss=1737.495, player_2/loss=1675.896, rew=426.67]                                                                                                


Epoch #284: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #285: 1025it [00:03, 285.88it/s, env_step=291840, len=35, n/ep=2, n/st=64, player_1/loss=1318.995, player_2/loss=1793.538, rew=1259.00]                                                                                               


Epoch #285: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #286: 1025it [00:03, 287.44it/s, env_step=292864, len=27, n/ep=2, n/st=64, player_1/loss=1006.033, player_2/loss=1926.076, rew=784.00]                                                                                                


Epoch #286: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #287: 1025it [00:03, 282.45it/s, env_step=293888, len=19, n/ep=3, n/st=64, player_2/loss=1064.103, rew=420.67]                                                                                                                        


Epoch #287: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #288: 1025it [00:03, 287.17it/s, env_step=294912, len=31, n/ep=2, n/st=64, player_1/loss=984.520, player_2/loss=866.264, rew=1026.00]                                                                                                 


Epoch #288: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #289: 1025it [00:03, 285.46it/s, env_step=295936, len=28, n/ep=2, n/st=64, player_2/loss=846.118, rew=971.00]                                                                                                                         


Epoch #289: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #290: 1025it [00:03, 268.01it/s, env_step=296960, len=33, n/ep=2, n/st=64, player_1/loss=579.798, player_2/loss=1238.398, rew=1241.00]                                                                                                


Epoch #290: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #291: 1025it [00:03, 273.61it/s, env_step=297984, len=21, n/ep=3, n/st=64, player_1/loss=672.589, player_2/loss=1425.806, rew=496.00]                                                                                                 


Epoch #291: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #292: 1025it [00:03, 277.18it/s, env_step=299008, len=25, n/ep=3, n/st=64, player_1/loss=759.888, player_2/loss=1147.053, rew=790.67]                                                                                                 


Epoch #292: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #293: 1025it [00:03, 273.00it/s, env_step=300032, len=31, n/ep=2, n/st=64, player_1/loss=993.735, player_2/loss=1324.290, rew=1064.00]                                                                                                


Epoch #293: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #294: 1025it [00:03, 278.52it/s, env_step=301056, len=29, n/ep=2, n/st=64, player_1/loss=945.547, player_2/loss=1008.510, rew=910.00]                                                                                                 


Epoch #294: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #295: 1025it [00:03, 268.17it/s, env_step=302080, len=35, n/ep=1, n/st=64, player_1/loss=808.518, player_2/loss=1005.295, rew=1258.00]                                                                                                


Epoch #295: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #296: 1025it [00:03, 271.69it/s, env_step=303104, len=14, n/ep=4, n/st=64, player_1/loss=1698.802, player_2/loss=1531.978, rew=231.00]                                                                                                


Epoch #296: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #297: 1025it [00:03, 267.32it/s, env_step=304128, len=13, n/ep=4, n/st=64, player_1/loss=1944.851, player_2/loss=1890.263, rew=203.00]                                                                                                


Epoch #297: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #298: 1025it [00:04, 244.88it/s, env_step=305152, len=17, n/ep=3, n/st=64, player_1/loss=1316.309, player_2/loss=2380.290, rew=369.33]                                                                                                


Epoch #298: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #299: 1025it [00:04, 245.46it/s, env_step=306176, len=23, n/ep=3, n/st=64, player_1/loss=1072.463, player_2/loss=1685.420, rew=722.00]                                                                                                


Epoch #299: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #300: 1025it [00:04, 250.68it/s, env_step=307200, len=26, n/ep=3, n/st=64, player_1/loss=915.520, player_2/loss=1718.453, rew=712.67]                                                                                                 


Epoch #300: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #301: 1025it [00:04, 247.75it/s, env_step=308224, len=25, n/ep=3, n/st=64, player_1/loss=740.106, player_2/loss=1143.410, rew=704.00]                                                                                                 


Epoch #301: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #302: 1025it [00:04, 244.87it/s, env_step=309248, len=20, n/ep=3, n/st=64, player_1/loss=650.886, player_2/loss=602.817, rew=452.00]                                                                                                  


Epoch #302: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #303: 1025it [00:04, 247.08it/s, env_step=310272, len=19, n/ep=3, n/st=64, player_1/loss=1405.990, player_2/loss=1522.711, rew=398.00]                                                                                                


Epoch #303: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #304: 1025it [00:04, 240.00it/s, env_step=311296, len=19, n/ep=4, n/st=64, player_1/loss=1641.264, player_2/loss=1882.297, rew=387.00]                                                                                                


Epoch #304: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #305: 1025it [00:04, 249.89it/s, env_step=312320, len=22, n/ep=2, n/st=64, player_1/loss=1131.378, player_2/loss=1225.734, rew=539.00]                                                                                                


Epoch #305: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #306: 1025it [00:04, 249.73it/s, env_step=313344, len=22, n/ep=3, n/st=64, player_1/loss=1183.470, player_2/loss=1307.275, rew=508.67]                                                                                                


Epoch #306: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #307: 1025it [00:04, 247.41it/s, env_step=314368, len=29, n/ep=2, n/st=64, player_1/loss=1159.916, player_2/loss=1273.925, rew=868.00]                                                                                                


Epoch #307: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #308: 1025it [00:03, 257.99it/s, env_step=315392, len=26, n/ep=3, n/st=64, player_1/loss=782.585, player_2/loss=1530.058, rew=718.00]                                                                                                 


Epoch #308: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #309: 1025it [00:04, 252.90it/s, env_step=316416, len=24, n/ep=3, n/st=64, player_1/loss=805.214, player_2/loss=1494.319, rew=604.00]                                                                                                 


Epoch #309: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #310: 1025it [00:04, 249.64it/s, env_step=317440, len=26, n/ep=2, n/st=64, player_1/loss=791.761, player_2/loss=836.122, rew=725.00]                                                                                                  


Epoch #310: test_reward: 1558.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #311: 1025it [00:04, 252.84it/s, env_step=318464, len=15, n/ep=6, n/st=64, player_1/loss=619.999, player_2/loss=1238.136, rew=360.00]                                                                                                 


Epoch #311: test_reward: 108.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #312: 1025it [00:04, 253.26it/s, env_step=319488, len=21, n/ep=3, n/st=64, player_1/loss=1540.362, player_2/loss=2548.368, rew=462.00]                                                                                                


Epoch #312: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #313: 1025it [00:04, 253.59it/s, env_step=320512, len=27, n/ep=3, n/st=64, player_1/loss=2141.156, player_2/loss=2669.798, rew=820.67]                                                                                                


Epoch #313: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #314: 1025it [00:04, 251.28it/s, env_step=321536, len=29, n/ep=2, n/st=64, player_1/loss=1443.649, player_2/loss=2198.041, rew=904.00]                                                                                                


Epoch #314: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #315: 1025it [00:04, 245.82it/s, env_step=322560, len=15, n/ep=4, n/st=64, player_1/loss=1017.598, player_2/loss=2010.021, rew=241.50]                                                                                                


Epoch #315: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #316: 1025it [00:04, 247.58it/s, env_step=323584, len=14, n/ep=4, n/st=64, player_1/loss=1123.081, player_2/loss=1982.217, rew=224.00]                                                                                                


Epoch #316: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #317: 1025it [00:04, 251.92it/s, env_step=324608, len=14, n/ep=4, n/st=64, player_1/loss=1475.314, player_2/loss=1526.711, rew=216.00]                                                                                                


Epoch #317: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #318: 1025it [00:04, 248.42it/s, env_step=325632, len=21, n/ep=3, n/st=64, player_1/loss=1365.995, player_2/loss=1541.284, rew=490.00]                                                                                                


Epoch #318: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #319: 1025it [00:04, 243.63it/s, env_step=326656, len=31, n/ep=2, n/st=64, player_1/loss=1298.020, player_2/loss=1660.331, rew=1034.00]                                                                                               


Epoch #319: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #320: 1025it [00:04, 248.99it/s, env_step=327680, len=38, n/ep=2, n/st=64, player_1/loss=1134.229, player_2/loss=1810.768, rew=1484.00]                                                                                               


Epoch #320: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #321: 1025it [00:03, 262.51it/s, env_step=328704, len=30, n/ep=2, n/st=64, player_1/loss=673.593, player_2/loss=1676.661, rew=971.00]                                                                                                 


Epoch #321: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #322: 1025it [00:02, 431.46it/s, env_step=329728, len=37, n/ep=1, n/st=64, player_1/loss=480.609, rew=1404.00]                                                                                                                        


Epoch #322: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #323: 1025it [00:02, 428.85it/s, env_step=330752, len=20, n/ep=2, n/st=64, player_1/loss=920.408, player_2/loss=1755.388, rew=571.00]                                                                                                 


Epoch #323: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #324: 1025it [00:02, 426.83it/s, env_step=331776, len=37, n/ep=1, n/st=64, player_1/loss=1216.419, player_2/loss=1677.880, rew=1404.00]                                                                                               


Epoch #324: test_reward: 1558.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #325: 1025it [00:02, 431.19it/s, env_step=332800, len=15, n/ep=4, n/st=64, player_1/loss=1457.237, player_2/loss=1920.533, rew=248.00]                                                                                                


Epoch #325: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #326: 1025it [00:02, 430.36it/s, env_step=333824, len=19, n/ep=3, n/st=64, player_1/loss=1463.557, player_2/loss=1963.415, rew=410.67]                                                                                                


Epoch #326: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #327: 1025it [00:02, 427.56it/s, env_step=334848, len=23, n/ep=3, n/st=64, player_1/loss=1272.544, player_2/loss=1762.307, rew=570.00]                                                                                                


Epoch #327: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #328: 1025it [00:02, 427.11it/s, env_step=335872, len=15, n/ep=4, n/st=64, player_1/loss=1213.255, player_2/loss=1497.958, rew=247.00]                                                                                                


Epoch #328: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #329: 1025it [00:02, 409.64it/s, env_step=336896, len=12, n/ep=5, n/st=64, player_1/loss=912.644, player_2/loss=1431.873, rew=179.60]                                                                                                 


Epoch #329: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #330: 1025it [00:02, 397.66it/s, env_step=337920, len=29, n/ep=2, n/st=64, player_1/loss=1092.459, player_2/loss=1361.295, rew=917.00]                                                                                                


Epoch #330: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #331: 1025it [00:02, 443.02it/s, env_step=338944, len=14, n/ep=4, n/st=64, player_1/loss=1686.717, player_2/loss=1360.590, rew=209.00]                                                                                                


Epoch #331: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #332: 1025it [00:02, 433.53it/s, env_step=339968, len=34, n/ep=2, n/st=64, player_1/loss=1605.105, player_2/loss=1388.453, rew=1267.00]                                                                                               


Epoch #332: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #333: 1025it [00:02, 428.46it/s, env_step=340992, len=22, n/ep=3, n/st=64, player_1/loss=1172.837, player_2/loss=1102.123, rew=523.33]                                                                                                


Epoch #333: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #334: 1025it [00:02, 417.88it/s, env_step=342016, len=14, n/ep=5, n/st=64, player_1/loss=1254.643, player_2/loss=1202.236, rew=220.80]                                                                                                


Epoch #334: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #335: 1025it [00:02, 436.33it/s, env_step=343040, len=13, n/ep=4, n/st=64, player_1/loss=1387.576, player_2/loss=1346.618, rew=202.00]                                                                                                


Epoch #335: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #336: 1025it [00:02, 421.46it/s, env_step=344064, len=14, n/ep=5, n/st=64, player_1/loss=1074.839, player_2/loss=1177.405, rew=216.00]                                                                                                


Epoch #336: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #337: 1025it [00:02, 429.34it/s, env_step=345088, len=26, n/ep=3, n/st=64, player_1/loss=719.182, player_2/loss=1323.470, rew=720.67]                                                                                                 


Epoch #337: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #338: 1025it [00:02, 396.71it/s, env_step=346112, len=19, n/ep=3, n/st=64, player_1/loss=593.858, player_2/loss=1225.253, rew=394.00]                                                                                                 


Epoch #338: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #339: 1025it [00:02, 412.45it/s, env_step=347136, len=27, n/ep=2, n/st=64, player_1/loss=812.094, player_2/loss=1047.382, rew=784.00]                                                                                                 


Epoch #339: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #340: 1025it [00:02, 447.71it/s, env_step=348160, len=31, n/ep=2, n/st=64, player_1/loss=996.427, player_2/loss=968.244, rew=999.00]                                                                                                  


Epoch #340: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #341: 1025it [00:02, 443.39it/s, env_step=349184, len=14, n/ep=5, n/st=64, player_1/loss=925.492, player_2/loss=633.251, rew=208.40]                                                                                                  


Epoch #341: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #342: 1025it [00:02, 447.39it/s, env_step=350208, len=22, n/ep=2, n/st=64, player_1/loss=833.656, player_2/loss=983.561, rew=505.00]                                                                                                  


Epoch #342: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #343: 1025it [00:02, 452.64it/s, env_step=351232, len=32, n/ep=2, n/st=64, player_1/loss=1115.273, player_2/loss=1059.816, rew=1089.00]                                                                                               


Epoch #343: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #344: 1025it [00:02, 448.82it/s, env_step=352256, len=16, n/ep=3, n/st=64, player_1/loss=1232.388, player_2/loss=1118.958, rew=294.00]                                                                                                


Epoch #344: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #345: 1025it [00:02, 443.95it/s, env_step=353280, len=11, n/ep=5, n/st=64, player_1/loss=1361.628, player_2/loss=1756.664, rew=184.40]                                                                                                


Epoch #345: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #346: 1025it [00:02, 451.38it/s, env_step=354304, len=27, n/ep=3, n/st=64, player_1/loss=1132.339, player_2/loss=1674.026, rew=786.00]                                                                                                


Epoch #346: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #347: 1025it [00:02, 450.74it/s, env_step=355328, len=28, n/ep=2, n/st=64, player_1/loss=1051.363, player_2/loss=1340.438, rew=811.00]                                                                                                


Epoch #347: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #348: 1025it [00:02, 450.22it/s, env_step=356352, len=14, n/ep=4, n/st=64, player_1/loss=1347.054, player_2/loss=847.120, rew=208.50]                                                                                                 


Epoch #348: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #349: 1025it [00:02, 453.11it/s, env_step=357376, len=21, n/ep=3, n/st=64, player_1/loss=1321.601, player_2/loss=1193.968, rew=518.67]                                                                                                


Epoch #349: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #350: 1025it [00:02, 450.94it/s, env_step=358400, len=26, n/ep=3, n/st=64, player_1/loss=743.667, player_2/loss=1343.714, rew=780.00]                                                                                                 


Epoch #350: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #351: 1025it [00:02, 454.24it/s, env_step=359424, len=24, n/ep=3, n/st=64, player_1/loss=960.657, player_2/loss=824.876, rew=664.00]                                                                                                  


Epoch #351: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #352: 1025it [00:02, 454.26it/s, env_step=360448, len=30, n/ep=2, n/st=64, player_1/loss=1386.193, player_2/loss=1497.746, rew=929.00]                                                                                                


Epoch #352: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #353: 1025it [00:02, 453.21it/s, env_step=361472, len=26, n/ep=2, n/st=64, player_1/loss=945.984, player_2/loss=1255.542, rew=733.00]                                                                                                 


Epoch #353: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #354: 1025it [00:02, 453.72it/s, env_step=362496, len=29, n/ep=2, n/st=64, player_1/loss=836.062, player_2/loss=1259.578, rew=904.00]                                                                                                 


Epoch #354: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #355: 1025it [00:02, 451.54it/s, env_step=363520, len=37, n/ep=2, n/st=64, player_1/loss=906.774, player_2/loss=1061.571, rew=1420.00]                                                                                                


Epoch #355: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #356: 1025it [00:02, 454.25it/s, env_step=364544, len=17, n/ep=3, n/st=64, player_1/loss=1413.130, player_2/loss=1399.992, rew=340.00]                                                                                                


Epoch #356: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #357: 1025it [00:02, 448.48it/s, env_step=365568, len=16, n/ep=4, n/st=64, player_1/loss=1632.874, player_2/loss=1345.328, rew=283.50]                                                                                                


Epoch #357: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #358: 1025it [00:02, 451.39it/s, env_step=366592, len=19, n/ep=4, n/st=64, player_1/loss=1090.363, player_2/loss=1095.670, rew=437.00]                                                                                                


Epoch #358: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #359: 1025it [00:02, 450.91it/s, env_step=367616, len=19, n/ep=3, n/st=64, player_1/loss=1416.737, player_2/loss=1044.718, rew=392.67]                                                                                                


Epoch #359: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #360: 1025it [00:02, 452.12it/s, env_step=368640, len=10, n/ep=6, n/st=64, player_1/loss=1348.678, player_2/loss=1267.532, rew=147.67]                                                                                                


Epoch #360: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #361: 1025it [00:02, 434.50it/s, env_step=369664, len=25, n/ep=2, n/st=64, player_1/loss=919.352, player_2/loss=1652.791, rew=657.00]                                                                                                 


Epoch #361: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #362: 1025it [00:02, 444.33it/s, env_step=370688, len=14, n/ep=4, n/st=64, player_1/loss=806.521, player_2/loss=1434.955, rew=214.00]                                                                                                 


Epoch #362: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #363: 1025it [00:02, 444.81it/s, env_step=371712, len=21, n/ep=3, n/st=64, player_1/loss=737.654, player_2/loss=1263.910, rew=500.67]                                                                                                 


Epoch #363: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #364: 1025it [00:02, 449.83it/s, env_step=372736, len=16, n/ep=3, n/st=64, player_1/loss=875.137, player_2/loss=1235.160, rew=278.67]                                                                                                 


Epoch #364: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #365: 1025it [00:02, 452.16it/s, env_step=373760, len=24, n/ep=2, n/st=64, player_1/loss=724.481, player_2/loss=1283.295, rew=623.00]                                                                                                 


Epoch #365: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #366: 1025it [00:02, 451.30it/s, env_step=374784, len=14, n/ep=5, n/st=64, player_1/loss=1077.389, player_2/loss=1354.338, rew=222.00]                                                                                                


Epoch #366: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #367: 1025it [00:02, 450.85it/s, env_step=375808, len=23, n/ep=3, n/st=64, player_1/loss=1413.219, player_2/loss=1225.299, rew=568.67]                                                                                                


Epoch #367: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #368: 1025it [00:02, 453.67it/s, env_step=376832, len=23, n/ep=3, n/st=64, player_1/loss=1131.934, player_2/loss=699.949, rew=558.00]                                                                                                 


Epoch #368: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #369: 1025it [00:02, 452.19it/s, env_step=377856, len=28, n/ep=2, n/st=64, player_1/loss=661.875, player_2/loss=888.991, rew=859.00]                                                                                                  


Epoch #369: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #370: 1025it [00:02, 448.49it/s, env_step=378880, len=23, n/ep=3, n/st=64, player_1/loss=804.393, player_2/loss=1067.022, rew=552.67]                                                                                                 


Epoch #370: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #371: 1025it [00:02, 449.61it/s, env_step=379904, len=14, n/ep=5, n/st=64, player_1/loss=1009.239, player_2/loss=1499.953, rew=215.20]                                                                                                


Epoch #371: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #372: 1025it [00:02, 452.11it/s, env_step=380928, len=27, n/ep=2, n/st=64, player_1/loss=856.672, player_2/loss=1312.573, rew=782.00]                                                                                                 


Epoch #372: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #373: 1025it [00:02, 451.86it/s, env_step=381952, len=18, n/ep=4, n/st=64, player_1/loss=1138.026, player_2/loss=1385.906, rew=390.00]                                                                                                


Epoch #373: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #374: 1025it [00:02, 453.13it/s, env_step=382976, len=20, n/ep=3, n/st=64, player_2/loss=956.964, rew=446.00]                                                                                                                         


Epoch #374: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #375: 1025it [00:02, 451.53it/s, env_step=384000, len=21, n/ep=3, n/st=64, player_1/loss=741.367, player_2/loss=929.487, rew=478.67]                                                                                                  


Epoch #375: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #376: 1025it [00:02, 450.54it/s, env_step=385024, len=19, n/ep=3, n/st=64, player_1/loss=769.904, player_2/loss=1108.519, rew=392.00]                                                                                                 


Epoch #376: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #377: 1025it [00:02, 447.39it/s, env_step=386048, len=27, n/ep=2, n/st=64, player_1/loss=762.216, player_2/loss=1142.885, rew=779.00]                                                                                                 


Epoch #377: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #378: 1025it [00:02, 453.06it/s, env_step=387072, len=30, n/ep=2, n/st=64, player_1/loss=505.662, player_2/loss=800.483, rew=959.00]                                                                                                  


Epoch #378: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #379: 1025it [00:02, 454.11it/s, env_step=388096, len=18, n/ep=3, n/st=64, player_1/loss=700.843, player_2/loss=596.172, rew=344.67]                                                                                                  


Epoch #379: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #380: 1025it [00:02, 451.36it/s, env_step=389120, len=14, n/ep=5, n/st=64, player_1/loss=866.644, player_2/loss=721.000, rew=223.20]                                                                                                  


Epoch #380: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #381: 1025it [00:02, 449.77it/s, env_step=390144, len=15, n/ep=5, n/st=64, player_1/loss=999.242, player_2/loss=1010.731, rew=253.20]                                                                                                 


Epoch #381: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #382: 1025it [00:02, 454.29it/s, env_step=391168, len=15, n/ep=4, n/st=64, player_1/loss=1335.245, player_2/loss=1040.357, rew=262.00]                                                                                                


Epoch #382: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #383: 1025it [00:02, 447.94it/s, env_step=392192, len=20, n/ep=3, n/st=64, player_1/loss=1170.121, player_2/loss=1180.505, rew=432.67]                                                                                                


Epoch #383: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #384: 1025it [00:02, 450.52it/s, env_step=393216, len=32, n/ep=2, n/st=64, player_1/loss=1021.174, player_2/loss=1595.672, rew=1058.00]                                                                                               


Epoch #384: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #385: 1025it [00:02, 450.87it/s, env_step=394240, len=26, n/ep=3, n/st=64, player_1/loss=991.238, rew=732.67]                                                                                                                         


Epoch #385: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #386: 1025it [00:02, 449.88it/s, env_step=395264, len=29, n/ep=3, n/st=64, player_1/loss=1086.567, player_2/loss=1245.545, rew=876.00]                                                                                                


Epoch #386: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #387: 1025it [00:02, 450.40it/s, env_step=396288, len=31, n/ep=2, n/st=64, player_1/loss=1158.515, player_2/loss=1454.628, rew=1022.00]                                                                                               


Epoch #387: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #388: 1025it [00:02, 450.99it/s, env_step=397312, len=22, n/ep=3, n/st=64, player_1/loss=853.363, player_2/loss=1146.021, rew=520.00]                                                                                                 


Epoch #388: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #389: 1025it [00:02, 455.60it/s, env_step=398336, len=21, n/ep=4, n/st=64, player_1/loss=832.843, player_2/loss=682.824, rew=473.50]                                                                                                  


Epoch #389: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #390: 1025it [00:02, 452.31it/s, env_step=399360, len=29, n/ep=2, n/st=64, player_1/loss=1188.142, player_2/loss=666.223, rew=900.00]                                                                                                 


Epoch #390: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #391: 1025it [00:02, 451.42it/s, env_step=400384, len=28, n/ep=3, n/st=64, player_1/loss=1273.340, player_2/loss=1023.136, rew=814.67]                                                                                                


Epoch #391: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #392: 1025it [00:02, 453.01it/s, env_step=401408, len=15, n/ep=4, n/st=64, player_1/loss=1297.301, player_2/loss=1245.668, rew=239.50]                                                                                                


Epoch #392: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #393: 1025it [00:02, 453.15it/s, env_step=402432, len=26, n/ep=2, n/st=64, player_1/loss=1091.708, player_2/loss=1381.767, rew=739.00]                                                                                                


Epoch #393: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #394: 1025it [00:02, 453.20it/s, env_step=403456, len=17, n/ep=4, n/st=64, player_1/loss=1081.456, player_2/loss=964.736, rew=353.50]                                                                                                 


Epoch #394: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #395: 1025it [00:02, 420.57it/s, env_step=404480, len=24, n/ep=3, n/st=64, player_1/loss=1349.936, player_2/loss=790.034, rew=660.67]                                                                                                 


Epoch #395: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #396: 1025it [00:02, 416.31it/s, env_step=405504, len=17, n/ep=5, n/st=64, player_1/loss=1695.856, player_2/loss=792.547, rew=386.80]                                                                                                 


Epoch #396: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #397: 1025it [00:02, 433.08it/s, env_step=406528, len=22, n/ep=3, n/st=64, player_1/loss=1652.610, player_2/loss=1233.235, rew=590.00]                                                                                                


Epoch #397: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #398: 1025it [00:02, 440.44it/s, env_step=407552, len=22, n/ep=3, n/st=64, player_1/loss=1593.000, player_2/loss=1843.744, rew=512.67]                                                                                                


Epoch #398: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #399: 1025it [00:02, 416.51it/s, env_step=408576, len=19, n/ep=3, n/st=64, player_1/loss=1311.798, player_2/loss=1987.336, rew=392.00]                                                                                                


Epoch #399: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #400: 1025it [00:02, 436.76it/s, env_step=409600, len=15, n/ep=5, n/st=64, player_1/loss=1142.327, player_2/loss=1278.848, rew=246.00]                                                                                                


Epoch #400: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #401: 1025it [00:02, 451.75it/s, env_step=410624, len=26, n/ep=3, n/st=64, player_1/loss=1060.826, player_2/loss=1055.912, rew=724.67]                                                                                                


Epoch #401: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #402: 1025it [00:02, 451.69it/s, env_step=411648, len=19, n/ep=3, n/st=64, player_1/loss=1391.713, player_2/loss=1495.597, rew=402.67]                                                                                                


Epoch #402: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #403: 1025it [00:02, 449.94it/s, env_step=412672, len=18, n/ep=4, n/st=64, player_1/loss=2283.746, player_2/loss=1487.354, rew=340.00]                                                                                                


Epoch #403: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #404: 1025it [00:02, 414.85it/s, env_step=413696, len=17, n/ep=4, n/st=64, player_1/loss=2046.761, player_2/loss=1450.617, rew=331.00]                                                                                                


Epoch #404: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #405: 1025it [00:02, 398.89it/s, env_step=414720, len=13, n/ep=4, n/st=64, player_1/loss=1350.618, player_2/loss=1191.615, rew=194.00]                                                                                                


Epoch #405: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #406: 1025it [00:02, 437.81it/s, env_step=415744, len=13, n/ep=5, n/st=64, player_1/loss=1228.940, player_2/loss=1158.067, rew=192.00]                                                                                                


Epoch #406: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #407: 1025it [00:02, 450.68it/s, env_step=416768, len=14, n/ep=5, n/st=64, player_1/loss=1069.617, player_2/loss=1221.338, rew=217.60]                                                                                                


Epoch #407: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #408: 1025it [00:02, 451.05it/s, env_step=417792, len=14, n/ep=4, n/st=64, player_2/loss=1388.885, rew=209.50]                                                                                                                        


Epoch #408: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #409: 1025it [00:02, 452.24it/s, env_step=418816, len=18, n/ep=3, n/st=64, player_1/loss=731.204, player_2/loss=1318.920, rew=360.67]                                                                                                 


Epoch #409: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #410: 1025it [00:02, 450.63it/s, env_step=419840, len=18, n/ep=3, n/st=64, player_1/loss=603.954, player_2/loss=1200.168, rew=358.67]                                                                                                 


Epoch #410: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #411: 1025it [00:02, 448.53it/s, env_step=420864, len=18, n/ep=4, n/st=64, player_1/loss=768.905, player_2/loss=1207.639, rew=373.50]                                                                                                 


Epoch #411: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #412: 1025it [00:02, 422.30it/s, env_step=421888, len=19, n/ep=3, n/st=64, player_1/loss=644.918, player_2/loss=948.140, rew=400.67]                                                                                                  


Epoch #412: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #413: 1025it [00:02, 442.48it/s, env_step=422912, len=20, n/ep=3, n/st=64, player_1/loss=793.614, player_2/loss=601.401, rew=418.67]                                                                                                  


Epoch #413: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #414: 1025it [00:02, 441.96it/s, env_step=423936, len=18, n/ep=4, n/st=64, player_1/loss=701.261, player_2/loss=666.615, rew=374.50]                                                                                                  


Epoch #414: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #415: 1025it [00:02, 440.99it/s, env_step=424960, len=28, n/ep=2, n/st=64, player_1/loss=500.950, player_2/loss=461.496, rew=841.00]                                                                                                  


Epoch #415: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #416: 1025it [00:02, 442.30it/s, env_step=425984, len=22, n/ep=3, n/st=64, player_1/loss=528.825, player_2/loss=701.495, rew=504.67]                                                                                                  


Epoch #416: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #417: 1025it [00:02, 424.11it/s, env_step=427008, len=28, n/ep=2, n/st=64, player_1/loss=620.434, rew=841.00]                                                                                                                         


Epoch #417: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #418: 1025it [00:02, 421.99it/s, env_step=428032, len=24, n/ep=2, n/st=64, player_1/loss=857.769, player_2/loss=1335.640, rew=647.00]                                                                                                 


Epoch #418: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #419: 1025it [00:02, 438.24it/s, env_step=429056, len=33, n/ep=2, n/st=64, player_1/loss=873.587, player_2/loss=1182.489, rew=1120.00]                                                                                                


Epoch #419: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #420: 1025it [00:02, 445.73it/s, env_step=430080, len=35, n/ep=2, n/st=64, player_1/loss=825.044, player_2/loss=763.266, rew=1294.00]                                                                                                 


Epoch #420: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #421: 1025it [00:02, 435.67it/s, env_step=431104, len=24, n/ep=2, n/st=64, player_1/loss=653.562, player_2/loss=676.012, rew=665.00]                                                                                                  


Epoch #421: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #422: 1025it [00:02, 435.69it/s, env_step=432128, len=31, n/ep=2, n/st=64, player_1/loss=841.507, player_2/loss=2011.532, rew=1042.00]                                                                                                


Epoch #422: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #423: 1025it [00:02, 444.69it/s, env_step=433152, len=18, n/ep=3, n/st=64, player_1/loss=1171.910, player_2/loss=2188.377, rew=382.67]                                                                                                


Epoch #423: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #424: 1025it [00:02, 425.37it/s, env_step=434176, len=9, n/ep=8, n/st=64, player_1/loss=1250.640, player_2/loss=1824.435, rew=117.50]                                                                                                 


Epoch #424: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #425: 1025it [00:02, 434.66it/s, env_step=435200, len=9, n/ep=6, n/st=64, player_1/loss=1324.890, player_2/loss=1982.141, rew=109.67]                                                                                                 


Epoch #425: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #426: 1025it [00:02, 448.40it/s, env_step=436224, len=21, n/ep=3, n/st=64, player_1/loss=1154.753, player_2/loss=1611.239, rew=490.00]                                                                                                


Epoch #426: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #427: 1025it [00:02, 450.92it/s, env_step=437248, len=19, n/ep=3, n/st=64, player_1/loss=1155.582, player_2/loss=1590.288, rew=404.67]                                                                                                


Epoch #427: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #428: 1025it [00:02, 449.43it/s, env_step=438272, len=9, n/ep=6, n/st=64, player_1/loss=1403.858, player_2/loss=1416.332, rew=106.00]                                                                                                 


Epoch #428: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #429: 1025it [00:02, 451.71it/s, env_step=439296, len=14, n/ep=4, n/st=64, player_1/loss=1128.981, player_2/loss=983.805, rew=258.50]                                                                                                 


Epoch #429: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #430: 1025it [00:02, 447.42it/s, env_step=440320, len=18, n/ep=3, n/st=64, player_1/loss=924.640, player_2/loss=1181.496, rew=368.67]                                                                                                 


Epoch #430: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #431: 1025it [00:02, 450.16it/s, env_step=441344, len=36, n/ep=2, n/st=64, player_1/loss=943.188, player_2/loss=1828.459, rew=1334.00]                                                                                                


Epoch #431: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #432: 1025it [00:02, 451.59it/s, env_step=442368, len=26, n/ep=2, n/st=64, player_1/loss=1212.498, player_2/loss=2352.748, rew=739.00]                                                                                                


Epoch #432: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #433: 1025it [00:02, 451.92it/s, env_step=443392, len=28, n/ep=2, n/st=64, player_1/loss=1219.884, player_2/loss=2365.142, rew=859.00]                                                                                                


Epoch #433: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #434: 1025it [00:02, 447.91it/s, env_step=444416, len=22, n/ep=3, n/st=64, player_1/loss=1352.487, player_2/loss=1671.408, rew=599.33]                                                                                                


Epoch #434: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #435: 1025it [00:02, 448.48it/s, env_step=445440, len=14, n/ep=5, n/st=64, player_1/loss=1246.022, player_2/loss=1058.368, rew=226.40]                                                                                                


Epoch #435: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #436: 1025it [00:02, 452.53it/s, env_step=446464, len=19, n/ep=3, n/st=64, player_1/loss=1231.304, player_2/loss=1182.735, rew=391.33]                                                                                                


Epoch #436: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #437: 1025it [00:02, 446.06it/s, env_step=447488, len=14, n/ep=4, n/st=64, player_1/loss=1049.212, player_2/loss=1817.703, rew=223.00]                                                                                                


Epoch #437: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #438: 1025it [00:02, 381.43it/s, env_step=448512, len=18, n/ep=4, n/st=64, player_1/loss=1202.839, player_2/loss=1483.778, rew=390.00]                                                                                                


Epoch #438: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #439: 1025it [00:02, 351.08it/s, env_step=449536, len=21, n/ep=3, n/st=64, player_1/loss=1459.355, player_2/loss=1320.045, rew=511.33]                                                                                                


Epoch #439: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #440: 1025it [00:02, 458.75it/s, env_step=450560, len=23, n/ep=3, n/st=64, player_1/loss=1012.249, player_2/loss=1135.103, rew=639.33]                                                                                                


Epoch #440: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #441: 1025it [00:03, 336.07it/s, env_step=451584, len=22, n/ep=2, n/st=64, player_1/loss=985.202, player_2/loss=1134.158, rew=547.00]                                                                                                 


Epoch #441: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #442: 1025it [00:02, 364.58it/s, env_step=452608, len=25, n/ep=3, n/st=64, player_1/loss=1025.672, player_2/loss=1048.753, rew=726.67]                                                                                                


Epoch #442: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #443: 1025it [00:02, 439.47it/s, env_step=453632, len=22, n/ep=3, n/st=64, player_1/loss=1111.777, player_2/loss=1412.816, rew=563.33]                                                                                                


Epoch #443: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #444: 1025it [00:02, 460.80it/s, env_step=454656, len=36, n/ep=2, n/st=64, player_1/loss=1133.079, player_2/loss=1329.403, rew=1369.00]                                                                                               


Epoch #444: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #445: 1025it [00:02, 465.69it/s, env_step=455680, len=27, n/ep=2, n/st=64, player_1/loss=758.470, player_2/loss=648.319, rew=779.00]                                                                                                  


Epoch #445: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #446: 1025it [00:02, 419.89it/s, env_step=456704, len=25, n/ep=2, n/st=64, player_1/loss=571.460, player_2/loss=509.842, rew=686.00]                                                                                                  


Epoch #446: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #447: 1025it [00:02, 367.79it/s, env_step=457728, len=30, n/ep=2, n/st=64, player_1/loss=1383.915, player_2/loss=1719.248, rew=961.00]                                                                                                


Epoch #447: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #448: 1025it [00:02, 365.80it/s, env_step=458752, len=38, n/ep=1, n/st=64, player_1/loss=1588.278, player_2/loss=2499.396, rew=1480.00]                                                                                               


Epoch #448: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #449: 1025it [00:02, 466.60it/s, env_step=459776, len=19, n/ep=4, n/st=64, player_1/loss=1887.249, player_2/loss=1832.651, rew=427.50]                                                                                                


Epoch #449: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #450: 1025it [00:02, 463.31it/s, env_step=460800, len=27, n/ep=2, n/st=64, player_1/loss=2151.974, player_2/loss=1566.473, rew=758.00]                                                                                                


Epoch #450: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #451: 1025it [00:02, 446.81it/s, env_step=461824, len=35, n/ep=2, n/st=64, player_1/loss=1439.323, player_2/loss=1185.902, rew=1258.00]                                                                                               


Epoch #451: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #452: 1025it [00:02, 401.52it/s, env_step=462848, len=34, n/ep=2, n/st=64, player_1/loss=1368.945, player_2/loss=1473.568, rew=1189.00]                                                                                               


Epoch #452: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #453: 1025it [00:02, 446.07it/s, env_step=463872, len=21, n/ep=3, n/st=64, player_1/loss=1354.130, player_2/loss=1417.321, rew=464.67]                                                                                                


Epoch #453: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #454: 1025it [00:02, 407.28it/s, env_step=464896, len=32, n/ep=2, n/st=64, player_1/loss=960.446, player_2/loss=1163.872, rew=1058.00]                                                                                                


Epoch #454: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #455: 1025it [00:02, 437.68it/s, env_step=465920, len=13, n/ep=4, n/st=64, player_1/loss=1075.454, player_2/loss=1347.961, rew=202.00]                                                                                                


Epoch #455: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #456: 1025it [00:02, 465.98it/s, env_step=466944, len=20, n/ep=4, n/st=64, player_1/loss=1851.077, player_2/loss=1597.901, rew=424.00]                                                                                                


Epoch #456: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #457: 1025it [00:02, 411.00it/s, env_step=467968, len=22, n/ep=3, n/st=64, player_1/loss=2061.518, player_2/loss=2243.709, rew=535.33]                                                                                                


Epoch #457: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #458: 1025it [00:02, 412.81it/s, env_step=468992, len=33, n/ep=2, n/st=64, player_1/loss=1175.002, player_2/loss=1945.796, rew=1166.00]                                                                                               


Epoch #458: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #459: 1025it [00:02, 465.08it/s, env_step=470016, len=33, n/ep=2, n/st=64, player_1/loss=684.589, player_2/loss=1107.169, rew=1124.00]                                                                                                


Epoch #459: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #460: 1025it [00:02, 466.50it/s, env_step=471040, len=21, n/ep=3, n/st=64, player_1/loss=519.276, player_2/loss=1174.434, rew=484.00]                                                                                                 


Epoch #460: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #461: 1025it [00:02, 463.22it/s, env_step=472064, len=36, n/ep=2, n/st=64, player_1/loss=824.100, player_2/loss=1595.180, rew=1367.00]                                                                                                


Epoch #461: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #462: 1025it [00:02, 465.81it/s, env_step=473088, len=28, n/ep=2, n/st=64, player_1/loss=824.716, player_2/loss=1452.788, rew=835.00]                                                                                                 


Epoch #462: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #463: 1025it [00:02, 460.38it/s, env_step=474112, len=33, n/ep=2, n/st=64, player_1/loss=834.703, player_2/loss=1220.884, rew=1154.00]                                                                                                


Epoch #463: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #464: 1025it [00:02, 462.35it/s, env_step=475136, len=21, n/ep=3, n/st=64, player_1/loss=999.421, player_2/loss=996.282, rew=486.00]                                                                                                  


Epoch #464: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #465: 1025it [00:02, 465.89it/s, env_step=476160, len=20, n/ep=4, n/st=64, player_1/loss=1540.851, player_2/loss=1251.738, rew=475.50]                                                                                                


Epoch #465: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #466: 1025it [00:02, 463.85it/s, env_step=477184, len=14, n/ep=4, n/st=64, player_1/loss=1634.865, player_2/loss=1397.430, rew=232.50]                                                                                                


Epoch #466: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #467: 1025it [00:02, 452.31it/s, env_step=478208, len=28, n/ep=2, n/st=64, player_1/loss=1725.532, player_2/loss=1427.916, rew=845.00]                                                                                                


Epoch #467: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #468: 1025it [00:02, 456.77it/s, env_step=479232, len=25, n/ep=3, n/st=64, player_1/loss=1179.116, player_2/loss=1831.990, rew=676.00]                                                                                                


Epoch #468: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #469: 1025it [00:02, 467.34it/s, env_step=480256, len=23, n/ep=3, n/st=64, player_1/loss=742.042, player_2/loss=1592.347, rew=588.67]                                                                                                 


Epoch #469: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #470: 1025it [00:02, 466.87it/s, env_step=481280, len=32, n/ep=2, n/st=64, player_1/loss=856.485, player_2/loss=1050.473, rew=1089.00]                                                                                                


Epoch #470: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #471: 1025it [00:02, 464.01it/s, env_step=482304, len=23, n/ep=3, n/st=64, player_1/loss=1347.520, player_2/loss=1020.570, rew=753.33]                                                                                                


Epoch #471: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #472: 1025it [00:02, 463.49it/s, env_step=483328, len=21, n/ep=3, n/st=64, player_1/loss=1786.928, player_2/loss=1964.077, rew=466.00]                                                                                                


Epoch #472: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #473: 1025it [00:02, 464.77it/s, env_step=484352, len=42, n/ep=1, n/st=64, player_1/loss=1707.891, player_2/loss=2330.277, rew=1834.00]                                                                                               


Epoch #473: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #474: 1025it [00:02, 462.86it/s, env_step=485376, len=22, n/ep=3, n/st=64, player_1/loss=1295.251, player_2/loss=1906.247, rew=576.67]                                                                                                


Epoch #474: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #475: 1025it [00:02, 461.77it/s, env_step=486400, len=26, n/ep=2, n/st=64, player_1/loss=1073.715, player_2/loss=1485.492, rew=700.00]                                                                                                


Epoch #475: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #476: 1025it [00:02, 458.72it/s, env_step=487424, len=15, n/ep=4, n/st=64, player_1/loss=986.660, player_2/loss=1511.914, rew=248.50]                                                                                                 


Epoch #476: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #477: 1025it [00:02, 441.87it/s, env_step=488448, len=20, n/ep=3, n/st=64, player_1/loss=1284.841, player_2/loss=1983.257, rew=418.67]                                                                                                


Epoch #477: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #478: 1025it [00:02, 409.55it/s, env_step=489472, len=19, n/ep=3, n/st=64, player_1/loss=1109.344, player_2/loss=2422.326, rew=406.00]                                                                                                


Epoch #478: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #479: 1025it [00:02, 434.03it/s, env_step=490496, len=31, n/ep=2, n/st=64, player_1/loss=775.722, player_2/loss=1960.169, rew=1064.00]                                                                                                


Epoch #479: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #480: 1025it [00:02, 466.13it/s, env_step=491520, len=34, n/ep=2, n/st=64, player_1/loss=828.099, player_2/loss=1537.226, rew=1189.00]                                                                                                


Epoch #480: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #481: 1025it [00:02, 466.88it/s, env_step=492544, len=25, n/ep=3, n/st=64, player_1/loss=838.659, player_2/loss=1386.110, rew=720.00]                                                                                                 


Epoch #481: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #482: 1025it [00:02, 469.01it/s, env_step=493568, len=28, n/ep=3, n/st=64, player_1/loss=502.281, player_2/loss=1194.902, rew=835.33]                                                                                                 


Epoch #482: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #483: 1025it [00:02, 467.53it/s, env_step=494592, len=27, n/ep=3, n/st=64, player_1/loss=495.812, player_2/loss=1503.899, rew=875.33]                                                                                                 


Epoch #483: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #484: 1025it [00:02, 468.98it/s, env_step=495616, len=20, n/ep=2, n/st=64, player_1/loss=1131.617, player_2/loss=1950.460, rew=495.00]                                                                                                


Epoch #484: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #485: 1025it [00:02, 463.97it/s, env_step=496640, len=15, n/ep=4, n/st=64, player_1/loss=1233.479, player_2/loss=1763.997, rew=256.50]                                                                                                


Epoch #485: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #486: 1025it [00:02, 464.44it/s, env_step=497664, len=26, n/ep=3, n/st=64, player_1/loss=1171.640, player_2/loss=1379.642, rew=750.00]                                                                                                


Epoch #486: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #487: 1025it [00:02, 436.39it/s, env_step=498688, len=28, n/ep=3, n/st=64, player_1/loss=1263.227, rew=895.33]                                                                                                                        


Epoch #487: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #488: 1025it [00:02, 429.07it/s, env_step=499712, len=22, n/ep=3, n/st=64, player_1/loss=1331.370, rew=536.00]                                                                                                                        


Epoch #488: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #489: 1025it [00:02, 370.46it/s, env_step=500736, len=24, n/ep=3, n/st=64, player_1/loss=935.367, player_2/loss=738.866, rew=639.33]                                                                                                  


Epoch #489: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #490: 1025it [00:02, 393.37it/s, env_step=501760, len=23, n/ep=3, n/st=64, player_1/loss=751.707, player_2/loss=858.686, rew=556.00]                                                                                                  


Epoch #490: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #491: 1025it [00:02, 444.11it/s, env_step=502784, len=34, n/ep=2, n/st=64, player_2/loss=1385.400, rew=1223.00]                                                                                                                       


Epoch #491: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #492: 1025it [00:02, 456.74it/s, env_step=503808, len=25, n/ep=2, n/st=64, player_1/loss=1318.238, player_2/loss=1682.279, rew=684.00]                                                                                                


Epoch #492: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #493: 1025it [00:02, 464.13it/s, env_step=504832, len=29, n/ep=2, n/st=64, player_1/loss=1368.183, rew=900.00]                                                                                                                        


Epoch #493: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #494: 1025it [00:02, 464.75it/s, env_step=505856, len=33, n/ep=1, n/st=64, player_1/loss=882.827, player_2/loss=1306.896, rew=1120.00]                                                                                                


Epoch #494: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #495: 1025it [00:02, 464.10it/s, env_step=506880, len=36, n/ep=2, n/st=64, player_1/loss=922.587, player_2/loss=1325.674, rew=1381.00]                                                                                                


Epoch #495: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #496: 1025it [00:02, 345.11it/s, env_step=507904, len=8, n/ep=8, n/st=64, player_1/loss=1545.896, player_2/loss=1195.564, rew=74.00]                                                                                                  


Epoch #496: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #497: 1025it [00:02, 458.30it/s, env_step=508928, len=26, n/ep=3, n/st=64, player_1/loss=1256.070, player_2/loss=1068.291, rew=718.00]                                                                                                


Epoch #497: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #498: 1025it [00:02, 458.99it/s, env_step=509952, len=25, n/ep=2, n/st=64, player_1/loss=865.469, player_2/loss=1633.508, rew=676.00]                                                                                                 


Epoch #498: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #499: 1025it [00:02, 466.90it/s, env_step=510976, len=25, n/ep=3, n/st=64, player_1/loss=921.166, player_2/loss=1213.736, rew=692.00]                                                                                                 


Epoch #499: test_reward: 130.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #500: 1025it [00:02, 437.94it/s, env_step=512000, len=13, n/ep=4, n/st=64, player_1/loss=1140.302, player_2/loss=1100.251, rew=252.00]                                                                                                


Epoch #500: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #501: 1025it [00:02, 464.48it/s, env_step=513024, len=17, n/ep=4, n/st=64, player_1/loss=1114.305, player_2/loss=1418.284, rew=406.00]                                                                                                


Epoch #501: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #502: 1025it [00:02, 467.40it/s, env_step=514048, len=8, n/ep=8, n/st=64, player_1/loss=842.028, player_2/loss=1195.256, rew=81.50]                                                                                                   


Epoch #502: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #503: 1025it [00:02, 378.91it/s, env_step=515072, len=19, n/ep=3, n/st=64, player_1/loss=791.528, player_2/loss=1170.272, rew=420.67]                                                                                                 


Epoch #503: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #504: 1025it [00:02, 422.58it/s, env_step=516096, len=17, n/ep=3, n/st=64, player_1/loss=835.555, player_2/loss=1423.272, rew=348.00]                                                                                                 


Epoch #504: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #505: 1025it [00:02, 433.79it/s, env_step=517120, len=9, n/ep=7, n/st=64, player_1/loss=866.789, player_2/loss=1522.061, rew=110.57]                                                                                                  


Epoch #505: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #506: 1025it [00:02, 412.09it/s, env_step=518144, len=14, n/ep=4, n/st=64, player_1/loss=1060.396, player_2/loss=1685.806, rew=219.00]                                                                                                


Epoch #506: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #507: 1025it [00:02, 408.63it/s, env_step=519168, len=15, n/ep=4, n/st=64, player_1/loss=1197.551, player_2/loss=1486.904, rew=251.00]                                                                                                


Epoch #507: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #508: 1025it [00:02, 404.61it/s, env_step=520192, len=30, n/ep=2, n/st=64, player_1/loss=1152.629, player_2/loss=1341.035, rew=929.00]                                                                                                


Epoch #508: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #509: 1025it [00:02, 412.30it/s, env_step=521216, len=21, n/ep=3, n/st=64, player_1/loss=917.953, player_2/loss=1630.828, rew=478.67]                                                                                                 


Epoch #509: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #510: 1025it [00:02, 432.83it/s, env_step=522240, len=13, n/ep=4, n/st=64, player_1/loss=969.216, player_2/loss=1312.242, rew=201.50]                                                                                                 


Epoch #510: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #511: 1025it [00:02, 435.73it/s, env_step=523264, len=20, n/ep=3, n/st=64, player_1/loss=1123.578, player_2/loss=1144.197, rew=432.67]                                                                                                


Epoch #511: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #512: 1025it [00:02, 435.24it/s, env_step=524288, len=27, n/ep=2, n/st=64, player_1/loss=1041.540, player_2/loss=1107.702, rew=758.00]                                                                                                


Epoch #512: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #513: 1025it [00:02, 435.37it/s, env_step=525312, len=13, n/ep=4, n/st=64, player_1/loss=1073.600, player_2/loss=867.008, rew=237.00]                                                                                                 


Epoch #513: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #514: 1025it [00:02, 432.59it/s, env_step=526336, len=26, n/ep=3, n/st=64, player_1/loss=1118.431, player_2/loss=963.957, rew=892.00]                                                                                                 


Epoch #514: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #515: 1025it [00:02, 436.58it/s, env_step=527360, len=14, n/ep=5, n/st=64, player_1/loss=1087.091, player_2/loss=1304.626, rew=245.60]                                                                                                


Epoch #515: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #516: 1025it [00:02, 434.16it/s, env_step=528384, len=27, n/ep=2, n/st=64, player_1/loss=1161.924, player_2/loss=1406.339, rew=914.00]                                                                                                


Epoch #516: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #517: 1025it [00:02, 436.54it/s, env_step=529408, len=15, n/ep=5, n/st=64, player_1/loss=1217.341, rew=246.00]                                                                                                                        


Epoch #517: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #518: 1025it [00:02, 437.40it/s, env_step=530432, len=24, n/ep=2, n/st=64, player_1/loss=1174.685, player_2/loss=1344.392, rew=647.00]                                                                                                


Epoch #518: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #519: 1025it [00:02, 420.54it/s, env_step=531456, len=18, n/ep=3, n/st=64, player_1/loss=1245.006, player_2/loss=1692.165, rew=416.00]                                                                                                


Epoch #519: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #520: 1025it [00:02, 419.20it/s, env_step=532480, len=24, n/ep=2, n/st=64, player_1/loss=1117.440, player_2/loss=1232.098, rew=679.00]                                                                                                


Epoch #520: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #521: 1025it [00:02, 457.30it/s, env_step=533504, len=23, n/ep=3, n/st=64, player_1/loss=1089.462, player_2/loss=1061.740, rew=620.67]                                                                                                


Epoch #521: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #522: 1025it [00:02, 345.66it/s, env_step=534528, len=33, n/ep=2, n/st=64, player_1/loss=976.964, player_2/loss=1040.897, rew=1121.00]                                                                                                


Epoch #522: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #523: 1025it [00:02, 437.92it/s, env_step=535552, len=26, n/ep=2, n/st=64, player_1/loss=1633.232, player_2/loss=1533.502, rew=799.00]                                                                                                


Epoch #523: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #524: 1025it [00:02, 395.42it/s, env_step=536576, len=34, n/ep=2, n/st=64, player_1/loss=1972.443, player_2/loss=1486.796, rew=1188.00]                                                                                               


Epoch #524: test_reward: 130.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #525: 1025it [00:02, 419.42it/s, env_step=537600, len=24, n/ep=2, n/st=64, player_1/loss=1793.758, player_2/loss=1216.307, rew=623.00]                                                                                                


Epoch #525: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #526: 1025it [00:02, 436.86it/s, env_step=538624, len=34, n/ep=2, n/st=64, player_1/loss=1610.586, player_2/loss=1547.550, rew=1225.00]                                                                                               


Epoch #526: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #527: 1025it [00:02, 442.87it/s, env_step=539648, len=13, n/ep=4, n/st=64, player_1/loss=1880.028, player_2/loss=1680.138, rew=203.50]                                                                                                


Epoch #527: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #528: 1025it [00:02, 405.18it/s, env_step=540672, len=30, n/ep=2, n/st=64, player_1/loss=1400.812, player_2/loss=1658.610, rew=965.00]                                                                                                


Epoch #528: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #529: 1025it [00:02, 407.25it/s, env_step=541696, len=27, n/ep=3, n/st=64, player_1/loss=864.881, player_2/loss=1764.118, rew=875.33]                                                                                                 


Epoch #529: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #530: 1025it [00:02, 419.45it/s, env_step=542720, len=35, n/ep=1, n/st=64, player_1/loss=2063.877, player_2/loss=1562.073, rew=1258.00]                                                                                               


Epoch #530: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #531: 1025it [00:02, 420.90it/s, env_step=543744, len=16, n/ep=4, n/st=64, player_1/loss=1892.541, player_2/loss=1680.795, rew=371.00]                                                                                                


Epoch #531: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #532: 1025it [00:02, 448.34it/s, env_step=544768, len=9, n/ep=6, n/st=64, player_1/loss=1111.166, player_2/loss=1837.940, rew=123.00]                                                                                                 


Epoch #532: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #533: 1025it [00:02, 447.91it/s, env_step=545792, len=8, n/ep=8, n/st=64, player_1/loss=847.435, player_2/loss=1865.215, rew=77.50]                                                                                                   


Epoch #533: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #534: 1025it [00:02, 442.08it/s, env_step=546816, len=20, n/ep=2, n/st=64, player_1/loss=929.478, player_2/loss=2003.389, rew=418.00]                                                                                                 


Epoch #534: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #535: 1025it [00:02, 426.62it/s, env_step=547840, len=15, n/ep=4, n/st=64, player_1/loss=1332.595, player_2/loss=1690.554, rew=354.00]                                                                                                


Epoch #535: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #536: 1025it [00:02, 428.95it/s, env_step=548864, len=16, n/ep=4, n/st=64, player_1/loss=1433.094, player_2/loss=1582.963, rew=297.50]                                                                                                


Epoch #536: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #537: 1025it [00:02, 420.79it/s, env_step=549888, len=31, n/ep=2, n/st=64, player_1/loss=1270.288, player_2/loss=1743.850, rew=1028.00]                                                                                               


Epoch #537: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #538: 1025it [00:02, 442.06it/s, env_step=550912, len=16, n/ep=3, n/st=64, player_1/loss=1168.609, player_2/loss=1742.074, rew=316.67]                                                                                                


Epoch #538: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #539: 1025it [00:02, 448.11it/s, env_step=551936, len=9, n/ep=5, n/st=64, player_1/loss=1464.352, player_2/loss=1522.622, rew=122.00]                                                                                                 


Epoch #539: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #540: 1025it [00:02, 435.52it/s, env_step=552960, len=10, n/ep=7, n/st=64, player_1/loss=1845.130, player_2/loss=1644.038, rew=116.29]                                                                                                


Epoch #540: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #541: 1025it [00:02, 440.27it/s, env_step=553984, len=24, n/ep=2, n/st=64, player_1/loss=1732.281, player_2/loss=1674.776, rew=698.00]                                                                                                


Epoch #541: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #542: 1025it [00:02, 408.58it/s, env_step=555008, len=10, n/ep=7, n/st=64, player_1/loss=1381.259, player_2/loss=1621.177, rew=121.43]                                                                                                


Epoch #542: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #543: 1025it [00:02, 437.99it/s, env_step=556032, len=11, n/ep=5, n/st=64, player_1/loss=1241.707, player_2/loss=2140.446, rew=152.80]                                                                                                


Epoch #543: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #544: 1025it [00:02, 433.38it/s, env_step=557056, len=27, n/ep=2, n/st=64, player_1/loss=1281.147, player_2/loss=2022.929, rew=838.00]                                                                                                


Epoch #544: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #545: 1025it [00:02, 433.48it/s, env_step=558080, len=19, n/ep=4, n/st=64, player_1/loss=1121.765, player_2/loss=1361.301, rew=496.50]                                                                                                


Epoch #545: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #546: 1025it [00:02, 424.74it/s, env_step=559104, len=19, n/ep=3, n/st=64, player_1/loss=1556.545, player_2/loss=1370.773, rew=386.00]                                                                                                


Epoch #546: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #547: 1025it [00:02, 430.87it/s, env_step=560128, len=28, n/ep=2, n/st=64, player_1/loss=1322.386, player_2/loss=1007.885, rew=846.00]                                                                                                


Epoch #547: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #548: 1025it [00:02, 466.70it/s, env_step=561152, len=15, n/ep=4, n/st=64, player_1/loss=1053.493, player_2/loss=1356.835, rew=238.00]                                                                                                


Epoch #548: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #549: 1025it [00:02, 464.65it/s, env_step=562176, len=18, n/ep=3, n/st=64, player_1/loss=1513.246, player_2/loss=1926.653, rew=348.67]                                                                                                


Epoch #549: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #550: 1025it [00:02, 465.63it/s, env_step=563200, len=26, n/ep=2, n/st=64, player_1/loss=1483.224, player_2/loss=1638.824, rew=727.00]                                                                                                


Epoch #550: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #551: 1025it [00:02, 466.27it/s, env_step=564224, len=20, n/ep=4, n/st=64, player_1/loss=1237.542, player_2/loss=1568.352, rew=458.50]                                                                                                


Epoch #551: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #552: 1025it [00:02, 464.46it/s, env_step=565248, len=14, n/ep=5, n/st=64, player_1/loss=1105.761, player_2/loss=1829.350, rew=214.00]                                                                                                


Epoch #552: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #553: 1025it [00:02, 467.12it/s, env_step=566272, len=13, n/ep=4, n/st=64, player_1/loss=1433.349, player_2/loss=1788.523, rew=201.50]                                                                                                


Epoch #553: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #554: 1025it [00:02, 464.15it/s, env_step=567296, len=22, n/ep=3, n/st=64, player_1/loss=1381.301, player_2/loss=1241.646, rew=512.67]                                                                                                


Epoch #554: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #555: 1025it [00:02, 465.34it/s, env_step=568320, len=20, n/ep=2, n/st=64, player_1/loss=1481.338, player_2/loss=1367.966, rew=571.00]                                                                                                


Epoch #555: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #556: 1025it [00:02, 464.19it/s, env_step=569344, len=27, n/ep=2, n/st=64, player_1/loss=1002.202, player_2/loss=1321.312, rew=824.00]                                                                                                


Epoch #556: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #557: 1025it [00:02, 466.21it/s, env_step=570368, len=20, n/ep=4, n/st=64, player_1/loss=1120.662, player_2/loss=936.036, rew=485.50]                                                                                                 


Epoch #557: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #558: 1025it [00:02, 467.37it/s, env_step=571392, len=19, n/ep=3, n/st=64, player_1/loss=1263.691, player_2/loss=1020.135, rew=404.67]                                                                                                


Epoch #558: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #559: 1025it [00:02, 464.43it/s, env_step=572416, len=28, n/ep=3, n/st=64, player_1/loss=1737.089, player_2/loss=1339.274, rew=812.67]                                                                                                


Epoch #559: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #560: 1025it [00:02, 466.21it/s, env_step=573440, len=32, n/ep=2, n/st=64, player_1/loss=2001.877, player_2/loss=1866.783, rew=1063.00]                                                                                               


Epoch #560: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #561: 1025it [00:02, 465.33it/s, env_step=574464, len=25, n/ep=3, n/st=64, player_1/loss=1760.858, player_2/loss=1822.805, rew=694.67]                                                                                                


Epoch #561: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #562: 1025it [00:02, 468.53it/s, env_step=575488, len=36, n/ep=1, n/st=64, player_1/loss=1686.475, player_2/loss=1126.965, rew=1330.00]                                                                                               


Epoch #562: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #563: 1025it [00:02, 464.34it/s, env_step=576512, len=37, n/ep=2, n/st=64, player_1/loss=1179.424, player_2/loss=641.659, rew=1454.00]                                                                                                


Epoch #563: test_reward: 1558.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #564: 1025it [00:02, 466.26it/s, env_step=577536, len=27, n/ep=2, n/st=64, player_1/loss=1211.627, player_2/loss=1357.124, rew=754.00]                                                                                                


Epoch #564: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #565: 1025it [00:02, 465.29it/s, env_step=578560, len=19, n/ep=3, n/st=64, player_1/loss=2089.021, player_2/loss=1876.311, rew=442.67]                                                                                                


Epoch #565: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #566: 1025it [00:02, 467.67it/s, env_step=579584, len=24, n/ep=2, n/st=64, player_1/loss=1848.229, player_2/loss=1337.086, rew=713.00]                                                                                                


Epoch #566: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #567: 1025it [00:02, 467.66it/s, env_step=580608, len=32, n/ep=2, n/st=64, player_1/loss=1333.968, player_2/loss=1644.543, rew=1089.00]                                                                                               


Epoch #567: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #568: 1025it [00:02, 467.56it/s, env_step=581632, len=27, n/ep=2, n/st=64, player_1/loss=1270.396, player_2/loss=1875.067, rew=803.00]                                                                                                


Epoch #568: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #569: 1025it [00:02, 465.61it/s, env_step=582656, len=27, n/ep=2, n/st=64, player_1/loss=1739.589, player_2/loss=1488.388, rew=755.00]                                                                                                


Epoch #569: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #570: 1025it [00:02, 464.23it/s, env_step=583680, len=32, n/ep=2, n/st=64, player_1/loss=1523.656, player_2/loss=1514.702, rew=1090.00]                                                                                               


Epoch #570: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #571: 1025it [00:02, 468.39it/s, env_step=584704, len=13, n/ep=4, n/st=64, player_1/loss=928.685, player_2/loss=1239.743, rew=202.50]                                                                                                 


Epoch #571: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #572: 1025it [00:02, 467.58it/s, env_step=585728, len=20, n/ep=3, n/st=64, player_1/loss=1359.081, player_2/loss=878.873, rew=436.67]                                                                                                 


Epoch #572: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #573: 1025it [00:02, 467.81it/s, env_step=586752, len=22, n/ep=3, n/st=64, player_1/loss=1597.982, player_2/loss=1432.252, rew=512.00]                                                                                                


Epoch #573: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #574: 1025it [00:02, 466.55it/s, env_step=587776, len=22, n/ep=3, n/st=64, player_1/loss=1207.805, player_2/loss=2025.238, rew=504.67]                                                                                                


Epoch #574: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #575: 1025it [00:02, 463.42it/s, env_step=588800, len=20, n/ep=3, n/st=64, player_1/loss=1602.568, player_2/loss=1929.688, rew=422.67]                                                                                                


Epoch #575: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #576: 1025it [00:02, 466.55it/s, env_step=589824, len=19, n/ep=3, n/st=64, player_1/loss=1448.162, player_2/loss=2037.375, rew=404.67]                                                                                                


Epoch #576: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #577: 1025it [00:02, 466.48it/s, env_step=590848, len=24, n/ep=3, n/st=64, player_1/loss=1039.390, player_2/loss=2352.492, rew=667.33]                                                                                                


Epoch #577: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #578: 1025it [00:02, 464.87it/s, env_step=591872, len=23, n/ep=3, n/st=64, player_1/loss=1203.264, player_2/loss=2446.700, rew=575.33]                                                                                                


Epoch #578: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #579: 1025it [00:02, 468.98it/s, env_step=592896, len=18, n/ep=3, n/st=64, player_1/loss=1538.753, player_2/loss=2706.485, rew=382.67]                                                                                                


Epoch #579: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #580: 1025it [00:02, 465.75it/s, env_step=593920, len=30, n/ep=2, n/st=64, player_1/loss=1459.994, player_2/loss=1936.652, rew=937.00]                                                                                                


Epoch #580: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #581: 1025it [00:02, 467.63it/s, env_step=594944, len=33, n/ep=2, n/st=64, player_1/loss=1122.342, player_2/loss=1403.425, rew=1120.00]                                                                                               


Epoch #581: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #582: 1025it [00:02, 461.24it/s, env_step=595968, len=28, n/ep=2, n/st=64, player_1/loss=1462.291, player_2/loss=1266.644, rew=841.00]                                                                                                


Epoch #582: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #583: 1025it [00:02, 405.76it/s, env_step=596992, len=26, n/ep=3, n/st=64, player_1/loss=1622.827, player_2/loss=977.608, rew=775.33]                                                                                                 


Epoch #583: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #584: 1025it [00:02, 407.70it/s, env_step=598016, len=28, n/ep=2, n/st=64, player_1/loss=1576.250, player_2/loss=1719.613, rew=814.00]                                                                                                


Epoch #584: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #585: 1025it [00:02, 453.03it/s, env_step=599040, len=26, n/ep=3, n/st=64, player_2/loss=1614.227, rew=716.67]                                                                                                                        


Epoch #585: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #586: 1025it [00:02, 434.74it/s, env_step=600064, len=28, n/ep=2, n/st=64, player_1/loss=2109.012, player_2/loss=1091.610, rew=819.00]                                                                                                


Epoch #586: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #587: 1025it [00:02, 427.50it/s, env_step=601088, len=27, n/ep=2, n/st=64, player_1/loss=1943.865, player_2/loss=1644.973, rew=758.00]                                                                                                


Epoch #587: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #588: 1025it [00:02, 442.57it/s, env_step=602112, len=26, n/ep=2, n/st=64, player_1/loss=1141.151, player_2/loss=1632.116, rew=701.00]                                                                                                


Epoch #588: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #589: 1025it [00:02, 450.14it/s, env_step=603136, len=33, n/ep=2, n/st=64, player_1/loss=1343.813, player_2/loss=1337.185, rew=1154.00]                                                                                               


Epoch #589: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #590: 1025it [00:02, 449.37it/s, env_step=604160, len=26, n/ep=3, n/st=64, player_1/loss=1162.137, player_2/loss=959.576, rew=750.67]                                                                                                 


Epoch #590: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #591: 1025it [00:02, 451.14it/s, env_step=605184, len=35, n/ep=1, n/st=64, player_1/loss=727.656, player_2/loss=994.001, rew=1258.00]                                                                                                 


Epoch #591: test_reward: 130.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #592: 1025it [00:02, 452.15it/s, env_step=606208, len=32, n/ep=2, n/st=64, player_1/loss=1076.061, player_2/loss=1848.131, rew=1055.00]                                                                                               


Epoch #592: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #593: 1025it [00:02, 452.99it/s, env_step=607232, len=22, n/ep=3, n/st=64, player_1/loss=962.681, player_2/loss=1828.637, rew=530.67]                                                                                                 


Epoch #593: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #594: 1025it [00:02, 448.46it/s, env_step=608256, len=28, n/ep=2, n/st=64, player_1/loss=897.881, player_2/loss=1728.980, rew=851.00]                                                                                                 


Epoch #594: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #595: 1025it [00:02, 437.40it/s, env_step=609280, len=15, n/ep=4, n/st=64, player_1/loss=1274.731, player_2/loss=1690.100, rew=373.50]                                                                                                


Epoch #595: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #596: 1025it [00:02, 447.51it/s, env_step=610304, len=34, n/ep=2, n/st=64, player_1/loss=1049.485, player_2/loss=1250.692, rew=1192.00]                                                                                               


Epoch #596: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #597: 1025it [00:02, 451.49it/s, env_step=611328, len=35, n/ep=2, n/st=64, player_1/loss=1209.195, player_2/loss=1106.713, rew=1274.00]                                                                                               


Epoch #597: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #598: 1025it [00:02, 451.51it/s, env_step=612352, len=34, n/ep=2, n/st=64, player_1/loss=1277.272, player_2/loss=1085.296, rew=1223.00]                                                                                               


Epoch #598: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #599: 1025it [00:02, 447.12it/s, env_step=613376, len=20, n/ep=3, n/st=64, player_1/loss=1624.255, player_2/loss=1349.867, rew=433.33]                                                                                                


Epoch #599: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #600: 1025it [00:02, 353.86it/s, env_step=614400, len=15, n/ep=4, n/st=64, player_1/loss=1226.714, player_2/loss=1601.826, rew=253.50]                                                                                                


Epoch #600: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #601: 1025it [00:03, 257.23it/s, env_step=615424, len=33, n/ep=2, n/st=64, player_1/loss=1010.548, player_2/loss=1615.751, rew=1156.00]                                                                                               


Epoch #601: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #602: 1025it [00:02, 399.28it/s, env_step=616448, len=33, n/ep=2, n/st=64, player_1/loss=1354.672, player_2/loss=1807.252, rew=1196.00]                                                                                               


Epoch #602: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #603: 1025it [00:02, 414.33it/s, env_step=617472, len=32, n/ep=2, n/st=64, player_1/loss=1295.894, player_2/loss=1832.943, rew=1087.00]                                                                                               


Epoch #603: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #604: 1025it [00:02, 451.02it/s, env_step=618496, len=32, n/ep=2, n/st=64, player_1/loss=958.018, player_2/loss=1158.851, rew=1055.00]                                                                                                


Epoch #604: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #605: 1025it [00:02, 448.71it/s, env_step=619520, len=25, n/ep=2, n/st=64, player_1/loss=679.956, player_2/loss=1148.946, rew=769.00]                                                                                                 


Epoch #605: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #606: 1025it [00:02, 447.42it/s, env_step=620544, len=31, n/ep=2, n/st=64, player_1/loss=935.193, rew=999.00]                                                                                                                         


Epoch #606: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #607: 1025it [00:02, 443.74it/s, env_step=621568, len=34, n/ep=2, n/st=64, player_1/loss=749.780, player_2/loss=1096.131, rew=1223.00]                                                                                                


Epoch #607: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #608: 1025it [00:02, 447.85it/s, env_step=622592, len=17, n/ep=4, n/st=64, player_1/loss=849.983, player_2/loss=1905.890, rew=338.50]                                                                                                 


Epoch #608: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #609: 1025it [00:02, 449.12it/s, env_step=623616, len=22, n/ep=3, n/st=64, player_1/loss=1324.252, player_2/loss=1936.341, rew=536.00]                                                                                                


Epoch #609: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #610: 1025it [00:02, 449.58it/s, env_step=624640, len=25, n/ep=3, n/st=64, player_1/loss=1596.546, player_2/loss=1493.997, rew=698.00]                                                                                                


Epoch #610: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #611: 1025it [00:02, 450.20it/s, env_step=625664, len=27, n/ep=2, n/st=64, player_1/loss=1462.151, player_2/loss=1579.571, rew=754.00]                                                                                                


Epoch #611: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #612: 1025it [00:02, 447.68it/s, env_step=626688, len=32, n/ep=2, n/st=64, player_1/loss=1293.771, player_2/loss=1782.009, rew=1099.00]                                                                                               


Epoch #612: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #613: 1025it [00:02, 447.97it/s, env_step=627712, len=36, n/ep=2, n/st=64, player_1/loss=1172.826, player_2/loss=2002.339, rew=1334.00]                                                                                               


Epoch #613: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #614: 1025it [00:02, 450.99it/s, env_step=628736, len=17, n/ep=4, n/st=64, player_1/loss=1551.049, player_2/loss=2174.025, rew=418.50]                                                                                                


Epoch #614: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #615: 1025it [00:02, 449.01it/s, env_step=629760, len=24, n/ep=2, n/st=64, player_1/loss=1502.340, player_2/loss=1980.832, rew=719.00]                                                                                                


Epoch #615: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #616: 1025it [00:02, 452.12it/s, env_step=630784, len=24, n/ep=3, n/st=64, player_1/loss=1126.289, player_2/loss=1516.746, rew=648.67]                                                                                                


Epoch #616: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #617: 1025it [00:02, 448.29it/s, env_step=631808, len=25, n/ep=3, n/st=64, player_1/loss=1235.665, player_2/loss=1418.957, rew=810.00]                                                                                                


Epoch #617: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #618: 1025it [00:02, 448.15it/s, env_step=632832, len=19, n/ep=4, n/st=64, player_1/loss=1342.209, player_2/loss=1741.589, rew=436.50]                                                                                                


Epoch #618: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #619: 1025it [00:02, 450.30it/s, env_step=633856, len=27, n/ep=2, n/st=64, player_1/loss=1524.531, player_2/loss=2403.789, rew=779.00]                                                                                                


Epoch #619: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #620: 1025it [00:02, 449.76it/s, env_step=634880, len=29, n/ep=2, n/st=64, player_1/loss=1641.420, player_2/loss=2531.579, rew=872.00]                                                                                                


Epoch #620: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #621: 1025it [00:02, 450.15it/s, env_step=635904, len=7, n/ep=8, n/st=64, player_1/loss=1597.734, player_2/loss=1736.877, rew=64.50]                                                                                                  


Epoch #621: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #622: 1025it [00:02, 448.43it/s, env_step=636928, len=18, n/ep=4, n/st=64, player_1/loss=2022.371, player_2/loss=1865.303, rew=367.50]                                                                                                


Epoch #622: test_reward: 130.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #623: 1025it [00:02, 450.42it/s, env_step=637952, len=8, n/ep=7, n/st=64, player_1/loss=2083.388, player_2/loss=1961.546, rew=85.43]                                                                                                  


Epoch #623: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #624: 1025it [00:02, 451.88it/s, env_step=638976, len=26, n/ep=3, n/st=64, player_1/loss=1674.216, player_2/loss=2382.762, rew=753.33]                                                                                                


Epoch #624: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #625: 1025it [00:02, 450.64it/s, env_step=640000, len=37, n/ep=2, n/st=64, player_1/loss=1524.653, player_2/loss=1934.559, rew=1442.00]                                                                                               


Epoch #625: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #626: 1025it [00:02, 449.58it/s, env_step=641024, len=27, n/ep=2, n/st=64, player_1/loss=1317.031, player_2/loss=1515.561, rew=812.00]                                                                                                


Epoch #626: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #627: 1025it [00:02, 451.31it/s, env_step=642048, len=37, n/ep=1, n/st=64, player_1/loss=1552.228, player_2/loss=1444.809, rew=1404.00]                                                                                               


Epoch #627: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #628: 1025it [00:02, 445.98it/s, env_step=643072, len=14, n/ep=4, n/st=64, player_1/loss=1650.518, player_2/loss=1897.415, rew=228.00]                                                                                                


Epoch #628: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #629: 1025it [00:02, 450.23it/s, env_step=644096, len=19, n/ep=3, n/st=64, player_1/loss=1618.767, player_2/loss=1661.711, rew=382.67]                                                                                                


Epoch #629: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #630: 1025it [00:02, 452.31it/s, env_step=645120, len=19, n/ep=3, n/st=64, player_1/loss=1460.465, player_2/loss=1446.261, rew=479.33]                                                                                                


Epoch #630: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #631: 1025it [00:02, 454.94it/s, env_step=646144, len=11, n/ep=7, n/st=64, player_1/loss=794.164, player_2/loss=1170.929, rew=164.86]                                                                                                 


Epoch #631: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #632: 1025it [00:02, 452.39it/s, env_step=647168, len=21, n/ep=3, n/st=64, player_1/loss=1081.186, player_2/loss=1288.350, rew=532.67]                                                                                                


Epoch #632: test_reward: 154.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #633: 1025it [00:02, 446.17it/s, env_step=648192, len=27, n/ep=2, n/st=64, player_1/loss=1665.135, player_2/loss=1234.586, rew=782.00]                                                                                                


Epoch #633: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #634: 1025it [00:02, 450.11it/s, env_step=649216, len=22, n/ep=3, n/st=64, player_1/loss=1341.068, player_2/loss=1287.049, rew=605.33]                                                                                                


Epoch #634: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #635: 1025it [00:02, 450.26it/s, env_step=650240, len=25, n/ep=2, n/st=64, player_1/loss=983.917, player_2/loss=1307.258, rew=746.00]                                                                                                 


Epoch #635: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #636: 1025it [00:02, 447.77it/s, env_step=651264, len=15, n/ep=6, n/st=64, player_1/loss=1217.955, player_2/loss=878.129, rew=302.33]                                                                                                 


Epoch #636: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #637: 1025it [00:02, 449.48it/s, env_step=652288, len=32, n/ep=2, n/st=64, player_1/loss=1444.500, player_2/loss=1838.206, rew=1103.00]                                                                                               


Epoch #637: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #638: 1025it [00:02, 409.42it/s, env_step=653312, len=12, n/ep=7, n/st=64, player_1/loss=1836.464, player_2/loss=1913.495, rew=233.14]                                                                                                


Epoch #638: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #639: 1025it [00:02, 442.56it/s, env_step=654336, len=8, n/ep=7, n/st=64, player_1/loss=2145.927, player_2/loss=1997.848, rew=79.14]                                                                                                  


Epoch #639: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #640: 1025it [00:02, 430.82it/s, env_step=655360, len=12, n/ep=4, n/st=64, player_1/loss=1926.069, player_2/loss=1238.177, rew=195.00]                                                                                                


Epoch #640: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #641: 1025it [00:02, 399.26it/s, env_step=656384, len=8, n/ep=7, n/st=64, player_1/loss=1551.673, player_2/loss=1140.183, rew=84.57]                                                                                                  


Epoch #641: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #642: 1025it [00:03, 320.59it/s, env_step=657408, len=25, n/ep=3, n/st=64, player_1/loss=1108.452, player_2/loss=1520.818, rew=665.33]                                                                                                


Epoch #642: test_reward: 1558.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #643: 1025it [00:02, 379.99it/s, env_step=658432, len=32, n/ep=1, n/st=64, player_1/loss=1335.788, player_2/loss=2403.667, rew=1054.00]                                                                                               


Epoch #643: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #644: 1025it [00:02, 377.11it/s, env_step=659456, len=28, n/ep=2, n/st=64, player_1/loss=1710.779, player_2/loss=2387.981, rew=859.00]                                                                                                


Epoch #644: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #645: 1025it [00:02, 421.06it/s, env_step=660480, len=28, n/ep=2, n/st=64, player_1/loss=1966.552, player_2/loss=1283.340, rew=845.00]                                                                                                


Epoch #645: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #646: 1025it [00:02, 394.01it/s, env_step=661504, len=25, n/ep=3, n/st=64, player_1/loss=1469.958, player_2/loss=1373.763, rew=745.33]                                                                                                


Epoch #646: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #647: 1025it [00:02, 468.05it/s, env_step=662528, len=12, n/ep=5, n/st=64, player_1/loss=1081.743, player_2/loss=2229.698, rew=168.00]                                                                                                


Epoch #647: test_reward: 130.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #648: 1025it [00:02, 456.19it/s, env_step=663552, len=24, n/ep=3, n/st=64, player_1/loss=684.940, player_2/loss=2132.795, rew=617.33]                                                                                                 


Epoch #648: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #649: 1025it [00:02, 357.51it/s, env_step=664576, len=22, n/ep=3, n/st=64, player_1/loss=770.001, player_2/loss=1590.589, rew=564.00]                                                                                                 


Epoch #649: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #650: 1025it [00:03, 329.81it/s, env_step=665600, len=25, n/ep=3, n/st=64, player_1/loss=1101.847, player_2/loss=1244.265, rew=666.67]                                                                                                


Epoch #650: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #651: 1025it [00:02, 414.88it/s, env_step=666624, len=21, n/ep=3, n/st=64, player_1/loss=884.455, player_2/loss=842.935, rew=564.00]                                                                                                  


Epoch #651: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #652: 1025it [00:02, 459.59it/s, env_step=667648, len=15, n/ep=4, n/st=64, player_1/loss=675.995, player_2/loss=724.556, rew=290.50]                                                                                                  


Epoch #652: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #653: 1025it [00:02, 459.56it/s, env_step=668672, len=22, n/ep=3, n/st=64, player_1/loss=1226.001, player_2/loss=1031.137, rew=612.00]                                                                                                


Epoch #653: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #654: 1025it [00:02, 462.53it/s, env_step=669696, len=25, n/ep=2, n/st=64, player_1/loss=772.869, player_2/loss=1244.642, rew=806.00]                                                                                                 


Epoch #654: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #655: 1025it [00:02, 461.52it/s, env_step=670720, len=31, n/ep=2, n/st=64, player_1/loss=626.181, player_2/loss=915.330, rew=991.00]                                                                                                  


Epoch #655: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #656: 1025it [00:02, 464.70it/s, env_step=671744, len=26, n/ep=2, n/st=64, player_1/loss=1012.520, player_2/loss=998.002, rew=821.00]                                                                                                 


Epoch #656: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #657: 1025it [00:02, 465.97it/s, env_step=672768, len=25, n/ep=2, n/st=64, player_1/loss=1153.642, player_2/loss=1708.400, rew=649.00]                                                                                                


Epoch #657: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #658: 1025it [00:02, 465.32it/s, env_step=673792, len=27, n/ep=2, n/st=64, player_1/loss=1267.165, player_2/loss=1844.055, rew=779.00]                                                                                                


Epoch #658: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #659: 1025it [00:02, 459.83it/s, env_step=674816, len=17, n/ep=4, n/st=64, player_1/loss=1430.040, player_2/loss=1750.136, rew=312.00]                                                                                                


Epoch #659: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #660: 1025it [00:02, 462.79it/s, env_step=675840, len=31, n/ep=3, n/st=64, player_1/loss=1538.043, player_2/loss=1832.161, rew=1044.00]                                                                                               


Epoch #660: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #661: 1025it [00:02, 462.17it/s, env_step=676864, len=18, n/ep=4, n/st=64, player_1/loss=1438.399, player_2/loss=1519.153, rew=436.50]                                                                                                


Epoch #661: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #662: 1025it [00:02, 455.94it/s, env_step=677888, len=17, n/ep=3, n/st=64, player_1/loss=1307.800, player_2/loss=1414.454, rew=312.00]                                                                                                


Epoch #662: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #663: 1025it [00:02, 463.75it/s, env_step=678912, len=34, n/ep=2, n/st=64, player_1/loss=1260.329, player_2/loss=1763.297, rew=1223.00]                                                                                               


Epoch #663: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #664: 1025it [00:02, 377.84it/s, env_step=679936, len=32, n/ep=2, n/st=64, player_1/loss=1469.839, player_2/loss=1706.550, rew=1055.00]                                                                                               


Epoch #664: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #665: 1025it [00:02, 462.14it/s, env_step=680960, len=25, n/ep=2, n/st=64, player_1/loss=1152.102, player_2/loss=1842.918, rew=716.00]                                                                                                


Epoch #665: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #666: 1025it [00:02, 465.62it/s, env_step=681984, len=33, n/ep=2, n/st=64, player_1/loss=1158.611, player_2/loss=1387.870, rew=1120.00]                                                                                               


Epoch #666: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #667: 1025it [00:02, 446.93it/s, env_step=683008, len=22, n/ep=3, n/st=64, player_1/loss=948.238, player_2/loss=1047.205, rew=599.33]                                                                                                 


Epoch #667: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #668: 1025it [00:02, 445.79it/s, env_step=684032, len=29, n/ep=3, n/st=64, player_1/loss=859.500, player_2/loss=1049.983, rew=998.67]                                                                                                 


Epoch #668: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #669: 1025it [00:02, 468.20it/s, env_step=685056, len=33, n/ep=2, n/st=64, player_1/loss=1427.401, player_2/loss=1212.187, rew=1154.00]                                                                                               


Epoch #669: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #670: 1025it [00:02, 465.28it/s, env_step=686080, len=13, n/ep=5, n/st=64, player_1/loss=1634.906, player_2/loss=1275.264, rew=202.80]                                                                                                


Epoch #670: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #671: 1025it [00:02, 468.73it/s, env_step=687104, len=28, n/ep=2, n/st=64, player_1/loss=1673.635, player_2/loss=1322.222, rew=819.00]                                                                                                


Epoch #671: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #672: 1025it [00:02, 426.22it/s, env_step=688128, len=31, n/ep=2, n/st=64, player_1/loss=1586.371, player_2/loss=1043.406, rew=1071.00]                                                                                               


Epoch #672: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #673: 1025it [00:02, 409.30it/s, env_step=689152, len=19, n/ep=3, n/st=64, player_1/loss=1151.913, player_2/loss=1060.937, rew=407.33]                                                                                                


Epoch #673: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #674: 1025it [00:02, 462.69it/s, env_step=690176, len=37, n/ep=2, n/st=64, player_1/loss=1106.003, player_2/loss=1150.115, rew=1405.00]                                                                                               


Epoch #674: test_reward: 1558.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #675: 1025it [00:02, 454.54it/s, env_step=691200, len=20, n/ep=3, n/st=64, player_1/loss=1034.513, player_2/loss=1260.795, rew=433.33]                                                                                                


Epoch #675: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #676: 1025it [00:02, 391.82it/s, env_step=692224, len=28, n/ep=2, n/st=64, player_1/loss=702.813, player_2/loss=1584.337, rew=971.00]                                                                                                 


Epoch #676: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #677: 1025it [00:02, 445.83it/s, env_step=693248, len=34, n/ep=2, n/st=64, player_1/loss=1107.479, player_2/loss=1892.674, rew=1189.00]                                                                                               


Epoch #677: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #678: 1025it [00:02, 464.61it/s, env_step=694272, len=18, n/ep=2, n/st=64, player_1/loss=1050.234, player_2/loss=1614.935, rew=361.00]                                                                                                


Epoch #678: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #679: 1025it [00:02, 461.21it/s, env_step=695296, len=29, n/ep=2, n/st=64, player_1/loss=879.625, player_2/loss=1705.352, rew=970.00]                                                                                                 


Epoch #679: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #680: 1025it [00:02, 449.59it/s, env_step=696320, len=29, n/ep=2, n/st=64, player_1/loss=1452.598, player_2/loss=1390.297, rew=877.00]                                                                                                


Epoch #680: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #681: 1025it [00:02, 417.33it/s, env_step=697344, len=31, n/ep=2, n/st=64, player_1/loss=1285.181, player_2/loss=1221.318, rew=1078.00]                                                                                               


Epoch #681: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #682: 1025it [00:02, 419.61it/s, env_step=698368, len=19, n/ep=4, n/st=64, player_1/loss=1308.454, player_2/loss=1279.564, rew=450.50]                                                                                                


Epoch #682: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #683: 1025it [00:02, 439.11it/s, env_step=699392, len=27, n/ep=2, n/st=64, player_1/loss=1238.901, player_2/loss=1935.454, rew=824.00]                                                                                                


Epoch #683: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #684: 1025it [00:02, 435.07it/s, env_step=700416, len=32, n/ep=2, n/st=64, player_1/loss=1337.260, player_2/loss=1771.667, rew=1058.00]                                                                                               


Epoch #684: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #685: 1025it [00:02, 460.10it/s, env_step=701440, len=22, n/ep=3, n/st=64, player_1/loss=1443.720, player_2/loss=1485.890, rew=530.00]                                                                                                


Epoch #685: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #686: 1025it [00:02, 449.08it/s, env_step=702464, len=22, n/ep=3, n/st=64, player_1/loss=1549.593, player_2/loss=1578.156, rew=504.67]                                                                                                


Epoch #686: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #687: 1025it [00:02, 455.35it/s, env_step=703488, len=8, n/ep=8, n/st=64, player_1/loss=1957.214, player_2/loss=1608.378, rew=80.75]                                                                                                  


Epoch #687: test_reward: 70.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #688: 1025it [00:02, 389.27it/s, env_step=704512, len=23, n/ep=3, n/st=64, player_1/loss=1619.392, player_2/loss=1827.964, rew=550.67]                                                                                                


Epoch #688: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #689: 1025it [00:02, 396.16it/s, env_step=705536, len=34, n/ep=2, n/st=64, player_1/loss=1523.443, player_2/loss=1576.281, rew=1223.00]                                                                                               


Epoch #689: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #690: 1025it [00:02, 399.61it/s, env_step=706560, len=36, n/ep=2, n/st=64, player_1/loss=1576.114, player_2/loss=1408.900, rew=1379.00]                                                                                               


Epoch #690: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #691: 1025it [00:02, 453.52it/s, env_step=707584, len=23, n/ep=2, n/st=64, player_1/loss=666.073, player_2/loss=1664.489, rew=580.00]                                                                                                 


Epoch #691: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #692: 1025it [00:02, 372.00it/s, env_step=708608, len=15, n/ep=4, n/st=64, player_1/loss=963.109, player_2/loss=1969.768, rew=263.50]                                                                                                 


Epoch #692: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #693: 1025it [00:02, 414.33it/s, env_step=709632, len=27, n/ep=2, n/st=64, player_1/loss=1702.020, player_2/loss=1671.054, rew=779.00]                                                                                                


Epoch #693: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #694: 1025it [00:02, 458.45it/s, env_step=710656, len=22, n/ep=3, n/st=64, player_1/loss=1570.678, player_2/loss=1411.047, rew=632.00]                                                                                                


Epoch #694: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #695: 1025it [00:02, 367.96it/s, env_step=711680, len=13, n/ep=5, n/st=64, player_1/loss=1602.506, player_2/loss=1766.086, rew=276.40]                                                                                                


Epoch #695: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #696: 1025it [00:02, 396.18it/s, env_step=712704, len=24, n/ep=3, n/st=64, player_1/loss=1859.807, player_2/loss=2162.350, rew=634.67]                                                                                                


Epoch #696: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #697: 1025it [00:02, 440.65it/s, env_step=713728, len=22, n/ep=3, n/st=64, player_1/loss=1793.045, player_2/loss=1471.935, rew=508.67]                                                                                                


Epoch #697: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #698: 1025it [00:02, 431.56it/s, env_step=714752, len=28, n/ep=2, n/st=64, player_1/loss=1391.551, player_2/loss=1196.406, rew=851.00]                                                                                                


Epoch #698: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #699: 1025it [00:02, 449.08it/s, env_step=715776, len=15, n/ep=3, n/st=64, player_1/loss=1744.407, player_2/loss=1003.944, rew=252.67]                                                                                                


Epoch #699: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #700: 1025it [00:02, 446.92it/s, env_step=716800, len=18, n/ep=4, n/st=64, player_1/loss=2299.557, player_2/loss=761.986, rew=367.50]                                                                                                 


Epoch #700: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #701: 1025it [00:02, 452.51it/s, env_step=717824, len=27, n/ep=2, n/st=64, player_1/loss=2132.007, player_2/loss=1064.458, rew=794.00]                                                                                                


Epoch #701: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #702: 1025it [00:02, 438.11it/s, env_step=718848, len=26, n/ep=2, n/st=64, player_1/loss=1785.139, player_2/loss=1441.507, rew=729.00]                                                                                                


Epoch #702: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #703: 1025it [00:02, 441.02it/s, env_step=719872, len=37, n/ep=2, n/st=64, player_1/loss=1811.460, player_2/loss=1222.133, rew=1408.00]                                                                                               


Epoch #703: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #704: 1025it [00:02, 432.26it/s, env_step=720896, len=24, n/ep=3, n/st=64, player_1/loss=1547.964, player_2/loss=1368.871, rew=665.33]                                                                                                


Epoch #704: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #705: 1025it [00:02, 402.55it/s, env_step=721920, len=14, n/ep=4, n/st=64, player_1/loss=1429.891, player_2/loss=1680.277, rew=213.50]                                                                                                


Epoch #705: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #706: 1025it [00:02, 405.28it/s, env_step=722944, len=37, n/ep=2, n/st=64, player_1/loss=1777.140, player_2/loss=1312.186, rew=1477.00]                                                                                               


Epoch #706: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #707: 1025it [00:02, 384.78it/s, env_step=723968, len=40, n/ep=2, n/st=64, player_1/loss=1826.994, player_2/loss=1080.535, rew=1639.00]                                                                                               


Epoch #707: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #708: 1025it [00:02, 423.78it/s, env_step=724992, len=33, n/ep=2, n/st=64, player_1/loss=1853.389, player_2/loss=1547.798, rew=1136.00]                                                                                               


Epoch #708: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #709: 1025it [00:02, 417.04it/s, env_step=726016, len=36, n/ep=2, n/st=64, player_1/loss=1373.308, player_2/loss=1991.560, rew=1369.00]                                                                                               


Epoch #709: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #710: 1025it [00:02, 425.80it/s, env_step=727040, len=23, n/ep=3, n/st=64, player_1/loss=1105.629, player_2/loss=1832.191, rew=562.67]                                                                                                


Epoch #710: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #711: 1025it [00:02, 422.59it/s, env_step=728064, len=28, n/ep=2, n/st=64, player_1/loss=1213.026, player_2/loss=1340.451, rew=841.00]                                                                                                


Epoch #711: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #712: 1025it [00:02, 417.86it/s, env_step=729088, len=22, n/ep=3, n/st=64, player_1/loss=1505.670, player_2/loss=1142.447, rew=530.67]                                                                                                


Epoch #712: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #713: 1025it [00:02, 436.15it/s, env_step=730112, len=26, n/ep=2, n/st=64, player_1/loss=1288.424, player_2/loss=1458.329, rew=739.00]                                                                                                


Epoch #713: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #714: 1025it [00:02, 445.46it/s, env_step=731136, len=27, n/ep=2, n/st=64, player_1/loss=1115.614, player_2/loss=2103.511, rew=770.00]                                                                                                


Epoch #714: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #715: 1025it [00:02, 448.02it/s, env_step=732160, len=23, n/ep=2, n/st=64, player_1/loss=1235.515, player_2/loss=1954.738, rew=604.00]                                                                                                


Epoch #715: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #716: 1025it [00:02, 450.86it/s, env_step=733184, len=31, n/ep=2, n/st=64, player_1/loss=1752.701, player_2/loss=1464.793, rew=999.00]                                                                                                


Epoch #716: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #717: 1025it [00:02, 451.35it/s, env_step=734208, len=32, n/ep=2, n/st=64, player_1/loss=1487.335, player_2/loss=1059.326, rew=1079.00]                                                                                               


Epoch #717: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #718: 1025it [00:02, 445.09it/s, env_step=735232, len=32, n/ep=3, n/st=64, player_1/loss=1471.360, player_2/loss=1397.039, rew=1062.67]                                                                                               


Epoch #718: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #719: 1025it [00:02, 448.33it/s, env_step=736256, len=30, n/ep=2, n/st=64, player_1/loss=1347.811, player_2/loss=1312.391, rew=971.00]                                                                                                


Epoch #719: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #720: 1025it [00:02, 437.31it/s, env_step=737280, len=35, n/ep=2, n/st=64, player_1/loss=1156.829, player_2/loss=1113.980, rew=1259.00]                                                                                               


Epoch #720: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #721: 1025it [00:02, 440.48it/s, env_step=738304, len=42, n/ep=1, n/st=64, player_1/loss=749.739, player_2/loss=837.536, rew=1834.00]                                                                                                 


Epoch #721: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #722: 1025it [00:02, 431.25it/s, env_step=739328, len=38, n/ep=2, n/st=64, player_1/loss=888.828, player_2/loss=830.267, rew=1511.00]                                                                                                 


Epoch #722: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #723: 1025it [00:02, 435.23it/s, env_step=740352, len=21, n/ep=4, n/st=64, player_1/loss=1008.068, player_2/loss=1287.323, rew=552.50]                                                                                                


Epoch #723: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #724: 1025it [00:02, 440.38it/s, env_step=741376, len=26, n/ep=3, n/st=64, player_1/loss=858.251, player_2/loss=1227.802, rew=800.67]                                                                                                 


Epoch #724: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #725: 1025it [00:02, 454.06it/s, env_step=742400, len=35, n/ep=2, n/st=64, player_1/loss=763.783, player_2/loss=1058.346, rew=1274.00]                                                                                                


Epoch #725: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #726: 1025it [00:02, 455.93it/s, env_step=743424, len=30, n/ep=2, n/st=64, player_1/loss=997.619, player_2/loss=675.142, rew=971.00]                                                                                                  


Epoch #726: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #727: 1025it [00:02, 444.17it/s, env_step=744448, len=39, n/ep=1, n/st=64, player_1/loss=1508.034, player_2/loss=1023.934, rew=1558.00]                                                                                               


Epoch #727: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #728: 1025it [00:02, 431.53it/s, env_step=745472, len=25, n/ep=2, n/st=64, player_1/loss=1085.603, player_2/loss=831.892, rew=784.00]                                                                                                 


Epoch #728: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #729: 1025it [00:02, 449.93it/s, env_step=746496, len=27, n/ep=2, n/st=64, player_1/loss=1204.196, player_2/loss=1161.327, rew=754.00]                                                                                                


Epoch #729: test_reward: 1404.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #730: 1025it [00:02, 449.20it/s, env_step=747520, len=20, n/ep=2, n/st=64, player_1/loss=1625.779, player_2/loss=1719.327, rew=511.00]                                                                                                


Epoch #730: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #731: 1025it [00:02, 448.98it/s, env_step=748544, len=26, n/ep=3, n/st=64, player_1/loss=1705.445, player_2/loss=1370.909, rew=858.67]                                                                                                


Epoch #731: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #732: 1025it [00:02, 450.07it/s, env_step=749568, len=28, n/ep=2, n/st=64, player_1/loss=1358.553, player_2/loss=1492.150, rew=839.00]                                                                                                


Epoch #732: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #733: 1025it [00:02, 449.45it/s, env_step=750592, len=31, n/ep=2, n/st=64, player_1/loss=526.019, player_2/loss=1120.074, rew=994.00]                                                                                                 


Epoch #733: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #734: 1025it [00:02, 453.16it/s, env_step=751616, len=38, n/ep=2, n/st=64, player_1/loss=633.614, player_2/loss=1049.343, rew=1525.00]                                                                                                


Epoch #734: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #735: 1025it [00:02, 451.03it/s, env_step=752640, len=25, n/ep=2, n/st=64, player_1/loss=1005.236, player_2/loss=1534.123, rew=712.00]                                                                                                


Epoch #735: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #736: 1025it [00:02, 449.02it/s, env_step=753664, len=15, n/ep=4, n/st=64, player_1/loss=1185.095, player_2/loss=1685.333, rew=262.50]                                                                                                


Epoch #736: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #737: 1025it [00:02, 455.04it/s, env_step=754688, len=16, n/ep=3, n/st=64, player_1/loss=1389.305, player_2/loss=1758.829, rew=274.67]                                                                                                


Epoch #737: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #738: 1025it [00:02, 449.53it/s, env_step=755712, len=19, n/ep=4, n/st=64, player_1/loss=1446.800, player_2/loss=1649.087, rew=417.50]                                                                                                


Epoch #738: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #739: 1025it [00:02, 454.34it/s, env_step=756736, len=24, n/ep=3, n/st=64, player_1/loss=1470.282, rew=598.67]                                                                                                                        


Epoch #739: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #740: 1025it [00:02, 452.40it/s, env_step=757760, len=23, n/ep=3, n/st=64, player_1/loss=1331.325, player_2/loss=657.683, rew=562.67]                                                                                                 


Epoch #740: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #741: 1025it [00:02, 451.14it/s, env_step=758784, len=19, n/ep=3, n/st=64, player_1/loss=1066.806, player_2/loss=618.882, rew=380.67]                                                                                                 


Epoch #741: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #742: 1025it [00:02, 438.11it/s, env_step=759808, len=31, n/ep=2, n/st=64, player_1/loss=1380.801, player_2/loss=1079.404, rew=990.00]                                                                                                


Epoch #742: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #743: 1025it [00:02, 444.40it/s, env_step=760832, len=29, n/ep=2, n/st=64, player_1/loss=1929.827, player_2/loss=1500.169, rew=893.00]                                                                                                


Epoch #743: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #744: 1025it [00:02, 439.95it/s, env_step=761856, len=33, n/ep=2, n/st=64, player_1/loss=1120.311, player_2/loss=1323.830, rew=1136.00]                                                                                               


Epoch #744: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #745: 1025it [00:02, 444.52it/s, env_step=762880, len=28, n/ep=2, n/st=64, player_1/loss=297.014, player_2/loss=1247.599, rew=841.00]                                                                                                 


Epoch #745: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #746: 1025it [00:02, 450.50it/s, env_step=763904, len=33, n/ep=3, n/st=64, player_1/loss=921.621, player_2/loss=1194.146, rew=1190.67]                                                                                                


Epoch #746: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #747: 1025it [00:02, 446.93it/s, env_step=764928, len=32, n/ep=2, n/st=64, player_1/loss=1529.448, player_2/loss=821.588, rew=1058.00]                                                                                                


Epoch #747: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #748: 1025it [00:02, 447.21it/s, env_step=765952, len=26, n/ep=2, n/st=64, player_1/loss=1662.976, player_2/loss=826.019, rew=781.00]                                                                                                 


Epoch #748: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #749: 1025it [00:02, 423.02it/s, env_step=766976, len=14, n/ep=4, n/st=64, player_1/loss=1537.363, player_2/loss=914.492, rew=234.00]                                                                                                 


Epoch #749: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #750: 1025it [00:02, 440.30it/s, env_step=768000, len=36, n/ep=2, n/st=64, player_1/loss=1281.197, player_2/loss=985.018, rew=1373.00]                                                                                                


Epoch #750: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #751: 1025it [00:02, 432.59it/s, env_step=769024, len=33, n/ep=2, n/st=64, player_1/loss=1503.630, player_2/loss=934.201, rew=1120.00]                                                                                                


Epoch #751: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #752: 1025it [00:02, 439.09it/s, env_step=770048, len=34, n/ep=1, n/st=64, player_1/loss=2062.268, player_2/loss=1388.373, rew=1188.00]                                                                                               


Epoch #752: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #753: 1025it [00:02, 444.82it/s, env_step=771072, len=11, n/ep=6, n/st=64, player_1/loss=2125.976, player_2/loss=1444.713, rew=155.67]                                                                                                


Epoch #753: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #754: 1025it [00:02, 435.62it/s, env_step=772096, len=30, n/ep=2, n/st=64, player_1/loss=2330.560, player_2/loss=1746.828, rew=961.00]                                                                                                


Epoch #754: test_reward: 130.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #755: 1025it [00:02, 446.92it/s, env_step=773120, len=28, n/ep=2, n/st=64, player_1/loss=2114.446, player_2/loss=1489.852, rew=881.00]                                                                                                


Epoch #755: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #756: 1025it [00:02, 446.53it/s, env_step=774144, len=27, n/ep=2, n/st=64, player_1/loss=2279.019, player_2/loss=1048.607, rew=794.00]                                                                                                


Epoch #756: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #757: 1025it [00:02, 446.27it/s, env_step=775168, len=32, n/ep=2, n/st=64, player_1/loss=1413.835, player_2/loss=1494.657, rew=1058.00]                                                                                               


Epoch #757: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #758: 1025it [00:02, 438.28it/s, env_step=776192, len=8, n/ep=6, n/st=64, player_1/loss=1066.914, player_2/loss=2184.397, rew=85.00]                                                                                                  


Epoch #758: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #759: 1025it [00:02, 423.80it/s, env_step=777216, len=15, n/ep=4, n/st=64, player_1/loss=1015.858, player_2/loss=2255.729, rew=258.50]                                                                                                


Epoch #759: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #760: 1025it [00:02, 412.84it/s, env_step=778240, len=13, n/ep=5, n/st=64, player_1/loss=883.079, player_2/loss=2093.544, rew=226.00]                                                                                                 


Epoch #760: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #761: 1025it [00:02, 404.86it/s, env_step=779264, len=37, n/ep=2, n/st=64, player_1/loss=786.337, player_2/loss=1907.184, rew=1420.00]                                                                                                


Epoch #761: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #762: 1025it [00:02, 445.81it/s, env_step=780288, len=21, n/ep=3, n/st=64, player_1/loss=704.788, player_2/loss=1638.446, rew=462.00]                                                                                                 


Epoch #762: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #763: 1025it [00:02, 424.78it/s, env_step=781312, len=21, n/ep=4, n/st=64, player_1/loss=1025.109, player_2/loss=1667.637, rew=573.00]                                                                                                


Epoch #763: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #764: 1025it [00:02, 440.21it/s, env_step=782336, len=20, n/ep=3, n/st=64, player_1/loss=1039.731, player_2/loss=1657.677, rew=446.67]                                                                                                


Epoch #764: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #765: 1025it [00:02, 438.23it/s, env_step=783360, len=36, n/ep=2, n/st=64, player_1/loss=1183.698, player_2/loss=1801.683, rew=1379.00]                                                                                               


Epoch #765: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #766: 1025it [00:02, 422.68it/s, env_step=784384, len=22, n/ep=3, n/st=64, player_1/loss=1369.500, player_2/loss=1423.512, rew=596.00]                                                                                                


Epoch #766: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #767: 1025it [00:02, 428.70it/s, env_step=785408, len=7, n/ep=8, n/st=64, player_1/loss=1399.928, player_2/loss=1233.931, rew=64.75]                                                                                                  


Epoch #767: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #768: 1025it [00:02, 388.05it/s, env_step=786432, len=16, n/ep=3, n/st=64, player_1/loss=1382.877, player_2/loss=1307.983, rew=293.33]                                                                                                


Epoch #768: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #769: 1025it [00:02, 370.56it/s, env_step=787456, len=19, n/ep=3, n/st=64, player_1/loss=1330.959, player_2/loss=939.437, rew=421.33]                                                                                                 


Epoch #769: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #770: 1025it [00:02, 345.23it/s, env_step=788480, len=7, n/ep=5, n/st=64, player_1/loss=1073.894, player_2/loss=1085.802, rew=64.00]                                                                                                  


Epoch #770: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #771: 1025it [00:03, 325.40it/s, env_step=789504, len=11, n/ep=5, n/st=64, player_1/loss=1052.516, player_2/loss=1317.044, rew=151.60]                                                                                                


Epoch #771: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #772: 1025it [00:03, 326.02it/s, env_step=790528, len=24, n/ep=3, n/st=64, player_1/loss=1637.062, player_2/loss=1591.889, rew=702.67]                                                                                                


Epoch #772: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #773: 1025it [00:03, 304.25it/s, env_step=791552, len=17, n/ep=4, n/st=64, player_1/loss=2003.880, player_2/loss=1972.790, rew=391.50]                                                                                                


Epoch #773: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #774: 1025it [00:02, 401.64it/s, env_step=792576, len=30, n/ep=2, n/st=64, player_1/loss=1415.119, player_2/loss=1787.650, rew=959.00]                                                                                                


Epoch #774: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #775: 1025it [00:02, 437.73it/s, env_step=793600, len=38, n/ep=1, n/st=64, player_1/loss=992.540, player_2/loss=1285.663, rew=1480.00]                                                                                                


Epoch #775: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #776: 1025it [00:02, 427.95it/s, env_step=794624, len=36, n/ep=2, n/st=64, player_1/loss=714.503, player_2/loss=803.257, rew=1334.00]                                                                                                 


Epoch #776: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #777: 1025it [00:02, 440.33it/s, env_step=795648, len=26, n/ep=2, n/st=64, player_1/loss=573.363, player_2/loss=878.978, rew=781.00]                                                                                                  


Epoch #777: test_reward: 1558.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #778: 1025it [00:02, 450.01it/s, env_step=796672, len=13, n/ep=4, n/st=64, player_1/loss=1561.767, player_2/loss=1466.541, rew=194.50]                                                                                                


Epoch #778: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #779: 1025it [00:02, 425.49it/s, env_step=797696, len=25, n/ep=2, n/st=64, player_1/loss=1815.149, player_2/loss=1380.563, rew=748.00]                                                                                                


Epoch #779: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #780: 1025it [00:02, 433.88it/s, env_step=798720, len=34, n/ep=2, n/st=64, player_1/loss=1449.118, player_2/loss=1204.913, rew=1279.00]                                                                                               


Epoch #780: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #781: 1025it [00:02, 420.79it/s, env_step=799744, len=7, n/ep=9, n/st=64, player_1/loss=1237.159, player_2/loss=1254.016, rew=59.56]                                                                                                  


Epoch #781: test_reward: 1638.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #782: 1025it [00:02, 434.69it/s, env_step=800768, len=17, n/ep=5, n/st=64, player_1/loss=1352.125, player_2/loss=1456.193, rew=366.80]                                                                                                


Epoch #782: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #783: 1025it [00:02, 447.20it/s, env_step=801792, len=32, n/ep=2, n/st=64, player_1/loss=1353.184, player_2/loss=1061.064, rew=1058.00]                                                                                               


Epoch #783: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #784: 1025it [00:02, 434.49it/s, env_step=802816, len=21, n/ep=3, n/st=64, player_1/loss=1543.766, player_2/loss=1080.579, rew=462.67]                                                                                                


Epoch #784: test_reward: 1558.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #785: 1025it [00:02, 426.68it/s, env_step=803840, len=29, n/ep=2, n/st=64, player_1/loss=1363.890, player_2/loss=1225.512, rew=900.00]                                                                                                


Epoch #785: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #786: 1025it [00:02, 443.03it/s, env_step=804864, len=30, n/ep=2, n/st=64, player_1/loss=1022.591, player_2/loss=1233.056, rew=979.00]                                                                                                


Epoch #786: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #787: 1025it [00:02, 445.53it/s, env_step=805888, len=26, n/ep=3, n/st=64, player_1/loss=792.567, player_2/loss=846.646, rew=740.00]                                                                                                  


Epoch #787: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #788: 1025it [00:02, 450.65it/s, env_step=806912, len=27, n/ep=2, n/st=64, player_1/loss=780.049, player_2/loss=711.162, rew=758.00]                                                                                                  


Epoch #788: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #789: 1025it [00:02, 436.13it/s, env_step=807936, len=28, n/ep=2, n/st=64, player_1/loss=955.818, player_2/loss=998.406, rew=841.00]                                                                                                  


Epoch #789: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #790: 1025it [00:02, 454.75it/s, env_step=808960, len=8, n/ep=7, n/st=64, player_1/loss=1147.576, player_2/loss=1617.648, rew=83.71]                                                                                                  


Epoch #790: test_reward: 130.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #791: 1025it [00:02, 454.22it/s, env_step=809984, len=38, n/ep=2, n/st=64, player_1/loss=1169.017, player_2/loss=1399.440, rew=1546.00]                                                                                               


Epoch #791: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #792: 1025it [00:02, 461.74it/s, env_step=811008, len=29, n/ep=2, n/st=64, player_1/loss=1113.288, player_2/loss=1250.787, rew=868.00]                                                                                                


Epoch #792: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #793: 1025it [00:02, 440.94it/s, env_step=812032, len=28, n/ep=3, n/st=64, player_1/loss=820.000, player_2/loss=1245.045, rew=834.67]                                                                                                 


Epoch #793: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #794: 1025it [00:02, 447.61it/s, env_step=813056, len=20, n/ep=3, n/st=64, player_1/loss=1155.614, player_2/loss=1560.133, rew=548.67]                                                                                                


Epoch #794: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #795: 1025it [00:02, 436.94it/s, env_step=814080, len=21, n/ep=3, n/st=64, player_1/loss=1432.133, player_2/loss=1502.879, rew=476.67]                                                                                                


Epoch #795: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #796: 1025it [00:02, 421.74it/s, env_step=815104, len=34, n/ep=2, n/st=64, player_1/loss=1440.416, player_2/loss=1463.070, rew=1213.00]                                                                                               


Epoch #796: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #797: 1025it [00:02, 396.47it/s, env_step=816128, len=34, n/ep=2, n/st=64, player_1/loss=1222.493, player_2/loss=1376.366, rew=1224.00]                                                                                               


Epoch #797: test_reward: 1558.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #798: 1025it [00:02, 439.90it/s, env_step=817152, len=33, n/ep=2, n/st=64, player_1/loss=1039.489, player_2/loss=1483.081, rew=1124.00]                                                                                               


Epoch #798: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #799: 1025it [00:02, 434.61it/s, env_step=818176, len=10, n/ep=8, n/st=64, player_1/loss=1502.988, player_2/loss=1413.736, rew=164.00]                                                                                                


Epoch #799: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #800: 1025it [00:02, 443.98it/s, env_step=819200, len=35, n/ep=2, n/st=64, player_1/loss=1251.223, player_2/loss=1355.221, rew=1258.00]                                                                                               


Epoch #800: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #801: 1025it [00:02, 432.11it/s, env_step=820224, len=25, n/ep=2, n/st=64, player_1/loss=1517.705, player_2/loss=2086.267, rew=648.00]                                                                                                


Epoch #801: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #802: 1025it [00:02, 437.33it/s, env_step=821248, len=25, n/ep=2, n/st=64, player_1/loss=1314.413, player_2/loss=1646.720, rew=686.00]                                                                                                


Epoch #802: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #803: 1025it [00:02, 442.53it/s, env_step=822272, len=29, n/ep=2, n/st=64, player_2/loss=1415.350, rew=918.00]                                                                                                                        


Epoch #803: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #804: 1025it [00:02, 434.54it/s, env_step=823296, len=23, n/ep=2, n/st=64, player_1/loss=570.524, player_2/loss=1205.507, rew=574.00]                                                                                                 


Epoch #804: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #805: 1025it [00:02, 430.28it/s, env_step=824320, len=14, n/ep=3, n/st=64, player_1/loss=498.010, player_2/loss=936.974, rew=208.67]                                                                                                  


Epoch #805: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #806: 1025it [00:02, 424.07it/s, env_step=825344, len=22, n/ep=3, n/st=64, player_1/loss=1476.355, player_2/loss=1988.707, rew=541.33]                                                                                                


Epoch #806: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #807: 1025it [00:02, 443.39it/s, env_step=826368, len=24, n/ep=3, n/st=64, player_1/loss=2164.572, player_2/loss=1927.719, rew=662.67]                                                                                                


Epoch #807: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #808: 1025it [00:02, 449.14it/s, env_step=827392, len=21, n/ep=3, n/st=64, player_1/loss=1454.096, player_2/loss=1144.893, rew=526.67]                                                                                                


Epoch #808: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #809: 1025it [00:02, 438.93it/s, env_step=828416, len=25, n/ep=3, n/st=64, player_1/loss=952.368, player_2/loss=374.755, rew=686.00]                                                                                                  


Epoch #809: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #810: 1025it [00:02, 436.53it/s, env_step=829440, len=22, n/ep=3, n/st=64, player_1/loss=811.252, player_2/loss=1208.415, rew=556.67]                                                                                                 


Epoch #810: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #811: 1025it [00:02, 432.44it/s, env_step=830464, len=27, n/ep=2, n/st=64, player_1/loss=1054.716, player_2/loss=1548.148, rew=770.00]                                                                                                


Epoch #811: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #812: 1025it [00:02, 431.28it/s, env_step=831488, len=27, n/ep=2, n/st=64, player_1/loss=1022.850, player_2/loss=1342.985, rew=802.00]                                                                                                


Epoch #812: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #813: 1025it [00:02, 447.16it/s, env_step=832512, len=16, n/ep=4, n/st=64, player_1/loss=870.509, player_2/loss=1075.751, rew=297.00]                                                                                                 


Epoch #813: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #814: 1025it [00:02, 436.85it/s, env_step=833536, len=21, n/ep=3, n/st=64, player_1/loss=1240.277, player_2/loss=1780.647, rew=462.67]                                                                                                


Epoch #814: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #815: 1025it [00:02, 434.64it/s, env_step=834560, len=28, n/ep=2, n/st=64, player_1/loss=1191.487, player_2/loss=1510.874, rew=859.00]                                                                                                


Epoch #815: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #816: 1025it [00:02, 418.97it/s, env_step=835584, len=31, n/ep=2, n/st=64, player_1/loss=1389.167, player_2/loss=1715.663, rew=991.00]                                                                                                


Epoch #816: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #817: 1025it [00:02, 404.01it/s, env_step=836608, len=21, n/ep=3, n/st=64, player_1/loss=1895.640, player_2/loss=2587.423, rew=462.00]                                                                                                


Epoch #817: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #818: 1025it [00:02, 437.16it/s, env_step=837632, len=24, n/ep=3, n/st=64, player_1/loss=1760.228, player_2/loss=2300.438, rew=610.67]                                                                                                


Epoch #818: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #819: 1025it [00:02, 441.77it/s, env_step=838656, len=23, n/ep=2, n/st=64, player_2/loss=1753.633, rew=554.00]                                                                                                                        


Epoch #819: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #820: 1025it [00:02, 449.72it/s, env_step=839680, len=35, n/ep=2, n/st=64, player_1/loss=2117.901, player_2/loss=1974.854, rew=1296.00]                                                                                               


Epoch #820: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #821: 1025it [00:02, 452.00it/s, env_step=840704, len=19, n/ep=3, n/st=64, player_1/loss=2402.539, player_2/loss=1827.014, rew=392.00]                                                                                                


Epoch #821: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #822: 1025it [00:02, 451.29it/s, env_step=841728, len=36, n/ep=2, n/st=64, player_1/loss=1652.149, player_2/loss=1673.388, rew=1339.00]                                                                                               


Epoch #822: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #823: 1025it [00:02, 449.07it/s, env_step=842752, len=34, n/ep=2, n/st=64, player_1/loss=1080.168, player_2/loss=1440.524, rew=1189.00]                                                                                               


Epoch #823: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #824: 1025it [00:02, 451.43it/s, env_step=843776, len=24, n/ep=2, n/st=64, player_1/loss=1042.732, player_2/loss=1786.949, rew=607.00]                                                                                                


Epoch #824: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #825: 1025it [00:02, 451.12it/s, env_step=844800, len=20, n/ep=2, n/st=64, player_1/loss=964.439, player_2/loss=1731.292, rew=451.00]                                                                                                 


Epoch #825: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #826: 1025it [00:02, 450.90it/s, env_step=845824, len=30, n/ep=2, n/st=64, player_1/loss=999.327, player_2/loss=1725.542, rew=937.00]                                                                                                 


Epoch #826: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #827: 1025it [00:02, 451.67it/s, env_step=846848, len=29, n/ep=2, n/st=64, player_1/loss=958.202, player_2/loss=2112.413, rew=893.00]                                                                                                 


Epoch #827: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #828: 1025it [00:02, 449.41it/s, env_step=847872, len=32, n/ep=2, n/st=64, player_1/loss=1147.308, player_2/loss=1705.144, rew=1087.00]                                                                                               


Epoch #828: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #829: 1025it [00:02, 451.83it/s, env_step=848896, len=21, n/ep=2, n/st=64, player_1/loss=1661.395, player_2/loss=1575.275, rew=524.00]                                                                                                


Epoch #829: test_reward: 270.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #830: 1025it [00:02, 453.82it/s, env_step=849920, len=33, n/ep=2, n/st=64, player_1/loss=942.434, player_2/loss=1410.430, rew=1120.00]                                                                                                


Epoch #830: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #831: 1025it [00:02, 453.58it/s, env_step=850944, len=30, n/ep=2, n/st=64, player_1/loss=628.166, player_2/loss=1341.676, rew=932.00]                                                                                                 


Epoch #831: test_reward: 1558.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #832: 1025it [00:02, 445.53it/s, env_step=851968, len=30, n/ep=2, n/st=64, player_1/loss=1370.950, player_2/loss=1829.947, rew=965.00]                                                                                                


Epoch #832: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #833: 1025it [00:02, 451.64it/s, env_step=852992, len=25, n/ep=3, n/st=64, player_2/loss=1773.418, rew=722.67]                                                                                                                        


Epoch #833: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #834: 1025it [00:02, 447.92it/s, env_step=854016, len=32, n/ep=2, n/st=64, player_1/loss=1059.326, player_2/loss=1612.188, rew=1055.00]                                                                                               


Epoch #834: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #835: 1025it [00:02, 448.13it/s, env_step=855040, len=20, n/ep=3, n/st=64, player_1/loss=930.750, player_2/loss=1156.614, rew=433.33]                                                                                                 


Epoch #835: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #836: 1025it [00:02, 448.67it/s, env_step=856064, len=23, n/ep=3, n/st=64, player_1/loss=1127.923, player_2/loss=1517.892, rew=558.00]                                                                                                


Epoch #836: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #837: 1025it [00:02, 448.64it/s, env_step=857088, len=32, n/ep=2, n/st=64, player_1/loss=1158.128, player_2/loss=1714.974, rew=1087.00]                                                                                               


Epoch #837: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #838: 1025it [00:02, 448.11it/s, env_step=858112, len=32, n/ep=2, n/st=64, player_1/loss=1527.581, player_2/loss=1943.877, rew=1058.00]                                                                                               


Epoch #838: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #839: 1025it [00:02, 449.11it/s, env_step=859136, len=31, n/ep=2, n/st=64, player_1/loss=1304.045, player_2/loss=1602.013, rew=1024.00]                                                                                               


Epoch #839: test_reward: 1638.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #840: 1025it [00:02, 449.63it/s, env_step=860160, len=29, n/ep=2, n/st=64, player_1/loss=1011.677, player_2/loss=1002.430, rew=940.00]                                                                                                


Epoch #840: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #841: 1025it [00:02, 446.40it/s, env_step=861184, len=21, n/ep=2, n/st=64, player_1/loss=1012.932, player_2/loss=1230.524, rew=482.00]                                                                                                


Epoch #841: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #842: 1025it [00:02, 448.77it/s, env_step=862208, len=30, n/ep=2, n/st=64, player_1/loss=798.973, player_2/loss=772.252, rew=929.00]                                                                                                  


Epoch #842: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #843: 1025it [00:02, 438.25it/s, env_step=863232, len=27, n/ep=2, n/st=64, player_1/loss=702.385, player_2/loss=1071.631, rew=835.00]                                                                                                 


Epoch #843: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #844: 1025it [00:02, 447.94it/s, env_step=864256, len=42, n/ep=1, n/st=64, player_1/loss=780.395, player_2/loss=1201.798, rew=1834.00]                                                                                                


Epoch #844: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #845: 1025it [00:02, 447.52it/s, env_step=865280, len=42, n/ep=1, n/st=64, player_1/loss=726.224, player_2/loss=608.010, rew=1834.00]                                                                                                 


Epoch #845: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #846: 1025it [00:02, 448.96it/s, env_step=866304, len=32, n/ep=2, n/st=64, player_1/loss=736.472, player_2/loss=340.827, rew=1090.00]                                                                                                 


Epoch #846: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #847: 1025it [00:02, 444.24it/s, env_step=867328, len=38, n/ep=2, n/st=64, player_1/loss=794.022, player_2/loss=636.643, rew=1519.00]                                                                                                 


Epoch #847: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #848: 1025it [00:02, 445.68it/s, env_step=868352, len=31, n/ep=2, n/st=64, player_1/loss=1144.267, player_2/loss=1261.379, rew=994.00]                                                                                                


Epoch #848: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #849: 1025it [00:02, 448.93it/s, env_step=869376, len=30, n/ep=2, n/st=64, player_1/loss=1237.586, player_2/loss=1865.672, rew=977.00]                                                                                                


Epoch #849: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #850: 1025it [00:02, 450.47it/s, env_step=870400, len=33, n/ep=2, n/st=64, player_2/loss=2093.155, rew=1174.00]                                                                                                                       


Epoch #850: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #851: 1025it [00:02, 448.69it/s, env_step=871424, len=26, n/ep=3, n/st=64, player_1/loss=1688.537, player_2/loss=1491.457, rew=716.67]                                                                                                


Epoch #851: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #852: 1025it [00:02, 449.24it/s, env_step=872448, len=31, n/ep=2, n/st=64, player_1/loss=1147.776, player_2/loss=651.814, rew=991.00]                                                                                                 


Epoch #852: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #853: 1025it [00:02, 448.97it/s, env_step=873472, len=25, n/ep=3, n/st=64, player_1/loss=1181.464, player_2/loss=771.375, rew=683.33]                                                                                                 


Epoch #853: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #854: 1025it [00:02, 449.18it/s, env_step=874496, len=31, n/ep=2, n/st=64, player_1/loss=1688.473, player_2/loss=2109.527, rew=991.00]                                                                                                


Epoch #854: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #855: 1025it [00:02, 447.54it/s, env_step=875520, len=27, n/ep=3, n/st=64, player_1/loss=2563.676, player_2/loss=2842.623, rew=882.00]                                                                                                


Epoch #855: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #856: 1025it [00:02, 448.95it/s, env_step=876544, len=33, n/ep=2, n/st=64, player_1/loss=1623.040, player_2/loss=1439.986, rew=1121.00]                                                                                               


Epoch #856: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #857: 1025it [00:02, 448.69it/s, env_step=877568, len=32, n/ep=2, n/st=64, player_1/loss=1549.174, player_2/loss=1730.005, rew=1103.00]                                                                                               


Epoch #857: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #858: 1025it [00:02, 445.80it/s, env_step=878592, len=16, n/ep=4, n/st=64, player_1/loss=2168.920, player_2/loss=1412.169, rew=384.00]                                                                                                


Epoch #858: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #859: 1025it [00:02, 451.32it/s, env_step=879616, len=22, n/ep=3, n/st=64, player_1/loss=1524.176, player_2/loss=828.208, rew=538.00]                                                                                                 


Epoch #859: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #860: 1025it [00:02, 450.25it/s, env_step=880640, len=33, n/ep=2, n/st=64, player_1/loss=1265.926, player_2/loss=1138.285, rew=1174.00]                                                                                               


Epoch #860: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #861: 1025it [00:02, 446.37it/s, env_step=881664, len=26, n/ep=3, n/st=64, player_1/loss=1168.615, player_2/loss=1005.483, rew=765.33]                                                                                                


Epoch #861: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #862: 1025it [00:02, 445.41it/s, env_step=882688, len=31, n/ep=2, n/st=64, player_1/loss=784.648, player_2/loss=860.184, rew=991.00]                                                                                                  


Epoch #862: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #863: 1025it [00:02, 448.81it/s, env_step=883712, len=18, n/ep=3, n/st=64, player_1/loss=1338.324, player_2/loss=836.886, rew=340.67]                                                                                                 


Epoch #863: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #864: 1025it [00:02, 451.78it/s, env_step=884736, len=28, n/ep=2, n/st=64, player_1/loss=1526.235, player_2/loss=802.103, rew=851.00]                                                                                                 


Epoch #864: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #865: 1025it [00:02, 450.85it/s, env_step=885760, len=27, n/ep=2, n/st=64, player_1/loss=1063.063, player_2/loss=1056.716, rew=763.00]                                                                                                


Epoch #865: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #866: 1025it [00:02, 446.63it/s, env_step=886784, len=24, n/ep=2, n/st=64, player_1/loss=970.262, player_2/loss=1107.197, rew=599.00]                                                                                                 


Epoch #866: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #867: 1025it [00:02, 452.02it/s, env_step=887808, len=34, n/ep=2, n/st=64, player_1/loss=1123.075, player_2/loss=1808.142, rew=1235.00]                                                                                               


Epoch #867: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #868: 1025it [00:02, 451.13it/s, env_step=888832, len=32, n/ep=2, n/st=64, player_1/loss=1067.056, player_2/loss=2023.715, rew=1055.00]                                                                                               


Epoch #868: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #869: 1025it [00:02, 451.22it/s, env_step=889856, len=35, n/ep=1, n/st=64, player_1/loss=571.232, player_2/loss=1320.926, rew=1258.00]                                                                                                


Epoch #869: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #870: 1025it [00:02, 446.51it/s, env_step=890880, len=23, n/ep=3, n/st=64, player_1/loss=1191.681, player_2/loss=1106.256, rew=664.67]                                                                                                


Epoch #870: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #871: 1025it [00:02, 450.00it/s, env_step=891904, len=27, n/ep=3, n/st=64, player_1/loss=1520.023, player_2/loss=1591.964, rew=776.00]                                                                                                


Epoch #871: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #872: 1025it [00:02, 448.97it/s, env_step=892928, len=24, n/ep=3, n/st=64, player_1/loss=1179.899, player_2/loss=1562.125, rew=632.67]                                                                                                


Epoch #872: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #873: 1025it [00:02, 448.64it/s, env_step=893952, len=30, n/ep=2, n/st=64, player_1/loss=1292.116, player_2/loss=1053.771, rew=932.00]                                                                                                


Epoch #873: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #874: 1025it [00:02, 449.40it/s, env_step=894976, len=33, n/ep=2, n/st=64, player_1/loss=1195.568, player_2/loss=1143.203, rew=1120.00]                                                                                               


Epoch #874: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #875: 1025it [00:02, 449.44it/s, env_step=896000, len=25, n/ep=2, n/st=64, player_1/loss=672.961, player_2/loss=903.986, rew=652.00]                                                                                                  


Epoch #875: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #876: 1025it [00:02, 449.18it/s, env_step=897024, len=28, n/ep=2, n/st=64, player_1/loss=380.610, player_2/loss=1341.122, rew=819.00]                                                                                                 


Epoch #876: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #877: 1025it [00:02, 447.87it/s, env_step=898048, len=28, n/ep=2, n/st=64, player_1/loss=1173.654, player_2/loss=1940.806, rew=869.00]                                                                                                


Epoch #877: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #878: 1025it [00:02, 450.06it/s, env_step=899072, len=24, n/ep=3, n/st=64, player_1/loss=1356.705, player_2/loss=1860.668, rew=639.33]                                                                                                


Epoch #878: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #879: 1025it [00:02, 446.81it/s, env_step=900096, len=31, n/ep=3, n/st=64, player_1/loss=1167.669, player_2/loss=1550.021, rew=1042.67]                                                                                               


Epoch #879: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #880: 1025it [00:02, 450.37it/s, env_step=901120, len=37, n/ep=2, n/st=64, player_1/loss=1430.843, player_2/loss=1150.084, rew=1477.00]                                                                                               


Epoch #880: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #881: 1025it [00:02, 449.68it/s, env_step=902144, len=36, n/ep=2, n/st=64, player_1/loss=1073.480, player_2/loss=633.429, rew=1397.00]                                                                                                


Epoch #881: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #882: 1025it [00:02, 452.35it/s, env_step=903168, len=29, n/ep=3, n/st=64, player_1/loss=1178.483, player_2/loss=1141.371, rew=956.00]                                                                                                


Epoch #882: test_reward: 1720.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #883: 1025it [00:02, 448.00it/s, env_step=904192, len=28, n/ep=2, n/st=64, player_1/loss=946.043, player_2/loss=1453.185, rew=851.00]                                                                                                 


Epoch #883: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #884: 1025it [00:02, 450.46it/s, env_step=905216, len=26, n/ep=2, n/st=64, player_1/loss=1457.594, player_2/loss=1411.251, rew=733.00]                                                                                                


Epoch #884: test_reward: 810.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #885: 1025it [00:02, 449.63it/s, env_step=906240, len=33, n/ep=3, n/st=64, player_1/loss=1752.360, player_2/loss=1618.598, rew=1178.67]                                                                                               


Epoch #885: test_reward: 1330.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #886: 1025it [00:02, 448.59it/s, env_step=907264, len=35, n/ep=2, n/st=64, player_1/loss=992.995, player_2/loss=1871.801, rew=1300.00]                                                                                                


Epoch #886: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #887: 1025it [00:02, 449.27it/s, env_step=908288, len=14, n/ep=5, n/st=64, player_1/loss=1198.007, player_2/loss=2077.103, rew=223.60]                                                                                                


Epoch #887: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #888: 1025it [00:02, 446.90it/s, env_step=909312, len=24, n/ep=3, n/st=64, player_1/loss=1440.249, player_2/loss=2576.117, rew=682.00]                                                                                                


Epoch #888: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #889: 1025it [00:02, 449.02it/s, env_step=910336, len=15, n/ep=4, n/st=64, player_2/loss=2280.870, rew=241.50]                                                                                                                        


Epoch #889: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #890: 1025it [00:02, 452.50it/s, env_step=911360, len=16, n/ep=3, n/st=64, player_1/loss=1324.065, player_2/loss=1189.399, rew=272.00]                                                                                                


Epoch #890: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #891: 1025it [00:02, 448.63it/s, env_step=912384, len=24, n/ep=2, n/st=64, player_1/loss=1464.894, player_2/loss=1216.037, rew=679.00]                                                                                                


Epoch #891: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #892: 1025it [00:02, 447.19it/s, env_step=913408, len=37, n/ep=1, n/st=64, player_1/loss=1183.213, player_2/loss=1298.064, rew=1404.00]                                                                                               


Epoch #892: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #893: 1025it [00:02, 449.68it/s, env_step=914432, len=23, n/ep=3, n/st=64, player_1/loss=1310.698, player_2/loss=1662.374, rew=624.67]                                                                                                


Epoch #893: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #894: 1025it [00:02, 451.41it/s, env_step=915456, len=22, n/ep=3, n/st=64, player_1/loss=1379.327, player_2/loss=1494.822, rew=520.67]                                                                                                


Epoch #894: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #895: 1025it [00:02, 449.42it/s, env_step=916480, len=24, n/ep=2, n/st=64, player_1/loss=1312.462, player_2/loss=1560.145, rew=623.00]                                                                                                


Epoch #895: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #896: 1025it [00:02, 445.32it/s, env_step=917504, len=30, n/ep=2, n/st=64, player_1/loss=1548.524, player_2/loss=1230.868, rew=971.00]                                                                                                


Epoch #896: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #897: 1025it [00:02, 450.26it/s, env_step=918528, len=25, n/ep=3, n/st=64, player_1/loss=1225.490, player_2/loss=828.738, rew=701.33]                                                                                                 


Epoch #897: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #898: 1025it [00:02, 446.60it/s, env_step=919552, len=31, n/ep=2, n/st=64, player_1/loss=876.219, player_2/loss=608.200, rew=1042.00]                                                                                                 


Epoch #898: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #899: 1025it [00:02, 448.07it/s, env_step=920576, len=19, n/ep=3, n/st=64, player_1/loss=890.365, player_2/loss=741.726, rew=446.00]                                                                                                  


Epoch #899: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #900: 1025it [00:02, 453.26it/s, env_step=921600, len=29, n/ep=2, n/st=64, player_1/loss=883.471, player_2/loss=1015.705, rew=884.00]                                                                                                 


Epoch #900: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #901: 1025it [00:02, 450.61it/s, env_step=922624, len=23, n/ep=3, n/st=64, player_1/loss=978.715, player_2/loss=870.640, rew=664.00]                                                                                                  


Epoch #901: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #902: 1025it [00:02, 448.59it/s, env_step=923648, len=28, n/ep=2, n/st=64, player_1/loss=1067.098, player_2/loss=724.588, rew=814.00]                                                                                                 


Epoch #902: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #903: 1025it [00:02, 452.33it/s, env_step=924672, len=28, n/ep=2, n/st=64, player_1/loss=2200.393, player_2/loss=809.271, rew=859.00]                                                                                                 


Epoch #903: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #904: 1025it [00:02, 449.61it/s, env_step=925696, len=21, n/ep=3, n/st=64, player_1/loss=1924.289, player_2/loss=870.816, rew=474.67]                                                                                                 


Epoch #904: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #905: 1025it [00:02, 450.39it/s, env_step=926720, len=27, n/ep=2, n/st=64, player_1/loss=1183.802, player_2/loss=1008.487, rew=788.00]                                                                                                


Epoch #905: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #906: 1025it [00:02, 448.83it/s, env_step=927744, len=24, n/ep=3, n/st=64, player_1/loss=1814.652, player_2/loss=1144.191, rew=625.33]                                                                                                


Epoch #906: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #907: 1025it [00:02, 451.51it/s, env_step=928768, len=21, n/ep=3, n/st=64, player_1/loss=1810.221, player_2/loss=905.758, rew=490.67]                                                                                                 


Epoch #907: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #908: 1025it [00:02, 446.98it/s, env_step=929792, len=33, n/ep=2, n/st=64, player_1/loss=1250.732, player_2/loss=1391.802, rew=1154.00]                                                                                               


Epoch #908: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #909: 1025it [00:02, 446.15it/s, env_step=930816, len=29, n/ep=2, n/st=64, player_1/loss=1254.064, player_2/loss=1240.437, rew=910.00]                                                                                                


Epoch #909: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #910: 1025it [00:02, 451.04it/s, env_step=931840, len=31, n/ep=2, n/st=64, player_1/loss=880.992, player_2/loss=538.363, rew=1022.00]                                                                                                 


Epoch #910: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #911: 1025it [00:02, 449.25it/s, env_step=932864, len=25, n/ep=2, n/st=64, player_1/loss=648.152, player_2/loss=330.335, rew=694.00]                                                                                                  


Epoch #911: test_reward: 928.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #912: 1025it [00:02, 444.66it/s, env_step=933888, len=16, n/ep=4, n/st=64, player_1/loss=958.213, player_2/loss=249.383, rew=302.00]                                                                                                  


Epoch #912: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #913: 1025it [00:02, 449.85it/s, env_step=934912, len=31, n/ep=2, n/st=64, player_1/loss=904.757, player_2/loss=840.482, rew=1015.00]                                                                                                 


Epoch #913: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #914: 1025it [00:02, 449.60it/s, env_step=935936, len=25, n/ep=2, n/st=64, player_1/loss=675.680, player_2/loss=858.891, rew=664.00]                                                                                                  


Epoch #914: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #915: 1025it [00:02, 450.67it/s, env_step=936960, len=37, n/ep=2, n/st=64, player_1/loss=951.589, player_2/loss=1192.329, rew=1404.00]                                                                                                


Epoch #915: test_reward: 1480.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #916: 1025it [00:02, 447.22it/s, env_step=937984, len=29, n/ep=3, n/st=64, player_1/loss=1012.990, player_2/loss=1077.167, rew=984.00]                                                                                                


Epoch #916: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #917: 1025it [00:02, 451.16it/s, env_step=939008, len=21, n/ep=3, n/st=64, player_1/loss=914.435, player_2/loss=1082.797, rew=460.00]                                                                                                 


Epoch #917: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #918: 1025it [00:02, 449.68it/s, env_step=940032, len=28, n/ep=2, n/st=64, player_1/loss=1075.941, player_2/loss=1098.275, rew=810.00]                                                                                                


Epoch #918: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #919: 1025it [00:02, 448.85it/s, env_step=941056, len=33, n/ep=2, n/st=64, player_1/loss=987.120, player_2/loss=1300.857, rew=1120.00]                                                                                                


Epoch #919: test_reward: 1054.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #920: 1025it [00:02, 448.77it/s, env_step=942080, len=19, n/ep=4, n/st=64, player_1/loss=1004.541, player_2/loss=1815.161, rew=479.00]                                                                                                


Epoch #920: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #921: 1025it [00:02, 449.80it/s, env_step=943104, len=28, n/ep=3, n/st=64, player_1/loss=1293.501, player_2/loss=1920.659, rew=832.00]                                                                                                


Epoch #921: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #922: 1025it [00:02, 443.97it/s, env_step=944128, len=20, n/ep=3, n/st=64, player_1/loss=1132.303, player_2/loss=2397.656, rew=420.00]                                                                                                


Epoch #922: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #923: 1025it [00:02, 447.03it/s, env_step=945152, len=23, n/ep=3, n/st=64, player_1/loss=808.527, player_2/loss=1558.467, rew=576.00]                                                                                                 


Epoch #923: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #924: 1025it [00:02, 447.33it/s, env_step=946176, len=20, n/ep=3, n/st=64, player_1/loss=816.869, player_2/loss=1112.625, rew=418.00]                                                                                                 


Epoch #924: test_reward: 990.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #925: 1025it [00:02, 450.70it/s, env_step=947200, len=29, n/ep=2, n/st=64, player_1/loss=577.640, player_2/loss=838.806, rew=910.00]                                                                                                  


Epoch #925: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #926: 1025it [00:02, 446.89it/s, env_step=948224, len=13, n/ep=4, n/st=64, player_1/loss=661.200, player_2/loss=1042.053, rew=231.50]                                                                                                 


Epoch #926: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #927: 1025it [00:02, 448.09it/s, env_step=949248, len=14, n/ep=5, n/st=64, player_1/loss=816.776, player_2/loss=996.944, rew=215.20]                                                                                                  


Epoch #927: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #928: 1025it [00:02, 446.97it/s, env_step=950272, len=22, n/ep=3, n/st=64, player_1/loss=986.399, player_2/loss=829.583, rew=536.00]                                                                                                  


Epoch #928: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #929: 1025it [00:02, 446.81it/s, env_step=951296, len=22, n/ep=4, n/st=64, player_1/loss=1015.898, player_2/loss=1273.249, rew=535.50]                                                                                                


Epoch #929: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #930: 1025it [00:02, 448.08it/s, env_step=952320, len=20, n/ep=3, n/st=64, player_1/loss=865.237, player_2/loss=1370.848, rew=446.67]                                                                                                 


Epoch #930: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #931: 1025it [00:02, 448.79it/s, env_step=953344, len=23, n/ep=2, n/st=64, player_1/loss=972.060, player_2/loss=1272.229, rew=574.00]                                                                                                 


Epoch #931: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #932: 1025it [00:02, 448.12it/s, env_step=954368, len=20, n/ep=3, n/st=64, player_1/loss=881.476, player_2/loss=917.636, rew=448.67]                                                                                                  


Epoch #932: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #933: 1025it [00:02, 450.28it/s, env_step=955392, len=17, n/ep=3, n/st=64, player_1/loss=542.323, player_2/loss=769.520, rew=534.67]                                                                                                  


Epoch #933: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #934: 1025it [00:02, 452.43it/s, env_step=956416, len=19, n/ep=4, n/st=64, player_1/loss=726.184, player_2/loss=898.524, rew=451.00]                                                                                                  


Epoch #934: test_reward: 1188.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #935: 1025it [00:02, 445.12it/s, env_step=957440, len=22, n/ep=3, n/st=64, player_1/loss=815.297, player_2/loss=1147.647, rew=535.33]                                                                                                 


Epoch #935: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #936: 1025it [00:02, 449.34it/s, env_step=958464, len=20, n/ep=3, n/st=64, player_1/loss=865.760, player_2/loss=884.418, rew=432.00]                                                                                                  


Epoch #936: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #937: 1025it [00:02, 448.52it/s, env_step=959488, len=16, n/ep=4, n/st=64, player_1/loss=873.121, player_2/loss=893.727, rew=319.50]                                                                                                  


Epoch #937: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #938: 1025it [00:02, 448.63it/s, env_step=960512, len=27, n/ep=2, n/st=64, player_1/loss=727.346, player_2/loss=1042.073, rew=763.00]                                                                                                 


Epoch #938: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #939: 1025it [00:02, 446.80it/s, env_step=961536, len=22, n/ep=3, n/st=64, player_1/loss=602.645, player_2/loss=713.883, rew=510.00]                                                                                                  


Epoch #939: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #940: 1025it [00:02, 450.88it/s, env_step=962560, len=21, n/ep=3, n/st=64, player_1/loss=463.417, player_2/loss=648.525, rew=596.67]                                                                                                  


Epoch #940: test_reward: 868.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #941: 1025it [00:02, 450.59it/s, env_step=963584, len=24, n/ep=3, n/st=64, player_1/loss=704.517, player_2/loss=799.862, rew=604.00]                                                                                                  


Epoch #941: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #942: 1025it [00:02, 448.72it/s, env_step=964608, len=9, n/ep=7, n/st=64, player_1/loss=825.535, player_2/loss=893.290, rew=95.14]                                                                                                    


Epoch #942: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #943: 1025it [00:02, 445.32it/s, env_step=965632, len=16, n/ep=4, n/st=64, player_1/loss=1140.332, player_2/loss=968.133, rew=275.00]                                                                                                 


Epoch #943: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #944: 1025it [00:02, 428.69it/s, env_step=966656, len=17, n/ep=4, n/st=64, player_1/loss=899.011, player_2/loss=1032.390, rew=308.50]                                                                                                 


Epoch #944: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #945: 1025it [00:02, 450.10it/s, env_step=967680, len=22, n/ep=3, n/st=64, player_1/loss=640.432, player_2/loss=735.438, rew=510.00]                                                                                                  


Epoch #945: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #946: 1025it [00:02, 448.02it/s, env_step=968704, len=27, n/ep=2, n/st=64, player_1/loss=448.752, player_2/loss=556.143, rew=794.00]                                                                                                  


Epoch #946: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #947: 1025it [00:02, 449.92it/s, env_step=969728, len=22, n/ep=3, n/st=64, player_1/loss=355.139, player_2/loss=697.872, rew=504.67]                                                                                                  


Epoch #947: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #948: 1025it [00:02, 443.73it/s, env_step=970752, len=21, n/ep=3, n/st=64, player_1/loss=367.732, player_2/loss=853.494, rew=490.67]                                                                                                  


Epoch #948: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #949: 1025it [00:02, 451.12it/s, env_step=971776, len=20, n/ep=3, n/st=64, player_1/loss=586.163, player_2/loss=991.296, rew=446.67]                                                                                                  


Epoch #949: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #950: 1025it [00:02, 445.74it/s, env_step=972800, len=21, n/ep=3, n/st=64, player_1/loss=719.581, player_2/loss=989.651, rew=490.67]                                                                                                  


Epoch #950: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #951: 1025it [00:02, 445.32it/s, env_step=973824, len=20, n/ep=3, n/st=64, player_1/loss=802.692, player_2/loss=1046.312, rew=448.67]                                                                                                 


Epoch #951: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #952: 1025it [00:02, 447.40it/s, env_step=974848, len=20, n/ep=3, n/st=64, player_1/loss=715.609, player_2/loss=1342.843, rew=446.00]                                                                                                 


Epoch #952: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #953: 1025it [00:02, 450.93it/s, env_step=975872, len=20, n/ep=3, n/st=64, player_1/loss=710.605, player_2/loss=1061.448, rew=446.00]                                                                                                 


Epoch #953: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #954: 1025it [00:02, 449.78it/s, env_step=976896, len=22, n/ep=3, n/st=64, player_1/loss=689.469, player_2/loss=688.219, rew=506.00]                                                                                                  


Epoch #954: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #955: 1025it [00:02, 449.05it/s, env_step=977920, len=23, n/ep=3, n/st=64, player_1/loss=616.555, player_2/loss=736.377, rew=606.67]                                                                                                  


Epoch #955: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #956: 1025it [00:02, 448.89it/s, env_step=978944, len=17, n/ep=3, n/st=64, player_1/loss=662.334, player_2/loss=720.893, rew=378.67]                                                                                                  


Epoch #956: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #957: 1025it [00:02, 447.37it/s, env_step=979968, len=19, n/ep=4, n/st=64, player_1/loss=589.120, player_2/loss=833.136, rew=442.00]                                                                                                  


Epoch #957: test_reward: 340.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #958: 1025it [00:02, 448.55it/s, env_step=980992, len=21, n/ep=3, n/st=64, player_1/loss=595.968, player_2/loss=830.958, rew=462.00]                                                                                                  


Epoch #958: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #959: 1025it [00:02, 449.04it/s, env_step=982016, len=24, n/ep=2, n/st=64, player_1/loss=468.007, player_2/loss=686.336, rew=629.00]                                                                                                  


Epoch #959: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #960: 1025it [00:02, 448.90it/s, env_step=983040, len=20, n/ep=3, n/st=64, player_1/loss=433.522, player_2/loss=762.314, rew=418.00]                                                                                                  


Epoch #960: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #961: 1025it [00:02, 448.17it/s, env_step=984064, len=20, n/ep=3, n/st=64, player_1/loss=347.602, player_2/loss=515.315, rew=418.67]                                                                                                  


Epoch #961: test_reward: 550.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #962: 1025it [00:02, 450.20it/s, env_step=985088, len=18, n/ep=3, n/st=64, player_1/loss=250.959, player_2/loss=753.276, rew=371.33]                                                                                                  


Epoch #962: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #963: 1025it [00:02, 449.72it/s, env_step=986112, len=8, n/ep=7, n/st=64, player_1/loss=390.456, player_2/loss=865.247, rew=86.29]                                                                                                    


Epoch #963: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #964: 1025it [00:02, 449.56it/s, env_step=987136, len=7, n/ep=9, n/st=64, player_1/loss=423.603, player_2/loss=865.164, rew=67.11]                                                                                                    


Epoch #964: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #965: 1025it [00:02, 448.35it/s, env_step=988160, len=14, n/ep=4, n/st=64, player_1/loss=428.311, player_2/loss=1078.955, rew=271.00]                                                                                                 


Epoch #965: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #966: 1025it [00:02, 450.09it/s, env_step=989184, len=21, n/ep=3, n/st=64, player_1/loss=611.703, player_2/loss=995.848, rew=500.00]                                                                                                  


Epoch #966: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #967: 1025it [00:02, 448.65it/s, env_step=990208, len=19, n/ep=3, n/st=64, player_1/loss=627.354, player_2/loss=574.114, rew=394.00]                                                                                                  


Epoch #967: test_reward: 1258.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #968: 1025it [00:02, 449.46it/s, env_step=991232, len=19, n/ep=4, n/st=64, player_1/loss=490.888, player_2/loss=610.363, rew=396.50]                                                                                                  


Epoch #968: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #969: 1025it [00:02, 447.63it/s, env_step=992256, len=21, n/ep=3, n/st=64, player_1/loss=640.383, player_2/loss=620.246, rew=494.67]                                                                                                  


Epoch #969: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #970: 1025it [00:02, 449.22it/s, env_step=993280, len=16, n/ep=4, n/st=64, player_1/loss=665.238, player_2/loss=945.972, rew=278.50]                                                                                                  


Epoch #970: test_reward: 238.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #971: 1025it [00:02, 440.50it/s, env_step=994304, len=23, n/ep=3, n/st=64, player_1/loss=484.299, player_2/loss=792.034, rew=568.67]                                                                                                  


Epoch #971: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #972: 1025it [00:02, 436.33it/s, env_step=995328, len=22, n/ep=3, n/st=64, player_1/loss=397.619, player_2/loss=516.606, rew=540.00]                                                                                                  


Epoch #972: test_reward: 648.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #973: 1025it [00:02, 447.47it/s, env_step=996352, len=21, n/ep=3, n/st=64, player_1/loss=690.572, player_2/loss=712.384, rew=460.67]                                                                                                  


Epoch #973: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #974: 1025it [00:02, 449.10it/s, env_step=997376, len=20, n/ep=3, n/st=64, player_1/loss=873.408, player_2/loss=1003.941, rew=452.00]                                                                                                 


Epoch #974: test_reward: 598.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #975: 1025it [00:02, 444.46it/s, env_step=998400, len=21, n/ep=3, n/st=64, player_2/loss=1181.924, rew=546.67]                                                                                                                        


Epoch #975: test_reward: 88.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #976: 1025it [00:02, 447.25it/s, env_step=999424, len=15, n/ep=4, n/st=64, player_1/loss=977.177, player_2/loss=1522.370, rew=257.00]                                                                                                 


Epoch #976: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #977: 1025it [00:02, 449.89it/s, env_step=1000448, len=15, n/ep=4, n/st=64, player_1/loss=824.258, player_2/loss=1250.337, rew=254.00]                                                                                                


Epoch #977: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #978: 1025it [00:02, 446.61it/s, env_step=1001472, len=16, n/ep=4, n/st=64, player_1/loss=579.319, player_2/loss=1244.423, rew=310.00]                                                                                                


Epoch #978: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #979: 1025it [00:02, 448.79it/s, env_step=1002496, len=15, n/ep=4, n/st=64, player_1/loss=665.117, player_2/loss=945.619, rew=246.50]                                                                                                 


Epoch #979: test_reward: 154.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #980: 1025it [00:02, 452.14it/s, env_step=1003520, len=23, n/ep=3, n/st=64, player_1/loss=874.447, player_2/loss=782.003, rew=552.00]                                                                                                 


Epoch #980: test_reward: 504.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #981: 1025it [00:02, 450.19it/s, env_step=1004544, len=20, n/ep=3, n/st=64, player_1/loss=703.660, player_2/loss=727.046, rew=438.00]                                                                                                 


Epoch #981: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #982: 1025it [00:02, 448.73it/s, env_step=1005568, len=23, n/ep=2, n/st=64, player_1/loss=514.467, player_2/loss=577.198, rew=566.00]                                                                                                 


Epoch #982: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #983: 1025it [00:02, 449.00it/s, env_step=1006592, len=19, n/ep=3, n/st=64, player_1/loss=569.409, player_2/loss=828.312, rew=447.33]                                                                                                 


Epoch #983: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #984: 1025it [00:02, 447.97it/s, env_step=1007616, len=19, n/ep=3, n/st=64, player_1/loss=628.339, player_2/loss=1259.003, rew=399.33]                                                                                                


Epoch #984: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #985: 1025it [00:02, 450.98it/s, env_step=1008640, len=26, n/ep=3, n/st=64, player_1/loss=773.662, player_2/loss=1284.003, rew=723.33]                                                                                                


Epoch #985: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #986: 1025it [00:02, 450.15it/s, env_step=1009664, len=28, n/ep=3, n/st=64, player_1/loss=713.102, player_2/loss=931.252, rew=838.67]                                                                                                 


Epoch #986: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #987: 1025it [00:02, 447.08it/s, env_step=1010688, len=16, n/ep=4, n/st=64, player_1/loss=359.205, player_2/loss=940.971, rew=278.50]                                                                                                 


Epoch #987: test_reward: 304.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #988: 1025it [00:02, 450.34it/s, env_step=1011712, len=20, n/ep=3, n/st=64, player_1/loss=522.540, player_2/loss=1057.511, rew=420.00]                                                                                                


Epoch #988: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #989: 1025it [00:02, 448.97it/s, env_step=1012736, len=39, n/ep=1, n/st=64, player_1/loss=527.023, player_2/loss=964.284, rew=1558.00]                                                                                                


Epoch #989: test_reward: 1834.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #990: 1025it [00:02, 449.46it/s, env_step=1013760, len=17, n/ep=3, n/st=64, player_1/loss=888.274, player_2/loss=839.144, rew=350.00]                                                                                                 


Epoch #990: test_reward: 378.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #991: 1025it [00:02, 446.28it/s, env_step=1014784, len=25, n/ep=2, n/st=64, player_1/loss=1442.435, player_2/loss=1011.435, rew=674.00]                                                                                               


Epoch #991: test_reward: 700.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #992: 1025it [00:02, 451.01it/s, env_step=1015808, len=32, n/ep=2, n/st=64, player_1/loss=929.270, player_2/loss=1098.822, rew=1058.00]                                                                                               


Epoch #992: test_reward: 1120.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #993: 1025it [00:02, 450.85it/s, env_step=1016832, len=34, n/ep=2, n/st=64, player_1/loss=899.295, player_2/loss=1581.813, rew=1225.00]                                                                                               


Epoch #993: test_reward: 754.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #994: 1025it [00:02, 448.35it/s, env_step=1017856, len=20, n/ep=3, n/st=64, player_1/loss=1104.846, player_2/loss=1536.671, rew=446.67]                                                                                               


Epoch #994: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #995: 1025it [00:02, 445.06it/s, env_step=1018880, len=21, n/ep=3, n/st=64, player_1/loss=1037.217, player_2/loss=1464.853, rew=460.67]                                                                                               


Epoch #995: test_reward: 460.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #996: 1025it [00:02, 449.58it/s, env_step=1019904, len=12, n/ep=5, n/st=64, player_1/loss=902.864, player_2/loss=1192.768, rew=214.80]                                                                                                


Epoch #996: test_reward: 54.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #997: 1025it [00:02, 448.22it/s, env_step=1020928, len=16, n/ep=3, n/st=64, player_1/loss=1430.841, player_2/loss=1116.632, rew=272.00]                                                                                               


Epoch #997: test_reward: 208.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #998: 1025it [00:02, 446.80it/s, env_step=1021952, len=14, n/ep=5, n/st=64, player_1/loss=1063.792, player_2/loss=748.506, rew=220.00]                                                                                                


Epoch #998: test_reward: 180.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88


Epoch #999: 1025it [00:02, 449.98it/s, env_step=1022976, len=21, n/ep=3, n/st=64, player_1/loss=823.298, player_2/loss=981.766, rew=462.67]                                                                                                 

Epoch #999: test_reward: 418.000000 ± 0.000000, best_reward: 1834.000000 ± 0.000000 in #88





In [13]:
####################################################
# EXPERIMENT: VIEWING THE BEST LEARNED POLICY
####################################################

# Get the environment settings
env = get_env()
observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
state_shape = observation_space.shape or observation_space.n
action_shape = env.action_space.shape or env.action_space.n

# Configure the best agent
best_agent1 = cf_cnn_dqn_policy(state_shape= state_shape,
                                action_shape= action_shape)
best_agent1.load_state_dict(torch.load("./saved_variables/paper_notebooks/7/1-cnn_dqn_frozen_agent1/best_policy_agent1.pth"))
best_agent1.set_eps(0)


best_agent2 = cf_cnn_dqn_policy(state_shape= state_shape,
                                action_shape= action_shape)
best_agent2.load_state_dict(torch.load("./saved_variables/paper_notebooks/7/1-cnn_dqn_frozen_agent1/best_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= best_agent1,
      agent_player2= best_agent2)



Average steps of game:  32.333333333333336
Final mean reward agent 1: 564.6666666666666, std: 63.168205785998246
Final mean reward agent 2: 514.6666666666666, std: 63.168205785998246


In [14]:
####################################################
# EXPERIMENT: VIEWING THE LAST LEARNED POLICY
####################################################

# Configure the final agent
final_agent_player1 = cf_cnn_dqn_policy(state_shape= state_shape,
                                        action_shape= action_shape)
final_agent_player1.load_state_dict(torch.load("./saved_variables/paper_notebooks/7/1-cnn_dqn_frozen_agent1/final_policy_agent1.pth"))
best_agent1.set_eps(0)

final_agent_player2 = cf_cnn_dqn_policy(state_shape= state_shape,
                                        action_shape= action_shape)
final_agent_player2.load_state_dict(torch.load("./saved_variables/paper_notebooks/7/1-cnn_dqn_frozen_agent1/final_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= final_agent_player1,
      agent_player2= final_agent_player2)



Average steps of game:  20.0
Final mean reward agent 1: 218.33333333333334, std: 6.128258770283411
Final mean reward agent 2: 201.66666666666666, std: 53.26871084938658


<hr><hr>

## Discussion

The performance of this model based on a CNN is similar to the previous model used.
We will address other difficult points in the next notebooks to build an appropriate bot. 

In [None]:
####################################################
# CLEAN VARIABLES
####################################################

del action_shape
del agent1
del agent2
del best_agent1
del best_agent2
del env
del final_agent_player1
del final_agent_player2
del observation_space
del off_policy_traininer_results
del state_shape
