# MLP based DQN agent against fixed oponent

In the previous notebook, `7-cnn-dqn-fixed-oponent.ipynb`, we used the CNN based model for training through an iteration of alternating frozen agents.
We found this to give interesting but not fully statisfactory results.
We will now use the same technique for the custom MLP based approach designed in `5-improving-dqn-architecture.ipynb` to properly compare both architectures performance for the agents.

<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Training two DQN agents on connect four Gym
  - Building the environment
  - Implementing the DQN policy
  - Building agents
  - Function for letting agents learn
  - Function for watching learned agent
  - Doing the experiment
- Discussion

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [3]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib

# Plotting
import matplotlib; print(f"Matplotlib version (3.5.1 recommended): {matplotlib.__version__}")
import matplotlib.pyplot as plt

# Argparser
import argparse

# More data types
import typing
import numpy as np

# Pygame
import pygame; print(f"Pygame version (2.1.2 recommended): {pygame.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Tianshou for RL algorithms
import tianshou as ts; print(f"Tianshou version (0.4.8 recommended): {ts.__version__}")

# Torch is a popular DL framework
import torch; print(f"Torch version (1.12.0 recommended): {torch.__version__}")

# PPrint is a pretty print for variables
from pprint import pprint

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym
importlib.invalidate_caches()
importlib.reload(cfgym)

# Time for allowing "freezes" in execution
import time;

# Allow for copying objects in a non reference manner
import copy

# Used for updating notebook display
from IPython.display import clear_output

Matplotlib version (3.5.1 recommended): 3.5.1
Pygame version (2.1.2 recommended): 2.1.2
Gym version (0.21.0 recommended): 0.21.0
Tianshou version (0.4.8 recommended): 0.4.8
Torch version (1.12.0 recommended): 1.12.0.dev20220520+cu116


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

In [4]:
####################################################
# CUDA VALIDATION
####################################################

# Check cuda available
print(f"CUDA is available: {torch.cuda.is_available()}")

# Show cuda devices
print(f"\nAmount of connected devices supporting CUDA: {torch.cuda.device_count()}")

# Show current cuda device
print(f"\nCurrent CUDA device: {torch.cuda.current_device()}")

# Show cuda device name
print(f"Cuda device 0 name: {torch.cuda.get_device_name(0)}")

CUDA is available: True

Amount of connected devices supporting CUDA: 1

Current CUDA device: 0
Cuda device 0 name: NVIDIA GeForce GTX 970


<hr><hr>

## Training two DQN agents on connect four Gym

Our connect four gym setup requires two agents, one for each player.
To reduce complexity, agents will always play as the same player, e.g. always as player 1.
It is important to note that connect four is a *solved game*.
According to [The Washington Post](https://www.washingtonpost.com/news/wonk/wp/2015/05/08/how-to-win-any-popular-game-according-to-data-scientists/):

> Connect Four is what mathematicians call a "solved game," meaning you can play it perfectly every time, no matter what your opponent does. You will need to get the first move, but as long as you do so, you can always win within 41 moves.

<hr>

### Building the environment

This code is taken from previous notebooks.
We don't allow invalid moves to make the problem easier for now.

In [5]:
####################################################
# CONNECT FOUR V2 ENVIRONMENT
####################################################

def get_env():
    """
    Returns the connect four gym environment V2 altered for Tianshou and Petting Zoo compatibility.
    Already wrapped with a ts.env.PettingZooEnv wrapper.
    """
    return ts.env.PettingZooEnv(cfgym.env(reward_move= 1, # Set to 1 for reward to make moves (incentivise longer games)
                                          reward_invalid= -3,
                                          reward_draw= 100,
                                          reward_win= 25,
                                          reward_loss= -25,
                                          allow_invalid_move= False))
    
    
# Test the environment
env = get_env()
print(f"Observation space: {env.observation_space}")
print(f"\nAction space: {env.action_space}")

# Reset the environment to start from a clean state, returns the initial observation
observation = env.reset()

print("\n Initial player id:")
print(observation["agent_id"])

print("\n Initial observation:")
print(observation["obs"])

print("\n Initial mask:")
print(observation["mask"])

# Clean unused variables
del observation
del env

Observation space: Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]], [[2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]], (6, 7), int8))

Action space: Discrete(7)

 Initial player id:
player_1

 Initial observation:
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]

 Initial mask:
[True, True, True, True, True, True, True]


<hr>

### Implementing the DQN policy

We use the strategy created in `5-improving-dqn-architecture.ipynb`.

In [6]:
####################################################
# DQN ARCHITECTURE
####################################################

class CustomDQN(torch.nn.Module):
    """
    Custom DQN using a model based on CNN
    """
    def __init__(self,
                 state_shape: typing.Sequence[int],
                 action_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu',):
        # Parent call
        super().__init__()
        
        # Save device (e.g. cuda)
        self.device = device
        
        self.model = torch.nn.Sequential(
            torch.nn.Linear(np.prod(state_shape), 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, np.prod(action_shape)),
        )

    def forward(self, obs, state=None, info={}):
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float, device=self.device)
        batch = obs.shape[0]
        logits = self.model(obs.view(batch, -1))
        return logits, state


In [7]:
####################################################
# DQN POLICY
####################################################

def cf_custom_dqn_policy(state_shape: tuple,
                         action_shape: tuple,
                         optim: typing.Optional[torch.optim.Optimizer] = None,
                         learning_rate: float =  0.0001,
                         gamma: float = 0.9, # Smaller gamma favours "faster" win
                         n_step: int = 4, # Number of steps to look ahead
                         frozen: bool = False,
                         target_update_freq: int = 320):
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Network to be used for DQN
    net = CustomDQN(state_shape, action_shape, device= device).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    if optim is None:
        optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # If we are frozen, we use an optimizer that has learning rate 0
    if frozen:
        optim = torch.optim.SGD(net.parameters(), lr= 0)
        
        
    # Our agent DQN policy
    return ts.policy.DQNPolicy(model= net,
                               optim= optim,
                               discount_factor= gamma,
                               estimation_step= n_step,
                               target_update_freq= target_update_freq)

<hr>

### Building agents

This is identical to the previous notebook with the added option of "freezing" an agent which corresponds to giving it an optimizer with learning rate 0.

In [8]:
####################################################
# AGENT CREATION
####################################################

def get_agents(agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
               agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
               optim: typing.Optional[torch.optim.Optimizer] = None,
               resume_path_player_1: str = '', # Path to file to resume agent training from
               resume_path_player_2: str = '', 
               agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
               agent_player2_frozen: bool = False,
               ) -> typing.Tuple[ts.policy.BasePolicy, torch.optim.Optimizer, list]:
    """
    Gets a multi agent policy manager, optimizer and player ids for the connect four V2 gym environment.
    Per default this returns 
        - Multi agent manager for 2 agents using DQN
        - Adam optimizer
        - ['player_1', 'player_2'] from the connect four environment
    """
    
    # Get the environment to play in (Connect four gym V2)
    env = get_env()
    
    # Get the observation space from the environment, depending on typo of space (ternary operator)
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    
    # Set the arguments
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent player 1 to be a DQN if no policy is passed.
    if agent_player1 is None:
        # Our agent1 uses a DQN policy
        agent_player1 = cf_custom_dqn_policy(state_shape= state_shape,
                                             action_shape= action_shape,
                                             optim= optim,
                                             frozen= agent_player1_frozen)
                
        # If we resume our agent we need to load the previous config
        if resume_path_player_1:
            agent_player1.load_state_dict(torch.load(resume_path_player_1))
            
    
    # Configure agent player 2 to be a DQN if no policy is passed.
    if agent_player2 is None:
        # Our agent1 uses a DQN policy
        agent_player2 = cf_custom_dqn_policy(state_shape= state_shape,
                                             action_shape= action_shape,
                                             optim= optim,
                                             frozen= agent_player2_frozen)
        
                
        # If we resume our agent we need to load the previous config
        if resume_path_player_2:
            agent_player2.load_state_dict(torch.load(resume_path_player_2))

    # Both our agents are DQN agents by default
    agents = [agent_player1, agent_player2]
        
    # Our policy depends on the order of the agents
    policy = ts.policy.MultiAgentPolicyManager(agents, env)
    
    # Return our policy, optimizer and the available agents in the environment
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    
    return policy, optim, env.agents

<hr>

### Function for letting agents learn

This is identical to the previous notebook.

In [9]:
####################################################
# AGENT TRAINING
####################################################

def train_agent(filename: str = "dqn_vs_dqn_cnn_based",
                agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
                agent_player2_frozen: bool = False,
                single_agent_score_as_reward: bool= False, # Uses non frozen agent's score as reward
                optim: typing.Optional[torch.optim.Optimizer] = None,
                training_env_num: int = 1,
                testing_env_num: int = 1,
                buffer_size: int = 2^14,
                batch_size: int = 1, 
                epochs: int = 50, #50
                step_per_epoch: int = 1024, #1024
                step_per_collect: int = 64, # transition before update
                update_per_step: float = 0.1,
                testing_eps: float = 0.05,
                training_eps: float = 0.1,
                ) -> typing.Tuple[dict, ts.policy.BasePolicy]:
    """
    Trains two agents in the connect four V2 environment and saves their best model and logs.
    Returns:
        - result from offpolicy_trainer
        - final version of agent 1
        - final version of agent 2
    """

    # ======== notebook specific =========
    notebook_version = '8' # Used for foldering logs and models

    # ======== environment setup =========
    train_envs = ts.env.DummyVectorEnv([get_env for _ in range(training_env_num)])
    test_envs = ts.env.DummyVectorEnv([get_env for _ in range(testing_env_num)])
    
    # set the seed for reproducibility
    np.random.seed(1998)
    torch.manual_seed(1998)
    train_envs.seed(1998)
    test_envs.seed(1998)

    # ======== agent setup =========
    # Gets our agents from the previously made function
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    policy, optim, agents = get_agents(agent_player1=agent_player1,
                                       agent_player2=agent_player2,
                                       agent_player1_frozen= agent_player1_frozen,
                                       agent_player2_frozen= agent_player2_frozen,
                                       optim=optim)

    # ======== collector setup =========
    # Make a collector for the training environments
    train_collector = ts.data.Collector(policy= policy,
                                        env= train_envs,
                                        buffer= ts.data.VectorReplayBuffer(buffer_size, len(train_envs)),
                                        exploration_noise= True)
    
    # Make a collector for the testing environments
    test_collector = ts.data.Collector(policy= policy,
                                       env= test_envs,
                                       buffer= ts.data.VectorReplayBuffer(buffer_size, len(test_envs)),
                                       exploration_noise= True)
    
    # Uncomment below if you want to set epsilon in epsilon policy
    # policy.set_eps(1)
    
    # Collect data fot the training evnironments
    train_collector.collect(n_step= batch_size * training_env_num)
    
    # ======== ensure folders exist =========
    if not os.path.exists(os.path.join('./logs', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./logs', 'paper_notebooks', notebook_version, filename))
    if not os.path.exists(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename))

    # ======== tensorboard logging setup =========
    # Allows to save the training progress to tensorboard compatable logs
    log_path = os.path.join('./logs', 'paper_notebooks', notebook_version, filename)
    writer = torch.utils.tensorboard.SummaryWriter(log_path)
    logger = ts.utils.TensorboardLogger(writer)

    # ======== callback functions used during training =========
    # We want to save our best policy
    def save_best_fn(policy):
        """
        Callback to save the best model
        """
        # Save best agent 1
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent1.pth')
        torch.save(policy.policies[agents[0]].state_dict(), model_save_path)
        
        # Save best agent 2
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent2.pth')
        torch.save(policy.policies[agents[1]].state_dict(), model_save_path)
        
        # Save agent2

    def stop_fn(mean_rewards):
        """
        Callback to stop training when we've reached the win rate
        """
        return mean_rewards >= 7 # (win = 10, 70% win without invalid moves = mean of 7)

    def train_fn(epoch, env_step):
        """
        Callback before training
        """        
        # Before training we want to configure the epsilon for the agents
        # In general more exploratory than the test case
        policy.policies[agents[0]].set_eps(training_eps)
        policy.policies[agents[1]].set_eps(training_eps)

    def test_fn(epoch, env_step):
        """
        Callback beore testing
        """        
        # Before testing we want to configure the epsilon for the agents
        # In general more greedy than the train case but not
        #   to avoid getting stuck on invalid moves
        policy.policies[agents[0]].set_eps(testing_eps)
        policy.policies[agents[1]].set_eps(testing_eps)

    def reward_metric(rews):
        """
        Callback for reward collection
        """        
        if agent_player2_frozen and single_agent_score_as_reward:
            # agent 2 frozen, optimizing for agent 1
            return rews[:, 0]
        
        if agent_player1_frozen and single_agent_score_as_reward:
            # agent 1 frozen, optimizing for agent 2
            return rews[:, 1]
        
        # Per default we are interested in optimizing both agents
        return rews[:, 0] + rews[:, 1]
    
            

    # trainer
    result = ts.trainer.offpolicy_trainer(policy= policy,
                                          train_collector= train_collector,
                                          test_collector= test_collector,
                                          max_epoch= epochs,
                                          step_per_epoch= step_per_epoch,
                                          step_per_collect= step_per_collect,
                                          episode_per_test= testing_env_num,
                                          batch_size= batch_size,
                                          train_fn= train_fn,
                                          test_fn= test_fn,
                                          # Stop function to stop before specified amount of epochs
                                          #stop_fn= stop_fn
                                          save_best_fn= save_best_fn,
                                          update_per_step= update_per_step,
                                          logger= logger,
                                          test_in_train= False,
                                          reward_metric= reward_metric)
    
    # Save final agent 1
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent1.pth')
    torch.save(policy.policies[agents[0]].state_dict(), model_save_path)

    # Save final agent 2
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent2.pth')
    torch.save(policy.policies[agents[1]].state_dict(), model_save_path)

    return result, policy.policies[agents[0]], policy.policies[agents[1]]

<hr>

### Function for watching learned agent

Identical to the previous notebook.

In [10]:
####################################################
# WATCHING THE LEARNED POLICY IN ACTION
####################################################

def watch(numer_of_games: int = 3,
          agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
          agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
          test_epsilon: float = 0.05, # For the watching we act completely greedy but low random for not getting stuck on invalid move
          render_speed: float = 0.15, # Amount of seconds to update frame/ do a step
          ) -> None:
    
    # Get the connect four V2 environment (must be a list)
    env= ts.env.DummyVectorEnv([get_env])
    
    # Get the agents from the trained agents
    policy, optim, agents = get_agents(agent_player1= agent_player1,
                                       agent_player2= agent_player2)
    
    # Evaluate the policy
    policy.eval()
    
    # Set the testing policy epsilon for our agents
    policy.policies[agents[0]].set_eps(test_epsilon)
    policy.policies[agents[1]].set_eps(test_epsilon)
    
    # Collect the test data
    collector = ts.data.Collector(policy= policy,
                                  env= env,
                                  exploration_noise= True)
    
    # Render games in human mode to see how it plays
    result = collector.collect(n_episode= numer_of_games, render= render_speed)
    
    # Close the environment aftering collecting the results
    # This closes the pygame window after completion
    env.close()
    
    # Get the rewards and length from the test trials
    rewards, length = result["rews"], result["lens"]
    
    # Print the final reward for the first agent
    print(f"Average steps of game:  {length.mean()}")
    print(f"Final mean reward agent 1: {rewards[:, 0].mean()}, std: {rewards[:, 0].std()}")
    print(f"Final mean reward agent 2: {rewards[:, 1].mean()}, std: {rewards[:, 1].std()}")

<hr>

### Doing the experiment

We now do the experiment with using our previously created functions.
We freeze one agent and initialize both agents from previous versions.

The following iterations were made:

1. Freeze agent 1, train agent 2:
    - Model save name: `1-cnn_dqn_frozen_agent1` 
    - Agent 1 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent2.pth`
    - Learning rate: `0.0001`
    - Training epsilon: `0.2`
    - Look ahead steps: `4`
    - Reward for move/invalid: `+1` / `-3`
    - Allow invalid move: `False`
    - Epochs: `1000`
    - Gamma: `0.9`
    - Best epoch: `17` with test reward `1102`
    - Scoring: sum of `both` agent's score
2. Freeze agent 2, train agent 1:
    - Model save name: `2-cnn_dqn_frozen_agent2` 
    - Agent 1 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/1-cnn_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.0001`
    - Training epsilon: `0.2`
    - Look ahead steps: `4`
    - Reward for move/invalid: `+1` / `-3`
    - Allow invalid move: `False`
    - Epochs: `1000`
    - Gamma: `0.9`
    - Best epoch: `XXX` with test reward `YYY`
    - Scoring: sum of `both` agent's score

After which the agent was so focused on prolonging the game, we decided to lower the learning rate and start optimizing for winning again. We also lowered the amount of epochs in each iterations of swapping the frozen agent.

3. Freeze agent 1, train agent 2:
    - Model save name: `3-cnn_dqn_frozen_agent1` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/2-cnn_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/1-cnn_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.00005` # halfed learning rate
    - Training epsilon: `0.1` # halfed training epsilon
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `500`
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: `XXX` with test reward `YYY`
    - Scoring: reward of `agent 2`
4. Freeze agent 2, train agent 1:
    - Model save name: `4-cnn_dqn_frozen_agent2` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/2-cnn_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/3-cnn_dqn_frozen_agent1/best_policy_agent2.pth`
    - Learning rate: `0.00005`
    - Training epsilon: `0.1`
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `500`
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: `XXX` with test reward `YYY`
    - Scoring: reward of `agent 1`
    
To do further training, a loop was created which alternated between freezing agens every 50 epochs. This loop was executed 20 times. The learning rate was also lowered once again.

5. Loop frozen agents:
    - Model save name: `5-looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/4-cnn_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/3-cnn_dqn_frozen_agent1/best_policy_agent2.pth`
    - Learning rate: `0.000001`
    - Training epsilon: `0.1`
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `50` x `20` loops 
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`
6. Loop frozen agents:
    - Model save name: `6-looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/5-looping-iteration-19/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/5-looping-iteration-19/best_policy_agent2.pth`
    - Learning rate: `0.000003`
    - Training epsilon: `0.1`
    - Look ahead steps: `8`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `20` x `100` loops 
    - Gamma: `0.9` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`
7. Loop frozen agents:
    - Model save name: `7-looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/6-looping-iteration-99/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/6-looping-iteration-99/best_policy_agent2.pth`
    - Learning rate: `0.001`
    - Training epsilon: `0.05`
    - Look ahead steps: `8`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `20` x `500` loops 
    - Gamma: `0.9` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`

For file size reasons, only a portion of the saved agents are kept and stored on GitHub.


In [11]:
####################################################
# EXPERIMENT: TRAINING AGENTS
####################################################

# Configs for the agents
freeze_agent1 = False
agent1_starting_params = "./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent1.pth"

freeze_agent2 = True
agent2_starting_params = "./saved_variables/paper_notebooks/8/1-cnn_dqn_frozen_agent1/final_policy_agent2.pth"

single_agent_score_as_reward = False # To use combined reward or non frozen agent reward as scoring
filename = "2-cnn_dqn_frozen_agent2"
epochs = 1000
loops = 1

learning_rate = 0.0001
training_eps = 0.2
gamma = 0.9
n_step = 4

for loop_idx in range(loops):
    # Filename
    #filename = f"7-20epoch_500loop/7-looping-iteration-{loop_idx}"
    
    # Use provided starting params in first loop, the one from previous iteration in next
    if loop_idx > 0:
        agent1_starting_params = f"./saved_variables/paper_notebooks/7/7-20epoch_500loop/7-looping-iteration-{loop_idx-1}/final_policy_agent1.pth"
        agent2_starting_params = f"./saved_variables/paper_notebooks/7/7-20epoch_500loop/7-looping-iteration-{loop_idx-1}/final_policy_agent2.pth"
    
    # Determine what agent to freeze
    freeze_agent1 = True if loop_idx % 2 == 1 else False
    freeze_agent2 = True if loop_idx % 2 == 0 else False
    
    # Get the environment settings
    env = get_env()
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent 1
    agent1 = cf_custom_dqn_policy(state_shape= state_shape,
                                  action_shape= action_shape,
                                  gamma= gamma,
                                  frozen= freeze_agent1,
                                  learning_rate = learning_rate,
                                  n_step= n_step)
    
    if agent1_starting_params:
        agent1.load_state_dict(torch.load(agent1_starting_params))
        
        # Configure agent 2
        agent2 = cf_custom_dqn_policy(state_shape= state_shape,
                                      action_shape= action_shape,
                                      gamma= gamma,
                                      frozen= freeze_agent2,
                                      learning_rate = learning_rate,
                                      n_step= n_step)
        
        if agent2_starting_params:
            agent2.load_state_dict(torch.load(agent2_starting_params))
            
            
            # Train the agent
            off_policy_traininer_results, final_agent_player1, final_agent_player2 = train_agent(epochs= epochs,
                                                                                                 agent_player1= agent1,
                                                                                                 agent_player1_frozen = freeze_agent1,
                                                                                                 agent_player2= agent2,
                                                                                                 agent_player2_frozen = freeze_agent2,
                                                                                                 filename= filename,
                                                                                                 single_agent_score_as_reward = single_agent_score_as_reward,
                                                                                                 training_eps= training_eps)
            
            

Epoch #1: 1025it [00:02, 343.39it/s, env_step=1024, len=38, n/ep=1, n/st=64, player_1/loss=278.103, player_2/loss=532.296, rew=740.00]                                                                                                      


Epoch #1: test_reward: 779.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #1


Epoch #2: 1025it [00:02, 489.63it/s, env_step=2048, len=37, n/ep=2, n/st=64, player_1/loss=518.275, player_2/loss=1072.717, rew=702.00]                                                                                                     


Epoch #2: test_reward: 902.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #3: 1025it [00:02, 491.84it/s, env_step=3072, len=25, n/ep=2, n/st=64, player_1/loss=649.638, player_2/loss=2439.666, rew=332.00]                                                                                                     


Epoch #3: test_reward: 495.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #4: 1025it [00:02, 492.33it/s, env_step=4096, len=34, n/ep=2, n/st=64, player_1/loss=544.846, player_2/loss=2333.940, rew=617.50]                                                                                                     


Epoch #4: test_reward: 495.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #5: 1025it [00:02, 489.19it/s, env_step=5120, len=34, n/ep=2, n/st=64, player_1/loss=268.966, player_2/loss=813.408, rew=594.00]                                                                                                      


Epoch #5: test_reward: 527.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #6: 1025it [00:02, 473.90it/s, env_step=6144, len=33, n/ep=2, n/st=64, player_1/loss=472.900, player_2/loss=791.190, rew=564.50]                                                                                                      


Epoch #6: test_reward: 527.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #7: 1025it [00:02, 446.16it/s, env_step=7168, len=35, n/ep=2, n/st=64, player_1/loss=667.595, player_2/loss=2459.029, rew=631.00]                                                                                                     


Epoch #7: test_reward: 495.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #8: 1025it [00:02, 461.99it/s, env_step=8192, len=28, n/ep=2, n/st=64, player_1/loss=486.614, player_2/loss=2701.178, rew=423.00]                                                                                                     


Epoch #8: test_reward: 377.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #9: 1025it [00:02, 464.38it/s, env_step=9216, len=26, n/ep=2, n/st=64, player_1/loss=370.845, player_2/loss=1899.617, rew=369.50]                                                                                                     


Epoch #9: test_reward: 495.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #10: 1025it [00:02, 462.49it/s, env_step=10240, len=28, n/ep=2, n/st=64, player_1/loss=359.629, player_2/loss=2014.178, rew=405.50]                                                                                                   


Epoch #10: test_reward: 819.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #11: 1025it [00:02, 480.17it/s, env_step=11264, len=38, n/ep=2, n/st=64, player_1/loss=782.728, player_2/loss=1574.316, rew=759.50]                                                                                                   


Epoch #11: test_reward: 702.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #12: 1025it [00:02, 461.80it/s, env_step=12288, len=40, n/ep=2, n/st=64, player_1/loss=838.158, player_2/loss=1391.153, rew=940.50]                                                                                                   


Epoch #12: test_reward: 665.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #13: 1025it [00:02, 458.69it/s, env_step=13312, len=32, n/ep=2, n/st=64, player_1/loss=553.729, player_2/loss=1723.525, rew=564.50]                                                                                                   


Epoch #13: test_reward: 405.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #14: 1025it [00:02, 467.97it/s, env_step=14336, len=34, n/ep=2, n/st=64, player_1/loss=457.069, player_2/loss=2338.162, rew=596.00]                                                                                                   


Epoch #14: test_reward: 405.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #15: 1025it [00:02, 484.35it/s, env_step=15360, len=33, n/ep=2, n/st=64, player_1/loss=348.922, player_2/loss=1971.656, rew=580.00]                                                                                                   


Epoch #15: test_reward: 495.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #16: 1025it [00:02, 464.81it/s, env_step=16384, len=19, n/ep=3, n/st=64, player_1/loss=817.143, player_2/loss=1182.986, rew=202.33]                                                                                                   


Epoch #16: test_reward: 189.000000 ± 0.000000, best_reward: 902.000000 ± 0.000000 in #2


Epoch #17: 1025it [00:02, 475.63it/s, env_step=17408, len=37, n/ep=2, n/st=64, player_1/loss=832.179, player_2/loss=2203.608, rew=721.00]                                                                                                   


Epoch #17: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #18: 1025it [00:02, 459.54it/s, env_step=18432, len=29, n/ep=2, n/st=64, player_1/loss=526.295, player_2/loss=2482.637, rew=434.00]                                                                                                   


Epoch #18: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #19: 1025it [00:02, 482.05it/s, env_step=19456, len=25, n/ep=3, n/st=64, player_1/loss=798.311, player_2/loss=2310.614, rew=327.00]                                                                                                   


Epoch #19: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #20: 1025it [00:02, 483.80it/s, env_step=20480, len=34, n/ep=2, n/st=64, player_1/loss=662.948, player_2/loss=2061.843, rew=614.50]                                                                                                   


Epoch #20: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #21: 1025it [00:02, 452.87it/s, env_step=21504, len=35, n/ep=1, n/st=64, player_1/loss=521.166, player_2/loss=2551.713, rew=629.00]                                                                                                   


Epoch #21: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #22: 1025it [00:02, 483.60it/s, env_step=22528, len=31, n/ep=2, n/st=64, player_1/loss=1026.281, player_2/loss=2825.064, rew=535.50]                                                                                                  


Epoch #22: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #23: 1025it [00:02, 485.09it/s, env_step=23552, len=32, n/ep=2, n/st=64, player_1/loss=961.095, player_2/loss=3296.784, rew=579.50]                                                                                                   


Epoch #23: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #24: 1025it [00:02, 488.22it/s, env_step=24576, len=34, n/ep=2, n/st=64, player_1/loss=363.042, player_2/loss=3519.510, rew=596.00]                                                                                                   


Epoch #24: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #25: 1025it [00:02, 460.55it/s, env_step=25600, len=28, n/ep=2, n/st=64, player_1/loss=249.792, player_2/loss=1540.637, rew=409.50]                                                                                                   


Epoch #25: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #26: 1025it [00:02, 455.84it/s, env_step=26624, len=30, n/ep=2, n/st=64, player_1/loss=265.540, player_2/loss=717.162, rew=507.50]                                                                                                    


Epoch #26: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #27: 1025it [00:02, 437.88it/s, env_step=27648, len=22, n/ep=3, n/st=64, player_1/loss=397.803, player_2/loss=1539.487, rew=255.00]                                                                                                   


Epoch #27: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #28: 1025it [00:02, 422.49it/s, env_step=28672, len=25, n/ep=2, n/st=64, player_1/loss=682.888, player_2/loss=2161.854, rew=347.00]                                                                                                   


Epoch #28: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #29: 1025it [00:02, 484.99it/s, env_step=29696, len=25, n/ep=3, n/st=64, player_1/loss=718.108, player_2/loss=3255.225, rew=335.67]                                                                                                   


Epoch #29: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #30: 1025it [00:02, 500.46it/s, env_step=30720, len=27, n/ep=3, n/st=64, player_1/loss=490.491, player_2/loss=2778.044, rew=402.33]                                                                                                   


Epoch #30: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #31: 1025it [00:02, 488.53it/s, env_step=31744, len=31, n/ep=2, n/st=64, player_1/loss=452.596, player_2/loss=2802.679, rew=519.50]                                                                                                   


Epoch #31: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #32: 1025it [00:02, 487.67it/s, env_step=32768, len=26, n/ep=3, n/st=64, player_1/loss=568.222, player_2/loss=3199.090, rew=350.33]                                                                                                   


Epoch #32: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #33: 1025it [00:02, 470.75it/s, env_step=33792, len=28, n/ep=3, n/st=64, player_1/loss=562.622, player_2/loss=2734.193, rew=418.00]                                                                                                   


Epoch #33: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #34: 1025it [00:02, 438.48it/s, env_step=34816, len=32, n/ep=2, n/st=64, player_1/loss=549.212, player_2/loss=3067.665, rew=564.50]                                                                                                   


Epoch #34: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #35: 1025it [00:02, 446.36it/s, env_step=35840, len=27, n/ep=3, n/st=64, player_1/loss=462.343, player_2/loss=3236.627, rew=388.33]                                                                                                   


Epoch #35: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #36: 1025it [00:02, 447.96it/s, env_step=36864, len=29, n/ep=2, n/st=64, player_1/loss=379.820, player_2/loss=1743.683, rew=459.00]                                                                                                   


Epoch #36: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #37: 1025it [00:02, 500.40it/s, env_step=37888, len=24, n/ep=3, n/st=64, player_1/loss=358.658, player_2/loss=1540.630, rew=321.33]                                                                                                   


Epoch #37: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #38: 1025it [00:02, 442.43it/s, env_step=38912, len=27, n/ep=3, n/st=64, player_1/loss=365.555, player_2/loss=2093.021, rew=517.67]                                                                                                   


Epoch #38: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #39: 1025it [00:02, 457.51it/s, env_step=39936, len=26, n/ep=3, n/st=64, player_1/loss=332.240, rew=356.33]                                                                                                                           


Epoch #39: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #40: 1025it [00:02, 496.95it/s, env_step=40960, len=31, n/ep=2, n/st=64, player_1/loss=395.048, player_2/loss=1982.427, rew=526.00]                                                                                                   


Epoch #40: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #41: 1025it [00:02, 440.13it/s, env_step=41984, len=32, n/ep=2, n/st=64, player_1/loss=331.433, player_2/loss=2520.617, rew=543.50]                                                                                                   


Epoch #41: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #42: 1025it [00:02, 482.06it/s, env_step=43008, len=30, n/ep=2, n/st=64, player_1/loss=302.874, player_2/loss=2500.478, rew=464.00]                                                                                                   


Epoch #42: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #43: 1025it [00:02, 450.80it/s, env_step=44032, len=32, n/ep=2, n/st=64, player_1/loss=244.065, player_2/loss=3340.011, rew=546.50]                                                                                                   


Epoch #43: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #44: 1025it [00:02, 491.77it/s, env_step=45056, len=33, n/ep=2, n/st=64, player_1/loss=250.420, player_2/loss=2774.161, rew=562.00]                                                                                                   


Epoch #44: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #45: 1025it [00:02, 486.23it/s, env_step=46080, len=31, n/ep=3, n/st=64, player_1/loss=170.137, player_2/loss=3127.453, rew=504.00]                                                                                                   


Epoch #45: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #46: 1025it [00:02, 493.49it/s, env_step=47104, len=34, n/ep=2, n/st=64, player_1/loss=208.266, player_2/loss=1749.006, rew=617.50]                                                                                                   


Epoch #46: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #47: 1025it [00:02, 488.95it/s, env_step=48128, len=27, n/ep=2, n/st=64, player_1/loss=222.373, player_2/loss=1280.671, rew=379.00]                                                                                                   


Epoch #47: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #48: 1025it [00:02, 491.50it/s, env_step=49152, len=27, n/ep=3, n/st=64, player_1/loss=205.800, player_2/loss=1068.169, rew=402.33]                                                                                                   


Epoch #48: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #49: 1025it [00:02, 492.27it/s, env_step=50176, len=19, n/ep=2, n/st=64, player_1/loss=211.393, player_2/loss=1764.139, rew=235.00]                                                                                                   


Epoch #49: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #50: 1025it [00:02, 492.05it/s, env_step=51200, len=33, n/ep=2, n/st=64, player_1/loss=381.080, player_2/loss=2522.734, rew=583.00]                                                                                                   


Epoch #50: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #51: 1025it [00:02, 491.60it/s, env_step=52224, len=36, n/ep=2, n/st=64, player_1/loss=452.506, player_2/loss=2089.391, rew=684.50]                                                                                                   


Epoch #51: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #52: 1025it [00:02, 489.35it/s, env_step=53248, len=25, n/ep=2, n/st=64, player_1/loss=444.506, player_2/loss=1920.970, rew=337.00]                                                                                                   


Epoch #52: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #53: 1025it [00:02, 489.44it/s, env_step=54272, len=21, n/ep=3, n/st=64, player_1/loss=570.672, player_2/loss=2264.813, rew=231.33]                                                                                                   


Epoch #53: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #54: 1025it [00:02, 491.68it/s, env_step=55296, len=25, n/ep=3, n/st=64, player_1/loss=596.806, player_2/loss=2611.883, rew=334.00]                                                                                                   


Epoch #54: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #55: 1025it [00:02, 488.87it/s, env_step=56320, len=28, n/ep=2, n/st=64, player_1/loss=376.305, player_2/loss=2042.992, rew=409.50]                                                                                                   


Epoch #55: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #56: 1025it [00:02, 491.04it/s, env_step=57344, len=27, n/ep=2, n/st=64, player_1/loss=283.868, player_2/loss=1669.319, rew=401.00]                                                                                                   


Epoch #56: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #57: 1025it [00:02, 491.83it/s, env_step=58368, len=36, n/ep=2, n/st=64, player_1/loss=352.216, player_2/loss=1964.006, rew=686.50]                                                                                                   


Epoch #57: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #58: 1025it [00:02, 491.23it/s, env_step=59392, len=38, n/ep=2, n/st=64, player_1/loss=505.464, player_2/loss=1812.881, rew=848.00]                                                                                                   


Epoch #58: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #59: 1025it [00:02, 490.58it/s, env_step=60416, len=35, n/ep=2, n/st=64, player_1/loss=472.160, player_2/loss=1476.104, rew=650.00]                                                                                                   


Epoch #59: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #60: 1025it [00:02, 493.64it/s, env_step=61440, len=14, n/ep=4, n/st=64, player_1/loss=701.332, player_2/loss=1556.820, rew=104.75]                                                                                                   


Epoch #60: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #61: 1025it [00:02, 491.01it/s, env_step=62464, len=15, n/ep=4, n/st=64, player_1/loss=487.380, player_2/loss=2495.663, rew=129.50]                                                                                                   


Epoch #61: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #62: 1025it [00:02, 435.64it/s, env_step=63488, len=16, n/ep=4, n/st=64, player_1/loss=628.130, player_2/loss=2682.453, rew=148.00]                                                                                                   


Epoch #62: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #63: 1025it [00:02, 476.68it/s, env_step=64512, len=19, n/ep=3, n/st=64, player_1/loss=481.448, player_2/loss=3001.766, rew=202.67]                                                                                                   


Epoch #63: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #64: 1025it [00:02, 452.48it/s, env_step=65536, len=22, n/ep=3, n/st=64, player_1/loss=529.715, rew=252.33]                                                                                                                           


Epoch #64: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #65: 1025it [00:02, 493.07it/s, env_step=66560, len=30, n/ep=2, n/st=64, player_1/loss=543.033, player_2/loss=2284.900, rew=466.00]                                                                                                   


Epoch #65: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #66: 1025it [00:02, 491.27it/s, env_step=67584, len=28, n/ep=2, n/st=64, player_1/loss=330.256, player_2/loss=2287.305, rew=405.50]                                                                                                   


Epoch #66: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #67: 1025it [00:02, 494.07it/s, env_step=68608, len=34, n/ep=2, n/st=64, player_1/loss=243.347, player_2/loss=1398.008, rew=726.00]                                                                                                   


Epoch #67: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #68: 1025it [00:02, 492.61it/s, env_step=69632, len=26, n/ep=2, n/st=64, player_1/loss=253.279, player_2/loss=1562.327, rew=354.50]                                                                                                   


Epoch #68: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #69: 1025it [00:02, 492.55it/s, env_step=70656, len=34, n/ep=2, n/st=64, player_1/loss=255.361, player_2/loss=1643.401, rew=621.50]                                                                                                   


Epoch #69: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #70: 1025it [00:02, 493.85it/s, env_step=71680, len=32, n/ep=2, n/st=64, player_1/loss=238.061, player_2/loss=1462.242, rew=544.50]                                                                                                   


Epoch #70: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #71: 1025it [00:02, 488.06it/s, env_step=72704, len=33, n/ep=2, n/st=64, player_1/loss=274.925, player_2/loss=2736.150, rew=592.00]                                                                                                   


Epoch #71: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #72: 1025it [00:02, 492.99it/s, env_step=73728, len=32, n/ep=2, n/st=64, player_1/loss=512.572, player_2/loss=3004.666, rew=529.00]                                                                                                   


Epoch #72: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #73: 1025it [00:02, 493.15it/s, env_step=74752, len=39, n/ep=1, n/st=64, player_1/loss=498.613, player_2/loss=2216.698, rew=779.00]                                                                                                   


Epoch #73: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #74: 1025it [00:02, 492.80it/s, env_step=75776, len=42, n/ep=2, n/st=64, player_1/loss=332.942, player_2/loss=2184.223, rew=1102.00]                                                                                                  


Epoch #74: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #75: 1025it [00:02, 496.67it/s, env_step=76800, len=17, n/ep=4, n/st=64, player_1/loss=512.511, player_2/loss=1392.319, rew=180.00]                                                                                                   


Epoch #75: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #76: 1025it [00:02, 491.65it/s, env_step=77824, len=19, n/ep=4, n/st=64, player_1/loss=439.308, player_2/loss=1676.744, rew=207.00]                                                                                                   


Epoch #76: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #77: 1025it [00:02, 491.16it/s, env_step=78848, len=27, n/ep=3, n/st=64, player_1/loss=467.511, player_2/loss=1733.493, rew=410.33]                                                                                                   


Epoch #77: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #78: 1025it [00:02, 491.87it/s, env_step=79872, len=42, n/ep=1, n/st=64, player_1/loss=607.358, player_2/loss=1583.067, rew=1102.00]                                                                                                  


Epoch #78: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #79: 1025it [00:02, 492.26it/s, env_step=80896, len=32, n/ep=2, n/st=64, player_1/loss=483.376, player_2/loss=2598.596, rew=553.50]                                                                                                   


Epoch #79: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #80: 1025it [00:02, 493.33it/s, env_step=81920, len=32, n/ep=2, n/st=64, player_1/loss=313.661, player_2/loss=2583.672, rew=553.50]                                                                                                   


Epoch #80: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #81: 1025it [00:02, 490.44it/s, env_step=82944, len=38, n/ep=2, n/st=64, player_1/loss=188.161, player_2/loss=1122.993, rew=740.00]                                                                                                   


Epoch #81: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #82: 1025it [00:02, 491.14it/s, env_step=83968, len=30, n/ep=2, n/st=64, player_1/loss=327.613, player_2/loss=1134.932, rew=524.50]                                                                                                   


Epoch #82: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #83: 1025it [00:02, 491.87it/s, env_step=84992, len=30, n/ep=2, n/st=64, player_1/loss=565.132, player_2/loss=1270.734, rew=496.00]                                                                                                   


Epoch #83: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #84: 1025it [00:02, 490.79it/s, env_step=86016, len=34, n/ep=2, n/st=64, player_1/loss=489.898, player_2/loss=1600.295, rew=602.00]                                                                                                   


Epoch #84: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #85: 1025it [00:02, 491.97it/s, env_step=87040, len=30, n/ep=2, n/st=64, player_1/loss=269.922, player_2/loss=1756.928, rew=476.50]                                                                                                   


Epoch #85: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #86: 1025it [00:02, 493.86it/s, env_step=88064, len=27, n/ep=2, n/st=64, player_1/loss=199.702, player_2/loss=1632.558, rew=401.00]                                                                                                   


Epoch #86: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #87: 1025it [00:02, 488.31it/s, env_step=89088, len=26, n/ep=3, n/st=64, player_1/loss=305.231, player_2/loss=2133.274, rew=446.67]                                                                                                   


Epoch #87: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #88: 1025it [00:02, 488.34it/s, env_step=90112, len=22, n/ep=3, n/st=64, player_1/loss=268.145, player_2/loss=2267.636, rew=261.00]                                                                                                   


Epoch #88: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #89: 1025it [00:02, 492.50it/s, env_step=91136, len=26, n/ep=2, n/st=64, player_1/loss=300.496, player_2/loss=1395.354, rew=410.50]                                                                                                   


Epoch #89: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #90: 1025it [00:02, 487.08it/s, env_step=92160, len=28, n/ep=3, n/st=64, player_1/loss=204.382, player_2/loss=2367.637, rew=448.33]                                                                                                   


Epoch #90: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #91: 1025it [00:02, 488.56it/s, env_step=93184, len=35, n/ep=2, n/st=64, player_1/loss=239.463, player_2/loss=2560.946, rew=753.50]                                                                                                   


Epoch #91: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #92: 1025it [00:02, 492.18it/s, env_step=94208, len=29, n/ep=2, n/st=64, player_1/loss=269.657, player_2/loss=1830.093, rew=434.50]                                                                                                   


Epoch #92: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #93: 1025it [00:02, 495.02it/s, env_step=95232, len=30, n/ep=3, n/st=64, player_1/loss=512.786, player_2/loss=2274.211, rew=519.00]                                                                                                   


Epoch #93: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #94: 1025it [00:02, 492.48it/s, env_step=96256, len=27, n/ep=2, n/st=64, player_1/loss=600.171, player_2/loss=1463.207, rew=427.00]                                                                                                   


Epoch #94: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #95: 1025it [00:02, 493.72it/s, env_step=97280, len=21, n/ep=3, n/st=64, player_1/loss=587.637, player_2/loss=2209.015, rew=251.67]                                                                                                   


Epoch #95: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #96: 1025it [00:02, 489.70it/s, env_step=98304, len=33, n/ep=2, n/st=64, player_1/loss=660.279, player_2/loss=2281.034, rew=572.50]                                                                                                   


Epoch #96: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #97: 1025it [00:02, 491.35it/s, env_step=99328, len=37, n/ep=1, n/st=64, player_2/loss=2082.405, rew=702.00]                                                                                                                          


Epoch #97: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #98: 1025it [00:02, 487.83it/s, env_step=100352, len=29, n/ep=2, n/st=64, player_1/loss=362.502, player_2/loss=1160.999, rew=477.00]                                                                                                  


Epoch #98: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #99: 1025it [00:02, 492.48it/s, env_step=101376, len=23, n/ep=3, n/st=64, player_1/loss=251.599, player_2/loss=1462.076, rew=351.00]                                                                                                  


Epoch #99: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #100: 1025it [00:02, 495.32it/s, env_step=102400, len=34, n/ep=2, n/st=64, player_1/loss=387.254, player_2/loss=2970.325, rew=602.00]                                                                                                 


Epoch #100: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #101: 1025it [00:02, 495.78it/s, env_step=103424, len=33, n/ep=2, n/st=64, player_1/loss=382.898, player_2/loss=3024.703, rew=700.50]                                                                                                 


Epoch #101: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #102: 1025it [00:02, 492.80it/s, env_step=104448, len=29, n/ep=2, n/st=64, player_1/loss=719.137, player_2/loss=2525.660, rew=449.00]                                                                                                 


Epoch #102: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #103: 1025it [00:02, 492.98it/s, env_step=105472, len=27, n/ep=2, n/st=64, player_1/loss=667.649, player_2/loss=1949.807, rew=392.00]                                                                                                 


Epoch #103: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #104: 1025it [00:02, 492.78it/s, env_step=106496, len=29, n/ep=2, n/st=64, player_1/loss=366.048, player_2/loss=2256.409, rew=434.00]                                                                                                 


Epoch #104: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #105: 1025it [00:02, 492.18it/s, env_step=107520, len=36, n/ep=2, n/st=64, player_1/loss=331.969, player_2/loss=1811.400, rew=667.00]                                                                                                 


Epoch #105: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #106: 1025it [00:02, 493.56it/s, env_step=108544, len=25, n/ep=2, n/st=64, player_1/loss=345.739, player_2/loss=1667.849, rew=340.00]                                                                                                 


Epoch #106: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #107: 1025it [00:02, 493.20it/s, env_step=109568, len=38, n/ep=2, n/st=64, player_1/loss=425.873, player_2/loss=3237.563, rew=740.50]                                                                                                 


Epoch #107: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #108: 1025it [00:02, 494.65it/s, env_step=110592, len=25, n/ep=3, n/st=64, player_1/loss=479.162, player_2/loss=3643.006, rew=336.00]                                                                                                 


Epoch #108: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #109: 1025it [00:02, 493.08it/s, env_step=111616, len=30, n/ep=3, n/st=64, player_1/loss=367.738, player_2/loss=3283.973, rew=480.00]                                                                                                 


Epoch #109: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #110: 1025it [00:02, 490.81it/s, env_step=112640, len=32, n/ep=1, n/st=64, player_1/loss=533.841, player_2/loss=2827.981, rew=527.00]                                                                                                 


Epoch #110: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #111: 1025it [00:02, 493.83it/s, env_step=113664, len=32, n/ep=2, n/st=64, player_1/loss=615.787, player_2/loss=1969.731, rew=539.50]                                                                                                 


Epoch #111: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #112: 1025it [00:02, 490.19it/s, env_step=114688, len=36, n/ep=2, n/st=64, player_1/loss=595.823, player_2/loss=1487.360, rew=667.00]                                                                                                 


Epoch #112: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #113: 1025it [00:02, 492.81it/s, env_step=115712, len=34, n/ep=2, n/st=64, player_1/loss=391.765, player_2/loss=1393.393, rew=594.50]                                                                                                 


Epoch #113: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #114: 1025it [00:02, 492.41it/s, env_step=116736, len=33, n/ep=2, n/st=64, player_1/loss=290.220, player_2/loss=1082.122, rew=562.00]                                                                                                 


Epoch #114: test_reward: 860.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #115: 1025it [00:02, 493.18it/s, env_step=117760, len=38, n/ep=1, n/st=64, player_1/loss=244.126, player_2/loss=1342.739, rew=740.00]                                                                                                 


Epoch #115: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #116: 1025it [00:02, 493.96it/s, env_step=118784, len=36, n/ep=2, n/st=64, player_1/loss=250.449, player_2/loss=2448.639, rew=667.00]                                                                                                 


Epoch #116: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #117: 1025it [00:02, 494.64it/s, env_step=119808, len=24, n/ep=3, n/st=64, player_1/loss=308.461, player_2/loss=2112.447, rew=320.33]                                                                                                 


Epoch #117: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #118: 1025it [00:02, 493.72it/s, env_step=120832, len=20, n/ep=3, n/st=64, player_1/loss=201.680, player_2/loss=1000.356, rew=223.33]                                                                                                 


Epoch #118: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #119: 1025it [00:02, 495.11it/s, env_step=121856, len=28, n/ep=2, n/st=64, player_1/loss=358.946, player_2/loss=1502.725, rew=405.00]                                                                                                 


Epoch #119: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #120: 1025it [00:02, 490.24it/s, env_step=122880, len=27, n/ep=2, n/st=64, player_1/loss=413.216, player_2/loss=1568.710, rew=401.00]                                                                                                 


Epoch #120: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #121: 1025it [00:02, 495.04it/s, env_step=123904, len=28, n/ep=2, n/st=64, player_1/loss=473.434, player_2/loss=1434.122, rew=464.50]                                                                                                 


Epoch #121: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #122: 1025it [00:02, 492.36it/s, env_step=124928, len=21, n/ep=3, n/st=64, player_1/loss=583.247, player_2/loss=1404.100, rew=230.33]                                                                                                 


Epoch #122: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #123: 1025it [00:02, 492.61it/s, env_step=125952, len=26, n/ep=3, n/st=64, player_1/loss=397.127, player_2/loss=1714.949, rew=351.00]                                                                                                 


Epoch #123: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #124: 1025it [00:02, 491.72it/s, env_step=126976, len=39, n/ep=1, n/st=64, player_1/loss=379.728, player_2/loss=2552.690, rew=779.00]                                                                                                 


Epoch #124: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #125: 1025it [00:02, 494.15it/s, env_step=128000, len=30, n/ep=2, n/st=64, player_1/loss=434.161, player_2/loss=2451.650, rew=496.00]                                                                                                 


Epoch #125: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #126: 1025it [00:02, 493.10it/s, env_step=129024, len=21, n/ep=3, n/st=64, player_1/loss=236.547, player_2/loss=2321.133, rew=238.67]                                                                                                 


Epoch #126: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #127: 1025it [00:02, 494.00it/s, env_step=130048, len=20, n/ep=3, n/st=64, player_1/loss=303.106, player_2/loss=2772.024, rew=225.00]                                                                                                 


Epoch #127: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #128: 1025it [00:02, 492.98it/s, env_step=131072, len=38, n/ep=1, n/st=64, player_1/loss=312.982, player_2/loss=2232.873, rew=740.00]                                                                                                 


Epoch #128: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #129: 1025it [00:02, 490.77it/s, env_step=132096, len=27, n/ep=3, n/st=64, player_1/loss=426.565, player_2/loss=2227.280, rew=396.33]                                                                                                 


Epoch #129: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #130: 1025it [00:02, 491.08it/s, env_step=133120, len=20, n/ep=3, n/st=64, player_1/loss=631.359, player_2/loss=1841.844, rew=216.33]                                                                                                 


Epoch #130: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #131: 1025it [00:02, 491.43it/s, env_step=134144, len=33, n/ep=1, n/st=64, player_1/loss=437.290, player_2/loss=1969.838, rew=560.00]                                                                                                 


Epoch #131: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #132: 1025it [00:02, 493.35it/s, env_step=135168, len=31, n/ep=2, n/st=64, player_1/loss=211.366, player_2/loss=1639.684, rew=521.00]                                                                                                 


Epoch #132: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #133: 1025it [00:02, 495.01it/s, env_step=136192, len=26, n/ep=3, n/st=64, player_1/loss=271.486, player_2/loss=1650.147, rew=350.00]                                                                                                 


Epoch #133: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #134: 1025it [00:02, 493.72it/s, env_step=137216, len=33, n/ep=3, n/st=64, player_1/loss=394.547, player_2/loss=1735.828, rew=569.00]                                                                                                 


Epoch #134: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #135: 1025it [00:02, 494.96it/s, env_step=138240, len=35, n/ep=2, n/st=64, player_1/loss=362.991, player_2/loss=1981.553, rew=629.00]                                                                                                 


Epoch #135: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #136: 1025it [00:02, 492.06it/s, env_step=139264, len=22, n/ep=3, n/st=64, player_1/loss=79.263, player_2/loss=1967.897, rew=258.33]                                                                                                  


Epoch #136: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #137: 1025it [00:02, 491.74it/s, env_step=140288, len=36, n/ep=2, n/st=64, player_1/loss=471.438, player_2/loss=1833.806, rew=665.50]                                                                                                 


Epoch #137: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #138: 1025it [00:02, 492.17it/s, env_step=141312, len=31, n/ep=2, n/st=64, player_1/loss=527.863, player_2/loss=2476.402, rew=526.00]                                                                                                 


Epoch #138: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #139: 1025it [00:02, 491.59it/s, env_step=142336, len=33, n/ep=2, n/st=64, player_1/loss=405.779, player_2/loss=3022.182, rew=578.00]                                                                                                 


Epoch #139: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #140: 1025it [00:02, 476.17it/s, env_step=143360, len=19, n/ep=4, n/st=64, player_1/loss=335.895, player_2/loss=3278.101, rew=206.00]                                                                                                 


Epoch #140: test_reward: 119.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #141: 1025it [00:02, 450.48it/s, env_step=144384, len=26, n/ep=2, n/st=64, player_1/loss=305.152, player_2/loss=2862.334, rew=350.00]                                                                                                 


Epoch #141: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #142: 1025it [00:02, 470.33it/s, env_step=145408, len=39, n/ep=2, n/st=64, player_1/loss=330.560, player_2/loss=3077.579, rew=799.00]                                                                                                 


Epoch #142: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #143: 1025it [00:02, 493.36it/s, env_step=146432, len=40, n/ep=2, n/st=64, player_1/loss=347.508, player_2/loss=2998.952, rew=940.50]                                                                                                 


Epoch #143: test_reward: 152.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #144: 1025it [00:02, 491.10it/s, env_step=147456, len=33, n/ep=2, n/st=64, player_1/loss=421.878, player_2/loss=2353.980, rew=572.50]                                                                                                 


Epoch #144: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #145: 1025it [00:02, 491.66it/s, env_step=148480, len=8, n/ep=8, n/st=64, player_1/loss=481.021, player_2/loss=3221.918, rew=36.12]                                                                                                   


Epoch #145: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #146: 1025it [00:02, 486.94it/s, env_step=149504, len=17, n/ep=3, n/st=64, player_1/loss=407.536, player_2/loss=4037.544, rew=177.67]                                                                                                 


Epoch #146: test_reward: 594.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #147: 1025it [00:02, 489.19it/s, env_step=150528, len=26, n/ep=3, n/st=64, player_1/loss=299.516, player_2/loss=4081.044, rew=375.33]                                                                                                 


Epoch #147: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #148: 1025it [00:02, 489.00it/s, env_step=151552, len=25, n/ep=2, n/st=64, player_1/loss=218.901, player_2/loss=3439.212, rew=358.00]                                                                                                 


Epoch #148: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #149: 1025it [00:02, 492.19it/s, env_step=152576, len=29, n/ep=2, n/st=64, player_1/loss=201.186, player_2/loss=2175.677, rew=452.00]                                                                                                 


Epoch #149: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #150: 1025it [00:02, 492.36it/s, env_step=153600, len=30, n/ep=2, n/st=64, player_1/loss=207.337, player_2/loss=1581.250, rew=488.50]                                                                                                 


Epoch #150: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #151: 1025it [00:02, 490.98it/s, env_step=154624, len=30, n/ep=2, n/st=64, player_1/loss=228.066, player_2/loss=1992.571, rew=524.50]                                                                                                 


Epoch #151: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #152: 1025it [00:02, 492.50it/s, env_step=155648, len=26, n/ep=3, n/st=64, player_1/loss=234.906, player_2/loss=2009.322, rew=368.00]                                                                                                 


Epoch #152: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #153: 1025it [00:02, 493.28it/s, env_step=156672, len=15, n/ep=4, n/st=64, player_1/loss=403.887, player_2/loss=2593.792, rew=123.25]                                                                                                 


Epoch #153: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #154: 1025it [00:02, 494.07it/s, env_step=157696, len=40, n/ep=2, n/st=64, player_1/loss=413.218, player_2/loss=2110.217, rew=921.00]                                                                                                 


Epoch #154: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #155: 1025it [00:02, 490.77it/s, env_step=158720, len=22, n/ep=3, n/st=64, player_1/loss=507.052, player_2/loss=1082.150, rew=286.00]                                                                                                 


Epoch #155: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #156: 1025it [00:02, 494.03it/s, env_step=159744, len=26, n/ep=3, n/st=64, player_1/loss=471.743, player_2/loss=2044.775, rew=350.33]                                                                                                 


Epoch #156: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #157: 1025it [00:02, 494.77it/s, env_step=160768, len=29, n/ep=2, n/st=64, player_1/loss=325.057, player_2/loss=2996.925, rew=494.00]                                                                                                 


Epoch #157: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #158: 1025it [00:02, 492.33it/s, env_step=161792, len=30, n/ep=2, n/st=64, player_1/loss=405.979, player_2/loss=2304.314, rew=488.50]                                                                                                 


Epoch #158: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #159: 1025it [00:02, 493.16it/s, env_step=162816, len=27, n/ep=2, n/st=64, player_1/loss=307.744, player_2/loss=1317.902, rew=391.00]                                                                                                 


Epoch #159: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #160: 1025it [00:02, 492.06it/s, env_step=163840, len=37, n/ep=1, n/st=64, player_1/loss=215.251, player_2/loss=1881.038, rew=702.00]                                                                                                 


Epoch #160: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #161: 1025it [00:02, 489.16it/s, env_step=164864, len=26, n/ep=2, n/st=64, player_1/loss=429.291, player_2/loss=2798.715, rew=410.50]                                                                                                 


Epoch #161: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #162: 1025it [00:02, 487.80it/s, env_step=165888, len=11, n/ep=6, n/st=64, player_1/loss=347.648, player_2/loss=2714.028, rew=73.67]                                                                                                  


Epoch #162: test_reward: 35.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #163: 1025it [00:02, 490.26it/s, env_step=166912, len=30, n/ep=2, n/st=64, player_1/loss=354.381, player_2/loss=2247.908, rew=496.00]                                                                                                 


Epoch #163: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #164: 1025it [00:02, 493.49it/s, env_step=167936, len=33, n/ep=2, n/st=64, player_1/loss=467.011, player_2/loss=1581.199, rew=587.00]                                                                                                 


Epoch #164: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #165: 1025it [00:02, 492.49it/s, env_step=168960, len=27, n/ep=2, n/st=64, player_1/loss=323.681, player_2/loss=1353.626, rew=391.00]                                                                                                 


Epoch #165: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #166: 1025it [00:02, 490.31it/s, env_step=169984, len=11, n/ep=8, n/st=64, player_1/loss=463.744, player_2/loss=1705.107, rew=104.25]                                                                                                 


Epoch #166: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #167: 1025it [00:02, 491.92it/s, env_step=171008, len=26, n/ep=3, n/st=64, player_1/loss=505.539, player_2/loss=1518.103, rew=391.33]                                                                                                 


Epoch #167: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #168: 1025it [00:02, 492.05it/s, env_step=172032, len=31, n/ep=2, n/st=64, player_1/loss=584.683, player_2/loss=2115.191, rew=495.50]                                                                                                 


Epoch #168: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #169: 1025it [00:02, 493.18it/s, env_step=173056, len=39, n/ep=1, n/st=64, player_1/loss=525.210, player_2/loss=2212.440, rew=779.00]                                                                                                 


Epoch #169: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #170: 1025it [00:02, 493.97it/s, env_step=174080, len=42, n/ep=1, n/st=64, player_1/loss=521.825, player_2/loss=1652.726, rew=1102.00]                                                                                                


Epoch #170: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #171: 1025it [00:02, 489.48it/s, env_step=175104, len=40, n/ep=1, n/st=64, player_1/loss=518.194, player_2/loss=2409.124, rew=819.00]                                                                                                 


Epoch #171: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #172: 1025it [00:02, 495.44it/s, env_step=176128, len=24, n/ep=2, n/st=64, player_1/loss=355.658, player_2/loss=2000.630, rew=326.50]                                                                                                 


Epoch #172: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #173: 1025it [00:02, 494.19it/s, env_step=177152, len=28, n/ep=2, n/st=64, player_1/loss=385.533, player_2/loss=453.523, rew=425.50]                                                                                                  


Epoch #173: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #174: 1025it [00:02, 494.20it/s, env_step=178176, len=38, n/ep=1, n/st=64, player_1/loss=367.492, player_2/loss=979.466, rew=740.00]                                                                                                  


Epoch #174: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #175: 1025it [00:02, 491.46it/s, env_step=179200, len=33, n/ep=2, n/st=64, player_1/loss=143.525, player_2/loss=1773.260, rew=568.00]                                                                                                 


Epoch #175: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #176: 1025it [00:02, 496.19it/s, env_step=180224, len=38, n/ep=2, n/st=64, player_1/loss=471.803, player_2/loss=3416.299, rew=740.00]                                                                                                 


Epoch #176: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #177: 1025it [00:02, 494.03it/s, env_step=181248, len=23, n/ep=3, n/st=64, player_1/loss=683.340, player_2/loss=3942.419, rew=287.00]                                                                                                 


Epoch #177: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #178: 1025it [00:02, 492.70it/s, env_step=182272, len=25, n/ep=3, n/st=64, player_1/loss=795.481, player_2/loss=2207.991, rew=344.67]                                                                                                 


Epoch #178: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #179: 1025it [00:02, 486.31it/s, env_step=183296, len=28, n/ep=2, n/st=64, player_1/loss=589.360, player_2/loss=2767.081, rew=447.50]                                                                                                 


Epoch #179: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #180: 1025it [00:02, 490.39it/s, env_step=184320, len=38, n/ep=1, n/st=64, player_1/loss=815.184, player_2/loss=1544.585, rew=740.00]                                                                                                 


Epoch #180: test_reward: 594.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #181: 1025it [00:02, 491.05it/s, env_step=185344, len=38, n/ep=2, n/st=64, player_1/loss=590.916, player_2/loss=1240.839, rew=740.50]                                                                                                 


Epoch #181: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #182: 1025it [00:02, 493.36it/s, env_step=186368, len=14, n/ep=4, n/st=64, player_1/loss=415.491, player_2/loss=1592.441, rew=112.75]                                                                                                 


Epoch #182: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #183: 1025it [00:02, 494.39it/s, env_step=187392, len=31, n/ep=2, n/st=64, player_1/loss=349.306, player_2/loss=2587.794, rew=511.00]                                                                                                 


Epoch #183: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #184: 1025it [00:02, 492.84it/s, env_step=188416, len=33, n/ep=2, n/st=64, player_1/loss=359.264, player_2/loss=2039.630, rew=577.00]                                                                                                 


Epoch #184: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #185: 1025it [00:02, 493.05it/s, env_step=189440, len=40, n/ep=1, n/st=64, player_1/loss=247.788, player_2/loss=1338.204, rew=819.00]                                                                                                 


Epoch #185: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #186: 1025it [00:02, 491.83it/s, env_step=190464, len=32, n/ep=2, n/st=64, player_1/loss=319.716, player_2/loss=2251.070, rew=558.50]                                                                                                 


Epoch #186: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #187: 1025it [00:02, 491.18it/s, env_step=191488, len=21, n/ep=3, n/st=64, player_1/loss=386.308, player_2/loss=2946.898, rew=240.67]                                                                                                 


Epoch #187: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #188: 1025it [00:02, 494.51it/s, env_step=192512, len=22, n/ep=3, n/st=64, player_1/loss=375.048, player_2/loss=3322.323, rew=263.00]                                                                                                 


Epoch #188: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #189: 1025it [00:02, 492.80it/s, env_step=193536, len=29, n/ep=2, n/st=64, player_1/loss=384.684, player_2/loss=2714.479, rew=484.00]                                                                                                 


Epoch #189: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #190: 1025it [00:02, 491.48it/s, env_step=194560, len=41, n/ep=2, n/st=64, player_1/loss=431.455, player_2/loss=2943.018, rew=960.50]                                                                                                 


Epoch #190: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #191: 1025it [00:02, 493.61it/s, env_step=195584, len=42, n/ep=1, n/st=64, player_1/loss=297.643, player_2/loss=3828.543, rew=1102.00]                                                                                                


Epoch #191: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #192: 1025it [00:02, 494.08it/s, env_step=196608, len=40, n/ep=1, n/st=64, player_1/loss=228.223, player_2/loss=1819.152, rew=819.00]                                                                                                 


Epoch #192: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #193: 1025it [00:02, 493.68it/s, env_step=197632, len=23, n/ep=3, n/st=64, player_1/loss=317.918, player_2/loss=1088.760, rew=337.00]                                                                                                 


Epoch #193: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #194: 1025it [00:02, 492.02it/s, env_step=198656, len=31, n/ep=2, n/st=64, player_1/loss=381.401, player_2/loss=1160.004, rew=513.00]                                                                                                 


Epoch #194: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #195: 1025it [00:02, 490.90it/s, env_step=199680, len=37, n/ep=2, n/st=64, player_1/loss=350.805, player_2/loss=958.469, rew=721.00]                                                                                                  


Epoch #195: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #196: 1025it [00:02, 493.13it/s, env_step=200704, len=31, n/ep=2, n/st=64, player_1/loss=365.038, player_2/loss=1554.539, rew=495.50]                                                                                                 


Epoch #196: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #197: 1025it [00:02, 489.36it/s, env_step=201728, len=29, n/ep=3, n/st=64, player_1/loss=193.927, player_2/loss=1312.218, rew=481.33]                                                                                                 


Epoch #197: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #198: 1025it [00:02, 490.24it/s, env_step=202752, len=15, n/ep=4, n/st=64, player_1/loss=322.374, player_2/loss=833.447, rew=119.25]                                                                                                  


Epoch #198: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #199: 1025it [00:02, 492.26it/s, env_step=203776, len=14, n/ep=5, n/st=64, player_1/loss=578.142, player_2/loss=2027.754, rew=113.80]                                                                                                 


Epoch #199: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #200: 1025it [00:02, 492.82it/s, env_step=204800, len=21, n/ep=3, n/st=64, player_1/loss=648.848, player_2/loss=2648.229, rew=244.67]                                                                                                 


Epoch #200: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #201: 1025it [00:02, 494.14it/s, env_step=205824, len=20, n/ep=3, n/st=64, player_1/loss=372.390, player_2/loss=2409.643, rew=210.00]                                                                                                 


Epoch #201: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #202: 1025it [00:02, 490.42it/s, env_step=206848, len=42, n/ep=1, n/st=64, player_1/loss=330.295, player_2/loss=1983.288, rew=1102.00]                                                                                                


Epoch #202: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #203: 1025it [00:02, 489.04it/s, env_step=207872, len=31, n/ep=2, n/st=64, player_1/loss=442.919, player_2/loss=1732.491, rew=513.00]                                                                                                 


Epoch #203: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #204: 1025it [00:02, 491.69it/s, env_step=208896, len=29, n/ep=3, n/st=64, player_1/loss=434.161, player_2/loss=990.700, rew=553.00]                                                                                                  


Epoch #204: test_reward: 665.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #205: 1025it [00:02, 493.41it/s, env_step=209920, len=23, n/ep=3, n/st=64, player_1/loss=236.722, player_2/loss=951.936, rew=293.00]                                                                                                  


Epoch #205: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #206: 1025it [00:02, 491.36it/s, env_step=210944, len=29, n/ep=2, n/st=64, player_1/loss=208.611, player_2/loss=1233.869, rew=434.00]                                                                                                 


Epoch #206: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #207: 1025it [00:02, 493.90it/s, env_step=211968, len=31, n/ep=2, n/st=64, player_1/loss=284.446, player_2/loss=1973.589, rew=503.00]                                                                                                 


Epoch #207: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #208: 1025it [00:02, 491.71it/s, env_step=212992, len=33, n/ep=2, n/st=64, player_1/loss=337.311, player_2/loss=2661.114, rew=592.00]                                                                                                 


Epoch #208: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #209: 1025it [00:02, 483.99it/s, env_step=214016, len=30, n/ep=2, n/st=64, player_1/loss=417.370, player_2/loss=2306.375, rew=496.00]                                                                                                 


Epoch #209: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #210: 1025it [00:02, 493.86it/s, env_step=215040, len=31, n/ep=2, n/st=64, player_1/loss=513.808, player_2/loss=1753.082, rew=517.00]                                                                                                 


Epoch #210: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #211: 1025it [00:02, 493.24it/s, env_step=216064, len=30, n/ep=2, n/st=64, player_1/loss=333.878, player_2/loss=1287.808, rew=489.50]                                                                                                 


Epoch #211: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #212: 1025it [00:02, 488.93it/s, env_step=217088, len=22, n/ep=3, n/st=64, player_1/loss=264.412, player_2/loss=626.216, rew=321.00]                                                                                                  


Epoch #212: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #213: 1025it [00:02, 491.58it/s, env_step=218112, len=41, n/ep=2, n/st=64, player_1/loss=323.483, player_2/loss=1140.937, rew=960.50]                                                                                                 


Epoch #213: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #214: 1025it [00:02, 494.57it/s, env_step=219136, len=24, n/ep=3, n/st=64, player_1/loss=388.022, player_2/loss=1583.341, rew=374.67]                                                                                                 


Epoch #214: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #215: 1025it [00:02, 490.05it/s, env_step=220160, len=29, n/ep=3, n/st=64, player_1/loss=196.798, player_2/loss=1352.455, rew=462.00]                                                                                                 


Epoch #215: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #216: 1025it [00:02, 489.61it/s, env_step=221184, len=25, n/ep=2, n/st=64, player_1/loss=194.549, player_2/loss=1476.600, rew=343.00]                                                                                                 


Epoch #216: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #217: 1025it [00:02, 496.16it/s, env_step=222208, len=33, n/ep=1, n/st=64, player_1/loss=280.807, player_2/loss=1664.308, rew=560.00]                                                                                                 


Epoch #217: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #218: 1025it [00:02, 494.11it/s, env_step=223232, len=30, n/ep=2, n/st=64, player_1/loss=232.146, player_2/loss=1346.440, rew=488.50]                                                                                                 


Epoch #218: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #219: 1025it [00:02, 492.73it/s, env_step=224256, len=23, n/ep=3, n/st=64, player_1/loss=415.182, player_2/loss=1276.722, rew=322.33]                                                                                                 


Epoch #219: test_reward: 35.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #220: 1025it [00:02, 487.40it/s, env_step=225280, len=15, n/ep=4, n/st=64, player_1/loss=459.602, player_2/loss=3111.281, rew=123.00]                                                                                                 


Epoch #220: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #221: 1025it [00:02, 488.33it/s, env_step=226304, len=22, n/ep=3, n/st=64, player_1/loss=418.261, player_2/loss=3923.168, rew=252.33]                                                                                                 


Epoch #221: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #222: 1025it [00:02, 494.04it/s, env_step=227328, len=32, n/ep=2, n/st=64, player_1/loss=362.102, player_2/loss=2453.152, rew=543.50]                                                                                                 


Epoch #222: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #223: 1025it [00:02, 489.44it/s, env_step=228352, len=32, n/ep=2, n/st=64, player_1/loss=352.236, player_2/loss=1963.454, rew=527.50]                                                                                                 


Epoch #223: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #224: 1025it [00:02, 492.70it/s, env_step=229376, len=32, n/ep=2, n/st=64, player_1/loss=193.967, player_2/loss=1704.706, rew=527.50]                                                                                                 


Epoch #224: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #225: 1025it [00:02, 493.29it/s, env_step=230400, len=27, n/ep=2, n/st=64, player_1/loss=328.617, player_2/loss=1529.569, rew=385.00]                                                                                                 


Epoch #225: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #226: 1025it [00:02, 491.24it/s, env_step=231424, len=22, n/ep=2, n/st=64, player_1/loss=446.088, player_2/loss=1628.129, rew=276.50]                                                                                                 


Epoch #226: test_reward: 119.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #227: 1025it [00:02, 491.94it/s, env_step=232448, len=25, n/ep=3, n/st=64, player_1/loss=581.050, player_2/loss=1747.251, rew=378.67]                                                                                                 


Epoch #227: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #228: 1025it [00:02, 493.55it/s, env_step=233472, len=17, n/ep=4, n/st=64, player_1/loss=451.639, player_2/loss=1939.077, rew=176.00]                                                                                                 


Epoch #228: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #229: 1025it [00:02, 488.13it/s, env_step=234496, len=22, n/ep=3, n/st=64, player_1/loss=241.727, player_2/loss=1642.956, rew=253.33]                                                                                                 


Epoch #229: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #230: 1025it [00:02, 491.41it/s, env_step=235520, len=22, n/ep=3, n/st=64, player_1/loss=271.142, player_2/loss=1228.351, rew=260.00]                                                                                                 


Epoch #230: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #231: 1025it [00:02, 491.20it/s, env_step=236544, len=24, n/ep=3, n/st=64, player_1/loss=274.309, player_2/loss=1253.607, rew=363.67]                                                                                                 


Epoch #231: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #232: 1025it [00:02, 486.42it/s, env_step=237568, len=35, n/ep=2, n/st=64, player_1/loss=264.199, player_2/loss=1557.149, rew=650.00]                                                                                                 


Epoch #232: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #233: 1025it [00:02, 491.79it/s, env_step=238592, len=36, n/ep=2, n/st=64, player_1/loss=411.543, player_2/loss=2096.926, rew=665.50]                                                                                                 


Epoch #233: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #234: 1025it [00:02, 492.51it/s, env_step=239616, len=18, n/ep=3, n/st=64, player_1/loss=417.987, player_2/loss=2455.432, rew=190.00]                                                                                                 


Epoch #234: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #235: 1025it [00:02, 492.52it/s, env_step=240640, len=32, n/ep=2, n/st=64, player_1/loss=432.902, player_2/loss=1875.023, rew=527.00]                                                                                                 


Epoch #235: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #236: 1025it [00:02, 492.90it/s, env_step=241664, len=36, n/ep=2, n/st=64, player_1/loss=503.735, player_2/loss=2051.838, rew=684.50]                                                                                                 


Epoch #236: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #237: 1025it [00:02, 490.84it/s, env_step=242688, len=37, n/ep=1, n/st=64, player_1/loss=340.257, player_2/loss=1755.521, rew=702.00]                                                                                                 


Epoch #237: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #238: 1025it [00:02, 491.30it/s, env_step=243712, len=15, n/ep=4, n/st=64, player_1/loss=342.831, player_2/loss=2270.730, rew=129.75]                                                                                                 


Epoch #238: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #239: 1025it [00:02, 495.23it/s, env_step=244736, len=22, n/ep=3, n/st=64, player_1/loss=418.115, player_2/loss=1912.092, rew=268.00]                                                                                                 


Epoch #239: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #240: 1025it [00:02, 491.72it/s, env_step=245760, len=27, n/ep=2, n/st=64, player_1/loss=506.284, player_2/loss=1550.522, rew=391.00]                                                                                                 


Epoch #240: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #241: 1025it [00:02, 489.61it/s, env_step=246784, len=32, n/ep=2, n/st=64, player_1/loss=485.848, player_2/loss=2713.784, rew=545.00]                                                                                                 


Epoch #241: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #242: 1025it [00:02, 488.65it/s, env_step=247808, len=26, n/ep=2, n/st=64, player_1/loss=380.310, player_2/loss=2969.063, rew=366.50]                                                                                                 


Epoch #242: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #243: 1025it [00:02, 489.44it/s, env_step=248832, len=30, n/ep=3, n/st=64, player_1/loss=489.053, player_2/loss=2490.722, rew=478.33]                                                                                                 


Epoch #243: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #244: 1025it [00:02, 494.99it/s, env_step=249856, len=17, n/ep=4, n/st=64, player_1/loss=533.997, player_2/loss=1349.966, rew=153.00]                                                                                                 


Epoch #244: test_reward: 152.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #245: 1025it [00:02, 487.90it/s, env_step=250880, len=37, n/ep=1, n/st=64, player_1/loss=344.556, player_2/loss=1588.421, rew=702.00]                                                                                                 


Epoch #245: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #246: 1025it [00:02, 495.58it/s, env_step=251904, len=31, n/ep=2, n/st=64, player_1/loss=290.404, player_2/loss=1977.025, rew=527.00]                                                                                                 


Epoch #246: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #247: 1025it [00:02, 493.49it/s, env_step=252928, len=30, n/ep=3, n/st=64, player_1/loss=314.524, player_2/loss=2247.912, rew=487.67]                                                                                                 


Epoch #247: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #248: 1025it [00:02, 493.51it/s, env_step=253952, len=40, n/ep=1, n/st=64, player_1/loss=319.827, player_2/loss=2549.793, rew=819.00]                                                                                                 


Epoch #248: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #249: 1025it [00:02, 493.73it/s, env_step=254976, len=30, n/ep=2, n/st=64, player_1/loss=229.095, player_2/loss=1743.477, rew=464.50]                                                                                                 


Epoch #249: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #250: 1025it [00:02, 491.42it/s, env_step=256000, len=17, n/ep=4, n/st=64, player_1/loss=132.167, player_2/loss=1907.337, rew=177.50]                                                                                                 


Epoch #250: test_reward: 594.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #251: 1025it [00:02, 489.92it/s, env_step=257024, len=34, n/ep=2, n/st=64, player_1/loss=132.792, player_2/loss=1675.113, rew=621.50]                                                                                                 


Epoch #251: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #252: 1025it [00:02, 488.95it/s, env_step=258048, len=14, n/ep=4, n/st=64, player_1/loss=298.752, player_2/loss=2423.191, rew=108.50]                                                                                                 


Epoch #252: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #253: 1025it [00:02, 490.56it/s, env_step=259072, len=26, n/ep=3, n/st=64, player_1/loss=509.871, player_2/loss=3420.775, rew=359.33]                                                                                                 


Epoch #253: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #254: 1025it [00:02, 490.70it/s, env_step=260096, len=34, n/ep=2, n/st=64, player_1/loss=771.721, player_2/loss=2715.693, rew=596.00]                                                                                                 


Epoch #254: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #255: 1025it [00:02, 481.14it/s, env_step=261120, len=40, n/ep=2, n/st=64, player_1/loss=678.392, rew=921.00]                                                                                                                         


Epoch #255: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #256: 1025it [00:02, 492.47it/s, env_step=262144, len=31, n/ep=2, n/st=64, player_1/loss=629.658, player_2/loss=2480.692, rew=517.00]                                                                                                 


Epoch #256: test_reward: 44.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #257: 1025it [00:02, 471.20it/s, env_step=263168, len=16, n/ep=4, n/st=64, player_1/loss=568.733, player_2/loss=3365.098, rew=147.00]                                                                                                 


Epoch #257: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #258: 1025it [00:02, 488.10it/s, env_step=264192, len=16, n/ep=4, n/st=64, player_1/loss=363.180, player_2/loss=3265.180, rew=148.75]                                                                                                 


Epoch #258: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #259: 1025it [00:02, 484.64it/s, env_step=265216, len=37, n/ep=2, n/st=64, player_1/loss=330.827, player_2/loss=2199.213, rew=721.00]                                                                                                 


Epoch #259: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #260: 1025it [00:02, 488.90it/s, env_step=266240, len=27, n/ep=2, n/st=64, player_1/loss=272.201, player_2/loss=1126.065, rew=395.00]                                                                                                 


Epoch #260: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #261: 1025it [00:02, 492.43it/s, env_step=267264, len=34, n/ep=2, n/st=64, player_1/loss=181.790, player_2/loss=745.172, rew=621.50]                                                                                                  


Epoch #261: test_reward: 665.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #262: 1025it [00:02, 489.13it/s, env_step=268288, len=40, n/ep=2, n/st=64, player_1/loss=312.730, player_2/loss=801.542, rew=940.50]                                                                                                  


Epoch #262: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #263: 1025it [00:02, 494.50it/s, env_step=269312, len=16, n/ep=4, n/st=64, player_1/loss=521.894, player_2/loss=1193.359, rew=209.50]                                                                                                 


Epoch #263: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #264: 1025it [00:02, 492.59it/s, env_step=270336, len=33, n/ep=2, n/st=64, player_1/loss=513.325, player_2/loss=1394.957, rew=572.50]                                                                                                 


Epoch #264: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #265: 1025it [00:02, 494.87it/s, env_step=271360, len=33, n/ep=2, n/st=64, player_1/loss=523.590, player_2/loss=1714.994, rew=583.00]                                                                                                 


Epoch #265: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #266: 1025it [00:02, 492.59it/s, env_step=272384, len=37, n/ep=2, n/st=64, player_1/loss=331.514, player_2/loss=1742.573, rew=724.00]                                                                                                 


Epoch #266: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #267: 1025it [00:02, 491.60it/s, env_step=273408, len=35, n/ep=2, n/st=64, player_1/loss=347.834, player_2/loss=1099.620, rew=753.50]                                                                                                 


Epoch #267: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #268: 1025it [00:02, 484.63it/s, env_step=274432, len=29, n/ep=2, n/st=64, player_1/loss=451.554, player_2/loss=918.079, rew=477.00]                                                                                                  


Epoch #268: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #269: 1025it [00:02, 492.78it/s, env_step=275456, len=38, n/ep=2, n/st=64, player_1/loss=518.993, player_2/loss=1742.642, rew=740.00]                                                                                                 


Epoch #269: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #270: 1025it [00:02, 487.65it/s, env_step=276480, len=7, n/ep=9, n/st=64, player_1/loss=312.303, rew=33.44]                                                                                                                           


Epoch #270: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #271: 1025it [00:02, 492.93it/s, env_step=277504, len=33, n/ep=2, n/st=64, player_1/loss=233.529, player_2/loss=1396.717, rew=583.00]                                                                                                 


Epoch #271: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #272: 1025it [00:02, 488.59it/s, env_step=278528, len=37, n/ep=2, n/st=64, player_1/loss=281.360, player_2/loss=1015.998, rew=831.00]                                                                                                 


Epoch #272: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #273: 1025it [00:02, 444.27it/s, env_step=279552, len=41, n/ep=1, n/st=64, player_1/loss=268.661, player_2/loss=1363.437, rew=860.00]                                                                                                 


Epoch #273: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #274: 1025it [00:02, 494.14it/s, env_step=280576, len=34, n/ep=2, n/st=64, player_1/loss=176.477, player_2/loss=1555.798, rew=594.50]                                                                                                 


Epoch #274: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #275: 1025it [00:02, 489.34it/s, env_step=281600, len=29, n/ep=2, n/st=64, player_1/loss=114.948, player_2/loss=1382.844, rew=455.00]                                                                                                 


Epoch #275: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #276: 1025it [00:02, 494.50it/s, env_step=282624, len=40, n/ep=1, n/st=64, player_1/loss=252.904, player_2/loss=1923.792, rew=819.00]                                                                                                 


Epoch #276: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #277: 1025it [00:02, 489.94it/s, env_step=283648, len=27, n/ep=3, n/st=64, player_1/loss=423.485, player_2/loss=2412.630, rew=397.67]                                                                                                 


Epoch #277: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #278: 1025it [00:02, 489.01it/s, env_step=284672, len=37, n/ep=1, n/st=64, player_1/loss=291.042, player_2/loss=2908.588, rew=702.00]                                                                                                 


Epoch #278: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #279: 1025it [00:02, 491.02it/s, env_step=285696, len=28, n/ep=2, n/st=64, player_1/loss=417.833, player_2/loss=3486.991, rew=429.50]                                                                                                 


Epoch #279: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #280: 1025it [00:02, 490.41it/s, env_step=286720, len=38, n/ep=2, n/st=64, player_1/loss=372.244, player_2/loss=2462.949, rew=865.50]                                                                                                 


Epoch #280: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #281: 1025it [00:02, 489.00it/s, env_step=287744, len=28, n/ep=2, n/st=64, player_1/loss=205.652, player_2/loss=2556.364, rew=445.50]                                                                                                 


Epoch #281: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #282: 1025it [00:02, 492.52it/s, env_step=288768, len=37, n/ep=2, n/st=64, player_1/loss=329.359, player_2/loss=1445.710, rew=721.00]                                                                                                 


Epoch #282: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #283: 1025it [00:02, 487.94it/s, env_step=289792, len=26, n/ep=3, n/st=64, player_1/loss=517.346, player_2/loss=1000.010, rew=371.33]                                                                                                 


Epoch #283: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #284: 1025it [00:02, 490.86it/s, env_step=290816, len=34, n/ep=2, n/st=64, player_1/loss=474.081, player_2/loss=2234.455, rew=726.00]                                                                                                 


Epoch #284: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #285: 1025it [00:02, 489.52it/s, env_step=291840, len=37, n/ep=1, n/st=64, player_1/loss=292.343, player_2/loss=3757.519, rew=702.00]                                                                                                 


Epoch #285: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #286: 1025it [00:02, 490.99it/s, env_step=292864, len=33, n/ep=2, n/st=64, player_1/loss=416.975, player_2/loss=3004.509, rew=700.50]                                                                                                 


Epoch #286: test_reward: 860.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #287: 1025it [00:02, 490.82it/s, env_step=293888, len=29, n/ep=2, n/st=64, player_1/loss=354.823, player_2/loss=1370.402, rew=485.00]                                                                                                 


Epoch #287: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #288: 1025it [00:02, 492.15it/s, env_step=294912, len=39, n/ep=1, n/st=64, player_1/loss=345.915, player_2/loss=1456.686, rew=779.00]                                                                                                 


Epoch #288: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #289: 1025it [00:02, 492.33it/s, env_step=295936, len=31, n/ep=2, n/st=64, player_1/loss=169.937, player_2/loss=1601.397, rew=545.00]                                                                                                 


Epoch #289: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #290: 1025it [00:02, 491.98it/s, env_step=296960, len=31, n/ep=2, n/st=64, player_1/loss=160.703, player_2/loss=1218.692, rew=513.00]                                                                                                 


Epoch #290: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #291: 1025it [00:02, 492.96it/s, env_step=297984, len=30, n/ep=2, n/st=64, player_1/loss=104.630, player_2/loss=1131.166, rew=480.50]                                                                                                 


Epoch #291: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #292: 1025it [00:02, 490.42it/s, env_step=299008, len=27, n/ep=3, n/st=64, player_1/loss=187.641, player_2/loss=2095.216, rew=402.33]                                                                                                 


Epoch #292: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #293: 1025it [00:02, 490.28it/s, env_step=300032, len=40, n/ep=1, n/st=64, player_1/loss=337.199, player_2/loss=2147.952, rew=819.00]                                                                                                 


Epoch #293: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #294: 1025it [00:02, 493.03it/s, env_step=301056, len=35, n/ep=2, n/st=64, player_1/loss=471.513, player_2/loss=2121.398, rew=648.00]                                                                                                 


Epoch #294: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #295: 1025it [00:02, 493.01it/s, env_step=302080, len=19, n/ep=2, n/st=64, player_1/loss=393.918, player_2/loss=1674.361, rew=200.00]                                                                                                 


Epoch #295: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #296: 1025it [00:02, 490.87it/s, env_step=303104, len=42, n/ep=1, n/st=64, player_1/loss=320.392, player_2/loss=780.476, rew=1102.00]                                                                                                 


Epoch #296: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #297: 1025it [00:02, 489.62it/s, env_step=304128, len=30, n/ep=2, n/st=64, player_1/loss=413.259, player_2/loss=1512.248, rew=482.00]                                                                                                 


Epoch #297: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #298: 1025it [00:02, 492.93it/s, env_step=305152, len=38, n/ep=2, n/st=64, player_1/loss=450.671, player_2/loss=1525.620, rew=759.50]                                                                                                 


Epoch #298: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #299: 1025it [00:02, 491.29it/s, env_step=306176, len=29, n/ep=2, n/st=64, player_1/loss=324.115, player_2/loss=1142.232, rew=434.50]                                                                                                 


Epoch #299: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #300: 1025it [00:02, 493.54it/s, env_step=307200, len=37, n/ep=1, n/st=64, player_1/loss=354.728, player_2/loss=1203.548, rew=702.00]                                                                                                 


Epoch #300: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #301: 1025it [00:02, 492.92it/s, env_step=308224, len=31, n/ep=2, n/st=64, player_1/loss=429.772, player_2/loss=1349.789, rew=517.00]                                                                                                 


Epoch #301: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #302: 1025it [00:02, 489.11it/s, env_step=309248, len=31, n/ep=2, n/st=64, player_1/loss=492.504, player_2/loss=1824.581, rew=495.50]                                                                                                 


Epoch #302: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #303: 1025it [00:02, 491.30it/s, env_step=310272, len=30, n/ep=2, n/st=64, player_1/loss=552.422, player_2/loss=1155.233, rew=482.00]                                                                                                 


Epoch #303: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #304: 1025it [00:02, 494.25it/s, env_step=311296, len=38, n/ep=2, n/st=64, player_1/loss=510.826, player_2/loss=1265.971, rew=865.50]                                                                                                 


Epoch #304: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #305: 1025it [00:02, 489.41it/s, env_step=312320, len=35, n/ep=2, n/st=64, player_1/loss=399.611, player_2/loss=1646.424, rew=648.00]                                                                                                 


Epoch #305: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #306: 1025it [00:02, 491.84it/s, env_step=313344, len=39, n/ep=1, n/st=64, player_1/loss=354.343, player_2/loss=1647.266, rew=779.00]                                                                                                 


Epoch #306: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #307: 1025it [00:02, 491.22it/s, env_step=314368, len=31, n/ep=2, n/st=64, player_1/loss=253.449, player_2/loss=1546.797, rew=495.50]                                                                                                 


Epoch #307: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #308: 1025it [00:02, 490.90it/s, env_step=315392, len=37, n/ep=1, n/st=64, player_1/loss=329.236, player_2/loss=2369.202, rew=702.00]                                                                                                 


Epoch #308: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #309: 1025it [00:02, 492.11it/s, env_step=316416, len=30, n/ep=2, n/st=64, player_1/loss=360.273, player_2/loss=2990.020, rew=515.50]                                                                                                 


Epoch #309: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #310: 1025it [00:02, 491.88it/s, env_step=317440, len=29, n/ep=3, n/st=64, player_1/loss=333.998, player_2/loss=2927.244, rew=474.67]                                                                                                 


Epoch #310: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #311: 1025it [00:02, 492.01it/s, env_step=318464, len=33, n/ep=2, n/st=64, player_1/loss=327.289, player_2/loss=1706.191, rew=587.00]                                                                                                 


Epoch #311: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #312: 1025it [00:02, 492.47it/s, env_step=319488, len=30, n/ep=3, n/st=64, player_1/loss=581.410, player_2/loss=2132.819, rew=498.00]                                                                                                 


Epoch #312: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #313: 1025it [00:02, 493.35it/s, env_step=320512, len=26, n/ep=4, n/st=64, player_1/loss=570.737, player_2/loss=2141.922, rew=402.75]                                                                                                 


Epoch #313: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #314: 1025it [00:02, 486.09it/s, env_step=321536, len=36, n/ep=2, n/st=64, player_1/loss=319.019, player_2/loss=2193.625, rew=684.50]                                                                                                 


Epoch #314: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #315: 1025it [00:02, 489.65it/s, env_step=322560, len=31, n/ep=2, n/st=64, player_1/loss=210.353, player_2/loss=2429.265, rew=512.00]                                                                                                 


Epoch #315: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #316: 1025it [00:02, 489.04it/s, env_step=323584, len=30, n/ep=2, n/st=64, player_1/loss=425.064, player_2/loss=2387.002, rew=496.00]                                                                                                 


Epoch #316: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #317: 1025it [00:02, 491.33it/s, env_step=324608, len=30, n/ep=2, n/st=64, player_1/loss=442.660, player_2/loss=1848.599, rew=496.00]                                                                                                 


Epoch #317: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #318: 1025it [00:02, 492.13it/s, env_step=325632, len=22, n/ep=3, n/st=64, player_1/loss=244.874, player_2/loss=1438.995, rew=332.33]                                                                                                 


Epoch #318: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #319: 1025it [00:02, 478.62it/s, env_step=326656, len=36, n/ep=2, n/st=64, player_1/loss=223.003, player_2/loss=1091.236, rew=684.50]                                                                                                 


Epoch #319: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #320: 1025it [00:02, 439.60it/s, env_step=327680, len=20, n/ep=3, n/st=64, player_1/loss=97.935, player_2/loss=400.318, rew=223.33]                                                                                                   


Epoch #320: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #321: 1025it [00:02, 461.31it/s, env_step=328704, len=38, n/ep=2, n/st=64, player_1/loss=185.948, player_2/loss=690.445, rew=740.50]                                                                                                  


Epoch #321: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #322: 1025it [00:02, 490.24it/s, env_step=329728, len=35, n/ep=2, n/st=64, player_1/loss=227.804, player_2/loss=773.370, rew=653.00]                                                                                                  


Epoch #322: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #323: 1025it [00:02, 490.95it/s, env_step=330752, len=31, n/ep=2, n/st=64, player_1/loss=1014.269, player_2/loss=1693.039, rew=495.50]                                                                                                


Epoch #323: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #324: 1025it [00:02, 486.74it/s, env_step=331776, len=28, n/ep=2, n/st=64, player_1/loss=1175.689, player_2/loss=2039.447, rew=405.00]                                                                                                


Epoch #324: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #325: 1025it [00:02, 489.72it/s, env_step=332800, len=24, n/ep=2, n/st=64, player_1/loss=540.224, player_2/loss=2144.795, rew=317.50]                                                                                                 


Epoch #325: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #326: 1025it [00:02, 491.29it/s, env_step=333824, len=37, n/ep=2, n/st=64, player_1/loss=693.570, player_2/loss=1791.393, rew=721.00]                                                                                                 


Epoch #326: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #327: 1025it [00:02, 487.65it/s, env_step=334848, len=27, n/ep=2, n/st=64, player_1/loss=884.453, player_2/loss=1401.963, rew=437.50]                                                                                                 


Epoch #327: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #328: 1025it [00:02, 492.12it/s, env_step=335872, len=27, n/ep=2, n/st=64, player_1/loss=568.217, player_2/loss=1251.116, rew=385.00]                                                                                                 


Epoch #328: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #329: 1025it [00:02, 489.71it/s, env_step=336896, len=31, n/ep=2, n/st=64, player_1/loss=250.037, player_2/loss=2869.237, rew=495.00]                                                                                                 


Epoch #329: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #330: 1025it [00:02, 492.42it/s, env_step=337920, len=37, n/ep=1, n/st=64, player_1/loss=514.376, player_2/loss=3775.675, rew=702.00]                                                                                                 


Epoch #330: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #331: 1025it [00:02, 493.81it/s, env_step=338944, len=38, n/ep=1, n/st=64, player_1/loss=573.346, player_2/loss=2489.530, rew=740.00]                                                                                                 


Epoch #331: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #332: 1025it [00:02, 489.46it/s, env_step=339968, len=38, n/ep=2, n/st=64, player_1/loss=311.425, player_2/loss=1933.302, rew=742.00]                                                                                                 


Epoch #332: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #333: 1025it [00:02, 493.68it/s, env_step=340992, len=30, n/ep=2, n/st=64, player_1/loss=356.900, player_2/loss=1873.056, rew=488.50]                                                                                                 


Epoch #333: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #334: 1025it [00:02, 490.36it/s, env_step=342016, len=26, n/ep=2, n/st=64, player_1/loss=359.185, player_2/loss=1808.829, rew=366.50]                                                                                                 


Epoch #334: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #335: 1025it [00:02, 494.53it/s, env_step=343040, len=32, n/ep=2, n/st=64, player_1/loss=489.628, player_2/loss=1823.149, rew=527.50]                                                                                                 


Epoch #335: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #336: 1025it [00:02, 490.54it/s, env_step=344064, len=24, n/ep=2, n/st=64, player_1/loss=360.099, player_2/loss=1469.394, rew=299.00]                                                                                                 


Epoch #336: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #337: 1025it [00:02, 490.58it/s, env_step=345088, len=31, n/ep=1, n/st=64, player_1/loss=318.011, player_2/loss=1384.186, rew=495.00]                                                                                                 


Epoch #337: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #338: 1025it [00:02, 488.34it/s, env_step=346112, len=36, n/ep=2, n/st=64, player_1/loss=320.325, player_2/loss=2600.693, rew=665.50]                                                                                                 


Epoch #338: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #339: 1025it [00:02, 494.00it/s, env_step=347136, len=27, n/ep=3, n/st=64, player_1/loss=338.544, player_2/loss=1584.880, rew=401.00]                                                                                                 


Epoch #339: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #340: 1025it [00:02, 491.46it/s, env_step=348160, len=26, n/ep=3, n/st=64, player_1/loss=343.101, player_2/loss=2176.253, rew=408.67]                                                                                                 


Epoch #340: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #341: 1025it [00:02, 491.64it/s, env_step=349184, len=28, n/ep=1, n/st=64, player_1/loss=440.702, player_2/loss=2022.793, rew=405.00]                                                                                                 


Epoch #341: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #342: 1025it [00:02, 489.91it/s, env_step=350208, len=33, n/ep=2, n/st=64, player_1/loss=500.633, player_2/loss=1953.706, rew=562.00]                                                                                                 


Epoch #342: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #343: 1025it [00:02, 489.16it/s, env_step=351232, len=25, n/ep=2, n/st=64, player_1/loss=416.232, player_2/loss=1952.906, rew=364.50]                                                                                                 


Epoch #343: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #344: 1025it [00:02, 488.91it/s, env_step=352256, len=32, n/ep=2, n/st=64, player_1/loss=284.751, player_2/loss=1607.693, rew=529.00]                                                                                                 


Epoch #344: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #345: 1025it [00:02, 490.50it/s, env_step=353280, len=36, n/ep=2, n/st=64, player_1/loss=242.761, player_2/loss=1563.222, rew=686.50]                                                                                                 


Epoch #345: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #346: 1025it [00:02, 491.79it/s, env_step=354304, len=39, n/ep=1, n/st=64, player_1/loss=324.129, player_2/loss=1585.041, rew=779.00]                                                                                                 


Epoch #346: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #347: 1025it [00:02, 491.73it/s, env_step=355328, len=23, n/ep=2, n/st=64, player_1/loss=196.417, player_2/loss=1518.244, rew=277.00]                                                                                                 


Epoch #347: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #348: 1025it [00:02, 492.21it/s, env_step=356352, len=24, n/ep=3, n/st=64, player_1/loss=229.389, player_2/loss=1739.044, rew=360.00]                                                                                                 


Epoch #348: test_reward: 65.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #349: 1025it [00:02, 489.98it/s, env_step=357376, len=38, n/ep=2, n/st=64, player_1/loss=212.703, player_2/loss=2022.232, rew=740.50]                                                                                                 


Epoch #349: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #350: 1025it [00:02, 494.25it/s, env_step=358400, len=32, n/ep=2, n/st=64, player_1/loss=158.588, player_2/loss=2321.679, rew=529.00]                                                                                                 


Epoch #350: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #351: 1025it [00:02, 491.58it/s, env_step=359424, len=42, n/ep=1, n/st=64, player_1/loss=249.965, player_2/loss=1859.472, rew=1102.00]                                                                                                


Epoch #351: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #352: 1025it [00:02, 490.07it/s, env_step=360448, len=15, n/ep=4, n/st=64, player_1/loss=343.978, player_2/loss=1703.824, rew=124.00]                                                                                                 


Epoch #352: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #353: 1025it [00:02, 493.44it/s, env_step=361472, len=29, n/ep=2, n/st=64, player_1/loss=351.929, player_2/loss=2493.106, rew=449.00]                                                                                                 


Epoch #353: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #354: 1025it [00:02, 490.45it/s, env_step=362496, len=29, n/ep=2, n/st=64, player_1/loss=375.221, player_2/loss=2103.904, rew=477.00]                                                                                                 


Epoch #354: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #355: 1025it [00:02, 489.74it/s, env_step=363520, len=35, n/ep=2, n/st=64, player_1/loss=294.326, player_2/loss=1164.360, rew=637.00]                                                                                                 


Epoch #355: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #356: 1025it [00:02, 490.95it/s, env_step=364544, len=36, n/ep=2, n/st=64, player_1/loss=208.391, player_2/loss=1065.149, rew=783.00]                                                                                                 


Epoch #356: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #357: 1025it [00:02, 491.71it/s, env_step=365568, len=39, n/ep=2, n/st=64, player_1/loss=224.475, player_2/loss=1717.485, rew=781.00]                                                                                                 


Epoch #357: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #358: 1025it [00:02, 493.40it/s, env_step=366592, len=36, n/ep=2, n/st=64, player_1/loss=192.069, player_2/loss=2588.839, rew=698.50]                                                                                                 


Epoch #358: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #359: 1025it [00:02, 488.98it/s, env_step=367616, len=30, n/ep=2, n/st=64, player_1/loss=246.819, player_2/loss=1908.935, rew=482.50]                                                                                                 


Epoch #359: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #360: 1025it [00:02, 490.74it/s, env_step=368640, len=25, n/ep=2, n/st=64, player_1/loss=405.738, player_2/loss=1308.426, rew=573.00]                                                                                                 


Epoch #360: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #361: 1025it [00:02, 494.54it/s, env_step=369664, len=36, n/ep=2, n/st=64, player_1/loss=528.284, player_2/loss=1992.118, rew=665.50]                                                                                                 


Epoch #361: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #362: 1025it [00:02, 492.19it/s, env_step=370688, len=24, n/ep=3, n/st=64, player_1/loss=639.623, player_2/loss=1579.651, rew=323.67]                                                                                                 


Epoch #362: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #363: 1025it [00:02, 492.63it/s, env_step=371712, len=28, n/ep=2, n/st=64, player_1/loss=460.588, player_2/loss=1330.434, rew=434.50]                                                                                                 


Epoch #363: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #364: 1025it [00:02, 490.19it/s, env_step=372736, len=27, n/ep=2, n/st=64, player_1/loss=454.717, player_2/loss=2244.332, rew=406.00]                                                                                                 


Epoch #364: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #365: 1025it [00:02, 491.24it/s, env_step=373760, len=33, n/ep=1, n/st=64, player_1/loss=730.949, player_2/loss=2240.063, rew=560.00]                                                                                                 


Epoch #365: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #366: 1025it [00:02, 491.72it/s, env_step=374784, len=27, n/ep=3, n/st=64, player_1/loss=765.890, player_2/loss=2812.278, rew=381.33]                                                                                                 


Epoch #366: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #367: 1025it [00:02, 489.60it/s, env_step=375808, len=29, n/ep=2, n/st=64, player_1/loss=341.881, player_2/loss=2523.409, rew=434.50]                                                                                                 


Epoch #367: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #368: 1025it [00:02, 491.78it/s, env_step=376832, len=29, n/ep=2, n/st=64, player_1/loss=163.493, player_2/loss=1717.547, rew=459.00]                                                                                                 


Epoch #368: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #369: 1025it [00:02, 491.83it/s, env_step=377856, len=31, n/ep=2, n/st=64, player_1/loss=320.949, player_2/loss=2294.233, rew=495.00]                                                                                                 


Epoch #369: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #370: 1025it [00:02, 487.27it/s, env_step=378880, len=31, n/ep=2, n/st=64, player_1/loss=559.959, player_2/loss=2719.051, rew=512.00]                                                                                                 


Epoch #370: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #371: 1025it [00:02, 488.53it/s, env_step=379904, len=29, n/ep=1, n/st=64, player_1/loss=594.169, player_2/loss=2889.376, rew=434.00]                                                                                                 


Epoch #371: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #372: 1025it [00:02, 492.88it/s, env_step=380928, len=30, n/ep=2, n/st=64, player_1/loss=524.643, player_2/loss=2269.974, rew=479.50]                                                                                                 


Epoch #372: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #373: 1025it [00:02, 488.54it/s, env_step=381952, len=28, n/ep=2, n/st=64, player_1/loss=504.853, player_2/loss=2094.742, rew=409.50]                                                                                                 


Epoch #373: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #374: 1025it [00:02, 492.21it/s, env_step=382976, len=37, n/ep=1, n/st=64, player_1/loss=485.525, rew=702.00]                                                                                                                         


Epoch #374: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #375: 1025it [00:02, 489.06it/s, env_step=384000, len=35, n/ep=1, n/st=64, player_1/loss=381.086, player_2/loss=2463.084, rew=629.00]                                                                                                 


Epoch #375: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #376: 1025it [00:02, 491.58it/s, env_step=385024, len=26, n/ep=2, n/st=64, player_1/loss=389.237, player_2/loss=2333.405, rew=410.50]                                                                                                 


Epoch #376: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #377: 1025it [00:02, 491.89it/s, env_step=386048, len=27, n/ep=2, n/st=64, player_1/loss=291.910, player_2/loss=2338.656, rew=395.00]                                                                                                 


Epoch #377: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #378: 1025it [00:02, 491.43it/s, env_step=387072, len=22, n/ep=3, n/st=64, player_1/loss=144.968, player_2/loss=3196.606, rew=268.00]                                                                                                 


Epoch #378: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #379: 1025it [00:02, 492.57it/s, env_step=388096, len=37, n/ep=1, n/st=64, player_1/loss=233.683, player_2/loss=2828.669, rew=702.00]                                                                                                 


Epoch #379: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #380: 1025it [00:02, 494.19it/s, env_step=389120, len=32, n/ep=2, n/st=64, player_1/loss=400.680, player_2/loss=1280.841, rew=553.50]                                                                                                 


Epoch #380: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #381: 1025it [00:02, 490.11it/s, env_step=390144, len=26, n/ep=2, n/st=64, player_1/loss=397.534, player_2/loss=897.608, rew=418.50]                                                                                                  


Epoch #381: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #382: 1025it [00:02, 489.27it/s, env_step=391168, len=21, n/ep=3, n/st=64, player_1/loss=300.331, player_2/loss=954.055, rew=281.33]                                                                                                  


Epoch #382: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #383: 1025it [00:02, 490.95it/s, env_step=392192, len=34, n/ep=1, n/st=64, player_1/loss=463.423, player_2/loss=1583.473, rew=594.00]                                                                                                 


Epoch #383: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #384: 1025it [00:02, 493.05it/s, env_step=393216, len=26, n/ep=2, n/st=64, player_1/loss=405.406, player_2/loss=1277.499, rew=441.50]                                                                                                 


Epoch #384: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #385: 1025it [00:02, 490.73it/s, env_step=394240, len=40, n/ep=1, n/st=64, player_1/loss=126.536, player_2/loss=836.480, rew=819.00]                                                                                                  


Epoch #385: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #386: 1025it [00:02, 488.97it/s, env_step=395264, len=34, n/ep=2, n/st=64, player_1/loss=173.461, player_2/loss=815.171, rew=626.50]                                                                                                  


Epoch #386: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #387: 1025it [00:02, 490.13it/s, env_step=396288, len=31, n/ep=3, n/st=64, player_1/loss=283.927, player_2/loss=1459.523, rew=514.00]                                                                                                 


Epoch #387: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #388: 1025it [00:02, 491.55it/s, env_step=397312, len=31, n/ep=2, n/st=64, player_2/loss=1863.651, rew=511.00]                                                                                                                        


Epoch #388: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #389: 1025it [00:02, 489.04it/s, env_step=398336, len=23, n/ep=2, n/st=64, player_1/loss=429.961, player_2/loss=1751.289, rew=307.00]                                                                                                 


Epoch #389: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #390: 1025it [00:02, 492.70it/s, env_step=399360, len=36, n/ep=2, n/st=64, player_1/loss=384.663, player_2/loss=1023.794, rew=689.50]                                                                                                 


Epoch #390: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #391: 1025it [00:02, 487.90it/s, env_step=400384, len=35, n/ep=2, n/st=64, player_1/loss=460.920, player_2/loss=1644.102, rew=629.50]                                                                                                 


Epoch #391: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #392: 1025it [00:02, 487.96it/s, env_step=401408, len=37, n/ep=2, n/st=64, player_1/loss=269.124, player_2/loss=3152.397, rew=721.00]                                                                                                 


Epoch #392: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #393: 1025it [00:02, 489.69it/s, env_step=402432, len=35, n/ep=2, n/st=64, player_1/loss=289.064, player_2/loss=2448.589, rew=641.50]                                                                                                 


Epoch #393: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #394: 1025it [00:02, 492.72it/s, env_step=403456, len=26, n/ep=2, n/st=64, player_1/loss=214.316, player_2/loss=1372.780, rew=391.50]                                                                                                 


Epoch #394: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #395: 1025it [00:02, 487.27it/s, env_step=404480, len=30, n/ep=2, n/st=64, player_1/loss=272.539, player_2/loss=2156.266, rew=507.50]                                                                                                 


Epoch #395: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #396: 1025it [00:02, 489.97it/s, env_step=405504, len=30, n/ep=2, n/st=64, player_1/loss=315.400, player_2/loss=3041.413, rew=476.50]                                                                                                 


Epoch #396: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #397: 1025it [00:02, 490.95it/s, env_step=406528, len=38, n/ep=2, n/st=64, player_1/loss=257.714, player_2/loss=2355.809, rew=740.00]                                                                                                 


Epoch #397: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #398: 1025it [00:02, 491.66it/s, env_step=407552, len=35, n/ep=2, n/st=64, player_1/loss=290.025, player_2/loss=2431.539, rew=650.00]                                                                                                 


Epoch #398: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #399: 1025it [00:02, 491.26it/s, env_step=408576, len=30, n/ep=2, n/st=64, player_1/loss=211.011, player_2/loss=1615.099, rew=480.50]                                                                                                 


Epoch #399: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #400: 1025it [00:02, 491.25it/s, env_step=409600, len=36, n/ep=2, n/st=64, player_1/loss=603.092, player_2/loss=2294.458, rew=665.50]                                                                                                 


Epoch #400: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #401: 1025it [00:02, 491.31it/s, env_step=410624, len=28, n/ep=2, n/st=64, player_1/loss=654.512, player_2/loss=2569.053, rew=429.50]                                                                                                 


Epoch #401: test_reward: 665.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #402: 1025it [00:02, 488.95it/s, env_step=411648, len=33, n/ep=2, n/st=64, player_1/loss=242.659, player_2/loss=3143.735, rew=578.00]                                                                                                 


Epoch #402: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #403: 1025it [00:02, 492.11it/s, env_step=412672, len=28, n/ep=3, n/st=64, player_1/loss=244.077, player_2/loss=3231.062, rew=441.00]                                                                                                 


Epoch #403: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #404: 1025it [00:02, 489.33it/s, env_step=413696, len=32, n/ep=2, n/st=64, player_1/loss=254.617, player_2/loss=2675.392, rew=553.50]                                                                                                 


Epoch #404: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #405: 1025it [00:02, 491.28it/s, env_step=414720, len=30, n/ep=2, n/st=64, player_1/loss=596.655, player_2/loss=1537.062, rew=489.50]                                                                                                 


Epoch #405: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #406: 1025it [00:02, 488.65it/s, env_step=415744, len=38, n/ep=1, n/st=64, player_1/loss=693.349, player_2/loss=1589.612, rew=740.00]                                                                                                 


Epoch #406: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #407: 1025it [00:02, 491.86it/s, env_step=416768, len=23, n/ep=2, n/st=64, player_1/loss=320.375, rew=299.50]                                                                                                                         


Epoch #407: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #408: 1025it [00:02, 490.15it/s, env_step=417792, len=35, n/ep=2, n/st=64, player_1/loss=404.066, player_2/loss=1576.790, rew=631.00]                                                                                                 


Epoch #408: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #409: 1025it [00:02, 490.53it/s, env_step=418816, len=29, n/ep=1, n/st=64, player_1/loss=483.321, player_2/loss=2591.691, rew=434.00]                                                                                                 


Epoch #409: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #410: 1025it [00:02, 489.56it/s, env_step=419840, len=34, n/ep=2, n/st=64, player_1/loss=290.303, player_2/loss=2715.358, rew=614.50]                                                                                                 


Epoch #410: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #411: 1025it [00:02, 491.84it/s, env_step=420864, len=23, n/ep=2, n/st=64, player_1/loss=250.199, player_2/loss=1824.188, rew=299.50]                                                                                                 


Epoch #411: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #412: 1025it [00:02, 489.34it/s, env_step=421888, len=29, n/ep=2, n/st=64, player_1/loss=237.116, player_2/loss=1999.506, rew=434.00]                                                                                                 


Epoch #412: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #413: 1025it [00:02, 490.31it/s, env_step=422912, len=20, n/ep=2, n/st=64, player_1/loss=240.309, player_2/loss=2289.997, rew=225.50]                                                                                                 


Epoch #413: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #414: 1025it [00:02, 489.10it/s, env_step=423936, len=11, n/ep=7, n/st=64, player_2/loss=3816.834, rew=99.86]                                                                                                                         


Epoch #414: test_reward: 44.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #415: 1025it [00:02, 491.01it/s, env_step=424960, len=36, n/ep=2, n/st=64, player_1/loss=403.147, player_2/loss=2821.021, rew=667.00]                                                                                                 


Epoch #415: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #416: 1025it [00:02, 489.98it/s, env_step=425984, len=32, n/ep=3, n/st=64, player_1/loss=300.172, player_2/loss=1994.347, rew=559.00]                                                                                                 


Epoch #416: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #417: 1025it [00:02, 489.81it/s, env_step=427008, len=35, n/ep=2, n/st=64, player_1/loss=238.364, player_2/loss=1388.877, rew=653.00]                                                                                                 


Epoch #417: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #418: 1025it [00:02, 489.67it/s, env_step=428032, len=36, n/ep=2, n/st=64, player_1/loss=72.017, player_2/loss=1467.276, rew=665.50]                                                                                                  


Epoch #418: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #419: 1025it [00:02, 487.21it/s, env_step=429056, len=20, n/ep=3, n/st=64, player_1/loss=325.854, player_2/loss=1799.953, rew=236.00]                                                                                                 


Epoch #419: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #420: 1025it [00:02, 491.63it/s, env_step=430080, len=22, n/ep=2, n/st=64, player_1/loss=505.583, player_2/loss=1674.860, rew=273.50]                                                                                                 


Epoch #420: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #421: 1025it [00:02, 488.84it/s, env_step=431104, len=26, n/ep=2, n/st=64, player_1/loss=420.164, player_2/loss=1909.934, rew=352.00]                                                                                                 


Epoch #421: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #422: 1025it [00:02, 492.25it/s, env_step=432128, len=37, n/ep=2, n/st=64, player_1/loss=297.163, player_2/loss=1424.620, rew=724.00]                                                                                                 


Epoch #422: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #423: 1025it [00:02, 493.39it/s, env_step=433152, len=28, n/ep=2, n/st=64, player_1/loss=293.014, player_2/loss=1868.001, rew=455.50]                                                                                                 


Epoch #423: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #424: 1025it [00:02, 492.65it/s, env_step=434176, len=28, n/ep=3, n/st=64, player_1/loss=299.358, player_2/loss=1980.588, rew=417.33]                                                                                                 


Epoch #424: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #425: 1025it [00:02, 489.35it/s, env_step=435200, len=30, n/ep=3, n/st=64, player_1/loss=387.152, player_2/loss=1607.876, rew=498.67]                                                                                                 


Epoch #425: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #426: 1025it [00:02, 486.87it/s, env_step=436224, len=35, n/ep=2, n/st=64, player_1/loss=382.606, player_2/loss=1923.652, rew=641.50]                                                                                                 


Epoch #426: test_reward: 665.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #427: 1025it [00:02, 488.41it/s, env_step=437248, len=22, n/ep=3, n/st=64, player_1/loss=392.680, player_2/loss=2722.209, rew=261.00]                                                                                                 


Epoch #427: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #428: 1025it [00:02, 486.39it/s, env_step=438272, len=28, n/ep=2, n/st=64, player_1/loss=485.213, player_2/loss=4352.649, rew=413.00]                                                                                                 


Epoch #428: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #429: 1025it [00:02, 491.75it/s, env_step=439296, len=36, n/ep=2, n/st=64, player_1/loss=509.218, player_2/loss=4066.698, rew=798.50]                                                                                                 


Epoch #429: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #430: 1025it [00:02, 491.39it/s, env_step=440320, len=16, n/ep=4, n/st=64, player_1/loss=553.947, player_2/loss=2213.184, rew=302.50]                                                                                                 


Epoch #430: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #431: 1025it [00:02, 488.58it/s, env_step=441344, len=35, n/ep=2, n/st=64, player_1/loss=521.486, player_2/loss=1504.338, rew=633.50]                                                                                                 


Epoch #431: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #432: 1025it [00:02, 489.21it/s, env_step=442368, len=29, n/ep=2, n/st=64, player_1/loss=505.737, player_2/loss=1177.964, rew=464.00]                                                                                                 


Epoch #432: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #433: 1025it [00:02, 492.50it/s, env_step=443392, len=38, n/ep=1, n/st=64, player_1/loss=242.695, player_2/loss=715.984, rew=740.00]                                                                                                  


Epoch #433: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #434: 1025it [00:02, 490.55it/s, env_step=444416, len=32, n/ep=1, n/st=64, player_1/loss=80.525, player_2/loss=1616.798, rew=527.00]                                                                                                  


Epoch #434: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #435: 1025it [00:02, 491.50it/s, env_step=445440, len=33, n/ep=2, n/st=64, player_1/loss=107.952, player_2/loss=2570.617, rew=568.00]                                                                                                 


Epoch #435: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #436: 1025it [00:02, 490.43it/s, env_step=446464, len=33, n/ep=2, n/st=64, player_1/loss=299.860, player_2/loss=2760.654, rew=584.50]                                                                                                 


Epoch #436: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #437: 1025it [00:02, 491.96it/s, env_step=447488, len=33, n/ep=3, n/st=64, player_1/loss=297.422, player_2/loss=2322.041, rew=582.67]                                                                                                 


Epoch #437: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #438: 1025it [00:02, 483.42it/s, env_step=448512, len=37, n/ep=2, n/st=64, player_1/loss=129.811, player_2/loss=2171.042, rew=721.00]                                                                                                 


Epoch #438: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #439: 1025it [00:02, 493.60it/s, env_step=449536, len=31, n/ep=2, n/st=64, player_1/loss=473.167, player_2/loss=2103.474, rew=503.00]                                                                                                 


Epoch #439: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #440: 1025it [00:02, 490.46it/s, env_step=450560, len=34, n/ep=1, n/st=64, player_1/loss=478.996, player_2/loss=1931.511, rew=594.00]                                                                                                 


Epoch #440: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #441: 1025it [00:02, 493.10it/s, env_step=451584, len=39, n/ep=2, n/st=64, player_1/loss=321.101, player_2/loss=1202.848, rew=779.50]                                                                                                 


Epoch #441: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #442: 1025it [00:02, 489.73it/s, env_step=452608, len=36, n/ep=2, n/st=64, player_1/loss=288.584, player_2/loss=1009.072, rew=684.50]                                                                                                 


Epoch #442: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #443: 1025it [00:02, 490.75it/s, env_step=453632, len=33, n/ep=2, n/st=64, player_1/loss=243.667, player_2/loss=1161.863, rew=584.50]                                                                                                 


Epoch #443: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #444: 1025it [00:02, 491.10it/s, env_step=454656, len=38, n/ep=2, n/st=64, player_1/loss=480.610, player_2/loss=1446.257, rew=865.50]                                                                                                 


Epoch #444: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #445: 1025it [00:02, 491.92it/s, env_step=455680, len=31, n/ep=3, n/st=64, player_1/loss=404.658, player_2/loss=1847.578, rew=517.33]                                                                                                 


Epoch #445: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #446: 1025it [00:02, 492.04it/s, env_step=456704, len=38, n/ep=1, n/st=64, player_1/loss=398.960, player_2/loss=2679.802, rew=740.00]                                                                                                 


Epoch #446: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #447: 1025it [00:02, 489.49it/s, env_step=457728, len=33, n/ep=2, n/st=64, player_1/loss=392.991, player_2/loss=3330.502, rew=583.00]                                                                                                 


Epoch #447: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #448: 1025it [00:02, 491.85it/s, env_step=458752, len=27, n/ep=2, n/st=64, player_1/loss=441.482, player_2/loss=2568.369, rew=412.00]                                                                                                 


Epoch #448: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #449: 1025it [00:02, 488.98it/s, env_step=459776, len=27, n/ep=3, n/st=64, player_1/loss=436.085, rew=386.67]                                                                                                                         


Epoch #449: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #450: 1025it [00:02, 486.32it/s, env_step=460800, len=29, n/ep=2, n/st=64, player_1/loss=497.344, player_2/loss=2175.909, rew=438.50]                                                                                                 


Epoch #450: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #451: 1025it [00:02, 490.94it/s, env_step=461824, len=31, n/ep=2, n/st=64, player_1/loss=591.145, player_2/loss=3231.708, rew=535.50]                                                                                                 


Epoch #451: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #452: 1025it [00:02, 491.11it/s, env_step=462848, len=24, n/ep=3, n/st=64, player_1/loss=573.361, player_2/loss=2751.596, rew=360.67]                                                                                                 


Epoch #452: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #453: 1025it [00:02, 487.39it/s, env_step=463872, len=32, n/ep=2, n/st=64, player_1/loss=370.945, player_2/loss=1366.201, rew=535.00]                                                                                                 


Epoch #453: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #454: 1025it [00:02, 491.50it/s, env_step=464896, len=34, n/ep=2, n/st=64, player_1/loss=98.256, player_2/loss=1222.331, rew=614.50]                                                                                                  


Epoch #454: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #455: 1025it [00:02, 492.98it/s, env_step=465920, len=30, n/ep=3, n/st=64, player_1/loss=186.869, player_2/loss=1336.040, rew=490.33]                                                                                                 


Epoch #455: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #456: 1025it [00:02, 486.64it/s, env_step=466944, len=37, n/ep=2, n/st=64, player_1/loss=228.449, player_2/loss=1009.088, rew=721.00]                                                                                                 


Epoch #456: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #457: 1025it [00:02, 488.98it/s, env_step=467968, len=19, n/ep=3, n/st=64, player_1/loss=260.445, player_2/loss=809.828, rew=211.00]                                                                                                  


Epoch #457: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #458: 1025it [00:02, 490.73it/s, env_step=468992, len=41, n/ep=2, n/st=64, player_1/loss=399.041, player_2/loss=1720.337, rew=960.50]                                                                                                 


Epoch #458: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #459: 1025it [00:02, 487.53it/s, env_step=470016, len=35, n/ep=2, n/st=64, player_1/loss=340.182, player_2/loss=2205.428, rew=768.00]                                                                                                 


Epoch #459: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #460: 1025it [00:02, 491.69it/s, env_step=471040, len=33, n/ep=2, n/st=64, player_1/loss=377.240, player_2/loss=2528.215, rew=568.00]                                                                                                 


Epoch #460: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #461: 1025it [00:02, 475.84it/s, env_step=472064, len=29, n/ep=2, n/st=64, player_1/loss=515.261, player_2/loss=2453.383, rew=466.00]                                                                                                 


Epoch #461: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #462: 1025it [00:02, 491.35it/s, env_step=473088, len=23, n/ep=3, n/st=64, player_1/loss=503.143, player_2/loss=1850.352, rew=349.67]                                                                                                 


Epoch #462: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #463: 1025it [00:02, 488.82it/s, env_step=474112, len=40, n/ep=2, n/st=64, player_1/loss=280.245, player_2/loss=1357.048, rew=940.50]                                                                                                 


Epoch #463: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #464: 1025it [00:02, 487.95it/s, env_step=475136, len=33, n/ep=3, n/st=64, player_1/loss=528.174, player_2/loss=1399.095, rew=666.67]                                                                                                 


Epoch #464: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #465: 1025it [00:02, 492.19it/s, env_step=476160, len=27, n/ep=3, n/st=64, player_1/loss=505.657, player_2/loss=1950.408, rew=394.33]                                                                                                 


Epoch #465: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #466: 1025it [00:02, 493.37it/s, env_step=477184, len=29, n/ep=3, n/st=64, player_1/loss=696.426, player_2/loss=3593.228, rew=434.33]                                                                                                 


Epoch #466: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #467: 1025it [00:02, 490.53it/s, env_step=478208, len=33, n/ep=2, n/st=64, player_1/loss=640.414, player_2/loss=4641.297, rew=560.50]                                                                                                 


Epoch #467: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #468: 1025it [00:02, 486.06it/s, env_step=479232, len=41, n/ep=2, n/st=64, player_1/loss=683.241, player_2/loss=2964.324, rew=960.50]                                                                                                 


Epoch #468: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #469: 1025it [00:02, 489.77it/s, env_step=480256, len=17, n/ep=4, n/st=64, player_1/loss=587.916, player_2/loss=1714.857, rew=158.00]                                                                                                 


Epoch #469: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #470: 1025it [00:02, 490.62it/s, env_step=481280, len=33, n/ep=2, n/st=64, player_1/loss=412.542, player_2/loss=1985.985, rew=587.00]                                                                                                 


Epoch #470: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #471: 1025it [00:02, 489.51it/s, env_step=482304, len=30, n/ep=2, n/st=64, player_1/loss=259.854, player_2/loss=1865.092, rew=494.50]                                                                                                 


Epoch #471: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #472: 1025it [00:02, 490.83it/s, env_step=483328, len=38, n/ep=1, n/st=64, player_1/loss=193.109, player_2/loss=1785.995, rew=740.00]                                                                                                 


Epoch #472: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #473: 1025it [00:02, 489.36it/s, env_step=484352, len=42, n/ep=1, n/st=64, player_1/loss=229.430, player_2/loss=2143.016, rew=1102.00]                                                                                                


Epoch #473: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #474: 1025it [00:02, 491.74it/s, env_step=485376, len=26, n/ep=3, n/st=64, player_1/loss=203.193, player_2/loss=2025.346, rew=493.67]                                                                                                 


Epoch #474: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #475: 1025it [00:02, 486.60it/s, env_step=486400, len=32, n/ep=2, n/st=64, player_1/loss=235.640, player_2/loss=1608.666, rew=564.50]                                                                                                 


Epoch #475: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #476: 1025it [00:02, 490.56it/s, env_step=487424, len=37, n/ep=2, n/st=64, player_1/loss=388.938, player_2/loss=954.235, rew=702.50]                                                                                                  


Epoch #476: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #477: 1025it [00:02, 489.98it/s, env_step=488448, len=31, n/ep=2, n/st=64, player_1/loss=404.053, player_2/loss=1187.393, rew=495.50]                                                                                                 


Epoch #477: test_reward: 119.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #478: 1025it [00:02, 490.72it/s, env_step=489472, len=19, n/ep=2, n/st=64, player_1/loss=335.491, player_2/loss=2118.237, rew=249.50]                                                                                                 


Epoch #478: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #479: 1025it [00:02, 491.19it/s, env_step=490496, len=27, n/ep=2, n/st=64, player_1/loss=573.688, player_2/loss=2308.672, rew=391.00]                                                                                                 


Epoch #479: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #480: 1025it [00:02, 489.52it/s, env_step=491520, len=38, n/ep=2, n/st=64, player_1/loss=550.752, player_2/loss=1392.977, rew=744.50]                                                                                                 


Epoch #480: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #481: 1025it [00:02, 486.41it/s, env_step=492544, len=24, n/ep=2, n/st=64, player_1/loss=410.815, player_2/loss=879.649, rew=383.50]                                                                                                  


Epoch #481: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #482: 1025it [00:02, 485.20it/s, env_step=493568, len=40, n/ep=2, n/st=64, player_1/loss=559.297, player_2/loss=2069.126, rew=819.50]                                                                                                 


Epoch #482: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #483: 1025it [00:02, 485.94it/s, env_step=494592, len=39, n/ep=1, n/st=64, player_1/loss=420.854, player_2/loss=2536.731, rew=779.00]                                                                                                 


Epoch #483: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #484: 1025it [00:02, 491.55it/s, env_step=495616, len=23, n/ep=3, n/st=64, player_1/loss=249.396, player_2/loss=1632.229, rew=332.33]                                                                                                 


Epoch #484: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #485: 1025it [00:02, 493.04it/s, env_step=496640, len=37, n/ep=1, n/st=64, player_1/loss=227.979, player_2/loss=1000.714, rew=702.00]                                                                                                 


Epoch #485: test_reward: 44.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #486: 1025it [00:02, 488.50it/s, env_step=497664, len=33, n/ep=2, n/st=64, player_1/loss=339.249, player_2/loss=1350.158, rew=578.00]                                                                                                 


Epoch #486: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #487: 1025it [00:02, 490.09it/s, env_step=498688, len=27, n/ep=3, n/st=64, player_1/loss=342.405, player_2/loss=2033.362, rew=428.33]                                                                                                 


Epoch #487: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #488: 1025it [00:02, 493.20it/s, env_step=499712, len=35, n/ep=1, n/st=64, player_1/loss=355.580, player_2/loss=2237.671, rew=629.00]                                                                                                 


Epoch #488: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #489: 1025it [00:02, 491.62it/s, env_step=500736, len=40, n/ep=1, n/st=64, player_1/loss=275.391, player_2/loss=2082.447, rew=819.00]                                                                                                 


Epoch #489: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #490: 1025it [00:02, 485.58it/s, env_step=501760, len=27, n/ep=3, n/st=64, player_1/loss=247.413, player_2/loss=2851.131, rew=420.00]                                                                                                 


Epoch #490: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #491: 1025it [00:02, 487.29it/s, env_step=502784, len=14, n/ep=4, n/st=64, player_1/loss=393.078, player_2/loss=2369.243, rew=116.75]                                                                                                 


Epoch #491: test_reward: 119.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #492: 1025it [00:02, 489.65it/s, env_step=503808, len=25, n/ep=3, n/st=64, player_1/loss=549.273, player_2/loss=1723.681, rew=396.33]                                                                                                 


Epoch #492: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #493: 1025it [00:02, 492.59it/s, env_step=504832, len=22, n/ep=2, n/st=64, player_1/loss=689.092, player_2/loss=1527.679, rew=383.50]                                                                                                 


Epoch #493: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #494: 1025it [00:02, 485.88it/s, env_step=505856, len=22, n/ep=3, n/st=64, player_1/loss=746.171, player_2/loss=1650.222, rew=277.67]                                                                                                 


Epoch #494: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #495: 1025it [00:02, 491.79it/s, env_step=506880, len=27, n/ep=3, n/st=64, player_1/loss=763.306, player_2/loss=1101.447, rew=387.33]                                                                                                 


Epoch #495: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #496: 1025it [00:02, 494.03it/s, env_step=507904, len=40, n/ep=1, n/st=64, player_1/loss=799.047, player_2/loss=1927.389, rew=819.00]                                                                                                 


Epoch #496: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #497: 1025it [00:02, 490.11it/s, env_step=508928, len=22, n/ep=3, n/st=64, player_1/loss=561.887, player_2/loss=2900.583, rew=268.00]                                                                                                 


Epoch #497: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #498: 1025it [00:02, 488.16it/s, env_step=509952, len=37, n/ep=1, n/st=64, player_1/loss=864.221, player_2/loss=1801.290, rew=702.00]                                                                                                 


Epoch #498: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #499: 1025it [00:02, 493.75it/s, env_step=510976, len=29, n/ep=2, n/st=64, player_1/loss=845.483, player_2/loss=1699.247, rew=449.00]                                                                                                 


Epoch #499: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #500: 1025it [00:02, 491.05it/s, env_step=512000, len=36, n/ep=2, n/st=64, player_1/loss=270.051, player_2/loss=2342.211, rew=686.50]                                                                                                 


Epoch #500: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #501: 1025it [00:02, 491.86it/s, env_step=513024, len=36, n/ep=2, n/st=64, player_1/loss=253.468, player_2/loss=1141.519, rew=684.50]                                                                                                 


Epoch #501: test_reward: 665.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #502: 1025it [00:02, 489.18it/s, env_step=514048, len=23, n/ep=3, n/st=64, player_1/loss=464.307, player_2/loss=621.099, rew=295.67]                                                                                                  


Epoch #502: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #503: 1025it [00:02, 490.95it/s, env_step=515072, len=21, n/ep=4, n/st=64, player_1/loss=559.652, player_2/loss=2160.263, rew=231.25]                                                                                                 


Epoch #503: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #504: 1025it [00:02, 483.35it/s, env_step=516096, len=35, n/ep=2, n/st=64, player_1/loss=384.102, player_2/loss=2570.550, rew=753.50]                                                                                                 


Epoch #504: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #505: 1025it [00:02, 488.33it/s, env_step=517120, len=29, n/ep=2, n/st=64, player_1/loss=276.691, player_2/loss=2213.093, rew=449.00]                                                                                                 


Epoch #505: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #506: 1025it [00:02, 489.19it/s, env_step=518144, len=26, n/ep=2, n/st=64, player_1/loss=467.941, player_2/loss=2674.020, rew=410.50]                                                                                                 


Epoch #506: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #507: 1025it [00:02, 487.90it/s, env_step=519168, len=26, n/ep=3, n/st=64, player_1/loss=547.671, player_2/loss=2566.109, rew=351.33]                                                                                                 


Epoch #507: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #508: 1025it [00:02, 489.59it/s, env_step=520192, len=38, n/ep=1, n/st=64, player_1/loss=301.578, player_2/loss=1632.630, rew=740.00]                                                                                                 


Epoch #508: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #509: 1025it [00:02, 491.22it/s, env_step=521216, len=23, n/ep=3, n/st=64, player_1/loss=457.466, player_2/loss=1821.011, rew=305.67]                                                                                                 


Epoch #509: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #510: 1025it [00:02, 490.72it/s, env_step=522240, len=31, n/ep=2, n/st=64, player_1/loss=631.389, player_2/loss=3006.214, rew=512.00]                                                                                                 


Epoch #510: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #511: 1025it [00:02, 489.32it/s, env_step=523264, len=21, n/ep=3, n/st=64, player_1/loss=505.058, player_2/loss=3693.580, rew=270.67]                                                                                                 


Epoch #511: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #512: 1025it [00:02, 492.68it/s, env_step=524288, len=26, n/ep=3, n/st=64, player_1/loss=474.437, player_2/loss=2923.489, rew=371.00]                                                                                                 


Epoch #512: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #513: 1025it [00:02, 488.63it/s, env_step=525312, len=24, n/ep=3, n/st=64, player_1/loss=252.621, player_2/loss=2160.100, rew=343.00]                                                                                                 


Epoch #513: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #514: 1025it [00:02, 491.78it/s, env_step=526336, len=41, n/ep=1, n/st=64, player_1/loss=252.997, player_2/loss=1768.858, rew=860.00]                                                                                                 


Epoch #514: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #515: 1025it [00:02, 490.18it/s, env_step=527360, len=21, n/ep=3, n/st=64, player_1/loss=297.062, player_2/loss=1687.229, rew=290.33]                                                                                                 


Epoch #515: test_reward: 665.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #516: 1025it [00:02, 487.89it/s, env_step=528384, len=37, n/ep=2, n/st=64, player_1/loss=334.100, player_2/loss=2188.921, rew=721.00]                                                                                                 


Epoch #516: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #517: 1025it [00:02, 491.13it/s, env_step=529408, len=31, n/ep=2, n/st=64, player_1/loss=395.658, player_2/loss=1987.309, rew=556.00]                                                                                                 


Epoch #517: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #518: 1025it [00:02, 489.34it/s, env_step=530432, len=26, n/ep=3, n/st=64, player_1/loss=444.167, player_2/loss=1979.328, rew=376.00]                                                                                                 


Epoch #518: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #519: 1025it [00:02, 487.76it/s, env_step=531456, len=34, n/ep=2, n/st=64, player_1/loss=345.197, player_2/loss=2175.292, rew=596.00]                                                                                                 


Epoch #519: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #520: 1025it [00:02, 491.03it/s, env_step=532480, len=26, n/ep=3, n/st=64, player_1/loss=304.895, player_2/loss=1781.181, rew=362.00]                                                                                                 


Epoch #520: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #521: 1025it [00:02, 486.62it/s, env_step=533504, len=26, n/ep=3, n/st=64, player_1/loss=292.079, player_2/loss=1308.161, rew=378.00]                                                                                                 


Epoch #521: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #522: 1025it [00:02, 489.92it/s, env_step=534528, len=29, n/ep=2, n/st=64, player_1/loss=312.620, player_2/loss=2122.110, rew=434.50]                                                                                                 


Epoch #522: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #523: 1025it [00:02, 491.77it/s, env_step=535552, len=31, n/ep=2, n/st=64, player_1/loss=291.032, player_2/loss=4149.214, rew=512.00]                                                                                                 


Epoch #523: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #524: 1025it [00:02, 489.88it/s, env_step=536576, len=31, n/ep=2, n/st=64, player_1/loss=261.958, player_2/loss=3061.502, rew=527.00]                                                                                                 


Epoch #524: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #525: 1025it [00:02, 491.23it/s, env_step=537600, len=26, n/ep=2, n/st=64, player_1/loss=386.180, player_2/loss=2246.897, rew=350.00]                                                                                                 


Epoch #525: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #526: 1025it [00:02, 487.97it/s, env_step=538624, len=10, n/ep=7, n/st=64, player_1/loss=355.236, player_2/loss=2150.200, rew=85.86]                                                                                                  


Epoch #526: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #527: 1025it [00:02, 489.93it/s, env_step=539648, len=18, n/ep=4, n/st=64, player_1/loss=289.895, player_2/loss=2882.399, rew=194.50]                                                                                                 


Epoch #527: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #528: 1025it [00:02, 493.02it/s, env_step=540672, len=25, n/ep=2, n/st=64, player_1/loss=252.441, player_2/loss=2470.747, rew=358.00]                                                                                                 


Epoch #528: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #529: 1025it [00:02, 491.40it/s, env_step=541696, len=33, n/ep=2, n/st=64, player_1/loss=348.098, player_2/loss=2988.080, rew=580.00]                                                                                                 


Epoch #529: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #530: 1025it [00:02, 489.77it/s, env_step=542720, len=32, n/ep=2, n/st=64, player_1/loss=480.130, player_2/loss=4005.165, rew=531.50]                                                                                                 


Epoch #530: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #531: 1025it [00:02, 489.97it/s, env_step=543744, len=31, n/ep=2, n/st=64, player_1/loss=443.149, player_2/loss=2639.629, rew=655.50]                                                                                                 


Epoch #531: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #532: 1025it [00:02, 489.46it/s, env_step=544768, len=29, n/ep=2, n/st=64, player_1/loss=192.069, player_2/loss=1994.658, rew=449.00]                                                                                                 


Epoch #532: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #533: 1025it [00:02, 490.71it/s, env_step=545792, len=32, n/ep=2, n/st=64, player_1/loss=294.546, player_2/loss=2367.901, rew=558.50]                                                                                                 


Epoch #533: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #534: 1025it [00:02, 490.02it/s, env_step=546816, len=36, n/ep=1, n/st=64, player_1/loss=295.229, player_2/loss=2808.294, rew=665.00]                                                                                                 


Epoch #534: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #535: 1025it [00:02, 492.05it/s, env_step=547840, len=25, n/ep=2, n/st=64, player_1/loss=201.389, player_2/loss=1927.070, rew=382.00]                                                                                                 


Epoch #535: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #536: 1025it [00:02, 489.65it/s, env_step=548864, len=27, n/ep=2, n/st=64, player_1/loss=206.041, player_2/loss=2374.238, rew=394.00]                                                                                                 


Epoch #536: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #537: 1025it [00:02, 485.14it/s, env_step=549888, len=32, n/ep=2, n/st=64, player_1/loss=202.040, player_2/loss=2406.573, rew=558.50]                                                                                                 


Epoch #537: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #538: 1025it [00:02, 488.22it/s, env_step=550912, len=25, n/ep=3, n/st=64, player_1/loss=265.733, player_2/loss=1656.456, rew=346.00]                                                                                                 


Epoch #538: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #539: 1025it [00:02, 491.65it/s, env_step=551936, len=25, n/ep=2, n/st=64, player_1/loss=277.101, player_2/loss=1225.371, rew=326.00]                                                                                                 


Epoch #539: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #540: 1025it [00:02, 492.83it/s, env_step=552960, len=38, n/ep=2, n/st=64, player_1/loss=343.041, player_2/loss=1556.137, rew=759.50]                                                                                                 


Epoch #540: test_reward: 665.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #541: 1025it [00:02, 491.23it/s, env_step=553984, len=39, n/ep=1, n/st=64, player_1/loss=421.508, player_2/loss=1681.529, rew=779.00]                                                                                                 


Epoch #541: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #542: 1025it [00:02, 491.34it/s, env_step=555008, len=26, n/ep=2, n/st=64, player_1/loss=430.869, player_2/loss=1653.330, rew=391.50]                                                                                                 


Epoch #542: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #543: 1025it [00:02, 488.41it/s, env_step=556032, len=35, n/ep=1, n/st=64, player_1/loss=233.156, player_2/loss=1620.495, rew=629.00]                                                                                                 


Epoch #543: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #544: 1025it [00:02, 486.02it/s, env_step=557056, len=26, n/ep=2, n/st=64, player_1/loss=160.897, player_2/loss=2885.438, rew=368.00]                                                                                                 


Epoch #544: test_reward: 65.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #545: 1025it [00:02, 488.79it/s, env_step=558080, len=8, n/ep=8, n/st=64, player_1/loss=384.684, player_2/loss=3183.480, rew=44.88]                                                                                                   


Epoch #545: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #546: 1025it [00:02, 484.79it/s, env_step=559104, len=23, n/ep=3, n/st=64, player_1/loss=601.873, player_2/loss=2094.863, rew=291.00]                                                                                                 


Epoch #546: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #547: 1025it [00:02, 491.49it/s, env_step=560128, len=29, n/ep=3, n/st=64, player_1/loss=598.797, player_2/loss=2664.836, rew=454.00]                                                                                                 


Epoch #547: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #548: 1025it [00:02, 491.21it/s, env_step=561152, len=23, n/ep=3, n/st=64, player_1/loss=597.445, player_2/loss=2392.617, rew=281.33]                                                                                                 


Epoch #548: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #549: 1025it [00:02, 493.24it/s, env_step=562176, len=32, n/ep=2, n/st=64, player_1/loss=455.501, player_2/loss=1592.418, rew=543.50]                                                                                                 


Epoch #549: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #550: 1025it [00:02, 491.82it/s, env_step=563200, len=39, n/ep=2, n/st=64, player_1/loss=351.874, player_2/loss=1717.729, rew=902.00]                                                                                                 


Epoch #550: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #551: 1025it [00:02, 492.10it/s, env_step=564224, len=28, n/ep=3, n/st=64, player_2/loss=1608.104, rew=434.00]                                                                                                                        


Epoch #551: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #552: 1025it [00:02, 490.76it/s, env_step=565248, len=32, n/ep=2, n/st=64, player_1/loss=225.413, player_2/loss=2068.980, rew=531.50]                                                                                                 


Epoch #552: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #553: 1025it [00:02, 487.49it/s, env_step=566272, len=40, n/ep=2, n/st=64, player_1/loss=140.450, player_2/loss=2209.077, rew=940.50]                                                                                                 


Epoch #553: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #554: 1025it [00:02, 489.92it/s, env_step=567296, len=31, n/ep=3, n/st=64, player_1/loss=210.995, player_2/loss=1903.370, rew=514.00]                                                                                                 


Epoch #554: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #555: 1025it [00:02, 492.19it/s, env_step=568320, len=32, n/ep=2, n/st=64, player_2/loss=3302.237, rew=564.50]                                                                                                                        


Epoch #555: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #556: 1025it [00:02, 488.08it/s, env_step=569344, len=13, n/ep=4, n/st=64, player_1/loss=475.244, rew=93.25]                                                                                                                          


Epoch #556: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #557: 1025it [00:02, 488.45it/s, env_step=570368, len=21, n/ep=3, n/st=64, player_1/loss=430.305, player_2/loss=2442.182, rew=248.67]                                                                                                 


Epoch #557: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #558: 1025it [00:02, 489.53it/s, env_step=571392, len=32, n/ep=2, n/st=64, player_1/loss=410.064, player_2/loss=1611.483, rew=558.50]                                                                                                 


Epoch #558: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #559: 1025it [00:02, 488.55it/s, env_step=572416, len=34, n/ep=2, n/st=64, player_1/loss=316.382, player_2/loss=1305.053, rew=617.50]                                                                                                 


Epoch #559: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #560: 1025it [00:02, 490.94it/s, env_step=573440, len=28, n/ep=2, n/st=64, player_1/loss=411.004, player_2/loss=2123.194, rew=434.50]                                                                                                 


Epoch #560: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #561: 1025it [00:02, 487.61it/s, env_step=574464, len=34, n/ep=2, n/st=64, player_1/loss=610.176, player_2/loss=2153.954, rew=621.50]                                                                                                 


Epoch #561: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #562: 1025it [00:02, 488.08it/s, env_step=575488, len=27, n/ep=2, n/st=64, player_1/loss=468.639, player_2/loss=2082.873, rew=394.00]                                                                                                 


Epoch #562: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #563: 1025it [00:02, 491.46it/s, env_step=576512, len=36, n/ep=2, n/st=64, player_1/loss=329.838, player_2/loss=1843.193, rew=686.50]                                                                                                 


Epoch #563: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #564: 1025it [00:02, 490.20it/s, env_step=577536, len=33, n/ep=2, n/st=64, player_1/loss=575.093, rew=577.00]                                                                                                                         


Epoch #564: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #565: 1025it [00:02, 486.52it/s, env_step=578560, len=29, n/ep=3, n/st=64, player_1/loss=536.252, player_2/loss=2859.733, rew=434.33]                                                                                                 


Epoch #565: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #566: 1025it [00:02, 489.78it/s, env_step=579584, len=20, n/ep=4, n/st=64, player_1/loss=305.233, player_2/loss=3112.013, rew=250.50]                                                                                                 


Epoch #566: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #567: 1025it [00:02, 486.59it/s, env_step=580608, len=36, n/ep=2, n/st=64, player_1/loss=240.137, player_2/loss=3068.214, rew=686.50]                                                                                                 


Epoch #567: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #568: 1025it [00:02, 488.61it/s, env_step=581632, len=31, n/ep=3, n/st=64, player_1/loss=306.363, player_2/loss=1855.225, rew=559.33]                                                                                                 


Epoch #568: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #569: 1025it [00:02, 488.83it/s, env_step=582656, len=33, n/ep=2, n/st=64, player_1/loss=395.111, player_2/loss=1299.451, rew=577.00]                                                                                                 


Epoch #569: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #570: 1025it [00:02, 489.73it/s, env_step=583680, len=28, n/ep=2, n/st=64, player_1/loss=442.366, player_2/loss=1607.730, rew=464.50]                                                                                                 


Epoch #570: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #571: 1025it [00:02, 487.45it/s, env_step=584704, len=38, n/ep=1, n/st=64, player_1/loss=421.662, player_2/loss=3105.636, rew=740.00]                                                                                                 


Epoch #571: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #572: 1025it [00:02, 490.11it/s, env_step=585728, len=28, n/ep=2, n/st=64, player_1/loss=376.218, player_2/loss=3737.709, rew=429.50]                                                                                                 


Epoch #572: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #573: 1025it [00:02, 486.44it/s, env_step=586752, len=33, n/ep=2, n/st=64, player_1/loss=214.309, player_2/loss=2785.166, rew=592.00]                                                                                                 


Epoch #573: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #574: 1025it [00:02, 492.37it/s, env_step=587776, len=33, n/ep=2, n/st=64, player_1/loss=318.698, player_2/loss=2126.784, rew=568.00]                                                                                                 


Epoch #574: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #575: 1025it [00:02, 484.99it/s, env_step=588800, len=33, n/ep=2, n/st=64, player_1/loss=616.812, player_2/loss=2194.655, rew=592.00]                                                                                                 


Epoch #575: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #576: 1025it [00:02, 487.87it/s, env_step=589824, len=33, n/ep=2, n/st=64, player_1/loss=307.650, player_2/loss=1959.322, rew=560.50]                                                                                                 


Epoch #576: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #577: 1025it [00:02, 489.54it/s, env_step=590848, len=30, n/ep=2, n/st=64, player_1/loss=198.187, player_2/loss=1286.804, rew=476.50]                                                                                                 


Epoch #577: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #578: 1025it [00:02, 489.28it/s, env_step=591872, len=25, n/ep=3, n/st=64, player_1/loss=250.626, player_2/loss=1525.272, rew=368.33]                                                                                                 


Epoch #578: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #579: 1025it [00:02, 491.02it/s, env_step=592896, len=29, n/ep=2, n/st=64, player_1/loss=490.185, player_2/loss=2948.505, rew=450.00]                                                                                                 


Epoch #579: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #580: 1025it [00:02, 491.09it/s, env_step=593920, len=25, n/ep=3, n/st=64, player_1/loss=463.452, player_2/loss=3002.878, rew=347.33]                                                                                                 


Epoch #580: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #581: 1025it [00:02, 493.04it/s, env_step=594944, len=13, n/ep=4, n/st=64, player_1/loss=162.313, player_2/loss=1861.253, rew=111.50]                                                                                                 


Epoch #581: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #582: 1025it [00:02, 490.24it/s, env_step=595968, len=34, n/ep=2, n/st=64, player_1/loss=168.712, player_2/loss=2420.969, rew=594.00]                                                                                                 


Epoch #582: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #583: 1025it [00:02, 489.07it/s, env_step=596992, len=35, n/ep=2, n/st=64, player_1/loss=345.550, player_2/loss=2794.236, rew=631.00]                                                                                                 


Epoch #583: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #584: 1025it [00:02, 488.33it/s, env_step=598016, len=30, n/ep=3, n/st=64, player_1/loss=638.333, player_2/loss=1669.237, rew=485.67]                                                                                                 


Epoch #584: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #585: 1025it [00:02, 484.78it/s, env_step=599040, len=31, n/ep=2, n/st=64, player_1/loss=562.531, player_2/loss=1382.534, rew=503.00]                                                                                                 


Epoch #585: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #586: 1025it [00:02, 488.40it/s, env_step=600064, len=36, n/ep=2, n/st=64, player_1/loss=203.385, player_2/loss=1249.451, rew=669.50]                                                                                                 


Epoch #586: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #587: 1025it [00:02, 489.67it/s, env_step=601088, len=37, n/ep=1, n/st=64, player_1/loss=254.412, player_2/loss=1865.452, rew=702.00]                                                                                                 


Epoch #587: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #588: 1025it [00:02, 492.45it/s, env_step=602112, len=30, n/ep=2, n/st=64, player_1/loss=222.758, player_2/loss=1586.857, rew=476.50]                                                                                                 


Epoch #588: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #589: 1025it [00:02, 485.70it/s, env_step=603136, len=41, n/ep=2, n/st=64, player_1/loss=404.555, player_2/loss=1227.391, rew=960.50]                                                                                                 


Epoch #589: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #590: 1025it [00:02, 488.18it/s, env_step=604160, len=36, n/ep=2, n/st=64, player_1/loss=382.185, player_2/loss=1606.931, rew=689.50]                                                                                                 


Epoch #590: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #591: 1025it [00:02, 489.28it/s, env_step=605184, len=34, n/ep=2, n/st=64, player_1/loss=326.522, player_2/loss=1755.758, rew=602.00]                                                                                                 


Epoch #591: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #592: 1025it [00:02, 488.93it/s, env_step=606208, len=29, n/ep=3, n/st=64, player_1/loss=456.057, player_2/loss=1593.176, rew=508.00]                                                                                                 


Epoch #592: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #593: 1025it [00:02, 487.45it/s, env_step=607232, len=33, n/ep=1, n/st=64, player_1/loss=356.021, player_2/loss=1608.703, rew=560.00]                                                                                                 


Epoch #593: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #594: 1025it [00:02, 485.28it/s, env_step=608256, len=35, n/ep=2, n/st=64, player_1/loss=351.605, player_2/loss=2572.450, rew=768.00]                                                                                                 


Epoch #594: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #595: 1025it [00:02, 487.73it/s, env_step=609280, len=26, n/ep=2, n/st=64, player_1/loss=254.266, player_2/loss=2148.888, rew=354.50]                                                                                                 


Epoch #595: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #596: 1025it [00:02, 489.60it/s, env_step=610304, len=32, n/ep=1, n/st=64, player_1/loss=177.613, player_2/loss=1572.276, rew=527.00]                                                                                                 


Epoch #596: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #597: 1025it [00:02, 487.58it/s, env_step=611328, len=27, n/ep=2, n/st=64, player_1/loss=302.184, player_2/loss=2666.901, rew=446.00]                                                                                                 


Epoch #597: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #598: 1025it [00:02, 489.46it/s, env_step=612352, len=24, n/ep=3, n/st=64, player_1/loss=310.422, player_2/loss=2496.792, rew=320.33]                                                                                                 


Epoch #598: test_reward: 65.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #599: 1025it [00:02, 489.67it/s, env_step=613376, len=17, n/ep=3, n/st=64, player_1/loss=274.499, player_2/loss=1666.968, rew=169.67]                                                                                                 


Epoch #599: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #600: 1025it [00:02, 477.16it/s, env_step=614400, len=40, n/ep=2, n/st=64, player_1/loss=318.681, player_2/loss=2106.888, rew=921.00]                                                                                                 


Epoch #600: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #601: 1025it [00:02, 488.36it/s, env_step=615424, len=27, n/ep=3, n/st=64, player_1/loss=242.494, player_2/loss=2545.481, rew=417.33]                                                                                                 


Epoch #601: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #602: 1025it [00:02, 489.90it/s, env_step=616448, len=32, n/ep=2, n/st=64, player_1/loss=239.487, player_2/loss=2532.700, rew=558.50]                                                                                                 


Epoch #602: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #603: 1025it [00:02, 489.22it/s, env_step=617472, len=31, n/ep=2, n/st=64, player_1/loss=323.009, player_2/loss=2356.913, rew=532.00]                                                                                                 


Epoch #603: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #604: 1025it [00:02, 489.48it/s, env_step=618496, len=21, n/ep=3, n/st=64, player_1/loss=486.018, player_2/loss=1796.759, rew=253.00]                                                                                                 


Epoch #604: test_reward: 170.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #605: 1025it [00:02, 489.84it/s, env_step=619520, len=26, n/ep=2, n/st=64, player_1/loss=419.552, player_2/loss=2316.113, rew=350.50]                                                                                                 


Epoch #605: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #606: 1025it [00:02, 490.58it/s, env_step=620544, len=31, n/ep=2, n/st=64, player_1/loss=450.011, player_2/loss=2473.604, rew=513.00]                                                                                                 


Epoch #606: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #607: 1025it [00:02, 490.97it/s, env_step=621568, len=28, n/ep=2, n/st=64, player_1/loss=392.185, player_2/loss=1928.810, rew=610.50]                                                                                                 


Epoch #607: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #608: 1025it [00:02, 489.15it/s, env_step=622592, len=29, n/ep=2, n/st=64, player_1/loss=164.518, player_2/loss=1490.451, rew=452.00]                                                                                                 


Epoch #608: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #609: 1025it [00:02, 491.56it/s, env_step=623616, len=21, n/ep=3, n/st=64, player_1/loss=375.816, player_2/loss=2337.560, rew=230.33]                                                                                                 


Epoch #609: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #610: 1025it [00:02, 493.20it/s, env_step=624640, len=27, n/ep=2, n/st=64, player_1/loss=358.219, player_2/loss=1890.189, rew=381.50]                                                                                                 


Epoch #610: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #611: 1025it [00:02, 488.45it/s, env_step=625664, len=31, n/ep=2, n/st=64, player_1/loss=336.017, player_2/loss=2078.821, rew=507.50]                                                                                                 


Epoch #611: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #612: 1025it [00:02, 490.66it/s, env_step=626688, len=36, n/ep=2, n/st=64, player_1/loss=293.041, player_2/loss=2083.926, rew=667.00]                                                                                                 


Epoch #612: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #613: 1025it [00:02, 489.32it/s, env_step=627712, len=35, n/ep=2, n/st=64, player_1/loss=224.641, player_2/loss=1741.616, rew=631.00]                                                                                                 


Epoch #613: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #614: 1025it [00:02, 489.42it/s, env_step=628736, len=23, n/ep=3, n/st=64, player_1/loss=329.211, player_2/loss=2643.652, rew=339.67]                                                                                                 


Epoch #614: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #615: 1025it [00:02, 490.65it/s, env_step=629760, len=16, n/ep=5, n/st=64, player_1/loss=330.845, player_2/loss=2594.211, rew=156.80]                                                                                                 


Epoch #615: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #616: 1025it [00:02, 482.91it/s, env_step=630784, len=24, n/ep=3, n/st=64, player_1/loss=444.518, player_2/loss=4271.705, rew=305.33]                                                                                                 


Epoch #616: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #617: 1025it [00:02, 489.29it/s, env_step=631808, len=21, n/ep=2, n/st=64, player_1/loss=334.464, player_2/loss=3608.888, rew=241.00]                                                                                                 


Epoch #617: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #618: 1025it [00:02, 489.86it/s, env_step=632832, len=26, n/ep=3, n/st=64, player_1/loss=306.300, player_2/loss=3358.764, rew=368.67]                                                                                                 


Epoch #618: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #619: 1025it [00:02, 488.26it/s, env_step=633856, len=35, n/ep=2, n/st=64, player_1/loss=325.271, player_2/loss=3378.779, rew=647.00]                                                                                                 


Epoch #619: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #620: 1025it [00:02, 489.22it/s, env_step=634880, len=39, n/ep=1, n/st=64, player_1/loss=497.869, player_2/loss=2743.278, rew=779.00]                                                                                                 


Epoch #620: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #621: 1025it [00:02, 486.89it/s, env_step=635904, len=31, n/ep=2, n/st=64, player_1/loss=422.795, player_2/loss=2845.253, rew=511.00]                                                                                                 


Epoch #621: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #622: 1025it [00:02, 486.56it/s, env_step=636928, len=34, n/ep=2, n/st=64, player_1/loss=324.265, player_2/loss=2999.940, rew=739.50]                                                                                                 


Epoch #622: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #623: 1025it [00:02, 487.92it/s, env_step=637952, len=36, n/ep=2, n/st=64, player_2/loss=1949.606, rew=673.00]                                                                                                                        


Epoch #623: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #624: 1025it [00:02, 486.81it/s, env_step=638976, len=34, n/ep=2, n/st=64, player_1/loss=290.880, player_2/loss=1513.805, rew=594.00]                                                                                                 


Epoch #624: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #625: 1025it [00:02, 488.05it/s, env_step=640000, len=36, n/ep=2, n/st=64, player_1/loss=350.839, player_2/loss=1594.094, rew=677.50]                                                                                                 


Epoch #625: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #626: 1025it [00:02, 488.64it/s, env_step=641024, len=38, n/ep=2, n/st=64, player_1/loss=339.674, player_2/loss=2711.708, rew=759.50]                                                                                                 


Epoch #626: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #627: 1025it [00:02, 489.80it/s, env_step=642048, len=36, n/ep=1, n/st=64, player_1/loss=353.203, player_2/loss=3260.130, rew=665.00]                                                                                                 


Epoch #627: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #628: 1025it [00:02, 491.51it/s, env_step=643072, len=37, n/ep=2, n/st=64, player_1/loss=467.058, player_2/loss=2304.206, rew=721.00]                                                                                                 


Epoch #628: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #629: 1025it [00:02, 490.38it/s, env_step=644096, len=30, n/ep=2, n/st=64, player_1/loss=344.168, player_2/loss=1491.025, rew=507.50]                                                                                                 


Epoch #629: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #630: 1025it [00:02, 487.30it/s, env_step=645120, len=37, n/ep=2, n/st=64, player_1/loss=159.383, player_2/loss=923.683, rew=721.00]                                                                                                  


Epoch #630: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #631: 1025it [00:02, 488.25it/s, env_step=646144, len=24, n/ep=2, n/st=64, player_1/loss=236.004, player_2/loss=1651.228, rew=356.50]                                                                                                 


Epoch #631: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #632: 1025it [00:02, 488.63it/s, env_step=647168, len=40, n/ep=1, n/st=64, player_1/loss=434.856, player_2/loss=2603.169, rew=819.00]                                                                                                 


Epoch #632: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #633: 1025it [00:02, 491.85it/s, env_step=648192, len=15, n/ep=4, n/st=64, player_1/loss=555.700, player_2/loss=3224.085, rew=129.50]                                                                                                 


Epoch #633: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #634: 1025it [00:02, 491.92it/s, env_step=649216, len=14, n/ep=4, n/st=64, player_1/loss=360.226, player_2/loss=2624.027, rew=112.00]                                                                                                 


Epoch #634: test_reward: 119.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #635: 1025it [00:02, 491.84it/s, env_step=650240, len=36, n/ep=1, n/st=64, player_1/loss=495.194, player_2/loss=1822.780, rew=665.00]                                                                                                 


Epoch #635: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #636: 1025it [00:02, 491.38it/s, env_step=651264, len=36, n/ep=2, n/st=64, player_1/loss=431.652, player_2/loss=1336.083, rew=684.50]                                                                                                 


Epoch #636: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #637: 1025it [00:02, 490.25it/s, env_step=652288, len=29, n/ep=2, n/st=64, player_1/loss=279.536, player_2/loss=1309.649, rew=446.50]                                                                                                 


Epoch #637: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #638: 1025it [00:02, 488.47it/s, env_step=653312, len=38, n/ep=1, n/st=64, player_1/loss=392.004, player_2/loss=1445.852, rew=740.00]                                                                                                 


Epoch #638: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #639: 1025it [00:02, 490.56it/s, env_step=654336, len=26, n/ep=2, n/st=64, player_1/loss=374.044, player_2/loss=2075.657, rew=390.50]                                                                                                 


Epoch #639: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #640: 1025it [00:02, 491.17it/s, env_step=655360, len=36, n/ep=2, n/st=64, player_1/loss=475.737, player_2/loss=2830.633, rew=686.50]                                                                                                 


Epoch #640: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #641: 1025it [00:02, 487.36it/s, env_step=656384, len=42, n/ep=1, n/st=64, player_1/loss=304.143, player_2/loss=1916.970, rew=1102.00]                                                                                                


Epoch #641: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #642: 1025it [00:02, 488.27it/s, env_step=657408, len=26, n/ep=3, n/st=64, player_1/loss=168.700, player_2/loss=1363.280, rew=427.67]                                                                                                 


Epoch #642: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #643: 1025it [00:02, 483.86it/s, env_step=658432, len=21, n/ep=3, n/st=64, player_1/loss=362.164, player_2/loss=1437.793, rew=232.33]                                                                                                 


Epoch #643: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #644: 1025it [00:02, 491.94it/s, env_step=659456, len=21, n/ep=3, n/st=64, player_1/loss=487.758, player_2/loss=1382.623, rew=237.67]                                                                                                 


Epoch #644: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #645: 1025it [00:02, 489.52it/s, env_step=660480, len=32, n/ep=2, n/st=64, player_1/loss=520.648, player_2/loss=2457.664, rew=544.50]                                                                                                 


Epoch #645: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #646: 1025it [00:02, 487.13it/s, env_step=661504, len=23, n/ep=2, n/st=64, player_1/loss=410.426, player_2/loss=2495.676, rew=293.00]                                                                                                 


Epoch #646: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #647: 1025it [00:02, 489.75it/s, env_step=662528, len=27, n/ep=2, n/st=64, player_1/loss=417.028, player_2/loss=2107.279, rew=395.00]                                                                                                 


Epoch #647: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #648: 1025it [00:02, 486.34it/s, env_step=663552, len=38, n/ep=2, n/st=64, player_1/loss=421.221, player_2/loss=2111.645, rew=740.50]                                                                                                 


Epoch #648: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #649: 1025it [00:02, 485.13it/s, env_step=664576, len=31, n/ep=2, n/st=64, player_1/loss=239.552, player_2/loss=1326.861, rew=497.00]                                                                                                 


Epoch #649: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #650: 1025it [00:02, 482.45it/s, env_step=665600, len=28, n/ep=3, n/st=64, player_1/loss=414.497, player_2/loss=1874.294, rew=460.33]                                                                                                 


Epoch #650: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #651: 1025it [00:02, 488.77it/s, env_step=666624, len=32, n/ep=2, n/st=64, player_1/loss=640.029, player_2/loss=2422.741, rew=588.50]                                                                                                 


Epoch #651: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #652: 1025it [00:02, 488.06it/s, env_step=667648, len=28, n/ep=2, n/st=64, player_1/loss=472.641, player_2/loss=2772.130, rew=434.50]                                                                                                 


Epoch #652: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #653: 1025it [00:02, 486.78it/s, env_step=668672, len=30, n/ep=2, n/st=64, player_1/loss=399.948, player_2/loss=1742.031, rew=466.00]                                                                                                 


Epoch #653: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #654: 1025it [00:02, 491.37it/s, env_step=669696, len=27, n/ep=2, n/st=64, player_1/loss=290.223, player_2/loss=2200.900, rew=381.50]                                                                                                 


Epoch #654: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #655: 1025it [00:02, 486.75it/s, env_step=670720, len=20, n/ep=3, n/st=64, player_1/loss=219.906, player_2/loss=2972.862, rew=282.00]                                                                                                 


Epoch #655: test_reward: 65.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #656: 1025it [00:02, 485.50it/s, env_step=671744, len=8, n/ep=8, n/st=64, player_1/loss=222.198, player_2/loss=2845.849, rew=39.25]                                                                                                   


Epoch #656: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #657: 1025it [00:02, 488.28it/s, env_step=672768, len=32, n/ep=3, n/st=64, player_1/loss=352.871, player_2/loss=2545.842, rew=549.00]                                                                                                 


Epoch #657: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #658: 1025it [00:02, 485.44it/s, env_step=673792, len=32, n/ep=2, n/st=64, player_1/loss=528.498, player_2/loss=2926.178, rew=546.50]                                                                                                 


Epoch #658: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #659: 1025it [00:02, 492.13it/s, env_step=674816, len=33, n/ep=2, n/st=64, player_1/loss=482.173, player_2/loss=2264.389, rew=562.00]                                                                                                 


Epoch #659: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #660: 1025it [00:02, 488.13it/s, env_step=675840, len=12, n/ep=5, n/st=64, player_1/loss=278.795, player_2/loss=1725.748, rew=129.40]                                                                                                 


Epoch #660: test_reward: 65.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #661: 1025it [00:02, 487.50it/s, env_step=676864, len=12, n/ep=5, n/st=64, player_1/loss=327.497, player_2/loss=2277.372, rew=85.40]                                                                                                  


Epoch #661: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #662: 1025it [00:02, 486.38it/s, env_step=677888, len=32, n/ep=2, n/st=64, player_1/loss=235.094, player_2/loss=2745.253, rew=571.50]                                                                                                 


Epoch #662: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #663: 1025it [00:02, 489.69it/s, env_step=678912, len=25, n/ep=3, n/st=64, player_1/loss=117.696, player_2/loss=2940.369, rew=353.33]                                                                                                 


Epoch #663: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #664: 1025it [00:02, 488.51it/s, env_step=679936, len=28, n/ep=2, n/st=64, player_1/loss=227.270, player_2/loss=3051.693, rew=445.50]                                                                                                 


Epoch #664: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #665: 1025it [00:02, 488.54it/s, env_step=680960, len=36, n/ep=2, n/st=64, player_1/loss=239.891, player_2/loss=1479.397, rew=669.50]                                                                                                 


Epoch #665: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #666: 1025it [00:02, 488.49it/s, env_step=681984, len=42, n/ep=1, n/st=64, player_1/loss=729.232, player_2/loss=1524.950, rew=1102.00]                                                                                                


Epoch #666: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #667: 1025it [00:02, 486.41it/s, env_step=683008, len=28, n/ep=2, n/st=64, player_1/loss=628.199, player_2/loss=1969.482, rew=464.50]                                                                                                 


Epoch #667: test_reward: 594.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #668: 1025it [00:02, 488.38it/s, env_step=684032, len=20, n/ep=3, n/st=64, player_1/loss=469.920, player_2/loss=2335.014, rew=214.33]                                                                                                 


Epoch #668: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #669: 1025it [00:02, 489.65it/s, env_step=685056, len=35, n/ep=2, n/st=64, player_1/loss=543.032, player_2/loss=1390.829, rew=648.00]                                                                                                 


Epoch #669: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #670: 1025it [00:02, 487.47it/s, env_step=686080, len=32, n/ep=2, n/st=64, player_1/loss=430.467, player_2/loss=1358.803, rew=543.50]                                                                                                 


Epoch #670: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #671: 1025it [00:02, 487.74it/s, env_step=687104, len=33, n/ep=2, n/st=64, player_1/loss=247.123, player_2/loss=1575.217, rew=568.00]                                                                                                 


Epoch #671: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #672: 1025it [00:02, 490.30it/s, env_step=688128, len=37, n/ep=2, n/st=64, player_1/loss=287.515, player_2/loss=1538.678, rew=722.00]                                                                                                 


Epoch #672: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #673: 1025it [00:02, 489.62it/s, env_step=689152, len=29, n/ep=3, n/st=64, player_1/loss=216.145, player_2/loss=1458.927, rew=456.00]                                                                                                 


Epoch #673: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #674: 1025it [00:02, 488.19it/s, env_step=690176, len=21, n/ep=2, n/st=64, player_1/loss=246.982, player_2/loss=1208.552, rew=247.00]                                                                                                 


Epoch #674: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #675: 1025it [00:02, 490.57it/s, env_step=691200, len=37, n/ep=2, n/st=64, player_1/loss=420.225, player_2/loss=1653.764, rew=706.50]                                                                                                 


Epoch #675: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #676: 1025it [00:02, 471.07it/s, env_step=692224, len=26, n/ep=2, n/st=64, player_1/loss=393.723, player_2/loss=1046.923, rew=352.00]                                                                                                 


Epoch #676: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #677: 1025it [00:02, 488.44it/s, env_step=693248, len=31, n/ep=2, n/st=64, player_1/loss=484.993, player_2/loss=1327.223, rew=495.00]                                                                                                 


Epoch #677: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #678: 1025it [00:02, 486.27it/s, env_step=694272, len=25, n/ep=2, n/st=64, player_1/loss=560.326, player_2/loss=1721.751, rew=324.50]                                                                                                 


Epoch #678: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #679: 1025it [00:02, 487.05it/s, env_step=695296, len=29, n/ep=3, n/st=64, player_1/loss=308.817, rew=460.33]                                                                                                                         


Epoch #679: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #680: 1025it [00:02, 490.34it/s, env_step=696320, len=34, n/ep=2, n/st=64, player_1/loss=153.586, player_2/loss=2286.754, rew=594.50]                                                                                                 


Epoch #680: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #681: 1025it [00:02, 486.38it/s, env_step=697344, len=32, n/ep=2, n/st=64, player_1/loss=394.122, player_2/loss=1155.848, rew=527.50]                                                                                                 


Epoch #681: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #682: 1025it [00:02, 485.29it/s, env_step=698368, len=30, n/ep=2, n/st=64, player_1/loss=487.471, player_2/loss=2296.037, rew=466.00]                                                                                                 


Epoch #682: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #683: 1025it [00:02, 485.50it/s, env_step=699392, len=26, n/ep=2, n/st=64, player_1/loss=348.710, player_2/loss=2976.709, rew=350.50]                                                                                                 


Epoch #683: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #684: 1025it [00:02, 486.88it/s, env_step=700416, len=42, n/ep=1, n/st=64, player_1/loss=426.354, player_2/loss=2691.327, rew=902.00]                                                                                                 


Epoch #684: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #685: 1025it [00:02, 487.68it/s, env_step=701440, len=32, n/ep=2, n/st=64, player_1/loss=228.786, player_2/loss=2066.459, rew=551.50]                                                                                                 


Epoch #685: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #686: 1025it [00:02, 490.94it/s, env_step=702464, len=31, n/ep=2, n/st=64, player_1/loss=152.174, player_2/loss=909.528, rew=521.00]                                                                                                  


Epoch #686: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #687: 1025it [00:02, 489.33it/s, env_step=703488, len=31, n/ep=2, n/st=64, player_1/loss=248.695, player_2/loss=1499.975, rew=535.50]                                                                                                 


Epoch #687: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #688: 1025it [00:02, 490.66it/s, env_step=704512, len=25, n/ep=2, n/st=64, player_1/loss=290.519, player_2/loss=2637.767, rew=328.50]                                                                                                 


Epoch #688: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #689: 1025it [00:02, 488.64it/s, env_step=705536, len=25, n/ep=1, n/st=64, player_1/loss=278.822, player_2/loss=1808.820, rew=324.00]                                                                                                 


Epoch #689: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #690: 1025it [00:02, 491.31it/s, env_step=706560, len=32, n/ep=2, n/st=64, player_1/loss=341.750, player_2/loss=1362.133, rew=551.50]                                                                                                 


Epoch #690: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #691: 1025it [00:02, 486.64it/s, env_step=707584, len=27, n/ep=3, n/st=64, player_1/loss=431.546, player_2/loss=1602.561, rew=390.33]                                                                                                 


Epoch #691: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #692: 1025it [00:02, 487.62it/s, env_step=708608, len=33, n/ep=2, n/st=64, player_1/loss=452.716, player_2/loss=1483.679, rew=583.00]                                                                                                 


Epoch #692: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #693: 1025it [00:02, 490.52it/s, env_step=709632, len=38, n/ep=2, n/st=64, player_1/loss=357.909, player_2/loss=2931.390, rew=740.00]                                                                                                 


Epoch #693: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #694: 1025it [00:02, 489.79it/s, env_step=710656, len=40, n/ep=2, n/st=64, player_1/loss=299.364, player_2/loss=2753.238, rew=940.50]                                                                                                 


Epoch #694: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #695: 1025it [00:02, 488.30it/s, env_step=711680, len=34, n/ep=2, n/st=64, player_1/loss=348.826, player_2/loss=1270.828, rew=598.50]                                                                                                 


Epoch #695: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #696: 1025it [00:02, 489.32it/s, env_step=712704, len=27, n/ep=2, n/st=64, player_1/loss=392.915, player_2/loss=1149.495, rew=377.50]                                                                                                 


Epoch #696: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #697: 1025it [00:02, 488.36it/s, env_step=713728, len=26, n/ep=3, n/st=64, player_1/loss=501.078, player_2/loss=1968.156, rew=390.00]                                                                                                 


Epoch #697: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #698: 1025it [00:02, 488.45it/s, env_step=714752, len=22, n/ep=2, n/st=64, player_1/loss=443.179, player_2/loss=2184.698, rew=252.50]                                                                                                 


Epoch #698: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #699: 1025it [00:02, 489.65it/s, env_step=715776, len=23, n/ep=2, n/st=64, player_1/loss=527.396, player_2/loss=1944.994, rew=332.00]                                                                                                 


Epoch #699: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #700: 1025it [00:02, 487.34it/s, env_step=716800, len=24, n/ep=3, n/st=64, player_1/loss=580.269, player_2/loss=1614.671, rew=351.33]                                                                                                 


Epoch #700: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #701: 1025it [00:02, 484.31it/s, env_step=717824, len=23, n/ep=2, n/st=64, player_1/loss=465.039, player_2/loss=1769.725, rew=302.00]                                                                                                 


Epoch #701: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #702: 1025it [00:02, 486.68it/s, env_step=718848, len=40, n/ep=2, n/st=64, player_1/loss=244.404, player_2/loss=1519.644, rew=921.00]                                                                                                 


Epoch #702: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #703: 1025it [00:02, 481.80it/s, env_step=719872, len=40, n/ep=2, n/st=64, player_1/loss=331.144, player_2/loss=1829.959, rew=921.00]                                                                                                 


Epoch #703: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #704: 1025it [00:02, 487.10it/s, env_step=720896, len=20, n/ep=2, n/st=64, player_1/loss=429.951, player_2/loss=2317.002, rew=269.50]                                                                                                 


Epoch #704: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #705: 1025it [00:02, 483.23it/s, env_step=721920, len=31, n/ep=2, n/st=64, player_1/loss=569.134, player_2/loss=3810.021, rew=519.50]                                                                                                 


Epoch #705: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #706: 1025it [00:02, 484.49it/s, env_step=722944, len=31, n/ep=2, n/st=64, player_1/loss=537.016, player_2/loss=3736.902, rew=666.00]                                                                                                 


Epoch #706: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #707: 1025it [00:02, 490.06it/s, env_step=723968, len=39, n/ep=2, n/st=64, player_1/loss=406.448, player_2/loss=3472.674, rew=902.00]                                                                                                 


Epoch #707: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #708: 1025it [00:02, 484.23it/s, env_step=724992, len=26, n/ep=3, n/st=64, player_1/loss=307.220, player_2/loss=3024.735, rew=408.00]                                                                                                 


Epoch #708: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #709: 1025it [00:02, 489.65it/s, env_step=726016, len=30, n/ep=2, n/st=64, player_1/loss=245.810, player_2/loss=1409.161, rew=504.50]                                                                                                 


Epoch #709: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #710: 1025it [00:02, 486.96it/s, env_step=727040, len=35, n/ep=2, n/st=64, player_1/loss=146.198, player_2/loss=921.908, rew=631.00]                                                                                                  


Epoch #710: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #711: 1025it [00:02, 491.41it/s, env_step=728064, len=30, n/ep=1, n/st=64, player_1/loss=168.213, player_2/loss=1086.207, rew=464.00]                                                                                                 


Epoch #711: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #712: 1025it [00:02, 488.71it/s, env_step=729088, len=38, n/ep=2, n/st=64, player_1/loss=130.639, player_2/loss=1292.143, rew=760.50]                                                                                                 


Epoch #712: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #713: 1025it [00:02, 487.70it/s, env_step=730112, len=39, n/ep=2, n/st=64, player_1/loss=192.509, player_2/loss=1382.766, rew=779.50]                                                                                                 


Epoch #713: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #714: 1025it [00:02, 489.11it/s, env_step=731136, len=20, n/ep=3, n/st=64, player_1/loss=409.930, player_2/loss=2214.920, rew=231.67]                                                                                                 


Epoch #714: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #715: 1025it [00:02, 487.89it/s, env_step=732160, len=30, n/ep=2, n/st=64, player_1/loss=388.302, player_2/loss=2141.553, rew=524.50]                                                                                                 


Epoch #715: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #716: 1025it [00:02, 486.62it/s, env_step=733184, len=23, n/ep=3, n/st=64, player_1/loss=286.401, player_2/loss=1939.733, rew=358.00]                                                                                                 


Epoch #716: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #717: 1025it [00:02, 487.73it/s, env_step=734208, len=23, n/ep=3, n/st=64, player_1/loss=287.036, player_2/loss=1422.586, rew=287.67]                                                                                                 


Epoch #717: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #718: 1025it [00:02, 490.47it/s, env_step=735232, len=25, n/ep=2, n/st=64, player_1/loss=270.828, player_2/loss=1732.480, rew=337.00]                                                                                                 


Epoch #718: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #719: 1025it [00:02, 486.79it/s, env_step=736256, len=28, n/ep=3, n/st=64, player_1/loss=322.670, player_2/loss=2328.223, rew=405.33]                                                                                                 


Epoch #719: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #720: 1025it [00:02, 488.35it/s, env_step=737280, len=18, n/ep=3, n/st=64, player_1/loss=549.411, player_2/loss=2241.191, rew=200.67]                                                                                                 


Epoch #720: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #721: 1025it [00:02, 487.40it/s, env_step=738304, len=37, n/ep=2, n/st=64, player_1/loss=684.972, player_2/loss=2132.763, rew=721.00]                                                                                                 


Epoch #721: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #722: 1025it [00:02, 483.80it/s, env_step=739328, len=29, n/ep=3, n/st=64, player_1/loss=471.276, player_2/loss=1836.408, rew=475.67]                                                                                                 


Epoch #722: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #723: 1025it [00:02, 487.08it/s, env_step=740352, len=33, n/ep=2, n/st=64, player_1/loss=259.271, player_2/loss=2669.726, rew=592.00]                                                                                                 


Epoch #723: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #724: 1025it [00:02, 488.19it/s, env_step=741376, len=25, n/ep=2, n/st=64, player_1/loss=260.897, player_2/loss=2695.081, rew=342.00]                                                                                                 


Epoch #724: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #725: 1025it [00:02, 490.16it/s, env_step=742400, len=35, n/ep=2, n/st=64, player_1/loss=203.044, player_2/loss=1800.393, rew=648.00]                                                                                                 


Epoch #725: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #726: 1025it [00:02, 481.32it/s, env_step=743424, len=26, n/ep=3, n/st=64, player_1/loss=79.411, player_2/loss=1708.053, rew=352.33]                                                                                                  


Epoch #726: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #727: 1025it [00:02, 481.46it/s, env_step=744448, len=29, n/ep=2, n/st=64, player_1/loss=142.748, player_2/loss=1871.122, rew=485.00]                                                                                                 


Epoch #727: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #728: 1025it [00:02, 488.43it/s, env_step=745472, len=30, n/ep=2, n/st=64, player_1/loss=248.742, player_2/loss=1416.629, rew=504.50]                                                                                                 


Epoch #728: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #729: 1025it [00:02, 488.12it/s, env_step=746496, len=35, n/ep=2, n/st=64, player_1/loss=251.688, player_2/loss=2232.936, rew=648.00]                                                                                                 


Epoch #729: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #730: 1025it [00:02, 488.18it/s, env_step=747520, len=31, n/ep=3, n/st=64, player_1/loss=338.295, player_2/loss=2844.880, rew=539.00]                                                                                                 


Epoch #730: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #731: 1025it [00:02, 490.11it/s, env_step=748544, len=25, n/ep=3, n/st=64, player_1/loss=314.811, player_2/loss=2530.877, rew=336.33]                                                                                                 


Epoch #731: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #732: 1025it [00:02, 486.80it/s, env_step=749568, len=31, n/ep=2, n/st=64, player_1/loss=209.975, player_2/loss=2136.202, rew=511.00]                                                                                                 


Epoch #732: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #733: 1025it [00:02, 487.23it/s, env_step=750592, len=30, n/ep=2, n/st=64, player_1/loss=120.105, player_2/loss=1120.641, rew=479.50]                                                                                                 


Epoch #733: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #734: 1025it [00:02, 488.97it/s, env_step=751616, len=26, n/ep=2, n/st=64, player_1/loss=122.525, player_2/loss=989.310, rew=350.00]                                                                                                  


Epoch #734: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #735: 1025it [00:02, 488.51it/s, env_step=752640, len=31, n/ep=2, n/st=64, player_1/loss=273.938, player_2/loss=2722.048, rew=511.00]                                                                                                 


Epoch #735: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #736: 1025it [00:02, 487.77it/s, env_step=753664, len=26, n/ep=2, n/st=64, player_1/loss=320.244, player_2/loss=2927.528, rew=354.50]                                                                                                 


Epoch #736: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #737: 1025it [00:02, 486.85it/s, env_step=754688, len=27, n/ep=2, n/st=64, player_1/loss=202.768, player_2/loss=1250.217, rew=379.00]                                                                                                 


Epoch #737: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #738: 1025it [00:02, 487.14it/s, env_step=755712, len=32, n/ep=2, n/st=64, player_1/loss=173.162, player_2/loss=1209.745, rew=546.50]                                                                                                 


Epoch #738: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #739: 1025it [00:02, 477.48it/s, env_step=756736, len=27, n/ep=3, n/st=64, player_1/loss=283.041, player_2/loss=2303.474, rew=397.00]                                                                                                 


Epoch #739: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #740: 1025it [00:02, 487.05it/s, env_step=757760, len=35, n/ep=1, n/st=64, player_1/loss=256.893, player_2/loss=3325.077, rew=629.00]                                                                                                 


Epoch #740: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #741: 1025it [00:02, 484.83it/s, env_step=758784, len=22, n/ep=3, n/st=64, player_1/loss=459.292, player_2/loss=2177.255, rew=286.67]                                                                                                 


Epoch #741: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #742: 1025it [00:02, 487.87it/s, env_step=759808, len=37, n/ep=2, n/st=64, player_1/loss=489.360, player_2/loss=1938.012, rew=831.00]                                                                                                 


Epoch #742: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #743: 1025it [00:02, 490.00it/s, env_step=760832, len=26, n/ep=3, n/st=64, player_1/loss=196.940, player_2/loss=1410.006, rew=399.00]                                                                                                 


Epoch #743: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #744: 1025it [00:02, 488.15it/s, env_step=761856, len=27, n/ep=3, n/st=64, player_1/loss=423.447, player_2/loss=1894.825, rew=396.33]                                                                                                 


Epoch #744: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #745: 1025it [00:02, 486.83it/s, env_step=762880, len=29, n/ep=3, n/st=64, player_1/loss=418.116, player_2/loss=1571.386, rew=467.33]                                                                                                 


Epoch #745: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #746: 1025it [00:02, 487.05it/s, env_step=763904, len=34, n/ep=2, n/st=64, player_1/loss=196.107, player_2/loss=1042.985, rew=602.00]                                                                                                 


Epoch #746: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #747: 1025it [00:02, 489.87it/s, env_step=764928, len=23, n/ep=3, n/st=64, player_1/loss=97.211, player_2/loss=954.137, rew=355.33]                                                                                                   


Epoch #747: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #748: 1025it [00:02, 484.55it/s, env_step=765952, len=31, n/ep=3, n/st=64, player_1/loss=146.452, player_2/loss=1054.446, rew=545.33]                                                                                                 


Epoch #748: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #749: 1025it [00:02, 489.01it/s, env_step=766976, len=31, n/ep=3, n/st=64, player_1/loss=568.929, player_2/loss=1275.301, rew=580.67]                                                                                                 


Epoch #749: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #750: 1025it [00:02, 489.47it/s, env_step=768000, len=15, n/ep=4, n/st=64, player_1/loss=612.484, player_2/loss=2017.092, rew=120.25]                                                                                                 


Epoch #750: test_reward: 135.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #751: 1025it [00:02, 486.18it/s, env_step=769024, len=14, n/ep=4, n/st=64, player_1/loss=437.984, player_2/loss=2608.244, rew=121.50]                                                                                                 


Epoch #751: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #752: 1025it [00:02, 488.09it/s, env_step=770048, len=16, n/ep=4, n/st=64, player_1/loss=360.732, player_2/loss=3182.262, rew=158.50]                                                                                                 


Epoch #752: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #753: 1025it [00:02, 488.41it/s, env_step=771072, len=21, n/ep=3, n/st=64, player_1/loss=509.883, player_2/loss=2523.755, rew=230.33]                                                                                                 


Epoch #753: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #754: 1025it [00:02, 486.50it/s, env_step=772096, len=19, n/ep=3, n/st=64, player_1/loss=392.955, player_2/loss=1548.825, rew=203.33]                                                                                                 


Epoch #754: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #755: 1025it [00:02, 487.07it/s, env_step=773120, len=19, n/ep=3, n/st=64, player_1/loss=287.525, player_2/loss=1793.903, rew=217.67]                                                                                                 


Epoch #755: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #756: 1025it [00:02, 485.93it/s, env_step=774144, len=24, n/ep=2, n/st=64, player_1/loss=278.194, rew=321.50]                                                                                                                         


Epoch #756: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #757: 1025it [00:02, 490.62it/s, env_step=775168, len=22, n/ep=3, n/st=64, player_1/loss=303.027, player_2/loss=901.720, rew=327.33]                                                                                                  


Epoch #757: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #758: 1025it [00:02, 487.28it/s, env_step=776192, len=28, n/ep=2, n/st=64, player_1/loss=229.923, player_2/loss=1155.679, rew=409.50]                                                                                                 


Epoch #758: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #759: 1025it [00:02, 487.86it/s, env_step=777216, len=25, n/ep=2, n/st=64, player_1/loss=169.298, player_2/loss=1813.040, rew=337.00]                                                                                                 


Epoch #759: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #760: 1025it [00:02, 483.44it/s, env_step=778240, len=26, n/ep=3, n/st=64, player_1/loss=155.943, player_2/loss=1767.650, rew=368.67]                                                                                                 


Epoch #760: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #761: 1025it [00:02, 486.17it/s, env_step=779264, len=34, n/ep=2, n/st=64, player_1/loss=147.556, player_2/loss=1408.014, rew=596.00]                                                                                                 


Epoch #761: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #762: 1025it [00:02, 487.17it/s, env_step=780288, len=39, n/ep=1, n/st=64, player_1/loss=238.228, player_2/loss=1803.403, rew=779.00]                                                                                                 


Epoch #762: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #763: 1025it [00:02, 488.72it/s, env_step=781312, len=30, n/ep=3, n/st=64, player_1/loss=321.405, player_2/loss=3291.518, rew=508.67]                                                                                                 


Epoch #763: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #764: 1025it [00:02, 485.75it/s, env_step=782336, len=33, n/ep=2, n/st=64, player_1/loss=385.878, player_2/loss=3619.223, rew=598.00]                                                                                                 


Epoch #764: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #765: 1025it [00:02, 486.08it/s, env_step=783360, len=31, n/ep=2, n/st=64, player_1/loss=303.394, player_2/loss=2623.043, rew=532.00]                                                                                                 


Epoch #765: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #766: 1025it [00:02, 487.19it/s, env_step=784384, len=28, n/ep=2, n/st=64, player_2/loss=2076.847, rew=465.50]                                                                                                                        


Epoch #766: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #767: 1025it [00:02, 487.19it/s, env_step=785408, len=23, n/ep=3, n/st=64, player_1/loss=257.350, player_2/loss=1688.882, rew=358.00]                                                                                                 


Epoch #767: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #768: 1025it [00:02, 492.95it/s, env_step=786432, len=34, n/ep=2, n/st=64, player_1/loss=296.690, player_2/loss=1632.869, rew=617.50]                                                                                                 


Epoch #768: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #769: 1025it [00:02, 486.26it/s, env_step=787456, len=27, n/ep=2, n/st=64, player_1/loss=368.577, player_2/loss=1620.114, rew=427.00]                                                                                                 


Epoch #769: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #770: 1025it [00:02, 489.17it/s, env_step=788480, len=30, n/ep=2, n/st=64, player_1/loss=381.178, player_2/loss=1213.142, rew=504.50]                                                                                                 


Epoch #770: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #771: 1025it [00:02, 486.96it/s, env_step=789504, len=32, n/ep=2, n/st=64, player_1/loss=430.403, player_2/loss=1259.998, rew=588.50]                                                                                                 


Epoch #771: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #772: 1025it [00:02, 489.50it/s, env_step=790528, len=33, n/ep=2, n/st=64, player_1/loss=728.964, player_2/loss=1498.806, rew=560.00]                                                                                                 


Epoch #772: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #773: 1025it [00:02, 487.52it/s, env_step=791552, len=9, n/ep=8, n/st=64, player_1/loss=563.865, player_2/loss=1493.214, rew=45.62]                                                                                                   


Epoch #773: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #774: 1025it [00:02, 486.08it/s, env_step=792576, len=25, n/ep=3, n/st=64, player_1/loss=269.923, player_2/loss=1993.923, rew=341.33]                                                                                                 


Epoch #774: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #775: 1025it [00:02, 489.61it/s, env_step=793600, len=37, n/ep=2, n/st=64, player_1/loss=198.914, player_2/loss=2131.425, rew=702.00]                                                                                                 


Epoch #775: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #776: 1025it [00:02, 487.46it/s, env_step=794624, len=21, n/ep=4, n/st=64, player_1/loss=208.832, player_2/loss=1544.931, rew=293.25]                                                                                                 


Epoch #776: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #777: 1025it [00:02, 490.88it/s, env_step=795648, len=37, n/ep=2, n/st=64, player_1/loss=164.661, player_2/loss=1044.732, rew=704.00]                                                                                                 


Epoch #777: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #778: 1025it [00:02, 486.05it/s, env_step=796672, len=35, n/ep=2, n/st=64, player_1/loss=148.510, player_2/loss=1110.248, rew=650.00]                                                                                                 


Epoch #778: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #779: 1025it [00:02, 487.99it/s, env_step=797696, len=16, n/ep=4, n/st=64, player_1/loss=430.449, player_2/loss=1810.176, rew=151.00]                                                                                                 


Epoch #779: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #780: 1025it [00:02, 485.56it/s, env_step=798720, len=32, n/ep=2, n/st=64, player_1/loss=492.378, player_2/loss=2563.756, rew=546.50]                                                                                                 


Epoch #780: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #781: 1025it [00:02, 487.55it/s, env_step=799744, len=31, n/ep=2, n/st=64, player_1/loss=272.945, player_2/loss=2571.943, rew=527.00]                                                                                                 


Epoch #781: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #782: 1025it [00:02, 486.52it/s, env_step=800768, len=34, n/ep=2, n/st=64, player_1/loss=224.216, player_2/loss=2460.368, rew=621.50]                                                                                                 


Epoch #782: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #783: 1025it [00:02, 482.66it/s, env_step=801792, len=14, n/ep=4, n/st=64, player_1/loss=264.681, player_2/loss=2250.104, rew=116.25]                                                                                                 


Epoch #783: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #784: 1025it [00:02, 489.10it/s, env_step=802816, len=17, n/ep=3, n/st=64, player_1/loss=213.168, player_2/loss=1568.697, rew=166.67]                                                                                                 


Epoch #784: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #785: 1025it [00:02, 487.71it/s, env_step=803840, len=32, n/ep=2, n/st=64, player_1/loss=127.174, player_2/loss=2262.969, rew=553.50]                                                                                                 


Epoch #785: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #786: 1025it [00:02, 488.96it/s, env_step=804864, len=21, n/ep=3, n/st=64, player_1/loss=128.433, player_2/loss=2499.613, rew=231.33]                                                                                                 


Epoch #786: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #787: 1025it [00:02, 485.25it/s, env_step=805888, len=28, n/ep=2, n/st=64, player_1/loss=176.256, player_2/loss=2202.899, rew=425.50]                                                                                                 


Epoch #787: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #788: 1025it [00:02, 487.20it/s, env_step=806912, len=8, n/ep=8, n/st=64, player_1/loss=478.842, player_2/loss=2110.743, rew=36.38]                                                                                                   


Epoch #788: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #789: 1025it [00:02, 481.57it/s, env_step=807936, len=12, n/ep=4, n/st=64, player_1/loss=614.593, player_2/loss=3157.688, rew=110.00]                                                                                                 


Epoch #789: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #790: 1025it [00:02, 486.02it/s, env_step=808960, len=17, n/ep=4, n/st=64, player_1/loss=479.351, player_2/loss=3311.200, rew=156.75]                                                                                                 


Epoch #790: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #791: 1025it [00:02, 485.88it/s, env_step=809984, len=34, n/ep=2, n/st=64, player_1/loss=481.009, player_2/loss=1573.917, rew=606.50]                                                                                                 


Epoch #791: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #792: 1025it [00:02, 480.43it/s, env_step=811008, len=34, n/ep=2, n/st=64, player_1/loss=282.334, player_2/loss=1521.894, rew=726.00]                                                                                                 


Epoch #792: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #793: 1025it [00:02, 486.17it/s, env_step=812032, len=34, n/ep=2, n/st=64, player_1/loss=196.790, player_2/loss=1939.876, rew=614.50]                                                                                                 


Epoch #793: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #794: 1025it [00:02, 487.53it/s, env_step=813056, len=29, n/ep=2, n/st=64, player_1/loss=200.070, player_2/loss=1329.746, rew=494.50]                                                                                                 


Epoch #794: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #795: 1025it [00:02, 489.00it/s, env_step=814080, len=31, n/ep=2, n/st=64, player_1/loss=153.989, player_2/loss=721.764, rew=497.00]                                                                                                  


Epoch #795: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #796: 1025it [00:02, 488.22it/s, env_step=815104, len=34, n/ep=2, n/st=64, player_1/loss=345.210, player_2/loss=2649.071, rew=739.50]                                                                                                 


Epoch #796: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #797: 1025it [00:02, 486.03it/s, env_step=816128, len=28, n/ep=2, n/st=64, player_1/loss=425.220, player_2/loss=2982.066, rew=409.50]                                                                                                 


Epoch #797: test_reward: 65.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #798: 1025it [00:02, 488.64it/s, env_step=817152, len=36, n/ep=2, n/st=64, player_1/loss=373.045, player_2/loss=1465.229, rew=665.50]                                                                                                 


Epoch #798: test_reward: 594.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #799: 1025it [00:02, 486.31it/s, env_step=818176, len=33, n/ep=2, n/st=64, player_1/loss=571.141, player_2/loss=1829.599, rew=578.00]                                                                                                 


Epoch #799: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #800: 1025it [00:02, 491.97it/s, env_step=819200, len=26, n/ep=2, n/st=64, player_1/loss=379.093, player_2/loss=1623.796, rew=369.50]                                                                                                 


Epoch #800: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #801: 1025it [00:02, 488.82it/s, env_step=820224, len=33, n/ep=2, n/st=64, player_1/loss=359.102, player_2/loss=1589.291, rew=592.00]                                                                                                 


Epoch #801: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #802: 1025it [00:02, 487.25it/s, env_step=821248, len=37, n/ep=2, n/st=64, player_1/loss=118.745, player_2/loss=1604.921, rew=721.00]                                                                                                 


Epoch #802: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #803: 1025it [00:02, 486.11it/s, env_step=822272, len=32, n/ep=2, n/st=64, player_1/loss=373.607, player_2/loss=1300.933, rew=543.50]                                                                                                 


Epoch #803: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #804: 1025it [00:02, 491.67it/s, env_step=823296, len=38, n/ep=2, n/st=64, player_1/loss=491.332, player_2/loss=2045.572, rew=759.50]                                                                                                 


Epoch #804: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #805: 1025it [00:02, 486.92it/s, env_step=824320, len=34, n/ep=2, n/st=64, player_2/loss=3633.703, rew=612.50]                                                                                                                        


Epoch #805: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #806: 1025it [00:02, 481.38it/s, env_step=825344, len=37, n/ep=2, n/st=64, player_1/loss=269.508, player_2/loss=2642.435, rew=721.00]                                                                                                 


Epoch #806: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #807: 1025it [00:02, 490.44it/s, env_step=826368, len=29, n/ep=2, n/st=64, player_1/loss=207.029, player_2/loss=1861.974, rew=459.00]                                                                                                 


Epoch #807: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #808: 1025it [00:02, 490.04it/s, env_step=827392, len=28, n/ep=3, n/st=64, player_1/loss=294.262, player_2/loss=2037.857, rew=455.00]                                                                                                 


Epoch #808: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #809: 1025it [00:02, 489.34it/s, env_step=828416, len=33, n/ep=2, n/st=64, player_1/loss=290.572, player_2/loss=1543.720, rew=572.50]                                                                                                 


Epoch #809: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #810: 1025it [00:02, 482.74it/s, env_step=829440, len=30, n/ep=2, n/st=64, player_1/loss=373.334, player_2/loss=1154.393, rew=479.50]                                                                                                 


Epoch #810: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #811: 1025it [00:02, 489.00it/s, env_step=830464, len=21, n/ep=3, n/st=64, player_1/loss=360.366, player_2/loss=1039.459, rew=250.33]                                                                                                 


Epoch #811: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #812: 1025it [00:02, 486.71it/s, env_step=831488, len=18, n/ep=3, n/st=64, player_1/loss=260.820, player_2/loss=1914.602, rew=187.00]                                                                                                 


Epoch #812: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #813: 1025it [00:02, 487.88it/s, env_step=832512, len=34, n/ep=2, n/st=64, player_1/loss=365.344, player_2/loss=3126.942, rew=612.00]                                                                                                 


Epoch #813: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #814: 1025it [00:02, 489.84it/s, env_step=833536, len=35, n/ep=2, n/st=64, player_1/loss=578.368, player_2/loss=1988.986, rew=629.50]                                                                                                 


Epoch #814: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #815: 1025it [00:02, 488.63it/s, env_step=834560, len=33, n/ep=2, n/st=64, player_1/loss=546.426, player_2/loss=1940.064, rew=560.50]                                                                                                 


Epoch #815: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #816: 1025it [00:02, 483.90it/s, env_step=835584, len=27, n/ep=3, n/st=64, player_1/loss=376.082, player_2/loss=2012.477, rew=385.33]                                                                                                 


Epoch #816: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #817: 1025it [00:02, 485.11it/s, env_step=836608, len=36, n/ep=2, n/st=64, player_1/loss=305.699, player_2/loss=1997.066, rew=667.00]                                                                                                 


Epoch #817: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #818: 1025it [00:02, 487.77it/s, env_step=837632, len=36, n/ep=2, n/st=64, player_1/loss=207.729, player_2/loss=1982.772, rew=798.50]                                                                                                 


Epoch #818: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #819: 1025it [00:02, 484.63it/s, env_step=838656, len=35, n/ep=2, n/st=64, player_1/loss=177.933, player_2/loss=2536.997, rew=633.50]                                                                                                 


Epoch #819: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #820: 1025it [00:02, 487.15it/s, env_step=839680, len=33, n/ep=2, n/st=64, player_1/loss=232.913, player_2/loss=2810.887, rew=562.00]                                                                                                 


Epoch #820: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #821: 1025it [00:02, 486.98it/s, env_step=840704, len=28, n/ep=2, n/st=64, player_1/loss=197.794, player_2/loss=2128.100, rew=434.50]                                                                                                 


Epoch #821: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #822: 1025it [00:02, 486.63it/s, env_step=841728, len=26, n/ep=3, n/st=64, player_1/loss=265.925, player_2/loss=2259.514, rew=381.00]                                                                                                 


Epoch #822: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #823: 1025it [00:02, 490.15it/s, env_step=842752, len=36, n/ep=1, n/st=64, player_1/loss=429.824, player_2/loss=1765.074, rew=665.00]                                                                                                 


Epoch #823: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #824: 1025it [00:02, 487.92it/s, env_step=843776, len=34, n/ep=2, n/st=64, player_1/loss=307.793, player_2/loss=1685.915, rew=606.50]                                                                                                 


Epoch #824: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #825: 1025it [00:02, 485.86it/s, env_step=844800, len=37, n/ep=1, n/st=64, player_1/loss=216.480, player_2/loss=1177.902, rew=702.00]                                                                                                 


Epoch #825: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #826: 1025it [00:02, 489.46it/s, env_step=845824, len=28, n/ep=3, n/st=64, player_1/loss=300.829, player_2/loss=2077.889, rew=405.33]                                                                                                 


Epoch #826: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #827: 1025it [00:02, 486.56it/s, env_step=846848, len=35, n/ep=2, n/st=64, player_1/loss=253.980, player_2/loss=2397.922, rew=657.00]                                                                                                 


Epoch #827: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #828: 1025it [00:02, 485.08it/s, env_step=847872, len=37, n/ep=2, n/st=64, player_1/loss=280.979, player_2/loss=2216.646, rew=702.50]                                                                                                 


Epoch #828: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #829: 1025it [00:02, 486.05it/s, env_step=848896, len=28, n/ep=1, n/st=64, player_1/loss=284.976, player_2/loss=1921.920, rew=405.00]                                                                                                 


Epoch #829: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #830: 1025it [00:02, 488.90it/s, env_step=849920, len=34, n/ep=2, n/st=64, player_1/loss=328.863, player_2/loss=2205.338, rew=632.50]                                                                                                 


Epoch #830: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #831: 1025it [00:02, 486.85it/s, env_step=850944, len=42, n/ep=1, n/st=64, player_1/loss=664.384, player_2/loss=1986.809, rew=1102.00]                                                                                                


Epoch #831: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #832: 1025it [00:02, 484.18it/s, env_step=851968, len=25, n/ep=2, n/st=64, player_1/loss=726.492, player_2/loss=1758.955, rew=340.00]                                                                                                 


Epoch #832: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #833: 1025it [00:02, 485.77it/s, env_step=852992, len=32, n/ep=2, n/st=64, player_1/loss=465.971, player_2/loss=1245.779, rew=527.50]                                                                                                 


Epoch #833: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #834: 1025it [00:02, 488.37it/s, env_step=854016, len=29, n/ep=2, n/st=64, player_1/loss=221.586, player_2/loss=1004.518, rew=436.00]                                                                                                 


Epoch #834: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #835: 1025it [00:02, 487.99it/s, env_step=855040, len=38, n/ep=2, n/st=64, player_1/loss=237.435, player_2/loss=1130.225, rew=742.00]                                                                                                 


Epoch #835: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #836: 1025it [00:02, 487.92it/s, env_step=856064, len=42, n/ep=1, n/st=64, player_1/loss=474.801, player_2/loss=1298.619, rew=1102.00]                                                                                                


Epoch #836: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #837: 1025it [00:02, 486.24it/s, env_step=857088, len=41, n/ep=1, n/st=64, player_1/loss=450.436, player_2/loss=2009.407, rew=860.00]                                                                                                 


Epoch #837: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #838: 1025it [00:02, 486.27it/s, env_step=858112, len=38, n/ep=1, n/st=64, player_1/loss=332.103, player_2/loss=1942.849, rew=740.00]                                                                                                 


Epoch #838: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #839: 1025it [00:02, 490.34it/s, env_step=859136, len=21, n/ep=3, n/st=64, player_1/loss=283.474, player_2/loss=2010.950, rew=250.33]                                                                                                 


Epoch #839: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #840: 1025it [00:02, 486.17it/s, env_step=860160, len=21, n/ep=3, n/st=64, player_1/loss=359.702, player_2/loss=2004.962, rew=230.33]                                                                                                 


Epoch #840: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #841: 1025it [00:02, 484.04it/s, env_step=861184, len=32, n/ep=2, n/st=64, player_1/loss=505.339, player_2/loss=1882.635, rew=529.00]                                                                                                 


Epoch #841: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #842: 1025it [00:02, 489.03it/s, env_step=862208, len=33, n/ep=2, n/st=64, player_1/loss=390.940, player_2/loss=2329.616, rew=583.00]                                                                                                 


Epoch #842: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #843: 1025it [00:02, 484.68it/s, env_step=863232, len=35, n/ep=2, n/st=64, player_1/loss=467.076, player_2/loss=3109.896, rew=653.00]                                                                                                 


Epoch #843: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #844: 1025it [00:02, 484.82it/s, env_step=864256, len=28, n/ep=2, n/st=64, player_1/loss=466.801, player_2/loss=2384.766, rew=434.50]                                                                                                 


Epoch #844: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #845: 1025it [00:02, 486.53it/s, env_step=865280, len=21, n/ep=3, n/st=64, player_1/loss=637.351, player_2/loss=1672.666, rew=230.33]                                                                                                 


Epoch #845: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #846: 1025it [00:02, 485.29it/s, env_step=866304, len=31, n/ep=3, n/st=64, player_1/loss=702.342, player_2/loss=2202.664, rew=532.33]                                                                                                 


Epoch #846: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #847: 1025it [00:02, 485.86it/s, env_step=867328, len=39, n/ep=2, n/st=64, player_1/loss=637.806, player_2/loss=2889.661, rew=783.50]                                                                                                 


Epoch #847: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #848: 1025it [00:02, 488.58it/s, env_step=868352, len=21, n/ep=3, n/st=64, player_1/loss=213.580, player_2/loss=2557.945, rew=240.33]                                                                                                 


Epoch #848: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #849: 1025it [00:02, 488.64it/s, env_step=869376, len=32, n/ep=2, n/st=64, player_1/loss=233.984, rew=558.50]                                                                                                                         


Epoch #849: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #850: 1025it [00:02, 484.68it/s, env_step=870400, len=25, n/ep=3, n/st=64, player_1/loss=309.121, player_2/loss=1779.612, rew=356.00]                                                                                                 


Epoch #850: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #851: 1025it [00:02, 489.72it/s, env_step=871424, len=20, n/ep=3, n/st=64, player_1/loss=197.911, player_2/loss=1127.344, rew=237.67]                                                                                                 


Epoch #851: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #852: 1025it [00:02, 488.74it/s, env_step=872448, len=27, n/ep=3, n/st=64, player_2/loss=1956.074, rew=379.33]                                                                                                                        


Epoch #852: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #853: 1025it [00:02, 487.46it/s, env_step=873472, len=27, n/ep=3, n/st=64, player_1/loss=299.421, player_2/loss=2513.373, rew=443.00]                                                                                                 


Epoch #853: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #854: 1025it [00:02, 487.42it/s, env_step=874496, len=24, n/ep=4, n/st=64, player_1/loss=527.219, player_2/loss=2951.200, rew=348.25]                                                                                                 


Epoch #854: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #855: 1025it [00:02, 483.59it/s, env_step=875520, len=28, n/ep=2, n/st=64, player_1/loss=510.658, player_2/loss=2058.501, rew=455.00]                                                                                                 


Epoch #855: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #856: 1025it [00:02, 486.32it/s, env_step=876544, len=31, n/ep=2, n/st=64, player_1/loss=232.087, player_2/loss=1552.595, rew=495.50]                                                                                                 


Epoch #856: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #857: 1025it [00:02, 487.73it/s, env_step=877568, len=41, n/ep=2, n/st=64, player_1/loss=202.220, player_2/loss=1188.556, rew=960.50]                                                                                                 


Epoch #857: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #858: 1025it [00:02, 488.82it/s, env_step=878592, len=24, n/ep=3, n/st=64, player_1/loss=207.452, player_2/loss=878.743, rew=342.00]                                                                                                  


Epoch #858: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #859: 1025it [00:02, 487.31it/s, env_step=879616, len=29, n/ep=2, n/st=64, player_1/loss=295.818, player_2/loss=2388.399, rew=452.00]                                                                                                 


Epoch #859: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #860: 1025it [00:02, 488.84it/s, env_step=880640, len=23, n/ep=3, n/st=64, player_1/loss=334.541, player_2/loss=3141.059, rew=292.33]                                                                                                 


Epoch #860: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #861: 1025it [00:02, 487.65it/s, env_step=881664, len=25, n/ep=2, n/st=64, player_1/loss=303.568, player_2/loss=2940.887, rew=373.00]                                                                                                 


Epoch #861: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #862: 1025it [00:02, 485.36it/s, env_step=882688, len=21, n/ep=3, n/st=64, player_1/loss=395.499, player_2/loss=3687.902, rew=245.33]                                                                                                 


Epoch #862: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #863: 1025it [00:02, 489.43it/s, env_step=883712, len=42, n/ep=2, n/st=64, player_1/loss=423.212, player_2/loss=2611.473, rew=1102.00]                                                                                                


Epoch #863: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #864: 1025it [00:02, 484.56it/s, env_step=884736, len=34, n/ep=2, n/st=64, player_1/loss=249.981, player_2/loss=1833.204, rew=621.50]                                                                                                 


Epoch #864: test_reward: 594.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #865: 1025it [00:02, 484.63it/s, env_step=885760, len=35, n/ep=2, n/st=64, player_1/loss=153.474, player_2/loss=1925.311, rew=633.50]                                                                                                 


Epoch #865: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #866: 1025it [00:02, 486.00it/s, env_step=886784, len=20, n/ep=4, n/st=64, player_1/loss=146.853, player_2/loss=1490.485, rew=215.50]                                                                                                 


Epoch #866: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #867: 1025it [00:02, 483.74it/s, env_step=887808, len=31, n/ep=2, n/st=64, player_1/loss=233.508, player_2/loss=1037.364, rew=513.00]                                                                                                 


Epoch #867: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #868: 1025it [00:02, 488.23it/s, env_step=888832, len=17, n/ep=3, n/st=64, player_1/loss=391.021, player_2/loss=2279.871, rew=179.33]                                                                                                 


Epoch #868: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #869: 1025it [00:02, 489.71it/s, env_step=889856, len=26, n/ep=2, n/st=64, player_1/loss=357.891, player_2/loss=2568.580, rew=364.50]                                                                                                 


Epoch #869: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #870: 1025it [00:02, 485.99it/s, env_step=890880, len=22, n/ep=3, n/st=64, player_1/loss=432.453, player_2/loss=2340.410, rew=278.33]                                                                                                 


Epoch #870: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #871: 1025it [00:02, 484.92it/s, env_step=891904, len=29, n/ep=2, n/st=64, player_1/loss=359.508, player_2/loss=2224.353, rew=436.00]                                                                                                 


Epoch #871: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #872: 1025it [00:02, 480.52it/s, env_step=892928, len=19, n/ep=3, n/st=64, player_1/loss=419.422, player_2/loss=1999.866, rew=198.00]                                                                                                 


Epoch #872: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #873: 1025it [00:02, 484.43it/s, env_step=893952, len=34, n/ep=2, n/st=64, player_1/loss=309.963, player_2/loss=1485.896, rew=594.50]                                                                                                 


Epoch #873: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #874: 1025it [00:02, 487.29it/s, env_step=894976, len=21, n/ep=3, n/st=64, player_1/loss=189.992, player_2/loss=971.521, rew=243.00]                                                                                                  


Epoch #874: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #875: 1025it [00:02, 487.17it/s, env_step=896000, len=38, n/ep=1, n/st=64, player_1/loss=479.314, player_2/loss=1505.627, rew=740.00]                                                                                                 


Epoch #875: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #876: 1025it [00:02, 484.23it/s, env_step=897024, len=33, n/ep=2, n/st=64, player_1/loss=646.822, player_2/loss=1991.211, rew=578.00]                                                                                                 


Epoch #876: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #877: 1025it [00:02, 482.68it/s, env_step=898048, len=33, n/ep=2, n/st=64, player_1/loss=551.901, player_2/loss=1660.171, rew=700.50]                                                                                                 


Epoch #877: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #878: 1025it [00:02, 481.21it/s, env_step=899072, len=17, n/ep=2, n/st=64, player_1/loss=472.886, player_2/loss=1666.838, rew=176.00]                                                                                                 


Epoch #878: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #879: 1025it [00:02, 489.30it/s, env_step=900096, len=36, n/ep=2, n/st=64, player_1/loss=406.077, player_2/loss=1441.315, rew=665.50]                                                                                                 


Epoch #879: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #880: 1025it [00:02, 487.46it/s, env_step=901120, len=31, n/ep=2, n/st=64, player_1/loss=312.032, player_2/loss=2331.147, rew=499.50]                                                                                                 


Epoch #880: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #881: 1025it [00:02, 484.28it/s, env_step=902144, len=26, n/ep=2, n/st=64, player_1/loss=832.112, player_2/loss=2690.134, rew=410.50]                                                                                                 


Epoch #881: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #882: 1025it [00:02, 487.23it/s, env_step=903168, len=30, n/ep=2, n/st=64, player_1/loss=763.702, player_2/loss=3061.582, rew=645.50]                                                                                                 


Epoch #882: test_reward: 594.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #883: 1025it [00:02, 489.20it/s, env_step=904192, len=24, n/ep=2, n/st=64, player_1/loss=440.021, player_2/loss=2945.821, rew=311.50]                                                                                                 


Epoch #883: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #884: 1025it [00:02, 486.66it/s, env_step=905216, len=38, n/ep=2, n/st=64, player_1/loss=422.607, player_2/loss=2017.944, rew=740.00]                                                                                                 


Epoch #884: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #885: 1025it [00:02, 488.80it/s, env_step=906240, len=31, n/ep=3, n/st=64, player_1/loss=290.960, player_2/loss=1225.175, rew=563.00]                                                                                                 


Epoch #885: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #886: 1025it [00:02, 486.42it/s, env_step=907264, len=33, n/ep=2, n/st=64, player_1/loss=188.206, player_2/loss=984.142, rew=587.00]                                                                                                  


Epoch #886: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #887: 1025it [00:02, 485.24it/s, env_step=908288, len=33, n/ep=2, n/st=64, player_1/loss=216.394, player_2/loss=1245.208, rew=592.00]                                                                                                 


Epoch #887: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #888: 1025it [00:02, 487.54it/s, env_step=909312, len=34, n/ep=1, n/st=64, player_1/loss=142.493, player_2/loss=1275.709, rew=594.00]                                                                                                 


Epoch #888: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #889: 1025it [00:02, 485.93it/s, env_step=910336, len=30, n/ep=3, n/st=64, player_1/loss=434.027, player_2/loss=2048.128, rew=498.67]                                                                                                 


Epoch #889: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #890: 1025it [00:02, 485.08it/s, env_step=911360, len=38, n/ep=1, n/st=64, player_1/loss=364.984, player_2/loss=3289.118, rew=740.00]                                                                                                 


Epoch #890: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #891: 1025it [00:02, 488.82it/s, env_step=912384, len=15, n/ep=4, n/st=64, player_1/loss=495.658, player_2/loss=3305.653, rew=120.75]                                                                                                 


Epoch #891: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #892: 1025it [00:02, 488.14it/s, env_step=913408, len=36, n/ep=2, n/st=64, player_1/loss=795.926, player_2/loss=2265.245, rew=665.50]                                                                                                 


Epoch #892: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #893: 1025it [00:02, 486.67it/s, env_step=914432, len=27, n/ep=2, n/st=64, player_1/loss=451.256, player_2/loss=3013.250, rew=419.00]                                                                                                 


Epoch #893: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #894: 1025it [00:02, 486.72it/s, env_step=915456, len=36, n/ep=2, n/st=64, player_1/loss=153.872, player_2/loss=3877.582, rew=684.50]                                                                                                 


Epoch #894: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #895: 1025it [00:02, 487.42it/s, env_step=916480, len=34, n/ep=2, n/st=64, player_1/loss=187.302, player_2/loss=4203.036, rew=612.00]                                                                                                 


Epoch #895: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #896: 1025it [00:02, 486.19it/s, env_step=917504, len=30, n/ep=2, n/st=64, player_1/loss=167.334, player_2/loss=3135.539, rew=464.50]                                                                                                 


Epoch #896: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #897: 1025it [00:02, 487.44it/s, env_step=918528, len=32, n/ep=2, n/st=64, player_1/loss=136.809, player_2/loss=2126.931, rew=546.50]                                                                                                 


Epoch #897: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #898: 1025it [00:02, 484.59it/s, env_step=919552, len=34, n/ep=2, n/st=64, player_1/loss=233.532, player_2/loss=1214.381, rew=598.50]                                                                                                 


Epoch #898: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #899: 1025it [00:02, 487.39it/s, env_step=920576, len=38, n/ep=1, n/st=64, player_1/loss=307.041, player_2/loss=1594.337, rew=740.00]                                                                                                 


Epoch #899: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #900: 1025it [00:02, 486.88it/s, env_step=921600, len=13, n/ep=4, n/st=64, player_1/loss=357.936, player_2/loss=2726.774, rew=101.00]                                                                                                 


Epoch #900: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #901: 1025it [00:02, 489.60it/s, env_step=922624, len=39, n/ep=1, n/st=64, player_1/loss=362.758, player_2/loss=2383.990, rew=779.00]                                                                                                 


Epoch #901: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #902: 1025it [00:02, 485.17it/s, env_step=923648, len=28, n/ep=2, n/st=64, player_1/loss=328.024, player_2/loss=1826.181, rew=405.50]                                                                                                 


Epoch #902: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #903: 1025it [00:02, 484.62it/s, env_step=924672, len=32, n/ep=3, n/st=64, player_1/loss=457.601, player_2/loss=2415.751, rew=549.33]                                                                                                 


Epoch #903: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #904: 1025it [00:02, 485.94it/s, env_step=925696, len=31, n/ep=2, n/st=64, player_1/loss=406.817, player_2/loss=2161.726, rew=511.00]                                                                                                 


Epoch #904: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #905: 1025it [00:02, 486.32it/s, env_step=926720, len=38, n/ep=1, n/st=64, player_1/loss=176.092, player_2/loss=1479.502, rew=740.00]                                                                                                 


Epoch #905: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #906: 1025it [00:02, 484.12it/s, env_step=927744, len=29, n/ep=2, n/st=64, player_1/loss=321.622, player_2/loss=1654.430, rew=477.00]                                                                                                 


Epoch #906: test_reward: 119.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #907: 1025it [00:02, 487.26it/s, env_step=928768, len=32, n/ep=2, n/st=64, player_1/loss=312.618, player_2/loss=2054.216, rew=551.50]                                                                                                 


Epoch #907: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #908: 1025it [00:02, 484.96it/s, env_step=929792, len=31, n/ep=2, n/st=64, player_1/loss=456.435, player_2/loss=1993.768, rew=497.00]                                                                                                 


Epoch #908: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #909: 1025it [00:02, 484.62it/s, env_step=930816, len=28, n/ep=3, n/st=64, player_1/loss=546.526, player_2/loss=1823.652, rew=409.33]                                                                                                 


Epoch #909: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #910: 1025it [00:02, 489.33it/s, env_step=931840, len=32, n/ep=2, n/st=64, player_1/loss=317.586, player_2/loss=1600.883, rew=539.50]                                                                                                 


Epoch #910: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #911: 1025it [00:02, 483.98it/s, env_step=932864, len=40, n/ep=1, n/st=64, player_1/loss=312.252, player_2/loss=1495.167, rew=819.00]                                                                                                 


Epoch #911: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #912: 1025it [00:02, 485.30it/s, env_step=933888, len=32, n/ep=2, n/st=64, player_1/loss=425.919, player_2/loss=1521.170, rew=546.50]                                                                                                 


Epoch #912: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #913: 1025it [00:02, 488.84it/s, env_step=934912, len=31, n/ep=3, n/st=64, player_1/loss=404.198, player_2/loss=1523.360, rew=528.67]                                                                                                 


Epoch #913: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #914: 1025it [00:02, 486.82it/s, env_step=935936, len=39, n/ep=2, n/st=64, player_1/loss=335.624, player_2/loss=1225.907, rew=800.00]                                                                                                 


Epoch #914: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #915: 1025it [00:02, 483.15it/s, env_step=936960, len=37, n/ep=1, n/st=64, player_1/loss=252.558, player_2/loss=1442.842, rew=702.00]                                                                                                 


Epoch #915: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #916: 1025it [00:02, 486.05it/s, env_step=937984, len=26, n/ep=2, n/st=64, player_1/loss=249.407, player_2/loss=1727.258, rew=350.00]                                                                                                 


Epoch #916: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #917: 1025it [00:02, 489.94it/s, env_step=939008, len=37, n/ep=2, n/st=64, player_1/loss=219.612, player_2/loss=2194.002, rew=721.00]                                                                                                 


Epoch #917: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #918: 1025it [00:02, 483.93it/s, env_step=940032, len=26, n/ep=2, n/st=64, player_1/loss=353.361, player_2/loss=1346.503, rew=364.50]                                                                                                 


Epoch #918: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #919: 1025it [00:02, 484.57it/s, env_step=941056, len=37, n/ep=2, n/st=64, player_1/loss=345.987, player_2/loss=982.153, rew=721.00]                                                                                                  


Epoch #919: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #920: 1025it [00:02, 488.82it/s, env_step=942080, len=38, n/ep=1, n/st=64, player_1/loss=171.429, player_2/loss=1503.770, rew=740.00]                                                                                                 


Epoch #920: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #921: 1025it [00:02, 484.22it/s, env_step=943104, len=35, n/ep=2, n/st=64, player_1/loss=366.076, player_2/loss=2172.068, rew=633.50]                                                                                                 


Epoch #921: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #922: 1025it [00:02, 489.07it/s, env_step=944128, len=31, n/ep=2, n/st=64, player_1/loss=503.639, player_2/loss=2446.581, rew=512.00]                                                                                                 


Epoch #922: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #923: 1025it [00:02, 483.00it/s, env_step=945152, len=28, n/ep=2, n/st=64, player_1/loss=473.376, player_2/loss=2968.446, rew=422.50]                                                                                                 


Epoch #923: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #924: 1025it [00:02, 487.25it/s, env_step=946176, len=25, n/ep=3, n/st=64, player_1/loss=418.301, player_2/loss=2980.573, rew=373.33]                                                                                                 


Epoch #924: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #925: 1025it [00:02, 486.44it/s, env_step=947200, len=41, n/ep=1, n/st=64, player_1/loss=487.880, player_2/loss=2257.895, rew=860.00]                                                                                                 


Epoch #925: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #926: 1025it [00:02, 486.88it/s, env_step=948224, len=29, n/ep=3, n/st=64, player_1/loss=568.112, player_2/loss=2179.236, rew=567.00]                                                                                                 


Epoch #926: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #927: 1025it [00:02, 482.97it/s, env_step=949248, len=25, n/ep=2, n/st=64, player_1/loss=662.318, player_2/loss=1748.649, rew=328.50]                                                                                                 


Epoch #927: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #928: 1025it [00:02, 483.31it/s, env_step=950272, len=35, n/ep=2, n/st=64, player_1/loss=377.562, player_2/loss=1952.319, rew=753.50]                                                                                                 


Epoch #928: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #929: 1025it [00:02, 485.58it/s, env_step=951296, len=39, n/ep=1, n/st=64, player_1/loss=274.383, player_2/loss=2894.272, rew=779.00]                                                                                                 


Epoch #929: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #930: 1025it [00:02, 487.21it/s, env_step=952320, len=28, n/ep=2, n/st=64, player_1/loss=347.815, player_2/loss=2873.148, rew=489.50]                                                                                                 


Epoch #930: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #931: 1025it [00:02, 483.72it/s, env_step=953344, len=27, n/ep=3, n/st=64, player_1/loss=350.173, player_2/loss=1770.405, rew=397.00]                                                                                                 


Epoch #931: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #932: 1025it [00:02, 486.90it/s, env_step=954368, len=33, n/ep=2, n/st=64, player_1/loss=135.161, player_2/loss=1565.385, rew=562.00]                                                                                                 


Epoch #932: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #933: 1025it [00:02, 487.03it/s, env_step=955392, len=40, n/ep=2, n/st=64, player_1/loss=156.939, player_2/loss=2317.460, rew=819.00]                                                                                                 


Epoch #933: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #934: 1025it [00:02, 486.65it/s, env_step=956416, len=42, n/ep=1, n/st=64, player_1/loss=263.477, player_2/loss=2239.941, rew=1102.00]                                                                                                


Epoch #934: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #935: 1025it [00:02, 485.39it/s, env_step=957440, len=39, n/ep=2, n/st=64, player_1/loss=275.910, player_2/loss=1451.718, rew=799.00]                                                                                                 


Epoch #935: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #936: 1025it [00:02, 488.95it/s, env_step=958464, len=28, n/ep=2, n/st=64, player_1/loss=171.877, player_2/loss=1217.373, rew=419.50]                                                                                                 


Epoch #936: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #937: 1025it [00:02, 484.56it/s, env_step=959488, len=28, n/ep=3, n/st=64, player_1/loss=188.481, player_2/loss=2183.603, rew=542.00]                                                                                                 


Epoch #937: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #938: 1025it [00:02, 486.83it/s, env_step=960512, len=28, n/ep=2, n/st=64, player_1/loss=278.759, player_2/loss=1737.124, rew=429.50]                                                                                                 


Epoch #938: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #939: 1025it [00:02, 487.13it/s, env_step=961536, len=35, n/ep=2, n/st=64, player_1/loss=473.399, player_2/loss=1199.674, rew=647.00]                                                                                                 


Epoch #939: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #940: 1025it [00:02, 489.42it/s, env_step=962560, len=32, n/ep=1, n/st=64, player_1/loss=493.624, player_2/loss=1882.206, rew=527.00]                                                                                                 


Epoch #940: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #941: 1025it [00:02, 491.76it/s, env_step=963584, len=29, n/ep=2, n/st=64, player_1/loss=231.783, player_2/loss=2316.068, rew=494.00]                                                                                                 


Epoch #941: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #942: 1025it [00:02, 490.42it/s, env_step=964608, len=24, n/ep=3, n/st=64, player_1/loss=194.717, player_2/loss=1331.839, rew=349.33]                                                                                                 


Epoch #942: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #943: 1025it [00:02, 484.65it/s, env_step=965632, len=34, n/ep=2, n/st=64, player_1/loss=157.771, player_2/loss=1542.288, rew=726.00]                                                                                                 


Epoch #943: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #944: 1025it [00:02, 490.17it/s, env_step=966656, len=30, n/ep=2, n/st=64, player_1/loss=169.434, player_2/loss=1489.651, rew=466.00]                                                                                                 


Epoch #944: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #945: 1025it [00:02, 489.90it/s, env_step=967680, len=28, n/ep=2, n/st=64, player_1/loss=264.639, player_2/loss=1826.877, rew=429.50]                                                                                                 


Epoch #945: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #946: 1025it [00:02, 486.42it/s, env_step=968704, len=30, n/ep=3, n/st=64, player_1/loss=312.487, player_2/loss=2160.124, rew=514.33]                                                                                                 


Epoch #946: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #947: 1025it [00:02, 486.60it/s, env_step=969728, len=32, n/ep=2, n/st=64, player_1/loss=611.143, player_2/loss=1866.276, rew=553.50]                                                                                                 


Epoch #947: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #948: 1025it [00:02, 484.29it/s, env_step=970752, len=40, n/ep=2, n/st=64, player_1/loss=571.474, player_2/loss=2694.835, rew=921.00]                                                                                                 


Epoch #948: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #949: 1025it [00:02, 485.53it/s, env_step=971776, len=24, n/ep=3, n/st=64, player_1/loss=474.579, player_2/loss=1776.840, rew=337.33]                                                                                                 


Epoch #949: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #950: 1025it [00:02, 484.96it/s, env_step=972800, len=36, n/ep=2, n/st=64, player_1/loss=829.312, player_2/loss=1461.048, rew=689.50]                                                                                                 


Epoch #950: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #951: 1025it [00:02, 485.58it/s, env_step=973824, len=28, n/ep=3, n/st=64, player_1/loss=598.265, player_2/loss=1411.403, rew=449.67]                                                                                                 


Epoch #951: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #952: 1025it [00:02, 483.31it/s, env_step=974848, len=31, n/ep=2, n/st=64, player_2/loss=1073.738, rew=495.50]                                                                                                                        


Epoch #952: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #953: 1025it [00:02, 488.90it/s, env_step=975872, len=32, n/ep=2, n/st=64, player_1/loss=215.916, player_2/loss=1509.791, rew=529.00]                                                                                                 


Epoch #953: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #954: 1025it [00:02, 486.18it/s, env_step=976896, len=31, n/ep=2, n/st=64, player_1/loss=146.617, player_2/loss=1306.600, rew=495.50]                                                                                                 


Epoch #954: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #955: 1025it [00:02, 486.53it/s, env_step=977920, len=27, n/ep=2, n/st=64, player_1/loss=224.989, player_2/loss=1424.799, rew=395.00]                                                                                                 


Epoch #955: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #956: 1025it [00:02, 484.19it/s, env_step=978944, len=30, n/ep=2, n/st=64, player_1/loss=357.855, player_2/loss=1782.058, rew=479.50]                                                                                                 


Epoch #956: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #957: 1025it [00:02, 485.69it/s, env_step=979968, len=19, n/ep=2, n/st=64, player_1/loss=208.790, player_2/loss=2111.105, rew=227.00]                                                                                                 


Epoch #957: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #958: 1025it [00:02, 485.30it/s, env_step=980992, len=24, n/ep=2, n/st=64, player_1/loss=220.847, player_2/loss=1987.026, rew=323.50]                                                                                                 


Epoch #958: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #959: 1025it [00:02, 487.90it/s, env_step=982016, len=41, n/ep=2, n/st=64, player_1/loss=210.142, player_2/loss=1684.549, rew=981.00]                                                                                                 


Epoch #959: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #960: 1025it [00:02, 489.06it/s, env_step=983040, len=33, n/ep=2, n/st=64, player_1/loss=311.445, player_2/loss=2683.807, rew=578.00]                                                                                                 


Epoch #960: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #961: 1025it [00:02, 485.68it/s, env_step=984064, len=29, n/ep=3, n/st=64, player_1/loss=288.259, player_2/loss=2340.511, rew=453.00]                                                                                                 


Epoch #961: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #962: 1025it [00:02, 489.45it/s, env_step=985088, len=34, n/ep=2, n/st=64, player_1/loss=228.334, player_2/loss=2071.853, rew=614.50]                                                                                                 


Epoch #962: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #963: 1025it [00:02, 484.55it/s, env_step=986112, len=26, n/ep=2, n/st=64, player_1/loss=253.798, player_2/loss=1824.961, rew=390.50]                                                                                                 


Epoch #963: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #964: 1025it [00:02, 485.15it/s, env_step=987136, len=33, n/ep=2, n/st=64, player_1/loss=344.139, player_2/loss=1652.746, rew=572.50]                                                                                                 


Epoch #964: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #965: 1025it [00:02, 484.76it/s, env_step=988160, len=28, n/ep=3, n/st=64, player_1/loss=270.814, player_2/loss=1749.656, rew=462.00]                                                                                                 


Epoch #965: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #966: 1025it [00:02, 486.02it/s, env_step=989184, len=25, n/ep=2, n/st=64, player_1/loss=199.885, player_2/loss=1658.359, rew=352.00]                                                                                                 


Epoch #966: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #967: 1025it [00:02, 485.42it/s, env_step=990208, len=35, n/ep=2, n/st=64, player_1/loss=220.084, player_2/loss=1853.716, rew=633.50]                                                                                                 


Epoch #967: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #968: 1025it [00:02, 486.42it/s, env_step=991232, len=33, n/ep=2, n/st=64, player_1/loss=386.663, player_2/loss=2121.557, rew=583.00]                                                                                                 


Epoch #968: test_reward: 152.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #969: 1025it [00:02, 486.13it/s, env_step=992256, len=29, n/ep=2, n/st=64, player_1/loss=598.309, player_2/loss=2497.687, rew=436.00]                                                                                                 


Epoch #969: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #970: 1025it [00:02, 485.91it/s, env_step=993280, len=36, n/ep=2, n/st=64, player_1/loss=607.554, player_2/loss=1650.486, rew=783.00]                                                                                                 


Epoch #970: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #971: 1025it [00:02, 486.67it/s, env_step=994304, len=28, n/ep=2, n/st=64, player_1/loss=462.513, player_2/loss=771.867, rew=419.50]                                                                                                  


Epoch #971: test_reward: 665.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #972: 1025it [00:02, 483.35it/s, env_step=995328, len=39, n/ep=1, n/st=64, player_1/loss=215.855, player_2/loss=2100.073, rew=779.00]                                                                                                 


Epoch #972: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #973: 1025it [00:02, 486.57it/s, env_step=996352, len=30, n/ep=2, n/st=64, player_1/loss=105.545, player_2/loss=1763.978, rew=479.50]                                                                                                 


Epoch #973: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #974: 1025it [00:02, 486.12it/s, env_step=997376, len=35, n/ep=1, n/st=64, player_1/loss=417.603, player_2/loss=2373.353, rew=629.00]                                                                                                 


Epoch #974: test_reward: 594.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #975: 1025it [00:02, 485.95it/s, env_step=998400, len=32, n/ep=2, n/st=64, player_1/loss=446.968, player_2/loss=2665.969, rew=527.50]                                                                                                 


Epoch #975: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #976: 1025it [00:02, 486.87it/s, env_step=999424, len=19, n/ep=2, n/st=64, player_1/loss=142.030, player_2/loss=1721.106, rew=209.00]                                                                                                 


Epoch #976: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #977: 1025it [00:02, 486.24it/s, env_step=1000448, len=38, n/ep=2, n/st=64, player_1/loss=235.238, player_2/loss=2007.499, rew=740.50]                                                                                                


Epoch #977: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #978: 1025it [00:02, 484.94it/s, env_step=1001472, len=29, n/ep=2, n/st=64, player_1/loss=232.782, player_2/loss=1686.990, rew=459.00]                                                                                                


Epoch #978: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #979: 1025it [00:02, 487.18it/s, env_step=1002496, len=30, n/ep=2, n/st=64, player_1/loss=93.058, player_2/loss=1180.373, rew=500.50]                                                                                                 


Epoch #979: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #980: 1025it [00:02, 484.51it/s, env_step=1003520, len=17, n/ep=4, n/st=64, player_1/loss=93.448, player_2/loss=1791.916, rew=167.25]                                                                                                 


Epoch #980: test_reward: 54.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #981: 1025it [00:02, 484.68it/s, env_step=1004544, len=32, n/ep=2, n/st=64, player_1/loss=100.023, player_2/loss=1684.857, rew=553.50]                                                                                                


Epoch #981: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #982: 1025it [00:02, 486.30it/s, env_step=1005568, len=27, n/ep=3, n/st=64, player_1/loss=214.280, player_2/loss=1419.863, rew=421.33]                                                                                                


Epoch #982: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #983: 1025it [00:02, 483.54it/s, env_step=1006592, len=31, n/ep=2, n/st=64, player_1/loss=245.639, player_2/loss=1721.945, rew=513.00]                                                                                                


Epoch #983: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #984: 1025it [00:02, 486.50it/s, env_step=1007616, len=27, n/ep=2, n/st=64, player_1/loss=525.291, player_2/loss=1521.458, rew=381.50]                                                                                                


Epoch #984: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #985: 1025it [00:02, 483.10it/s, env_step=1008640, len=26, n/ep=2, n/st=64, player_1/loss=591.097, player_2/loss=1614.454, rew=363.50]                                                                                                


Epoch #985: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #986: 1025it [00:02, 484.89it/s, env_step=1009664, len=26, n/ep=2, n/st=64, player_1/loss=369.390, player_2/loss=1963.239, rew=364.50]                                                                                                


Epoch #986: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #987: 1025it [00:02, 480.52it/s, env_step=1010688, len=20, n/ep=4, n/st=64, player_1/loss=292.119, player_2/loss=1964.263, rew=262.00]                                                                                                


Epoch #987: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #988: 1025it [00:02, 488.30it/s, env_step=1011712, len=20, n/ep=2, n/st=64, player_1/loss=368.758, player_2/loss=1851.350, rew=310.50]                                                                                                


Epoch #988: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #989: 1025it [00:02, 487.64it/s, env_step=1012736, len=32, n/ep=2, n/st=64, player_1/loss=468.534, player_2/loss=1931.578, rew=688.50]                                                                                                


Epoch #989: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #990: 1025it [00:02, 490.30it/s, env_step=1013760, len=36, n/ep=2, n/st=64, player_1/loss=529.125, player_2/loss=2179.770, rew=683.50]                                                                                                


Epoch #990: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #991: 1025it [00:02, 483.50it/s, env_step=1014784, len=28, n/ep=2, n/st=64, player_1/loss=305.765, player_2/loss=2588.321, rew=465.50]                                                                                                


Epoch #991: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #992: 1025it [00:02, 486.65it/s, env_step=1015808, len=35, n/ep=1, n/st=64, player_1/loss=146.317, player_2/loss=2265.250, rew=629.00]                                                                                                


Epoch #992: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #993: 1025it [00:02, 485.78it/s, env_step=1016832, len=34, n/ep=2, n/st=64, player_1/loss=191.432, player_2/loss=1919.339, rew=739.50]                                                                                                


Epoch #993: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #994: 1025it [00:02, 485.52it/s, env_step=1017856, len=36, n/ep=2, n/st=64, player_1/loss=173.222, player_2/loss=1251.593, rew=783.00]                                                                                                


Epoch #994: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #995: 1025it [00:02, 484.07it/s, env_step=1018880, len=33, n/ep=2, n/st=64, player_1/loss=307.165, player_2/loss=619.052, rew=578.00]                                                                                                 


Epoch #995: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #996: 1025it [00:02, 483.17it/s, env_step=1019904, len=28, n/ep=3, n/st=64, player_1/loss=344.827, player_2/loss=969.689, rew=432.00]                                                                                                 


Epoch #996: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #997: 1025it [00:02, 485.39it/s, env_step=1020928, len=27, n/ep=3, n/st=64, player_1/loss=300.377, rew=427.67]                                                                                                                        


Epoch #997: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #998: 1025it [00:02, 489.63it/s, env_step=1021952, len=34, n/ep=2, n/st=64, player_1/loss=314.217, player_2/loss=1549.383, rew=617.50]                                                                                                


Epoch #998: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


Epoch #999: 1025it [00:02, 485.32it/s, env_step=1022976, len=30, n/ep=2, n/st=64, player_1/loss=261.252, player_2/loss=1287.074, rew=464.00]                                                                                                


Epoch #999: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #17


In [12]:
####################################################
# EXPERIMENT: VIEWING THE BEST LEARNED POLICY
####################################################

# Get the environment settings
env = get_env()
observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
state_shape = observation_space.shape or observation_space.n
action_shape = env.action_space.shape or env.action_space.n

# Configure the best agent
best_agent1 = cf_custom_dqn_policy(state_shape= state_shape,
                                   action_shape= action_shape)
best_agent1.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/1-cnn_dqn_frozen_agent1/best_policy_agent1.pth"))
best_agent1.set_eps(0)


best_agent2 = cf_custom_dqn_policy(state_shape= state_shape,
                                   action_shape= action_shape)
best_agent2.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/1-cnn_dqn_frozen_agent1/best_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= best_agent1,
      agent_player2= best_agent2)



Average steps of game:  38.333333333333336
Final mean reward agent 1: 378.3333333333333, std: 16.996731711975947
Final mean reward agent 2: 375.3333333333333, std: 44.21412544525662


In [14]:
####################################################
# EXPERIMENT: VIEWING THE LAST LEARNED POLICY
####################################################

# Configure the final agent
final_agent_player1 = cf_custom_dqn_policy(state_shape= state_shape,
                                           action_shape= action_shape)
final_agent_player1.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/1-cnn_dqn_frozen_agent1/final_policy_agent1.pth"))
best_agent1.set_eps(0)

final_agent_player2 = cf_custom_dqn_policy(state_shape= state_shape,
                                           action_shape= action_shape)
final_agent_player2.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/1-cnn_dqn_frozen_agent1/final_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= final_agent_player1,
      agent_player2= final_agent_player2)



Average steps of game:  35.666666666666664
Final mean reward agent 1: 329.0, std: 59.262129560116215
Final mean reward agent 2: 327.3333333333333, std: 43.484352230301056


<hr><hr>

## Discussion

We see that the agent can learn quickly to win against a fixed strategy oponent but the overall performance of the agent is still weak, making human play of very poor quality once again.

In [None]:
####################################################
# CLEAN VARIABLES
####################################################

del action_shape
del agent1
del agent2
del best_agent1
del best_agent2
del env
del final_agent_player1
del final_agent_player2
del observation_space
del off_policy_traininer_results
del state_shape
