# MLP based DQN agent against fixed oponent

In the previous notebook, `7-cnn-dqn-fixed-oponent.ipynb`, we used the CNN based model for training through an iteration of alternating frozen agents.
We found this to give interesting but not fully statisfactory results.
We will now use the same technique for the custom MLP based approach designed in `5-improving-dqn-architecture.ipynb` to properly compare both architectures performance for the agents.

<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Training two DQN agents on connect four Gym
  - Building the environment
  - Implementing the DQN policy
  - Building agents
  - Function for letting agents learn
  - Function for watching learned agent
  - Doing the experiment
- Discussion

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [3]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib

# Plotting
import matplotlib; print(f"Matplotlib version (3.5.1 recommended): {matplotlib.__version__}")
import matplotlib.pyplot as plt

# Argparser
import argparse

# More data types
import typing
import numpy as np

# Pygame
import pygame; print(f"Pygame version (2.1.2 recommended): {pygame.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Tianshou for RL algorithms
import tianshou as ts; print(f"Tianshou version (0.4.8 recommended): {ts.__version__}")

# Torch is a popular DL framework
import torch; print(f"Torch version (1.12.0 recommended): {torch.__version__}")

# PPrint is a pretty print for variables
from pprint import pprint

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym
importlib.invalidate_caches()
importlib.reload(cfgym)

# Time for allowing "freezes" in execution
import time;

# Allow for copying objects in a non reference manner
import copy

# Used for updating notebook display
from IPython.display import clear_output

Matplotlib version (3.5.1 recommended): 3.5.1
Pygame version (2.1.2 recommended): 2.1.2
Gym version (0.21.0 recommended): 0.21.0
Tianshou version (0.4.8 recommended): 0.4.8
Torch version (1.12.0 recommended): 1.12.0.dev20220520+cu116


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

In [4]:
####################################################
# CUDA VALIDATION
####################################################

# Check cuda available
print(f"CUDA is available: {torch.cuda.is_available()}")

# Show cuda devices
print(f"\nAmount of connected devices supporting CUDA: {torch.cuda.device_count()}")

# Show current cuda device
print(f"\nCurrent CUDA device: {torch.cuda.current_device()}")

# Show cuda device name
print(f"Cuda device 0 name: {torch.cuda.get_device_name(0)}")

CUDA is available: True

Amount of connected devices supporting CUDA: 1

Current CUDA device: 0
Cuda device 0 name: NVIDIA GeForce GTX 970


<hr><hr>

## Training two DQN agents on connect four Gym

Our connect four gym setup requires two agents, one for each player.
To reduce complexity, agents will always play as the same player, e.g. always as player 1.
It is important to note that connect four is a *solved game*.
According to [The Washington Post](https://www.washingtonpost.com/news/wonk/wp/2015/05/08/how-to-win-any-popular-game-according-to-data-scientists/):

> Connect Four is what mathematicians call a "solved game," meaning you can play it perfectly every time, no matter what your opponent does. You will need to get the first move, but as long as you do so, you can always win within 41 moves.

<hr>

### Building the environment

This code is taken from previous notebooks.
We don't allow invalid moves to make the problem easier for now.

In [5]:
####################################################
# CONNECT FOUR V2 ENVIRONMENT
####################################################

def get_env():
    """
    Returns the connect four gym environment V2 altered for Tianshou and Petting Zoo compatibility.
    Already wrapped with a ts.env.PettingZooEnv wrapper.
    """
    return ts.env.PettingZooEnv(cfgym.env(reward_move= 1, # Set to 1 for reward to make moves (incentivise longer games)
                                          reward_invalid= -3,
                                          reward_draw= 100,
                                          reward_win= 25,
                                          reward_loss= -25,
                                          allow_invalid_move= False))
    
    
# Test the environment
env = get_env()
print(f"Observation space: {env.observation_space}")
print(f"\nAction space: {env.action_space}")

# Reset the environment to start from a clean state, returns the initial observation
observation = env.reset()

print("\n Initial player id:")
print(observation["agent_id"])

print("\n Initial observation:")
print(observation["obs"])

print("\n Initial mask:")
print(observation["mask"])

# Clean unused variables
del observation
del env

Observation space: Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]], [[2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]], (6, 7), int8))

Action space: Discrete(7)

 Initial player id:
player_1

 Initial observation:
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]

 Initial mask:
[True, True, True, True, True, True, True]


<hr>

### Implementing the DQN policy

We use the strategy created in `5-improving-dqn-architecture.ipynb`.

In [6]:
####################################################
# DQN ARCHITECTURE
####################################################

class CustomDQN(torch.nn.Module):
    """
    Custom DQN using a model based on CNN
    """
    def __init__(self,
                 state_shape: typing.Sequence[int],
                 action_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu',):
        # Parent call
        super().__init__()
        
        # Save device (e.g. cuda)
        self.device = device
        
        self.model = torch.nn.Sequential(
            torch.nn.Linear(np.prod(state_shape), 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, np.prod(action_shape)),
        )

    def forward(self, obs, state=None, info={}):
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float, device=self.device)
        batch = obs.shape[0]
        logits = self.model(obs.view(batch, -1))
        return logits, state


In [7]:
####################################################
# DQN POLICY
####################################################

def cf_custom_dqn_policy(state_shape: tuple,
                         action_shape: tuple,
                         optim: typing.Optional[torch.optim.Optimizer] = None,
                         learning_rate: float =  0.0001,
                         gamma: float = 0.9, # Smaller gamma favours "faster" win
                         n_step: int = 4, # Number of steps to look ahead
                         frozen: bool = False,
                         target_update_freq: int = 320):
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Network to be used for DQN
    net = CustomDQN(state_shape, action_shape, device= device).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    if optim is None:
        optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # If we are frozen, we use an optimizer that has learning rate 0
    if frozen:
        optim = torch.optim.SGD(net.parameters(), lr= 0)
        
        
    # Our agent DQN policy
    return ts.policy.DQNPolicy(model= net,
                               optim= optim,
                               discount_factor= gamma,
                               estimation_step= n_step,
                               target_update_freq= target_update_freq)

<hr>

### Building agents

This is identical to the previous notebook with the added option of "freezing" an agent which corresponds to giving it an optimizer with learning rate 0.

In [8]:
####################################################
# AGENT CREATION
####################################################

def get_agents(agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
               agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
               optim: typing.Optional[torch.optim.Optimizer] = None,
               resume_path_player_1: str = '', # Path to file to resume agent training from
               resume_path_player_2: str = '', 
               agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
               agent_player2_frozen: bool = False,
               ) -> typing.Tuple[ts.policy.BasePolicy, torch.optim.Optimizer, list]:
    """
    Gets a multi agent policy manager, optimizer and player ids for the connect four V2 gym environment.
    Per default this returns 
        - Multi agent manager for 2 agents using DQN
        - Adam optimizer
        - ['player_1', 'player_2'] from the connect four environment
    """
    
    # Get the environment to play in (Connect four gym V2)
    env = get_env()
    
    # Get the observation space from the environment, depending on typo of space (ternary operator)
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    
    # Set the arguments
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent player 1 to be a DQN if no policy is passed.
    if agent_player1 is None:
        # Our agent1 uses a DQN policy
        agent_player1 = cf_custom_dqn_policy(state_shape= state_shape,
                                             action_shape= action_shape,
                                             optim= optim,
                                             frozen= agent_player1_frozen)
                
        # If we resume our agent we need to load the previous config
        if resume_path_player_1:
            agent_player1.load_state_dict(torch.load(resume_path_player_1))
            
    
    # Configure agent player 2 to be a DQN if no policy is passed.
    if agent_player2 is None:
        # Our agent1 uses a DQN policy
        agent_player2 = cf_custom_dqn_policy(state_shape= state_shape,
                                             action_shape= action_shape,
                                             optim= optim,
                                             frozen= agent_player2_frozen)
        
                
        # If we resume our agent we need to load the previous config
        if resume_path_player_2:
            agent_player2.load_state_dict(torch.load(resume_path_player_2))

    # Both our agents are DQN agents by default
    agents = [agent_player1, agent_player2]
        
    # Our policy depends on the order of the agents
    policy = ts.policy.MultiAgentPolicyManager(agents, env)
    
    # Return our policy, optimizer and the available agents in the environment
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    
    return policy, optim, env.agents

<hr>

### Function for letting agents learn

This is identical to the previous notebook.

In [9]:
####################################################
# AGENT TRAINING
####################################################

def train_agent(filename: str = "dqn_vs_dqn_cnn_based",
                agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
                agent_player2_frozen: bool = False,
                single_agent_score_as_reward: bool= False, # Uses non frozen agent's score as reward
                optim: typing.Optional[torch.optim.Optimizer] = None,
                training_env_num: int = 1,
                testing_env_num: int = 1,
                buffer_size: int = 2^14,
                batch_size: int = 1, 
                epochs: int = 50, #50
                step_per_epoch: int = 1024, #1024
                step_per_collect: int = 64, # transition before update
                update_per_step: float = 0.1,
                testing_eps: float = 0.05,
                training_eps: float = 0.1,
                ) -> typing.Tuple[dict, ts.policy.BasePolicy]:
    """
    Trains two agents in the connect four V2 environment and saves their best model and logs.
    Returns:
        - result from offpolicy_trainer
        - final version of agent 1
        - final version of agent 2
    """

    # ======== notebook specific =========
    notebook_version = '8' # Used for foldering logs and models

    # ======== environment setup =========
    train_envs = ts.env.DummyVectorEnv([get_env for _ in range(training_env_num)])
    test_envs = ts.env.DummyVectorEnv([get_env for _ in range(testing_env_num)])
    
    # set the seed for reproducibility
    np.random.seed(1998)
    torch.manual_seed(1998)
    train_envs.seed(1998)
    test_envs.seed(1998)

    # ======== agent setup =========
    # Gets our agents from the previously made function
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    policy, optim, agents = get_agents(agent_player1=agent_player1,
                                       agent_player2=agent_player2,
                                       agent_player1_frozen= agent_player1_frozen,
                                       agent_player2_frozen= agent_player2_frozen,
                                       optim=optim)

    # ======== collector setup =========
    # Make a collector for the training environments
    train_collector = ts.data.Collector(policy= policy,
                                        env= train_envs,
                                        buffer= ts.data.VectorReplayBuffer(buffer_size, len(train_envs)),
                                        exploration_noise= True)
    
    # Make a collector for the testing environments
    test_collector = ts.data.Collector(policy= policy,
                                       env= test_envs,
                                       buffer= ts.data.VectorReplayBuffer(buffer_size, len(test_envs)),
                                       exploration_noise= True)
    
    # Uncomment below if you want to set epsilon in epsilon policy
    # policy.set_eps(1)
    
    # Collect data fot the training evnironments
    train_collector.collect(n_step= batch_size * training_env_num)
    
    # ======== ensure folders exist =========
    if not os.path.exists(os.path.join('./logs', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./logs', 'paper_notebooks', notebook_version, filename))
    if not os.path.exists(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename))

    # ======== tensorboard logging setup =========
    # Allows to save the training progress to tensorboard compatable logs
    log_path = os.path.join('./logs', 'paper_notebooks', notebook_version, filename)
    writer = torch.utils.tensorboard.SummaryWriter(log_path)
    logger = ts.utils.TensorboardLogger(writer)

    # ======== callback functions used during training =========
    # We want to save our best policy
    def save_best_fn(policy):
        """
        Callback to save the best model
        """
        # Save best agent 1
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent1.pth')
        torch.save(policy.policies[agents[0]].state_dict(), model_save_path)
        
        # Save best agent 2
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent2.pth')
        torch.save(policy.policies[agents[1]].state_dict(), model_save_path)
        
        # Save agent2

    def stop_fn(mean_rewards):
        """
        Callback to stop training when we've reached the win rate
        """
        return mean_rewards >= 7 # (win = 10, 70% win without invalid moves = mean of 7)

    def train_fn(epoch, env_step):
        """
        Callback before training
        """        
        # Before training we want to configure the epsilon for the agents
        # In general more exploratory than the test case
        policy.policies[agents[0]].set_eps(training_eps)
        policy.policies[agents[1]].set_eps(training_eps)

    def test_fn(epoch, env_step):
        """
        Callback beore testing
        """        
        # Before testing we want to configure the epsilon for the agents
        # In general more greedy than the train case but not
        #   to avoid getting stuck on invalid moves
        policy.policies[agents[0]].set_eps(testing_eps)
        policy.policies[agents[1]].set_eps(testing_eps)

    def reward_metric(rews):
        """
        Callback for reward collection
        """        
        if agent_player2_frozen and single_agent_score_as_reward:
            # agent 2 frozen, optimizing for agent 1
            return rews[:, 0]
        
        if agent_player1_frozen and single_agent_score_as_reward:
            # agent 1 frozen, optimizing for agent 2
            return rews[:, 1]
        
        # Per default we are interested in optimizing both agents
        return rews[:, 0] + rews[:, 1]
    
            

    # trainer
    result = ts.trainer.offpolicy_trainer(policy= policy,
                                          train_collector= train_collector,
                                          test_collector= test_collector,
                                          max_epoch= epochs,
                                          step_per_epoch= step_per_epoch,
                                          step_per_collect= step_per_collect,
                                          episode_per_test= testing_env_num,
                                          batch_size= batch_size,
                                          train_fn= train_fn,
                                          test_fn= test_fn,
                                          # Stop function to stop before specified amount of epochs
                                          #stop_fn= stop_fn
                                          save_best_fn= save_best_fn,
                                          update_per_step= update_per_step,
                                          logger= logger,
                                          test_in_train= False,
                                          reward_metric= reward_metric)
    
    # Save final agent 1
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent1.pth')
    torch.save(policy.policies[agents[0]].state_dict(), model_save_path)

    # Save final agent 2
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent2.pth')
    torch.save(policy.policies[agents[1]].state_dict(), model_save_path)

    return result, policy.policies[agents[0]], policy.policies[agents[1]]

<hr>

### Function for watching learned agent

Identical to the previous notebook.

In [10]:
####################################################
# WATCHING THE LEARNED POLICY IN ACTION
####################################################

def watch(numer_of_games: int = 3,
          agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
          agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
          test_epsilon: float = 0.05, # For the watching we act completely greedy but low random for not getting stuck on invalid move
          render_speed: float = 0.15, # Amount of seconds to update frame/ do a step
          ) -> None:
    
    # Get the connect four V2 environment (must be a list)
    env= ts.env.DummyVectorEnv([get_env])
    
    # Get the agents from the trained agents
    policy, optim, agents = get_agents(agent_player1= agent_player1,
                                       agent_player2= agent_player2)
    
    # Evaluate the policy
    policy.eval()
    
    # Set the testing policy epsilon for our agents
    policy.policies[agents[0]].set_eps(test_epsilon)
    policy.policies[agents[1]].set_eps(test_epsilon)
    
    # Collect the test data
    collector = ts.data.Collector(policy= policy,
                                  env= env,
                                  exploration_noise= True)
    
    # Render games in human mode to see how it plays
    result = collector.collect(n_episode= numer_of_games, render= render_speed)
    
    # Close the environment aftering collecting the results
    # This closes the pygame window after completion
    env.close()
    
    # Get the rewards and length from the test trials
    rewards, length = result["rews"], result["lens"]
    
    # Print the final reward for the first agent
    print(f"Average steps of game:  {length.mean()}")
    print(f"Final mean reward agent 1: {rewards[:, 0].mean()}, std: {rewards[:, 0].std()}")
    print(f"Final mean reward agent 2: {rewards[:, 1].mean()}, std: {rewards[:, 1].std()}")

<hr>

### Doing the experiment

We now do the experiment with using our previously created functions.
We freeze one agent and initialize both agents from previous versions.

The following iterations were made:

1. Freeze agent 1, train agent 2:
    - Model save name: `1-mlp_dqn_frozen_agent1` 
    - Agent 1 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent2.pth`
    - Learning rate: `0.0001`
    - Training epsilon: `0.2`
    - Look ahead steps: `4`
    - Reward for move/invalid: `+1` / `-3`
    - Allow invalid move: `False`
    - Epochs: `1000`
    - Gamma: `0.9`
    - Best epoch: `1` with test reward `1102`
    - Scoring: sum of `both` agent's score
2. Freeze agent 2, train agent 1:
    - Model save name: `2-mlp_dqn_frozen_agent2` 
    - Agent 1 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/1-mlp_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.0001`
    - Training epsilon: `0.2`
    - Look ahead steps: `4`
    - Reward for move/invalid: `+1` / `-3`
    - Allow invalid move: `False`
    - Epochs: `1000`
    - Gamma: `0.9`
    - Best epoch: `482` with test reward `1102`
    - Scoring: sum of `both` agent's score

After which the agent was so focused on prolonging the game, we decided to lower the learning rate and start optimizing for winning again. We also lowered the amount of epochs in each iterations of swapping the frozen agent.

3. Freeze agent 1, train agent 2:
    - Model save name: `3-mlp_dqn_frozen_agent1` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/2-mlp_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/1-mlp_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.00005` # halfed learning rate
    - Training epsilon: `0.1` # halfed training epsilon
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `500`
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: `XXX` with test reward `YYY`
    - Scoring: reward of `agent 2`
4. Freeze agent 2, train agent 1:
    - Model save name: `4-mlp_dqn_frozen_agent2` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/2-cnn_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/3-cnn_dqn_frozen_agent1/best_policy_agent2.pth`
    - Learning rate: `0.00005`
    - Training epsilon: `0.1`
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `500`
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: `XXX` with test reward `YYY`
    - Scoring: reward of `agent 1`
    
To do further training, a loop was created which alternated between freezing agens every 50 epochs. This loop was executed 20 times. The learning rate was also lowered once again.

5. Loop frozen agents:
    - Model save name: `5-looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/4-mlp_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/3-mlp_dqn_frozen_agent1/best_policy_agent2.pth`
    - Learning rate: `0.000001`
    - Training epsilon: `0.1`
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `50` x `20` loops 
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`
6. Loop frozen agents:
    - Model save name: `6-looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/5-looping-iteration-19/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/5-looping-iteration-19/best_policy_agent2.pth`
    - Learning rate: `0.000003`
    - Training epsilon: `0.1`
    - Look ahead steps: `8`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `20` x `100` loops 
    - Gamma: `0.9` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`
7. Loop frozen agents:
    - Model save name: `7-looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/6-looping-iteration-99/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/6-looping-iteration-99/best_policy_agent2.pth`
    - Learning rate: `0.001`
    - Training epsilon: `0.05`
    - Look ahead steps: `8`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `20` x `500` loops 
    - Gamma: `0.9` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`

For file size reasons, only a portion of the saved agents are kept and stored on GitHub.


In [21]:
####################################################
# EXPERIMENT: TRAINING AGENTS
####################################################

# Configs for the agents
freeze_agent1 = False
agent1_starting_params = "./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent1.pth"

freeze_agent2 = True
agent2_starting_params = "./saved_variables/paper_notebooks/8/1-mlp_dqn_frozen_agent1/final_policy_agent2.pth"

single_agent_score_as_reward = False # To use combined reward or non frozen agent reward as scoring
filename = "2-mlp_dqn_frozen_agent2"
epochs = 1000
loops = 1

learning_rate = 0.0001
training_eps = 0.2
gamma = 0.9
n_step = 4

for loop_idx in range(loops):
    # Filename
    #filename = f"7-20epoch_500loop/7-looping-iteration-{loop_idx}"
    
    # Use provided starting params in first loop, the one from previous iteration in next
    #if loop_idx > 0:
    #    agent1_starting_params = f"./saved_variables/paper_notebooks/7/7-20epoch_500loop/7-looping-iteration-{loop_idx-1}/final_policy_agent1.pth"
    #    agent2_starting_params = f"./saved_variables/paper_notebooks/7/7-20epoch_500loop/7-looping-iteration-{loop_idx-1}/final_policy_agent2.pth"
    
    # Determine what agent to freeze
    #freeze_agent1 = True if loop_idx % 2 == 1 else False
    #freeze_agent2 = True if loop_idx % 2 == 0 else False
    
    # Get the environment settings
    env = get_env()
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent 1
    agent1 = cf_custom_dqn_policy(state_shape= state_shape,
                                  action_shape= action_shape,
                                  gamma= gamma,
                                  frozen= freeze_agent1,
                                  learning_rate = learning_rate,
                                  n_step= n_step)
    
    if agent1_starting_params:
        agent1.load_state_dict(torch.load(agent1_starting_params))
        
        # Configure agent 2
        agent2 = cf_custom_dqn_policy(state_shape= state_shape,
                                      action_shape= action_shape,
                                      gamma= gamma,
                                      frozen= freeze_agent2,
                                      learning_rate = learning_rate,
                                      n_step= n_step)
        
        if agent2_starting_params:
            agent2.load_state_dict(torch.load(agent2_starting_params))
            
            
            # Train the agent
            off_policy_traininer_results, final_agent_player1, final_agent_player2 = train_agent(epochs= epochs,
                                                                                                 agent_player1= agent1,
                                                                                                 agent_player1_frozen = freeze_agent1,
                                                                                                 agent_player2= agent2,
                                                                                                 agent_player2_frozen = freeze_agent2,
                                                                                                 filename= filename,
                                                                                                 single_agent_score_as_reward = single_agent_score_as_reward,
                                                                                                 training_eps= training_eps)
            
            

Epoch #1: 1025it [00:02, 403.38it/s, env_step=1024, len=27, n/ep=2, n/st=64, player_1/loss=719.967, player_2/loss=302.765, rew=394.00]                                                                                                                                                                                      


Epoch #1: test_reward: 702.000000 ± 0.000000, best_reward: 702.000000 ± 0.000000 in #1


Epoch #2: 1025it [00:02, 468.71it/s, env_step=2048, len=32, n/ep=2, n/st=64, player_1/loss=868.773, player_2/loss=256.675, rew=551.50]                                                                                                                                                                                      


Epoch #2: test_reward: 377.000000 ± 0.000000, best_reward: 702.000000 ± 0.000000 in #1


Epoch #3: 1025it [00:02, 433.74it/s, env_step=3072, len=34, n/ep=2, n/st=64, player_1/loss=962.933, player_2/loss=191.449, rew=596.00]                                                                                                                                                                                      


Epoch #3: test_reward: 464.000000 ± 0.000000, best_reward: 702.000000 ± 0.000000 in #1


Epoch #4: 1025it [00:02, 470.63it/s, env_step=4096, len=28, n/ep=3, n/st=64, player_1/loss=1111.502, player_2/loss=223.862, rew=417.33]                                                                                                                                                                                     


Epoch #4: test_reward: 560.000000 ± 0.000000, best_reward: 702.000000 ± 0.000000 in #1


Epoch #5: 1025it [00:02, 470.42it/s, env_step=5120, len=27, n/ep=2, n/st=64, player_1/loss=955.145, player_2/loss=244.195, rew=437.50]                                                                                                                                                                                      


Epoch #5: test_reward: 740.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #6: 1025it [00:02, 461.09it/s, env_step=6144, len=29, n/ep=2, n/st=64, player_1/loss=580.834, player_2/loss=291.272, rew=449.00]                                                                                                                                                                                      


Epoch #6: test_reward: 324.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #7: 1025it [00:02, 445.86it/s, env_step=7168, len=27, n/ep=2, n/st=64, player_1/loss=636.476, player_2/loss=374.353, rew=377.00]                                                                                                                                                                                      


Epoch #7: test_reward: 324.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #8: 1025it [00:02, 463.59it/s, env_step=8192, len=33, n/ep=2, n/st=64, player_1/loss=487.458, player_2/loss=494.931, rew=568.00]                                                                                                                                                                                      


Epoch #8: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #9: 1025it [00:02, 452.98it/s, env_step=9216, len=34, n/ep=2, n/st=64, player_1/loss=302.927, player_2/loss=348.577, rew=606.50]                                                                                                                                                                                      


Epoch #9: test_reward: 252.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #10: 1025it [00:02, 457.55it/s, env_step=10240, len=26, n/ep=2, n/st=64, player_1/loss=377.901, player_2/loss=224.517, rew=391.50]                                                                                                                                                                                    


Epoch #10: test_reward: 405.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #11: 1025it [00:02, 460.82it/s, env_step=11264, len=37, n/ep=2, n/st=64, player_1/loss=440.214, player_2/loss=315.899, rew=814.50]                                                                                                                                                                                    


Epoch #11: test_reward: 464.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #12: 1025it [00:02, 448.41it/s, env_step=12288, len=32, n/ep=2, n/st=64, player_1/loss=729.531, player_2/loss=404.135, rew=544.50]                                                                                                                                                                                    


Epoch #12: test_reward: 434.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #13: 1025it [00:02, 465.47it/s, env_step=13312, len=28, n/ep=2, n/st=64, player_1/loss=644.488, player_2/loss=436.000, rew=405.00]                                                                                                                                                                                    


Epoch #13: test_reward: 405.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #14: 1025it [00:02, 456.77it/s, env_step=14336, len=28, n/ep=3, n/st=64, player_1/loss=356.654, player_2/loss=599.372, rew=432.33]                                                                                                                                                                                    


Epoch #14: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #15: 1025it [00:02, 436.12it/s, env_step=15360, len=39, n/ep=1, n/st=64, player_1/loss=726.996, player_2/loss=603.813, rew=779.00]                                                                                                                                                                                    


Epoch #15: test_reward: 324.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #16: 1025it [00:02, 452.73it/s, env_step=16384, len=24, n/ep=2, n/st=64, player_1/loss=765.428, player_2/loss=408.753, rew=311.50]                                                                                                                                                                                    


Epoch #16: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #17: 1025it [00:02, 455.57it/s, env_step=17408, len=37, n/ep=2, n/st=64, player_1/loss=512.376, player_2/loss=190.768, rew=722.00]                                                                                                                                                                                    


Epoch #17: test_reward: 324.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #18: 1025it [00:02, 467.69it/s, env_step=18432, len=35, n/ep=2, n/st=64, player_1/loss=562.079, player_2/loss=113.064, rew=650.00]                                                                                                                                                                                    


Epoch #18: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #19: 1025it [00:02, 458.25it/s, env_step=19456, len=27, n/ep=2, n/st=64, player_1/loss=428.199, player_2/loss=244.858, rew=391.00]                                                                                                                                                                                    


Epoch #19: test_reward: 405.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #20: 1025it [00:02, 431.07it/s, env_step=20480, len=30, n/ep=2, n/st=64, player_1/loss=293.117, player_2/loss=267.798, rew=479.50]                                                                                                                                                                                    


Epoch #20: test_reward: 324.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #21: 1025it [00:02, 437.73it/s, env_step=21504, len=33, n/ep=2, n/st=64, player_2/loss=233.640, rew=568.00]                                                                                                                                                                                                           


Epoch #21: test_reward: 702.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #22: 1025it [00:02, 495.87it/s, env_step=22528, len=30, n/ep=2, n/st=64, player_1/loss=295.658, player_2/loss=265.445, rew=507.50]                                                                                                                                                                                    


Epoch #22: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #23: 1025it [00:02, 488.63it/s, env_step=23552, len=28, n/ep=2, n/st=64, player_1/loss=338.977, player_2/loss=300.350, rew=434.50]                                                                                                                                                                                    


Epoch #23: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #24: 1025it [00:02, 485.65it/s, env_step=24576, len=30, n/ep=2, n/st=64, player_1/loss=267.482, player_2/loss=330.980, rew=464.00]                                                                                                                                                                                    


Epoch #24: test_reward: 464.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #25: 1025it [00:02, 473.03it/s, env_step=25600, len=28, n/ep=2, n/st=64, player_1/loss=614.153, player_2/loss=451.340, rew=405.50]                                                                                                                                                                                    


Epoch #25: test_reward: 104.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #26: 1025it [00:02, 484.94it/s, env_step=26624, len=28, n/ep=2, n/st=64, player_1/loss=624.202, player_2/loss=401.072, rew=407.00]                                                                                                                                                                                    


Epoch #26: test_reward: 324.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #27: 1025it [00:02, 485.21it/s, env_step=27648, len=28, n/ep=3, n/st=64, player_1/loss=242.947, player_2/loss=388.245, rew=424.67]                                                                                                                                                                                    


Epoch #27: test_reward: 405.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #28: 1025it [00:02, 494.70it/s, env_step=28672, len=28, n/ep=3, n/st=64, player_2/loss=506.314, rew=416.00]                                                                                                                                                                                                           


Epoch #28: test_reward: 405.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #29: 1025it [00:02, 412.78it/s, env_step=29696, len=28, n/ep=3, n/st=64, player_1/loss=273.794, player_2/loss=327.559, rew=448.00]                                                                                                                                                                                    


Epoch #29: test_reward: 350.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #30: 1025it [00:02, 473.24it/s, env_step=30720, len=30, n/ep=2, n/st=64, player_1/loss=407.311, player_2/loss=398.527, rew=482.50]                                                                                                                                                                                    


Epoch #30: test_reward: 527.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #31: 1025it [00:02, 461.44it/s, env_step=31744, len=27, n/ep=3, n/st=64, player_1/loss=386.968, player_2/loss=516.519, rew=392.00]                                                                                                                                                                                    


Epoch #31: test_reward: 324.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #32: 1025it [00:02, 485.37it/s, env_step=32768, len=29, n/ep=2, n/st=64, player_1/loss=463.842, player_2/loss=344.013, rew=452.00]                                                                                                                                                                                    


Epoch #32: test_reward: 405.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #33: 1025it [00:02, 496.73it/s, env_step=33792, len=26, n/ep=3, n/st=64, player_1/loss=624.742, player_2/loss=181.670, rew=357.00]                                                                                                                                                                                    


Epoch #33: test_reward: 230.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #34: 1025it [00:02, 479.20it/s, env_step=34816, len=32, n/ep=2, n/st=64, player_1/loss=316.879, player_2/loss=203.319, rew=558.50]                                                                                                                                                                                    


Epoch #34: test_reward: 527.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #35: 1025it [00:02, 501.17it/s, env_step=35840, len=34, n/ep=2, n/st=64, player_1/loss=169.811, player_2/loss=226.522, rew=626.50]                                                                                                                                                                                    


Epoch #35: test_reward: 702.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #36: 1025it [00:02, 501.11it/s, env_step=36864, len=32, n/ep=2, n/st=64, player_1/loss=240.061, player_2/loss=301.590, rew=558.50]                                                                                                                                                                                    


Epoch #36: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #37: 1025it [00:02, 500.19it/s, env_step=37888, len=36, n/ep=2, n/st=64, player_1/loss=297.982, player_2/loss=238.134, rew=783.00]                                                                                                                                                                                    


Epoch #37: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #38: 1025it [00:02, 501.12it/s, env_step=38912, len=37, n/ep=2, n/st=64, player_1/loss=204.514, player_2/loss=191.071, rew=721.00]                                                                                                                                                                                    


Epoch #38: test_reward: 527.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #39: 1025it [00:02, 497.57it/s, env_step=39936, len=39, n/ep=1, n/st=64, player_1/loss=188.768, player_2/loss=382.586, rew=779.00]                                                                                                                                                                                    


Epoch #39: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #40: 1025it [00:02, 498.81it/s, env_step=40960, len=28, n/ep=2, n/st=64, player_1/loss=354.794, player_2/loss=332.922, rew=447.50]                                                                                                                                                                                    


Epoch #40: test_reward: 740.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #41: 1025it [00:02, 496.40it/s, env_step=41984, len=28, n/ep=1, n/st=64, player_1/loss=730.996, player_2/loss=257.973, rew=405.00]                                                                                                                                                                                    


Epoch #41: test_reward: 324.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #42: 1025it [00:02, 496.07it/s, env_step=43008, len=18, n/ep=3, n/st=64, player_1/loss=552.092, player_2/loss=364.528, rew=179.33]                                                                                                                                                                                    


Epoch #42: test_reward: 230.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #43: 1025it [00:02, 487.81it/s, env_step=44032, len=30, n/ep=2, n/st=64, player_1/loss=405.987, player_2/loss=353.614, rew=482.50]                                                                                                                                                                                    


Epoch #43: test_reward: 495.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #44: 1025it [00:02, 464.89it/s, env_step=45056, len=29, n/ep=2, n/st=64, player_1/loss=266.026, player_2/loss=273.142, rew=434.50]                                                                                                                                                                                    


Epoch #44: test_reward: 434.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #45: 1025it [00:02, 447.05it/s, env_step=46080, len=30, n/ep=2, n/st=64, player_1/loss=236.354, player_2/loss=365.991, rew=476.50]                                                                                                                                                                                    


Epoch #45: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #46: 1025it [00:02, 431.11it/s, env_step=47104, len=33, n/ep=2, n/st=64, player_1/loss=271.360, player_2/loss=452.464, rew=568.00]                                                                                                                                                                                    


Epoch #46: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #47: 1025it [00:02, 420.51it/s, env_step=48128, len=34, n/ep=2, n/st=64, player_1/loss=670.273, player_2/loss=711.686, rew=606.50]                                                                                                                                                                                    


Epoch #47: test_reward: 405.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #48: 1025it [00:02, 412.96it/s, env_step=49152, len=26, n/ep=2, n/st=64, player_1/loss=554.660, player_2/loss=483.115, rew=429.50]                                                                                                                                                                                    


Epoch #48: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #49: 1025it [00:02, 396.12it/s, env_step=50176, len=27, n/ep=3, n/st=64, player_1/loss=470.314, player_2/loss=521.681, rew=401.00]                                                                                                                                                                                    


Epoch #49: test_reward: 405.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #50: 1025it [00:02, 396.31it/s, env_step=51200, len=30, n/ep=2, n/st=64, player_1/loss=442.905, player_2/loss=537.930, rew=480.50]                                                                                                                                                                                    


Epoch #50: test_reward: 405.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #51: 1025it [00:02, 397.56it/s, env_step=52224, len=26, n/ep=2, n/st=64, player_2/loss=293.436, rew=350.50]                                                                                                                                                                                                           


Epoch #51: test_reward: 90.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #52: 1025it [00:02, 392.58it/s, env_step=53248, len=35, n/ep=2, n/st=64, player_1/loss=357.152, player_2/loss=281.016, rew=631.00]                                                                                                                                                                                    


Epoch #52: test_reward: 350.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #53: 1025it [00:02, 396.38it/s, env_step=54272, len=24, n/ep=2, n/st=64, player_1/loss=308.308, player_2/loss=236.508, rew=317.50]                                                                                                                                                                                    


Epoch #53: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #54: 1025it [00:02, 398.07it/s, env_step=55296, len=26, n/ep=3, n/st=64, player_1/loss=266.993, player_2/loss=281.758, rew=354.33]                                                                                                                                                                                    


Epoch #54: test_reward: 324.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #55: 1025it [00:02, 395.61it/s, env_step=56320, len=29, n/ep=2, n/st=64, player_1/loss=121.177, player_2/loss=284.274, rew=464.00]                                                                                                                                                                                    


Epoch #55: test_reward: 377.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #56: 1025it [00:02, 398.37it/s, env_step=57344, len=30, n/ep=2, n/st=64, player_1/loss=185.801, player_2/loss=447.152, rew=468.50]                                                                                                                                                                                    


Epoch #56: test_reward: 324.000000 ± 0.000000, best_reward: 740.000000 ± 0.000000 in #5


Epoch #57: 1025it [00:02, 396.22it/s, env_step=58368, len=25, n/ep=3, n/st=64, player_1/loss=407.387, player_2/loss=407.076, rew=326.33]                                                                                                                                                                                    


Epoch #57: test_reward: 779.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #58: 1025it [00:02, 395.17it/s, env_step=59392, len=27, n/ep=2, n/st=64, player_1/loss=307.237, player_2/loss=380.030, rew=377.50]                                                                                                                                                                                    


Epoch #58: test_reward: 779.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #59: 1025it [00:02, 396.48it/s, env_step=60416, len=34, n/ep=2, n/st=64, player_1/loss=316.799, player_2/loss=275.419, rew=614.50]                                                                                                                                                                                    


Epoch #59: test_reward: 405.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #60: 1025it [00:02, 396.68it/s, env_step=61440, len=26, n/ep=3, n/st=64, player_1/loss=306.911, player_2/loss=333.913, rew=373.67]                                                                                                                                                                                    


Epoch #60: test_reward: 405.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #61: 1025it [00:02, 395.74it/s, env_step=62464, len=31, n/ep=2, n/st=64, player_1/loss=263.855, player_2/loss=363.444, rew=556.00]                                                                                                                                                                                    


Epoch #61: test_reward: 740.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #62: 1025it [00:02, 396.35it/s, env_step=63488, len=32, n/ep=2, n/st=64, player_1/loss=282.811, player_2/loss=416.429, rew=688.50]                                                                                                                                                                                    


Epoch #62: test_reward: 779.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #63: 1025it [00:02, 397.85it/s, env_step=64512, len=34, n/ep=2, n/st=64, player_1/loss=467.827, player_2/loss=415.316, rew=621.50]                                                                                                                                                                                    


Epoch #63: test_reward: 405.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #64: 1025it [00:02, 394.80it/s, env_step=65536, len=39, n/ep=2, n/st=64, player_1/loss=529.032, player_2/loss=635.204, rew=883.50]                                                                                                                                                                                    


Epoch #64: test_reward: 464.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #65: 1025it [00:02, 395.02it/s, env_step=66560, len=36, n/ep=1, n/st=64, player_1/loss=199.838, player_2/loss=556.037, rew=665.00]                                                                                                                                                                                    


Epoch #65: test_reward: 377.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #66: 1025it [00:02, 395.82it/s, env_step=67584, len=34, n/ep=2, n/st=64, player_1/loss=303.663, player_2/loss=346.117, rew=632.50]                                                                                                                                                                                    


Epoch #66: test_reward: 405.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #67: 1025it [00:02, 395.67it/s, env_step=68608, len=30, n/ep=2, n/st=64, player_1/loss=506.748, player_2/loss=383.079, rew=485.50]                                                                                                                                                                                    


Epoch #67: test_reward: 405.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #68: 1025it [00:02, 398.85it/s, env_step=69632, len=33, n/ep=2, n/st=64, player_1/loss=396.869, player_2/loss=268.559, rew=592.00]                                                                                                                                                                                    


Epoch #68: test_reward: 275.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #69: 1025it [00:02, 390.34it/s, env_step=70656, len=26, n/ep=2, n/st=64, player_1/loss=121.929, player_2/loss=475.584, rew=363.50]                                                                                                                                                                                    


Epoch #69: test_reward: 594.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #70: 1025it [00:02, 407.91it/s, env_step=71680, len=24, n/ep=2, n/st=64, player_1/loss=331.624, player_2/loss=532.762, rew=311.50]                                                                                                                                                                                    


Epoch #70: test_reward: 377.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #71: 1025it [00:02, 418.52it/s, env_step=72704, len=29, n/ep=3, n/st=64, player_1/loss=386.338, player_2/loss=194.863, rew=464.00]                                                                                                                                                                                    


Epoch #71: test_reward: 702.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #72: 1025it [00:02, 407.22it/s, env_step=73728, len=33, n/ep=2, n/st=64, player_1/loss=436.185, player_2/loss=196.369, rew=578.00]                                                                                                                                                                                    


Epoch #72: test_reward: 405.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #73: 1025it [00:02, 398.57it/s, env_step=74752, len=36, n/ep=2, n/st=64, player_1/loss=475.350, player_2/loss=496.208, rew=686.50]                                                                                                                                                                                    


Epoch #73: test_reward: 527.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #74: 1025it [00:02, 397.19it/s, env_step=75776, len=35, n/ep=2, n/st=64, player_1/loss=370.828, player_2/loss=329.731, rew=653.50]                                                                                                                                                                                    


Epoch #74: test_reward: 405.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #75: 1025it [00:02, 397.82it/s, env_step=76800, len=21, n/ep=2, n/st=64, player_1/loss=323.072, player_2/loss=281.680, rew=251.00]                                                                                                                                                                                    


Epoch #75: test_reward: 464.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #76: 1025it [00:02, 395.41it/s, env_step=77824, len=14, n/ep=3, n/st=64, player_1/loss=334.363, player_2/loss=344.375, rew=105.00]                                                                                                                                                                                    


Epoch #76: test_reward: 90.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #77: 1025it [00:02, 395.86it/s, env_step=78848, len=20, n/ep=4, n/st=64, player_1/loss=624.948, player_2/loss=343.948, rew=209.25]                                                                                                                                                                                    


Epoch #77: test_reward: 189.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #78: 1025it [00:02, 393.77it/s, env_step=79872, len=32, n/ep=2, n/st=64, player_1/loss=597.174, player_2/loss=379.862, rew=558.50]                                                                                                                                                                                    


Epoch #78: test_reward: 377.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #79: 1025it [00:02, 397.22it/s, env_step=80896, len=33, n/ep=2, n/st=64, player_1/loss=420.190, player_2/loss=337.567, rew=578.00]                                                                                                                                                                                    


Epoch #79: test_reward: 377.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #80: 1025it [00:02, 396.63it/s, env_step=81920, len=26, n/ep=2, n/st=64, player_1/loss=255.296, player_2/loss=265.821, rew=429.50]                                                                                                                                                                                    


Epoch #80: test_reward: 350.000000 ± 0.000000, best_reward: 779.000000 ± 0.000000 in #57


Epoch #81: 1025it [00:02, 398.01it/s, env_step=82944, len=20, n/ep=3, n/st=64, player_1/loss=257.275, player_2/loss=361.924, rew=221.33]                                                                                                                                                                                    


Epoch #81: test_reward: 819.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #82: 1025it [00:02, 396.23it/s, env_step=83968, len=28, n/ep=2, n/st=64, player_1/loss=316.971, player_2/loss=270.383, rew=429.50]                                                                                                                                                                                    


Epoch #82: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #83: 1025it [00:02, 398.51it/s, env_step=84992, len=22, n/ep=2, n/st=64, player_1/loss=351.306, player_2/loss=206.652, rew=256.50]                                                                                                                                                                                    


Epoch #83: test_reward: 209.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #84: 1025it [00:02, 396.13it/s, env_step=86016, len=29, n/ep=2, n/st=64, player_1/loss=296.937, player_2/loss=426.582, rew=434.50]                                                                                                                                                                                    


Epoch #84: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #85: 1025it [00:02, 395.43it/s, env_step=87040, len=35, n/ep=2, n/st=64, player_1/loss=336.168, player_2/loss=366.353, rew=633.50]                                                                                                                                                                                    


Epoch #85: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #86: 1025it [00:02, 385.65it/s, env_step=88064, len=32, n/ep=2, n/st=64, player_1/loss=238.007, player_2/loss=201.290, rew=529.00]                                                                                                                                                                                    


Epoch #86: test_reward: 527.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #87: 1025it [00:02, 344.76it/s, env_step=89088, len=31, n/ep=2, n/st=64, player_1/loss=224.758, player_2/loss=374.148, rew=495.50]                                                                                                                                                                                    


Epoch #87: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #88: 1025it [00:02, 374.54it/s, env_step=90112, len=25, n/ep=3, n/st=64, player_1/loss=241.538, player_2/loss=383.338, rew=347.33]                                                                                                                                                                                    


Epoch #88: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #89: 1025it [00:02, 409.65it/s, env_step=91136, len=27, n/ep=2, n/st=64, player_1/loss=481.158, player_2/loss=447.893, rew=389.50]                                                                                                                                                                                    


Epoch #89: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #90: 1025it [00:02, 402.71it/s, env_step=92160, len=27, n/ep=3, n/st=64, player_1/loss=373.167, player_2/loss=341.406, rew=386.33]                                                                                                                                                                                    


Epoch #90: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #91: 1025it [00:02, 398.40it/s, env_step=93184, len=32, n/ep=2, n/st=64, player_1/loss=582.805, player_2/loss=308.099, rew=546.50]                                                                                                                                                                                    


Epoch #91: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #92: 1025it [00:02, 398.83it/s, env_step=94208, len=26, n/ep=3, n/st=64, player_2/loss=274.595, rew=362.00]                                                                                                                                                                                                           


Epoch #92: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #93: 1025it [00:02, 397.37it/s, env_step=95232, len=28, n/ep=2, n/st=64, player_1/loss=325.955, player_2/loss=375.683, rew=405.50]                                                                                                                                                                                    


Epoch #93: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #94: 1025it [00:02, 393.51it/s, env_step=96256, len=25, n/ep=3, n/st=64, player_1/loss=213.468, player_2/loss=412.371, rew=361.33]                                                                                                                                                                                    


Epoch #94: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #95: 1025it [00:02, 392.31it/s, env_step=97280, len=29, n/ep=2, n/st=64, player_1/loss=266.507, player_2/loss=250.600, rew=452.00]                                                                                                                                                                                    


Epoch #95: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #96: 1025it [00:02, 396.81it/s, env_step=98304, len=29, n/ep=2, n/st=64, player_1/loss=311.927, player_2/loss=387.007, rew=434.00]                                                                                                                                                                                    


Epoch #96: test_reward: 299.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #97: 1025it [00:02, 395.50it/s, env_step=99328, len=27, n/ep=3, n/st=64, player_1/loss=256.297, player_2/loss=543.629, rew=394.33]                                                                                                                                                                                    


Epoch #97: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #98: 1025it [00:02, 394.54it/s, env_step=100352, len=30, n/ep=3, n/st=64, player_1/loss=122.837, player_2/loss=534.527, rew=475.00]                                                                                                                                                                                   


Epoch #98: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #99: 1025it [00:02, 396.53it/s, env_step=101376, len=20, n/ep=3, n/st=64, player_1/loss=157.552, player_2/loss=289.415, rew=225.33]                                                                                                                                                                                   


Epoch #99: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #100: 1025it [00:02, 399.34it/s, env_step=102400, len=26, n/ep=2, n/st=64, player_1/loss=156.808, player_2/loss=338.657, rew=364.50]                                                                                                                                                                                  


Epoch #100: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #101: 1025it [00:02, 398.19it/s, env_step=103424, len=28, n/ep=2, n/st=64, player_1/loss=139.697, player_2/loss=381.145, rew=422.50]                                                                                                                                                                                  


Epoch #101: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #102: 1025it [00:02, 397.68it/s, env_step=104448, len=33, n/ep=2, n/st=64, player_1/loss=171.440, player_2/loss=304.572, rew=578.00]                                                                                                                                                                                  


Epoch #102: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #103: 1025it [00:02, 395.90it/s, env_step=105472, len=28, n/ep=2, n/st=64, player_1/loss=191.468, player_2/loss=206.666, rew=420.50]                                                                                                                                                                                  


Epoch #103: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #104: 1025it [00:02, 395.01it/s, env_step=106496, len=22, n/ep=2, n/st=64, player_1/loss=181.701, player_2/loss=198.871, rew=263.50]                                                                                                                                                                                  


Epoch #104: test_reward: 495.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #105: 1025it [00:02, 396.22it/s, env_step=107520, len=16, n/ep=5, n/st=64, player_1/loss=210.356, player_2/loss=240.606, rew=215.80]                                                                                                                                                                                  


Epoch #105: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #106: 1025it [00:02, 400.06it/s, env_step=108544, len=28, n/ep=2, n/st=64, player_1/loss=197.555, player_2/loss=179.959, rew=405.50]                                                                                                                                                                                  


Epoch #106: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #107: 1025it [00:02, 396.05it/s, env_step=109568, len=32, n/ep=2, n/st=64, player_1/loss=282.518, player_2/loss=343.574, rew=558.50]                                                                                                                                                                                  


Epoch #107: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #108: 1025it [00:02, 397.37it/s, env_step=110592, len=26, n/ep=2, n/st=64, player_1/loss=273.532, player_2/loss=356.092, rew=366.50]                                                                                                                                                                                  


Epoch #108: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #109: 1025it [00:02, 397.65it/s, env_step=111616, len=30, n/ep=2, n/st=64, player_1/loss=290.321, player_2/loss=396.878, rew=479.50]                                                                                                                                                                                  


Epoch #109: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #110: 1025it [00:02, 394.72it/s, env_step=112640, len=25, n/ep=2, n/st=64, player_1/loss=195.230, player_2/loss=251.402, rew=324.50]                                                                                                                                                                                  


Epoch #110: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #111: 1025it [00:02, 395.66it/s, env_step=113664, len=26, n/ep=3, n/st=64, player_1/loss=483.521, player_2/loss=310.243, rew=368.00]                                                                                                                                                                                  


Epoch #111: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #112: 1025it [00:02, 396.07it/s, env_step=114688, len=31, n/ep=2, n/st=64, player_1/loss=511.665, player_2/loss=474.056, rew=495.50]                                                                                                                                                                                  


Epoch #112: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #113: 1025it [00:02, 395.66it/s, env_step=115712, len=28, n/ep=2, n/st=64, player_1/loss=239.658, player_2/loss=273.302, rew=425.50]                                                                                                                                                                                  


Epoch #113: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #114: 1025it [00:02, 398.11it/s, env_step=116736, len=28, n/ep=2, n/st=64, player_1/loss=244.866, player_2/loss=503.228, rew=434.50]                                                                                                                                                                                  


Epoch #114: test_reward: 527.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #115: 1025it [00:02, 398.79it/s, env_step=117760, len=28, n/ep=3, n/st=64, player_1/loss=287.147, player_2/loss=593.161, rew=407.33]                                                                                                                                                                                  


Epoch #115: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #116: 1025it [00:02, 406.30it/s, env_step=118784, len=24, n/ep=3, n/st=64, player_1/loss=285.555, player_2/loss=466.939, rew=300.33]                                                                                                                                                                                  


Epoch #116: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #117: 1025it [00:02, 395.99it/s, env_step=119808, len=29, n/ep=2, n/st=64, player_1/loss=245.176, player_2/loss=267.205, rew=452.00]                                                                                                                                                                                  


Epoch #117: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #118: 1025it [00:02, 395.99it/s, env_step=120832, len=29, n/ep=3, n/st=64, player_1/loss=288.798, player_2/loss=135.808, rew=473.33]                                                                                                                                                                                  


Epoch #118: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #119: 1025it [00:02, 396.44it/s, env_step=121856, len=28, n/ep=3, n/st=64, player_1/loss=304.462, player_2/loss=187.826, rew=421.33]                                                                                                                                                                                  


Epoch #119: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #120: 1025it [00:02, 394.40it/s, env_step=122880, len=33, n/ep=2, n/st=64, player_1/loss=364.793, player_2/loss=392.555, rew=572.50]                                                                                                                                                                                  


Epoch #120: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #121: 1025it [00:02, 396.04it/s, env_step=123904, len=33, n/ep=2, n/st=64, player_1/loss=390.178, player_2/loss=394.565, rew=572.50]                                                                                                                                                                                  


Epoch #121: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #122: 1025it [00:02, 398.36it/s, env_step=124928, len=31, n/ep=2, n/st=64, player_1/loss=274.590, player_2/loss=363.625, rew=527.00]                                                                                                                                                                                  


Epoch #122: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #123: 1025it [00:02, 396.23it/s, env_step=125952, len=31, n/ep=3, n/st=64, player_1/loss=344.080, player_2/loss=493.607, rew=528.67]                                                                                                                                                                                  


Epoch #123: test_reward: 275.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #124: 1025it [00:02, 395.84it/s, env_step=126976, len=24, n/ep=2, n/st=64, player_1/loss=352.375, player_2/loss=375.688, rew=301.00]                                                                                                                                                                                  


Epoch #124: test_reward: 350.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #125: 1025it [00:02, 398.40it/s, env_step=128000, len=26, n/ep=3, n/st=64, player_1/loss=277.354, player_2/loss=345.031, rew=369.33]                                                                                                                                                                                  


Epoch #125: test_reward: 527.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #126: 1025it [00:02, 397.89it/s, env_step=129024, len=31, n/ep=2, n/st=64, player_1/loss=163.232, player_2/loss=474.850, rew=495.00]                                                                                                                                                                                  


Epoch #126: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #127: 1025it [00:02, 397.88it/s, env_step=130048, len=25, n/ep=3, n/st=64, player_1/loss=178.752, player_2/loss=377.080, rew=373.33]                                                                                                                                                                                  


Epoch #127: test_reward: 90.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #128: 1025it [00:02, 396.89it/s, env_step=131072, len=20, n/ep=3, n/st=64, player_1/loss=260.797, player_2/loss=403.727, rew=223.33]                                                                                                                                                                                  


Epoch #128: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #129: 1025it [00:02, 396.56it/s, env_step=132096, len=32, n/ep=2, n/st=64, player_1/loss=252.487, player_2/loss=358.698, rew=571.50]                                                                                                                                                                                  


Epoch #129: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #130: 1025it [00:02, 396.88it/s, env_step=133120, len=28, n/ep=2, n/st=64, player_1/loss=297.913, player_2/loss=470.694, rew=610.50]                                                                                                                                                                                  


Epoch #130: test_reward: 779.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #131: 1025it [00:02, 395.30it/s, env_step=134144, len=28, n/ep=2, n/st=64, player_1/loss=340.314, player_2/loss=453.071, rew=407.00]                                                                                                                                                                                  


Epoch #131: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #132: 1025it [00:02, 393.92it/s, env_step=135168, len=34, n/ep=1, n/st=64, player_1/loss=325.332, player_2/loss=647.307, rew=594.00]                                                                                                                                                                                  


Epoch #132: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #133: 1025it [00:02, 396.80it/s, env_step=136192, len=23, n/ep=2, n/st=64, player_2/loss=633.595, rew=275.50]                                                                                                                                                                                                         


Epoch #133: test_reward: 275.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #134: 1025it [00:02, 393.33it/s, env_step=137216, len=31, n/ep=2, n/st=64, player_1/loss=430.290, player_2/loss=426.900, rew=556.00]                                                                                                                                                                                  


Epoch #134: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #135: 1025it [00:02, 395.58it/s, env_step=138240, len=35, n/ep=2, n/st=64, player_1/loss=184.234, player_2/loss=383.261, rew=753.50]                                                                                                                                                                                  


Epoch #135: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #136: 1025it [00:02, 393.71it/s, env_step=139264, len=28, n/ep=2, n/st=64, player_1/loss=93.941, player_2/loss=525.435, rew=419.50]                                                                                                                                                                                   


Epoch #136: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #137: 1025it [00:02, 395.44it/s, env_step=140288, len=33, n/ep=2, n/st=64, player_1/loss=105.266, player_2/loss=382.107, rew=568.00]                                                                                                                                                                                  


Epoch #137: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #138: 1025it [00:02, 396.30it/s, env_step=141312, len=33, n/ep=2, n/st=64, player_1/loss=189.612, player_2/loss=308.827, rew=583.00]                                                                                                                                                                                  


Epoch #138: test_reward: 779.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #139: 1025it [00:02, 398.77it/s, env_step=142336, len=27, n/ep=3, n/st=64, player_1/loss=167.447, player_2/loss=480.833, rew=463.33]                                                                                                                                                                                  


Epoch #139: test_reward: 27.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #140: 1025it [00:02, 396.67it/s, env_step=143360, len=34, n/ep=2, n/st=64, player_1/loss=262.532, player_2/loss=310.077, rew=739.50]                                                                                                                                                                                  


Epoch #140: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #141: 1025it [00:02, 397.10it/s, env_step=144384, len=27, n/ep=2, n/st=64, player_1/loss=416.375, player_2/loss=220.472, rew=379.00]                                                                                                                                                                                  


Epoch #141: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #142: 1025it [00:02, 399.35it/s, env_step=145408, len=31, n/ep=2, n/st=64, player_1/loss=258.810, player_2/loss=191.169, rew=545.00]                                                                                                                                                                                  


Epoch #142: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #143: 1025it [00:02, 397.33it/s, env_step=146432, len=28, n/ep=3, n/st=64, player_1/loss=375.183, player_2/loss=127.305, rew=427.00]                                                                                                                                                                                  


Epoch #143: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #144: 1025it [00:02, 396.70it/s, env_step=147456, len=27, n/ep=2, n/st=64, player_1/loss=594.813, player_2/loss=622.298, rew=391.00]                                                                                                                                                                                  


Epoch #144: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #145: 1025it [00:02, 394.20it/s, env_step=148480, len=31, n/ep=3, n/st=64, player_1/loss=557.766, player_2/loss=1087.187, rew=577.67]                                                                                                                                                                                 


Epoch #145: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #146: 1025it [00:02, 390.33it/s, env_step=149504, len=24, n/ep=2, n/st=64, player_1/loss=544.757, player_2/loss=891.432, rew=317.50]                                                                                                                                                                                  


Epoch #146: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #147: 1025it [00:02, 397.00it/s, env_step=150528, len=24, n/ep=2, n/st=64, player_1/loss=391.471, player_2/loss=403.191, rew=402.50]                                                                                                                                                                                  


Epoch #147: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #148: 1025it [00:02, 396.74it/s, env_step=151552, len=30, n/ep=3, n/st=64, player_1/loss=359.142, player_2/loss=273.587, rew=503.00]                                                                                                                                                                                  


Epoch #148: test_reward: 252.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #149: 1025it [00:02, 394.20it/s, env_step=152576, len=21, n/ep=2, n/st=64, player_1/loss=410.950, player_2/loss=291.207, rew=296.00]                                                                                                                                                                                  


Epoch #149: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #150: 1025it [00:02, 395.21it/s, env_step=153600, len=19, n/ep=3, n/st=64, player_1/loss=357.692, player_2/loss=334.066, rew=213.00]                                                                                                                                                                                  


Epoch #150: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #151: 1025it [00:02, 395.30it/s, env_step=154624, len=29, n/ep=2, n/st=64, player_1/loss=115.851, player_2/loss=173.646, rew=452.00]                                                                                                                                                                                  


Epoch #151: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #152: 1025it [00:02, 395.16it/s, env_step=155648, len=30, n/ep=2, n/st=64, player_1/loss=224.327, player_2/loss=351.503, rew=500.50]                                                                                                                                                                                  


Epoch #152: test_reward: 527.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #153: 1025it [00:02, 397.06it/s, env_step=156672, len=28, n/ep=2, n/st=64, player_1/loss=246.675, player_2/loss=359.105, rew=420.50]                                                                                                                                                                                  


Epoch #153: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #154: 1025it [00:02, 396.43it/s, env_step=157696, len=32, n/ep=2, n/st=64, player_1/loss=220.276, player_2/loss=148.166, rew=558.50]                                                                                                                                                                                  


Epoch #154: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #155: 1025it [00:02, 396.74it/s, env_step=158720, len=14, n/ep=4, n/st=64, player_1/loss=513.655, player_2/loss=469.953, rew=108.00]                                                                                                                                                                                  


Epoch #155: test_reward: 90.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #156: 1025it [00:02, 397.21it/s, env_step=159744, len=16, n/ep=4, n/st=64, player_1/loss=479.871, player_2/loss=628.832, rew=145.50]                                                                                                                                                                                  


Epoch #156: test_reward: 90.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #157: 1025it [00:02, 397.40it/s, env_step=160768, len=21, n/ep=3, n/st=64, player_1/loss=359.230, player_2/loss=337.840, rew=320.00]                                                                                                                                                                                  


Epoch #157: test_reward: 104.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #158: 1025it [00:02, 394.85it/s, env_step=161792, len=27, n/ep=2, n/st=64, player_1/loss=304.653, player_2/loss=271.874, rew=395.00]                                                                                                                                                                                  


Epoch #158: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #159: 1025it [00:02, 397.19it/s, env_step=162816, len=32, n/ep=2, n/st=64, player_1/loss=132.769, player_2/loss=287.187, rew=543.50]                                                                                                                                                                                  


Epoch #159: test_reward: 560.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #160: 1025it [00:02, 425.52it/s, env_step=163840, len=17, n/ep=5, n/st=64, player_1/loss=165.299, player_2/loss=437.525, rew=176.00]                                                                                                                                                                                  


Epoch #160: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #161: 1025it [00:02, 465.04it/s, env_step=164864, len=25, n/ep=2, n/st=64, player_1/loss=264.436, player_2/loss=718.716, rew=347.00]                                                                                                                                                                                  


Epoch #161: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #162: 1025it [00:02, 446.67it/s, env_step=165888, len=30, n/ep=2, n/st=64, player_1/loss=218.102, player_2/loss=818.367, rew=496.00]                                                                                                                                                                                  


Epoch #162: test_reward: 594.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #163: 1025it [00:02, 435.12it/s, env_step=166912, len=31, n/ep=2, n/st=64, player_1/loss=195.375, player_2/loss=346.514, rew=503.00]                                                                                                                                                                                  


Epoch #163: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #164: 1025it [00:02, 421.26it/s, env_step=167936, len=35, n/ep=2, n/st=64, player_1/loss=256.195, player_2/loss=184.283, rew=637.00]                                                                                                                                                                                  


Epoch #164: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #165: 1025it [00:02, 411.48it/s, env_step=168960, len=21, n/ep=3, n/st=64, player_1/loss=298.276, player_2/loss=203.823, rew=246.00]                                                                                                                                                                                  


Epoch #165: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #166: 1025it [00:02, 404.38it/s, env_step=169984, len=26, n/ep=2, n/st=64, player_1/loss=341.386, player_2/loss=212.454, rew=391.50]                                                                                                                                                                                  


Epoch #166: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #167: 1025it [00:02, 393.84it/s, env_step=171008, len=37, n/ep=2, n/st=64, player_1/loss=263.456, player_2/loss=217.585, rew=721.00]                                                                                                                                                                                  


Epoch #167: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #168: 1025it [00:02, 396.42it/s, env_step=172032, len=25, n/ep=3, n/st=64, player_1/loss=341.132, player_2/loss=198.066, rew=337.33]                                                                                                                                                                                  


Epoch #168: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #169: 1025it [00:02, 394.24it/s, env_step=173056, len=21, n/ep=3, n/st=64, player_1/loss=542.706, player_2/loss=410.350, rew=240.33]                                                                                                                                                                                  


Epoch #169: test_reward: 434.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #170: 1025it [00:02, 396.66it/s, env_step=174080, len=33, n/ep=2, n/st=64, player_1/loss=231.145, player_2/loss=495.761, rew=578.00]                                                                                                                                                                                  


Epoch #170: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #171: 1025it [00:02, 398.30it/s, env_step=175104, len=27, n/ep=3, n/st=64, player_1/loss=221.400, player_2/loss=664.167, rew=378.00]                                                                                                                                                                                  


Epoch #171: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #172: 1025it [00:02, 390.11it/s, env_step=176128, len=20, n/ep=3, n/st=64, player_1/loss=553.350, player_2/loss=575.149, rew=224.67]                                                                                                                                                                                  


Epoch #172: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #173: 1025it [00:02, 395.23it/s, env_step=177152, len=20, n/ep=3, n/st=64, player_1/loss=506.238, player_2/loss=484.972, rew=223.67]                                                                                                                                                                                  


Epoch #173: test_reward: 252.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #174: 1025it [00:02, 393.13it/s, env_step=178176, len=32, n/ep=2, n/st=64, player_1/loss=264.389, player_2/loss=264.504, rew=546.50]                                                                                                                                                                                  


Epoch #174: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #175: 1025it [00:02, 396.34it/s, env_step=179200, len=29, n/ep=2, n/st=64, player_1/loss=260.099, player_2/loss=490.949, rew=436.00]                                                                                                                                                                                  


Epoch #175: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #176: 1025it [00:02, 394.57it/s, env_step=180224, len=28, n/ep=2, n/st=64, player_1/loss=282.006, player_2/loss=599.199, rew=405.50]                                                                                                                                                                                  


Epoch #176: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #177: 1025it [00:02, 395.81it/s, env_step=181248, len=32, n/ep=2, n/st=64, player_1/loss=297.335, player_2/loss=452.052, rew=546.50]                                                                                                                                                                                  


Epoch #177: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #178: 1025it [00:02, 394.89it/s, env_step=182272, len=26, n/ep=3, n/st=64, player_1/loss=269.117, player_2/loss=589.170, rew=350.33]                                                                                                                                                                                  


Epoch #178: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #179: 1025it [00:02, 398.49it/s, env_step=183296, len=26, n/ep=3, n/st=64, player_1/loss=345.919, player_2/loss=630.620, rew=414.00]                                                                                                                                                                                  


Epoch #179: test_reward: 594.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #180: 1025it [00:02, 395.47it/s, env_step=184320, len=41, n/ep=1, n/st=64, player_1/loss=371.035, player_2/loss=251.562, rew=860.00]                                                                                                                                                                                  


Epoch #180: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #181: 1025it [00:02, 397.29it/s, env_step=185344, len=24, n/ep=3, n/st=64, player_1/loss=231.646, player_2/loss=284.918, rew=334.00]                                                                                                                                                                                  


Epoch #181: test_reward: 252.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #182: 1025it [00:02, 395.70it/s, env_step=186368, len=18, n/ep=3, n/st=64, player_1/loss=289.295, player_2/loss=393.037, rew=186.67]                                                                                                                                                                                  


Epoch #182: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #183: 1025it [00:02, 397.10it/s, env_step=187392, len=34, n/ep=2, n/st=64, player_1/loss=390.001, player_2/loss=425.424, rew=617.50]                                                                                                                                                                                  


Epoch #183: test_reward: 252.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #184: 1025it [00:02, 394.97it/s, env_step=188416, len=30, n/ep=2, n/st=64, player_1/loss=389.237, player_2/loss=410.333, rew=464.00]                                                                                                                                                                                  


Epoch #184: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #185: 1025it [00:02, 398.56it/s, env_step=189440, len=22, n/ep=3, n/st=64, player_1/loss=250.385, player_2/loss=414.186, rew=259.67]                                                                                                                                                                                  


Epoch #185: test_reward: 209.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #186: 1025it [00:02, 396.09it/s, env_step=190464, len=29, n/ep=2, n/st=64, player_1/loss=252.625, player_2/loss=320.688, rew=452.00]                                                                                                                                                                                  


Epoch #186: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #187: 1025it [00:02, 396.43it/s, env_step=191488, len=20, n/ep=4, n/st=64, player_1/loss=325.684, player_2/loss=392.410, rew=259.50]                                                                                                                                                                                  


Epoch #187: test_reward: 527.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #188: 1025it [00:02, 395.90it/s, env_step=192512, len=28, n/ep=2, n/st=64, player_1/loss=232.809, player_2/loss=451.767, rew=420.50]                                                                                                                                                                                  


Epoch #188: test_reward: 819.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #189: 1025it [00:02, 397.36it/s, env_step=193536, len=37, n/ep=2, n/st=64, player_1/loss=193.140, player_2/loss=369.060, rew=706.50]                                                                                                                                                                                  


Epoch #189: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #190: 1025it [00:02, 396.29it/s, env_step=194560, len=39, n/ep=2, n/st=64, player_1/loss=295.257, player_2/loss=250.363, rew=799.00]                                                                                                                                                                                  


Epoch #190: test_reward: 252.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #191: 1025it [00:02, 396.76it/s, env_step=195584, len=31, n/ep=3, n/st=64, player_1/loss=346.977, player_2/loss=235.734, rew=520.33]                                                                                                                                                                                  


Epoch #191: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #192: 1025it [00:02, 394.27it/s, env_step=196608, len=19, n/ep=3, n/st=64, player_1/loss=468.462, player_2/loss=664.141, rew=233.33]                                                                                                                                                                                  


Epoch #192: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #193: 1025it [00:02, 396.48it/s, env_step=197632, len=31, n/ep=2, n/st=64, player_1/loss=418.175, player_2/loss=900.155, rew=526.00]                                                                                                                                                                                  


Epoch #193: test_reward: 275.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #194: 1025it [00:02, 395.11it/s, env_step=198656, len=38, n/ep=2, n/st=64, player_1/loss=242.412, player_2/loss=634.099, rew=760.50]                                                                                                                                                                                  


Epoch #194: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #195: 1025it [00:02, 397.03it/s, env_step=199680, len=36, n/ep=2, n/st=64, player_1/loss=303.360, player_2/loss=213.126, rew=667.00]                                                                                                                                                                                  


Epoch #195: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #196: 1025it [00:02, 397.92it/s, env_step=200704, len=24, n/ep=3, n/st=64, player_1/loss=361.520, player_2/loss=253.668, rew=307.33]                                                                                                                                                                                  


Epoch #196: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #197: 1025it [00:02, 398.75it/s, env_step=201728, len=26, n/ep=2, n/st=64, player_1/loss=315.742, player_2/loss=346.467, rew=366.50]                                                                                                                                                                                  


Epoch #197: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #198: 1025it [00:02, 398.09it/s, env_step=202752, len=27, n/ep=3, n/st=64, player_1/loss=274.773, player_2/loss=402.821, rew=377.00]                                                                                                                                                                                  


Epoch #198: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #199: 1025it [00:02, 393.90it/s, env_step=203776, len=15, n/ep=4, n/st=64, player_1/loss=303.723, player_2/loss=393.892, rew=137.75]                                                                                                                                                                                  


Epoch #199: test_reward: 90.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #200: 1025it [00:02, 398.47it/s, env_step=204800, len=20, n/ep=3, n/st=64, player_1/loss=529.504, player_2/loss=442.283, rew=218.00]                                                                                                                                                                                  


Epoch #200: test_reward: 170.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #201: 1025it [00:02, 395.18it/s, env_step=205824, len=22, n/ep=3, n/st=64, player_1/loss=491.240, player_2/loss=431.956, rew=253.33]                                                                                                                                                                                  


Epoch #201: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #202: 1025it [00:02, 397.48it/s, env_step=206848, len=22, n/ep=2, n/st=64, player_1/loss=333.738, player_2/loss=476.074, rew=254.00]                                                                                                                                                                                  


Epoch #202: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #203: 1025it [00:02, 395.63it/s, env_step=207872, len=32, n/ep=2, n/st=64, player_1/loss=264.129, player_2/loss=411.745, rew=545.00]                                                                                                                                                                                  


Epoch #203: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #204: 1025it [00:02, 395.95it/s, env_step=208896, len=30, n/ep=2, n/st=64, player_1/loss=165.883, player_2/loss=257.419, rew=479.50]                                                                                                                                                                                  


Epoch #204: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #205: 1025it [00:02, 397.08it/s, env_step=209920, len=25, n/ep=3, n/st=64, player_1/loss=213.147, player_2/loss=209.562, rew=346.67]                                                                                                                                                                                  


Epoch #205: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #206: 1025it [00:02, 397.40it/s, env_step=210944, len=20, n/ep=2, n/st=64, player_1/loss=218.178, player_2/loss=341.159, rew=240.50]                                                                                                                                                                                  


Epoch #206: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #207: 1025it [00:02, 416.41it/s, env_step=211968, len=23, n/ep=3, n/st=64, player_1/loss=195.765, player_2/loss=584.099, rew=322.33]                                                                                                                                                                                  


Epoch #207: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #208: 1025it [00:02, 411.77it/s, env_step=212992, len=33, n/ep=2, n/st=64, player_1/loss=201.485, player_2/loss=608.552, rew=560.50]                                                                                                                                                                                  


Epoch #208: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #209: 1025it [00:02, 401.56it/s, env_step=214016, len=31, n/ep=2, n/st=64, player_1/loss=227.092, player_2/loss=405.609, rew=495.50]                                                                                                                                                                                  


Epoch #209: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #210: 1025it [00:02, 398.33it/s, env_step=215040, len=26, n/ep=2, n/st=64, player_1/loss=245.665, player_2/loss=539.994, rew=364.50]                                                                                                                                                                                  


Epoch #210: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #211: 1025it [00:02, 396.17it/s, env_step=216064, len=28, n/ep=3, n/st=64, player_1/loss=300.445, player_2/loss=825.182, rew=427.33]                                                                                                                                                                                  


Epoch #211: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #212: 1025it [00:02, 396.17it/s, env_step=217088, len=28, n/ep=2, n/st=64, player_1/loss=380.626, player_2/loss=611.309, rew=425.50]                                                                                                                                                                                  


Epoch #212: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #213: 1025it [00:02, 396.76it/s, env_step=218112, len=30, n/ep=2, n/st=64, player_1/loss=510.787, player_2/loss=351.574, rew=485.50]                                                                                                                                                                                  


Epoch #213: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #214: 1025it [00:02, 396.64it/s, env_step=219136, len=37, n/ep=1, n/st=64, player_1/loss=425.068, player_2/loss=304.100, rew=702.00]                                                                                                                                                                                  


Epoch #214: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #215: 1025it [00:02, 397.02it/s, env_step=220160, len=8, n/ep=7, n/st=64, player_1/loss=466.172, player_2/loss=252.945, rew=36.57]                                                                                                                                                                                    


Epoch #215: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #216: 1025it [00:02, 398.24it/s, env_step=221184, len=33, n/ep=2, n/st=64, player_1/loss=453.034, player_2/loss=254.428, rew=592.00]                                                                                                                                                                                  


Epoch #216: test_reward: 434.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #217: 1025it [00:02, 396.99it/s, env_step=222208, len=30, n/ep=1, n/st=64, player_1/loss=333.099, player_2/loss=316.601, rew=464.00]                                                                                                                                                                                  


Epoch #217: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #218: 1025it [00:02, 398.84it/s, env_step=223232, len=7, n/ep=6, n/st=64, player_1/loss=419.252, player_2/loss=552.870, rew=33.83]                                                                                                                                                                                    


Epoch #218: test_reward: 27.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #219: 1025it [00:02, 395.63it/s, env_step=224256, len=27, n/ep=2, n/st=64, player_1/loss=413.285, player_2/loss=628.394, rew=395.00]                                                                                                                                                                                  


Epoch #219: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #220: 1025it [00:02, 397.16it/s, env_step=225280, len=29, n/ep=2, n/st=64, player_1/loss=274.334, player_2/loss=480.143, rew=450.00]                                                                                                                                                                                  


Epoch #220: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #221: 1025it [00:02, 396.47it/s, env_step=226304, len=14, n/ep=3, n/st=64, player_1/loss=370.347, player_2/loss=615.817, rew=109.67]                                                                                                                                                                                  


Epoch #221: test_reward: 104.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #222: 1025it [00:02, 394.48it/s, env_step=227328, len=21, n/ep=3, n/st=64, player_1/loss=328.751, player_2/loss=573.074, rew=252.33]                                                                                                                                                                                  


Epoch #222: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #223: 1025it [00:02, 395.43it/s, env_step=228352, len=37, n/ep=2, n/st=64, player_1/loss=377.510, player_2/loss=235.895, rew=721.00]                                                                                                                                                                                  


Epoch #223: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #224: 1025it [00:02, 396.18it/s, env_step=229376, len=28, n/ep=3, n/st=64, player_1/loss=288.717, player_2/loss=290.358, rew=442.00]                                                                                                                                                                                  


Epoch #224: test_reward: 779.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #225: 1025it [00:02, 396.01it/s, env_step=230400, len=27, n/ep=2, n/st=64, player_1/loss=236.899, player_2/loss=298.706, rew=392.00]                                                                                                                                                                                  


Epoch #225: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #226: 1025it [00:02, 396.18it/s, env_step=231424, len=31, n/ep=2, n/st=64, player_1/loss=456.041, player_2/loss=333.074, rew=532.00]                                                                                                                                                                                  


Epoch #226: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #227: 1025it [00:02, 394.05it/s, env_step=232448, len=27, n/ep=3, n/st=64, player_1/loss=324.031, player_2/loss=268.732, rew=396.33]                                                                                                                                                                                  


Epoch #227: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #228: 1025it [00:02, 394.53it/s, env_step=233472, len=26, n/ep=3, n/st=64, player_1/loss=175.661, player_2/loss=384.251, rew=362.33]                                                                                                                                                                                  


Epoch #228: test_reward: 629.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #229: 1025it [00:02, 397.70it/s, env_step=234496, len=27, n/ep=3, n/st=64, player_1/loss=401.042, player_2/loss=475.596, rew=394.33]                                                                                                                                                                                  


Epoch #229: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #230: 1025it [00:02, 397.02it/s, env_step=235520, len=26, n/ep=2, n/st=64, player_1/loss=348.122, player_2/loss=476.076, rew=352.00]                                                                                                                                                                                  


Epoch #230: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #231: 1025it [00:02, 397.50it/s, env_step=236544, len=28, n/ep=2, n/st=64, player_1/loss=213.542, player_2/loss=498.493, rew=420.50]                                                                                                                                                                                  


Epoch #231: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #232: 1025it [00:02, 398.64it/s, env_step=237568, len=25, n/ep=3, n/st=64, player_1/loss=253.340, player_2/loss=427.927, rew=347.00]                                                                                                                                                                                  


Epoch #232: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #233: 1025it [00:02, 398.23it/s, env_step=238592, len=28, n/ep=2, n/st=64, player_1/loss=404.227, player_2/loss=503.791, rew=425.50]                                                                                                                                                                                  


Epoch #233: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #234: 1025it [00:02, 399.76it/s, env_step=239616, len=27, n/ep=2, n/st=64, player_1/loss=404.547, player_2/loss=439.246, rew=381.50]                                                                                                                                                                                  


Epoch #234: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #235: 1025it [00:02, 395.37it/s, env_step=240640, len=34, n/ep=2, n/st=64, player_1/loss=330.770, player_2/loss=276.582, rew=602.00]                                                                                                                                                                                  


Epoch #235: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #236: 1025it [00:02, 396.48it/s, env_step=241664, len=37, n/ep=2, n/st=64, player_1/loss=299.636, player_2/loss=241.144, rew=721.00]                                                                                                                                                                                  


Epoch #236: test_reward: 252.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #237: 1025it [00:02, 396.12it/s, env_step=242688, len=22, n/ep=2, n/st=64, player_1/loss=306.380, player_2/loss=664.860, rew=329.50]                                                                                                                                                                                  


Epoch #237: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #238: 1025it [00:02, 397.78it/s, env_step=243712, len=40, n/ep=2, n/st=64, player_1/loss=211.953, player_2/loss=659.697, rew=921.00]                                                                                                                                                                                  


Epoch #238: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #239: 1025it [00:02, 396.00it/s, env_step=244736, len=35, n/ep=2, n/st=64, player_1/loss=96.601, player_2/loss=238.540, rew=637.00]                                                                                                                                                                                   


Epoch #239: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #240: 1025it [00:02, 397.41it/s, env_step=245760, len=11, n/ep=6, n/st=64, player_1/loss=283.237, player_2/loss=327.656, rew=77.50]                                                                                                                                                                                   


Epoch #240: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #241: 1025it [00:02, 398.14it/s, env_step=246784, len=32, n/ep=2, n/st=64, player_1/loss=416.109, player_2/loss=579.411, rew=527.00]                                                                                                                                                                                  


Epoch #241: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #242: 1025it [00:02, 396.59it/s, env_step=247808, len=14, n/ep=5, n/st=64, player_1/loss=463.784, player_2/loss=556.445, rew=112.40]                                                                                                                                                                                  


Epoch #242: test_reward: 104.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #243: 1025it [00:02, 397.97it/s, env_step=248832, len=15, n/ep=5, n/st=64, player_1/loss=517.050, player_2/loss=488.745, rew=119.20]                                                                                                                                                                                  


Epoch #243: test_reward: 104.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #244: 1025it [00:02, 396.77it/s, env_step=249856, len=27, n/ep=3, n/st=64, player_1/loss=411.239, player_2/loss=391.513, rew=377.33]                                                                                                                                                                                  


Epoch #244: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #245: 1025it [00:02, 398.23it/s, env_step=250880, len=25, n/ep=2, n/st=64, player_1/loss=247.626, player_2/loss=160.803, rew=337.00]                                                                                                                                                                                  


Epoch #245: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #246: 1025it [00:02, 397.27it/s, env_step=251904, len=26, n/ep=2, n/st=64, player_1/loss=299.132, player_2/loss=445.218, rew=352.00]                                                                                                                                                                                  


Epoch #246: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #247: 1025it [00:02, 400.42it/s, env_step=252928, len=32, n/ep=2, n/st=64, player_1/loss=298.262, player_2/loss=526.883, rew=553.50]                                                                                                                                                                                  


Epoch #247: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #248: 1025it [00:02, 399.06it/s, env_step=253952, len=40, n/ep=1, n/st=64, player_1/loss=192.782, player_2/loss=606.636, rew=819.00]                                                                                                                                                                                  


Epoch #248: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #249: 1025it [00:02, 396.82it/s, env_step=254976, len=37, n/ep=2, n/st=64, player_1/loss=343.961, player_2/loss=568.328, rew=814.50]                                                                                                                                                                                  


Epoch #249: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #250: 1025it [00:02, 398.33it/s, env_step=256000, len=28, n/ep=3, n/st=64, player_1/loss=317.880, player_2/loss=180.032, rew=433.33]                                                                                                                                                                                  


Epoch #250: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #251: 1025it [00:02, 390.44it/s, env_step=257024, len=25, n/ep=2, n/st=64, player_1/loss=288.250, player_2/loss=201.186, rew=326.00]                                                                                                                                                                                  


Epoch #251: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #252: 1025it [00:02, 403.36it/s, env_step=258048, len=26, n/ep=3, n/st=64, player_1/loss=165.174, player_2/loss=271.433, rew=378.00]                                                                                                                                                                                  


Epoch #252: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #253: 1025it [00:02, 406.24it/s, env_step=259072, len=30, n/ep=2, n/st=64, player_1/loss=430.237, player_2/loss=554.120, rew=496.00]                                                                                                                                                                                  


Epoch #253: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #254: 1025it [00:02, 396.59it/s, env_step=260096, len=20, n/ep=3, n/st=64, player_1/loss=543.704, player_2/loss=613.382, rew=214.33]                                                                                                                                                                                  


Epoch #254: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #255: 1025it [00:02, 397.16it/s, env_step=261120, len=28, n/ep=2, n/st=64, player_1/loss=504.891, player_2/loss=914.253, rew=405.00]                                                                                                                                                                                  


Epoch #255: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #256: 1025it [00:02, 399.57it/s, env_step=262144, len=29, n/ep=3, n/st=64, player_1/loss=409.527, player_2/loss=717.271, rew=454.00]                                                                                                                                                                                  


Epoch #256: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #257: 1025it [00:02, 396.46it/s, env_step=263168, len=32, n/ep=2, n/st=64, player_1/loss=270.002, player_2/loss=259.123, rew=543.50]                                                                                                                                                                                  


Epoch #257: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #258: 1025it [00:02, 394.77it/s, env_step=264192, len=10, n/ep=7, n/st=64, player_1/loss=181.372, player_2/loss=565.330, rew=66.43]                                                                                                                                                                                   


Epoch #258: test_reward: 90.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #259: 1025it [00:02, 396.48it/s, env_step=265216, len=25, n/ep=3, n/st=64, player_1/loss=507.415, player_2/loss=469.053, rew=342.67]                                                                                                                                                                                  


Epoch #259: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #260: 1025it [00:02, 395.26it/s, env_step=266240, len=26, n/ep=2, n/st=64, player_1/loss=401.176, player_2/loss=546.912, rew=350.50]                                                                                                                                                                                  


Epoch #260: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #261: 1025it [00:02, 397.30it/s, env_step=267264, len=33, n/ep=2, n/st=64, player_1/loss=311.850, player_2/loss=206.045, rew=598.00]                                                                                                                                                                                  


Epoch #261: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #262: 1025it [00:02, 394.50it/s, env_step=268288, len=25, n/ep=2, n/st=64, player_1/loss=264.417, player_2/loss=212.257, rew=338.00]                                                                                                                                                                                  


Epoch #262: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #263: 1025it [00:02, 396.45it/s, env_step=269312, len=10, n/ep=6, n/st=64, player_1/loss=219.613, player_2/loss=585.294, rew=58.50]                                                                                                                                                                                   


Epoch #263: test_reward: 27.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #264: 1025it [00:02, 397.02it/s, env_step=270336, len=7, n/ep=8, n/st=64, player_1/loss=404.229, player_2/loss=811.424, rew=34.50]                                                                                                                                                                                    


Epoch #264: test_reward: 27.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #265: 1025it [00:02, 395.27it/s, env_step=271360, len=17, n/ep=4, n/st=64, player_1/loss=555.755, player_2/loss=736.428, rew=164.25]                                                                                                                                                                                  


Epoch #265: test_reward: 90.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #266: 1025it [00:02, 395.31it/s, env_step=272384, len=15, n/ep=4, n/st=64, player_1/loss=412.084, player_2/loss=645.256, rew=132.75]                                                                                                                                                                                  


Epoch #266: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #267: 1025it [00:02, 399.14it/s, env_step=273408, len=32, n/ep=2, n/st=64, player_1/loss=298.493, player_2/loss=574.167, rew=553.50]                                                                                                                                                                                  


Epoch #267: test_reward: 252.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #268: 1025it [00:02, 398.23it/s, env_step=274432, len=21, n/ep=3, n/st=64, player_1/loss=444.157, player_2/loss=455.372, rew=244.33]                                                                                                                                                                                  


Epoch #268: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #269: 1025it [00:02, 397.61it/s, env_step=275456, len=34, n/ep=2, n/st=64, player_1/loss=526.187, player_2/loss=331.956, rew=606.50]                                                                                                                                                                                  


Epoch #269: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #270: 1025it [00:02, 399.17it/s, env_step=276480, len=38, n/ep=2, n/st=64, player_1/loss=330.609, player_2/loss=302.420, rew=740.00]                                                                                                                                                                                  


Epoch #270: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #271: 1025it [00:02, 400.01it/s, env_step=277504, len=24, n/ep=3, n/st=64, player_1/loss=444.779, player_2/loss=274.166, rew=314.33]                                                                                                                                                                                  


Epoch #271: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #272: 1025it [00:02, 398.41it/s, env_step=278528, len=29, n/ep=2, n/st=64, player_1/loss=444.511, player_2/loss=193.553, rew=450.00]                                                                                                                                                                                  


Epoch #272: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #273: 1025it [00:02, 395.58it/s, env_step=279552, len=26, n/ep=1, n/st=64, player_1/loss=285.746, player_2/loss=214.529, rew=350.00]                                                                                                                                                                                  


Epoch #273: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #274: 1025it [00:02, 397.17it/s, env_step=280576, len=29, n/ep=1, n/st=64, player_1/loss=242.065, player_2/loss=312.515, rew=434.00]                                                                                                                                                                                  


Epoch #274: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #275: 1025it [00:02, 396.81it/s, env_step=281600, len=28, n/ep=2, n/st=64, player_1/loss=222.841, player_2/loss=538.555, rew=407.00]                                                                                                                                                                                  


Epoch #275: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #276: 1025it [00:02, 398.77it/s, env_step=282624, len=31, n/ep=2, n/st=64, player_1/loss=484.727, player_2/loss=643.383, rew=497.00]                                                                                                                                                                                  


Epoch #276: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #277: 1025it [00:02, 398.78it/s, env_step=283648, len=26, n/ep=2, n/st=64, player_1/loss=444.200, player_2/loss=336.578, rew=391.50]                                                                                                                                                                                  


Epoch #277: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #278: 1025it [00:02, 397.66it/s, env_step=284672, len=30, n/ep=2, n/st=64, player_1/loss=325.010, player_2/loss=242.473, rew=479.50]                                                                                                                                                                                  


Epoch #278: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #279: 1025it [00:02, 397.43it/s, env_step=285696, len=27, n/ep=2, n/st=64, player_1/loss=457.779, player_2/loss=244.926, rew=394.00]                                                                                                                                                                                  


Epoch #279: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #280: 1025it [00:02, 389.06it/s, env_step=286720, len=31, n/ep=2, n/st=64, player_1/loss=552.637, player_2/loss=195.295, rew=495.50]                                                                                                                                                                                  


Epoch #280: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #281: 1025it [00:02, 398.59it/s, env_step=287744, len=21, n/ep=3, n/st=64, player_1/loss=380.577, player_2/loss=269.713, rew=269.33]                                                                                                                                                                                  


Epoch #281: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #282: 1025it [00:02, 396.91it/s, env_step=288768, len=36, n/ep=2, n/st=64, player_1/loss=255.977, player_2/loss=238.739, rew=667.00]                                                                                                                                                                                  


Epoch #282: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #283: 1025it [00:02, 396.16it/s, env_step=289792, len=27, n/ep=2, n/st=64, player_1/loss=265.805, player_2/loss=146.773, rew=391.00]                                                                                                                                                                                  


Epoch #283: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #284: 1025it [00:02, 394.72it/s, env_step=290816, len=28, n/ep=2, n/st=64, player_1/loss=295.612, player_2/loss=141.610, rew=419.50]                                                                                                                                                                                  


Epoch #284: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #285: 1025it [00:02, 397.25it/s, env_step=291840, len=35, n/ep=2, n/st=64, player_1/loss=394.764, player_2/loss=285.338, rew=647.00]                                                                                                                                                                                  


Epoch #285: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #286: 1025it [00:02, 397.34it/s, env_step=292864, len=27, n/ep=2, n/st=64, player_1/loss=304.605, player_2/loss=351.556, rew=392.00]                                                                                                                                                                                  


Epoch #286: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #287: 1025it [00:02, 397.68it/s, env_step=293888, len=29, n/ep=2, n/st=64, player_2/loss=611.929, rew=627.00]                                                                                                                                                                                                         


Epoch #287: test_reward: 819.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #288: 1025it [00:02, 398.96it/s, env_step=294912, len=30, n/ep=2, n/st=64, player_1/loss=288.950, player_2/loss=737.578, rew=488.50]                                                                                                                                                                                  


Epoch #288: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #289: 1025it [00:02, 396.51it/s, env_step=295936, len=34, n/ep=2, n/st=64, player_1/loss=248.787, player_2/loss=323.528, rew=614.50]                                                                                                                                                                                  


Epoch #289: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #290: 1025it [00:02, 396.81it/s, env_step=296960, len=33, n/ep=2, n/st=64, player_1/loss=302.380, player_2/loss=432.216, rew=598.00]                                                                                                                                                                                  


Epoch #290: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #291: 1025it [00:02, 396.14it/s, env_step=297984, len=34, n/ep=2, n/st=64, player_1/loss=382.618, player_2/loss=757.384, rew=614.50]                                                                                                                                                                                  


Epoch #291: test_reward: 299.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #292: 1025it [00:02, 396.62it/s, env_step=299008, len=30, n/ep=3, n/st=64, player_1/loss=363.630, player_2/loss=1002.876, rew=502.67]                                                                                                                                                                                 


Epoch #292: test_reward: 275.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #293: 1025it [00:02, 399.32it/s, env_step=300032, len=35, n/ep=2, n/st=64, player_1/loss=479.081, player_2/loss=875.577, rew=647.00]                                                                                                                                                                                  


Epoch #293: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #294: 1025it [00:02, 397.10it/s, env_step=301056, len=26, n/ep=2, n/st=64, player_1/loss=498.727, player_2/loss=699.699, rew=366.50]                                                                                                                                                                                  


Epoch #294: test_reward: 252.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #295: 1025it [00:02, 398.83it/s, env_step=302080, len=31, n/ep=2, n/st=64, player_1/loss=676.677, player_2/loss=270.133, rew=503.00]                                                                                                                                                                                  


Epoch #295: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #296: 1025it [00:02, 398.91it/s, env_step=303104, len=30, n/ep=2, n/st=64, player_1/loss=491.172, player_2/loss=250.641, rew=466.00]                                                                                                                                                                                  


Epoch #296: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #297: 1025it [00:02, 400.43it/s, env_step=304128, len=29, n/ep=2, n/st=64, player_1/loss=181.886, player_2/loss=285.700, rew=438.50]                                                                                                                                                                                  


Epoch #297: test_reward: 275.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #298: 1025it [00:02, 405.43it/s, env_step=305152, len=29, n/ep=2, n/st=64, player_1/loss=135.719, player_2/loss=128.276, rew=449.00]                                                                                                                                                                                  


Epoch #298: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #299: 1025it [00:02, 408.10it/s, env_step=306176, len=27, n/ep=2, n/st=64, player_1/loss=137.884, player_2/loss=191.374, rew=394.00]                                                                                                                                                                                  


Epoch #299: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #300: 1025it [00:02, 399.11it/s, env_step=307200, len=27, n/ep=3, n/st=64, player_1/loss=169.419, player_2/loss=470.038, rew=388.33]                                                                                                                                                                                  


Epoch #300: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #301: 1025it [00:02, 397.60it/s, env_step=308224, len=33, n/ep=2, n/st=64, player_1/loss=230.170, player_2/loss=385.325, rew=587.00]                                                                                                                                                                                  


Epoch #301: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #302: 1025it [00:02, 399.91it/s, env_step=309248, len=26, n/ep=3, n/st=64, player_1/loss=225.550, player_2/loss=260.420, rew=359.33]                                                                                                                                                                                  


Epoch #302: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #303: 1025it [00:02, 399.34it/s, env_step=310272, len=36, n/ep=1, n/st=64, player_1/loss=207.932, player_2/loss=387.936, rew=665.00]                                                                                                                                                                                  


Epoch #303: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #304: 1025it [00:02, 397.93it/s, env_step=311296, len=35, n/ep=2, n/st=64, player_1/loss=133.509, player_2/loss=298.570, rew=633.50]                                                                                                                                                                                  


Epoch #304: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #305: 1025it [00:02, 397.79it/s, env_step=312320, len=26, n/ep=2, n/st=64, player_1/loss=170.305, player_2/loss=261.200, rew=352.00]                                                                                                                                                                                  


Epoch #305: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #306: 1025it [00:02, 397.52it/s, env_step=313344, len=26, n/ep=2, n/st=64, player_1/loss=200.924, player_2/loss=260.078, rew=366.50]                                                                                                                                                                                  


Epoch #306: test_reward: 629.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #307: 1025it [00:02, 396.29it/s, env_step=314368, len=28, n/ep=2, n/st=64, player_1/loss=419.740, player_2/loss=429.880, rew=440.50]                                                                                                                                                                                  


Epoch #307: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #308: 1025it [00:02, 395.69it/s, env_step=315392, len=34, n/ep=2, n/st=64, player_1/loss=404.710, player_2/loss=409.412, rew=594.50]                                                                                                                                                                                  


Epoch #308: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #309: 1025it [00:02, 396.91it/s, env_step=316416, len=29, n/ep=2, n/st=64, player_1/loss=99.020, player_2/loss=257.524, rew=455.00]                                                                                                                                                                                   


Epoch #309: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #310: 1025it [00:02, 396.17it/s, env_step=317440, len=27, n/ep=2, n/st=64, player_1/loss=57.881, player_2/loss=399.998, rew=377.00]                                                                                                                                                                                   


Epoch #310: test_reward: 299.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #311: 1025it [00:02, 397.65it/s, env_step=318464, len=35, n/ep=2, n/st=64, player_1/loss=99.413, player_2/loss=623.668, rew=768.00]                                                                                                                                                                                   


Epoch #311: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #312: 1025it [00:02, 396.64it/s, env_step=319488, len=29, n/ep=2, n/st=64, player_1/loss=126.670, player_2/loss=643.802, rew=434.50]                                                                                                                                                                                  


Epoch #312: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #313: 1025it [00:02, 396.74it/s, env_step=320512, len=29, n/ep=2, n/st=64, player_1/loss=169.060, player_2/loss=370.282, rew=434.50]                                                                                                                                                                                  


Epoch #313: test_reward: 495.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #314: 1025it [00:02, 399.51it/s, env_step=321536, len=33, n/ep=2, n/st=64, player_1/loss=327.823, player_2/loss=346.406, rew=578.00]                                                                                                                                                                                  


Epoch #314: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #315: 1025it [00:02, 393.87it/s, env_step=322560, len=37, n/ep=2, n/st=64, player_1/loss=344.022, player_2/loss=550.747, rew=721.00]                                                                                                                                                                                  


Epoch #315: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #316: 1025it [00:02, 395.91it/s, env_step=323584, len=29, n/ep=2, n/st=64, player_1/loss=319.723, player_2/loss=471.774, rew=452.00]                                                                                                                                                                                  


Epoch #316: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #317: 1025it [00:02, 397.35it/s, env_step=324608, len=23, n/ep=3, n/st=64, player_1/loss=256.921, player_2/loss=358.020, rew=277.33]                                                                                                                                                                                  


Epoch #317: test_reward: 209.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #318: 1025it [00:02, 395.50it/s, env_step=325632, len=28, n/ep=3, n/st=64, player_1/loss=385.324, player_2/loss=429.204, rew=440.00]                                                                                                                                                                                  


Epoch #318: test_reward: 299.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #319: 1025it [00:02, 398.28it/s, env_step=326656, len=32, n/ep=2, n/st=64, player_1/loss=371.562, player_2/loss=400.571, rew=549.50]                                                                                                                                                                                  


Epoch #319: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #320: 1025it [00:02, 396.64it/s, env_step=327680, len=23, n/ep=3, n/st=64, player_1/loss=257.815, player_2/loss=359.909, rew=277.33]                                                                                                                                                                                  


Epoch #320: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #321: 1025it [00:02, 396.29it/s, env_step=328704, len=18, n/ep=3, n/st=64, player_1/loss=207.562, player_2/loss=388.656, rew=185.33]                                                                                                                                                                                  


Epoch #321: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #322: 1025it [00:02, 396.08it/s, env_step=329728, len=22, n/ep=3, n/st=64, player_1/loss=545.237, player_2/loss=414.488, rew=268.67]                                                                                                                                                                                  


Epoch #322: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #323: 1025it [00:02, 395.64it/s, env_step=330752, len=22, n/ep=3, n/st=64, player_1/loss=999.346, player_2/loss=574.694, rew=256.33]                                                                                                                                                                                  


Epoch #323: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #324: 1025it [00:02, 396.64it/s, env_step=331776, len=38, n/ep=2, n/st=64, player_1/loss=756.017, player_2/loss=573.498, rew=759.50]                                                                                                                                                                                  


Epoch #324: test_reward: 275.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #325: 1025it [00:02, 396.41it/s, env_step=332800, len=32, n/ep=3, n/st=64, player_1/loss=392.816, player_2/loss=414.035, rew=567.00]                                                                                                                                                                                  


Epoch #325: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #326: 1025it [00:02, 397.78it/s, env_step=333824, len=28, n/ep=2, n/st=64, player_1/loss=272.714, player_2/loss=420.000, rew=420.50]                                                                                                                                                                                  


Epoch #326: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #327: 1025it [00:02, 396.69it/s, env_step=334848, len=31, n/ep=2, n/st=64, player_1/loss=211.802, player_2/loss=515.562, rew=507.50]                                                                                                                                                                                  


Epoch #327: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #328: 1025it [00:02, 395.53it/s, env_step=335872, len=37, n/ep=2, n/st=64, player_1/loss=251.391, player_2/loss=381.913, rew=704.00]                                                                                                                                                                                  


Epoch #328: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #329: 1025it [00:02, 397.77it/s, env_step=336896, len=30, n/ep=2, n/st=64, player_1/loss=390.241, player_2/loss=347.174, rew=479.50]                                                                                                                                                                                  


Epoch #329: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #330: 1025it [00:02, 398.71it/s, env_step=337920, len=28, n/ep=2, n/st=64, player_1/loss=400.742, player_2/loss=440.105, rew=420.50]                                                                                                                                                                                  


Epoch #330: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #331: 1025it [00:02, 398.61it/s, env_step=338944, len=33, n/ep=2, n/st=64, player_1/loss=340.619, player_2/loss=489.102, rew=572.50]                                                                                                                                                                                  


Epoch #331: test_reward: 779.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #332: 1025it [00:02, 397.74it/s, env_step=339968, len=14, n/ep=5, n/st=64, player_1/loss=379.565, player_2/loss=431.459, rew=111.80]                                                                                                                                                                                  


Epoch #332: test_reward: 90.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #333: 1025it [00:02, 397.09it/s, env_step=340992, len=29, n/ep=2, n/st=64, player_1/loss=533.498, player_2/loss=396.173, rew=434.50]                                                                                                                                                                                  


Epoch #333: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #334: 1025it [00:02, 396.94it/s, env_step=342016, len=40, n/ep=2, n/st=64, player_1/loss=514.957, player_2/loss=488.976, rew=940.50]                                                                                                                                                                                  


Epoch #334: test_reward: 560.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #335: 1025it [00:02, 396.08it/s, env_step=343040, len=29, n/ep=3, n/st=64, player_1/loss=217.851, player_2/loss=428.142, rew=469.33]                                                                                                                                                                                  


Epoch #335: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #336: 1025it [00:02, 396.67it/s, env_step=344064, len=31, n/ep=3, n/st=64, player_1/loss=151.952, player_2/loss=323.717, rew=531.67]                                                                                                                                                                                  


Epoch #336: test_reward: 560.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #337: 1025it [00:02, 394.98it/s, env_step=345088, len=34, n/ep=2, n/st=64, player_1/loss=559.156, player_2/loss=314.665, rew=606.50]                                                                                                                                                                                  


Epoch #337: test_reward: 819.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #338: 1025it [00:02, 396.25it/s, env_step=346112, len=32, n/ep=2, n/st=64, player_1/loss=628.042, player_2/loss=256.297, rew=529.00]                                                                                                                                                                                  


Epoch #338: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #339: 1025it [00:02, 395.49it/s, env_step=347136, len=29, n/ep=2, n/st=64, player_1/loss=260.564, player_2/loss=277.716, rew=452.00]                                                                                                                                                                                  


Epoch #339: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #340: 1025it [00:02, 396.43it/s, env_step=348160, len=31, n/ep=2, n/st=64, player_1/loss=204.006, player_2/loss=557.308, rew=499.50]                                                                                                                                                                                  


Epoch #340: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #341: 1025it [00:02, 397.68it/s, env_step=349184, len=29, n/ep=3, n/st=64, player_1/loss=280.700, player_2/loss=292.472, rew=438.33]                                                                                                                                                                                  


Epoch #341: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #342: 1025it [00:02, 392.58it/s, env_step=350208, len=24, n/ep=2, n/st=64, player_1/loss=261.139, player_2/loss=318.247, rew=307.00]                                                                                                                                                                                  


Epoch #342: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #343: 1025it [00:02, 404.78it/s, env_step=351232, len=29, n/ep=2, n/st=64, player_1/loss=160.986, player_2/loss=367.878, rew=436.00]                                                                                                                                                                                  


Epoch #343: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #344: 1025it [00:02, 409.53it/s, env_step=352256, len=32, n/ep=2, n/st=64, player_1/loss=141.518, player_2/loss=347.212, rew=571.50]                                                                                                                                                                                  


Epoch #344: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #345: 1025it [00:02, 409.16it/s, env_step=353280, len=35, n/ep=2, n/st=64, player_1/loss=146.916, player_2/loss=321.321, rew=662.00]                                                                                                                                                                                  


Epoch #345: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #346: 1025it [00:02, 398.28it/s, env_step=354304, len=26, n/ep=2, n/st=64, player_1/loss=349.167, player_2/loss=190.724, rew=363.50]                                                                                                                                                                                  


Epoch #346: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #347: 1025it [00:02, 396.98it/s, env_step=355328, len=24, n/ep=3, n/st=64, player_1/loss=363.456, player_2/loss=547.988, rew=310.33]                                                                                                                                                                                  


Epoch #347: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #348: 1025it [00:02, 398.52it/s, env_step=356352, len=37, n/ep=2, n/st=64, player_1/loss=284.949, player_2/loss=638.507, rew=706.50]                                                                                                                                                                                  


Epoch #348: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #349: 1025it [00:02, 394.48it/s, env_step=357376, len=26, n/ep=2, n/st=64, player_1/loss=287.383, player_2/loss=255.066, rew=418.50]                                                                                                                                                                                  


Epoch #349: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #350: 1025it [00:02, 397.56it/s, env_step=358400, len=33, n/ep=2, n/st=64, player_1/loss=298.059, player_2/loss=74.310, rew=578.00]                                                                                                                                                                                   


Epoch #350: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #351: 1025it [00:02, 394.19it/s, env_step=359424, len=28, n/ep=2, n/st=64, player_1/loss=486.793, player_2/loss=203.090, rew=420.50]                                                                                                                                                                                  


Epoch #351: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #352: 1025it [00:02, 396.07it/s, env_step=360448, len=22, n/ep=3, n/st=64, player_1/loss=377.697, player_2/loss=449.996, rew=268.33]                                                                                                                                                                                  


Epoch #352: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #353: 1025it [00:02, 397.70it/s, env_step=361472, len=21, n/ep=3, n/st=64, player_1/loss=740.306, player_2/loss=904.464, rew=246.00]                                                                                                                                                                                  


Epoch #353: test_reward: 209.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #354: 1025it [00:02, 397.65it/s, env_step=362496, len=28, n/ep=2, n/st=64, player_1/loss=742.911, player_2/loss=931.438, rew=474.50]                                                                                                                                                                                  


Epoch #354: test_reward: 119.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #355: 1025it [00:02, 397.89it/s, env_step=363520, len=27, n/ep=3, n/st=64, player_1/loss=338.233, player_2/loss=648.916, rew=414.00]                                                                                                                                                                                  


Epoch #355: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #356: 1025it [00:02, 394.65it/s, env_step=364544, len=8, n/ep=7, n/st=64, player_1/loss=227.971, player_2/loss=582.230, rew=43.29]                                                                                                                                                                                    


Epoch #356: test_reward: 35.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #357: 1025it [00:02, 397.94it/s, env_step=365568, len=26, n/ep=3, n/st=64, player_1/loss=167.897, player_2/loss=456.135, rew=433.67]                                                                                                                                                                                  


Epoch #357: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #358: 1025it [00:02, 393.82it/s, env_step=366592, len=28, n/ep=2, n/st=64, player_1/loss=182.345, player_2/loss=431.636, rew=420.50]                                                                                                                                                                                  


Epoch #358: test_reward: 779.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #359: 1025it [00:02, 395.26it/s, env_step=367616, len=29, n/ep=2, n/st=64, player_1/loss=241.131, player_2/loss=445.375, rew=452.00]                                                                                                                                                                                  


Epoch #359: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #360: 1025it [00:02, 398.22it/s, env_step=368640, len=33, n/ep=2, n/st=64, player_1/loss=333.017, player_2/loss=437.770, rew=578.00]                                                                                                                                                                                  


Epoch #360: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #361: 1025it [00:02, 395.97it/s, env_step=369664, len=30, n/ep=2, n/st=64, player_1/loss=245.402, player_2/loss=548.811, rew=466.00]                                                                                                                                                                                  


Epoch #361: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #362: 1025it [00:02, 396.06it/s, env_step=370688, len=33, n/ep=2, n/st=64, player_2/loss=840.644, rew=572.50]                                                                                                                                                                                                         


Epoch #362: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #363: 1025it [00:02, 392.19it/s, env_step=371712, len=17, n/ep=5, n/st=64, player_1/loss=247.357, player_2/loss=662.487, rew=200.40]                                                                                                                                                                                  


Epoch #363: test_reward: 35.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #364: 1025it [00:02, 393.99it/s, env_step=372736, len=32, n/ep=2, n/st=64, player_1/loss=404.181, player_2/loss=236.537, rew=558.50]                                                                                                                                                                                  


Epoch #364: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #365: 1025it [00:02, 398.53it/s, env_step=373760, len=38, n/ep=2, n/st=64, player_1/loss=345.658, player_2/loss=218.590, rew=740.50]                                                                                                                                                                                  


Epoch #365: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #366: 1025it [00:02, 395.79it/s, env_step=374784, len=28, n/ep=3, n/st=64, player_1/loss=407.102, player_2/loss=566.883, rew=433.33]                                                                                                                                                                                  


Epoch #366: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #367: 1025it [00:02, 396.17it/s, env_step=375808, len=24, n/ep=2, n/st=64, player_1/loss=455.148, player_2/loss=638.148, rew=307.00]                                                                                                                                                                                  


Epoch #367: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #368: 1025it [00:02, 396.95it/s, env_step=376832, len=30, n/ep=2, n/st=64, player_1/loss=344.627, player_2/loss=305.622, rew=482.50]                                                                                                                                                                                  


Epoch #368: test_reward: 779.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #369: 1025it [00:02, 395.72it/s, env_step=377856, len=23, n/ep=2, n/st=64, player_1/loss=287.466, player_2/loss=363.475, rew=287.00]                                                                                                                                                                                  


Epoch #369: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #370: 1025it [00:02, 394.99it/s, env_step=378880, len=9, n/ep=7, n/st=64, player_1/loss=139.140, player_2/loss=457.963, rew=46.71]                                                                                                                                                                                    


Epoch #370: test_reward: 27.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #371: 1025it [00:02, 396.23it/s, env_step=379904, len=22, n/ep=3, n/st=64, player_1/loss=273.533, player_2/loss=510.087, rew=260.00]                                                                                                                                                                                  


Epoch #371: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #372: 1025it [00:02, 394.59it/s, env_step=380928, len=32, n/ep=3, n/st=64, player_1/loss=550.013, player_2/loss=456.464, rew=535.33]                                                                                                                                                                                  


Epoch #372: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #373: 1025it [00:02, 396.26it/s, env_step=381952, len=32, n/ep=2, n/st=64, player_1/loss=486.120, player_2/loss=428.152, rew=535.00]                                                                                                                                                                                  


Epoch #373: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #374: 1025it [00:02, 395.49it/s, env_step=382976, len=31, n/ep=2, n/st=64, player_1/loss=405.210, player_2/loss=469.145, rew=532.00]                                                                                                                                                                                  


Epoch #374: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #375: 1025it [00:02, 395.21it/s, env_step=384000, len=29, n/ep=2, n/st=64, player_1/loss=292.408, player_2/loss=602.875, rew=470.00]                                                                                                                                                                                  


Epoch #375: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #376: 1025it [00:02, 395.10it/s, env_step=385024, len=21, n/ep=4, n/st=64, player_1/loss=310.159, player_2/loss=551.391, rew=277.75]                                                                                                                                                                                  


Epoch #376: test_reward: 119.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #377: 1025it [00:02, 396.01it/s, env_step=386048, len=32, n/ep=2, n/st=64, player_1/loss=573.628, player_2/loss=460.078, rew=527.00]                                                                                                                                                                                  


Epoch #377: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #378: 1025it [00:02, 396.59it/s, env_step=387072, len=21, n/ep=3, n/st=64, player_1/loss=493.294, player_2/loss=569.752, rew=237.33]                                                                                                                                                                                  


Epoch #378: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #379: 1025it [00:02, 397.06it/s, env_step=388096, len=30, n/ep=2, n/st=64, player_1/loss=368.659, player_2/loss=762.105, rew=479.50]                                                                                                                                                                                  


Epoch #379: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #380: 1025it [00:02, 394.85it/s, env_step=389120, len=25, n/ep=3, n/st=64, player_1/loss=304.522, player_2/loss=449.684, rew=330.33]                                                                                                                                                                                  


Epoch #380: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #381: 1025it [00:02, 394.93it/s, env_step=390144, len=29, n/ep=2, n/st=64, player_1/loss=236.603, player_2/loss=189.847, rew=438.50]                                                                                                                                                                                  


Epoch #381: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #382: 1025it [00:02, 395.67it/s, env_step=391168, len=31, n/ep=2, n/st=64, player_1/loss=225.159, player_2/loss=308.255, rew=503.00]                                                                                                                                                                                  


Epoch #382: test_reward: 594.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #383: 1025it [00:02, 397.05it/s, env_step=392192, len=32, n/ep=2, n/st=64, player_1/loss=295.226, player_2/loss=367.143, rew=545.00]                                                                                                                                                                                  


Epoch #383: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #384: 1025it [00:02, 395.46it/s, env_step=393216, len=28, n/ep=2, n/st=64, player_1/loss=450.311, player_2/loss=380.497, rew=413.00]                                                                                                                                                                                  


Epoch #384: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #385: 1025it [00:02, 394.07it/s, env_step=394240, len=29, n/ep=2, n/st=64, player_1/loss=483.931, player_2/loss=998.327, rew=434.00]                                                                                                                                                                                  


Epoch #385: test_reward: 629.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #386: 1025it [00:02, 394.16it/s, env_step=395264, len=33, n/ep=2, n/st=64, player_1/loss=486.242, player_2/loss=933.214, rew=568.00]                                                                                                                                                                                  


Epoch #386: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #387: 1025it [00:02, 395.20it/s, env_step=396288, len=22, n/ep=3, n/st=64, player_1/loss=548.656, player_2/loss=370.964, rew=259.67]                                                                                                                                                                                  


Epoch #387: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #388: 1025it [00:02, 394.07it/s, env_step=397312, len=33, n/ep=2, n/st=64, player_1/loss=569.789, player_2/loss=296.546, rew=587.00]                                                                                                                                                                                  


Epoch #388: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #389: 1025it [00:02, 408.19it/s, env_step=398336, len=27, n/ep=2, n/st=64, player_1/loss=590.565, player_2/loss=164.799, rew=394.00]                                                                                                                                                                                  


Epoch #389: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #390: 1025it [00:02, 394.50it/s, env_step=399360, len=24, n/ep=3, n/st=64, player_1/loss=569.415, player_2/loss=355.182, rew=314.33]                                                                                                                                                                                  


Epoch #390: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #391: 1025it [00:02, 395.67it/s, env_step=400384, len=23, n/ep=2, n/st=64, player_1/loss=365.932, player_2/loss=445.472, rew=290.00]                                                                                                                                                                                  


Epoch #391: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #392: 1025it [00:02, 395.69it/s, env_step=401408, len=37, n/ep=2, n/st=64, player_1/loss=465.026, player_2/loss=408.945, rew=702.50]                                                                                                                                                                                  


Epoch #392: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #393: 1025it [00:02, 388.42it/s, env_step=402432, len=34, n/ep=2, n/st=64, player_1/loss=451.333, player_2/loss=491.879, rew=606.50]                                                                                                                                                                                  


Epoch #393: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #394: 1025it [00:02, 391.46it/s, env_step=403456, len=27, n/ep=2, n/st=64, player_2/loss=395.977, rew=391.00]                                                                                                                                                                                                         


Epoch #394: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #395: 1025it [00:02, 395.03it/s, env_step=404480, len=31, n/ep=2, n/st=64, player_1/loss=514.301, player_2/loss=323.728, rew=513.00]                                                                                                                                                                                  


Epoch #395: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #396: 1025it [00:02, 394.13it/s, env_step=405504, len=29, n/ep=2, n/st=64, player_1/loss=344.707, player_2/loss=359.621, rew=466.00]                                                                                                                                                                                  


Epoch #396: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #397: 1025it [00:02, 396.39it/s, env_step=406528, len=34, n/ep=2, n/st=64, player_1/loss=291.378, player_2/loss=320.808, rew=602.00]                                                                                                                                                                                  


Epoch #397: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #398: 1025it [00:02, 395.34it/s, env_step=407552, len=28, n/ep=2, n/st=64, player_1/loss=357.927, player_2/loss=305.734, rew=420.50]                                                                                                                                                                                  


Epoch #398: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #399: 1025it [00:02, 396.96it/s, env_step=408576, len=33, n/ep=2, n/st=64, player_1/loss=334.398, player_2/loss=509.141, rew=583.00]                                                                                                                                                                                  


Epoch #399: test_reward: 779.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #400: 1025it [00:02, 393.70it/s, env_step=409600, len=28, n/ep=3, n/st=64, player_1/loss=362.167, player_2/loss=773.606, rew=424.67]                                                                                                                                                                                  


Epoch #400: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #401: 1025it [00:02, 395.10it/s, env_step=410624, len=40, n/ep=2, n/st=64, player_1/loss=267.345, player_2/loss=635.175, rew=840.50]                                                                                                                                                                                  


Epoch #401: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #402: 1025it [00:02, 394.64it/s, env_step=411648, len=29, n/ep=2, n/st=64, player_1/loss=365.590, player_2/loss=460.821, rew=485.00]                                                                                                                                                                                  


Epoch #402: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #403: 1025it [00:02, 394.65it/s, env_step=412672, len=38, n/ep=2, n/st=64, player_1/loss=410.521, player_2/loss=340.150, rew=740.50]                                                                                                                                                                                  


Epoch #403: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #404: 1025it [00:02, 396.90it/s, env_step=413696, len=34, n/ep=2, n/st=64, player_1/loss=427.338, player_2/loss=223.717, rew=617.50]                                                                                                                                                                                  


Epoch #404: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #405: 1025it [00:02, 394.95it/s, env_step=414720, len=39, n/ep=2, n/st=64, player_1/loss=288.084, player_2/loss=281.912, rew=799.00]                                                                                                                                                                                  


Epoch #405: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #406: 1025it [00:02, 395.00it/s, env_step=415744, len=29, n/ep=3, n/st=64, player_1/loss=233.044, rew=444.33]                                                                                                                                                                                                         


Epoch #406: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #407: 1025it [00:02, 396.24it/s, env_step=416768, len=35, n/ep=2, n/st=64, player_1/loss=241.640, player_2/loss=279.750, rew=650.00]                                                                                                                                                                                  


Epoch #407: test_reward: 779.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #408: 1025it [00:02, 391.40it/s, env_step=417792, len=27, n/ep=2, n/st=64, player_1/loss=229.781, player_2/loss=386.343, rew=381.50]                                                                                                                                                                                  


Epoch #408: test_reward: 434.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #409: 1025it [00:02, 394.71it/s, env_step=418816, len=25, n/ep=3, n/st=64, player_1/loss=440.274, player_2/loss=330.906, rew=334.00]                                                                                                                                                                                  


Epoch #409: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #410: 1025it [00:02, 394.79it/s, env_step=419840, len=25, n/ep=2, n/st=64, player_1/loss=402.906, player_2/loss=374.475, rew=324.00]                                                                                                                                                                                  


Epoch #410: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #411: 1025it [00:02, 394.13it/s, env_step=420864, len=28, n/ep=2, n/st=64, player_1/loss=322.914, player_2/loss=534.596, rew=405.50]                                                                                                                                                                                  


Epoch #411: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #412: 1025it [00:02, 395.40it/s, env_step=421888, len=31, n/ep=2, n/st=64, player_1/loss=297.734, player_2/loss=384.691, rew=513.00]                                                                                                                                                                                  


Epoch #412: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #413: 1025it [00:02, 398.15it/s, env_step=422912, len=26, n/ep=2, n/st=64, player_1/loss=437.076, player_2/loss=413.389, rew=369.50]                                                                                                                                                                                  


Epoch #413: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #414: 1025it [00:02, 395.33it/s, env_step=423936, len=28, n/ep=2, n/st=64, player_1/loss=363.865, player_2/loss=347.784, rew=405.50]                                                                                                                                                                                  


Epoch #414: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #415: 1025it [00:02, 394.00it/s, env_step=424960, len=25, n/ep=3, n/st=64, player_1/loss=205.162, player_2/loss=163.909, rew=419.00]                                                                                                                                                                                  


Epoch #415: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #416: 1025it [00:02, 396.91it/s, env_step=425984, len=31, n/ep=2, n/st=64, player_1/loss=310.711, rew=532.00]                                                                                                                                                                                                         


Epoch #416: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #417: 1025it [00:02, 394.58it/s, env_step=427008, len=29, n/ep=2, n/st=64, player_1/loss=416.251, player_2/loss=434.109, rew=515.00]                                                                                                                                                                                  


Epoch #417: test_reward: 275.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #418: 1025it [00:02, 396.28it/s, env_step=428032, len=36, n/ep=2, n/st=64, player_1/loss=454.047, player_2/loss=247.507, rew=683.50]                                                                                                                                                                                  


Epoch #418: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #419: 1025it [00:02, 395.37it/s, env_step=429056, len=30, n/ep=2, n/st=64, player_1/loss=432.288, player_2/loss=240.269, rew=464.00]                                                                                                                                                                                  


Epoch #419: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #420: 1025it [00:02, 394.79it/s, env_step=430080, len=29, n/ep=2, n/st=64, player_1/loss=526.782, player_2/loss=235.690, rew=434.50]                                                                                                                                                                                  


Epoch #420: test_reward: 434.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #421: 1025it [00:02, 394.59it/s, env_step=431104, len=30, n/ep=2, n/st=64, player_1/loss=387.222, player_2/loss=289.121, rew=496.00]                                                                                                                                                                                  


Epoch #421: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #422: 1025it [00:02, 396.25it/s, env_step=432128, len=28, n/ep=2, n/st=64, player_1/loss=275.023, player_2/loss=324.243, rew=405.00]                                                                                                                                                                                  


Epoch #422: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #423: 1025it [00:02, 394.06it/s, env_step=433152, len=39, n/ep=2, n/st=64, player_1/loss=397.383, player_2/loss=329.748, rew=781.00]                                                                                                                                                                                  


Epoch #423: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #424: 1025it [00:02, 395.97it/s, env_step=434176, len=22, n/ep=2, n/st=64, player_1/loss=488.257, player_2/loss=225.653, rew=260.00]                                                                                                                                                                                  


Epoch #424: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #425: 1025it [00:02, 394.89it/s, env_step=435200, len=33, n/ep=2, n/st=64, player_1/loss=325.893, player_2/loss=392.127, rew=587.00]                                                                                                                                                                                  


Epoch #425: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #426: 1025it [00:02, 394.35it/s, env_step=436224, len=37, n/ep=2, n/st=64, player_1/loss=320.614, player_2/loss=388.105, rew=721.00]                                                                                                                                                                                  


Epoch #426: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #427: 1025it [00:02, 391.53it/s, env_step=437248, len=8, n/ep=6, n/st=64, player_1/loss=747.297, player_2/loss=934.922, rew=38.83]                                                                                                                                                                                    


Epoch #427: test_reward: 27.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #428: 1025it [00:02, 392.26it/s, env_step=438272, len=7, n/ep=8, n/st=64, player_1/loss=809.680, player_2/loss=1100.680, rew=34.25]                                                                                                                                                                                   


Epoch #428: test_reward: 27.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #429: 1025it [00:02, 394.51it/s, env_step=439296, len=24, n/ep=3, n/st=64, player_1/loss=602.384, player_2/loss=861.011, rew=357.00]                                                                                                                                                                                  


Epoch #429: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #430: 1025it [00:02, 395.99it/s, env_step=440320, len=26, n/ep=2, n/st=64, player_1/loss=499.737, player_2/loss=493.373, rew=399.50]                                                                                                                                                                                  


Epoch #430: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #431: 1025it [00:02, 395.72it/s, env_step=441344, len=12, n/ep=6, n/st=64, player_1/loss=298.443, player_2/loss=289.518, rew=127.33]                                                                                                                                                                                  


Epoch #431: test_reward: 27.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #432: 1025it [00:02, 395.69it/s, env_step=442368, len=33, n/ep=2, n/st=64, player_1/loss=218.780, player_2/loss=402.179, rew=572.50]                                                                                                                                                                                  


Epoch #432: test_reward: 527.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #433: 1025it [00:02, 394.06it/s, env_step=443392, len=38, n/ep=2, n/st=64, player_1/loss=145.534, player_2/loss=406.477, rew=740.00]                                                                                                                                                                                  


Epoch #433: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #434: 1025it [00:02, 404.76it/s, env_step=444416, len=22, n/ep=2, n/st=64, player_1/loss=269.655, player_2/loss=301.504, rew=278.50]                                                                                                                                                                                  


Epoch #434: test_reward: 275.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #435: 1025it [00:02, 396.37it/s, env_step=445440, len=32, n/ep=2, n/st=64, player_1/loss=431.026, player_2/loss=330.485, rew=558.50]                                                                                                                                                                                  


Epoch #435: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #436: 1025it [00:02, 395.95it/s, env_step=446464, len=28, n/ep=2, n/st=64, player_1/loss=465.761, player_2/loss=379.579, rew=455.00]                                                                                                                                                                                  


Epoch #436: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #437: 1025it [00:02, 398.09it/s, env_step=447488, len=34, n/ep=2, n/st=64, player_2/loss=374.547, rew=614.50]                                                                                                                                                                                                         


Epoch #437: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #438: 1025it [00:02, 394.20it/s, env_step=448512, len=32, n/ep=2, n/st=64, player_1/loss=225.456, player_2/loss=348.853, rew=564.50]                                                                                                                                                                                  


Epoch #438: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #439: 1025it [00:02, 393.29it/s, env_step=449536, len=36, n/ep=2, n/st=64, player_1/loss=391.882, player_2/loss=233.512, rew=665.50]                                                                                                                                                                                  


Epoch #439: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #440: 1025it [00:02, 393.59it/s, env_step=450560, len=31, n/ep=2, n/st=64, player_1/loss=521.558, player_2/loss=201.773, rew=517.00]                                                                                                                                                                                  


Epoch #440: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #441: 1025it [00:02, 394.17it/s, env_step=451584, len=25, n/ep=3, n/st=64, player_1/loss=228.133, player_2/loss=283.762, rew=341.00]                                                                                                                                                                                  


Epoch #441: test_reward: 495.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #442: 1025it [00:02, 395.77it/s, env_step=452608, len=38, n/ep=1, n/st=64, player_1/loss=76.685, player_2/loss=384.420, rew=740.00]                                                                                                                                                                                   


Epoch #442: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #443: 1025it [00:02, 396.52it/s, env_step=453632, len=23, n/ep=3, n/st=64, player_1/loss=252.542, player_2/loss=460.354, rew=284.33]                                                                                                                                                                                  


Epoch #443: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #444: 1025it [00:02, 395.67it/s, env_step=454656, len=27, n/ep=2, n/st=64, player_1/loss=324.086, player_2/loss=299.085, rew=381.50]                                                                                                                                                                                  


Epoch #444: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #445: 1025it [00:02, 392.65it/s, env_step=455680, len=25, n/ep=3, n/st=64, player_1/loss=426.172, player_2/loss=571.310, rew=363.33]                                                                                                                                                                                  


Epoch #445: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #446: 1025it [00:02, 397.24it/s, env_step=456704, len=26, n/ep=2, n/st=64, player_2/loss=579.523, rew=354.50]                                                                                                                                                                                                         


Epoch #446: test_reward: 495.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #447: 1025it [00:02, 395.15it/s, env_step=457728, len=29, n/ep=2, n/st=64, player_1/loss=251.632, player_2/loss=440.270, rew=466.00]                                                                                                                                                                                  


Epoch #447: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #448: 1025it [00:02, 395.50it/s, env_step=458752, len=18, n/ep=3, n/st=64, player_1/loss=307.734, player_2/loss=571.756, rew=189.33]                                                                                                                                                                                  


Epoch #448: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #449: 1025it [00:02, 393.79it/s, env_step=459776, len=39, n/ep=1, n/st=64, player_1/loss=353.813, player_2/loss=590.324, rew=779.00]                                                                                                                                                                                  


Epoch #449: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #450: 1025it [00:02, 397.50it/s, env_step=460800, len=25, n/ep=2, n/st=64, player_1/loss=563.628, player_2/loss=338.376, rew=338.00]                                                                                                                                                                                  


Epoch #450: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #451: 1025it [00:02, 390.81it/s, env_step=461824, len=33, n/ep=2, n/st=64, player_1/loss=602.487, player_2/loss=220.791, rew=562.00]                                                                                                                                                                                  


Epoch #451: test_reward: 434.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #452: 1025it [00:02, 394.33it/s, env_step=462848, len=30, n/ep=2, n/st=64, player_1/loss=526.102, player_2/loss=272.684, rew=466.00]                                                                                                                                                                                  


Epoch #452: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #453: 1025it [00:02, 393.57it/s, env_step=463872, len=25, n/ep=2, n/st=64, player_1/loss=504.296, player_2/loss=537.544, rew=338.00]                                                                                                                                                                                  


Epoch #453: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #454: 1025it [00:02, 394.47it/s, env_step=464896, len=25, n/ep=3, n/st=64, player_1/loss=257.188, player_2/loss=620.433, rew=335.33]                                                                                                                                                                                  


Epoch #454: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #455: 1025it [00:02, 394.83it/s, env_step=465920, len=31, n/ep=2, n/st=64, player_1/loss=196.964, player_2/loss=297.822, rew=519.50]                                                                                                                                                                                  


Epoch #455: test_reward: 464.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #456: 1025it [00:02, 395.65it/s, env_step=466944, len=25, n/ep=3, n/st=64, player_1/loss=216.114, player_2/loss=259.518, rew=372.67]                                                                                                                                                                                  


Epoch #456: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #457: 1025it [00:02, 395.02it/s, env_step=467968, len=26, n/ep=2, n/st=64, player_1/loss=199.787, player_2/loss=284.688, rew=369.50]                                                                                                                                                                                  


Epoch #457: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #458: 1025it [00:02, 395.48it/s, env_step=468992, len=38, n/ep=2, n/st=64, player_1/loss=359.670, player_2/loss=326.245, rew=760.50]                                                                                                                                                                                  


Epoch #458: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #459: 1025it [00:02, 396.05it/s, env_step=470016, len=26, n/ep=2, n/st=64, player_1/loss=518.534, player_2/loss=249.574, rew=374.50]                                                                                                                                                                                  


Epoch #459: test_reward: 275.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #460: 1025it [00:02, 395.51it/s, env_step=471040, len=27, n/ep=3, n/st=64, player_1/loss=539.087, player_2/loss=232.182, rew=378.00]                                                                                                                                                                                  


Epoch #460: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #461: 1025it [00:02, 394.09it/s, env_step=472064, len=24, n/ep=2, n/st=64, player_1/loss=490.700, player_2/loss=354.251, rew=301.00]                                                                                                                                                                                  


Epoch #461: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #462: 1025it [00:02, 394.86it/s, env_step=473088, len=35, n/ep=2, n/st=64, player_1/loss=378.263, player_2/loss=598.041, rew=648.00]                                                                                                                                                                                  


Epoch #462: test_reward: 299.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #463: 1025it [00:02, 394.29it/s, env_step=474112, len=28, n/ep=3, n/st=64, player_1/loss=242.140, player_2/loss=511.199, rew=421.33]                                                                                                                                                                                  


Epoch #463: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #464: 1025it [00:02, 398.27it/s, env_step=475136, len=27, n/ep=3, n/st=64, player_1/loss=263.904, player_2/loss=269.213, rew=405.00]                                                                                                                                                                                  


Epoch #464: test_reward: 405.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #465: 1025it [00:02, 396.94it/s, env_step=476160, len=18, n/ep=4, n/st=64, player_1/loss=275.940, player_2/loss=244.627, rew=188.75]                                                                                                                                                                                  


Epoch #465: test_reward: 189.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #466: 1025it [00:02, 396.81it/s, env_step=477184, len=25, n/ep=2, n/st=64, player_1/loss=360.408, player_2/loss=297.838, rew=343.00]                                                                                                                                                                                  


Epoch #466: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #467: 1025it [00:02, 395.88it/s, env_step=478208, len=39, n/ep=2, n/st=64, player_1/loss=517.026, player_2/loss=540.290, rew=779.50]                                                                                                                                                                                  


Epoch #467: test_reward: 702.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #468: 1025it [00:02, 394.99it/s, env_step=479232, len=28, n/ep=2, n/st=64, player_1/loss=434.783, player_2/loss=416.222, rew=429.50]                                                                                                                                                                                  


Epoch #468: test_reward: 740.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #469: 1025it [00:02, 395.04it/s, env_step=480256, len=24, n/ep=2, n/st=64, player_1/loss=480.322, player_2/loss=169.958, rew=314.50]                                                                                                                                                                                  


Epoch #469: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #470: 1025it [00:02, 395.76it/s, env_step=481280, len=31, n/ep=2, n/st=64, player_1/loss=463.149, player_2/loss=220.877, rew=532.00]                                                                                                                                                                                  


Epoch #470: test_reward: 252.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #471: 1025it [00:02, 393.80it/s, env_step=482304, len=31, n/ep=2, n/st=64, player_1/loss=304.343, player_2/loss=387.278, rew=503.00]                                                                                                                                                                                  


Epoch #471: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #472: 1025it [00:02, 395.08it/s, env_step=483328, len=27, n/ep=3, n/st=64, player_1/loss=398.852, player_2/loss=649.007, rew=402.67]                                                                                                                                                                                  


Epoch #472: test_reward: 230.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #473: 1025it [00:02, 395.07it/s, env_step=484352, len=31, n/ep=2, n/st=64, player_1/loss=417.664, player_2/loss=834.139, rew=497.00]                                                                                                                                                                                  


Epoch #473: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #474: 1025it [00:02, 394.90it/s, env_step=485376, len=28, n/ep=3, n/st=64, player_1/loss=639.902, player_2/loss=523.403, rew=407.33]                                                                                                                                                                                  


Epoch #474: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #475: 1025it [00:02, 393.10it/s, env_step=486400, len=29, n/ep=2, n/st=64, player_1/loss=571.359, player_2/loss=510.756, rew=450.00]                                                                                                                                                                                  


Epoch #475: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #476: 1025it [00:02, 394.31it/s, env_step=487424, len=29, n/ep=2, n/st=64, player_1/loss=269.740, player_2/loss=485.821, rew=449.00]                                                                                                                                                                                  


Epoch #476: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #477: 1025it [00:02, 396.03it/s, env_step=488448, len=27, n/ep=2, n/st=64, player_2/loss=294.663, rew=379.00]                                                                                                                                                                                                         


Epoch #477: test_reward: 350.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #478: 1025it [00:02, 390.33it/s, env_step=489472, len=31, n/ep=1, n/st=64, player_1/loss=225.373, player_2/loss=480.042, rew=495.00]                                                                                                                                                                                  


Epoch #478: test_reward: 434.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #479: 1025it [00:02, 388.31it/s, env_step=490496, len=29, n/ep=2, n/st=64, player_1/loss=320.589, player_2/loss=438.215, rew=436.00]                                                                                                                                                                                  


Epoch #479: test_reward: 434.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #480: 1025it [00:02, 394.28it/s, env_step=491520, len=31, n/ep=3, n/st=64, player_1/loss=283.373, player_2/loss=406.949, rew=519.00]                                                                                                                                                                                  


Epoch #480: test_reward: 377.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #481: 1025it [00:02, 395.99it/s, env_step=492544, len=23, n/ep=3, n/st=64, player_1/loss=248.296, player_2/loss=351.154, rew=302.00]                                                                                                                                                                                  


Epoch #481: test_reward: 324.000000 ± 0.000000, best_reward: 819.000000 ± 0.000000 in #81


Epoch #482: 1025it [00:02, 394.21it/s, env_step=493568, len=22, n/ep=3, n/st=64, player_1/loss=401.961, player_2/loss=287.921, rew=278.67]                                                                                                                                                                                  


Epoch #482: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #483: 1025it [00:02, 395.86it/s, env_step=494592, len=22, n/ep=3, n/st=64, player_1/loss=429.985, player_2/loss=314.206, rew=272.33]                                                                                                                                                                                  


Epoch #483: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #484: 1025it [00:02, 395.63it/s, env_step=495616, len=28, n/ep=3, n/st=64, player_1/loss=323.272, player_2/loss=333.506, rew=406.00]                                                                                                                                                                                  


Epoch #484: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #485: 1025it [00:02, 394.07it/s, env_step=496640, len=32, n/ep=2, n/st=64, player_1/loss=252.593, player_2/loss=347.883, rew=677.00]                                                                                                                                                                                  


Epoch #485: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #486: 1025it [00:02, 397.30it/s, env_step=497664, len=34, n/ep=2, n/st=64, player_1/loss=246.944, player_2/loss=352.257, rew=606.50]                                                                                                                                                                                  


Epoch #486: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #487: 1025it [00:02, 384.11it/s, env_step=498688, len=38, n/ep=2, n/st=64, player_1/loss=291.042, player_2/loss=328.154, rew=740.50]                                                                                                                                                                                  


Epoch #487: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #488: 1025it [00:02, 394.64it/s, env_step=499712, len=29, n/ep=3, n/st=64, player_1/loss=241.115, player_2/loss=223.853, rew=464.33]                                                                                                                                                                                  


Epoch #488: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #489: 1025it [00:02, 397.52it/s, env_step=500736, len=25, n/ep=2, n/st=64, player_1/loss=381.066, player_2/loss=251.767, rew=338.00]                                                                                                                                                                                  


Epoch #489: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #490: 1025it [00:02, 396.60it/s, env_step=501760, len=22, n/ep=3, n/st=64, player_1/loss=456.238, player_2/loss=350.550, rew=254.33]                                                                                                                                                                                  


Epoch #490: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #491: 1025it [00:02, 394.62it/s, env_step=502784, len=38, n/ep=2, n/st=64, player_1/loss=301.839, player_2/loss=313.313, rew=759.50]                                                                                                                                                                                  


Epoch #491: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #492: 1025it [00:02, 394.22it/s, env_step=503808, len=28, n/ep=3, n/st=64, player_1/loss=404.830, player_2/loss=381.479, rew=406.33]                                                                                                                                                                                  


Epoch #492: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #493: 1025it [00:02, 396.76it/s, env_step=504832, len=22, n/ep=3, n/st=64, player_1/loss=428.239, player_2/loss=431.189, rew=277.33]                                                                                                                                                                                  


Epoch #493: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #494: 1025it [00:02, 394.83it/s, env_step=505856, len=30, n/ep=1, n/st=64, player_1/loss=291.210, player_2/loss=253.102, rew=464.00]                                                                                                                                                                                  


Epoch #494: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #495: 1025it [00:02, 393.04it/s, env_step=506880, len=36, n/ep=2, n/st=64, player_1/loss=226.191, player_2/loss=185.692, rew=684.50]                                                                                                                                                                                  


Epoch #495: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #496: 1025it [00:02, 393.51it/s, env_step=507904, len=29, n/ep=2, n/st=64, player_1/loss=192.816, player_2/loss=187.824, rew=438.50]                                                                                                                                                                                  


Epoch #496: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #497: 1025it [00:02, 396.01it/s, env_step=508928, len=33, n/ep=2, n/st=64, player_1/loss=192.599, player_2/loss=258.008, rew=578.00]                                                                                                                                                                                  


Epoch #497: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #498: 1025it [00:02, 398.05it/s, env_step=509952, len=24, n/ep=3, n/st=64, player_1/loss=209.223, player_2/loss=576.650, rew=314.67]                                                                                                                                                                                  


Epoch #498: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #499: 1025it [00:02, 395.55it/s, env_step=510976, len=33, n/ep=2, n/st=64, player_1/loss=350.981, player_2/loss=590.725, rew=560.00]                                                                                                                                                                                  


Epoch #499: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #500: 1025it [00:02, 394.79it/s, env_step=512000, len=33, n/ep=2, n/st=64, player_1/loss=508.717, player_2/loss=471.725, rew=572.50]                                                                                                                                                                                  


Epoch #500: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #501: 1025it [00:02, 393.78it/s, env_step=513024, len=28, n/ep=2, n/st=64, player_1/loss=350.338, player_2/loss=479.687, rew=464.50]                                                                                                                                                                                  


Epoch #501: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #502: 1025it [00:02, 395.74it/s, env_step=514048, len=29, n/ep=2, n/st=64, player_1/loss=249.209, player_2/loss=420.688, rew=485.00]                                                                                                                                                                                  


Epoch #502: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #503: 1025it [00:02, 395.97it/s, env_step=515072, len=28, n/ep=2, n/st=64, player_1/loss=392.848, player_2/loss=417.269, rew=417.50]                                                                                                                                                                                  


Epoch #503: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #504: 1025it [00:02, 396.03it/s, env_step=516096, len=23, n/ep=2, n/st=64, player_1/loss=406.901, player_2/loss=328.518, rew=287.00]                                                                                                                                                                                  


Epoch #504: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #505: 1025it [00:02, 395.36it/s, env_step=517120, len=28, n/ep=3, n/st=64, player_1/loss=368.914, player_2/loss=393.868, rew=408.00]                                                                                                                                                                                  


Epoch #505: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #506: 1025it [00:02, 390.47it/s, env_step=518144, len=21, n/ep=3, n/st=64, player_1/loss=431.145, player_2/loss=280.418, rew=230.33]                                                                                                                                                                                  


Epoch #506: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #507: 1025it [00:02, 392.54it/s, env_step=519168, len=24, n/ep=3, n/st=64, player_2/loss=393.074, rew=307.33]                                                                                                                                                                                                         


Epoch #507: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #508: 1025it [00:02, 394.79it/s, env_step=520192, len=27, n/ep=2, n/st=64, player_1/loss=604.644, player_2/loss=458.393, rew=377.50]                                                                                                                                                                                  


Epoch #508: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #509: 1025it [00:02, 395.79it/s, env_step=521216, len=28, n/ep=2, n/st=64, player_1/loss=483.785, player_2/loss=298.204, rew=419.50]                                                                                                                                                                                  


Epoch #509: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #510: 1025it [00:02, 397.92it/s, env_step=522240, len=28, n/ep=2, n/st=64, player_1/loss=142.546, player_2/loss=182.937, rew=434.50]                                                                                                                                                                                  


Epoch #510: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #511: 1025it [00:02, 395.17it/s, env_step=523264, len=29, n/ep=2, n/st=64, player_1/loss=234.065, player_2/loss=397.555, rew=434.50]                                                                                                                                                                                  


Epoch #511: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #512: 1025it [00:02, 395.25it/s, env_step=524288, len=27, n/ep=2, n/st=64, player_1/loss=447.473, player_2/loss=444.498, rew=391.00]                                                                                                                                                                                  


Epoch #512: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #513: 1025it [00:02, 396.28it/s, env_step=525312, len=32, n/ep=2, n/st=64, player_1/loss=375.225, player_2/loss=207.063, rew=546.50]                                                                                                                                                                                  


Epoch #513: test_reward: 665.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #514: 1025it [00:02, 393.43it/s, env_step=526336, len=37, n/ep=1, n/st=64, player_1/loss=342.725, player_2/loss=186.153, rew=702.00]                                                                                                                                                                                  


Epoch #514: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #515: 1025it [00:02, 395.60it/s, env_step=527360, len=32, n/ep=2, n/st=64, player_1/loss=483.574, player_2/loss=414.371, rew=545.00]                                                                                                                                                                                  


Epoch #515: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #516: 1025it [00:02, 394.52it/s, env_step=528384, len=32, n/ep=2, n/st=64, player_1/loss=505.155, player_2/loss=482.967, rew=564.50]                                                                                                                                                                                  


Epoch #516: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #517: 1025it [00:02, 391.97it/s, env_step=529408, len=26, n/ep=3, n/st=64, player_1/loss=558.923, player_2/loss=503.910, rew=352.33]                                                                                                                                                                                  


Epoch #517: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #518: 1025it [00:02, 397.39it/s, env_step=530432, len=30, n/ep=2, n/st=64, player_1/loss=516.841, player_2/loss=393.320, rew=488.50]                                                                                                                                                                                  


Epoch #518: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #519: 1025it [00:02, 395.70it/s, env_step=531456, len=38, n/ep=2, n/st=64, player_1/loss=352.811, player_2/loss=244.717, rew=760.50]                                                                                                                                                                                  


Epoch #519: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #520: 1025it [00:02, 396.99it/s, env_step=532480, len=28, n/ep=3, n/st=64, player_1/loss=522.155, player_2/loss=340.162, rew=430.00]                                                                                                                                                                                  


Epoch #520: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #521: 1025it [00:02, 395.08it/s, env_step=533504, len=31, n/ep=2, n/st=64, player_1/loss=475.732, player_2/loss=392.730, rew=526.00]                                                                                                                                                                                  


Epoch #521: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #522: 1025it [00:02, 394.47it/s, env_step=534528, len=31, n/ep=2, n/st=64, player_1/loss=186.482, player_2/loss=423.833, rew=519.50]                                                                                                                                                                                  


Epoch #522: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #523: 1025it [00:02, 393.59it/s, env_step=535552, len=30, n/ep=2, n/st=64, player_1/loss=227.266, player_2/loss=334.444, rew=494.50]                                                                                                                                                                                  


Epoch #523: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #524: 1025it [00:02, 402.11it/s, env_step=536576, len=31, n/ep=2, n/st=64, player_1/loss=538.405, player_2/loss=290.724, rew=495.50]                                                                                                                                                                                  


Epoch #524: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #525: 1025it [00:02, 402.42it/s, env_step=537600, len=35, n/ep=1, n/st=64, player_1/loss=519.382, player_2/loss=341.943, rew=629.00]                                                                                                                                                                                  


Epoch #525: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #526: 1025it [00:02, 395.41it/s, env_step=538624, len=26, n/ep=2, n/st=64, player_1/loss=443.591, player_2/loss=528.221, rew=363.50]                                                                                                                                                                                  


Epoch #526: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #527: 1025it [00:02, 394.43it/s, env_step=539648, len=30, n/ep=2, n/st=64, player_1/loss=405.082, rew=494.50]                                                                                                                                                                                                         


Epoch #527: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #528: 1025it [00:02, 392.07it/s, env_step=540672, len=32, n/ep=2, n/st=64, player_1/loss=233.772, player_2/loss=602.105, rew=545.00]                                                                                                                                                                                  


Epoch #528: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #529: 1025it [00:02, 392.56it/s, env_step=541696, len=32, n/ep=2, n/st=64, player_1/loss=258.047, player_2/loss=387.845, rew=558.50]                                                                                                                                                                                  


Epoch #529: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #530: 1025it [00:02, 392.32it/s, env_step=542720, len=36, n/ep=2, n/st=64, player_1/loss=209.029, player_2/loss=174.775, rew=669.50]                                                                                                                                                                                  


Epoch #530: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #531: 1025it [00:02, 393.92it/s, env_step=543744, len=32, n/ep=2, n/st=64, player_1/loss=238.846, player_2/loss=206.288, rew=529.00]                                                                                                                                                                                  


Epoch #531: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #532: 1025it [00:02, 393.86it/s, env_step=544768, len=35, n/ep=2, n/st=64, player_1/loss=237.459, player_2/loss=319.001, rew=641.50]                                                                                                                                                                                  


Epoch #532: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #533: 1025it [00:02, 395.34it/s, env_step=545792, len=34, n/ep=2, n/st=64, player_1/loss=268.518, player_2/loss=326.064, rew=621.50]                                                                                                                                                                                  


Epoch #533: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #534: 1025it [00:02, 398.29it/s, env_step=546816, len=33, n/ep=2, n/st=64, player_1/loss=248.961, player_2/loss=395.086, rew=713.00]                                                                                                                                                                                  


Epoch #534: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #535: 1025it [00:02, 394.37it/s, env_step=547840, len=35, n/ep=2, n/st=64, player_1/loss=319.924, player_2/loss=386.400, rew=768.00]                                                                                                                                                                                  


Epoch #535: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #536: 1025it [00:02, 392.47it/s, env_step=548864, len=32, n/ep=2, n/st=64, player_1/loss=566.112, player_2/loss=352.239, rew=539.50]                                                                                                                                                                                  


Epoch #536: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #537: 1025it [00:02, 395.80it/s, env_step=549888, len=28, n/ep=2, n/st=64, player_1/loss=498.756, player_2/loss=407.419, rew=419.50]                                                                                                                                                                                  


Epoch #537: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #538: 1025it [00:02, 397.22it/s, env_step=550912, len=27, n/ep=2, n/st=64, player_1/loss=234.319, player_2/loss=492.019, rew=412.00]                                                                                                                                                                                  


Epoch #538: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #539: 1025it [00:02, 392.85it/s, env_step=551936, len=27, n/ep=1, n/st=64, player_1/loss=288.738, player_2/loss=493.612, rew=377.00]                                                                                                                                                                                  


Epoch #539: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #540: 1025it [00:02, 395.33it/s, env_step=552960, len=36, n/ep=2, n/st=64, player_1/loss=225.113, player_2/loss=588.744, rew=693.50]                                                                                                                                                                                  


Epoch #540: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #541: 1025it [00:02, 391.84it/s, env_step=553984, len=18, n/ep=3, n/st=64, player_1/loss=284.989, player_2/loss=576.047, rew=179.00]                                                                                                                                                                                  


Epoch #541: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #542: 1025it [00:02, 393.31it/s, env_step=555008, len=21, n/ep=3, n/st=64, player_1/loss=272.073, player_2/loss=356.029, rew=246.00]                                                                                                                                                                                  


Epoch #542: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #543: 1025it [00:02, 392.77it/s, env_step=556032, len=26, n/ep=2, n/st=64, player_1/loss=380.821, player_2/loss=472.692, rew=352.00]                                                                                                                                                                                  


Epoch #543: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #544: 1025it [00:02, 394.12it/s, env_step=557056, len=42, n/ep=1, n/st=64, player_1/loss=539.199, player_2/loss=606.132, rew=1102.00]                                                                                                                                                                                 


Epoch #544: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #545: 1025it [00:02, 391.78it/s, env_step=558080, len=31, n/ep=2, n/st=64, player_1/loss=359.026, player_2/loss=791.203, rew=503.00]                                                                                                                                                                                  


Epoch #545: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #546: 1025it [00:02, 392.85it/s, env_step=559104, len=33, n/ep=2, n/st=64, player_1/loss=441.217, rew=577.00]                                                                                                                                                                                                         


Epoch #546: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #547: 1025it [00:02, 395.59it/s, env_step=560128, len=39, n/ep=2, n/st=64, player_1/loss=472.188, player_2/loss=411.178, rew=779.00]                                                                                                                                                                                  


Epoch #547: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #548: 1025it [00:02, 393.93it/s, env_step=561152, len=33, n/ep=2, n/st=64, player_1/loss=412.250, player_2/loss=338.618, rew=583.00]                                                                                                                                                                                  


Epoch #548: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #549: 1025it [00:02, 393.81it/s, env_step=562176, len=30, n/ep=2, n/st=64, player_1/loss=336.051, player_2/loss=117.773, rew=464.00]                                                                                                                                                                                  


Epoch #549: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #550: 1025it [00:02, 395.34it/s, env_step=563200, len=29, n/ep=2, n/st=64, player_1/loss=225.303, player_2/loss=223.840, rew=464.00]                                                                                                                                                                                  


Epoch #550: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #551: 1025it [00:02, 393.35it/s, env_step=564224, len=21, n/ep=3, n/st=64, player_1/loss=401.240, player_2/loss=204.149, rew=254.67]                                                                                                                                                                                  


Epoch #551: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #552: 1025it [00:02, 393.35it/s, env_step=565248, len=38, n/ep=1, n/st=64, player_1/loss=430.831, player_2/loss=191.238, rew=740.00]                                                                                                                                                                                  


Epoch #552: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #553: 1025it [00:02, 392.84it/s, env_step=566272, len=22, n/ep=2, n/st=64, player_1/loss=401.121, player_2/loss=551.346, rew=260.00]                                                                                                                                                                                  


Epoch #553: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #554: 1025it [00:02, 395.39it/s, env_step=567296, len=30, n/ep=3, n/st=64, player_1/loss=519.431, player_2/loss=546.594, rew=479.67]                                                                                                                                                                                  


Epoch #554: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #555: 1025it [00:02, 394.11it/s, env_step=568320, len=33, n/ep=2, n/st=64, player_1/loss=389.483, player_2/loss=169.566, rew=578.00]                                                                                                                                                                                  


Epoch #555: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #556: 1025it [00:02, 395.69it/s, env_step=569344, len=25, n/ep=3, n/st=64, player_1/loss=333.319, player_2/loss=192.652, rew=394.67]                                                                                                                                                                                  


Epoch #556: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #557: 1025it [00:02, 393.66it/s, env_step=570368, len=27, n/ep=2, n/st=64, player_1/loss=290.825, player_2/loss=364.283, rew=385.00]                                                                                                                                                                                  


Epoch #557: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #558: 1025it [00:02, 392.84it/s, env_step=571392, len=36, n/ep=2, n/st=64, player_1/loss=340.568, player_2/loss=436.551, rew=798.50]                                                                                                                                                                                  


Epoch #558: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #559: 1025it [00:02, 397.61it/s, env_step=572416, len=34, n/ep=2, n/st=64, player_1/loss=286.176, rew=594.50]                                                                                                                                                                                                         


Epoch #559: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #560: 1025it [00:02, 395.12it/s, env_step=573440, len=27, n/ep=2, n/st=64, player_1/loss=234.206, player_2/loss=307.310, rew=391.00]                                                                                                                                                                                  


Epoch #560: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #561: 1025it [00:02, 394.52it/s, env_step=574464, len=31, n/ep=2, n/st=64, player_1/loss=264.421, player_2/loss=251.492, rew=497.00]                                                                                                                                                                                  


Epoch #561: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #562: 1025it [00:02, 391.56it/s, env_step=575488, len=35, n/ep=2, n/st=64, player_1/loss=267.020, player_2/loss=191.817, rew=633.50]                                                                                                                                                                                  


Epoch #562: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #563: 1025it [00:02, 395.11it/s, env_step=576512, len=34, n/ep=2, n/st=64, player_1/loss=227.865, player_2/loss=401.143, rew=621.50]                                                                                                                                                                                  


Epoch #563: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #564: 1025it [00:02, 394.43it/s, env_step=577536, len=40, n/ep=2, n/st=64, player_1/loss=244.558, player_2/loss=342.418, rew=940.50]                                                                                                                                                                                  


Epoch #564: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #565: 1025it [00:02, 395.04it/s, env_step=578560, len=30, n/ep=2, n/st=64, player_1/loss=333.750, player_2/loss=266.670, rew=468.50]                                                                                                                                                                                  


Epoch #565: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #566: 1025it [00:02, 393.31it/s, env_step=579584, len=33, n/ep=2, n/st=64, player_1/loss=253.190, player_2/loss=371.281, rew=583.00]                                                                                                                                                                                  


Epoch #566: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #567: 1025it [00:02, 393.78it/s, env_step=580608, len=26, n/ep=3, n/st=64, player_1/loss=215.339, player_2/loss=372.738, rew=430.33]                                                                                                                                                                                  


Epoch #567: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #568: 1025it [00:02, 389.82it/s, env_step=581632, len=31, n/ep=2, n/st=64, player_1/loss=354.027, player_2/loss=427.607, rew=532.00]                                                                                                                                                                                  


Epoch #568: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #569: 1025it [00:02, 408.91it/s, env_step=582656, len=37, n/ep=2, n/st=64, player_1/loss=348.484, player_2/loss=423.133, rew=702.50]                                                                                                                                                                                  


Epoch #569: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #570: 1025it [00:02, 415.79it/s, env_step=583680, len=34, n/ep=2, n/st=64, player_1/loss=246.571, player_2/loss=464.890, rew=739.50]                                                                                                                                                                                  


Epoch #570: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #571: 1025it [00:02, 403.33it/s, env_step=584704, len=40, n/ep=2, n/st=64, player_1/loss=308.404, player_2/loss=386.439, rew=840.50]                                                                                                                                                                                  


Epoch #571: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #572: 1025it [00:02, 392.60it/s, env_step=585728, len=24, n/ep=2, n/st=64, player_1/loss=338.749, player_2/loss=252.161, rew=312.50]                                                                                                                                                                                  


Epoch #572: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #573: 1025it [00:02, 395.69it/s, env_step=586752, len=27, n/ep=2, n/st=64, player_1/loss=297.600, player_2/loss=271.397, rew=392.00]                                                                                                                                                                                  


Epoch #573: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #574: 1025it [00:02, 392.06it/s, env_step=587776, len=28, n/ep=2, n/st=64, player_1/loss=193.464, player_2/loss=393.047, rew=485.50]                                                                                                                                                                                  


Epoch #574: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #575: 1025it [00:02, 393.12it/s, env_step=588800, len=33, n/ep=2, n/st=64, player_1/loss=142.600, player_2/loss=385.236, rew=568.00]                                                                                                                                                                                  


Epoch #575: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #576: 1025it [00:02, 395.02it/s, env_step=589824, len=35, n/ep=1, n/st=64, player_1/loss=312.816, player_2/loss=207.558, rew=629.00]                                                                                                                                                                                  


Epoch #576: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #577: 1025it [00:02, 390.88it/s, env_step=590848, len=26, n/ep=2, n/st=64, player_1/loss=395.467, player_2/loss=291.295, rew=363.50]                                                                                                                                                                                  


Epoch #577: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #578: 1025it [00:02, 394.74it/s, env_step=591872, len=34, n/ep=1, n/st=64, player_1/loss=258.453, player_2/loss=330.808, rew=594.00]                                                                                                                                                                                  


Epoch #578: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #579: 1025it [00:02, 393.77it/s, env_step=592896, len=30, n/ep=2, n/st=64, player_1/loss=219.109, player_2/loss=408.299, rew=482.50]                                                                                                                                                                                  


Epoch #579: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #580: 1025it [00:02, 393.17it/s, env_step=593920, len=19, n/ep=4, n/st=64, player_1/loss=166.346, player_2/loss=430.872, rew=213.75]                                                                                                                                                                                  


Epoch #580: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #581: 1025it [00:02, 395.42it/s, env_step=594944, len=30, n/ep=2, n/st=64, player_1/loss=276.304, player_2/loss=724.807, rew=464.50]                                                                                                                                                                                  


Epoch #581: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #582: 1025it [00:02, 394.02it/s, env_step=595968, len=26, n/ep=2, n/st=64, player_1/loss=432.736, player_2/loss=564.138, rew=350.50]                                                                                                                                                                                  


Epoch #582: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #583: 1025it [00:02, 391.50it/s, env_step=596992, len=26, n/ep=2, n/st=64, player_1/loss=454.342, player_2/loss=414.163, rew=364.50]                                                                                                                                                                                  


Epoch #583: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #584: 1025it [00:02, 394.66it/s, env_step=598016, len=19, n/ep=3, n/st=64, player_1/loss=381.027, player_2/loss=271.158, rew=206.33]                                                                                                                                                                                  


Epoch #584: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #585: 1025it [00:02, 394.45it/s, env_step=599040, len=28, n/ep=2, n/st=64, player_1/loss=374.666, player_2/loss=362.881, rew=405.50]                                                                                                                                                                                  


Epoch #585: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #586: 1025it [00:02, 394.51it/s, env_step=600064, len=27, n/ep=2, n/st=64, player_1/loss=390.602, player_2/loss=473.751, rew=391.00]                                                                                                                                                                                  


Epoch #586: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #587: 1025it [00:02, 397.08it/s, env_step=601088, len=26, n/ep=3, n/st=64, player_1/loss=336.497, player_2/loss=275.426, rew=368.67]                                                                                                                                                                                  


Epoch #587: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #588: 1025it [00:02, 390.88it/s, env_step=602112, len=26, n/ep=2, n/st=64, player_1/loss=249.109, player_2/loss=124.615, rew=352.00]                                                                                                                                                                                  


Epoch #588: test_reward: 594.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #589: 1025it [00:02, 394.12it/s, env_step=603136, len=24, n/ep=2, n/st=64, player_1/loss=343.155, player_2/loss=235.430, rew=311.50]                                                                                                                                                                                  


Epoch #589: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #590: 1025it [00:02, 395.08it/s, env_step=604160, len=26, n/ep=2, n/st=64, player_1/loss=363.058, player_2/loss=332.793, rew=364.50]                                                                                                                                                                                  


Epoch #590: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #591: 1025it [00:02, 393.51it/s, env_step=605184, len=28, n/ep=2, n/st=64, player_1/loss=217.193, player_2/loss=291.079, rew=420.50]                                                                                                                                                                                  


Epoch #591: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #592: 1025it [00:02, 392.91it/s, env_step=606208, len=24, n/ep=2, n/st=64, player_1/loss=199.732, player_2/loss=281.127, rew=416.50]                                                                                                                                                                                  


Epoch #592: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #593: 1025it [00:02, 390.57it/s, env_step=607232, len=36, n/ep=2, n/st=64, player_1/loss=311.125, player_2/loss=282.912, rew=665.00]                                                                                                                                                                                  


Epoch #593: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #594: 1025it [00:02, 392.02it/s, env_step=608256, len=24, n/ep=3, n/st=64, player_1/loss=377.912, player_2/loss=385.504, rew=334.33]                                                                                                                                                                                  


Epoch #594: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #595: 1025it [00:02, 395.34it/s, env_step=609280, len=28, n/ep=3, n/st=64, player_1/loss=124.055, player_2/loss=422.660, rew=427.67]                                                                                                                                                                                  


Epoch #595: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #596: 1025it [00:02, 394.25it/s, env_step=610304, len=27, n/ep=3, n/st=64, player_1/loss=200.844, player_2/loss=318.477, rew=378.00]                                                                                                                                                                                  


Epoch #596: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #597: 1025it [00:02, 391.12it/s, env_step=611328, len=39, n/ep=1, n/st=64, player_1/loss=280.267, player_2/loss=368.871, rew=779.00]                                                                                                                                                                                  


Epoch #597: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #598: 1025it [00:02, 392.73it/s, env_step=612352, len=36, n/ep=2, n/st=64, player_1/loss=342.904, player_2/loss=351.705, rew=669.50]                                                                                                                                                                                  


Epoch #598: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #599: 1025it [00:02, 394.36it/s, env_step=613376, len=29, n/ep=2, n/st=64, player_1/loss=287.994, player_2/loss=415.076, rew=434.50]                                                                                                                                                                                  


Epoch #599: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #600: 1025it [00:02, 392.50it/s, env_step=614400, len=24, n/ep=3, n/st=64, player_1/loss=213.045, player_2/loss=565.707, rew=300.00]                                                                                                                                                                                  


Epoch #600: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #601: 1025it [00:02, 393.84it/s, env_step=615424, len=30, n/ep=2, n/st=64, player_1/loss=289.513, player_2/loss=395.531, rew=476.50]                                                                                                                                                                                  


Epoch #601: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #602: 1025it [00:02, 394.82it/s, env_step=616448, len=39, n/ep=2, n/st=64, player_1/loss=209.130, player_2/loss=302.967, rew=779.50]                                                                                                                                                                                  


Epoch #602: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #603: 1025it [00:02, 393.90it/s, env_step=617472, len=13, n/ep=5, n/st=64, player_1/loss=148.877, player_2/loss=224.264, rew=177.40]                                                                                                                                                                                  


Epoch #603: test_reward: 35.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #604: 1025it [00:02, 392.22it/s, env_step=618496, len=31, n/ep=2, n/st=64, player_1/loss=420.999, player_2/loss=518.160, rew=503.00]                                                                                                                                                                                  


Epoch #604: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #605: 1025it [00:02, 394.71it/s, env_step=619520, len=29, n/ep=2, n/st=64, player_1/loss=471.062, player_2/loss=592.673, rew=485.00]                                                                                                                                                                                  


Epoch #605: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #606: 1025it [00:02, 392.51it/s, env_step=620544, len=26, n/ep=2, n/st=64, player_1/loss=401.700, player_2/loss=396.465, rew=366.50]                                                                                                                                                                                  


Epoch #606: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #607: 1025it [00:02, 394.21it/s, env_step=621568, len=34, n/ep=2, n/st=64, player_1/loss=343.998, player_2/loss=365.479, rew=617.50]                                                                                                                                                                                  


Epoch #607: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #608: 1025it [00:02, 396.27it/s, env_step=622592, len=31, n/ep=2, n/st=64, player_1/loss=276.782, player_2/loss=327.603, rew=532.00]                                                                                                                                                                                  


Epoch #608: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #609: 1025it [00:02, 392.75it/s, env_step=623616, len=30, n/ep=2, n/st=64, player_1/loss=202.957, player_2/loss=157.026, rew=485.50]                                                                                                                                                                                  


Epoch #609: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #610: 1025it [00:02, 393.42it/s, env_step=624640, len=40, n/ep=2, n/st=64, player_1/loss=185.636, player_2/loss=409.672, rew=940.50]                                                                                                                                                                                  


Epoch #610: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #611: 1025it [00:02, 394.41it/s, env_step=625664, len=27, n/ep=2, n/st=64, player_1/loss=145.070, player_2/loss=412.151, rew=401.50]                                                                                                                                                                                  


Epoch #611: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #612: 1025it [00:02, 394.74it/s, env_step=626688, len=26, n/ep=2, n/st=64, player_1/loss=195.291, player_2/loss=245.683, rew=363.50]                                                                                                                                                                                  


Epoch #612: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #613: 1025it [00:02, 391.06it/s, env_step=627712, len=27, n/ep=2, n/st=64, player_1/loss=284.619, player_2/loss=264.778, rew=394.00]                                                                                                                                                                                  


Epoch #613: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #614: 1025it [00:02, 418.29it/s, env_step=628736, len=27, n/ep=3, n/st=64, player_1/loss=188.990, player_2/loss=233.062, rew=387.67]                                                                                                                                                                                  


Epoch #614: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #615: 1025it [00:02, 416.56it/s, env_step=629760, len=33, n/ep=2, n/st=64, player_1/loss=238.349, player_2/loss=507.540, rew=578.00]                                                                                                                                                                                  


Epoch #615: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #616: 1025it [00:02, 406.66it/s, env_step=630784, len=35, n/ep=2, n/st=64, player_1/loss=353.605, player_2/loss=489.438, rew=637.00]                                                                                                                                                                                  


Epoch #616: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #617: 1025it [00:02, 394.74it/s, env_step=631808, len=33, n/ep=2, n/st=64, player_1/loss=262.758, player_2/loss=592.541, rew=592.00]                                                                                                                                                                                  


Epoch #617: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #618: 1025it [00:02, 393.72it/s, env_step=632832, len=30, n/ep=2, n/st=64, player_1/loss=204.775, player_2/loss=493.984, rew=464.50]                                                                                                                                                                                  


Epoch #618: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #619: 1025it [00:02, 388.40it/s, env_step=633856, len=28, n/ep=2, n/st=64, player_1/loss=548.955, player_2/loss=414.448, rew=434.50]                                                                                                                                                                                  


Epoch #619: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #620: 1025it [00:02, 394.54it/s, env_step=634880, len=26, n/ep=3, n/st=64, player_1/loss=594.536, player_2/loss=502.803, rew=376.67]                                                                                                                                                                                  


Epoch #620: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #621: 1025it [00:02, 397.50it/s, env_step=635904, len=8, n/ep=7, n/st=64, player_1/loss=445.291, player_2/loss=648.429, rew=39.00]                                                                                                                                                                                    


Epoch #621: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #622: 1025it [00:02, 395.83it/s, env_step=636928, len=30, n/ep=2, n/st=64, player_1/loss=253.303, player_2/loss=477.728, rew=488.50]                                                                                                                                                                                  


Epoch #622: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #623: 1025it [00:02, 393.01it/s, env_step=637952, len=24, n/ep=3, n/st=64, player_1/loss=188.820, player_2/loss=400.778, rew=331.00]                                                                                                                                                                                  


Epoch #623: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #624: 1025it [00:02, 396.34it/s, env_step=638976, len=30, n/ep=2, n/st=64, player_1/loss=134.975, player_2/loss=317.629, rew=485.50]                                                                                                                                                                                  


Epoch #624: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #625: 1025it [00:02, 392.56it/s, env_step=640000, len=22, n/ep=3, n/st=64, player_1/loss=90.649, player_2/loss=462.271, rew=272.00]                                                                                                                                                                                   


Epoch #625: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #626: 1025it [00:02, 393.62it/s, env_step=641024, len=26, n/ep=3, n/st=64, player_1/loss=163.189, player_2/loss=411.536, rew=369.33]                                                                                                                                                                                  


Epoch #626: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #627: 1025it [00:02, 394.27it/s, env_step=642048, len=28, n/ep=3, n/st=64, player_1/loss=221.319, player_2/loss=230.573, rew=406.00]                                                                                                                                                                                  


Epoch #627: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #628: 1025it [00:02, 394.58it/s, env_step=643072, len=28, n/ep=2, n/st=64, player_1/loss=208.802, player_2/loss=357.723, rew=407.00]                                                                                                                                                                                  


Epoch #628: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #629: 1025it [00:02, 394.42it/s, env_step=644096, len=30, n/ep=2, n/st=64, player_1/loss=152.790, player_2/loss=359.978, rew=480.50]                                                                                                                                                                                  


Epoch #629: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #630: 1025it [00:02, 396.30it/s, env_step=645120, len=29, n/ep=3, n/st=64, player_1/loss=172.079, player_2/loss=225.538, rew=437.00]                                                                                                                                                                                  


Epoch #630: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #631: 1025it [00:02, 398.33it/s, env_step=646144, len=29, n/ep=2, n/st=64, player_1/loss=201.720, player_2/loss=440.656, rew=459.00]                                                                                                                                                                                  


Epoch #631: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #632: 1025it [00:02, 396.66it/s, env_step=647168, len=27, n/ep=2, n/st=64, player_1/loss=176.901, player_2/loss=580.322, rew=377.50]                                                                                                                                                                                  


Epoch #632: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #633: 1025it [00:02, 392.54it/s, env_step=648192, len=33, n/ep=2, n/st=64, player_1/loss=187.029, player_2/loss=550.544, rew=578.00]                                                                                                                                                                                  


Epoch #633: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #634: 1025it [00:02, 393.32it/s, env_step=649216, len=28, n/ep=2, n/st=64, player_1/loss=183.094, player_2/loss=398.540, rew=464.50]                                                                                                                                                                                  


Epoch #634: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #635: 1025it [00:02, 393.53it/s, env_step=650240, len=37, n/ep=1, n/st=64, player_1/loss=220.753, player_2/loss=331.539, rew=702.00]                                                                                                                                                                                  


Epoch #635: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #636: 1025it [00:02, 393.49it/s, env_step=651264, len=29, n/ep=2, n/st=64, player_1/loss=462.079, player_2/loss=475.599, rew=452.00]                                                                                                                                                                                  


Epoch #636: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #637: 1025it [00:02, 394.65it/s, env_step=652288, len=30, n/ep=1, n/st=64, player_1/loss=409.209, player_2/loss=458.988, rew=464.00]                                                                                                                                                                                  


Epoch #637: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #638: 1025it [00:02, 393.64it/s, env_step=653312, len=31, n/ep=2, n/st=64, player_1/loss=230.036, player_2/loss=563.304, rew=655.50]                                                                                                                                                                                  


Epoch #638: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #639: 1025it [00:02, 395.43it/s, env_step=654336, len=19, n/ep=3, n/st=64, player_1/loss=234.787, player_2/loss=374.419, rew=203.00]                                                                                                                                                                                  


Epoch #639: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #640: 1025it [00:02, 393.98it/s, env_step=655360, len=41, n/ep=1, n/st=64, player_1/loss=313.793, player_2/loss=484.125, rew=860.00]                                                                                                                                                                                  


Epoch #640: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #641: 1025it [00:02, 396.73it/s, env_step=656384, len=31, n/ep=2, n/st=64, player_1/loss=306.101, player_2/loss=512.555, rew=513.00]                                                                                                                                                                                  


Epoch #641: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #642: 1025it [00:02, 393.71it/s, env_step=657408, len=23, n/ep=2, n/st=64, player_1/loss=187.005, rew=299.50]                                                                                                                                                                                                         


Epoch #642: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #643: 1025it [00:02, 390.78it/s, env_step=658432, len=28, n/ep=3, n/st=64, player_1/loss=150.808, player_2/loss=446.997, rew=431.33]                                                                                                                                                                                  


Epoch #643: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #644: 1025it [00:02, 395.33it/s, env_step=659456, len=25, n/ep=2, n/st=64, player_1/loss=117.314, player_2/loss=556.948, rew=328.50]                                                                                                                                                                                  


Epoch #644: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #645: 1025it [00:02, 394.99it/s, env_step=660480, len=31, n/ep=2, n/st=64, player_1/loss=399.700, player_2/loss=551.031, rew=521.00]                                                                                                                                                                                  


Epoch #645: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #646: 1025it [00:02, 398.52it/s, env_step=661504, len=22, n/ep=3, n/st=64, player_1/loss=372.104, player_2/loss=340.680, rew=294.67]                                                                                                                                                                                  


Epoch #646: test_reward: 65.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #647: 1025it [00:02, 393.47it/s, env_step=662528, len=26, n/ep=3, n/st=64, player_1/loss=353.615, player_2/loss=244.102, rew=357.00]                                                                                                                                                                                  


Epoch #647: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #648: 1025it [00:02, 392.74it/s, env_step=663552, len=34, n/ep=2, n/st=64, player_1/loss=458.488, player_2/loss=398.178, rew=632.50]                                                                                                                                                                                  


Epoch #648: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #649: 1025it [00:02, 394.71it/s, env_step=664576, len=22, n/ep=4, n/st=64, player_1/loss=367.358, player_2/loss=574.365, rew=369.00]                                                                                                                                                                                  


Epoch #649: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #650: 1025it [00:02, 395.30it/s, env_step=665600, len=29, n/ep=2, n/st=64, player_1/loss=455.700, player_2/loss=718.664, rew=477.00]                                                                                                                                                                                  


Epoch #650: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #651: 1025it [00:02, 397.42it/s, env_step=666624, len=19, n/ep=4, n/st=64, player_1/loss=408.014, player_2/loss=539.998, rew=230.25]                                                                                                                                                                                  


Epoch #651: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #652: 1025it [00:02, 395.90it/s, env_step=667648, len=33, n/ep=2, n/st=64, player_1/loss=312.303, player_2/loss=282.441, rew=580.00]                                                                                                                                                                                  


Epoch #652: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #653: 1025it [00:02, 393.75it/s, env_step=668672, len=28, n/ep=3, n/st=64, player_1/loss=359.952, player_2/loss=225.241, rew=446.33]                                                                                                                                                                                  


Epoch #653: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #654: 1025it [00:02, 393.85it/s, env_step=669696, len=32, n/ep=2, n/st=64, player_1/loss=351.774, player_2/loss=302.980, rew=531.50]                                                                                                                                                                                  


Epoch #654: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #655: 1025it [00:02, 394.01it/s, env_step=670720, len=26, n/ep=2, n/st=64, player_1/loss=344.284, player_2/loss=354.215, rew=400.00]                                                                                                                                                                                  


Epoch #655: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #656: 1025it [00:02, 396.52it/s, env_step=671744, len=33, n/ep=3, n/st=64, player_1/loss=380.056, player_2/loss=468.089, rew=703.33]                                                                                                                                                                                  


Epoch #656: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #657: 1025it [00:02, 395.43it/s, env_step=672768, len=27, n/ep=3, n/st=64, player_1/loss=556.214, player_2/loss=391.702, rew=386.33]                                                                                                                                                                                  


Epoch #657: test_reward: 152.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #658: 1025it [00:02, 386.69it/s, env_step=673792, len=23, n/ep=3, n/st=64, player_2/loss=210.202, rew=285.33]                                                                                                                                                                                                         


Epoch #658: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #659: 1025it [00:02, 404.07it/s, env_step=674816, len=32, n/ep=2, n/st=64, player_1/loss=405.968, player_2/loss=310.371, rew=571.50]                                                                                                                                                                                  


Epoch #659: test_reward: 665.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #660: 1025it [00:02, 405.45it/s, env_step=675840, len=22, n/ep=3, n/st=64, player_1/loss=197.291, player_2/loss=247.978, rew=289.33]                                                                                                                                                                                  


Epoch #660: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #661: 1025it [00:02, 394.29it/s, env_step=676864, len=22, n/ep=2, n/st=64, player_1/loss=269.105, player_2/loss=371.573, rew=318.50]                                                                                                                                                                                  


Epoch #661: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #662: 1025it [00:02, 392.20it/s, env_step=677888, len=38, n/ep=2, n/st=64, player_1/loss=439.485, player_2/loss=295.280, rew=740.00]                                                                                                                                                                                  


Epoch #662: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #663: 1025it [00:02, 396.25it/s, env_step=678912, len=31, n/ep=2, n/st=64, player_1/loss=397.478, player_2/loss=222.447, rew=507.50]                                                                                                                                                                                  


Epoch #663: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #664: 1025it [00:02, 394.72it/s, env_step=679936, len=25, n/ep=2, n/st=64, player_1/loss=282.038, player_2/loss=219.601, rew=324.50]                                                                                                                                                                                  


Epoch #664: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #665: 1025it [00:02, 395.59it/s, env_step=680960, len=28, n/ep=2, n/st=64, player_1/loss=316.246, player_2/loss=228.805, rew=420.50]                                                                                                                                                                                  


Epoch #665: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #666: 1025it [00:02, 393.29it/s, env_step=681984, len=28, n/ep=3, n/st=64, player_1/loss=248.383, player_2/loss=228.731, rew=414.67]                                                                                                                                                                                  


Epoch #666: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #667: 1025it [00:02, 392.67it/s, env_step=683008, len=29, n/ep=2, n/st=64, player_1/loss=185.135, player_2/loss=341.191, rew=434.50]                                                                                                                                                                                  


Epoch #667: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #668: 1025it [00:02, 397.51it/s, env_step=684032, len=30, n/ep=2, n/st=64, player_1/loss=446.867, player_2/loss=508.222, rew=504.50]                                                                                                                                                                                  


Epoch #668: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #669: 1025it [00:02, 393.51it/s, env_step=685056, len=26, n/ep=3, n/st=64, player_1/loss=410.072, player_2/loss=519.041, rew=360.67]                                                                                                                                                                                  


Epoch #669: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #670: 1025it [00:02, 396.30it/s, env_step=686080, len=31, n/ep=2, n/st=64, player_1/loss=147.342, player_2/loss=634.212, rew=532.00]                                                                                                                                                                                  


Epoch #670: test_reward: 152.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #671: 1025it [00:02, 393.80it/s, env_step=687104, len=33, n/ep=2, n/st=64, player_1/loss=236.194, player_2/loss=518.264, rew=583.00]                                                                                                                                                                                  


Epoch #671: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #672: 1025it [00:02, 392.38it/s, env_step=688128, len=26, n/ep=2, n/st=64, player_1/loss=569.953, player_2/loss=528.115, rew=363.50]                                                                                                                                                                                  


Epoch #672: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #673: 1025it [00:02, 395.00it/s, env_step=689152, len=15, n/ep=6, n/st=64, player_1/loss=572.225, player_2/loss=417.146, rew=178.50]                                                                                                                                                                                  


Epoch #673: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #674: 1025it [00:02, 395.80it/s, env_step=690176, len=34, n/ep=2, n/st=64, player_1/loss=421.247, player_2/loss=179.289, rew=611.50]                                                                                                                                                                                  


Epoch #674: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #675: 1025it [00:02, 394.29it/s, env_step=691200, len=30, n/ep=2, n/st=64, player_1/loss=541.600, player_2/loss=351.484, rew=482.50]                                                                                                                                                                                  


Epoch #675: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #676: 1025it [00:02, 397.01it/s, env_step=692224, len=29, n/ep=3, n/st=64, player_1/loss=594.921, player_2/loss=528.072, rew=464.00]                                                                                                                                                                                  


Epoch #676: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #677: 1025it [00:02, 391.59it/s, env_step=693248, len=33, n/ep=2, n/st=64, player_1/loss=321.340, player_2/loss=279.119, rew=583.00]                                                                                                                                                                                  


Epoch #677: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #678: 1025it [00:02, 394.48it/s, env_step=694272, len=39, n/ep=1, n/st=64, player_1/loss=248.139, player_2/loss=320.948, rew=779.00]                                                                                                                                                                                  


Epoch #678: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #679: 1025it [00:02, 394.03it/s, env_step=695296, len=26, n/ep=3, n/st=64, player_1/loss=221.608, player_2/loss=351.080, rew=416.00]                                                                                                                                                                                  


Epoch #679: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #680: 1025it [00:02, 394.03it/s, env_step=696320, len=27, n/ep=2, n/st=64, player_1/loss=69.572, player_2/loss=214.017, rew=427.00]                                                                                                                                                                                   


Epoch #680: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #681: 1025it [00:02, 397.35it/s, env_step=697344, len=32, n/ep=2, n/st=64, player_1/loss=192.135, player_2/loss=366.594, rew=558.50]                                                                                                                                                                                  


Epoch #681: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #682: 1025it [00:02, 390.65it/s, env_step=698368, len=32, n/ep=2, n/st=64, player_1/loss=477.014, player_2/loss=690.870, rew=551.50]                                                                                                                                                                                  


Epoch #682: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #683: 1025it [00:02, 390.83it/s, env_step=699392, len=26, n/ep=3, n/st=64, player_1/loss=501.837, player_2/loss=797.275, rew=360.33]                                                                                                                                                                                  


Epoch #683: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #684: 1025it [00:02, 395.80it/s, env_step=700416, len=27, n/ep=2, n/st=64, player_1/loss=309.434, player_2/loss=448.048, rew=394.00]                                                                                                                                                                                  


Epoch #684: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #685: 1025it [00:02, 393.08it/s, env_step=701440, len=27, n/ep=3, n/st=64, player_1/loss=556.597, player_2/loss=399.549, rew=440.33]                                                                                                                                                                                  


Epoch #685: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #686: 1025it [00:02, 392.95it/s, env_step=702464, len=25, n/ep=2, n/st=64, player_1/loss=664.003, player_2/loss=339.518, rew=326.00]                                                                                                                                                                                  


Epoch #686: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #687: 1025it [00:02, 393.05it/s, env_step=703488, len=38, n/ep=2, n/st=64, player_1/loss=486.259, player_2/loss=233.916, rew=759.50]                                                                                                                                                                                  


Epoch #687: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #688: 1025it [00:02, 392.65it/s, env_step=704512, len=33, n/ep=2, n/st=64, player_1/loss=422.088, player_2/loss=375.637, rew=572.50]                                                                                                                                                                                  


Epoch #688: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #689: 1025it [00:02, 394.44it/s, env_step=705536, len=22, n/ep=3, n/st=64, player_1/loss=307.784, player_2/loss=531.008, rew=304.00]                                                                                                                                                                                  


Epoch #689: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #690: 1025it [00:02, 393.33it/s, env_step=706560, len=37, n/ep=2, n/st=64, player_1/loss=237.474, player_2/loss=342.550, rew=721.00]                                                                                                                                                                                  


Epoch #690: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #691: 1025it [00:02, 392.44it/s, env_step=707584, len=24, n/ep=2, n/st=64, player_1/loss=323.271, player_2/loss=231.584, rew=311.50]                                                                                                                                                                                  


Epoch #691: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #692: 1025it [00:02, 393.77it/s, env_step=708608, len=40, n/ep=2, n/st=64, player_1/loss=428.388, player_2/loss=406.011, rew=821.00]                                                                                                                                                                                  


Epoch #692: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #693: 1025it [00:02, 396.22it/s, env_step=709632, len=28, n/ep=2, n/st=64, player_1/loss=380.730, player_2/loss=619.532, rew=407.00]                                                                                                                                                                                  


Epoch #693: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #694: 1025it [00:02, 394.27it/s, env_step=710656, len=20, n/ep=2, n/st=64, player_1/loss=446.574, player_2/loss=607.388, rew=220.50]                                                                                                                                                                                  


Epoch #694: test_reward: 119.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #695: 1025it [00:02, 395.93it/s, env_step=711680, len=22, n/ep=3, n/st=64, player_1/loss=598.497, player_2/loss=377.422, rew=267.67]                                                                                                                                                                                  


Epoch #695: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #696: 1025it [00:02, 395.69it/s, env_step=712704, len=26, n/ep=2, n/st=64, player_1/loss=749.394, player_2/loss=325.737, rew=350.50]                                                                                                                                                                                  


Epoch #696: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #697: 1025it [00:02, 391.40it/s, env_step=713728, len=16, n/ep=4, n/st=64, player_1/loss=729.996, player_2/loss=365.860, rew=145.00]                                                                                                                                                                                  


Epoch #697: test_reward: 152.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #698: 1025it [00:02, 394.50it/s, env_step=714752, len=19, n/ep=2, n/st=64, player_1/loss=580.612, player_2/loss=473.103, rew=200.00]                                                                                                                                                                                  


Epoch #698: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #699: 1025it [00:02, 393.85it/s, env_step=715776, len=33, n/ep=2, n/st=64, player_1/loss=727.168, player_2/loss=580.486, rew=584.50]                                                                                                                                                                                  


Epoch #699: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #700: 1025it [00:02, 394.76it/s, env_step=716800, len=26, n/ep=2, n/st=64, player_1/loss=751.552, player_2/loss=360.908, rew=350.00]                                                                                                                                                                                  


Epoch #700: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #701: 1025it [00:02, 391.27it/s, env_step=717824, len=34, n/ep=2, n/st=64, player_1/loss=632.942, player_2/loss=121.204, rew=606.50]                                                                                                                                                                                  


Epoch #701: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #702: 1025it [00:02, 392.06it/s, env_step=718848, len=23, n/ep=2, n/st=64, player_1/loss=407.136, player_2/loss=196.858, rew=275.50]                                                                                                                                                                                  


Epoch #702: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #703: 1025it [00:02, 391.86it/s, env_step=719872, len=32, n/ep=2, n/st=64, player_1/loss=328.943, player_2/loss=200.909, rew=529.00]                                                                                                                                                                                  


Epoch #703: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #704: 1025it [00:02, 388.08it/s, env_step=720896, len=33, n/ep=2, n/st=64, player_1/loss=333.484, player_2/loss=194.100, rew=564.50]                                                                                                                                                                                  


Epoch #704: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #705: 1025it [00:02, 403.00it/s, env_step=721920, len=14, n/ep=5, n/st=64, player_1/loss=253.244, player_2/loss=462.467, rew=119.60]                                                                                                                                                                                  


Epoch #705: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #706: 1025it [00:02, 392.09it/s, env_step=722944, len=28, n/ep=2, n/st=64, player_1/loss=360.903, player_2/loss=649.900, rew=420.50]                                                                                                                                                                                  


Epoch #706: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #707: 1025it [00:02, 394.58it/s, env_step=723968, len=28, n/ep=3, n/st=64, player_1/loss=288.620, player_2/loss=437.130, rew=415.33]                                                                                                                                                                                  


Epoch #707: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #708: 1025it [00:02, 394.19it/s, env_step=724992, len=27, n/ep=3, n/st=64, player_1/loss=376.246, player_2/loss=443.593, rew=410.33]                                                                                                                                                                                  


Epoch #708: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #709: 1025it [00:02, 396.28it/s, env_step=726016, len=26, n/ep=2, n/st=64, player_1/loss=493.031, player_2/loss=374.273, rew=364.50]                                                                                                                                                                                  


Epoch #709: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #710: 1025it [00:02, 393.29it/s, env_step=727040, len=16, n/ep=4, n/st=64, player_1/loss=300.454, player_2/loss=488.681, rew=221.00]                                                                                                                                                                                  


Epoch #710: test_reward: 44.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #711: 1025it [00:02, 394.75it/s, env_step=728064, len=15, n/ep=4, n/st=64, player_1/loss=219.542, player_2/loss=574.927, rew=130.50]                                                                                                                                                                                  


Epoch #711: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #712: 1025it [00:02, 394.93it/s, env_step=729088, len=22, n/ep=2, n/st=64, player_1/loss=477.186, player_2/loss=564.878, rew=256.50]                                                                                                                                                                                  


Epoch #712: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #713: 1025it [00:02, 392.81it/s, env_step=730112, len=32, n/ep=2, n/st=64, player_1/loss=471.183, player_2/loss=359.624, rew=564.50]                                                                                                                                                                                  


Epoch #713: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #714: 1025it [00:02, 394.27it/s, env_step=731136, len=36, n/ep=2, n/st=64, player_1/loss=306.075, player_2/loss=265.257, rew=783.00]                                                                                                                                                                                  


Epoch #714: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #715: 1025it [00:02, 391.74it/s, env_step=732160, len=7, n/ep=8, n/st=64, player_1/loss=388.545, player_2/loss=465.467, rew=33.25]                                                                                                                                                                                    


Epoch #715: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #716: 1025it [00:02, 393.21it/s, env_step=733184, len=28, n/ep=3, n/st=64, player_1/loss=521.234, player_2/loss=864.982, rew=436.67]                                                                                                                                                                                  


Epoch #716: test_reward: 119.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #717: 1025it [00:02, 394.45it/s, env_step=734208, len=24, n/ep=2, n/st=64, player_1/loss=378.511, player_2/loss=700.541, rew=339.50]                                                                                                                                                                                  


Epoch #717: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #718: 1025it [00:02, 393.94it/s, env_step=735232, len=20, n/ep=3, n/st=64, player_1/loss=209.850, player_2/loss=422.717, rew=224.33]                                                                                                                                                                                  


Epoch #718: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #719: 1025it [00:02, 394.85it/s, env_step=736256, len=18, n/ep=2, n/st=64, player_1/loss=238.729, player_2/loss=354.433, rew=224.50]                                                                                                                                                                                  


Epoch #719: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #720: 1025it [00:02, 393.77it/s, env_step=737280, len=22, n/ep=3, n/st=64, player_2/loss=760.063, rew=293.33]                                                                                                                                                                                                         


Epoch #720: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #721: 1025it [00:02, 394.68it/s, env_step=738304, len=20, n/ep=3, n/st=64, player_1/loss=305.482, player_2/loss=956.952, rew=282.00]                                                                                                                                                                                  


Epoch #721: test_reward: 44.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #722: 1025it [00:02, 393.26it/s, env_step=739328, len=25, n/ep=3, n/st=64, player_1/loss=291.670, player_2/loss=629.644, rew=384.00]                                                                                                                                                                                  


Epoch #722: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #723: 1025it [00:02, 393.65it/s, env_step=740352, len=13, n/ep=4, n/st=64, player_1/loss=418.706, player_2/loss=337.418, rew=142.25]                                                                                                                                                                                  


Epoch #723: test_reward: 35.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #724: 1025it [00:02, 393.28it/s, env_step=741376, len=33, n/ep=2, n/st=64, player_1/loss=472.284, player_2/loss=335.612, rew=578.00]                                                                                                                                                                                  


Epoch #724: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #725: 1025it [00:02, 393.20it/s, env_step=742400, len=36, n/ep=2, n/st=64, player_1/loss=384.590, player_2/loss=470.317, rew=673.00]                                                                                                                                                                                  


Epoch #725: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #726: 1025it [00:02, 395.86it/s, env_step=743424, len=27, n/ep=2, n/st=64, player_2/loss=605.839, rew=446.00]                                                                                                                                                                                                         


Epoch #726: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #727: 1025it [00:02, 392.24it/s, env_step=744448, len=38, n/ep=2, n/st=64, player_1/loss=85.628, player_2/loss=419.256, rew=759.50]                                                                                                                                                                                   


Epoch #727: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #728: 1025it [00:02, 395.06it/s, env_step=745472, len=30, n/ep=2, n/st=64, player_1/loss=101.642, player_2/loss=293.600, rew=472.00]                                                                                                                                                                                  


Epoch #728: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #729: 1025it [00:02, 390.66it/s, env_step=746496, len=33, n/ep=2, n/st=64, player_1/loss=100.206, player_2/loss=332.207, rew=578.00]                                                                                                                                                                                  


Epoch #729: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #730: 1025it [00:02, 391.74it/s, env_step=747520, len=20, n/ep=3, n/st=64, player_1/loss=191.183, player_2/loss=538.878, rew=400.00]                                                                                                                                                                                  


Epoch #730: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #731: 1025it [00:02, 392.15it/s, env_step=748544, len=17, n/ep=3, n/st=64, player_1/loss=261.005, player_2/loss=593.679, rew=189.33]                                                                                                                                                                                  


Epoch #731: test_reward: 119.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #732: 1025it [00:02, 390.41it/s, env_step=749568, len=24, n/ep=3, n/st=64, player_1/loss=237.448, player_2/loss=470.614, rew=317.67]                                                                                                                                                                                  


Epoch #732: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #733: 1025it [00:02, 390.31it/s, env_step=750592, len=14, n/ep=4, n/st=64, player_1/loss=306.720, player_2/loss=383.875, rew=112.00]                                                                                                                                                                                  


Epoch #733: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #734: 1025it [00:02, 393.10it/s, env_step=751616, len=36, n/ep=2, n/st=64, player_1/loss=379.227, player_2/loss=502.369, rew=667.00]                                                                                                                                                                                  


Epoch #734: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #735: 1025it [00:02, 393.17it/s, env_step=752640, len=26, n/ep=2, n/st=64, player_1/loss=261.615, player_2/loss=492.583, rew=352.00]                                                                                                                                                                                  


Epoch #735: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #736: 1025it [00:02, 389.80it/s, env_step=753664, len=31, n/ep=2, n/st=64, player_1/loss=284.032, player_2/loss=347.891, rew=517.00]                                                                                                                                                                                  


Epoch #736: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #737: 1025it [00:02, 394.49it/s, env_step=754688, len=31, n/ep=2, n/st=64, player_1/loss=278.298, player_2/loss=286.823, rew=539.00]                                                                                                                                                                                  


Epoch #737: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #738: 1025it [00:02, 393.52it/s, env_step=755712, len=23, n/ep=3, n/st=64, player_1/loss=194.161, rew=296.33]                                                                                                                                                                                                         


Epoch #738: test_reward: 44.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #739: 1025it [00:02, 393.53it/s, env_step=756736, len=20, n/ep=3, n/st=64, player_1/loss=237.756, player_2/loss=437.841, rew=230.67]                                                                                                                                                                                  


Epoch #739: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #740: 1025it [00:02, 392.17it/s, env_step=757760, len=28, n/ep=2, n/st=64, player_1/loss=553.770, player_2/loss=438.933, rew=420.50]                                                                                                                                                                                  


Epoch #740: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #741: 1025it [00:02, 393.22it/s, env_step=758784, len=29, n/ep=2, n/st=64, player_1/loss=660.755, player_2/loss=382.678, rew=449.00]                                                                                                                                                                                  


Epoch #741: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #742: 1025it [00:02, 393.43it/s, env_step=759808, len=23, n/ep=2, n/st=64, player_1/loss=451.235, player_2/loss=355.924, rew=275.50]                                                                                                                                                                                  


Epoch #742: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #743: 1025it [00:02, 394.12it/s, env_step=760832, len=33, n/ep=2, n/st=64, player_1/loss=495.702, player_2/loss=453.560, rew=592.00]                                                                                                                                                                                  


Epoch #743: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #744: 1025it [00:02, 394.44it/s, env_step=761856, len=41, n/ep=1, n/st=64, player_1/loss=439.185, player_2/loss=358.093, rew=860.00]                                                                                                                                                                                  


Epoch #744: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #745: 1025it [00:02, 392.92it/s, env_step=762880, len=25, n/ep=2, n/st=64, player_1/loss=232.800, player_2/loss=186.327, rew=326.00]                                                                                                                                                                                  


Epoch #745: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #746: 1025it [00:02, 392.82it/s, env_step=763904, len=17, n/ep=3, n/st=64, player_1/loss=221.547, player_2/loss=356.161, rew=177.33]                                                                                                                                                                                  


Epoch #746: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #747: 1025it [00:02, 391.26it/s, env_step=764928, len=29, n/ep=2, n/st=64, player_1/loss=384.754, player_2/loss=276.905, rew=466.00]                                                                                                                                                                                  


Epoch #747: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #748: 1025it [00:02, 393.49it/s, env_step=765952, len=20, n/ep=3, n/st=64, player_1/loss=371.623, player_2/loss=118.218, rew=223.33]                                                                                                                                                                                  


Epoch #748: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #749: 1025it [00:02, 385.04it/s, env_step=766976, len=26, n/ep=2, n/st=64, player_1/loss=436.937, player_2/loss=274.474, rew=354.50]                                                                                                                                                                                  


Epoch #749: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #750: 1025it [00:02, 393.57it/s, env_step=768000, len=35, n/ep=1, n/st=64, player_1/loss=468.138, player_2/loss=457.114, rew=629.00]                                                                                                                                                                                  


Epoch #750: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #751: 1025it [00:02, 391.45it/s, env_step=769024, len=23, n/ep=3, n/st=64, player_1/loss=184.890, player_2/loss=328.974, rew=322.33]                                                                                                                                                                                  


Epoch #751: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #752: 1025it [00:02, 393.00it/s, env_step=770048, len=32, n/ep=3, n/st=64, player_1/loss=572.487, player_2/loss=237.113, rew=546.00]                                                                                                                                                                                  


Epoch #752: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #753: 1025it [00:02, 393.35it/s, env_step=771072, len=29, n/ep=2, n/st=64, player_1/loss=656.559, player_2/loss=351.952, rew=459.00]                                                                                                                                                                                  


Epoch #753: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #754: 1025it [00:02, 393.68it/s, env_step=772096, len=34, n/ep=2, n/st=64, player_1/loss=265.691, player_2/loss=283.880, rew=602.00]                                                                                                                                                                                  


Epoch #754: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #755: 1025it [00:02, 392.06it/s, env_step=773120, len=29, n/ep=2, n/st=64, player_1/loss=369.932, player_2/loss=319.300, rew=455.00]                                                                                                                                                                                  


Epoch #755: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #756: 1025it [00:02, 394.25it/s, env_step=774144, len=28, n/ep=3, n/st=64, player_1/loss=411.775, player_2/loss=488.885, rew=416.33]                                                                                                                                                                                  


Epoch #756: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #757: 1025it [00:02, 393.21it/s, env_step=775168, len=29, n/ep=2, n/st=64, player_1/loss=338.949, player_2/loss=556.027, rew=436.00]                                                                                                                                                                                  


Epoch #757: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #758: 1025it [00:02, 392.76it/s, env_step=776192, len=40, n/ep=1, n/st=64, player_1/loss=392.364, player_2/loss=448.324, rew=819.00]                                                                                                                                                                                  


Epoch #758: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #759: 1025it [00:02, 392.09it/s, env_step=777216, len=26, n/ep=2, n/st=64, player_1/loss=312.469, player_2/loss=298.034, rew=363.50]                                                                                                                                                                                  


Epoch #759: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #760: 1025it [00:02, 391.34it/s, env_step=778240, len=29, n/ep=2, n/st=64, player_1/loss=266.087, player_2/loss=435.027, rew=455.00]                                                                                                                                                                                  


Epoch #760: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #761: 1025it [00:02, 391.52it/s, env_step=779264, len=21, n/ep=3, n/st=64, player_1/loss=383.195, player_2/loss=327.042, rew=247.33]                                                                                                                                                                                  


Epoch #761: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #762: 1025it [00:02, 393.64it/s, env_step=780288, len=30, n/ep=2, n/st=64, player_1/loss=284.252, player_2/loss=250.016, rew=504.50]                                                                                                                                                                                  


Epoch #762: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #763: 1025it [00:02, 393.34it/s, env_step=781312, len=39, n/ep=2, n/st=64, player_1/loss=121.725, player_2/loss=200.643, rew=779.00]                                                                                                                                                                                  


Epoch #763: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #764: 1025it [00:02, 390.19it/s, env_step=782336, len=35, n/ep=2, n/st=64, player_1/loss=374.274, player_2/loss=221.234, rew=629.00]                                                                                                                                                                                  


Epoch #764: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #765: 1025it [00:02, 391.56it/s, env_step=783360, len=32, n/ep=3, n/st=64, player_1/loss=560.037, player_2/loss=187.269, rew=624.00]                                                                                                                                                                                  


Epoch #765: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #766: 1025it [00:02, 393.25it/s, env_step=784384, len=20, n/ep=3, n/st=64, player_1/loss=759.223, player_2/loss=367.957, rew=216.33]                                                                                                                                                                                  


Epoch #766: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #767: 1025it [00:02, 394.32it/s, env_step=785408, len=37, n/ep=2, n/st=64, player_1/loss=492.283, player_2/loss=471.086, rew=706.50]                                                                                                                                                                                  


Epoch #767: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #768: 1025it [00:02, 392.77it/s, env_step=786432, len=30, n/ep=2, n/st=64, player_1/loss=165.869, player_2/loss=216.666, rew=494.50]                                                                                                                                                                                  


Epoch #768: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #769: 1025it [00:02, 392.03it/s, env_step=787456, len=33, n/ep=2, n/st=64, player_1/loss=128.433, player_2/loss=172.476, rew=572.50]                                                                                                                                                                                  


Epoch #769: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #770: 1025it [00:02, 394.31it/s, env_step=788480, len=26, n/ep=3, n/st=64, player_1/loss=389.175, player_2/loss=478.222, rew=350.33]                                                                                                                                                                                  


Epoch #770: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #771: 1025it [00:02, 392.38it/s, env_step=789504, len=30, n/ep=3, n/st=64, player_1/loss=542.635, player_2/loss=500.294, rew=487.67]                                                                                                                                                                                  


Epoch #771: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #772: 1025it [00:02, 392.63it/s, env_step=790528, len=30, n/ep=2, n/st=64, player_1/loss=380.850, player_2/loss=456.617, rew=645.50]                                                                                                                                                                                  


Epoch #772: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #773: 1025it [00:02, 387.84it/s, env_step=791552, len=29, n/ep=3, n/st=64, player_1/loss=261.618, player_2/loss=219.279, rew=477.00]                                                                                                                                                                                  


Epoch #773: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #774: 1025it [00:02, 389.83it/s, env_step=792576, len=29, n/ep=3, n/st=64, player_1/loss=182.996, player_2/loss=196.943, rew=572.67]                                                                                                                                                                                  


Epoch #774: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #775: 1025it [00:02, 385.96it/s, env_step=793600, len=33, n/ep=2, n/st=64, player_1/loss=189.110, player_2/loss=245.356, rew=713.00]                                                                                                                                                                                  


Epoch #775: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #776: 1025it [00:02, 389.40it/s, env_step=794624, len=40, n/ep=2, n/st=64, player_1/loss=145.578, player_2/loss=296.013, rew=839.50]                                                                                                                                                                                  


Epoch #776: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #777: 1025it [00:02, 393.18it/s, env_step=795648, len=29, n/ep=3, n/st=64, player_1/loss=380.800, player_2/loss=417.541, rew=471.33]                                                                                                                                                                                  


Epoch #777: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #778: 1025it [00:02, 394.56it/s, env_step=796672, len=23, n/ep=3, n/st=64, player_1/loss=360.945, player_2/loss=496.705, rew=350.00]                                                                                                                                                                                  


Epoch #778: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #779: 1025it [00:02, 388.02it/s, env_step=797696, len=39, n/ep=2, n/st=64, player_1/loss=317.458, player_2/loss=333.257, rew=802.00]                                                                                                                                                                                  


Epoch #779: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #780: 1025it [00:02, 393.63it/s, env_step=798720, len=24, n/ep=3, n/st=64, player_1/loss=349.924, player_2/loss=235.995, rew=311.33]                                                                                                                                                                                  


Epoch #780: test_reward: 665.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #781: 1025it [00:02, 390.97it/s, env_step=799744, len=30, n/ep=2, n/st=64, player_1/loss=483.004, player_2/loss=228.019, rew=507.50]                                                                                                                                                                                  


Epoch #781: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #782: 1025it [00:02, 391.61it/s, env_step=800768, len=27, n/ep=2, n/st=64, player_1/loss=413.254, player_2/loss=252.545, rew=379.00]                                                                                                                                                                                  


Epoch #782: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #783: 1025it [00:02, 370.65it/s, env_step=801792, len=30, n/ep=3, n/st=64, player_1/loss=440.480, player_2/loss=441.855, rew=480.00]                                                                                                                                                                                  


Epoch #783: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #784: 1025it [00:02, 405.42it/s, env_step=802816, len=40, n/ep=1, n/st=64, player_1/loss=601.411, player_2/loss=529.598, rew=819.00]                                                                                                                                                                                  


Epoch #784: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #785: 1025it [00:02, 392.65it/s, env_step=803840, len=29, n/ep=2, n/st=64, player_1/loss=552.048, player_2/loss=415.118, rew=434.50]                                                                                                                                                                                  


Epoch #785: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #786: 1025it [00:02, 393.72it/s, env_step=804864, len=30, n/ep=2, n/st=64, player_1/loss=330.904, player_2/loss=274.568, rew=464.00]                                                                                                                                                                                  


Epoch #786: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #787: 1025it [00:02, 403.23it/s, env_step=805888, len=18, n/ep=4, n/st=64, player_1/loss=170.959, player_2/loss=190.011, rew=198.75]                                                                                                                                                                                  


Epoch #787: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #788: 1025it [00:02, 406.78it/s, env_step=806912, len=34, n/ep=2, n/st=64, player_1/loss=427.383, player_2/loss=305.590, rew=614.50]                                                                                                                                                                                  


Epoch #788: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #789: 1025it [00:02, 405.36it/s, env_step=807936, len=34, n/ep=2, n/st=64, player_2/loss=306.257, rew=626.50]                                                                                                                                                                                                         


Epoch #789: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #790: 1025it [00:02, 407.70it/s, env_step=808960, len=35, n/ep=2, n/st=64, player_1/loss=525.236, player_2/loss=207.955, rew=629.50]                                                                                                                                                                                  


Epoch #790: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #791: 1025it [00:02, 405.57it/s, env_step=809984, len=31, n/ep=3, n/st=64, player_1/loss=463.648, player_2/loss=296.871, rew=519.33]                                                                                                                                                                                  


Epoch #791: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #792: 1025it [00:02, 406.61it/s, env_step=811008, len=26, n/ep=2, n/st=64, player_1/loss=369.720, player_2/loss=394.732, rew=374.50]                                                                                                                                                                                  


Epoch #792: test_reward: 594.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #793: 1025it [00:02, 408.52it/s, env_step=812032, len=38, n/ep=1, n/st=64, player_1/loss=528.096, player_2/loss=450.046, rew=740.00]                                                                                                                                                                                  


Epoch #793: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #794: 1025it [00:02, 392.23it/s, env_step=813056, len=29, n/ep=2, n/st=64, player_1/loss=561.803, player_2/loss=605.156, rew=455.00]                                                                                                                                                                                  


Epoch #794: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #795: 1025it [00:02, 407.14it/s, env_step=814080, len=35, n/ep=2, n/st=64, player_1/loss=438.092, player_2/loss=453.077, rew=641.50]                                                                                                                                                                                  


Epoch #795: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #796: 1025it [00:02, 407.15it/s, env_step=815104, len=30, n/ep=2, n/st=64, player_1/loss=215.011, player_2/loss=254.678, rew=480.50]                                                                                                                                                                                  


Epoch #796: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #797: 1025it [00:02, 403.75it/s, env_step=816128, len=30, n/ep=2, n/st=64, player_1/loss=487.947, player_2/loss=155.004, rew=482.00]                                                                                                                                                                                  


Epoch #797: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #798: 1025it [00:02, 407.75it/s, env_step=817152, len=27, n/ep=2, n/st=64, player_1/loss=273.140, player_2/loss=261.328, rew=389.50]                                                                                                                                                                                  


Epoch #798: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #799: 1025it [00:02, 402.28it/s, env_step=818176, len=31, n/ep=2, n/st=64, player_1/loss=191.986, player_2/loss=474.016, rew=517.00]                                                                                                                                                                                  


Epoch #799: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #800: 1025it [00:02, 405.78it/s, env_step=819200, len=22, n/ep=3, n/st=64, player_1/loss=434.979, player_2/loss=517.396, rew=253.33]                                                                                                                                                                                  


Epoch #800: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #801: 1025it [00:02, 408.39it/s, env_step=820224, len=23, n/ep=2, n/st=64, player_1/loss=356.223, player_2/loss=350.763, rew=297.00]                                                                                                                                                                                  


Epoch #801: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #802: 1025it [00:02, 406.94it/s, env_step=821248, len=27, n/ep=2, n/st=64, player_1/loss=238.678, player_2/loss=260.708, rew=394.00]                                                                                                                                                                                  


Epoch #802: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #803: 1025it [00:02, 393.48it/s, env_step=822272, len=24, n/ep=3, n/st=64, player_1/loss=469.943, player_2/loss=423.677, rew=302.00]                                                                                                                                                                                  


Epoch #803: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #804: 1025it [00:02, 394.12it/s, env_step=823296, len=25, n/ep=2, n/st=64, player_1/loss=277.863, player_2/loss=501.936, rew=347.00]                                                                                                                                                                                  


Epoch #804: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #805: 1025it [00:02, 394.42it/s, env_step=824320, len=42, n/ep=1, n/st=64, player_1/loss=265.705, player_2/loss=403.649, rew=1102.00]                                                                                                                                                                                 


Epoch #805: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #806: 1025it [00:02, 391.99it/s, env_step=825344, len=33, n/ep=2, n/st=64, player_1/loss=389.532, player_2/loss=259.503, rew=577.00]                                                                                                                                                                                  


Epoch #806: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #807: 1025it [00:02, 393.70it/s, env_step=826368, len=28, n/ep=3, n/st=64, player_1/loss=309.483, player_2/loss=520.210, rew=409.33]                                                                                                                                                                                  


Epoch #807: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #808: 1025it [00:02, 389.33it/s, env_step=827392, len=31, n/ep=2, n/st=64, player_1/loss=197.362, player_2/loss=581.126, rew=513.00]                                                                                                                                                                                  


Epoch #808: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #809: 1025it [00:02, 391.20it/s, env_step=828416, len=11, n/ep=7, n/st=64, player_1/loss=127.631, player_2/loss=535.214, rew=88.00]                                                                                                                                                                                   


Epoch #809: test_reward: 44.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #810: 1025it [00:02, 392.24it/s, env_step=829440, len=28, n/ep=1, n/st=64, player_1/loss=159.015, player_2/loss=371.930, rew=405.00]                                                                                                                                                                                  


Epoch #810: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #811: 1025it [00:02, 393.80it/s, env_step=830464, len=25, n/ep=2, n/st=64, player_1/loss=217.593, player_2/loss=346.384, rew=338.00]                                                                                                                                                                                  


Epoch #811: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #812: 1025it [00:02, 391.21it/s, env_step=831488, len=25, n/ep=3, n/st=64, player_1/loss=263.788, player_2/loss=352.287, rew=343.33]                                                                                                                                                                                  


Epoch #812: test_reward: 44.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #813: 1025it [00:02, 391.97it/s, env_step=832512, len=28, n/ep=2, n/st=64, player_1/loss=418.171, player_2/loss=415.780, rew=420.50]                                                                                                                                                                                  


Epoch #813: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #814: 1025it [00:02, 394.03it/s, env_step=833536, len=27, n/ep=3, n/st=64, player_1/loss=269.563, player_2/loss=379.019, rew=397.00]                                                                                                                                                                                  


Epoch #814: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #815: 1025it [00:02, 391.74it/s, env_step=834560, len=26, n/ep=3, n/st=64, player_1/loss=263.335, player_2/loss=293.076, rew=367.00]                                                                                                                                                                                  


Epoch #815: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #816: 1025it [00:02, 392.60it/s, env_step=835584, len=29, n/ep=2, n/st=64, player_1/loss=274.903, player_2/loss=346.858, rew=470.00]                                                                                                                                                                                  


Epoch #816: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #817: 1025it [00:02, 394.76it/s, env_step=836608, len=29, n/ep=2, n/st=64, player_1/loss=246.927, player_2/loss=407.837, rew=434.00]                                                                                                                                                                                  


Epoch #817: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #818: 1025it [00:02, 391.37it/s, env_step=837632, len=35, n/ep=2, n/st=64, player_1/loss=136.749, player_2/loss=286.519, rew=657.00]                                                                                                                                                                                  


Epoch #818: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #819: 1025it [00:02, 393.00it/s, env_step=838656, len=39, n/ep=1, n/st=64, player_1/loss=183.547, player_2/loss=236.245, rew=779.00]                                                                                                                                                                                  


Epoch #819: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #820: 1025it [00:02, 395.40it/s, env_step=839680, len=28, n/ep=2, n/st=64, player_1/loss=281.723, player_2/loss=220.952, rew=409.50]                                                                                                                                                                                  


Epoch #820: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #821: 1025it [00:02, 392.02it/s, env_step=840704, len=37, n/ep=1, n/st=64, player_1/loss=390.121, player_2/loss=373.107, rew=702.00]                                                                                                                                                                                  


Epoch #821: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #822: 1025it [00:02, 394.66it/s, env_step=841728, len=36, n/ep=2, n/st=64, player_1/loss=313.416, player_2/loss=484.808, rew=665.50]                                                                                                                                                                                  


Epoch #822: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #823: 1025it [00:02, 389.50it/s, env_step=842752, len=26, n/ep=3, n/st=64, player_1/loss=250.203, player_2/loss=606.340, rew=382.67]                                                                                                                                                                                  


Epoch #823: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #824: 1025it [00:02, 394.06it/s, env_step=843776, len=29, n/ep=3, n/st=64, player_1/loss=324.904, player_2/loss=792.194, rew=477.00]                                                                                                                                                                                  


Epoch #824: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #825: 1025it [00:02, 394.64it/s, env_step=844800, len=30, n/ep=3, n/st=64, player_1/loss=423.993, player_2/loss=813.808, rew=507.67]                                                                                                                                                                                  


Epoch #825: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #826: 1025it [00:02, 382.49it/s, env_step=845824, len=30, n/ep=2, n/st=64, player_1/loss=400.756, player_2/loss=501.750, rew=515.50]                                                                                                                                                                                  


Epoch #826: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #827: 1025it [00:02, 394.27it/s, env_step=846848, len=30, n/ep=2, n/st=64, player_1/loss=196.487, player_2/loss=340.499, rew=507.50]                                                                                                                                                                                  


Epoch #827: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #828: 1025it [00:02, 393.84it/s, env_step=847872, len=28, n/ep=2, n/st=64, player_1/loss=222.733, player_2/loss=547.256, rew=429.50]                                                                                                                                                                                  


Epoch #828: test_reward: 860.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #829: 1025it [00:02, 392.16it/s, env_step=848896, len=30, n/ep=2, n/st=64, player_1/loss=265.937, player_2/loss=388.419, rew=489.50]                                                                                                                                                                                  


Epoch #829: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #830: 1025it [00:02, 393.72it/s, env_step=849920, len=25, n/ep=2, n/st=64, player_1/loss=203.840, player_2/loss=276.491, rew=324.50]                                                                                                                                                                                  


Epoch #830: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #831: 1025it [00:02, 394.86it/s, env_step=850944, len=27, n/ep=2, n/st=64, player_1/loss=302.852, player_2/loss=302.171, rew=379.00]                                                                                                                                                                                  


Epoch #831: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #832: 1025it [00:02, 391.23it/s, env_step=851968, len=36, n/ep=2, n/st=64, player_1/loss=441.674, player_2/loss=310.163, rew=683.50]                                                                                                                                                                                  


Epoch #832: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #833: 1025it [00:02, 392.93it/s, env_step=852992, len=25, n/ep=2, n/st=64, player_1/loss=426.139, player_2/loss=348.428, rew=324.50]                                                                                                                                                                                  


Epoch #833: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #834: 1025it [00:02, 390.88it/s, env_step=854016, len=21, n/ep=3, n/st=64, player_1/loss=431.857, player_2/loss=421.836, rew=239.33]                                                                                                                                                                                  


Epoch #834: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #835: 1025it [00:02, 394.58it/s, env_step=855040, len=30, n/ep=2, n/st=64, player_1/loss=311.669, player_2/loss=528.820, rew=466.00]                                                                                                                                                                                  


Epoch #835: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #836: 1025it [00:02, 395.37it/s, env_step=856064, len=19, n/ep=3, n/st=64, player_2/loss=542.397, rew=219.00]                                                                                                                                                                                                         


Epoch #836: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #837: 1025it [00:02, 393.65it/s, env_step=857088, len=29, n/ep=2, n/st=64, player_1/loss=181.740, player_2/loss=498.658, rew=452.00]                                                                                                                                                                                  


Epoch #837: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #838: 1025it [00:02, 391.74it/s, env_step=858112, len=17, n/ep=4, n/st=64, player_1/loss=192.954, player_2/loss=367.843, rew=154.75]                                                                                                                                                                                  


Epoch #838: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #839: 1025it [00:02, 397.22it/s, env_step=859136, len=24, n/ep=3, n/st=64, player_1/loss=297.369, player_2/loss=418.320, rew=307.33]                                                                                                                                                                                  


Epoch #839: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #840: 1025it [00:02, 403.01it/s, env_step=860160, len=22, n/ep=3, n/st=64, player_1/loss=374.921, player_2/loss=940.635, rew=272.00]                                                                                                                                                                                  


Epoch #840: test_reward: 170.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #841: 1025it [00:02, 395.75it/s, env_step=861184, len=20, n/ep=3, n/st=64, player_1/loss=203.595, player_2/loss=788.248, rew=218.00]                                                                                                                                                                                  


Epoch #841: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #842: 1025it [00:02, 393.32it/s, env_step=862208, len=22, n/ep=3, n/st=64, player_1/loss=52.589, player_2/loss=434.763, rew=271.33]                                                                                                                                                                                   


Epoch #842: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #843: 1025it [00:02, 393.36it/s, env_step=863232, len=31, n/ep=2, n/st=64, player_1/loss=66.264, player_2/loss=244.098, rew=517.00]                                                                                                                                                                                   


Epoch #843: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #844: 1025it [00:02, 389.25it/s, env_step=864256, len=26, n/ep=3, n/st=64, player_1/loss=210.669, player_2/loss=150.858, rew=367.33]                                                                                                                                                                                  


Epoch #844: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #845: 1025it [00:02, 388.27it/s, env_step=865280, len=27, n/ep=2, n/st=64, player_1/loss=306.184, player_2/loss=412.197, rew=381.50]                                                                                                                                                                                  


Epoch #845: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #846: 1025it [00:02, 392.43it/s, env_step=866304, len=34, n/ep=2, n/st=64, player_1/loss=317.432, player_2/loss=491.528, rew=602.00]                                                                                                                                                                                  


Epoch #846: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #847: 1025it [00:02, 394.15it/s, env_step=867328, len=20, n/ep=3, n/st=64, player_1/loss=243.693, player_2/loss=503.428, rew=239.33]                                                                                                                                                                                  


Epoch #847: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #848: 1025it [00:02, 394.65it/s, env_step=868352, len=34, n/ep=2, n/st=64, player_1/loss=218.964, player_2/loss=512.971, rew=596.00]                                                                                                                                                                                  


Epoch #848: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #849: 1025it [00:02, 395.93it/s, env_step=869376, len=21, n/ep=3, n/st=64, player_1/loss=286.296, player_2/loss=275.132, rew=251.33]                                                                                                                                                                                  


Epoch #849: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #850: 1025it [00:02, 394.12it/s, env_step=870400, len=32, n/ep=2, n/st=64, player_2/loss=327.235, rew=545.00]                                                                                                                                                                                                         


Epoch #850: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #851: 1025it [00:02, 391.74it/s, env_step=871424, len=10, n/ep=6, n/st=64, player_1/loss=369.648, player_2/loss=640.698, rew=73.50]                                                                                                                                                                                   


Epoch #851: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #852: 1025it [00:02, 393.65it/s, env_step=872448, len=14, n/ep=5, n/st=64, player_1/loss=465.141, player_2/loss=943.711, rew=110.00]                                                                                                                                                                                  


Epoch #852: test_reward: 152.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #853: 1025it [00:02, 395.16it/s, env_step=873472, len=22, n/ep=2, n/st=64, player_1/loss=666.434, player_2/loss=705.954, rew=252.50]                                                                                                                                                                                  


Epoch #853: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #854: 1025it [00:02, 392.46it/s, env_step=874496, len=24, n/ep=2, n/st=64, player_1/loss=544.382, player_2/loss=424.810, rew=314.50]                                                                                                                                                                                  


Epoch #854: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #855: 1025it [00:02, 391.54it/s, env_step=875520, len=20, n/ep=3, n/st=64, player_1/loss=346.594, player_2/loss=279.596, rew=209.33]                                                                                                                                                                                  


Epoch #855: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #856: 1025it [00:02, 392.97it/s, env_step=876544, len=23, n/ep=3, n/st=64, player_1/loss=312.194, player_2/loss=103.611, rew=286.33]                                                                                                                                                                                  


Epoch #856: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #857: 1025it [00:02, 391.92it/s, env_step=877568, len=24, n/ep=2, n/st=64, player_1/loss=238.165, player_2/loss=206.559, rew=312.50]                                                                                                                                                                                  


Epoch #857: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #858: 1025it [00:02, 394.23it/s, env_step=878592, len=29, n/ep=2, n/st=64, player_1/loss=179.189, player_2/loss=281.082, rew=452.00]                                                                                                                                                                                  


Epoch #858: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #859: 1025it [00:02, 392.61it/s, env_step=879616, len=22, n/ep=3, n/st=64, player_1/loss=197.993, player_2/loss=276.788, rew=305.33]                                                                                                                                                                                  


Epoch #859: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #860: 1025it [00:02, 392.71it/s, env_step=880640, len=29, n/ep=2, n/st=64, player_1/loss=242.776, player_2/loss=393.928, rew=434.00]                                                                                                                                                                                  


Epoch #860: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #861: 1025it [00:02, 395.10it/s, env_step=881664, len=27, n/ep=2, n/st=64, player_1/loss=201.289, player_2/loss=427.877, rew=392.00]                                                                                                                                                                                  


Epoch #861: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #862: 1025it [00:02, 394.40it/s, env_step=882688, len=29, n/ep=2, n/st=64, player_1/loss=177.906, player_2/loss=515.604, rew=434.00]                                                                                                                                                                                  


Epoch #862: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #863: 1025it [00:02, 395.48it/s, env_step=883712, len=34, n/ep=2, n/st=64, player_1/loss=362.827, player_2/loss=482.387, rew=726.00]                                                                                                                                                                                  


Epoch #863: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #864: 1025it [00:02, 391.32it/s, env_step=884736, len=23, n/ep=3, n/st=64, player_1/loss=304.644, player_2/loss=298.708, rew=350.33]                                                                                                                                                                                  


Epoch #864: test_reward: 35.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #865: 1025it [00:02, 394.23it/s, env_step=885760, len=24, n/ep=2, n/st=64, player_1/loss=126.025, player_2/loss=350.432, rew=311.50]                                                                                                                                                                                  


Epoch #865: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #866: 1025it [00:02, 393.24it/s, env_step=886784, len=30, n/ep=2, n/st=64, player_1/loss=195.684, player_2/loss=289.625, rew=500.50]                                                                                                                                                                                  


Epoch #866: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #867: 1025it [00:02, 395.54it/s, env_step=887808, len=26, n/ep=2, n/st=64, player_2/loss=294.184, rew=352.00]                                                                                                                                                                                                         


Epoch #867: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #868: 1025it [00:02, 391.26it/s, env_step=888832, len=21, n/ep=2, n/st=64, player_1/loss=455.597, player_2/loss=448.704, rew=269.00]                                                                                                                                                                                  


Epoch #868: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #869: 1025it [00:02, 395.68it/s, env_step=889856, len=25, n/ep=3, n/st=64, player_1/loss=334.506, player_2/loss=336.919, rew=343.00]                                                                                                                                                                                  


Epoch #869: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #870: 1025it [00:02, 391.16it/s, env_step=890880, len=17, n/ep=4, n/st=64, player_1/loss=183.633, player_2/loss=344.609, rew=187.00]                                                                                                                                                                                  


Epoch #870: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #871: 1025it [00:02, 393.91it/s, env_step=891904, len=33, n/ep=2, n/st=64, player_1/loss=237.853, player_2/loss=260.228, rew=578.00]                                                                                                                                                                                  


Epoch #871: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #872: 1025it [00:02, 394.63it/s, env_step=892928, len=36, n/ep=2, n/st=64, player_1/loss=346.213, player_2/loss=334.200, rew=798.50]                                                                                                                                                                                  


Epoch #872: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #873: 1025it [00:02, 394.27it/s, env_step=893952, len=27, n/ep=2, n/st=64, player_1/loss=547.554, player_2/loss=378.042, rew=397.00]                                                                                                                                                                                  


Epoch #873: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #874: 1025it [00:02, 393.14it/s, env_step=894976, len=16, n/ep=4, n/st=64, player_1/loss=449.131, player_2/loss=521.234, rew=157.50]                                                                                                                                                                                  


Epoch #874: test_reward: 44.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #875: 1025it [00:02, 395.23it/s, env_step=896000, len=26, n/ep=2, n/st=64, player_1/loss=92.030, player_2/loss=474.661, rew=350.50]                                                                                                                                                                                   


Epoch #875: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #876: 1025it [00:02, 391.88it/s, env_step=897024, len=19, n/ep=4, n/st=64, player_1/loss=113.853, player_2/loss=286.052, rew=260.00]                                                                                                                                                                                  


Epoch #876: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #877: 1025it [00:02, 394.22it/s, env_step=898048, len=15, n/ep=3, n/st=64, player_1/loss=236.229, player_2/loss=179.078, rew=131.67]                                                                                                                                                                                  


Epoch #877: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #878: 1025it [00:02, 393.42it/s, env_step=899072, len=30, n/ep=2, n/st=64, player_1/loss=297.757, player_2/loss=229.679, rew=482.00]                                                                                                                                                                                  


Epoch #878: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #879: 1025it [00:02, 394.41it/s, env_step=900096, len=29, n/ep=2, n/st=64, player_1/loss=228.061, player_2/loss=488.519, rew=504.00]                                                                                                                                                                                  


Epoch #879: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #880: 1025it [00:02, 395.32it/s, env_step=901120, len=32, n/ep=2, n/st=64, player_1/loss=350.357, player_2/loss=480.735, rew=527.50]                                                                                                                                                                                  


Epoch #880: test_reward: 104.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #881: 1025it [00:02, 392.50it/s, env_step=902144, len=29, n/ep=2, n/st=64, player_1/loss=296.248, player_2/loss=701.249, rew=449.00]                                                                                                                                                                                  


Epoch #881: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #882: 1025it [00:02, 390.46it/s, env_step=903168, len=26, n/ep=3, n/st=64, player_1/loss=298.715, player_2/loss=698.142, rew=402.00]                                                                                                                                                                                  


Epoch #882: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #883: 1025it [00:02, 390.43it/s, env_step=904192, len=28, n/ep=3, n/st=64, player_1/loss=293.032, player_2/loss=379.242, rew=413.33]                                                                                                                                                                                  


Epoch #883: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #884: 1025it [00:02, 384.36it/s, env_step=905216, len=29, n/ep=2, n/st=64, player_1/loss=251.833, player_2/loss=343.415, rew=485.00]                                                                                                                                                                                  


Epoch #884: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #885: 1025it [00:02, 389.06it/s, env_step=906240, len=29, n/ep=3, n/st=64, player_1/loss=444.892, player_2/loss=436.668, rew=468.00]                                                                                                                                                                                  


Epoch #885: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #886: 1025it [00:02, 396.47it/s, env_step=907264, len=29, n/ep=3, n/st=64, player_1/loss=357.221, player_2/loss=349.242, rew=451.33]                                                                                                                                                                                  


Epoch #886: test_reward: 495.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #887: 1025it [00:02, 392.03it/s, env_step=908288, len=29, n/ep=2, n/st=64, player_1/loss=176.061, player_2/loss=261.842, rew=627.00]                                                                                                                                                                                  


Epoch #887: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #888: 1025it [00:02, 394.71it/s, env_step=909312, len=35, n/ep=2, n/st=64, player_1/loss=334.927, player_2/loss=279.651, rew=662.00]                                                                                                                                                                                  


Epoch #888: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #889: 1025it [00:02, 392.70it/s, env_step=910336, len=22, n/ep=3, n/st=64, player_1/loss=783.271, player_2/loss=378.707, rew=324.33]                                                                                                                                                                                  


Epoch #889: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #890: 1025it [00:02, 395.51it/s, env_step=911360, len=14, n/ep=4, n/st=64, player_1/loss=738.700, player_2/loss=682.595, rew=113.50]                                                                                                                                                                                  


Epoch #890: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #891: 1025it [00:02, 392.40it/s, env_step=912384, len=18, n/ep=4, n/st=64, player_1/loss=445.251, player_2/loss=678.045, rew=206.00]                                                                                                                                                                                  


Epoch #891: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #892: 1025it [00:02, 393.01it/s, env_step=913408, len=21, n/ep=3, n/st=64, player_1/loss=560.871, player_2/loss=714.902, rew=246.00]                                                                                                                                                                                  


Epoch #892: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #893: 1025it [00:02, 393.97it/s, env_step=914432, len=22, n/ep=3, n/st=64, player_1/loss=474.541, player_2/loss=684.951, rew=260.33]                                                                                                                                                                                  


Epoch #893: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #894: 1025it [00:02, 394.16it/s, env_step=915456, len=38, n/ep=1, n/st=64, player_1/loss=563.081, player_2/loss=450.147, rew=740.00]                                                                                                                                                                                  


Epoch #894: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #895: 1025it [00:02, 394.31it/s, env_step=916480, len=22, n/ep=2, n/st=64, player_1/loss=428.549, player_2/loss=267.671, rew=308.50]                                                                                                                                                                                  


Epoch #895: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #896: 1025it [00:02, 395.93it/s, env_step=917504, len=28, n/ep=1, n/st=64, player_1/loss=259.461, player_2/loss=270.649, rew=405.00]                                                                                                                                                                                  


Epoch #896: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #897: 1025it [00:02, 391.65it/s, env_step=918528, len=23, n/ep=3, n/st=64, player_1/loss=425.640, player_2/loss=215.247, rew=335.00]                                                                                                                                                                                  


Epoch #897: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #898: 1025it [00:02, 395.52it/s, env_step=919552, len=33, n/ep=2, n/st=64, player_1/loss=481.031, player_2/loss=267.375, rew=568.00]                                                                                                                                                                                  


Epoch #898: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #899: 1025it [00:02, 393.26it/s, env_step=920576, len=23, n/ep=3, n/st=64, player_1/loss=424.687, player_2/loss=364.692, rew=294.00]                                                                                                                                                                                  


Epoch #899: test_reward: 170.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #900: 1025it [00:02, 392.47it/s, env_step=921600, len=31, n/ep=2, n/st=64, player_1/loss=321.878, player_2/loss=318.394, rew=519.50]                                                                                                                                                                                  


Epoch #900: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #901: 1025it [00:02, 393.17it/s, env_step=922624, len=31, n/ep=3, n/st=64, player_1/loss=214.428, player_2/loss=315.750, rew=523.67]                                                                                                                                                                                  


Epoch #901: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #902: 1025it [00:02, 385.02it/s, env_step=923648, len=29, n/ep=2, n/st=64, player_1/loss=214.906, player_2/loss=231.428, rew=434.50]                                                                                                                                                                                  


Epoch #902: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #903: 1025it [00:02, 393.88it/s, env_step=924672, len=37, n/ep=2, n/st=64, player_1/loss=276.230, player_2/loss=309.977, rew=704.00]                                                                                                                                                                                  


Epoch #903: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #904: 1025it [00:02, 394.19it/s, env_step=925696, len=32, n/ep=2, n/st=64, player_1/loss=241.391, player_2/loss=379.478, rew=558.50]                                                                                                                                                                                  


Epoch #904: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #905: 1025it [00:02, 390.54it/s, env_step=926720, len=28, n/ep=3, n/st=64, player_1/loss=307.832, player_2/loss=445.248, rew=442.33]                                                                                                                                                                                  


Epoch #905: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #906: 1025it [00:02, 395.16it/s, env_step=927744, len=22, n/ep=3, n/st=64, player_1/loss=417.678, player_2/loss=455.143, rew=281.67]                                                                                                                                                                                  


Epoch #906: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #907: 1025it [00:02, 392.63it/s, env_step=928768, len=26, n/ep=3, n/st=64, player_1/loss=403.722, player_2/loss=352.761, rew=352.33]                                                                                                                                                                                  


Epoch #907: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #908: 1025it [00:02, 393.95it/s, env_step=929792, len=34, n/ep=2, n/st=64, player_1/loss=307.765, player_2/loss=266.132, rew=602.00]                                                                                                                                                                                  


Epoch #908: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #909: 1025it [00:02, 391.86it/s, env_step=930816, len=32, n/ep=2, n/st=64, player_1/loss=221.515, player_2/loss=294.003, rew=531.50]                                                                                                                                                                                  


Epoch #909: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #910: 1025it [00:02, 394.55it/s, env_step=931840, len=30, n/ep=2, n/st=64, player_1/loss=388.083, player_2/loss=353.521, rew=464.00]                                                                                                                                                                                  


Epoch #910: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #911: 1025it [00:02, 394.08it/s, env_step=932864, len=32, n/ep=2, n/st=64, player_1/loss=606.407, player_2/loss=351.426, rew=545.00]                                                                                                                                                                                  


Epoch #911: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #912: 1025it [00:02, 394.61it/s, env_step=933888, len=30, n/ep=2, n/st=64, player_1/loss=372.686, player_2/loss=376.262, rew=464.50]                                                                                                                                                                                  


Epoch #912: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #913: 1025it [00:02, 391.64it/s, env_step=934912, len=24, n/ep=3, n/st=64, player_1/loss=207.877, player_2/loss=619.512, rew=299.33]                                                                                                                                                                                  


Epoch #913: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #914: 1025it [00:02, 394.55it/s, env_step=935936, len=32, n/ep=2, n/st=64, player_1/loss=405.982, player_2/loss=450.516, rew=549.50]                                                                                                                                                                                  


Epoch #914: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #915: 1025it [00:02, 392.59it/s, env_step=936960, len=33, n/ep=2, n/st=64, player_1/loss=510.228, player_2/loss=283.675, rew=584.50]                                                                                                                                                                                  


Epoch #915: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #916: 1025it [00:02, 394.73it/s, env_step=937984, len=32, n/ep=2, n/st=64, player_1/loss=449.569, player_2/loss=376.234, rew=543.50]                                                                                                                                                                                  


Epoch #916: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #917: 1025it [00:02, 393.29it/s, env_step=939008, len=41, n/ep=2, n/st=64, player_1/loss=396.113, player_2/loss=397.894, rew=881.00]                                                                                                                                                                                  


Epoch #917: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #918: 1025it [00:02, 395.89it/s, env_step=940032, len=29, n/ep=2, n/st=64, player_1/loss=449.223, player_2/loss=530.404, rew=434.00]                                                                                                                                                                                  


Epoch #918: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #919: 1025it [00:02, 391.97it/s, env_step=941056, len=32, n/ep=2, n/st=64, player_1/loss=487.885, player_2/loss=560.296, rew=546.50]                                                                                                                                                                                  


Epoch #919: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #920: 1025it [00:02, 393.40it/s, env_step=942080, len=26, n/ep=2, n/st=64, player_1/loss=486.888, player_2/loss=459.049, rew=363.50]                                                                                                                                                                                  


Epoch #920: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #921: 1025it [00:02, 392.04it/s, env_step=943104, len=33, n/ep=2, n/st=64, player_1/loss=355.361, player_2/loss=349.930, rew=580.00]                                                                                                                                                                                  


Epoch #921: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #922: 1025it [00:02, 393.12it/s, env_step=944128, len=32, n/ep=2, n/st=64, player_1/loss=375.796, player_2/loss=266.304, rew=558.50]                                                                                                                                                                                  


Epoch #922: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #923: 1025it [00:02, 394.82it/s, env_step=945152, len=26, n/ep=2, n/st=64, player_1/loss=251.498, player_2/loss=349.646, rew=358.00]                                                                                                                                                                                  


Epoch #923: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #924: 1025it [00:02, 392.08it/s, env_step=946176, len=29, n/ep=2, n/st=64, player_1/loss=405.965, player_2/loss=422.065, rew=455.00]                                                                                                                                                                                  


Epoch #924: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #925: 1025it [00:02, 392.47it/s, env_step=947200, len=27, n/ep=2, n/st=64, player_1/loss=477.484, player_2/loss=298.393, rew=449.00]                                                                                                                                                                                  


Epoch #925: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #926: 1025it [00:02, 392.53it/s, env_step=948224, len=35, n/ep=1, n/st=64, player_1/loss=358.058, player_2/loss=170.150, rew=629.00]                                                                                                                                                                                  


Epoch #926: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #927: 1025it [00:02, 394.03it/s, env_step=949248, len=34, n/ep=2, n/st=64, player_1/loss=161.089, player_2/loss=85.480, rew=614.50]                                                                                                                                                                                   


Epoch #927: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #928: 1025it [00:02, 394.83it/s, env_step=950272, len=28, n/ep=3, n/st=64, player_1/loss=196.056, player_2/loss=209.885, rew=406.00]                                                                                                                                                                                  


Epoch #928: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #929: 1025it [00:02, 398.10it/s, env_step=951296, len=15, n/ep=4, n/st=64, player_1/loss=331.771, player_2/loss=464.512, rew=131.25]                                                                                                                                                                                  


Epoch #929: test_reward: 90.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #930: 1025it [00:02, 401.87it/s, env_step=952320, len=34, n/ep=2, n/st=64, player_1/loss=511.981, player_2/loss=563.869, rew=611.50]                                                                                                                                                                                  


Epoch #930: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #931: 1025it [00:02, 392.55it/s, env_step=953344, len=27, n/ep=2, n/st=64, player_1/loss=565.542, player_2/loss=520.092, rew=392.00]                                                                                                                                                                                  


Epoch #931: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #932: 1025it [00:02, 390.98it/s, env_step=954368, len=39, n/ep=2, n/st=64, player_1/loss=295.293, player_2/loss=268.659, rew=902.00]                                                                                                                                                                                  


Epoch #932: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #933: 1025it [00:02, 394.91it/s, env_step=955392, len=38, n/ep=2, n/st=64, player_1/loss=222.683, player_2/loss=357.285, rew=740.00]                                                                                                                                                                                  


Epoch #933: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #934: 1025it [00:02, 393.77it/s, env_step=956416, len=35, n/ep=2, n/st=64, player_1/loss=381.039, player_2/loss=484.543, rew=641.50]                                                                                                                                                                                  


Epoch #934: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #935: 1025it [00:02, 393.72it/s, env_step=957440, len=32, n/ep=2, n/st=64, player_1/loss=452.888, player_2/loss=389.116, rew=527.50]                                                                                                                                                                                  


Epoch #935: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #936: 1025it [00:02, 390.69it/s, env_step=958464, len=26, n/ep=2, n/st=64, player_1/loss=332.467, player_2/loss=489.483, rew=352.00]                                                                                                                                                                                  


Epoch #936: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #937: 1025it [00:02, 394.08it/s, env_step=959488, len=26, n/ep=3, n/st=64, player_1/loss=431.305, player_2/loss=507.310, rew=359.33]                                                                                                                                                                                  


Epoch #937: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #938: 1025it [00:02, 392.77it/s, env_step=960512, len=19, n/ep=3, n/st=64, player_1/loss=554.566, player_2/loss=439.142, rew=207.67]                                                                                                                                                                                  


Epoch #938: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #939: 1025it [00:02, 395.89it/s, env_step=961536, len=19, n/ep=2, n/st=64, player_1/loss=616.260, player_2/loss=602.177, rew=209.00]                                                                                                                                                                                  


Epoch #939: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #940: 1025it [00:02, 393.41it/s, env_step=962560, len=29, n/ep=3, n/st=64, player_1/loss=631.259, player_2/loss=459.358, rew=528.00]                                                                                                                                                                                  


Epoch #940: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #941: 1025it [00:02, 393.44it/s, env_step=963584, len=28, n/ep=2, n/st=64, player_1/loss=576.843, player_2/loss=489.131, rew=447.50]                                                                                                                                                                                  


Epoch #941: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #942: 1025it [00:02, 395.24it/s, env_step=964608, len=32, n/ep=2, n/st=64, player_1/loss=550.983, player_2/loss=540.703, rew=564.50]                                                                                                                                                                                  


Epoch #942: test_reward: 560.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #943: 1025it [00:02, 393.84it/s, env_step=965632, len=29, n/ep=2, n/st=64, player_1/loss=489.003, player_2/loss=420.205, rew=434.50]                                                                                                                                                                                  


Epoch #943: test_reward: 629.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #944: 1025it [00:02, 394.69it/s, env_step=966656, len=38, n/ep=2, n/st=64, player_1/loss=286.137, player_2/loss=382.329, rew=740.00]                                                                                                                                                                                  


Epoch #944: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #945: 1025it [00:02, 393.81it/s, env_step=967680, len=29, n/ep=2, n/st=64, player_2/loss=487.161, rew=627.00]                                                                                                                                                                                                         


Epoch #945: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #946: 1025it [00:02, 394.75it/s, env_step=968704, len=35, n/ep=2, n/st=64, player_1/loss=401.140, player_2/loss=343.754, rew=647.00]                                                                                                                                                                                  


Epoch #946: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #947: 1025it [00:02, 392.98it/s, env_step=969728, len=39, n/ep=1, n/st=64, player_1/loss=276.433, player_2/loss=260.891, rew=779.00]                                                                                                                                                                                  


Epoch #947: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #948: 1025it [00:02, 394.24it/s, env_step=970752, len=28, n/ep=2, n/st=64, player_1/loss=246.297, rew=417.50]                                                                                                                                                                                                         


Epoch #948: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #949: 1025it [00:02, 394.73it/s, env_step=971776, len=32, n/ep=3, n/st=64, player_1/loss=257.525, player_2/loss=359.986, rew=577.67]                                                                                                                                                                                  


Epoch #949: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #950: 1025it [00:02, 395.45it/s, env_step=972800, len=34, n/ep=2, n/st=64, player_1/loss=222.861, player_2/loss=364.601, rew=726.00]                                                                                                                                                                                  


Epoch #950: test_reward: 252.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #951: 1025it [00:02, 391.58it/s, env_step=973824, len=29, n/ep=3, n/st=64, player_1/loss=291.702, player_2/loss=264.625, rew=473.67]                                                                                                                                                                                  


Epoch #951: test_reward: 275.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #952: 1025it [00:02, 394.49it/s, env_step=974848, len=29, n/ep=3, n/st=64, player_1/loss=354.221, player_2/loss=407.244, rew=443.33]                                                                                                                                                                                  


Epoch #952: test_reward: 464.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #953: 1025it [00:02, 393.93it/s, env_step=975872, len=21, n/ep=3, n/st=64, player_1/loss=417.596, player_2/loss=501.159, rew=246.00]                                                                                                                                                                                  


Epoch #953: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #954: 1025it [00:02, 393.43it/s, env_step=976896, len=34, n/ep=2, n/st=64, player_1/loss=482.509, player_2/loss=530.032, rew=617.50]                                                                                                                                                                                  


Epoch #954: test_reward: 1102.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #955: 1025it [00:02, 391.09it/s, env_step=977920, len=33, n/ep=2, n/st=64, player_1/loss=411.150, rew=572.50]                                                                                                                                                                                                         


Epoch #955: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #956: 1025it [00:02, 394.54it/s, env_step=978944, len=27, n/ep=3, n/st=64, player_1/loss=286.313, player_2/loss=453.908, rew=381.00]                                                                                                                                                                                  


Epoch #956: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #957: 1025it [00:02, 389.06it/s, env_step=979968, len=27, n/ep=1, n/st=64, player_1/loss=221.814, player_2/loss=413.551, rew=377.00]                                                                                                                                                                                  


Epoch #957: test_reward: 434.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #958: 1025it [00:02, 392.43it/s, env_step=980992, len=15, n/ep=3, n/st=64, player_1/loss=296.844, player_2/loss=225.997, rew=124.33]                                                                                                                                                                                  


Epoch #958: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #959: 1025it [00:02, 392.80it/s, env_step=982016, len=33, n/ep=2, n/st=64, player_1/loss=358.905, player_2/loss=210.877, rew=564.50]                                                                                                                                                                                  


Epoch #959: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #960: 1025it [00:02, 393.77it/s, env_step=983040, len=33, n/ep=2, n/st=64, player_2/loss=180.541, rew=568.00]                                                                                                                                                                                                         


Epoch #960: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #961: 1025it [00:02, 392.87it/s, env_step=984064, len=28, n/ep=2, n/st=64, player_1/loss=406.781, player_2/loss=373.575, rew=413.00]                                                                                                                                                                                  


Epoch #961: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #962: 1025it [00:02, 394.16it/s, env_step=985088, len=29, n/ep=2, n/st=64, player_1/loss=423.285, player_2/loss=341.669, rew=436.00]                                                                                                                                                                                  


Epoch #962: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #963: 1025it [00:02, 390.88it/s, env_step=986112, len=24, n/ep=2, n/st=64, player_1/loss=201.649, player_2/loss=171.613, rew=301.00]                                                                                                                                                                                  


Epoch #963: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #964: 1025it [00:02, 394.44it/s, env_step=987136, len=28, n/ep=2, n/st=64, player_1/loss=393.348, player_2/loss=329.101, rew=405.50]                                                                                                                                                                                  


Epoch #964: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #965: 1025it [00:02, 391.99it/s, env_step=988160, len=24, n/ep=2, n/st=64, player_1/loss=538.562, player_2/loss=413.701, rew=332.50]                                                                                                                                                                                  


Epoch #965: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #966: 1025it [00:02, 390.30it/s, env_step=989184, len=37, n/ep=2, n/st=64, player_1/loss=349.742, player_2/loss=393.516, rew=702.00]                                                                                                                                                                                  


Epoch #966: test_reward: 299.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #967: 1025it [00:02, 390.53it/s, env_step=990208, len=32, n/ep=2, n/st=64, player_1/loss=87.632, player_2/loss=346.121, rew=531.50]                                                                                                                                                                                   


Epoch #967: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #968: 1025it [00:02, 393.67it/s, env_step=991232, len=32, n/ep=2, n/st=64, player_1/loss=164.884, player_2/loss=683.200, rew=571.50]                                                                                                                                                                                  


Epoch #968: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #969: 1025it [00:02, 392.04it/s, env_step=992256, len=24, n/ep=2, n/st=64, player_1/loss=313.329, player_2/loss=949.870, rew=402.50]                                                                                                                                                                                  


Epoch #969: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #970: 1025it [00:02, 393.57it/s, env_step=993280, len=24, n/ep=3, n/st=64, player_1/loss=343.087, player_2/loss=467.726, rew=384.00]                                                                                                                                                                                  


Epoch #970: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #971: 1025it [00:02, 394.60it/s, env_step=994304, len=31, n/ep=2, n/st=64, player_1/loss=342.679, player_2/loss=313.585, rew=532.00]                                                                                                                                                                                  


Epoch #971: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #972: 1025it [00:02, 390.77it/s, env_step=995328, len=9, n/ep=7, n/st=64, player_1/loss=654.099, player_2/loss=433.528, rew=52.00]                                                                                                                                                                                    


Epoch #972: test_reward: 27.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #973: 1025it [00:02, 391.98it/s, env_step=996352, len=19, n/ep=4, n/st=64, player_1/loss=739.581, player_2/loss=400.578, rew=189.00]                                                                                                                                                                                  


Epoch #973: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #974: 1025it [00:02, 390.55it/s, env_step=997376, len=22, n/ep=3, n/st=64, player_1/loss=506.231, player_2/loss=399.116, rew=269.00]                                                                                                                                                                                  


Epoch #974: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #975: 1025it [00:02, 407.42it/s, env_step=998400, len=21, n/ep=3, n/st=64, player_1/loss=543.114, player_2/loss=526.940, rew=239.33]                                                                                                                                                                                  


Epoch #975: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #976: 1025it [00:02, 396.34it/s, env_step=999424, len=25, n/ep=3, n/st=64, player_1/loss=517.911, player_2/loss=502.188, rew=348.33]                                                                                                                                                                                  


Epoch #976: test_reward: 230.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #977: 1025it [00:02, 391.89it/s, env_step=1000448, len=21, n/ep=3, n/st=64, player_1/loss=338.649, player_2/loss=583.539, rew=230.33]                                                                                                                                                                                 


Epoch #977: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #978: 1025it [00:02, 392.53it/s, env_step=1001472, len=23, n/ep=2, n/st=64, player_1/loss=163.733, player_2/loss=570.741, rew=288.00]                                                                                                                                                                                 


Epoch #978: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #979: 1025it [00:02, 393.45it/s, env_step=1002496, len=23, n/ep=3, n/st=64, player_1/loss=225.252, player_2/loss=384.081, rew=291.00]                                                                                                                                                                                 


Epoch #979: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #980: 1025it [00:02, 392.57it/s, env_step=1003520, len=20, n/ep=3, n/st=64, player_1/loss=306.223, player_2/loss=637.604, rew=257.00]                                                                                                                                                                                 


Epoch #980: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #981: 1025it [00:02, 392.61it/s, env_step=1004544, len=20, n/ep=3, n/st=64, player_1/loss=332.192, player_2/loss=449.412, rew=223.67]                                                                                                                                                                                 


Epoch #981: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #982: 1025it [00:02, 393.50it/s, env_step=1005568, len=34, n/ep=2, n/st=64, player_1/loss=370.395, player_2/loss=259.987, rew=618.50]                                                                                                                                                                                 


Epoch #982: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #983: 1025it [00:02, 392.25it/s, env_step=1006592, len=32, n/ep=3, n/st=64, player_1/loss=481.971, player_2/loss=412.913, rew=561.00]                                                                                                                                                                                 


Epoch #983: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #984: 1025it [00:02, 394.97it/s, env_step=1007616, len=23, n/ep=3, n/st=64, player_1/loss=394.005, player_2/loss=468.002, rew=307.33]                                                                                                                                                                                 


Epoch #984: test_reward: 665.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #985: 1025it [00:02, 393.26it/s, env_step=1008640, len=32, n/ep=3, n/st=64, player_1/loss=368.815, player_2/loss=457.955, rew=570.00]                                                                                                                                                                                 


Epoch #985: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #986: 1025it [00:02, 393.43it/s, env_step=1009664, len=25, n/ep=2, n/st=64, player_1/loss=418.768, player_2/loss=423.340, rew=328.50]                                                                                                                                                                                 


Epoch #986: test_reward: 527.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #987: 1025it [00:02, 394.00it/s, env_step=1010688, len=28, n/ep=3, n/st=64, player_1/loss=388.223, player_2/loss=381.454, rew=457.67]                                                                                                                                                                                 


Epoch #987: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #988: 1025it [00:02, 393.38it/s, env_step=1011712, len=28, n/ep=2, n/st=64, player_1/loss=257.074, player_2/loss=320.163, rew=405.00]                                                                                                                                                                                 


Epoch #988: test_reward: 405.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #989: 1025it [00:02, 389.44it/s, env_step=1012736, len=28, n/ep=2, n/st=64, player_1/loss=154.872, player_2/loss=427.842, rew=405.50]                                                                                                                                                                                 


Epoch #989: test_reward: 324.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #990: 1025it [00:02, 393.24it/s, env_step=1013760, len=29, n/ep=3, n/st=64, player_1/loss=165.704, player_2/loss=312.230, rew=435.33]                                                                                                                                                                                 


Epoch #990: test_reward: 350.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #991: 1025it [00:02, 393.00it/s, env_step=1014784, len=38, n/ep=1, n/st=64, player_1/loss=155.185, player_2/loss=323.290, rew=740.00]                                                                                                                                                                                 


Epoch #991: test_reward: 702.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #992: 1025it [00:02, 393.94it/s, env_step=1015808, len=39, n/ep=2, n/st=64, player_1/loss=260.103, player_2/loss=550.569, rew=779.00]                                                                                                                                                                                 


Epoch #992: test_reward: 779.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #993: 1025it [00:02, 391.69it/s, env_step=1016832, len=23, n/ep=3, n/st=64, player_1/loss=277.010, player_2/loss=533.872, rew=294.00]                                                                                                                                                                                 


Epoch #993: test_reward: 377.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #994: 1025it [00:02, 392.90it/s, env_step=1017856, len=37, n/ep=2, n/st=64, player_1/loss=230.946, player_2/loss=498.285, rew=727.00]                                                                                                                                                                                 


Epoch #994: test_reward: 819.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #995: 1025it [00:02, 393.21it/s, env_step=1018880, len=27, n/ep=2, n/st=64, player_1/loss=172.644, player_2/loss=391.347, rew=446.00]                                                                                                                                                                                 


Epoch #995: test_reward: 740.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #996: 1025it [00:02, 391.87it/s, env_step=1019904, len=22, n/ep=3, n/st=64, player_2/loss=171.078, rew=283.00]                                                                                                                                                                                                        


Epoch #996: test_reward: 119.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #997: 1025it [00:02, 393.03it/s, env_step=1020928, len=29, n/ep=2, n/st=64, player_1/loss=322.937, player_2/loss=270.545, rew=485.00]                                                                                                                                                                                 


Epoch #997: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #998: 1025it [00:02, 394.55it/s, env_step=1021952, len=22, n/ep=3, n/st=64, player_1/loss=315.949, player_2/loss=317.354, rew=260.00]                                                                                                                                                                                 


Epoch #998: test_reward: 189.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


Epoch #999: 1025it [00:02, 389.74it/s, env_step=1022976, len=19, n/ep=3, n/st=64, player_1/loss=73.239, player_2/loss=298.779, rew=202.33]                                                                                                                                                                                  


Epoch #999: test_reward: 209.000000 ± 0.000000, best_reward: 1102.000000 ± 0.000000 in #482


In [23]:
####################################################
# EXPERIMENT: VIEWING THE BEST LEARNED POLICY
####################################################

# Get the environment settings
env = get_env()
observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
state_shape = observation_space.shape or observation_space.n
action_shape = env.action_space.shape or env.action_space.n

# Configure the best agent
best_agent1 = cf_custom_dqn_policy(state_shape= state_shape,
                                   action_shape= action_shape)
best_agent1.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/2-mlp_dqn_frozen_agent2/best_policy_agent1.pth"))
best_agent1.set_eps(0)


best_agent2 = cf_custom_dqn_policy(state_shape= state_shape,
                                   action_shape= action_shape)
best_agent2.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/2-mlp_dqn_frozen_agent2/best_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= best_agent1,
      agent_player2= best_agent2)



Average steps of game:  25.666666666666668
Final mean reward agent 1: 171.66666666666666, std: 18.856180831641268
Final mean reward agent 2: 175.0, std: 63.63961030678928


In [24]:
####################################################
# EXPERIMENT: VIEWING THE LAST LEARNED POLICY
####################################################

# Configure the final agent
final_agent_player1 = cf_custom_dqn_policy(state_shape= state_shape,
                                           action_shape= action_shape)
final_agent_player1.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/2-mlp_dqn_frozen_agent2/final_policy_agent1.pth"))
best_agent1.set_eps(0)

final_agent_player2 = cf_custom_dqn_policy(state_shape= state_shape,
                                           action_shape= action_shape)
final_agent_player2.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/2-mlp_dqn_frozen_agent2/final_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= final_agent_player1,
      agent_player2= final_agent_player2)



Average steps of game:  20.666666666666668
Final mean reward agent 1: 92.33333333333333, std: 10.370899457402697
Final mean reward agent 2: 131.0, std: 9.899494936611665


<hr><hr>

## Discussion

We see that the agent can learn quickly to win against a fixed strategy oponent but the overall performance of the agent is still weak, making human play of very poor quality once again.

In [None]:
####################################################
# CLEAN VARIABLES
####################################################

del action_shape
del agent1
del agent2
del best_agent1
del best_agent2
del env
del final_agent_player1
del final_agent_player2
del observation_space
del off_policy_traininer_results
del state_shape
