# Improving DQN architecture

The previous two notebooks, `3-learning-connect-four-dqn-agents-tianshou.ipynb` and `4-improving-dqn-agents.ipynb`, have shown that there is an improvement to be made to make the agents truly interesting.
The issue probably lies in a combination of at least three things:
- Training two DQN agents simultaneously is known to be though, especially when starting from a random initialisation
- The network used was a simple MLP
- The training is not done over enough iterations

A solution to the first issue would be to fix one agent's policy for several episodes whilst training the other and switching back and forth between the agents.
Instead of an MLP, a CNN based approach could be used, which makes sense in this setting.
Once somewhat promising results emerge, training time can be increased given it is still manageable on average consumer-grade hardware.

In this notebook we will try to use a custom NN for training the DQN.

<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Training two DQN agents on connect four Gym
  - Building the environment
  - Implementing the DQN policy
  - Building agents
  - Function for letting agents learn
  - Function for watching learned agent
  - Doing the experiment
- Discussion

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [3]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib

# Plotting
import matplotlib; print(f"Matplotlib version (3.5.1 recommended): {matplotlib.__version__}")
import matplotlib.pyplot as plt

# Argparser
import argparse

# More data types
import typing
import numpy as np

# Pygame
import pygame; print(f"Pygame version (2.1.2 recommended): {pygame.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Tianshou for RL algorithms
import tianshou as ts; print(f"Tianshou version (0.4.8 recommended): {ts.__version__}")

# Torch is a popular DL framework
import torch; print(f"Torch version (1.11.0 recommended): {torch.__version__}")

# PPrint is a pretty print for variables
from pprint import pprint

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym
importlib.invalidate_caches()
importlib.reload(cfgym)

# Time for allowing "freezes" in execution
import time;

# Allow for copying objects in a non reference manner
import copy

# Used for updating notebook display
from IPython.display import clear_output

Matplotlib version (3.5.1 recommended): 3.5.1
Pygame version (2.1.2 recommended): 2.1.2
Gym version (0.21.0 recommended): 0.21.0
Tianshou version (0.4.8 recommended): 0.4.8
Torch version (1.11.0 recommended): 1.12.0.dev20220520+cu116


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

In [4]:
####################################################
# CUDA VALIDATION
####################################################

# Check cuda available
print(f"CUDA is available: {torch.cuda.is_available()}")

# Show cuda devices
print(f"\nAmount of connected devices supporting CUDA: {torch.cuda.device_count()}")

# Show current cuda device
print(f"\nCurrent CUDA device: {torch.cuda.current_device()}")

# Show cuda device name
print(f"Cuda device 0 name: {torch.cuda.get_device_name(0)}")

CUDA is available: True

Amount of connected devices supporting CUDA: 1

Current CUDA device: 0
Cuda device 0 name: NVIDIA GeForce GTX 970


<hr><hr>

## Training two DQN agents on connect four Gym

Our connect four gym setup requires two agents, one for each player.
To reduce complexity, agents will always play as the same player, e.g. always as player 1.
It is important to note that connect four is a *solved game*.
According to [The Washington Post](https://www.washingtonpost.com/news/wonk/wp/2015/05/08/how-to-win-any-popular-game-according-to-data-scientists/):

> Connect Four is what mathematicians call a "solved game," meaning you can play it perfectly every time, no matter what your opponent does. You will need to get the first move, but as long as you do so, you can always win within 41 moves.

<hr>

### Building the environment

This code is taken from previous notebooks.
We don't allow invalid moves to make the problem easier for now.

In [5]:
####################################################
# CONNECT FOUR V2 ENVIRONMENT
####################################################

def get_env():
    """
    Returns the connect four gym environment V2 altered for Tianshou and Petting Zoo compatibility.
    Already wrapped with a ts.env.PettingZooEnv wrapper.
    """
    return ts.env.PettingZooEnv(cfgym.env(reward_move= 0, # Set to 1 for reward to make moves (incentivise longer games)
                                          reward_invalid= -3,
                                          reward_draw= 15,
                                          reward_win= 25,
                                          reward_loss= -25,
                                          allow_invalid_move= False))
    
    
# Test the environment
env = get_env()
print(f"Observation space: {env.observation_space}")
print(f"\nAction space: {env.action_space}")

# Reset the environment to start from a clean state, returns the initial observation
observation = env.reset()

print("\n Initial player id:")
print(observation["agent_id"])

print("\n Initial observation:")
print(observation["obs"])

print("\n Initial mask:")
print(observation["mask"])

# Clean unused variables
del observation
del env

Observation space: Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]], [[2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]], (6, 7), int8))

Action space: Discrete(7)

 Initial player id:
player_1

 Initial observation:
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]

 Initial mask:
[True, True, True, True, True, True, True]


<hr>

### Implementing the DQN policy

The DQN policy for the agent is configured and set up below.
This is identical to the previous notebook.

In [6]:
####################################################
# DQN ARCHITECTURE
####################################################

class CustomDQN(torch.nn.Module):
    """
    Custom DQN using a model based on CNN
    """
    def __init__(self,
                 state_shape: typing.Sequence[int],
                 action_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu',):
        # Parent call
        super().__init__()
        
        # Save device (e.g. cuda)
        self.device = device
        
        self.model = torch.nn.Sequential(
            torch.nn.Linear(np.prod(state_shape), 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, np.prod(action_shape)),
        )

    def forward(self, obs, state=None, info={}):
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float, device=self.device)
        batch = obs.shape[0]
        logits = self.model(obs.view(batch, -1))
        return logits, state


In [7]:
####################################################
# DQN POLICY
####################################################

def cf_dqn_policy(state_shape: tuple,
                  action_shape: tuple,
                  optim: typing.Optional[torch.optim.Optimizer] = None,
                  learning_rate: float =  0.0001,
                  gamma: float = 0.9, # Smaller gamma favours "faster" win
                  n_step: int = 1, # Number of steps to look ahead
                  target_update_freq: int = 320):
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Network to be used for DQN
    net = CustomDQN(state_shape, action_shape, device= device).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    if optim is None:
        optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # Our agent DQN policy
    return ts.policy.DQNPolicy(model= net,
                               optim= optim,
                               discount_factor= gamma,
                               estimation_step= n_step,
                               target_update_freq= target_update_freq)

<hr>

### Building agents

Identical to the previous notebook.

In [8]:
####################################################
# AGENT CREATION
####################################################

def get_agents(agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
               agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
               optim: typing.Optional[torch.optim.Optimizer] = None,
               resume_path_player_1: str = '', # Path to file to resume agent training from
               resume_path_player_2: str = '', 
               ) -> typing.Tuple[ts.policy.BasePolicy, torch.optim.Optimizer, list]:
    """
    Gets a multi agent policy manager, optimizer and player ids for the connect four V2 gym environment.
    Per default this returns 
        - Multi agent manager for 2 agents using DQN
        - Adam optimizer
        - ['player_1', 'player_2'] from the connect four environment
    """
    
    # Get the environment to play in (Connect four gym V2)
    env = get_env()
    
    # Get the observation space from the environment, depending on typo of space (ternary operator)
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    
    # Set the arguments
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent player 1 to be a DQN if no policy is passed.
    if agent_player1 is None:
        # Our agent1 uses a DQN policy
        agent_player1 = cf_dqn_policy(state_shape= state_shape,
                                      action_shape= action_shape,
                                      optim= optim)
        
        # If we resume our agent we need to load the previous config
        if resume_path_player_1:
            agent_player1.load_state_dict(torch.load(resume_path_player_1))
    
    # Configure agent player 2 to be a DQN if no policy is passed.
    if agent_player2 is None:
        # Our agent1 uses a DQN policy
        agent_player2 = cf_dqn_policy(state_shape= state_shape,
                                      action_shape= action_shape,
                                      optim= optim)
        
        # If we resume our agent we need to load the previous config
        if resume_path_player_2:
            agent_player2.load_state_dict(torch.load(resume_path_player_2))

    # Both our agents are DQN agents by default
    agents = [agent_player1, agent_player2]
        
    # Our policy depends on the order of the agents
    policy = ts.policy.MultiAgentPolicyManager(agents, env)
    
    # Return our policy, optimizer and the available agents in the environment
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    
    return policy, optim, env.agents

<hr>

### Function for letting agents learn

Identical to the previous notebook.

In [9]:
####################################################
# AGENT TRAINING
####################################################

def train_agent(filename: str = "dqn_vs_dqn",
                agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
                optim: typing.Optional[torch.optim.Optimizer] = None,
                training_env_num: int = 1,
                testing_env_num: int = 1,
                buffer_size: int = 2^14,
                batch_size: int = 64, #64
                epochs: int = 50, #50
                step_per_epoch: int = 1024, #1024
                step_per_collect: int = 64, # transition before update
                update_per_step: float = 0.1,
                testing_eps: float = 0.05,
                training_eps: float = 0.1,
                ) -> typing.Tuple[dict, ts.policy.BasePolicy]:
    """
    Trains two agents in the connect four V2 environment and saves their best model and logs.
    Returns:
        - result from offpolicy_trainer
        - final version of agent 1
        - final version of agent 2
    """

    # ======== notebook specific =========
    notebook_version = '5' # Used for foldering logs and models

    # ======== environment setup =========
    train_envs = ts.env.DummyVectorEnv([get_env for _ in range(training_env_num)])
    test_envs = ts.env.DummyVectorEnv([get_env for _ in range(testing_env_num)])
    
    # set the seed for reproducibility
    np.random.seed(1998)
    torch.manual_seed(1998)
    train_envs.seed(1998)
    test_envs.seed(1998)

    # ======== agent setup =========
    # Gets our agents from the previously made function
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    policy, optim, agents = get_agents(agent_player1=agent_player1,
                                       agent_player2=agent_player2,
                                       optim=optim)

    # ======== collector setup =========
    # Make a collector for the training environments
    train_collector = ts.data.Collector(policy= policy,
                                        env= train_envs,
                                        buffer= ts.data.VectorReplayBuffer(buffer_size, len(train_envs)),
                                        exploration_noise= True)
    
    # Make a collector for the testing environments
    test_collector = ts.data.Collector(policy= policy,
                                       env= train_envs,
                                       exploration_noise= True)
    
    # Uncomment below if you want to set epsilon in epsilon policy
    # policy.set_eps(1)
    
    # Collect data fot the training evnironments
    train_collector.collect(n_step= batch_size * training_env_num)
    
    # ======== ensure folders exist =========
    if not os.path.exists(os.path.join('./logs', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./logs', 'paper_notebooks', notebook_version, filename))
    if not os.path.exists(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename))

    # ======== tensorboard logging setup =========
    # Allows to save the training progress to tensorboard compatable logs
    log_path = os.path.join('./logs', 'paper_notebooks', notebook_version, filename)
    writer = torch.utils.tensorboard.SummaryWriter(log_path)
    logger = ts.utils.TensorboardLogger(writer)

    # ======== callback functions used during training =========
    # We want to save our best policy
    def save_best_fn(policy):
        """
        Callback to save the best model
        """
        # Save best agent 1
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent1.pth')
        torch.save(policy.policies[agents[0]].state_dict(), model_save_path)
        
        # Save best agent 2
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent2.pth')
        torch.save(policy.policies[agents[1]].state_dict(), model_save_path)
        
        # Save agent2

    def stop_fn(mean_rewards):
        """
        Callback to stop training when we've reached the win rate
        """
        return mean_rewards >= 7 # (win = 10, 70% win without invalid moves = mean of 7)

    def train_fn(epoch, env_step):
        """
        Callback before training
        """        
        # Before training we want to configure the epsilon for the agents
        # In general more exploratory than the test case
        policy.policies[agents[0]].set_eps(training_eps)
        policy.policies[agents[1]].set_eps(training_eps)

    def test_fn(epoch, env_step):
        """
        Callback beore testing
        """        
        # Before testing we want to configure the epsilon for the agents
        # In general more greedy than the train case but not
        #   to avoid getting stuck on invalid moves
        policy.policies[agents[0]].set_eps(testing_eps)

    def reward_metric(rews):
        """
        Callback for reward collection
        """
        # We are interested in having a high total total reward,
        #   as this would mean equally good agents.
        return rews[:, 0] + rews[:, 1]

    # trainer
    result = ts.trainer.offpolicy_trainer(policy= policy,
                                          train_collector= train_collector,
                                          test_collector= test_collector,
                                          max_epoch= epochs,
                                          step_per_epoch= step_per_epoch,
                                          step_per_collect= step_per_collect,
                                          episode_per_test= testing_env_num,
                                          batch_size= batch_size,
                                          train_fn= train_fn,
                                          test_fn= test_fn,
                                          # Stop function to stop before specified amount of epochs
                                          #stop_fn= stop_fn
                                          save_best_fn= save_best_fn,
                                          update_per_step= update_per_step,
                                          logger= logger,
                                          test_in_train= False,
                                          reward_metric= reward_metric)
    
    # Save final agent 1
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent1.pth')
    torch.save(policy.policies[agents[0]].state_dict(), model_save_path)

    # Save final agent 2
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent2.pth')
    torch.save(policy.policies[agents[1]].state_dict(), model_save_path)

    return result, policy.policies[agents[0]], policy.policies[agents[1]]

<hr>

### Function for watching learned agent

Identical to the previous notebook.

In [10]:
####################################################
# WATCHING THE LEARNED POLICY IN ACTION
####################################################

def watch(numer_of_games: int = 3,
          agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
          agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
          test_epsilon: float = 0.05, # For the watching we act completely greedy but low random for not getting stuck on invalid move
          render_speed: float = 0.15, # Amount of seconds to update frame/ do a step
          ) -> None:
    
    # Get the connect four V2 environment (must be a list)
    env= ts.env.DummyVectorEnv([get_env])
    
    # Get the agents from the trained agents
    policy, optim, agents = get_agents(agent_player1= agent_player1,
                                       agent_player2= agent_player2)
    
    # Evaluate the policy
    policy.eval()
    
    # Set the testing policy epsilon for our agents
    policy.policies[agents[0]].set_eps(test_epsilon)
    policy.policies[agents[1]].set_eps(test_epsilon)
    
    # Collect the test data
    collector = ts.data.Collector(policy= policy,
                                  env= env,
                                  exploration_noise= True)
    
    # Render games in human mode to see how it plays
    result = collector.collect(n_episode= numer_of_games, render= render_speed)
    
    # Close the environment aftering collecting the results
    # This closes the pygame window after completion
    env.close()
    
    # Get the rewards and length from the test trials
    rewards, length = result["rews"], result["lens"]
    
    # Print the final reward for the first agent
    print(f"Average steps of game:  {length.mean()}")
    print(f"Final mean reward agent 1: {rewards[:, 0].mean()}, std: {rewards[:, 0].std()}")
    print(f"Final mean reward agent 2: {rewards[:, 1].mean()}, std: {rewards[:, 1].std()}")

<hr>

### Doing the experiment

We now do the experiment with using our previously created functions.
We update some parameter settings to find if we can improve our DQN agents.

In [10]:
####################################################
# EXPERIMENT: TRAINING AGENTS
####################################################

# Get the environment settings
env = get_env()
observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
state_shape = observation_space.shape or observation_space.n
action_shape = env.action_space.shape or env.action_space.n

# Configure the agents
agent1 = cf_dqn_policy(state_shape= state_shape,
                       action_shape= action_shape,
                       gamma= 0.95, # Favour shorter solutions if small
                       n_step= 6)


agent2 = cf_dqn_policy(state_shape= state_shape,
                       action_shape= action_shape,
                       gamma= 0.95, # Favour shorter solutions if small
                       n_step= 6)

# Train the agent
off_policy_traininer_results, final_agent_player1, final_agent_player2 = train_agent(epochs= 1000,
                                                                                     filename= "dqn_vs_dqn_no_move_reward",
                                                                                     training_eps= 0.2)

Epoch #1: 1025it [00:03, 311.05it/s, env_step=1024, len=8, n/ep=7, n/st=64, player_1/loss=95.106, player_2/loss=29.814, rew=0.00]


Epoch #1: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 424.29it/s, env_step=2048, len=9, n/ep=7, n/st=64, player_1/loss=78.133, player_2/loss=53.710, rew=0.00]


Epoch #2: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 421.31it/s, env_step=3072, len=8, n/ep=7, n/st=64, player_1/loss=54.524, player_2/loss=62.461, rew=0.00]


Epoch #3: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #4: 1025it [00:02, 420.80it/s, env_step=4096, len=7, n/ep=8, n/st=64, player_1/loss=48.256, player_2/loss=29.000, rew=0.00]


Epoch #4: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #5: 1025it [00:02, 419.36it/s, env_step=5120, len=11, n/ep=5, n/st=64, player_1/loss=27.549, player_2/loss=11.180, rew=0.00]


Epoch #5: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #6: 1025it [00:02, 417.24it/s, env_step=6144, len=7, n/ep=9, n/st=64, player_1/loss=34.553, player_2/loss=27.511, rew=0.00]


Epoch #6: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #7: 1025it [00:02, 370.49it/s, env_step=7168, len=8, n/ep=7, n/st=64, player_1/loss=36.894, player_2/loss=31.725, rew=0.00]


Epoch #7: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #8: 1025it [00:02, 418.32it/s, env_step=8192, len=10, n/ep=6, n/st=64, player_1/loss=26.122, player_2/loss=6.978, rew=0.00]


Epoch #8: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #9: 1025it [00:02, 414.95it/s, env_step=9216, len=10, n/ep=6, n/st=64, player_1/loss=23.621, player_2/loss=1.448, rew=0.00]


Epoch #9: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #10: 1025it [00:02, 377.10it/s, env_step=10240, len=15, n/ep=4, n/st=64, player_1/loss=26.390, player_2/loss=5.426, rew=0.00]


Epoch #10: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #11: 1025it [00:02, 420.08it/s, env_step=11264, len=7, n/ep=8, n/st=64, player_1/loss=19.458, player_2/loss=43.371, rew=0.00]


Epoch #11: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #12: 1025it [00:02, 401.08it/s, env_step=12288, len=7, n/ep=9, n/st=64, player_1/loss=15.791, player_2/loss=35.784, rew=0.00]


Epoch #12: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #13: 1025it [00:02, 405.08it/s, env_step=13312, len=7, n/ep=8, n/st=64, player_1/loss=18.389, player_2/loss=27.638, rew=0.00]


Epoch #13: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #14: 1025it [00:02, 384.24it/s, env_step=14336, len=7, n/ep=8, n/st=64, player_1/loss=14.880, player_2/loss=16.686, rew=0.00]


Epoch #14: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #15: 1025it [00:02, 367.05it/s, env_step=15360, len=7, n/ep=9, n/st=64, player_1/loss=9.752, player_2/loss=9.431, rew=0.00]


Epoch #15: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #16: 1025it [00:02, 359.17it/s, env_step=16384, len=8, n/ep=8, n/st=64, player_1/loss=6.954, player_2/loss=6.517, rew=0.00]


Epoch #16: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #17: 1025it [00:02, 356.71it/s, env_step=17408, len=9, n/ep=7, n/st=64, player_1/loss=10.668, player_2/loss=7.651, rew=0.00]


Epoch #17: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #18: 1025it [00:02, 356.21it/s, env_step=18432, len=11, n/ep=6, n/st=64, player_1/loss=2.287, player_2/loss=5.966, rew=0.00]


Epoch #18: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #19: 1025it [00:02, 356.43it/s, env_step=19456, len=9, n/ep=7, n/st=64, player_1/loss=5.076, player_2/loss=4.794, rew=0.00]


Epoch #19: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #20: 1025it [00:02, 357.48it/s, env_step=20480, len=8, n/ep=8, n/st=64, player_1/loss=14.015, player_2/loss=1.985, rew=0.00]


Epoch #20: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #21: 1025it [00:02, 356.50it/s, env_step=21504, len=8, n/ep=8, n/st=64, player_1/loss=5.230, player_2/loss=8.238, rew=0.00]


Epoch #21: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #22: 1025it [00:02, 351.62it/s, env_step=22528, len=7, n/ep=8, n/st=64, player_1/loss=5.120, player_2/loss=4.977, rew=0.00]


Epoch #22: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #23: 1025it [00:02, 351.32it/s, env_step=23552, len=7, n/ep=8, n/st=64, player_1/loss=3.041, player_2/loss=3.747, rew=0.00]


Epoch #23: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #24: 1025it [00:02, 374.46it/s, env_step=24576, len=9, n/ep=7, n/st=64, player_1/loss=2.756, player_2/loss=5.264, rew=0.00]


Epoch #24: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #25: 1025it [00:02, 413.32it/s, env_step=25600, len=9, n/ep=7, n/st=64, player_1/loss=2.818, player_2/loss=2.964, rew=0.00]


Epoch #25: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #26: 1025it [00:02, 407.26it/s, env_step=26624, len=9, n/ep=7, n/st=64, player_1/loss=1.950, player_2/loss=3.991, rew=0.00]


Epoch #26: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #27: 1025it [00:02, 402.37it/s, env_step=27648, len=8, n/ep=8, n/st=64, player_1/loss=1.372, player_2/loss=2.985, rew=0.00]


Epoch #27: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #28: 1025it [00:02, 409.75it/s, env_step=28672, len=8, n/ep=8, n/st=64, player_1/loss=1.419, player_2/loss=3.157, rew=0.00]


Epoch #28: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #29: 1025it [00:02, 400.32it/s, env_step=29696, len=8, n/ep=8, n/st=64, player_1/loss=1.821, player_2/loss=2.599, rew=0.00]


Epoch #29: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #30: 1025it [00:02, 354.97it/s, env_step=30720, len=7, n/ep=8, n/st=64, player_1/loss=2.167, player_2/loss=1.252, rew=0.00]


Epoch #30: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #31: 1025it [00:02, 414.54it/s, env_step=31744, len=8, n/ep=7, n/st=64, player_1/loss=1.699, player_2/loss=3.704, rew=0.00]


Epoch #31: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #32: 1025it [00:02, 416.56it/s, env_step=32768, len=8, n/ep=7, n/st=64, player_1/loss=2.065, player_2/loss=3.815, rew=0.00]


Epoch #32: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #33: 1025it [00:02, 402.79it/s, env_step=33792, len=7, n/ep=9, n/st=64, player_1/loss=1.472, player_2/loss=2.628, rew=0.00]


Epoch #33: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #34: 1025it [00:02, 357.89it/s, env_step=34816, len=7, n/ep=8, n/st=64, player_1/loss=2.782, player_2/loss=2.369, rew=0.00]


Epoch #34: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #35: 1025it [00:02, 356.69it/s, env_step=35840, len=9, n/ep=7, n/st=64, player_1/loss=2.490, player_2/loss=2.720, rew=0.00]


Epoch #35: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #36: 1025it [00:02, 341.81it/s, env_step=36864, len=9, n/ep=7, n/st=64, player_1/loss=2.668, player_2/loss=2.264, rew=0.00]


Epoch #36: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #37: 1025it [00:02, 376.71it/s, env_step=37888, len=8, n/ep=8, n/st=64, player_1/loss=0.753, player_2/loss=2.082, rew=0.00]


Epoch #37: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #38: 1025it [00:02, 376.90it/s, env_step=38912, len=9, n/ep=7, n/st=64, player_1/loss=1.627, player_2/loss=2.171, rew=0.00]


Epoch #38: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #39: 1025it [00:02, 363.40it/s, env_step=39936, len=9, n/ep=7, n/st=64, player_1/loss=1.684, player_2/loss=1.813, rew=0.00]


Epoch #39: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #40: 1025it [00:02, 378.64it/s, env_step=40960, len=9, n/ep=7, n/st=64, player_1/loss=1.501, player_2/loss=2.311, rew=0.00]


Epoch #40: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #41: 1025it [00:02, 396.75it/s, env_step=41984, len=8, n/ep=8, n/st=64, player_1/loss=3.454, player_2/loss=4.738, rew=0.00]


Epoch #41: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #42: 1025it [00:02, 393.60it/s, env_step=43008, len=8, n/ep=7, n/st=64, player_1/loss=1.431, player_2/loss=2.364, rew=0.00]


Epoch #42: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #43: 1025it [00:02, 396.97it/s, env_step=44032, len=8, n/ep=8, n/st=64, player_1/loss=1.537, player_2/loss=4.535, rew=0.00]


Epoch #43: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #44: 1025it [00:02, 395.37it/s, env_step=45056, len=9, n/ep=7, n/st=64, player_1/loss=5.193, player_2/loss=1.637, rew=0.00]


Epoch #44: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #45: 1025it [00:02, 379.77it/s, env_step=46080, len=22, n/ep=3, n/st=64, player_1/loss=3.741, player_2/loss=1.965, rew=0.00]


Epoch #45: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #46: 1025it [00:02, 385.33it/s, env_step=47104, len=10, n/ep=6, n/st=64, player_1/loss=2.425, player_2/loss=4.548, rew=0.00]


Epoch #46: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #47: 1025it [00:02, 390.08it/s, env_step=48128, len=11, n/ep=6, n/st=64, player_1/loss=2.732, player_2/loss=2.082, rew=0.00]


Epoch #47: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #48: 1025it [00:02, 396.47it/s, env_step=49152, len=7, n/ep=8, n/st=64, player_1/loss=1.465, player_2/loss=1.357, rew=0.00]


Epoch #48: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #49: 1025it [00:02, 390.40it/s, env_step=50176, len=7, n/ep=8, n/st=64, player_1/loss=2.089, player_2/loss=1.085, rew=0.00]


Epoch #49: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #50: 1025it [00:02, 370.26it/s, env_step=51200, len=7, n/ep=9, n/st=64, player_1/loss=1.779, player_2/loss=0.995, rew=0.00]


Epoch #50: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #51: 1025it [00:02, 352.91it/s, env_step=52224, len=8, n/ep=8, n/st=64, player_1/loss=1.661, player_2/loss=0.843, rew=0.00]


Epoch #51: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #52: 1025it [00:02, 360.76it/s, env_step=53248, len=7, n/ep=8, n/st=64, player_1/loss=1.703, player_2/loss=0.631, rew=0.00]


Epoch #52: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #53: 1025it [00:02, 362.12it/s, env_step=54272, len=8, n/ep=8, n/st=64, player_1/loss=3.382, player_2/loss=0.949, rew=0.00]


Epoch #53: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #54: 1025it [00:02, 377.79it/s, env_step=55296, len=9, n/ep=7, n/st=64, player_1/loss=2.059, player_2/loss=1.606, rew=0.00]


Epoch #54: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #55: 1025it [00:02, 383.93it/s, env_step=56320, len=8, n/ep=8, n/st=64, player_1/loss=3.794, player_2/loss=0.549, rew=0.00]


Epoch #55: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #56: 1025it [00:02, 396.44it/s, env_step=57344, len=7, n/ep=8, n/st=64, player_1/loss=1.192, player_2/loss=0.433, rew=0.00]


Epoch #56: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #57: 1025it [00:02, 381.83it/s, env_step=58368, len=11, n/ep=6, n/st=64, player_1/loss=1.921, player_2/loss=2.015, rew=0.00]


Epoch #57: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #58: 1025it [00:02, 381.05it/s, env_step=59392, len=7, n/ep=9, n/st=64, player_1/loss=1.703, player_2/loss=1.629, rew=0.00]


Epoch #58: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #59: 1025it [00:02, 369.11it/s, env_step=60416, len=10, n/ep=7, n/st=64, player_1/loss=3.639, player_2/loss=1.135, rew=0.00]


Epoch #59: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #60: 1025it [00:02, 362.98it/s, env_step=61440, len=10, n/ep=6, n/st=64, player_1/loss=1.245, player_2/loss=3.128, rew=0.00]


Epoch #60: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #61: 1025it [00:02, 358.15it/s, env_step=62464, len=9, n/ep=7, n/st=64, player_1/loss=1.929, player_2/loss=1.829, rew=0.00]


Epoch #61: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #62: 1025it [00:02, 355.71it/s, env_step=63488, len=7, n/ep=8, n/st=64, player_1/loss=6.159, player_2/loss=1.384, rew=0.00]


Epoch #62: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #63: 1025it [00:02, 356.16it/s, env_step=64512, len=10, n/ep=6, n/st=64, player_1/loss=3.896, player_2/loss=0.883, rew=0.00]


Epoch #63: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #64: 1025it [00:02, 358.28it/s, env_step=65536, len=8, n/ep=6, n/st=64, player_1/loss=5.337, player_2/loss=3.503, rew=0.00]


Epoch #64: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #65: 1025it [00:02, 358.09it/s, env_step=66560, len=8, n/ep=8, n/st=64, player_1/loss=5.061, player_2/loss=2.954, rew=0.00]


Epoch #65: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #66: 1025it [00:02, 355.18it/s, env_step=67584, len=9, n/ep=8, n/st=64, player_1/loss=3.562, player_2/loss=2.006, rew=0.00]


Epoch #66: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #67: 1025it [00:02, 357.66it/s, env_step=68608, len=8, n/ep=8, n/st=64, player_1/loss=1.958, player_2/loss=7.328, rew=0.00]


Epoch #67: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #68: 1025it [00:02, 357.05it/s, env_step=69632, len=8, n/ep=7, n/st=64, player_1/loss=1.998, player_2/loss=5.715, rew=0.00]


Epoch #68: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #69: 1025it [00:02, 357.37it/s, env_step=70656, len=7, n/ep=9, n/st=64, player_1/loss=7.980, player_2/loss=3.343, rew=0.00]


Epoch #69: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #70: 1025it [00:02, 356.39it/s, env_step=71680, len=8, n/ep=7, n/st=64, player_1/loss=5.262, player_2/loss=3.033, rew=0.00]


Epoch #70: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #71: 1025it [00:02, 357.32it/s, env_step=72704, len=8, n/ep=8, n/st=64, player_1/loss=1.996, player_2/loss=2.303, rew=0.00]


Epoch #71: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #72: 1025it [00:02, 357.09it/s, env_step=73728, len=7, n/ep=8, n/st=64, player_1/loss=2.323, player_2/loss=1.682, rew=0.00]


Epoch #72: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #73: 1025it [00:02, 353.15it/s, env_step=74752, len=8, n/ep=7, n/st=64, player_1/loss=0.583, player_2/loss=0.958, rew=0.00]


Epoch #73: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #74: 1025it [00:02, 355.04it/s, env_step=75776, len=9, n/ep=6, n/st=64, player_1/loss=1.465, player_2/loss=1.179, rew=0.00]


Epoch #74: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #75: 1025it [00:02, 356.47it/s, env_step=76800, len=8, n/ep=8, n/st=64, player_1/loss=1.006, player_2/loss=0.656, rew=0.00]


Epoch #75: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #76: 1025it [00:02, 356.83it/s, env_step=77824, len=8, n/ep=8, n/st=64, player_1/loss=1.186, player_2/loss=0.841, rew=0.00]


Epoch #76: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #77: 1025it [00:02, 356.94it/s, env_step=78848, len=9, n/ep=7, n/st=64, player_1/loss=1.186, player_2/loss=1.389, rew=0.00]


Epoch #77: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #78: 1025it [00:02, 357.93it/s, env_step=79872, len=7, n/ep=8, n/st=64, player_1/loss=0.884, player_2/loss=1.030, rew=0.00]


Epoch #78: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #79: 1025it [00:02, 356.92it/s, env_step=80896, len=8, n/ep=7, n/st=64, player_1/loss=0.652, player_2/loss=1.821, rew=0.00]


Epoch #79: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #80: 1025it [00:02, 358.90it/s, env_step=81920, len=8, n/ep=8, n/st=64, player_1/loss=2.018, player_2/loss=1.729, rew=0.00]


Epoch #80: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #81: 1025it [00:02, 356.98it/s, env_step=82944, len=7, n/ep=8, n/st=64, player_1/loss=3.462, player_2/loss=2.046, rew=0.00]


Epoch #81: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #82: 1025it [00:02, 357.00it/s, env_step=83968, len=8, n/ep=7, n/st=64, player_1/loss=5.194, player_2/loss=2.349, rew=0.00]


Epoch #82: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #83: 1025it [00:02, 357.09it/s, env_step=84992, len=8, n/ep=8, n/st=64, player_1/loss=4.712, player_2/loss=1.604, rew=0.00]


Epoch #83: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #84: 1025it [00:02, 357.45it/s, env_step=86016, len=7, n/ep=9, n/st=64, player_1/loss=4.694, player_2/loss=1.634, rew=0.00]


Epoch #84: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #85: 1025it [00:02, 359.45it/s, env_step=87040, len=11, n/ep=7, n/st=64, player_1/loss=6.056, player_2/loss=2.243, rew=0.00]


Epoch #85: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #86: 1025it [00:02, 357.77it/s, env_step=88064, len=10, n/ep=7, n/st=64, player_1/loss=3.666, player_2/loss=1.285, rew=0.00]


Epoch #86: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #87: 1025it [00:02, 358.72it/s, env_step=89088, len=12, n/ep=5, n/st=64, player_1/loss=6.380, player_2/loss=1.315, rew=0.00]


Epoch #87: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #88: 1025it [00:02, 356.96it/s, env_step=90112, len=8, n/ep=8, n/st=64, player_1/loss=3.657, player_2/loss=2.628, rew=0.00]


Epoch #88: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #89: 1025it [00:02, 358.70it/s, env_step=91136, len=8, n/ep=8, n/st=64, player_1/loss=4.521, player_2/loss=1.937, rew=0.00]


Epoch #89: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #90: 1025it [00:02, 357.17it/s, env_step=92160, len=8, n/ep=8, n/st=64, player_1/loss=2.942, player_2/loss=1.510, rew=0.00]


Epoch #90: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #91: 1025it [00:02, 358.08it/s, env_step=93184, len=8, n/ep=8, n/st=64, player_1/loss=6.219, player_2/loss=2.323, rew=0.00]


Epoch #91: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #92: 1025it [00:02, 357.29it/s, env_step=94208, len=8, n/ep=7, n/st=64, player_1/loss=3.722, player_2/loss=1.189, rew=0.00]


Epoch #92: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #93: 1025it [00:02, 356.29it/s, env_step=95232, len=8, n/ep=7, n/st=64, player_1/loss=1.806, player_2/loss=0.691, rew=0.00]


Epoch #93: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #94: 1025it [00:02, 357.52it/s, env_step=96256, len=7, n/ep=7, n/st=64, player_1/loss=3.078, player_2/loss=2.628, rew=0.00]


Epoch #94: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #95: 1025it [00:02, 356.76it/s, env_step=97280, len=8, n/ep=7, n/st=64, player_1/loss=4.271, player_2/loss=1.007, rew=0.00]


Epoch #95: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #96: 1025it [00:02, 356.75it/s, env_step=98304, len=8, n/ep=7, n/st=64, player_1/loss=2.853, player_2/loss=1.425, rew=0.00]


Epoch #96: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #97: 1025it [00:02, 355.51it/s, env_step=99328, len=9, n/ep=7, n/st=64, player_1/loss=5.042, player_2/loss=1.116, rew=0.00]


Epoch #97: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #98: 1025it [00:02, 359.11it/s, env_step=100352, len=8, n/ep=7, n/st=64, player_1/loss=4.039, player_2/loss=0.884, rew=0.00]


Epoch #98: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #99: 1025it [00:02, 358.10it/s, env_step=101376, len=7, n/ep=7, n/st=64, player_1/loss=5.347, player_2/loss=0.816, rew=0.00]


Epoch #99: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #100: 1025it [00:02, 356.82it/s, env_step=102400, len=8, n/ep=7, n/st=64, player_1/loss=2.769, player_2/loss=0.605, rew=0.00]


Epoch #100: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #101: 1025it [00:02, 356.93it/s, env_step=103424, len=8, n/ep=7, n/st=64, player_1/loss=1.366, player_2/loss=1.880, rew=0.00]


Epoch #101: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #102: 1025it [00:02, 358.26it/s, env_step=104448, len=8, n/ep=8, n/st=64, player_1/loss=2.285, player_2/loss=0.251, rew=0.00]


Epoch #102: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #103: 1025it [00:02, 356.43it/s, env_step=105472, len=8, n/ep=7, n/st=64, player_1/loss=5.271, player_2/loss=0.965, rew=0.00]


Epoch #103: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #104: 1025it [00:02, 357.47it/s, env_step=106496, len=8, n/ep=9, n/st=64, player_1/loss=3.984, player_2/loss=3.145, rew=0.00]


Epoch #104: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #105: 1025it [00:02, 357.53it/s, env_step=107520, len=7, n/ep=7, n/st=64, player_1/loss=3.572, player_2/loss=2.752, rew=0.00]


Epoch #105: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #106: 1025it [00:02, 357.35it/s, env_step=108544, len=7, n/ep=8, n/st=64, player_1/loss=2.204, player_2/loss=1.906, rew=0.00]


Epoch #106: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #107: 1025it [00:02, 355.90it/s, env_step=109568, len=8, n/ep=7, n/st=64, player_1/loss=3.415, player_2/loss=1.168, rew=0.00]


Epoch #107: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #108: 1025it [00:02, 357.47it/s, env_step=110592, len=8, n/ep=7, n/st=64, player_1/loss=2.184, player_2/loss=2.204, rew=0.00]


Epoch #108: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #109: 1025it [00:02, 357.60it/s, env_step=111616, len=10, n/ep=6, n/st=64, player_1/loss=5.587, player_2/loss=3.341, rew=0.00]


Epoch #109: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #110: 1025it [00:02, 356.93it/s, env_step=112640, len=8, n/ep=8, n/st=64, player_1/loss=7.768, player_2/loss=1.991, rew=0.00]


Epoch #110: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #111: 1025it [00:02, 356.44it/s, env_step=113664, len=9, n/ep=7, n/st=64, player_1/loss=4.545, player_2/loss=1.635, rew=0.00]


Epoch #111: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #112: 1025it [00:02, 358.18it/s, env_step=114688, len=8, n/ep=8, n/st=64, player_1/loss=3.335, player_2/loss=4.653, rew=0.00]


Epoch #112: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #113: 1025it [00:02, 358.41it/s, env_step=115712, len=8, n/ep=7, n/st=64, player_1/loss=2.364, player_2/loss=2.656, rew=0.00]


Epoch #113: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #114: 1025it [00:02, 356.05it/s, env_step=116736, len=8, n/ep=7, n/st=64, player_1/loss=3.987, player_2/loss=0.627, rew=0.00]


Epoch #114: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #115: 1025it [00:02, 350.94it/s, env_step=117760, len=8, n/ep=7, n/st=64, player_1/loss=1.649, player_2/loss=0.366, rew=0.00]


Epoch #115: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #116: 1025it [00:02, 357.09it/s, env_step=118784, len=8, n/ep=8, n/st=64, player_1/loss=1.392, player_2/loss=0.394, rew=0.00]


Epoch #116: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #117: 1025it [00:02, 356.46it/s, env_step=119808, len=8, n/ep=8, n/st=64, player_1/loss=2.072, player_2/loss=0.666, rew=0.00]


Epoch #117: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #118: 1025it [00:02, 357.86it/s, env_step=120832, len=9, n/ep=5, n/st=64, player_1/loss=2.926, player_2/loss=0.783, rew=0.00]


Epoch #118: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #119: 1025it [00:02, 356.13it/s, env_step=121856, len=7, n/ep=7, n/st=64, player_1/loss=4.535, player_2/loss=1.049, rew=0.00]


Epoch #119: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #120: 1025it [00:02, 358.74it/s, env_step=122880, len=8, n/ep=8, n/st=64, player_1/loss=2.948, player_2/loss=1.289, rew=0.00]


Epoch #120: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #121: 1025it [00:02, 357.75it/s, env_step=123904, len=11, n/ep=6, n/st=64, player_1/loss=3.766, player_2/loss=0.715, rew=0.00]


Epoch #121: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #122: 1025it [00:02, 357.41it/s, env_step=124928, len=8, n/ep=8, n/st=64, player_1/loss=5.128, player_2/loss=0.340, rew=0.00]


Epoch #122: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #123: 1025it [00:02, 357.47it/s, env_step=125952, len=8, n/ep=9, n/st=64, player_1/loss=3.690, player_2/loss=0.640, rew=0.00]


Epoch #123: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #124: 1025it [00:02, 356.92it/s, env_step=126976, len=7, n/ep=9, n/st=64, player_1/loss=3.500, player_2/loss=0.662, rew=0.00]


Epoch #124: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #125: 1025it [00:02, 358.89it/s, env_step=128000, len=8, n/ep=7, n/st=64, player_1/loss=2.678, player_2/loss=0.286, rew=0.00]


Epoch #125: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #126: 1025it [00:02, 356.64it/s, env_step=129024, len=8, n/ep=8, n/st=64, player_1/loss=5.103, player_2/loss=0.670, rew=0.00]


Epoch #126: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #127: 1025it [00:02, 356.08it/s, env_step=130048, len=7, n/ep=8, n/st=64, player_1/loss=2.610, player_2/loss=0.910, rew=0.00]


Epoch #127: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #128: 1025it [00:02, 352.69it/s, env_step=131072, len=8, n/ep=8, n/st=64, player_1/loss=3.497, player_2/loss=0.623, rew=0.00]


Epoch #128: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #129: 1025it [00:02, 357.82it/s, env_step=132096, len=8, n/ep=7, n/st=64, player_1/loss=3.147, player_2/loss=0.663, rew=0.00]


Epoch #129: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #130: 1025it [00:02, 359.44it/s, env_step=133120, len=8, n/ep=8, n/st=64, player_1/loss=2.732, player_2/loss=0.483, rew=0.00]


Epoch #130: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #131: 1025it [00:02, 356.46it/s, env_step=134144, len=9, n/ep=7, n/st=64, player_1/loss=3.620, player_2/loss=0.577, rew=0.00]


Epoch #131: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #132: 1025it [00:02, 354.83it/s, env_step=135168, len=8, n/ep=8, n/st=64, player_1/loss=2.773, player_2/loss=0.651, rew=0.00]


Epoch #132: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #133: 1025it [00:02, 356.92it/s, env_step=136192, len=8, n/ep=7, n/st=64, player_1/loss=2.039, player_2/loss=0.789, rew=0.00]


Epoch #133: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #134: 1025it [00:02, 358.30it/s, env_step=137216, len=11, n/ep=5, n/st=64, player_1/loss=2.783, player_2/loss=0.813, rew=0.00]


Epoch #134: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #135: 1025it [00:02, 357.93it/s, env_step=138240, len=9, n/ep=7, n/st=64, player_1/loss=2.111, player_2/loss=1.606, rew=0.00]


Epoch #135: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #136: 1025it [00:02, 358.21it/s, env_step=139264, len=8, n/ep=7, n/st=64, player_1/loss=4.926, player_2/loss=1.446, rew=0.00]


Epoch #136: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #137: 1025it [00:02, 357.57it/s, env_step=140288, len=8, n/ep=7, n/st=64, player_1/loss=5.023, player_2/loss=0.941, rew=0.00]


Epoch #137: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #138: 1025it [00:02, 357.94it/s, env_step=141312, len=9, n/ep=7, n/st=64, player_1/loss=3.443, player_2/loss=1.944, rew=0.00]


Epoch #138: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #139: 1025it [00:02, 357.40it/s, env_step=142336, len=7, n/ep=8, n/st=64, player_1/loss=1.809, player_2/loss=1.487, rew=0.00]


Epoch #139: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #140: 1025it [00:02, 359.10it/s, env_step=143360, len=8, n/ep=7, n/st=64, player_1/loss=1.342, player_2/loss=1.476, rew=0.00]


Epoch #140: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #141: 1025it [00:02, 358.65it/s, env_step=144384, len=8, n/ep=8, n/st=64, player_1/loss=3.213, player_2/loss=2.456, rew=0.00]


Epoch #141: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #142: 1025it [00:02, 356.43it/s, env_step=145408, len=8, n/ep=7, n/st=64, player_1/loss=2.039, player_2/loss=2.639, rew=0.00]


Epoch #142: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #143: 1025it [00:02, 359.18it/s, env_step=146432, len=8, n/ep=7, n/st=64, player_1/loss=1.979, player_2/loss=0.650, rew=0.00]


Epoch #143: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #144: 1025it [00:02, 357.65it/s, env_step=147456, len=8, n/ep=7, n/st=64, player_1/loss=5.169, player_2/loss=4.575, rew=0.00]


Epoch #144: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #145: 1025it [00:02, 357.59it/s, env_step=148480, len=10, n/ep=7, n/st=64, player_1/loss=3.128, player_2/loss=2.364, rew=0.00]


Epoch #145: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #146: 1025it [00:02, 358.61it/s, env_step=149504, len=8, n/ep=8, n/st=64, player_1/loss=3.082, player_2/loss=2.782, rew=0.00]


Epoch #146: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #147: 1025it [00:02, 358.22it/s, env_step=150528, len=7, n/ep=8, n/st=64, player_1/loss=2.140, player_2/loss=2.738, rew=0.00]


Epoch #147: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #148: 1025it [00:02, 353.25it/s, env_step=151552, len=8, n/ep=8, n/st=64, player_1/loss=2.381, player_2/loss=2.150, rew=0.00]


Epoch #148: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #149: 1025it [00:02, 354.96it/s, env_step=152576, len=9, n/ep=7, n/st=64, player_1/loss=0.758, player_2/loss=1.373, rew=0.00]


Epoch #149: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #150: 1025it [00:02, 358.25it/s, env_step=153600, len=9, n/ep=7, n/st=64, player_1/loss=1.717, player_2/loss=0.185, rew=0.00]


Epoch #150: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #151: 1025it [00:02, 357.60it/s, env_step=154624, len=9, n/ep=7, n/st=64, player_1/loss=1.411, player_2/loss=1.441, rew=0.00]


Epoch #151: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #152: 1025it [00:02, 357.86it/s, env_step=155648, len=8, n/ep=8, n/st=64, player_1/loss=3.660, player_2/loss=1.013, rew=0.00]


Epoch #152: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #153: 1025it [00:02, 358.41it/s, env_step=156672, len=7, n/ep=8, n/st=64, player_1/loss=4.808, player_2/loss=1.354, rew=0.00]


Epoch #153: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #154: 1025it [00:02, 357.05it/s, env_step=157696, len=10, n/ep=6, n/st=64, player_1/loss=3.332, player_2/loss=2.985, rew=0.00]


Epoch #154: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #155: 1025it [00:02, 358.05it/s, env_step=158720, len=8, n/ep=8, n/st=64, player_1/loss=3.307, player_2/loss=2.377, rew=0.00]


Epoch #155: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #156: 1025it [00:02, 354.40it/s, env_step=159744, len=7, n/ep=9, n/st=64, player_1/loss=3.505, player_2/loss=2.566, rew=0.00]


Epoch #156: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #157: 1025it [00:02, 355.34it/s, env_step=160768, len=12, n/ep=5, n/st=64, player_1/loss=3.005, player_2/loss=2.441, rew=0.00]


Epoch #157: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #158: 1025it [00:02, 357.29it/s, env_step=161792, len=7, n/ep=8, n/st=64, player_1/loss=1.602, player_2/loss=2.038, rew=0.00]


Epoch #158: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #159: 1025it [00:02, 357.37it/s, env_step=162816, len=7, n/ep=7, n/st=64, player_1/loss=1.265, player_2/loss=2.826, rew=0.00]


Epoch #159: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #160: 1025it [00:02, 354.86it/s, env_step=163840, len=8, n/ep=7, n/st=64, player_1/loss=1.400, player_2/loss=2.392, rew=0.00]


Epoch #160: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #161: 1025it [00:02, 356.69it/s, env_step=164864, len=11, n/ep=7, n/st=64, player_1/loss=3.287, player_2/loss=2.531, rew=0.00]


Epoch #161: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #162: 1025it [00:02, 356.17it/s, env_step=165888, len=7, n/ep=8, n/st=64, player_1/loss=1.836, player_2/loss=1.591, rew=0.00]


Epoch #162: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #163: 1025it [00:02, 358.43it/s, env_step=166912, len=7, n/ep=9, n/st=64, player_1/loss=1.554, player_2/loss=2.711, rew=0.00]


Epoch #163: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #164: 1025it [00:02, 356.93it/s, env_step=167936, len=9, n/ep=7, n/st=64, player_1/loss=2.876, player_2/loss=4.165, rew=0.00]


Epoch #164: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #165: 1025it [00:02, 358.32it/s, env_step=168960, len=8, n/ep=9, n/st=64, player_1/loss=2.942, player_2/loss=2.166, rew=0.00]


Epoch #165: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #166: 1025it [00:03, 340.22it/s, env_step=169984, len=9, n/ep=7, n/st=64, player_1/loss=3.796, player_2/loss=4.005, rew=0.00]


Epoch #166: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #167: 1025it [00:02, 346.28it/s, env_step=171008, len=8, n/ep=8, n/st=64, player_1/loss=1.342, player_2/loss=3.607, rew=0.00]


Epoch #167: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #168: 1025it [00:02, 346.52it/s, env_step=172032, len=8, n/ep=8, n/st=64, player_1/loss=1.330, player_2/loss=3.681, rew=0.00]


Epoch #168: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #169: 1025it [00:02, 358.44it/s, env_step=173056, len=10, n/ep=6, n/st=64, player_1/loss=3.732, player_2/loss=1.278, rew=0.00]


Epoch #169: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #170: 1025it [00:02, 357.14it/s, env_step=174080, len=16, n/ep=5, n/st=64, player_1/loss=5.849, player_2/loss=2.137, rew=0.00]


Epoch #170: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #171: 1025it [00:02, 360.80it/s, env_step=175104, len=7, n/ep=7, n/st=64, player_1/loss=7.127, player_2/loss=4.396, rew=0.00]


Epoch #171: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #172: 1025it [00:02, 355.74it/s, env_step=176128, len=8, n/ep=8, n/st=64, player_1/loss=10.017, player_2/loss=1.748, rew=0.00]


Epoch #172: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #173: 1025it [00:02, 358.03it/s, env_step=177152, len=7, n/ep=9, n/st=64, player_1/loss=7.720, player_2/loss=2.721, rew=0.00]


Epoch #173: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #174: 1025it [00:02, 358.50it/s, env_step=178176, len=10, n/ep=6, n/st=64, player_1/loss=5.367, player_2/loss=2.319, rew=0.00]


Epoch #174: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #175: 1025it [00:02, 355.76it/s, env_step=179200, len=8, n/ep=7, n/st=64, player_1/loss=6.171, player_2/loss=1.765, rew=0.00]


Epoch #175: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #176: 1025it [00:02, 356.81it/s, env_step=180224, len=8, n/ep=8, n/st=64, player_1/loss=7.251, player_2/loss=1.519, rew=0.00]


Epoch #176: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #177: 1025it [00:02, 358.25it/s, env_step=181248, len=8, n/ep=7, n/st=64, player_1/loss=2.496, player_2/loss=1.219, rew=0.00]


Epoch #177: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #178: 1025it [00:02, 358.46it/s, env_step=182272, len=8, n/ep=8, n/st=64, player_1/loss=5.093, player_2/loss=2.684, rew=0.00]


Epoch #178: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #179: 1025it [00:02, 356.69it/s, env_step=183296, len=8, n/ep=7, n/st=64, player_1/loss=2.961, player_2/loss=1.758, rew=0.00]


Epoch #179: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #180: 1025it [00:02, 358.17it/s, env_step=184320, len=9, n/ep=7, n/st=64, player_1/loss=1.720, player_2/loss=1.554, rew=0.00]


Epoch #180: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #181: 1025it [00:02, 357.71it/s, env_step=185344, len=9, n/ep=8, n/st=64, player_1/loss=3.144, player_2/loss=2.479, rew=0.00]


Epoch #181: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #182: 1025it [00:02, 356.66it/s, env_step=186368, len=8, n/ep=7, n/st=64, player_1/loss=3.699, player_2/loss=0.927, rew=0.00]


Epoch #182: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #183: 1025it [00:02, 358.66it/s, env_step=187392, len=12, n/ep=5, n/st=64, player_1/loss=5.309, player_2/loss=1.455, rew=0.00]


Epoch #183: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #184: 1025it [00:02, 356.72it/s, env_step=188416, len=9, n/ep=6, n/st=64, player_1/loss=2.376, player_2/loss=1.950, rew=0.00]


Epoch #184: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #185: 1025it [00:02, 357.79it/s, env_step=189440, len=10, n/ep=6, n/st=64, player_1/loss=3.988, player_2/loss=3.399, rew=0.00]


Epoch #185: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #186: 1025it [00:02, 358.24it/s, env_step=190464, len=9, n/ep=8, n/st=64, player_1/loss=3.416, player_2/loss=1.860, rew=0.00]


Epoch #186: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #187: 1025it [00:02, 357.67it/s, env_step=191488, len=7, n/ep=9, n/st=64, player_1/loss=3.920, player_2/loss=1.391, rew=0.00]


Epoch #187: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #188: 1025it [00:02, 356.64it/s, env_step=192512, len=8, n/ep=8, n/st=64, player_1/loss=2.628, player_2/loss=1.695, rew=0.00]


Epoch #188: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #189: 1025it [00:02, 357.55it/s, env_step=193536, len=8, n/ep=8, n/st=64, player_1/loss=4.881, player_2/loss=2.057, rew=0.00]


Epoch #189: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #190: 1025it [00:02, 360.42it/s, env_step=194560, len=8, n/ep=7, n/st=64, player_1/loss=4.477, player_2/loss=0.898, rew=0.00]


Epoch #190: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #191: 1025it [00:02, 359.45it/s, env_step=195584, len=8, n/ep=7, n/st=64, player_1/loss=3.462, player_2/loss=2.627, rew=0.00]


Epoch #191: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #192: 1025it [00:02, 358.74it/s, env_step=196608, len=7, n/ep=8, n/st=64, player_1/loss=1.718, player_2/loss=2.391, rew=0.00]


Epoch #192: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #193: 1025it [00:02, 357.89it/s, env_step=197632, len=8, n/ep=8, n/st=64, player_1/loss=2.010, player_2/loss=0.882, rew=0.00]


Epoch #193: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #194: 1025it [00:02, 357.34it/s, env_step=198656, len=8, n/ep=7, n/st=64, player_1/loss=2.956, player_2/loss=0.915, rew=0.00]


Epoch #194: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #195: 1025it [00:02, 356.98it/s, env_step=199680, len=8, n/ep=7, n/st=64, player_1/loss=2.615, player_2/loss=4.275, rew=0.00]


Epoch #195: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #196: 1025it [00:02, 358.03it/s, env_step=200704, len=10, n/ep=7, n/st=64, player_1/loss=2.564, player_2/loss=4.173, rew=0.00]


Epoch #196: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #197: 1025it [00:02, 356.12it/s, env_step=201728, len=8, n/ep=7, n/st=64, player_1/loss=2.564, player_2/loss=2.524, rew=0.00]


Epoch #197: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #198: 1025it [00:02, 352.41it/s, env_step=202752, len=10, n/ep=6, n/st=64, player_1/loss=3.673, player_2/loss=2.453, rew=0.00]


Epoch #198: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #199: 1025it [00:02, 359.35it/s, env_step=203776, len=7, n/ep=9, n/st=64, player_1/loss=1.504, player_2/loss=1.465, rew=0.00]


Epoch #199: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #200: 1025it [00:02, 357.31it/s, env_step=204800, len=8, n/ep=8, n/st=64, player_1/loss=3.419, player_2/loss=3.055, rew=0.00]


Epoch #200: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #201: 1025it [00:02, 357.44it/s, env_step=205824, len=8, n/ep=7, n/st=64, player_1/loss=0.457, player_2/loss=2.497, rew=0.00]


Epoch #201: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #202: 1025it [00:02, 359.90it/s, env_step=206848, len=8, n/ep=7, n/st=64, player_1/loss=3.388, player_2/loss=2.861, rew=0.00]


Epoch #202: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #203: 1025it [00:02, 358.28it/s, env_step=207872, len=9, n/ep=7, n/st=64, player_1/loss=2.923, player_2/loss=2.409, rew=0.00]


Epoch #203: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #204: 1025it [00:02, 355.83it/s, env_step=208896, len=8, n/ep=7, n/st=64, player_1/loss=2.703, player_2/loss=4.038, rew=0.00]


Epoch #204: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #205: 1025it [00:02, 358.84it/s, env_step=209920, len=8, n/ep=8, n/st=64, player_1/loss=1.502, player_2/loss=1.430, rew=0.00]


Epoch #205: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #206: 1025it [00:02, 357.82it/s, env_step=210944, len=7, n/ep=8, n/st=64, player_1/loss=0.744, player_2/loss=0.831, rew=0.00]


Epoch #206: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #207: 1025it [00:02, 357.08it/s, env_step=211968, len=9, n/ep=6, n/st=64, player_1/loss=2.393, player_2/loss=1.643, rew=0.00]


Epoch #207: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #208: 1025it [00:02, 355.76it/s, env_step=212992, len=8, n/ep=8, n/st=64, player_1/loss=4.631, player_2/loss=5.215, rew=0.00]


Epoch #208: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #209: 1025it [00:02, 358.09it/s, env_step=214016, len=8, n/ep=7, n/st=64, player_1/loss=4.853, player_2/loss=4.868, rew=0.00]


Epoch #209: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #210: 1025it [00:02, 357.10it/s, env_step=215040, len=7, n/ep=8, n/st=64, player_1/loss=1.969, player_2/loss=1.172, rew=0.00]


Epoch #210: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #211: 1025it [00:02, 356.89it/s, env_step=216064, len=9, n/ep=6, n/st=64, player_1/loss=0.649, player_2/loss=1.282, rew=0.00]


Epoch #211: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #212: 1025it [00:02, 356.64it/s, env_step=217088, len=7, n/ep=7, n/st=64, player_1/loss=1.495, player_2/loss=1.454, rew=0.00]


Epoch #212: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #213: 1025it [00:02, 356.84it/s, env_step=218112, len=8, n/ep=8, n/st=64, player_1/loss=2.363, player_2/loss=1.775, rew=0.00]


Epoch #213: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #214: 1025it [00:02, 356.52it/s, env_step=219136, len=7, n/ep=7, n/st=64, player_1/loss=3.288, player_2/loss=1.721, rew=0.00]


Epoch #214: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #215: 1025it [00:02, 355.74it/s, env_step=220160, len=8, n/ep=8, n/st=64, player_1/loss=1.419, player_2/loss=5.697, rew=0.00]


Epoch #215: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #216: 1025it [00:02, 357.02it/s, env_step=221184, len=7, n/ep=7, n/st=64, player_1/loss=2.643, player_2/loss=2.558, rew=0.00]


Epoch #216: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #217: 1025it [00:02, 356.32it/s, env_step=222208, len=10, n/ep=6, n/st=64, player_1/loss=3.385, player_2/loss=0.599, rew=0.00]


Epoch #217: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #218: 1025it [00:02, 355.88it/s, env_step=223232, len=10, n/ep=6, n/st=64, player_1/loss=1.984, player_2/loss=2.106, rew=0.00]


Epoch #218: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #219: 1025it [00:02, 358.78it/s, env_step=224256, len=8, n/ep=8, n/st=64, player_1/loss=2.059, player_2/loss=4.969, rew=0.00]


Epoch #219: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #220: 1025it [00:02, 357.32it/s, env_step=225280, len=8, n/ep=8, n/st=64, player_1/loss=1.293, player_2/loss=2.099, rew=0.00]


Epoch #220: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #221: 1025it [00:02, 357.20it/s, env_step=226304, len=7, n/ep=8, n/st=64, player_1/loss=1.807, player_2/loss=1.822, rew=0.00]


Epoch #221: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #222: 1025it [00:02, 357.00it/s, env_step=227328, len=9, n/ep=7, n/st=64, player_1/loss=1.672, player_2/loss=1.579, rew=0.00]


Epoch #222: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #223: 1025it [00:02, 354.54it/s, env_step=228352, len=8, n/ep=8, n/st=64, player_1/loss=3.138, player_2/loss=4.193, rew=0.00]


Epoch #223: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #224: 1025it [00:02, 356.93it/s, env_step=229376, len=8, n/ep=7, n/st=64, player_1/loss=1.818, player_2/loss=1.485, rew=0.00]


Epoch #224: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #225: 1025it [00:02, 356.07it/s, env_step=230400, len=7, n/ep=7, n/st=64, player_1/loss=2.897, player_2/loss=1.682, rew=0.00]


Epoch #225: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #226: 1025it [00:02, 357.79it/s, env_step=231424, len=14, n/ep=4, n/st=64, player_1/loss=1.929, player_2/loss=1.628, rew=0.00]


Epoch #226: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #227: 1025it [00:02, 357.40it/s, env_step=232448, len=8, n/ep=7, n/st=64, player_1/loss=1.281, player_2/loss=1.562, rew=0.00]


Epoch #227: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #228: 1025it [00:02, 356.31it/s, env_step=233472, len=7, n/ep=8, n/st=64, player_1/loss=1.691, player_2/loss=2.099, rew=0.00]


Epoch #228: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #229: 1025it [00:02, 359.58it/s, env_step=234496, len=9, n/ep=7, n/st=64, player_1/loss=1.179, player_2/loss=2.797, rew=0.00]


Epoch #229: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #230: 1025it [00:02, 356.14it/s, env_step=235520, len=7, n/ep=8, n/st=64, player_1/loss=1.826, player_2/loss=1.616, rew=0.00]


Epoch #230: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #231: 1025it [00:02, 357.44it/s, env_step=236544, len=8, n/ep=8, n/st=64, player_1/loss=1.430, player_2/loss=2.245, rew=0.00]


Epoch #231: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #232: 1025it [00:02, 351.68it/s, env_step=237568, len=8, n/ep=8, n/st=64, player_1/loss=2.046, player_2/loss=2.287, rew=0.00]


Epoch #232: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #233: 1025it [00:02, 355.88it/s, env_step=238592, len=8, n/ep=7, n/st=64, player_1/loss=1.401, player_2/loss=2.165, rew=0.00]


Epoch #233: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #234: 1025it [00:02, 357.88it/s, env_step=239616, len=8, n/ep=7, n/st=64, player_1/loss=0.946, player_2/loss=2.607, rew=0.00]


Epoch #234: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #235: 1025it [00:02, 355.68it/s, env_step=240640, len=7, n/ep=9, n/st=64, player_1/loss=1.220, player_2/loss=1.653, rew=0.00]


Epoch #235: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #236: 1025it [00:02, 357.41it/s, env_step=241664, len=8, n/ep=7, n/st=64, player_1/loss=1.278, player_2/loss=2.267, rew=0.00]


Epoch #236: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #237: 1025it [00:02, 356.63it/s, env_step=242688, len=8, n/ep=8, n/st=64, player_1/loss=1.818, player_2/loss=3.099, rew=0.00]


Epoch #237: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #238: 1025it [00:02, 357.23it/s, env_step=243712, len=10, n/ep=6, n/st=64, player_1/loss=1.799, player_2/loss=4.151, rew=0.00]


Epoch #238: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #239: 1025it [00:02, 353.41it/s, env_step=244736, len=8, n/ep=7, n/st=64, player_1/loss=2.271, player_2/loss=4.621, rew=0.00]


Epoch #239: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #240: 1025it [00:02, 355.40it/s, env_step=245760, len=9, n/ep=7, n/st=64, player_1/loss=2.215, player_2/loss=2.742, rew=0.00]


Epoch #240: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #241: 1025it [00:02, 357.11it/s, env_step=246784, len=8, n/ep=8, n/st=64, player_1/loss=0.565, player_2/loss=2.525, rew=0.00]


Epoch #241: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #242: 1025it [00:02, 357.98it/s, env_step=247808, len=8, n/ep=7, n/st=64, player_1/loss=2.192, player_2/loss=2.565, rew=0.00]


Epoch #242: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #243: 1025it [00:02, 352.92it/s, env_step=248832, len=8, n/ep=8, n/st=64, player_1/loss=1.488, player_2/loss=6.149, rew=0.00]


Epoch #243: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #244: 1025it [00:02, 357.04it/s, env_step=249856, len=8, n/ep=8, n/st=64, player_1/loss=2.764, player_2/loss=3.347, rew=0.00]


Epoch #244: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #245: 1025it [00:02, 356.05it/s, env_step=250880, len=9, n/ep=7, n/st=64, player_1/loss=1.651, player_2/loss=1.745, rew=0.00]


Epoch #245: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #246: 1025it [00:02, 355.83it/s, env_step=251904, len=8, n/ep=8, n/st=64, player_1/loss=1.643, player_2/loss=3.640, rew=0.00]


Epoch #246: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #247: 1025it [00:02, 359.08it/s, env_step=252928, len=9, n/ep=7, n/st=64, player_1/loss=3.028, player_2/loss=1.735, rew=0.00]


Epoch #247: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #248: 1025it [00:02, 357.54it/s, env_step=253952, len=7, n/ep=8, n/st=64, player_1/loss=3.745, player_2/loss=3.601, rew=0.00]


Epoch #248: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #249: 1025it [00:02, 356.75it/s, env_step=254976, len=12, n/ep=5, n/st=64, player_1/loss=2.055, player_2/loss=2.029, rew=0.00]


Epoch #249: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #250: 1025it [00:02, 357.91it/s, env_step=256000, len=9, n/ep=7, n/st=64, player_1/loss=2.824, player_2/loss=0.906, rew=0.00]


Epoch #250: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #251: 1025it [00:02, 357.16it/s, env_step=257024, len=7, n/ep=9, n/st=64, player_1/loss=1.813, player_2/loss=1.842, rew=0.00]


Epoch #251: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #252: 1025it [00:02, 355.29it/s, env_step=258048, len=8, n/ep=8, n/st=64, player_1/loss=1.997, player_2/loss=2.886, rew=0.00]


Epoch #252: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #253: 1025it [00:02, 355.48it/s, env_step=259072, len=9, n/ep=7, n/st=64, player_1/loss=2.758, player_2/loss=1.098, rew=0.00]


Epoch #253: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #254: 1025it [00:02, 356.25it/s, env_step=260096, len=8, n/ep=7, n/st=64, player_1/loss=1.811, player_2/loss=1.679, rew=0.00]


Epoch #254: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #255: 1025it [00:02, 357.73it/s, env_step=261120, len=8, n/ep=8, n/st=64, player_1/loss=4.077, player_2/loss=4.662, rew=0.00]


Epoch #255: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #256: 1025it [00:02, 359.41it/s, env_step=262144, len=9, n/ep=6, n/st=64, player_1/loss=3.756, player_2/loss=2.292, rew=0.00]


Epoch #256: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #257: 1025it [00:02, 358.79it/s, env_step=263168, len=7, n/ep=8, n/st=64, player_1/loss=2.332, player_2/loss=0.537, rew=0.00]


Epoch #257: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #258: 1025it [00:02, 357.26it/s, env_step=264192, len=7, n/ep=8, n/st=64, player_1/loss=1.169, player_2/loss=2.835, rew=0.00]


Epoch #258: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #259: 1025it [00:02, 356.59it/s, env_step=265216, len=7, n/ep=8, n/st=64, player_1/loss=3.699, player_2/loss=2.072, rew=0.00]


Epoch #259: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #260: 1025it [00:02, 356.82it/s, env_step=266240, len=7, n/ep=8, n/st=64, player_1/loss=2.377, player_2/loss=1.742, rew=0.00]


Epoch #260: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #261: 1025it [00:02, 357.02it/s, env_step=267264, len=13, n/ep=5, n/st=64, player_1/loss=1.214, player_2/loss=1.540, rew=0.00]


Epoch #261: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #262: 1025it [00:02, 355.61it/s, env_step=268288, len=7, n/ep=7, n/st=64, player_1/loss=1.950, player_2/loss=3.998, rew=0.00]


Epoch #262: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #263: 1025it [00:02, 357.20it/s, env_step=269312, len=8, n/ep=8, n/st=64, player_1/loss=1.260, player_2/loss=3.779, rew=0.00]


Epoch #263: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #264: 1025it [00:02, 356.71it/s, env_step=270336, len=9, n/ep=7, n/st=64, player_1/loss=3.448, player_2/loss=3.855, rew=0.00]


Epoch #264: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #265: 1025it [00:02, 357.14it/s, env_step=271360, len=8, n/ep=8, n/st=64, player_1/loss=1.083, player_2/loss=0.693, rew=0.00]


Epoch #265: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #266: 1025it [00:02, 357.20it/s, env_step=272384, len=7, n/ep=9, n/st=64, player_1/loss=0.699, player_2/loss=1.283, rew=0.00]


Epoch #266: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #267: 1025it [00:02, 356.86it/s, env_step=273408, len=7, n/ep=8, n/st=64, player_1/loss=0.593, player_2/loss=0.853, rew=0.00]


Epoch #267: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #268: 1025it [00:02, 356.30it/s, env_step=274432, len=9, n/ep=6, n/st=64, player_1/loss=4.154, player_2/loss=1.777, rew=0.00]


Epoch #268: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #269: 1025it [00:02, 355.62it/s, env_step=275456, len=7, n/ep=8, n/st=64, player_1/loss=5.997, player_2/loss=1.157, rew=0.00]


Epoch #269: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #270: 1025it [00:02, 356.21it/s, env_step=276480, len=8, n/ep=8, n/st=64, player_1/loss=3.425, player_2/loss=2.286, rew=0.00]


Epoch #270: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #271: 1025it [00:02, 356.26it/s, env_step=277504, len=8, n/ep=8, n/st=64, player_1/loss=1.345, player_2/loss=5.544, rew=0.00]


Epoch #271: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #272: 1025it [00:02, 356.69it/s, env_step=278528, len=7, n/ep=9, n/st=64, player_1/loss=0.683, player_2/loss=2.327, rew=0.00]


Epoch #272: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #273: 1025it [00:02, 356.25it/s, env_step=279552, len=10, n/ep=8, n/st=64, player_1/loss=3.705, player_2/loss=8.371, rew=0.00]


Epoch #273: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #274: 1025it [00:02, 357.18it/s, env_step=280576, len=8, n/ep=8, n/st=64, player_1/loss=2.330, player_2/loss=2.411, rew=0.00]


Epoch #274: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #275: 1025it [00:02, 356.15it/s, env_step=281600, len=8, n/ep=7, n/st=64, player_1/loss=4.291, player_2/loss=5.433, rew=0.00]


Epoch #275: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #276: 1025it [00:02, 356.69it/s, env_step=282624, len=8, n/ep=7, n/st=64, player_1/loss=3.793, player_2/loss=3.279, rew=0.00]


Epoch #276: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #277: 1025it [00:02, 355.93it/s, env_step=283648, len=8, n/ep=8, n/st=64, player_1/loss=2.899, player_2/loss=4.789, rew=0.00]


Epoch #277: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #278: 1025it [00:02, 357.81it/s, env_step=284672, len=8, n/ep=8, n/st=64, player_1/loss=1.508, player_2/loss=3.103, rew=0.00]


Epoch #278: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #279: 1025it [00:02, 357.61it/s, env_step=285696, len=10, n/ep=6, n/st=64, player_1/loss=1.524, player_2/loss=2.754, rew=0.00]


Epoch #279: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #280: 1025it [00:02, 353.33it/s, env_step=286720, len=9, n/ep=7, n/st=64, player_1/loss=3.600, player_2/loss=2.216, rew=0.00]


Epoch #280: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #281: 1025it [00:02, 360.83it/s, env_step=287744, len=7, n/ep=8, n/st=64, player_1/loss=1.791, player_2/loss=6.859, rew=0.00]


Epoch #281: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #282: 1025it [00:02, 360.58it/s, env_step=288768, len=9, n/ep=7, n/st=64, player_1/loss=5.558, player_2/loss=3.213, rew=0.00]


Epoch #282: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #283: 1025it [00:02, 358.45it/s, env_step=289792, len=8, n/ep=8, n/st=64, player_1/loss=1.843, player_2/loss=2.140, rew=0.00]


Epoch #283: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #284: 1025it [00:02, 358.03it/s, env_step=290816, len=8, n/ep=7, n/st=64, player_1/loss=2.204, player_2/loss=2.930, rew=0.00]


Epoch #284: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #285: 1025it [00:02, 356.42it/s, env_step=291840, len=8, n/ep=7, n/st=64, player_1/loss=2.353, player_2/loss=4.110, rew=0.00]


Epoch #285: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #286: 1025it [00:02, 357.32it/s, env_step=292864, len=9, n/ep=7, n/st=64, player_1/loss=3.055, player_2/loss=8.196, rew=0.00]


Epoch #286: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #287: 1025it [00:02, 355.71it/s, env_step=293888, len=8, n/ep=8, n/st=64, player_1/loss=2.199, player_2/loss=3.829, rew=0.00]


Epoch #287: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #288: 1025it [00:02, 358.22it/s, env_step=294912, len=8, n/ep=8, n/st=64, player_1/loss=0.787, player_2/loss=1.262, rew=0.00]


Epoch #288: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #289: 1025it [00:02, 355.51it/s, env_step=295936, len=8, n/ep=7, n/st=64, player_1/loss=1.740, player_2/loss=1.334, rew=0.00]


Epoch #289: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #290: 1025it [00:02, 356.55it/s, env_step=296960, len=8, n/ep=8, n/st=64, player_1/loss=1.697, player_2/loss=2.236, rew=0.00]


Epoch #290: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #291: 1025it [00:02, 357.05it/s, env_step=297984, len=8, n/ep=8, n/st=64, player_1/loss=1.521, player_2/loss=1.442, rew=0.00]


Epoch #291: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #292: 1025it [00:02, 353.25it/s, env_step=299008, len=8, n/ep=8, n/st=64, player_1/loss=1.767, player_2/loss=2.720, rew=0.00]


Epoch #292: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #293: 1025it [00:02, 356.25it/s, env_step=300032, len=8, n/ep=8, n/st=64, player_1/loss=1.225, player_2/loss=1.903, rew=0.00]


Epoch #293: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #294: 1025it [00:02, 357.31it/s, env_step=301056, len=7, n/ep=8, n/st=64, player_1/loss=0.965, player_2/loss=1.646, rew=0.00]


Epoch #294: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #295: 1025it [00:02, 357.68it/s, env_step=302080, len=10, n/ep=6, n/st=64, player_1/loss=1.776, player_2/loss=2.084, rew=0.00]


Epoch #295: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #296: 1025it [00:02, 356.96it/s, env_step=303104, len=9, n/ep=7, n/st=64, player_1/loss=1.182, player_2/loss=1.231, rew=0.00]


Epoch #296: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #297: 1025it [00:02, 356.58it/s, env_step=304128, len=10, n/ep=7, n/st=64, player_1/loss=1.294, player_2/loss=1.855, rew=0.00]


Epoch #297: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #298: 1025it [00:02, 356.27it/s, env_step=305152, len=7, n/ep=8, n/st=64, player_1/loss=1.477, player_2/loss=0.910, rew=0.00]


Epoch #298: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #299: 1025it [00:02, 357.00it/s, env_step=306176, len=8, n/ep=7, n/st=64, player_1/loss=1.681, player_2/loss=0.917, rew=0.00]


Epoch #299: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #300: 1025it [00:02, 356.26it/s, env_step=307200, len=8, n/ep=8, n/st=64, player_1/loss=3.946, player_2/loss=0.592, rew=0.00]


Epoch #300: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #301: 1025it [00:02, 357.36it/s, env_step=308224, len=9, n/ep=7, n/st=64, player_1/loss=1.372, player_2/loss=0.781, rew=0.00]


Epoch #301: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #302: 1025it [00:02, 356.47it/s, env_step=309248, len=8, n/ep=9, n/st=64, player_1/loss=1.549, player_2/loss=0.733, rew=0.00]


Epoch #302: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #303: 1025it [00:02, 356.50it/s, env_step=310272, len=8, n/ep=8, n/st=64, player_1/loss=0.576, player_2/loss=1.355, rew=0.00]


Epoch #303: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #304: 1025it [00:02, 356.72it/s, env_step=311296, len=7, n/ep=9, n/st=64, player_1/loss=2.924, player_2/loss=1.412, rew=0.00]


Epoch #304: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #305: 1025it [00:02, 356.03it/s, env_step=312320, len=8, n/ep=8, n/st=64, player_1/loss=3.226, player_2/loss=0.765, rew=0.00]


Epoch #305: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #306: 1025it [00:02, 357.07it/s, env_step=313344, len=7, n/ep=8, n/st=64, player_1/loss=5.029, player_2/loss=1.757, rew=0.00]


Epoch #306: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #307: 1025it [00:02, 355.61it/s, env_step=314368, len=8, n/ep=8, n/st=64, player_1/loss=3.423, player_2/loss=2.165, rew=0.00]


Epoch #307: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #308: 1025it [00:02, 356.46it/s, env_step=315392, len=8, n/ep=8, n/st=64, player_1/loss=3.067, player_2/loss=0.937, rew=0.00]


Epoch #308: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #309: 1025it [00:02, 356.56it/s, env_step=316416, len=9, n/ep=7, n/st=64, player_1/loss=3.353, player_2/loss=0.964, rew=0.00]


Epoch #309: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #310: 1025it [00:02, 356.65it/s, env_step=317440, len=8, n/ep=8, n/st=64, player_1/loss=2.949, player_2/loss=2.163, rew=0.00]


Epoch #310: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #311: 1025it [00:02, 356.56it/s, env_step=318464, len=8, n/ep=8, n/st=64, player_1/loss=2.192, player_2/loss=1.615, rew=0.00]


Epoch #311: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #312: 1025it [00:02, 356.85it/s, env_step=319488, len=9, n/ep=7, n/st=64, player_1/loss=2.891, player_2/loss=0.903, rew=0.00]


Epoch #312: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #313: 1025it [00:02, 357.46it/s, env_step=320512, len=8, n/ep=7, n/st=64, player_1/loss=12.398, player_2/loss=2.599, rew=0.00]


Epoch #313: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #314: 1025it [00:02, 356.28it/s, env_step=321536, len=8, n/ep=8, n/st=64, player_1/loss=11.302, player_2/loss=4.769, rew=0.00]


Epoch #314: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #315: 1025it [00:02, 354.87it/s, env_step=322560, len=8, n/ep=7, n/st=64, player_1/loss=18.219, player_2/loss=2.784, rew=0.00]


Epoch #315: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #316: 1025it [00:02, 354.98it/s, env_step=323584, len=8, n/ep=7, n/st=64, player_1/loss=11.736, player_2/loss=2.545, rew=0.00]


Epoch #316: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #317: 1025it [00:02, 359.14it/s, env_step=324608, len=7, n/ep=8, n/st=64, player_1/loss=8.104, player_2/loss=3.548, rew=0.00]


Epoch #317: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #318: 1025it [00:02, 357.20it/s, env_step=325632, len=7, n/ep=8, n/st=64, player_1/loss=5.443, player_2/loss=1.867, rew=0.00]


Epoch #318: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #319: 1025it [00:02, 356.51it/s, env_step=326656, len=13, n/ep=5, n/st=64, player_1/loss=6.178, player_2/loss=6.268, rew=0.00]


Epoch #319: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #320: 1025it [00:02, 356.18it/s, env_step=327680, len=9, n/ep=7, n/st=64, player_1/loss=6.519, player_2/loss=3.244, rew=0.00]


Epoch #320: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #321: 1025it [00:02, 358.31it/s, env_step=328704, len=8, n/ep=8, n/st=64, player_1/loss=3.147, player_2/loss=2.678, rew=0.00]


Epoch #321: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #322: 1025it [00:02, 352.21it/s, env_step=329728, len=9, n/ep=6, n/st=64, player_1/loss=3.576, player_2/loss=5.508, rew=0.00]


Epoch #322: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #323: 1025it [00:02, 355.52it/s, env_step=330752, len=9, n/ep=6, n/st=64, player_1/loss=4.912, player_2/loss=4.411, rew=0.00]


Epoch #323: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #324: 1025it [00:02, 358.28it/s, env_step=331776, len=9, n/ep=7, n/st=64, player_1/loss=3.653, player_2/loss=7.322, rew=0.00]


Epoch #324: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #325: 1025it [00:02, 358.29it/s, env_step=332800, len=8, n/ep=8, n/st=64, player_1/loss=2.623, player_2/loss=2.833, rew=0.00]


Epoch #325: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #326: 1025it [00:02, 355.27it/s, env_step=333824, len=8, n/ep=7, n/st=64, player_1/loss=1.609, player_2/loss=6.259, rew=0.00]


Epoch #326: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #327: 1025it [00:02, 356.98it/s, env_step=334848, len=8, n/ep=7, n/st=64, player_1/loss=1.835, player_2/loss=9.040, rew=0.00]


Epoch #327: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #328: 1025it [00:02, 356.56it/s, env_step=335872, len=8, n/ep=7, n/st=64, player_1/loss=2.037, player_2/loss=3.930, rew=0.00]


Epoch #328: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #329: 1025it [00:02, 357.28it/s, env_step=336896, len=7, n/ep=9, n/st=64, player_1/loss=3.875, player_2/loss=3.964, rew=0.00]


Epoch #329: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #330: 1025it [00:02, 356.31it/s, env_step=337920, len=8, n/ep=8, n/st=64, player_1/loss=0.837, player_2/loss=1.194, rew=0.00]


Epoch #330: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #331: 1025it [00:02, 356.46it/s, env_step=338944, len=8, n/ep=8, n/st=64, player_1/loss=0.956, player_2/loss=6.847, rew=0.00]


Epoch #331: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #332: 1025it [00:02, 357.61it/s, env_step=339968, len=9, n/ep=7, n/st=64, player_1/loss=1.573, player_2/loss=4.565, rew=0.00]


Epoch #332: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #333: 1025it [00:02, 357.44it/s, env_step=340992, len=8, n/ep=8, n/st=64, player_1/loss=1.357, player_2/loss=4.459, rew=0.00]


Epoch #333: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #334: 1025it [00:02, 356.31it/s, env_step=342016, len=8, n/ep=8, n/st=64, player_1/loss=2.997, player_2/loss=3.518, rew=0.00]


Epoch #334: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #335: 1025it [00:02, 351.94it/s, env_step=343040, len=8, n/ep=8, n/st=64, player_1/loss=2.363, player_2/loss=6.935, rew=0.00]


Epoch #335: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #336: 1025it [00:02, 352.70it/s, env_step=344064, len=7, n/ep=8, n/st=64, player_1/loss=3.556, player_2/loss=4.135, rew=0.00]


Epoch #336: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #337: 1025it [00:02, 356.52it/s, env_step=345088, len=7, n/ep=8, n/st=64, player_1/loss=1.931, player_2/loss=4.102, rew=0.00]


Epoch #337: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #338: 1025it [00:02, 357.69it/s, env_step=346112, len=8, n/ep=8, n/st=64, player_1/loss=1.483, player_2/loss=5.355, rew=0.00]


Epoch #338: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #339: 1025it [00:02, 355.72it/s, env_step=347136, len=8, n/ep=8, n/st=64, player_1/loss=2.501, player_2/loss=5.484, rew=0.00]


Epoch #339: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #340: 1025it [00:02, 358.19it/s, env_step=348160, len=9, n/ep=7, n/st=64, player_1/loss=1.634, player_2/loss=3.719, rew=0.00]


Epoch #340: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #341: 1025it [00:02, 356.52it/s, env_step=349184, len=8, n/ep=8, n/st=64, player_1/loss=2.416, player_2/loss=2.334, rew=0.00]


Epoch #341: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #342: 1025it [00:02, 355.32it/s, env_step=350208, len=7, n/ep=8, n/st=64, player_1/loss=6.866, player_2/loss=2.013, rew=0.00]


Epoch #342: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #343: 1025it [00:02, 355.22it/s, env_step=351232, len=7, n/ep=9, n/st=64, player_1/loss=2.634, player_2/loss=2.419, rew=0.00]


Epoch #343: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #344: 1025it [00:02, 357.82it/s, env_step=352256, len=8, n/ep=8, n/st=64, player_1/loss=4.669, player_2/loss=2.070, rew=0.00]


Epoch #344: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #345: 1025it [00:02, 357.45it/s, env_step=353280, len=8, n/ep=8, n/st=64, player_1/loss=3.992, player_2/loss=2.437, rew=0.00]


Epoch #345: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #346: 1025it [00:02, 355.35it/s, env_step=354304, len=10, n/ep=6, n/st=64, player_1/loss=9.119, player_2/loss=1.030, rew=0.00]


Epoch #346: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #347: 1025it [00:02, 357.31it/s, env_step=355328, len=12, n/ep=5, n/st=64, player_1/loss=8.020, player_2/loss=1.902, rew=0.00]


Epoch #347: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #348: 1025it [00:02, 357.99it/s, env_step=356352, len=9, n/ep=6, n/st=64, player_1/loss=1.547, player_2/loss=4.402, rew=0.00]


Epoch #348: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #349: 1025it [00:02, 355.68it/s, env_step=357376, len=7, n/ep=8, n/st=64, player_1/loss=3.317, player_2/loss=1.428, rew=0.00]


Epoch #349: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #350: 1025it [00:02, 354.75it/s, env_step=358400, len=11, n/ep=6, n/st=64, player_1/loss=7.532, player_2/loss=2.512, rew=0.00]


Epoch #350: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #351: 1025it [00:02, 356.32it/s, env_step=359424, len=13, n/ep=5, n/st=64, player_1/loss=5.776, player_2/loss=2.941, rew=0.00]


Epoch #351: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #352: 1025it [00:02, 357.64it/s, env_step=360448, len=8, n/ep=7, n/st=64, player_1/loss=6.116, player_2/loss=2.352, rew=0.00]


Epoch #352: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #353: 1025it [00:02, 354.92it/s, env_step=361472, len=8, n/ep=8, n/st=64, player_1/loss=4.652, player_2/loss=1.132, rew=0.00]


Epoch #353: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #354: 1025it [00:02, 357.08it/s, env_step=362496, len=10, n/ep=7, n/st=64, player_1/loss=4.093, player_2/loss=3.358, rew=0.00]


Epoch #354: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #355: 1025it [00:02, 356.49it/s, env_step=363520, len=9, n/ep=7, n/st=64, player_1/loss=1.481, player_2/loss=2.396, rew=0.00]


Epoch #355: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #356: 1025it [00:02, 354.57it/s, env_step=364544, len=7, n/ep=8, n/st=64, player_1/loss=2.854, player_2/loss=1.699, rew=0.00]


Epoch #356: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #357: 1025it [00:02, 356.63it/s, env_step=365568, len=8, n/ep=8, n/st=64, player_1/loss=3.293, player_2/loss=0.862, rew=0.00]


Epoch #357: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #358: 1025it [00:02, 354.10it/s, env_step=366592, len=8, n/ep=8, n/st=64, player_1/loss=7.048, player_2/loss=2.474, rew=0.00]


Epoch #358: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #359: 1025it [00:02, 355.95it/s, env_step=367616, len=11, n/ep=6, n/st=64, player_1/loss=8.477, player_2/loss=2.001, rew=0.00]


Epoch #359: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #360: 1025it [00:02, 356.06it/s, env_step=368640, len=9, n/ep=7, n/st=64, player_1/loss=4.877, player_2/loss=1.380, rew=0.00]


Epoch #360: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #361: 1025it [00:02, 356.74it/s, env_step=369664, len=7, n/ep=8, n/st=64, player_1/loss=5.793, player_2/loss=2.125, rew=0.00]


Epoch #361: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #362: 1025it [00:02, 355.87it/s, env_step=370688, len=8, n/ep=7, n/st=64, player_1/loss=5.249, player_2/loss=0.349, rew=0.00]


Epoch #362: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #363: 1025it [00:02, 354.09it/s, env_step=371712, len=8, n/ep=8, n/st=64, player_1/loss=4.102, player_2/loss=0.517, rew=0.00]


Epoch #363: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #364: 1025it [00:02, 357.99it/s, env_step=372736, len=7, n/ep=8, n/st=64, player_1/loss=5.955, player_2/loss=1.467, rew=0.00]


Epoch #364: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #365: 1025it [00:02, 359.61it/s, env_step=373760, len=8, n/ep=8, n/st=64, player_1/loss=3.449, player_2/loss=1.268, rew=0.00]


Epoch #365: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #366: 1025it [00:02, 354.67it/s, env_step=374784, len=7, n/ep=8, n/st=64, player_1/loss=3.664, player_2/loss=0.530, rew=0.00]


Epoch #366: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #367: 1025it [00:02, 356.65it/s, env_step=375808, len=10, n/ep=6, n/st=64, player_1/loss=1.487, player_2/loss=0.392, rew=0.00]


Epoch #367: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #368: 1025it [00:02, 357.49it/s, env_step=376832, len=8, n/ep=8, n/st=64, player_1/loss=3.469, player_2/loss=1.576, rew=0.00]


Epoch #368: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #369: 1025it [00:02, 355.08it/s, env_step=377856, len=8, n/ep=7, n/st=64, player_1/loss=3.270, player_2/loss=0.995, rew=0.00]


Epoch #369: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #370: 1025it [00:02, 356.07it/s, env_step=378880, len=8, n/ep=8, n/st=64, player_1/loss=4.605, player_2/loss=1.958, rew=0.00]


Epoch #370: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #371: 1025it [00:02, 356.27it/s, env_step=379904, len=8, n/ep=8, n/st=64, player_1/loss=4.907, player_2/loss=1.546, rew=0.00]


Epoch #371: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #372: 1025it [00:02, 355.37it/s, env_step=380928, len=7, n/ep=8, n/st=64, player_1/loss=3.019, player_2/loss=1.318, rew=0.00]


Epoch #372: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #373: 1025it [00:02, 355.27it/s, env_step=381952, len=7, n/ep=8, n/st=64, player_1/loss=2.565, player_2/loss=2.544, rew=0.00]


Epoch #373: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #374: 1025it [00:02, 354.91it/s, env_step=382976, len=8, n/ep=7, n/st=64, player_1/loss=1.814, player_2/loss=0.997, rew=0.00]


Epoch #374: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #375: 1025it [00:02, 357.10it/s, env_step=384000, len=9, n/ep=6, n/st=64, player_1/loss=6.466, player_2/loss=1.924, rew=0.00]


Epoch #375: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #376: 1025it [00:02, 356.83it/s, env_step=385024, len=9, n/ep=7, n/st=64, player_1/loss=10.305, player_2/loss=1.327, rew=0.00]


Epoch #376: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #377: 1025it [00:02, 356.37it/s, env_step=386048, len=9, n/ep=7, n/st=64, player_1/loss=10.459, player_2/loss=1.493, rew=0.00]


Epoch #377: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #378: 1025it [00:02, 355.89it/s, env_step=387072, len=8, n/ep=8, n/st=64, player_1/loss=3.069, player_2/loss=2.682, rew=0.00]


Epoch #378: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #379: 1025it [00:02, 355.57it/s, env_step=388096, len=7, n/ep=9, n/st=64, player_1/loss=3.909, player_2/loss=1.051, rew=0.00]


Epoch #379: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #380: 1025it [00:02, 356.78it/s, env_step=389120, len=11, n/ep=6, n/st=64, player_1/loss=3.914, player_2/loss=2.168, rew=0.00]


Epoch #380: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #381: 1025it [00:02, 356.14it/s, env_step=390144, len=12, n/ep=5, n/st=64, player_1/loss=3.779, player_2/loss=2.814, rew=0.00]


Epoch #381: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #382: 1025it [00:02, 354.61it/s, env_step=391168, len=7, n/ep=9, n/st=64, player_1/loss=1.571, player_2/loss=0.932, rew=0.00]


Epoch #382: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #383: 1025it [00:02, 355.74it/s, env_step=392192, len=8, n/ep=7, n/st=64, player_1/loss=1.798, player_2/loss=1.168, rew=0.00]


Epoch #383: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #384: 1025it [00:02, 357.38it/s, env_step=393216, len=8, n/ep=8, n/st=64, player_1/loss=2.380, player_2/loss=1.468, rew=0.00]


Epoch #384: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #385: 1025it [00:02, 356.61it/s, env_step=394240, len=8, n/ep=8, n/st=64, player_1/loss=1.972, player_2/loss=1.738, rew=0.00]


Epoch #385: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #386: 1025it [00:02, 356.39it/s, env_step=395264, len=8, n/ep=8, n/st=64, player_1/loss=1.411, player_2/loss=2.101, rew=0.00]


Epoch #386: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #387: 1025it [00:02, 356.61it/s, env_step=396288, len=7, n/ep=8, n/st=64, player_1/loss=0.997, player_2/loss=1.325, rew=0.00]


Epoch #387: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #388: 1025it [00:02, 355.01it/s, env_step=397312, len=8, n/ep=8, n/st=64, player_1/loss=0.936, player_2/loss=0.997, rew=0.00]


Epoch #388: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #389: 1025it [00:02, 355.27it/s, env_step=398336, len=10, n/ep=6, n/st=64, player_1/loss=2.232, player_2/loss=0.626, rew=0.00]


Epoch #389: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #390: 1025it [00:02, 355.41it/s, env_step=399360, len=8, n/ep=7, n/st=64, player_1/loss=3.042, player_2/loss=0.683, rew=0.00]


Epoch #390: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #391: 1025it [00:02, 356.03it/s, env_step=400384, len=8, n/ep=7, n/st=64, player_1/loss=3.008, player_2/loss=1.830, rew=0.00]


Epoch #391: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #392: 1025it [00:02, 355.36it/s, env_step=401408, len=8, n/ep=8, n/st=64, player_1/loss=2.729, player_2/loss=1.571, rew=0.00]


Epoch #392: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #393: 1025it [00:02, 357.23it/s, env_step=402432, len=8, n/ep=7, n/st=64, player_1/loss=2.085, player_2/loss=1.579, rew=0.00]


Epoch #393: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #394: 1025it [00:02, 356.96it/s, env_step=403456, len=9, n/ep=8, n/st=64, player_1/loss=1.034, player_2/loss=2.784, rew=0.00]


Epoch #394: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #395: 1025it [00:02, 357.40it/s, env_step=404480, len=18, n/ep=3, n/st=64, player_1/loss=5.724, player_2/loss=5.197, rew=0.00]


Epoch #395: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #396: 1025it [00:02, 356.32it/s, env_step=405504, len=8, n/ep=7, n/st=64, player_1/loss=8.549, player_2/loss=2.766, rew=0.00]


Epoch #396: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #397: 1025it [00:02, 355.55it/s, env_step=406528, len=9, n/ep=8, n/st=64, player_1/loss=5.220, player_2/loss=3.739, rew=0.00]


Epoch #397: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #398: 1025it [00:02, 354.65it/s, env_step=407552, len=7, n/ep=9, n/st=64, player_1/loss=2.760, player_2/loss=1.007, rew=0.00]


Epoch #398: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #399: 1025it [00:02, 353.66it/s, env_step=408576, len=9, n/ep=7, n/st=64, player_1/loss=2.254, player_2/loss=1.773, rew=0.00]


Epoch #399: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #400: 1025it [00:02, 357.98it/s, env_step=409600, len=9, n/ep=6, n/st=64, player_1/loss=1.725, player_2/loss=1.239, rew=0.00]


Epoch #400: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #401: 1025it [00:02, 354.86it/s, env_step=410624, len=7, n/ep=8, n/st=64, player_1/loss=2.146, player_2/loss=0.777, rew=0.00]


Epoch #401: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #402: 1025it [00:02, 356.49it/s, env_step=411648, len=8, n/ep=8, n/st=64, player_1/loss=2.941, player_2/loss=0.907, rew=0.00]


Epoch #402: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #403: 1025it [00:02, 356.93it/s, env_step=412672, len=7, n/ep=7, n/st=64, player_1/loss=2.577, player_2/loss=1.798, rew=0.00]


Epoch #403: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #404: 1025it [00:02, 354.30it/s, env_step=413696, len=8, n/ep=8, n/st=64, player_1/loss=3.396, player_2/loss=1.060, rew=0.00]


Epoch #404: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #405: 1025it [00:02, 348.53it/s, env_step=414720, len=8, n/ep=8, n/st=64, player_1/loss=1.810, player_2/loss=1.135, rew=0.00]


Epoch #405: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #406: 1025it [00:02, 356.49it/s, env_step=415744, len=8, n/ep=8, n/st=64, player_1/loss=1.456, player_2/loss=2.037, rew=0.00]


Epoch #406: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #407: 1025it [00:02, 357.14it/s, env_step=416768, len=9, n/ep=7, n/st=64, player_1/loss=2.867, player_2/loss=1.702, rew=0.00]


Epoch #407: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #408: 1025it [00:02, 354.74it/s, env_step=417792, len=7, n/ep=8, n/st=64, player_1/loss=3.146, player_2/loss=3.595, rew=0.00]


Epoch #408: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #409: 1025it [00:02, 355.52it/s, env_step=418816, len=8, n/ep=8, n/st=64, player_1/loss=1.598, player_2/loss=3.016, rew=0.00]


Epoch #409: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #410: 1025it [00:02, 353.62it/s, env_step=419840, len=10, n/ep=6, n/st=64, player_1/loss=3.876, player_2/loss=4.697, rew=0.00]


Epoch #410: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #411: 1025it [00:02, 356.29it/s, env_step=420864, len=8, n/ep=8, n/st=64, player_1/loss=3.439, player_2/loss=3.134, rew=0.00]


Epoch #411: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #412: 1025it [00:02, 354.62it/s, env_step=421888, len=8, n/ep=8, n/st=64, player_1/loss=1.950, player_2/loss=0.823, rew=0.00]


Epoch #412: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #413: 1025it [00:02, 356.23it/s, env_step=422912, len=8, n/ep=7, n/st=64, player_1/loss=3.417, player_2/loss=1.628, rew=0.00]


Epoch #413: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #414: 1025it [00:02, 355.63it/s, env_step=423936, len=9, n/ep=7, n/st=64, player_1/loss=4.117, player_2/loss=4.584, rew=0.00]


Epoch #414: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #415: 1025it [00:02, 354.53it/s, env_step=424960, len=7, n/ep=9, n/st=64, player_1/loss=3.701, player_2/loss=3.251, rew=0.00]


Epoch #415: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #416: 1025it [00:02, 356.51it/s, env_step=425984, len=11, n/ep=6, n/st=64, player_1/loss=0.621, player_2/loss=4.445, rew=0.00]


Epoch #416: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #417: 1025it [00:02, 356.54it/s, env_step=427008, len=8, n/ep=7, n/st=64, player_1/loss=4.281, player_2/loss=3.607, rew=0.00]


Epoch #417: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #418: 1025it [00:02, 356.69it/s, env_step=428032, len=11, n/ep=6, n/st=64, player_1/loss=3.452, player_2/loss=1.878, rew=0.00]


Epoch #418: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #419: 1025it [00:02, 356.16it/s, env_step=429056, len=7, n/ep=9, n/st=64, player_1/loss=2.400, player_2/loss=0.789, rew=0.00]


Epoch #419: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #420: 1025it [00:02, 355.47it/s, env_step=430080, len=8, n/ep=8, n/st=64, player_1/loss=2.871, player_2/loss=1.139, rew=0.00]


Epoch #420: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #421: 1025it [00:02, 354.54it/s, env_step=431104, len=9, n/ep=8, n/st=64, player_1/loss=2.332, player_2/loss=3.668, rew=0.00]


Epoch #421: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #422: 1025it [00:02, 357.53it/s, env_step=432128, len=8, n/ep=7, n/st=64, player_1/loss=4.561, player_2/loss=1.235, rew=0.00]


Epoch #422: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #423: 1025it [00:02, 354.12it/s, env_step=433152, len=7, n/ep=8, n/st=64, player_1/loss=2.994, player_2/loss=1.456, rew=0.00]


Epoch #423: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #424: 1025it [00:02, 354.69it/s, env_step=434176, len=8, n/ep=8, n/st=64, player_1/loss=3.389, player_2/loss=0.809, rew=0.00]


Epoch #424: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #425: 1025it [00:02, 355.18it/s, env_step=435200, len=9, n/ep=7, n/st=64, player_1/loss=2.354, player_2/loss=0.728, rew=0.00]


Epoch #425: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #426: 1025it [00:02, 355.13it/s, env_step=436224, len=9, n/ep=7, n/st=64, player_1/loss=1.783, player_2/loss=0.224, rew=0.00]


Epoch #426: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #427: 1025it [00:02, 354.17it/s, env_step=437248, len=11, n/ep=6, n/st=64, player_1/loss=2.750, player_2/loss=1.793, rew=0.00]


Epoch #427: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #428: 1025it [00:02, 356.37it/s, env_step=438272, len=9, n/ep=7, n/st=64, player_1/loss=2.111, player_2/loss=1.771, rew=0.00]


Epoch #428: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #429: 1025it [00:02, 354.08it/s, env_step=439296, len=8, n/ep=8, n/st=64, player_1/loss=1.340, player_2/loss=2.164, rew=0.00]


Epoch #429: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #430: 1025it [00:02, 356.42it/s, env_step=440320, len=9, n/ep=7, n/st=64, player_1/loss=0.645, player_2/loss=2.259, rew=0.00]


Epoch #430: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #431: 1025it [00:02, 354.66it/s, env_step=441344, len=8, n/ep=8, n/st=64, player_1/loss=1.369, player_2/loss=1.741, rew=0.00]


Epoch #431: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #432: 1025it [00:02, 352.67it/s, env_step=442368, len=8, n/ep=8, n/st=64, player_1/loss=0.543, player_2/loss=1.870, rew=0.00]


Epoch #432: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #433: 1025it [00:02, 355.54it/s, env_step=443392, len=10, n/ep=7, n/st=64, player_1/loss=2.350, player_2/loss=2.507, rew=0.00]


Epoch #433: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #434: 1025it [00:02, 356.42it/s, env_step=444416, len=9, n/ep=6, n/st=64, player_1/loss=1.379, player_2/loss=2.064, rew=0.00]


Epoch #434: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #435: 1025it [00:02, 353.50it/s, env_step=445440, len=7, n/ep=8, n/st=64, player_1/loss=0.638, player_2/loss=4.011, rew=0.00]


Epoch #435: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #436: 1025it [00:02, 353.74it/s, env_step=446464, len=8, n/ep=7, n/st=64, player_1/loss=1.007, player_2/loss=1.295, rew=0.00]


Epoch #436: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #437: 1025it [00:02, 354.49it/s, env_step=447488, len=8, n/ep=8, n/st=64, player_1/loss=1.394, player_2/loss=2.838, rew=0.00]


Epoch #437: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #438: 1025it [00:02, 353.46it/s, env_step=448512, len=8, n/ep=8, n/st=64, player_1/loss=0.402, player_2/loss=2.211, rew=0.00]


Epoch #438: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #439: 1025it [00:02, 351.27it/s, env_step=449536, len=9, n/ep=6, n/st=64, player_1/loss=0.694, player_2/loss=3.492, rew=0.00]


Epoch #439: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #440: 1025it [00:02, 356.53it/s, env_step=450560, len=8, n/ep=8, n/st=64, player_1/loss=1.107, player_2/loss=3.250, rew=0.00]


Epoch #440: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #441: 1025it [00:02, 356.36it/s, env_step=451584, len=8, n/ep=8, n/st=64, player_1/loss=1.656, player_2/loss=1.654, rew=0.00]


Epoch #441: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #442: 1025it [00:02, 355.87it/s, env_step=452608, len=8, n/ep=7, n/st=64, player_1/loss=1.057, player_2/loss=3.145, rew=0.00]


Epoch #442: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #443: 1025it [00:02, 354.79it/s, env_step=453632, len=7, n/ep=9, n/st=64, player_1/loss=1.008, player_2/loss=1.493, rew=0.00]


Epoch #443: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #444: 1025it [00:02, 354.27it/s, env_step=454656, len=8, n/ep=8, n/st=64, player_1/loss=1.994, player_2/loss=2.023, rew=0.00]


Epoch #444: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #445: 1025it [00:02, 353.62it/s, env_step=455680, len=7, n/ep=8, n/st=64, player_1/loss=1.550, player_2/loss=3.099, rew=0.00]


Epoch #445: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #446: 1025it [00:02, 352.49it/s, env_step=456704, len=10, n/ep=8, n/st=64, player_1/loss=2.547, player_2/loss=5.556, rew=0.00]


Epoch #446: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #447: 1025it [00:02, 354.04it/s, env_step=457728, len=9, n/ep=7, n/st=64, player_1/loss=1.594, player_2/loss=4.974, rew=0.00]


Epoch #447: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #448: 1025it [00:02, 355.86it/s, env_step=458752, len=7, n/ep=8, n/st=64, player_1/loss=5.752, player_2/loss=3.776, rew=0.00]


Epoch #448: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #449: 1025it [00:02, 355.69it/s, env_step=459776, len=7, n/ep=8, n/st=64, player_1/loss=3.158, player_2/loss=2.675, rew=0.00]


Epoch #449: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #450: 1025it [00:02, 354.34it/s, env_step=460800, len=8, n/ep=8, n/st=64, player_1/loss=3.069, player_2/loss=4.459, rew=0.00]


Epoch #450: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #451: 1025it [00:02, 354.02it/s, env_step=461824, len=8, n/ep=8, n/st=64, player_1/loss=1.529, player_2/loss=5.009, rew=0.00]


Epoch #451: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #452: 1025it [00:02, 356.97it/s, env_step=462848, len=8, n/ep=7, n/st=64, player_1/loss=3.868, player_2/loss=2.656, rew=0.00]


Epoch #452: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #453: 1025it [00:02, 355.99it/s, env_step=463872, len=8, n/ep=7, n/st=64, player_1/loss=2.904, player_2/loss=5.191, rew=0.00]


Epoch #453: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #454: 1025it [00:02, 354.40it/s, env_step=464896, len=8, n/ep=8, n/st=64, player_1/loss=2.680, player_2/loss=2.726, rew=0.00]


Epoch #454: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #455: 1025it [00:02, 354.99it/s, env_step=465920, len=9, n/ep=6, n/st=64, player_1/loss=1.226, player_2/loss=2.719, rew=0.00]


Epoch #455: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #456: 1025it [00:02, 354.73it/s, env_step=466944, len=9, n/ep=7, n/st=64, player_1/loss=7.917, player_2/loss=3.977, rew=0.00]


Epoch #456: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #457: 1025it [00:02, 353.93it/s, env_step=467968, len=9, n/ep=6, n/st=64, player_1/loss=9.662, player_2/loss=4.581, rew=0.00]


Epoch #457: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #458: 1025it [00:02, 355.35it/s, env_step=468992, len=9, n/ep=7, n/st=64, player_1/loss=2.980, player_2/loss=2.402, rew=0.00]


Epoch #458: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #459: 1025it [00:02, 354.92it/s, env_step=470016, len=8, n/ep=8, n/st=64, player_1/loss=7.105, player_2/loss=3.286, rew=0.00]


Epoch #459: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #460: 1025it [00:02, 354.02it/s, env_step=471040, len=8, n/ep=7, n/st=64, player_1/loss=3.210, player_2/loss=1.242, rew=0.00]


Epoch #460: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #461: 1025it [00:02, 356.67it/s, env_step=472064, len=8, n/ep=8, n/st=64, player_1/loss=2.609, player_2/loss=2.031, rew=0.00]


Epoch #461: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #462: 1025it [00:02, 356.77it/s, env_step=473088, len=7, n/ep=9, n/st=64, player_1/loss=2.641, player_2/loss=0.983, rew=0.00]


Epoch #462: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #463: 1025it [00:02, 355.64it/s, env_step=474112, len=8, n/ep=7, n/st=64, player_1/loss=3.189, player_2/loss=1.390, rew=0.00]


Epoch #463: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #464: 1025it [00:02, 355.18it/s, env_step=475136, len=7, n/ep=8, n/st=64, player_1/loss=2.003, player_2/loss=0.683, rew=0.00]


Epoch #464: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #465: 1025it [00:02, 354.68it/s, env_step=476160, len=9, n/ep=7, n/st=64, player_1/loss=1.735, player_2/loss=1.075, rew=0.00]


Epoch #465: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #466: 1025it [00:02, 355.83it/s, env_step=477184, len=10, n/ep=7, n/st=64, player_1/loss=1.858, player_2/loss=0.693, rew=0.00]


Epoch #466: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #467: 1025it [00:02, 356.42it/s, env_step=478208, len=10, n/ep=6, n/st=64, player_1/loss=1.509, player_2/loss=1.184, rew=0.00]


Epoch #467: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #468: 1025it [00:02, 356.31it/s, env_step=479232, len=8, n/ep=7, n/st=64, player_1/loss=1.937, player_2/loss=1.526, rew=0.00]


Epoch #468: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #469: 1025it [00:02, 356.74it/s, env_step=480256, len=7, n/ep=8, n/st=64, player_1/loss=2.906, player_2/loss=1.017, rew=0.00]


Epoch #469: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #470: 1025it [00:02, 355.06it/s, env_step=481280, len=8, n/ep=7, n/st=64, player_1/loss=2.566, player_2/loss=1.439, rew=0.00]


Epoch #470: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #471: 1025it [00:02, 353.40it/s, env_step=482304, len=8, n/ep=8, n/st=64, player_1/loss=2.462, player_2/loss=2.226, rew=0.00]


Epoch #471: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #472: 1025it [00:02, 356.29it/s, env_step=483328, len=9, n/ep=7, n/st=64, player_1/loss=3.368, player_2/loss=0.832, rew=0.00]


Epoch #472: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #473: 1025it [00:02, 353.84it/s, env_step=484352, len=9, n/ep=7, n/st=64, player_1/loss=3.026, player_2/loss=1.058, rew=0.00]


Epoch #473: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #474: 1025it [00:02, 355.06it/s, env_step=485376, len=9, n/ep=7, n/st=64, player_1/loss=3.484, player_2/loss=1.979, rew=0.00]


Epoch #474: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #475: 1025it [00:02, 354.44it/s, env_step=486400, len=8, n/ep=7, n/st=64, player_1/loss=2.877, player_2/loss=1.914, rew=0.00]


Epoch #475: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #476: 1025it [00:02, 355.92it/s, env_step=487424, len=10, n/ep=6, n/st=64, player_1/loss=1.073, player_2/loss=5.281, rew=0.00]


Epoch #476: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #477: 1025it [00:02, 353.41it/s, env_step=488448, len=9, n/ep=7, n/st=64, player_1/loss=1.541, player_2/loss=1.528, rew=0.00]


Epoch #477: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #478: 1025it [00:02, 355.66it/s, env_step=489472, len=9, n/ep=7, n/st=64, player_1/loss=0.724, player_2/loss=1.278, rew=0.00]


Epoch #478: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #479: 1025it [00:02, 342.83it/s, env_step=490496, len=8, n/ep=8, n/st=64, player_1/loss=4.036, player_2/loss=4.043, rew=0.00]


Epoch #479: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #480: 1025it [00:02, 354.54it/s, env_step=491520, len=7, n/ep=8, n/st=64, player_1/loss=2.855, player_2/loss=1.798, rew=0.00]


Epoch #480: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #481: 1025it [00:02, 355.39it/s, env_step=492544, len=9, n/ep=7, n/st=64, player_1/loss=3.421, player_2/loss=2.219, rew=0.00]


Epoch #481: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #482: 1025it [00:02, 355.03it/s, env_step=493568, len=9, n/ep=7, n/st=64, player_1/loss=2.649, player_2/loss=0.934, rew=0.00]


Epoch #482: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #483: 1025it [00:02, 355.37it/s, env_step=494592, len=14, n/ep=6, n/st=64, player_1/loss=4.181, player_2/loss=3.429, rew=0.00]


Epoch #483: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #484: 1025it [00:02, 354.89it/s, env_step=495616, len=9, n/ep=7, n/st=64, player_1/loss=1.772, player_2/loss=0.755, rew=0.00]


Epoch #484: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #485: 1025it [00:02, 356.43it/s, env_step=496640, len=11, n/ep=6, n/st=64, player_1/loss=1.560, player_2/loss=3.902, rew=0.00]


Epoch #485: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #486: 1025it [00:02, 355.95it/s, env_step=497664, len=8, n/ep=7, n/st=64, player_1/loss=3.043, player_2/loss=1.545, rew=0.00]


Epoch #486: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #487: 1025it [00:02, 351.67it/s, env_step=498688, len=7, n/ep=8, n/st=64, player_1/loss=4.234, player_2/loss=0.957, rew=0.00]


Epoch #487: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #488: 1025it [00:02, 354.54it/s, env_step=499712, len=8, n/ep=8, n/st=64, player_1/loss=2.921, player_2/loss=1.873, rew=0.00]


Epoch #488: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #489: 1025it [00:02, 355.92it/s, env_step=500736, len=11, n/ep=5, n/st=64, player_1/loss=4.783, player_2/loss=1.987, rew=0.00]


Epoch #489: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #490: 1025it [00:02, 355.96it/s, env_step=501760, len=8, n/ep=8, n/st=64, player_1/loss=3.778, player_2/loss=1.538, rew=0.00]


Epoch #490: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #491: 1025it [00:02, 353.76it/s, env_step=502784, len=9, n/ep=7, n/st=64, player_1/loss=4.064, player_2/loss=1.262, rew=0.00]


Epoch #491: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #492: 1025it [00:02, 353.94it/s, env_step=503808, len=8, n/ep=8, n/st=64, player_1/loss=5.853, player_2/loss=2.134, rew=0.00]


Epoch #492: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #493: 1025it [00:02, 352.93it/s, env_step=504832, len=7, n/ep=8, n/st=64, player_1/loss=3.812, player_2/loss=0.947, rew=0.00]


Epoch #493: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #494: 1025it [00:02, 355.64it/s, env_step=505856, len=8, n/ep=7, n/st=64, player_1/loss=3.702, player_2/loss=2.321, rew=0.00]


Epoch #494: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #495: 1025it [00:02, 354.27it/s, env_step=506880, len=8, n/ep=8, n/st=64, player_1/loss=5.328, player_2/loss=2.229, rew=0.00]


Epoch #495: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #496: 1025it [00:02, 355.53it/s, env_step=507904, len=8, n/ep=7, n/st=64, player_1/loss=1.802, player_2/loss=0.893, rew=0.00]


Epoch #496: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #497: 1025it [00:02, 354.64it/s, env_step=508928, len=10, n/ep=5, n/st=64, player_1/loss=2.440, player_2/loss=1.590, rew=0.00]


Epoch #497: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #498: 1025it [00:02, 355.48it/s, env_step=509952, len=7, n/ep=7, n/st=64, player_1/loss=1.067, player_2/loss=1.233, rew=0.00]


Epoch #498: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #499: 1025it [00:02, 354.44it/s, env_step=510976, len=7, n/ep=8, n/st=64, player_1/loss=1.863, player_2/loss=1.019, rew=0.00]


Epoch #499: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #500: 1025it [00:02, 355.67it/s, env_step=512000, len=10, n/ep=7, n/st=64, player_1/loss=4.597, player_2/loss=1.431, rew=0.00]


Epoch #500: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #501: 1025it [00:02, 355.49it/s, env_step=513024, len=8, n/ep=7, n/st=64, player_1/loss=4.289, player_2/loss=2.166, rew=0.00]


Epoch #501: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #502: 1025it [00:02, 354.89it/s, env_step=514048, len=8, n/ep=8, n/st=64, player_1/loss=2.935, player_2/loss=3.074, rew=0.00]


Epoch #502: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #503: 1025it [00:02, 356.23it/s, env_step=515072, len=8, n/ep=7, n/st=64, player_1/loss=2.275, player_2/loss=0.920, rew=0.00]


Epoch #503: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #504: 1025it [00:02, 353.88it/s, env_step=516096, len=8, n/ep=7, n/st=64, player_1/loss=2.470, player_2/loss=2.495, rew=0.00]


Epoch #504: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #505: 1025it [00:02, 356.56it/s, env_step=517120, len=8, n/ep=7, n/st=64, player_1/loss=3.630, player_2/loss=2.937, rew=0.00]


Epoch #505: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #506: 1025it [00:02, 353.17it/s, env_step=518144, len=8, n/ep=5, n/st=64, player_1/loss=2.708, player_2/loss=2.164, rew=0.00]


Epoch #506: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #507: 1025it [00:02, 355.97it/s, env_step=519168, len=8, n/ep=8, n/st=64, player_1/loss=4.268, player_2/loss=2.273, rew=0.00]


Epoch #507: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #508: 1025it [00:02, 355.89it/s, env_step=520192, len=9, n/ep=7, n/st=64, player_1/loss=3.667, player_2/loss=1.287, rew=0.00]


Epoch #508: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #509: 1025it [00:02, 354.11it/s, env_step=521216, len=7, n/ep=8, n/st=64, player_1/loss=2.354, player_2/loss=2.990, rew=0.00]


Epoch #509: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #510: 1025it [00:02, 354.44it/s, env_step=522240, len=7, n/ep=8, n/st=64, player_1/loss=3.318, player_2/loss=1.662, rew=0.00]


Epoch #510: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #511: 1025it [00:02, 354.52it/s, env_step=523264, len=7, n/ep=8, n/st=64, player_1/loss=2.145, player_2/loss=1.772, rew=0.00]


Epoch #511: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #512: 1025it [00:02, 355.19it/s, env_step=524288, len=8, n/ep=8, n/st=64, player_1/loss=3.919, player_2/loss=4.888, rew=0.00]


Epoch #512: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #513: 1025it [00:02, 355.44it/s, env_step=525312, len=11, n/ep=6, n/st=64, player_1/loss=1.938, player_2/loss=2.695, rew=0.00]


Epoch #513: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #514: 1025it [00:02, 355.58it/s, env_step=526336, len=7, n/ep=8, n/st=64, player_1/loss=2.887, player_2/loss=2.823, rew=0.00]


Epoch #514: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #515: 1025it [00:02, 355.14it/s, env_step=527360, len=9, n/ep=7, n/st=64, player_1/loss=3.112, player_2/loss=3.284, rew=0.00]


Epoch #515: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #516: 1025it [00:02, 355.11it/s, env_step=528384, len=8, n/ep=8, n/st=64, player_1/loss=3.627, player_2/loss=4.410, rew=0.00]


Epoch #516: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #517: 1025it [00:02, 353.66it/s, env_step=529408, len=7, n/ep=8, n/st=64, player_1/loss=6.399, player_2/loss=3.392, rew=0.00]


Epoch #517: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #518: 1025it [00:02, 355.15it/s, env_step=530432, len=8, n/ep=7, n/st=64, player_1/loss=3.233, player_2/loss=6.953, rew=0.00]


Epoch #518: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #519: 1025it [00:02, 353.65it/s, env_step=531456, len=7, n/ep=8, n/st=64, player_1/loss=5.662, player_2/loss=3.387, rew=0.00]


Epoch #519: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #520: 1025it [00:02, 356.83it/s, env_step=532480, len=8, n/ep=8, n/st=64, player_1/loss=3.965, player_2/loss=5.779, rew=0.00]


Epoch #520: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #521: 1025it [00:02, 354.79it/s, env_step=533504, len=8, n/ep=8, n/st=64, player_1/loss=4.503, player_2/loss=2.887, rew=0.00]


Epoch #521: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #522: 1025it [00:02, 355.52it/s, env_step=534528, len=7, n/ep=9, n/st=64, player_1/loss=7.410, player_2/loss=3.657, rew=0.00]


Epoch #522: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #523: 1025it [00:02, 354.99it/s, env_step=535552, len=11, n/ep=6, n/st=64, player_1/loss=5.356, player_2/loss=1.583, rew=0.00]


Epoch #523: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #524: 1025it [00:02, 354.54it/s, env_step=536576, len=8, n/ep=8, n/st=64, player_1/loss=8.822, player_2/loss=5.090, rew=0.00]


Epoch #524: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #525: 1025it [00:02, 355.30it/s, env_step=537600, len=9, n/ep=7, n/st=64, player_1/loss=6.270, player_2/loss=6.907, rew=0.00]


Epoch #525: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #526: 1025it [00:02, 356.55it/s, env_step=538624, len=10, n/ep=6, n/st=64, player_1/loss=11.978, player_2/loss=2.420, rew=0.00]


Epoch #526: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #527: 1025it [00:02, 355.97it/s, env_step=539648, len=9, n/ep=7, n/st=64, player_1/loss=2.551, player_2/loss=1.702, rew=0.00]


Epoch #527: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #528: 1025it [00:02, 355.85it/s, env_step=540672, len=10, n/ep=7, n/st=64, player_1/loss=1.542, player_2/loss=1.250, rew=0.00]


Epoch #528: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #529: 1025it [00:02, 349.14it/s, env_step=541696, len=8, n/ep=7, n/st=64, player_1/loss=1.452, player_2/loss=1.609, rew=0.00]


Epoch #529: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #530: 1025it [00:02, 356.08it/s, env_step=542720, len=7, n/ep=9, n/st=64, player_1/loss=1.187, player_2/loss=2.842, rew=0.00]


Epoch #530: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #531: 1025it [00:02, 355.74it/s, env_step=543744, len=10, n/ep=7, n/st=64, player_1/loss=6.061, player_2/loss=2.660, rew=0.00]


Epoch #531: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #532: 1025it [00:02, 357.38it/s, env_step=544768, len=9, n/ep=7, n/st=64, player_1/loss=2.801, player_2/loss=5.922, rew=0.00]


Epoch #532: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #533: 1025it [00:02, 355.78it/s, env_step=545792, len=9, n/ep=7, n/st=64, player_1/loss=4.259, player_2/loss=1.701, rew=0.00]


Epoch #533: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #534: 1025it [00:02, 354.83it/s, env_step=546816, len=9, n/ep=6, n/st=64, player_1/loss=2.298, player_2/loss=1.258, rew=0.00]


Epoch #534: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #535: 1025it [00:02, 353.48it/s, env_step=547840, len=7, n/ep=8, n/st=64, player_1/loss=2.974, player_2/loss=2.445, rew=0.00]


Epoch #535: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #536: 1025it [00:02, 355.28it/s, env_step=548864, len=8, n/ep=8, n/st=64, player_1/loss=3.678, player_2/loss=6.170, rew=0.00]


Epoch #536: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #537: 1025it [00:02, 354.41it/s, env_step=549888, len=7, n/ep=8, n/st=64, player_1/loss=2.335, player_2/loss=3.074, rew=0.00]


Epoch #537: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #538: 1025it [00:02, 354.03it/s, env_step=550912, len=7, n/ep=8, n/st=64, player_1/loss=2.074, player_2/loss=2.903, rew=0.00]


Epoch #538: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #539: 1025it [00:02, 356.37it/s, env_step=551936, len=8, n/ep=7, n/st=64, player_1/loss=1.548, player_2/loss=3.125, rew=0.00]


Epoch #539: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #540: 1025it [00:02, 354.32it/s, env_step=552960, len=10, n/ep=7, n/st=64, player_1/loss=1.068, player_2/loss=6.129, rew=0.00]


Epoch #540: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #541: 1025it [00:02, 354.49it/s, env_step=553984, len=8, n/ep=8, n/st=64, player_1/loss=2.826, player_2/loss=5.506, rew=0.00]


Epoch #541: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #542: 1025it [00:02, 351.77it/s, env_step=555008, len=9, n/ep=8, n/st=64, player_1/loss=1.012, player_2/loss=4.027, rew=0.00]


Epoch #542: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #543: 1025it [00:02, 355.52it/s, env_step=556032, len=8, n/ep=8, n/st=64, player_1/loss=2.454, player_2/loss=1.921, rew=0.00]


Epoch #543: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #544: 1025it [00:02, 355.60it/s, env_step=557056, len=8, n/ep=8, n/st=64, player_1/loss=2.103, player_2/loss=4.150, rew=0.00]


Epoch #544: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #545: 1025it [00:02, 355.90it/s, env_step=558080, len=7, n/ep=8, n/st=64, player_1/loss=1.748, player_2/loss=2.030, rew=0.00]


Epoch #545: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #546: 1025it [00:02, 355.70it/s, env_step=559104, len=11, n/ep=5, n/st=64, player_1/loss=2.569, player_2/loss=2.230, rew=0.00]


Epoch #546: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #547: 1025it [00:02, 354.77it/s, env_step=560128, len=9, n/ep=6, n/st=64, player_1/loss=3.780, player_2/loss=3.678, rew=0.00]


Epoch #547: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #548: 1025it [00:02, 356.02it/s, env_step=561152, len=8, n/ep=7, n/st=64, player_1/loss=2.512, player_2/loss=3.691, rew=0.00]


Epoch #548: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #549: 1025it [00:02, 355.83it/s, env_step=562176, len=10, n/ep=7, n/st=64, player_1/loss=2.887, player_2/loss=1.333, rew=0.00]


Epoch #549: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #550: 1025it [00:02, 357.19it/s, env_step=563200, len=10, n/ep=7, n/st=64, player_1/loss=2.425, player_2/loss=1.327, rew=0.00]


Epoch #550: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #551: 1025it [00:02, 355.44it/s, env_step=564224, len=9, n/ep=7, n/st=64, player_1/loss=3.049, player_2/loss=4.692, rew=0.00]


Epoch #551: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #552: 1025it [00:02, 354.37it/s, env_step=565248, len=9, n/ep=7, n/st=64, player_1/loss=2.276, player_2/loss=3.921, rew=0.00]


Epoch #552: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #553: 1025it [00:02, 355.24it/s, env_step=566272, len=8, n/ep=8, n/st=64, player_1/loss=2.062, player_2/loss=4.652, rew=0.00]


Epoch #553: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #554: 1025it [00:02, 355.99it/s, env_step=567296, len=7, n/ep=8, n/st=64, player_1/loss=3.370, player_2/loss=3.567, rew=0.00]


Epoch #554: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #555: 1025it [00:02, 355.39it/s, env_step=568320, len=11, n/ep=6, n/st=64, player_1/loss=2.173, player_2/loss=1.869, rew=0.00]


Epoch #555: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #556: 1025it [00:02, 356.37it/s, env_step=569344, len=8, n/ep=6, n/st=64, player_1/loss=2.082, player_2/loss=3.287, rew=0.00]


Epoch #556: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #557: 1025it [00:02, 357.17it/s, env_step=570368, len=8, n/ep=8, n/st=64, player_1/loss=2.959, player_2/loss=2.939, rew=0.00]


Epoch #557: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #558: 1025it [00:02, 354.79it/s, env_step=571392, len=8, n/ep=8, n/st=64, player_1/loss=2.205, player_2/loss=1.742, rew=0.00]


Epoch #558: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #559: 1025it [00:02, 355.06it/s, env_step=572416, len=8, n/ep=7, n/st=64, player_1/loss=3.368, player_2/loss=1.513, rew=0.00]


Epoch #559: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #560: 1025it [00:02, 355.94it/s, env_step=573440, len=7, n/ep=8, n/st=64, player_1/loss=1.672, player_2/loss=1.883, rew=0.00]


Epoch #560: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #561: 1025it [00:02, 355.44it/s, env_step=574464, len=10, n/ep=7, n/st=64, player_1/loss=0.999, player_2/loss=1.994, rew=0.00]


Epoch #561: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #562: 1025it [00:02, 356.40it/s, env_step=575488, len=8, n/ep=8, n/st=64, player_1/loss=3.087, player_2/loss=1.696, rew=0.00]


Epoch #562: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #563: 1025it [00:02, 354.72it/s, env_step=576512, len=7, n/ep=8, n/st=64, player_1/loss=5.225, player_2/loss=3.450, rew=0.00]


Epoch #563: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #564: 1025it [00:02, 356.12it/s, env_step=577536, len=8, n/ep=8, n/st=64, player_1/loss=4.027, player_2/loss=1.247, rew=0.00]


Epoch #564: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #565: 1025it [00:02, 356.02it/s, env_step=578560, len=9, n/ep=6, n/st=64, player_1/loss=3.289, player_2/loss=1.320, rew=0.00]


Epoch #565: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #566: 1025it [00:02, 357.76it/s, env_step=579584, len=8, n/ep=8, n/st=64, player_1/loss=2.719, player_2/loss=2.197, rew=0.00]


Epoch #566: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #567: 1025it [00:02, 355.36it/s, env_step=580608, len=8, n/ep=8, n/st=64, player_1/loss=2.568, player_2/loss=1.720, rew=0.00]


Epoch #567: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #568: 1025it [00:02, 354.73it/s, env_step=581632, len=7, n/ep=9, n/st=64, player_1/loss=1.523, player_2/loss=2.989, rew=0.00]


Epoch #568: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #569: 1025it [00:02, 355.10it/s, env_step=582656, len=8, n/ep=7, n/st=64, player_1/loss=1.886, player_2/loss=1.546, rew=0.00]


Epoch #569: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #570: 1025it [00:02, 353.07it/s, env_step=583680, len=7, n/ep=8, n/st=64, player_1/loss=1.934, player_2/loss=2.660, rew=0.00]


Epoch #570: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #571: 1025it [00:02, 353.85it/s, env_step=584704, len=8, n/ep=8, n/st=64, player_1/loss=2.441, player_2/loss=1.006, rew=0.00]


Epoch #571: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #572: 1025it [00:02, 356.14it/s, env_step=585728, len=8, n/ep=8, n/st=64, player_1/loss=2.170, player_2/loss=1.491, rew=0.00]


Epoch #572: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #573: 1025it [00:02, 354.42it/s, env_step=586752, len=7, n/ep=8, n/st=64, player_1/loss=2.230, player_2/loss=0.934, rew=0.00]


Epoch #573: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #574: 1025it [00:02, 354.74it/s, env_step=587776, len=8, n/ep=6, n/st=64, player_1/loss=1.720, player_2/loss=1.582, rew=0.00]


Epoch #574: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #575: 1025it [00:02, 356.74it/s, env_step=588800, len=11, n/ep=6, n/st=64, player_1/loss=3.617, player_2/loss=2.022, rew=0.00]


Epoch #575: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #576: 1025it [00:02, 355.75it/s, env_step=589824, len=9, n/ep=6, n/st=64, player_1/loss=2.654, player_2/loss=0.801, rew=0.00]


Epoch #576: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #577: 1025it [00:02, 357.01it/s, env_step=590848, len=9, n/ep=7, n/st=64, player_1/loss=1.539, player_2/loss=0.860, rew=0.00]


Epoch #577: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #578: 1025it [00:02, 356.57it/s, env_step=591872, len=8, n/ep=7, n/st=64, player_1/loss=3.320, player_2/loss=1.578, rew=0.00]


Epoch #578: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #579: 1025it [00:02, 354.71it/s, env_step=592896, len=8, n/ep=7, n/st=64, player_1/loss=4.382, player_2/loss=1.331, rew=0.00]


Epoch #579: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #580: 1025it [00:02, 355.16it/s, env_step=593920, len=8, n/ep=7, n/st=64, player_1/loss=4.059, player_2/loss=0.842, rew=0.00]


Epoch #580: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #581: 1025it [00:02, 357.16it/s, env_step=594944, len=7, n/ep=8, n/st=64, player_1/loss=7.716, player_2/loss=2.642, rew=0.00]


Epoch #581: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #582: 1025it [00:02, 354.06it/s, env_step=595968, len=7, n/ep=8, n/st=64, player_1/loss=6.311, player_2/loss=1.605, rew=0.00]


Epoch #582: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #583: 1025it [00:02, 355.61it/s, env_step=596992, len=8, n/ep=8, n/st=64, player_1/loss=2.931, player_2/loss=1.937, rew=0.00]


Epoch #583: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #584: 1025it [00:02, 354.38it/s, env_step=598016, len=8, n/ep=8, n/st=64, player_1/loss=3.675, player_2/loss=1.816, rew=0.00]


Epoch #584: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #585: 1025it [00:02, 354.83it/s, env_step=599040, len=7, n/ep=8, n/st=64, player_1/loss=5.404, player_2/loss=0.517, rew=0.00]


Epoch #585: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #586: 1025it [00:02, 356.86it/s, env_step=600064, len=10, n/ep=6, n/st=64, player_1/loss=4.411, player_2/loss=2.034, rew=0.00]


Epoch #586: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #587: 1025it [00:02, 354.99it/s, env_step=601088, len=7, n/ep=9, n/st=64, player_1/loss=5.871, player_2/loss=1.723, rew=0.00]


Epoch #587: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #588: 1025it [00:02, 358.10it/s, env_step=602112, len=8, n/ep=8, n/st=64, player_1/loss=5.524, player_2/loss=1.184, rew=0.00]


Epoch #588: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #589: 1025it [00:02, 356.90it/s, env_step=603136, len=10, n/ep=5, n/st=64, player_1/loss=5.937, player_2/loss=0.736, rew=0.00]


Epoch #589: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #590: 1025it [00:02, 355.16it/s, env_step=604160, len=9, n/ep=6, n/st=64, player_1/loss=4.447, player_2/loss=2.574, rew=0.00]


Epoch #590: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #591: 1025it [00:02, 354.53it/s, env_step=605184, len=9, n/ep=8, n/st=64, player_1/loss=2.325, player_2/loss=2.183, rew=0.00]


Epoch #591: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #592: 1025it [00:02, 355.24it/s, env_step=606208, len=8, n/ep=7, n/st=64, player_1/loss=6.801, player_2/loss=3.132, rew=0.00]


Epoch #592: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #593: 1025it [00:02, 354.74it/s, env_step=607232, len=9, n/ep=6, n/st=64, player_1/loss=2.493, player_2/loss=1.350, rew=0.00]


Epoch #593: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #594: 1025it [00:02, 355.86it/s, env_step=608256, len=8, n/ep=7, n/st=64, player_1/loss=1.766, player_2/loss=1.581, rew=0.00]


Epoch #594: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #595: 1025it [00:02, 356.67it/s, env_step=609280, len=8, n/ep=7, n/st=64, player_1/loss=2.698, player_2/loss=1.747, rew=0.00]


Epoch #595: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #596: 1025it [00:02, 355.20it/s, env_step=610304, len=7, n/ep=8, n/st=64, player_1/loss=2.923, player_2/loss=1.024, rew=0.00]


Epoch #596: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #597: 1025it [00:02, 355.60it/s, env_step=611328, len=8, n/ep=8, n/st=64, player_1/loss=2.406, player_2/loss=1.658, rew=0.00]


Epoch #597: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #598: 1025it [00:02, 355.24it/s, env_step=612352, len=8, n/ep=8, n/st=64, player_1/loss=4.749, player_2/loss=1.837, rew=0.00]


Epoch #598: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #599: 1025it [00:02, 356.33it/s, env_step=613376, len=8, n/ep=7, n/st=64, player_1/loss=3.871, player_2/loss=1.707, rew=0.00]


Epoch #599: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #600: 1025it [00:02, 355.39it/s, env_step=614400, len=9, n/ep=7, n/st=64, player_1/loss=2.121, player_2/loss=2.239, rew=0.00]


Epoch #600: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #601: 1025it [00:02, 355.49it/s, env_step=615424, len=8, n/ep=8, n/st=64, player_1/loss=2.930, player_2/loss=2.490, rew=0.00]


Epoch #601: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #602: 1025it [00:02, 354.08it/s, env_step=616448, len=8, n/ep=8, n/st=64, player_1/loss=1.987, player_2/loss=0.356, rew=0.00]


Epoch #602: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #603: 1025it [00:02, 355.20it/s, env_step=617472, len=9, n/ep=7, n/st=64, player_1/loss=1.746, player_2/loss=0.814, rew=0.00]


Epoch #603: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #604: 1025it [00:02, 357.71it/s, env_step=618496, len=8, n/ep=8, n/st=64, player_1/loss=2.220, player_2/loss=0.618, rew=0.00]


Epoch #604: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #605: 1025it [00:02, 356.69it/s, env_step=619520, len=9, n/ep=7, n/st=64, player_1/loss=2.968, player_2/loss=1.273, rew=0.00]


Epoch #605: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #606: 1025it [00:02, 355.57it/s, env_step=620544, len=9, n/ep=7, n/st=64, player_1/loss=1.398, player_2/loss=4.557, rew=0.00]


Epoch #606: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #607: 1025it [00:02, 354.32it/s, env_step=621568, len=9, n/ep=7, n/st=64, player_1/loss=3.503, player_2/loss=0.430, rew=0.00]


Epoch #607: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #608: 1025it [00:02, 355.75it/s, env_step=622592, len=8, n/ep=8, n/st=64, player_1/loss=1.814, player_2/loss=1.750, rew=0.00]


Epoch #608: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #609: 1025it [00:02, 356.00it/s, env_step=623616, len=8, n/ep=8, n/st=64, player_1/loss=1.715, player_2/loss=1.439, rew=0.00]


Epoch #609: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #610: 1025it [00:02, 354.26it/s, env_step=624640, len=7, n/ep=8, n/st=64, player_1/loss=3.368, player_2/loss=1.842, rew=0.00]


Epoch #610: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #611: 1025it [00:02, 351.21it/s, env_step=625664, len=7, n/ep=9, n/st=64, player_1/loss=3.007, player_2/loss=1.520, rew=0.00]


Epoch #611: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #612: 1025it [00:02, 400.02it/s, env_step=626688, len=7, n/ep=8, n/st=64, player_1/loss=3.178, player_2/loss=0.774, rew=0.00]


Epoch #612: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #613: 1025it [00:02, 399.63it/s, env_step=627712, len=9, n/ep=7, n/st=64, player_1/loss=3.084, player_2/loss=0.679, rew=0.00]


Epoch #613: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #614: 1025it [00:02, 372.55it/s, env_step=628736, len=8, n/ep=8, n/st=64, player_1/loss=1.924, player_2/loss=2.016, rew=0.00]


Epoch #614: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #615: 1025it [00:02, 363.65it/s, env_step=629760, len=8, n/ep=8, n/st=64, player_1/loss=5.467, player_2/loss=2.436, rew=0.00]


Epoch #615: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #616: 1025it [00:03, 341.48it/s, env_step=630784, len=11, n/ep=6, n/st=64, player_1/loss=6.001, player_2/loss=3.235, rew=0.00]


Epoch #616: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #617: 1025it [00:02, 392.57it/s, env_step=631808, len=11, n/ep=6, n/st=64, player_1/loss=2.739, player_2/loss=3.463, rew=0.00]


Epoch #617: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #618: 1025it [00:02, 357.47it/s, env_step=632832, len=10, n/ep=6, n/st=64, player_1/loss=4.902, player_2/loss=1.222, rew=0.00]


Epoch #618: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #619: 1025it [00:03, 303.47it/s, env_step=633856, len=9, n/ep=7, n/st=64, player_1/loss=2.513, player_2/loss=2.360, rew=0.00]


Epoch #619: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #620: 1025it [00:02, 359.14it/s, env_step=634880, len=8, n/ep=7, n/st=64, player_1/loss=2.929, player_2/loss=1.206, rew=0.00]


Epoch #620: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #621: 1025it [00:02, 357.78it/s, env_step=635904, len=8, n/ep=8, n/st=64, player_1/loss=1.812, player_2/loss=1.392, rew=0.00]


Epoch #621: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #622: 1025it [00:03, 317.77it/s, env_step=636928, len=7, n/ep=8, n/st=64, player_1/loss=3.900, player_2/loss=2.092, rew=0.00]


Epoch #622: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #623: 1025it [00:02, 354.32it/s, env_step=637952, len=7, n/ep=7, n/st=64, player_1/loss=1.382, player_2/loss=1.976, rew=0.00]


Epoch #623: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #624: 1025it [00:02, 358.12it/s, env_step=638976, len=8, n/ep=8, n/st=64, player_1/loss=5.514, player_2/loss=3.248, rew=0.00]


Epoch #624: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #625: 1025it [00:02, 371.91it/s, env_step=640000, len=8, n/ep=7, n/st=64, player_1/loss=4.127, player_2/loss=1.821, rew=0.00]


Epoch #625: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #626: 1025it [00:02, 376.20it/s, env_step=641024, len=9, n/ep=7, n/st=64, player_1/loss=2.204, player_2/loss=2.653, rew=0.00]


Epoch #626: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #627: 1025it [00:02, 364.97it/s, env_step=642048, len=10, n/ep=6, n/st=64, player_1/loss=3.412, player_2/loss=2.072, rew=0.00]


Epoch #627: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #628: 1025it [00:02, 368.23it/s, env_step=643072, len=8, n/ep=7, n/st=64, player_1/loss=1.616, player_2/loss=3.483, rew=0.00]


Epoch #628: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #629: 1025it [00:02, 374.18it/s, env_step=644096, len=8, n/ep=8, n/st=64, player_1/loss=2.417, player_2/loss=1.181, rew=0.00]


Epoch #629: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #630: 1025it [00:02, 362.69it/s, env_step=645120, len=7, n/ep=9, n/st=64, player_1/loss=0.867, player_2/loss=1.211, rew=0.00]


Epoch #630: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #631: 1025it [00:02, 377.46it/s, env_step=646144, len=8, n/ep=8, n/st=64, player_1/loss=2.070, player_2/loss=9.365, rew=0.00]


Epoch #631: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #632: 1025it [00:02, 378.21it/s, env_step=647168, len=8, n/ep=8, n/st=64, player_1/loss=1.807, player_2/loss=0.932, rew=0.00]


Epoch #632: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #633: 1025it [00:02, 375.88it/s, env_step=648192, len=7, n/ep=8, n/st=64, player_1/loss=1.456, player_2/loss=1.096, rew=0.00]


Epoch #633: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #634: 1025it [00:02, 373.79it/s, env_step=649216, len=8, n/ep=8, n/st=64, player_1/loss=4.674, player_2/loss=4.303, rew=0.00]


Epoch #634: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #635: 1025it [00:02, 370.92it/s, env_step=650240, len=9, n/ep=6, n/st=64, player_1/loss=1.721, player_2/loss=4.214, rew=0.00]


Epoch #635: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #636: 1025it [00:02, 380.89it/s, env_step=651264, len=9, n/ep=7, n/st=64, player_1/loss=2.916, player_2/loss=3.937, rew=0.00]


Epoch #636: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #637: 1025it [00:02, 386.22it/s, env_step=652288, len=8, n/ep=8, n/st=64, player_1/loss=1.390, player_2/loss=2.067, rew=0.00]


Epoch #637: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #638: 1025it [00:02, 363.29it/s, env_step=653312, len=8, n/ep=8, n/st=64, player_1/loss=2.080, player_2/loss=0.800, rew=0.00]


Epoch #638: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #639: 1025it [00:02, 356.47it/s, env_step=654336, len=8, n/ep=8, n/st=64, player_1/loss=0.893, player_2/loss=1.515, rew=0.00]


Epoch #639: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #640: 1025it [00:02, 354.62it/s, env_step=655360, len=8, n/ep=8, n/st=64, player_1/loss=0.929, player_2/loss=1.609, rew=0.00]


Epoch #640: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #641: 1025it [00:02, 353.61it/s, env_step=656384, len=8, n/ep=7, n/st=64, player_1/loss=0.504, player_2/loss=1.352, rew=0.00]


Epoch #641: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #642: 1025it [00:02, 353.78it/s, env_step=657408, len=9, n/ep=7, n/st=64, player_1/loss=1.192, player_2/loss=1.354, rew=0.00]


Epoch #642: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #643: 1025it [00:02, 355.08it/s, env_step=658432, len=11, n/ep=7, n/st=64, player_1/loss=0.751, player_2/loss=1.148, rew=0.00]


Epoch #643: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #644: 1025it [00:02, 353.58it/s, env_step=659456, len=8, n/ep=7, n/st=64, player_1/loss=1.620, player_2/loss=2.054, rew=0.00]


Epoch #644: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #645: 1025it [00:02, 354.50it/s, env_step=660480, len=7, n/ep=8, n/st=64, player_1/loss=0.950, player_2/loss=1.706, rew=0.00]


Epoch #645: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #646: 1025it [00:02, 348.84it/s, env_step=661504, len=8, n/ep=7, n/st=64, player_1/loss=1.391, player_2/loss=1.494, rew=0.00]


Epoch #646: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #647: 1025it [00:02, 353.91it/s, env_step=662528, len=8, n/ep=7, n/st=64, player_1/loss=0.810, player_2/loss=1.801, rew=0.00]


Epoch #647: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #648: 1025it [00:02, 354.27it/s, env_step=663552, len=8, n/ep=6, n/st=64, player_1/loss=1.126, player_2/loss=2.406, rew=0.00]


Epoch #648: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #649: 1025it [00:02, 353.53it/s, env_step=664576, len=8, n/ep=8, n/st=64, player_1/loss=3.030, player_2/loss=3.495, rew=0.00]


Epoch #649: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #650: 1025it [00:02, 354.92it/s, env_step=665600, len=8, n/ep=7, n/st=64, player_1/loss=4.931, player_2/loss=3.244, rew=0.00]


Epoch #650: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #651: 1025it [00:02, 351.84it/s, env_step=666624, len=9, n/ep=7, n/st=64, player_1/loss=2.567, player_2/loss=2.124, rew=0.00]


Epoch #651: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #652: 1025it [00:02, 354.96it/s, env_step=667648, len=12, n/ep=5, n/st=64, player_1/loss=2.360, player_2/loss=1.596, rew=0.00]


Epoch #652: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #653: 1025it [00:02, 349.15it/s, env_step=668672, len=7, n/ep=8, n/st=64, player_1/loss=1.745, player_2/loss=0.811, rew=0.00]


Epoch #653: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #654: 1025it [00:02, 351.74it/s, env_step=669696, len=8, n/ep=8, n/st=64, player_1/loss=0.873, player_2/loss=0.950, rew=0.00]


Epoch #654: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #655: 1025it [00:02, 352.99it/s, env_step=670720, len=9, n/ep=6, n/st=64, player_1/loss=0.704, player_2/loss=1.297, rew=0.00]


Epoch #655: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #656: 1025it [00:02, 352.27it/s, env_step=671744, len=7, n/ep=9, n/st=64, player_1/loss=1.631, player_2/loss=0.461, rew=0.00]


Epoch #656: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #657: 1025it [00:02, 356.54it/s, env_step=672768, len=8, n/ep=7, n/st=64, player_1/loss=1.953, player_2/loss=0.672, rew=0.00]


Epoch #657: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #658: 1025it [00:02, 353.66it/s, env_step=673792, len=7, n/ep=8, n/st=64, player_1/loss=2.752, player_2/loss=1.364, rew=0.00]


Epoch #658: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #659: 1025it [00:02, 355.00it/s, env_step=674816, len=8, n/ep=8, n/st=64, player_1/loss=1.734, player_2/loss=2.734, rew=0.00]


Epoch #659: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #660: 1025it [00:02, 354.13it/s, env_step=675840, len=7, n/ep=9, n/st=64, player_1/loss=1.830, player_2/loss=0.591, rew=0.00]


Epoch #660: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #661: 1025it [00:02, 354.19it/s, env_step=676864, len=13, n/ep=5, n/st=64, player_1/loss=3.716, player_2/loss=1.644, rew=0.00]


Epoch #661: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #662: 1025it [00:02, 354.59it/s, env_step=677888, len=8, n/ep=7, n/st=64, player_1/loss=4.648, player_2/loss=6.187, rew=0.00]


Epoch #662: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #663: 1025it [00:02, 352.77it/s, env_step=678912, len=9, n/ep=6, n/st=64, player_1/loss=2.452, player_2/loss=3.769, rew=0.00]


Epoch #663: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #664: 1025it [00:02, 354.57it/s, env_step=679936, len=9, n/ep=7, n/st=64, player_1/loss=1.475, player_2/loss=1.864, rew=0.00]


Epoch #664: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #665: 1025it [00:02, 353.95it/s, env_step=680960, len=9, n/ep=6, n/st=64, player_1/loss=1.611, player_2/loss=1.671, rew=0.00]


Epoch #665: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #666: 1025it [00:02, 352.28it/s, env_step=681984, len=9, n/ep=7, n/st=64, player_1/loss=3.951, player_2/loss=1.797, rew=0.00]


Epoch #666: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #667: 1025it [00:02, 352.36it/s, env_step=683008, len=9, n/ep=7, n/st=64, player_1/loss=1.751, player_2/loss=2.771, rew=0.00]


Epoch #667: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #668: 1025it [00:02, 353.97it/s, env_step=684032, len=10, n/ep=6, n/st=64, player_1/loss=2.466, player_2/loss=1.954, rew=0.00]


Epoch #668: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #669: 1025it [00:02, 355.05it/s, env_step=685056, len=9, n/ep=6, n/st=64, player_1/loss=3.020, player_2/loss=4.449, rew=0.00]


Epoch #669: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #670: 1025it [00:02, 357.84it/s, env_step=686080, len=8, n/ep=7, n/st=64, player_1/loss=1.475, player_2/loss=1.360, rew=0.00]


Epoch #670: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #671: 1025it [00:02, 352.34it/s, env_step=687104, len=9, n/ep=6, n/st=64, player_1/loss=1.375, player_2/loss=0.307, rew=0.00]


Epoch #671: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #672: 1025it [00:02, 353.08it/s, env_step=688128, len=9, n/ep=7, n/st=64, player_1/loss=1.628, player_2/loss=2.425, rew=0.00]


Epoch #672: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #673: 1025it [00:02, 352.11it/s, env_step=689152, len=9, n/ep=7, n/st=64, player_1/loss=1.319, player_2/loss=2.375, rew=0.00]


Epoch #673: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #674: 1025it [00:02, 354.84it/s, env_step=690176, len=9, n/ep=7, n/st=64, player_1/loss=5.598, player_2/loss=1.703, rew=0.00]


Epoch #674: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #675: 1025it [00:02, 353.70it/s, env_step=691200, len=8, n/ep=8, n/st=64, player_1/loss=11.074, player_2/loss=2.350, rew=0.00]


Epoch #675: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #676: 1025it [00:02, 349.56it/s, env_step=692224, len=8, n/ep=7, n/st=64, player_1/loss=9.233, player_2/loss=1.590, rew=0.00]


Epoch #676: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #677: 1025it [00:02, 354.68it/s, env_step=693248, len=7, n/ep=9, n/st=64, player_1/loss=6.487, player_2/loss=0.634, rew=0.00]


Epoch #677: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #678: 1025it [00:02, 352.70it/s, env_step=694272, len=8, n/ep=8, n/st=64, player_1/loss=4.996, player_2/loss=0.773, rew=0.00]


Epoch #678: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #679: 1025it [00:02, 352.69it/s, env_step=695296, len=8, n/ep=8, n/st=64, player_1/loss=5.754, player_2/loss=0.917, rew=0.00]


Epoch #679: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #680: 1025it [00:02, 353.95it/s, env_step=696320, len=8, n/ep=8, n/st=64, player_1/loss=5.174, player_2/loss=1.042, rew=0.00]


Epoch #680: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #681: 1025it [00:02, 354.49it/s, env_step=697344, len=12, n/ep=5, n/st=64, player_1/loss=6.606, player_2/loss=1.651, rew=0.00]


Epoch #681: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #682: 1025it [00:02, 353.01it/s, env_step=698368, len=7, n/ep=8, n/st=64, player_1/loss=6.296, player_2/loss=1.006, rew=0.00]


Epoch #682: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #683: 1025it [00:02, 352.48it/s, env_step=699392, len=10, n/ep=7, n/st=64, player_1/loss=4.538, player_2/loss=1.892, rew=0.00]


Epoch #683: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #684: 1025it [00:02, 354.66it/s, env_step=700416, len=8, n/ep=7, n/st=64, player_1/loss=4.099, player_2/loss=1.136, rew=0.00]


Epoch #684: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #685: 1025it [00:02, 354.18it/s, env_step=701440, len=11, n/ep=5, n/st=64, player_1/loss=9.175, player_2/loss=1.169, rew=0.00]


Epoch #685: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #686: 1025it [00:02, 355.13it/s, env_step=702464, len=8, n/ep=7, n/st=64, player_1/loss=5.060, player_2/loss=2.074, rew=0.00]


Epoch #686: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #687: 1025it [00:02, 352.26it/s, env_step=703488, len=8, n/ep=8, n/st=64, player_1/loss=5.710, player_2/loss=1.160, rew=0.00]


Epoch #687: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #688: 1025it [00:02, 354.31it/s, env_step=704512, len=9, n/ep=7, n/st=64, player_1/loss=2.591, player_2/loss=0.458, rew=0.00]


Epoch #688: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #689: 1025it [00:02, 352.44it/s, env_step=705536, len=10, n/ep=7, n/st=64, player_1/loss=1.678, player_2/loss=1.148, rew=0.00]


Epoch #689: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #690: 1025it [00:02, 352.61it/s, env_step=706560, len=9, n/ep=7, n/st=64, player_1/loss=1.814, player_2/loss=1.622, rew=0.00]


Epoch #690: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #691: 1025it [00:02, 352.58it/s, env_step=707584, len=9, n/ep=6, n/st=64, player_1/loss=4.126, player_2/loss=1.623, rew=0.00]


Epoch #691: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #692: 1025it [00:02, 354.55it/s, env_step=708608, len=9, n/ep=7, n/st=64, player_1/loss=4.190, player_2/loss=1.808, rew=0.00]


Epoch #692: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #693: 1025it [00:02, 355.06it/s, env_step=709632, len=11, n/ep=6, n/st=64, player_1/loss=2.641, player_2/loss=1.269, rew=0.00]


Epoch #693: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #694: 1025it [00:02, 350.38it/s, env_step=710656, len=11, n/ep=7, n/st=64, player_1/loss=4.102, player_2/loss=0.632, rew=0.00]


Epoch #694: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #695: 1025it [00:02, 351.50it/s, env_step=711680, len=13, n/ep=5, n/st=64, player_1/loss=4.207, player_2/loss=0.593, rew=0.00]


Epoch #695: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #696: 1025it [00:02, 353.50it/s, env_step=712704, len=8, n/ep=8, n/st=64, player_1/loss=5.710, player_2/loss=1.156, rew=0.00]


Epoch #696: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #697: 1025it [00:02, 352.44it/s, env_step=713728, len=8, n/ep=7, n/st=64, player_1/loss=4.185, player_2/loss=2.032, rew=0.00]


Epoch #697: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #698: 1025it [00:02, 355.19it/s, env_step=714752, len=8, n/ep=8, n/st=64, player_1/loss=2.439, player_2/loss=0.997, rew=0.00]


Epoch #698: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #699: 1025it [00:02, 353.75it/s, env_step=715776, len=7, n/ep=8, n/st=64, player_1/loss=2.667, player_2/loss=1.537, rew=0.00]


Epoch #699: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #700: 1025it [00:02, 352.84it/s, env_step=716800, len=8, n/ep=7, n/st=64, player_1/loss=3.625, player_2/loss=2.448, rew=0.00]


Epoch #700: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #701: 1025it [00:02, 353.08it/s, env_step=717824, len=7, n/ep=9, n/st=64, player_1/loss=2.984, player_2/loss=3.409, rew=0.00]


Epoch #701: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #702: 1025it [00:02, 353.43it/s, env_step=718848, len=8, n/ep=8, n/st=64, player_1/loss=4.037, player_2/loss=1.078, rew=0.00]


Epoch #702: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #703: 1025it [00:02, 354.65it/s, env_step=719872, len=9, n/ep=7, n/st=64, player_1/loss=4.430, player_2/loss=1.459, rew=0.00]


Epoch #703: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #704: 1025it [00:02, 353.59it/s, env_step=720896, len=8, n/ep=8, n/st=64, player_1/loss=3.225, player_2/loss=3.691, rew=0.00]


Epoch #704: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #705: 1025it [00:02, 352.97it/s, env_step=721920, len=7, n/ep=8, n/st=64, player_1/loss=0.779, player_2/loss=2.469, rew=0.00]


Epoch #705: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #706: 1025it [00:02, 352.31it/s, env_step=722944, len=8, n/ep=7, n/st=64, player_1/loss=1.235, player_2/loss=2.164, rew=0.00]


Epoch #706: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #707: 1025it [00:02, 354.40it/s, env_step=723968, len=9, n/ep=7, n/st=64, player_1/loss=1.142, player_2/loss=0.779, rew=0.00]


Epoch #707: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #708: 1025it [00:02, 353.99it/s, env_step=724992, len=10, n/ep=6, n/st=64, player_1/loss=0.722, player_2/loss=1.379, rew=0.00]


Epoch #708: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #709: 1025it [00:02, 352.02it/s, env_step=726016, len=8, n/ep=8, n/st=64, player_1/loss=1.031, player_2/loss=3.803, rew=0.00]


Epoch #709: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #710: 1025it [00:02, 354.91it/s, env_step=727040, len=8, n/ep=7, n/st=64, player_1/loss=2.140, player_2/loss=4.796, rew=0.00]


Epoch #710: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #711: 1025it [00:02, 352.31it/s, env_step=728064, len=10, n/ep=6, n/st=64, player_1/loss=1.503, player_2/loss=3.055, rew=0.00]


Epoch #711: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #712: 1025it [00:02, 356.41it/s, env_step=729088, len=8, n/ep=7, n/st=64, player_1/loss=1.208, player_2/loss=6.558, rew=0.00]


Epoch #712: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #713: 1025it [00:02, 351.52it/s, env_step=730112, len=11, n/ep=5, n/st=64, player_1/loss=3.699, player_2/loss=3.256, rew=0.00]


Epoch #713: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #714: 1025it [00:02, 353.77it/s, env_step=731136, len=8, n/ep=8, n/st=64, player_1/loss=3.904, player_2/loss=4.947, rew=0.00]


Epoch #714: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #715: 1025it [00:02, 355.79it/s, env_step=732160, len=10, n/ep=6, n/st=64, player_1/loss=4.319, player_2/loss=5.291, rew=0.00]


Epoch #715: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #716: 1025it [00:02, 353.51it/s, env_step=733184, len=11, n/ep=6, n/st=64, player_1/loss=3.398, player_2/loss=5.062, rew=0.00]


Epoch #716: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #717: 1025it [00:02, 353.50it/s, env_step=734208, len=9, n/ep=7, n/st=64, player_1/loss=2.955, player_2/loss=3.700, rew=0.00]


Epoch #717: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #718: 1025it [00:02, 352.34it/s, env_step=735232, len=9, n/ep=7, n/st=64, player_1/loss=5.722, player_2/loss=3.605, rew=0.00]


Epoch #718: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #719: 1025it [00:02, 354.12it/s, env_step=736256, len=10, n/ep=6, n/st=64, player_1/loss=5.221, player_2/loss=1.649, rew=0.00]


Epoch #719: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #720: 1025it [00:02, 353.91it/s, env_step=737280, len=13, n/ep=5, n/st=64, player_1/loss=3.650, player_2/loss=4.736, rew=0.00]


Epoch #720: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #721: 1025it [00:02, 354.88it/s, env_step=738304, len=9, n/ep=7, n/st=64, player_1/loss=2.026, player_2/loss=4.431, rew=0.00]


Epoch #721: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #722: 1025it [00:02, 352.95it/s, env_step=739328, len=9, n/ep=8, n/st=64, player_1/loss=2.273, player_2/loss=1.875, rew=0.00]


Epoch #722: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #723: 1025it [00:02, 354.28it/s, env_step=740352, len=9, n/ep=7, n/st=64, player_1/loss=1.552, player_2/loss=1.725, rew=0.00]


Epoch #723: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #724: 1025it [00:02, 353.64it/s, env_step=741376, len=9, n/ep=7, n/st=64, player_1/loss=2.156, player_2/loss=2.151, rew=0.00]


Epoch #724: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #725: 1025it [00:02, 355.00it/s, env_step=742400, len=9, n/ep=7, n/st=64, player_1/loss=2.898, player_2/loss=3.824, rew=0.00]


Epoch #725: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #726: 1025it [00:02, 354.36it/s, env_step=743424, len=9, n/ep=7, n/st=64, player_1/loss=3.064, player_2/loss=3.745, rew=0.00]


Epoch #726: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #727: 1025it [00:02, 353.22it/s, env_step=744448, len=8, n/ep=7, n/st=64, player_1/loss=2.160, player_2/loss=2.011, rew=0.00]


Epoch #727: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #728: 1025it [00:02, 354.39it/s, env_step=745472, len=10, n/ep=6, n/st=64, player_1/loss=1.257, player_2/loss=0.810, rew=0.00]


Epoch #728: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #729: 1025it [00:02, 353.53it/s, env_step=746496, len=10, n/ep=6, n/st=64, player_1/loss=4.545, player_2/loss=2.936, rew=0.00]


Epoch #729: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #730: 1025it [00:02, 352.54it/s, env_step=747520, len=8, n/ep=8, n/st=64, player_1/loss=4.050, player_2/loss=1.765, rew=0.00]


Epoch #730: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #731: 1025it [00:02, 354.81it/s, env_step=748544, len=9, n/ep=6, n/st=64, player_1/loss=2.380, player_2/loss=2.812, rew=0.00]


Epoch #731: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #732: 1025it [00:02, 354.09it/s, env_step=749568, len=7, n/ep=9, n/st=64, player_1/loss=3.257, player_2/loss=2.878, rew=0.00]


Epoch #732: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #733: 1025it [00:02, 353.88it/s, env_step=750592, len=8, n/ep=8, n/st=64, player_1/loss=3.809, player_2/loss=1.854, rew=0.00]


Epoch #733: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #734: 1025it [00:02, 352.79it/s, env_step=751616, len=9, n/ep=7, n/st=64, player_1/loss=2.875, player_2/loss=2.679, rew=0.00]


Epoch #734: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #735: 1025it [00:02, 350.02it/s, env_step=752640, len=8, n/ep=8, n/st=64, player_1/loss=1.984, player_2/loss=1.024, rew=0.00]


Epoch #735: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #736: 1025it [00:02, 349.86it/s, env_step=753664, len=9, n/ep=7, n/st=64, player_1/loss=0.959, player_2/loss=1.142, rew=0.00]


Epoch #736: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #737: 1025it [00:02, 353.37it/s, env_step=754688, len=8, n/ep=8, n/st=64, player_1/loss=2.302, player_2/loss=1.799, rew=0.00]


Epoch #737: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #738: 1025it [00:02, 351.34it/s, env_step=755712, len=9, n/ep=7, n/st=64, player_1/loss=1.633, player_2/loss=1.938, rew=0.00]


Epoch #738: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #739: 1025it [00:02, 354.22it/s, env_step=756736, len=9, n/ep=7, n/st=64, player_1/loss=2.055, player_2/loss=2.306, rew=0.00]


Epoch #739: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #740: 1025it [00:02, 354.27it/s, env_step=757760, len=9, n/ep=8, n/st=64, player_1/loss=1.284, player_2/loss=3.676, rew=0.00]


Epoch #740: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #741: 1025it [00:02, 351.39it/s, env_step=758784, len=8, n/ep=8, n/st=64, player_1/loss=0.408, player_2/loss=0.725, rew=0.00]


Epoch #741: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #742: 1025it [00:02, 352.37it/s, env_step=759808, len=8, n/ep=7, n/st=64, player_1/loss=1.798, player_2/loss=1.128, rew=0.00]


Epoch #742: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #743: 1025it [00:02, 354.45it/s, env_step=760832, len=8, n/ep=8, n/st=64, player_1/loss=1.023, player_2/loss=1.829, rew=0.00]


Epoch #743: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #744: 1025it [00:02, 352.97it/s, env_step=761856, len=8, n/ep=8, n/st=64, player_1/loss=2.068, player_2/loss=1.071, rew=0.00]


Epoch #744: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #745: 1025it [00:02, 353.61it/s, env_step=762880, len=10, n/ep=7, n/st=64, player_1/loss=1.493, player_2/loss=1.513, rew=0.00]


Epoch #745: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #746: 1025it [00:02, 353.10it/s, env_step=763904, len=8, n/ep=8, n/st=64, player_1/loss=1.025, player_2/loss=0.839, rew=0.00]


Epoch #746: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #747: 1025it [00:02, 354.25it/s, env_step=764928, len=10, n/ep=6, n/st=64, player_1/loss=1.377, player_2/loss=2.044, rew=0.00]


Epoch #747: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #748: 1025it [00:02, 353.03it/s, env_step=765952, len=7, n/ep=8, n/st=64, player_1/loss=2.797, player_2/loss=1.909, rew=0.00]


Epoch #748: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #749: 1025it [00:02, 346.29it/s, env_step=766976, len=10, n/ep=6, n/st=64, player_1/loss=1.574, player_2/loss=3.482, rew=0.00]


Epoch #749: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #750: 1025it [00:02, 353.95it/s, env_step=768000, len=9, n/ep=7, n/st=64, player_1/loss=3.381, player_2/loss=1.350, rew=0.00]


Epoch #750: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #751: 1025it [00:02, 350.47it/s, env_step=769024, len=8, n/ep=8, n/st=64, player_1/loss=1.192, player_2/loss=0.827, rew=0.00]


Epoch #751: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #752: 1025it [00:02, 352.76it/s, env_step=770048, len=17, n/ep=3, n/st=64, player_1/loss=3.869, player_2/loss=2.555, rew=0.00]


Epoch #752: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #753: 1025it [00:02, 351.30it/s, env_step=771072, len=8, n/ep=7, n/st=64, player_1/loss=3.176, player_2/loss=1.305, rew=0.00]


Epoch #753: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #754: 1025it [00:02, 352.32it/s, env_step=772096, len=8, n/ep=8, n/st=64, player_1/loss=2.891, player_2/loss=2.111, rew=0.00]


Epoch #754: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #755: 1025it [00:02, 353.76it/s, env_step=773120, len=8, n/ep=7, n/st=64, player_1/loss=1.347, player_2/loss=1.190, rew=0.00]


Epoch #755: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #756: 1025it [00:02, 352.42it/s, env_step=774144, len=9, n/ep=7, n/st=64, player_1/loss=1.778, player_2/loss=1.728, rew=0.00]


Epoch #756: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #757: 1025it [00:02, 352.88it/s, env_step=775168, len=9, n/ep=7, n/st=64, player_1/loss=2.871, player_2/loss=1.678, rew=0.00]


Epoch #757: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #758: 1025it [00:02, 352.37it/s, env_step=776192, len=8, n/ep=7, n/st=64, player_1/loss=1.735, player_2/loss=1.306, rew=0.00]


Epoch #758: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #759: 1025it [00:02, 351.44it/s, env_step=777216, len=8, n/ep=7, n/st=64, player_1/loss=1.165, player_2/loss=0.876, rew=0.00]


Epoch #759: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #760: 1025it [00:02, 353.48it/s, env_step=778240, len=7, n/ep=8, n/st=64, player_1/loss=2.139, player_2/loss=1.166, rew=0.00]


Epoch #760: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #761: 1025it [00:02, 351.61it/s, env_step=779264, len=9, n/ep=7, n/st=64, player_1/loss=2.642, player_2/loss=1.360, rew=0.00]


Epoch #761: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #762: 1025it [00:02, 351.08it/s, env_step=780288, len=8, n/ep=8, n/st=64, player_1/loss=2.872, player_2/loss=1.042, rew=0.00]


Epoch #762: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #763: 1025it [00:02, 352.47it/s, env_step=781312, len=12, n/ep=6, n/st=64, player_1/loss=3.913, player_2/loss=0.791, rew=0.00]


Epoch #763: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #764: 1025it [00:02, 351.69it/s, env_step=782336, len=8, n/ep=7, n/st=64, player_1/loss=1.333, player_2/loss=0.835, rew=0.00]


Epoch #764: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #765: 1025it [00:02, 354.21it/s, env_step=783360, len=8, n/ep=8, n/st=64, player_1/loss=3.672, player_2/loss=1.993, rew=0.00]


Epoch #765: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #766: 1025it [00:02, 353.23it/s, env_step=784384, len=8, n/ep=8, n/st=64, player_1/loss=1.809, player_2/loss=0.729, rew=0.00]


Epoch #766: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #767: 1025it [00:02, 353.06it/s, env_step=785408, len=13, n/ep=5, n/st=64, player_1/loss=1.364, player_2/loss=1.910, rew=0.00]


Epoch #767: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #768: 1025it [00:02, 353.65it/s, env_step=786432, len=8, n/ep=8, n/st=64, player_1/loss=8.420, player_2/loss=2.799, rew=0.00]


Epoch #768: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #769: 1025it [00:02, 350.72it/s, env_step=787456, len=10, n/ep=6, n/st=64, player_1/loss=6.276, player_2/loss=1.339, rew=0.00]


Epoch #769: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #770: 1025it [00:02, 351.00it/s, env_step=788480, len=7, n/ep=8, n/st=64, player_1/loss=5.986, player_2/loss=2.280, rew=0.00]


Epoch #770: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #771: 1025it [00:02, 351.91it/s, env_step=789504, len=10, n/ep=8, n/st=64, player_1/loss=2.305, player_2/loss=2.130, rew=0.00]


Epoch #771: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #772: 1025it [00:02, 353.38it/s, env_step=790528, len=11, n/ep=6, n/st=64, player_1/loss=1.419, player_2/loss=3.327, rew=0.00]


Epoch #772: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #773: 1025it [00:02, 353.32it/s, env_step=791552, len=11, n/ep=5, n/st=64, player_1/loss=2.437, player_2/loss=2.256, rew=0.00]


Epoch #773: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #774: 1025it [00:02, 353.56it/s, env_step=792576, len=12, n/ep=6, n/st=64, player_1/loss=1.643, player_2/loss=1.493, rew=0.00]


Epoch #774: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #775: 1025it [00:02, 354.10it/s, env_step=793600, len=9, n/ep=6, n/st=64, player_1/loss=0.879, player_2/loss=2.790, rew=0.00]


Epoch #775: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #776: 1025it [00:02, 349.63it/s, env_step=794624, len=10, n/ep=6, n/st=64, player_1/loss=0.585, player_2/loss=3.395, rew=0.00]


Epoch #776: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #777: 1025it [00:02, 350.80it/s, env_step=795648, len=11, n/ep=5, n/st=64, player_1/loss=4.163, player_2/loss=1.023, rew=0.00]


Epoch #777: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #778: 1025it [00:02, 352.81it/s, env_step=796672, len=9, n/ep=7, n/st=64, player_1/loss=2.704, player_2/loss=1.047, rew=0.00]


Epoch #778: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #779: 1025it [00:02, 352.62it/s, env_step=797696, len=8, n/ep=7, n/st=64, player_1/loss=2.203, player_2/loss=1.151, rew=0.00]


Epoch #779: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #780: 1025it [00:02, 353.63it/s, env_step=798720, len=8, n/ep=8, n/st=64, player_1/loss=1.612, player_2/loss=1.961, rew=0.00]


Epoch #780: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #781: 1025it [00:02, 353.49it/s, env_step=799744, len=9, n/ep=7, n/st=64, player_1/loss=2.455, player_2/loss=1.433, rew=0.00]


Epoch #781: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #782: 1025it [00:02, 352.52it/s, env_step=800768, len=9, n/ep=7, n/st=64, player_1/loss=1.057, player_2/loss=2.281, rew=0.00]


Epoch #782: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #783: 1025it [00:02, 355.42it/s, env_step=801792, len=8, n/ep=8, n/st=64, player_1/loss=0.852, player_2/loss=1.595, rew=0.00]


Epoch #783: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #784: 1025it [00:02, 351.92it/s, env_step=802816, len=8, n/ep=7, n/st=64, player_1/loss=1.871, player_2/loss=1.137, rew=0.00]


Epoch #784: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #785: 1025it [00:02, 352.70it/s, env_step=803840, len=8, n/ep=8, n/st=64, player_1/loss=1.546, player_2/loss=3.330, rew=0.00]


Epoch #785: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #786: 1025it [00:02, 352.55it/s, env_step=804864, len=8, n/ep=8, n/st=64, player_1/loss=0.616, player_2/loss=0.997, rew=0.00]


Epoch #786: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #787: 1025it [00:02, 353.27it/s, env_step=805888, len=8, n/ep=7, n/st=64, player_1/loss=1.485, player_2/loss=1.207, rew=0.00]


Epoch #787: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #788: 1025it [00:02, 353.23it/s, env_step=806912, len=8, n/ep=8, n/st=64, player_1/loss=2.035, player_2/loss=2.609, rew=0.00]


Epoch #788: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #789: 1025it [00:03, 340.55it/s, env_step=807936, len=10, n/ep=6, n/st=64, player_1/loss=1.591, player_2/loss=2.302, rew=0.00]


Epoch #789: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #790: 1025it [00:02, 351.33it/s, env_step=808960, len=7, n/ep=8, n/st=64, player_1/loss=0.910, player_2/loss=1.773, rew=0.00]


Epoch #790: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #791: 1025it [00:02, 353.19it/s, env_step=809984, len=13, n/ep=5, n/st=64, player_1/loss=2.612, player_2/loss=3.895, rew=0.00]


Epoch #791: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #792: 1025it [00:02, 352.40it/s, env_step=811008, len=8, n/ep=8, n/st=64, player_1/loss=3.025, player_2/loss=1.751, rew=0.00]


Epoch #792: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #793: 1025it [00:02, 352.50it/s, env_step=812032, len=11, n/ep=6, n/st=64, player_1/loss=1.264, player_2/loss=2.423, rew=0.00]


Epoch #793: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #794: 1025it [00:02, 352.71it/s, env_step=813056, len=8, n/ep=8, n/st=64, player_1/loss=1.884, player_2/loss=2.023, rew=0.00]


Epoch #794: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #795: 1025it [00:02, 354.34it/s, env_step=814080, len=15, n/ep=4, n/st=64, player_1/loss=1.928, player_2/loss=1.022, rew=0.00]


Epoch #795: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #796: 1025it [00:02, 353.88it/s, env_step=815104, len=11, n/ep=5, n/st=64, player_1/loss=2.194, player_2/loss=2.324, rew=0.00]


Epoch #796: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #797: 1025it [00:02, 354.77it/s, env_step=816128, len=11, n/ep=5, n/st=64, player_1/loss=5.940, player_2/loss=1.660, rew=0.00]


Epoch #797: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #798: 1025it [00:02, 351.65it/s, env_step=817152, len=14, n/ep=5, n/st=64, player_1/loss=7.428, player_2/loss=2.700, rew=0.00]


Epoch #798: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #799: 1025it [00:02, 354.36it/s, env_step=818176, len=9, n/ep=7, n/st=64, player_1/loss=4.524, player_2/loss=2.661, rew=0.00]


Epoch #799: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #800: 1025it [00:02, 353.92it/s, env_step=819200, len=8, n/ep=7, n/st=64, player_1/loss=1.840, player_2/loss=1.431, rew=0.00]


Epoch #800: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #801: 1025it [00:02, 349.66it/s, env_step=820224, len=8, n/ep=7, n/st=64, player_1/loss=2.015, player_2/loss=2.620, rew=0.00]


Epoch #801: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #802: 1025it [00:02, 352.27it/s, env_step=821248, len=8, n/ep=8, n/st=64, player_1/loss=1.012, player_2/loss=3.021, rew=0.00]


Epoch #802: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #803: 1025it [00:02, 353.63it/s, env_step=822272, len=7, n/ep=8, n/st=64, player_1/loss=1.633, player_2/loss=3.955, rew=0.00]


Epoch #803: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #804: 1025it [00:02, 354.97it/s, env_step=823296, len=9, n/ep=7, n/st=64, player_1/loss=1.538, player_2/loss=4.149, rew=0.00]


Epoch #804: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #805: 1025it [00:02, 353.75it/s, env_step=824320, len=10, n/ep=6, n/st=64, player_1/loss=2.749, player_2/loss=2.189, rew=0.00]


Epoch #805: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #806: 1025it [00:02, 353.79it/s, env_step=825344, len=17, n/ep=5, n/st=64, player_1/loss=2.436, player_2/loss=2.800, rew=0.00]


Epoch #806: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #807: 1025it [00:02, 351.55it/s, env_step=826368, len=10, n/ep=6, n/st=64, player_1/loss=2.267, player_2/loss=4.579, rew=0.00]


Epoch #807: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #808: 1025it [00:02, 353.58it/s, env_step=827392, len=8, n/ep=7, n/st=64, player_1/loss=3.617, player_2/loss=4.132, rew=0.00]


Epoch #808: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #809: 1025it [00:02, 352.42it/s, env_step=828416, len=9, n/ep=7, n/st=64, player_1/loss=3.146, player_2/loss=2.078, rew=0.00]


Epoch #809: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #810: 1025it [00:02, 351.93it/s, env_step=829440, len=9, n/ep=7, n/st=64, player_1/loss=0.943, player_2/loss=0.992, rew=0.00]


Epoch #810: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #811: 1025it [00:02, 353.19it/s, env_step=830464, len=7, n/ep=8, n/st=64, player_1/loss=1.955, player_2/loss=4.831, rew=0.00]


Epoch #811: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #812: 1025it [00:02, 351.33it/s, env_step=831488, len=9, n/ep=7, n/st=64, player_1/loss=3.202, player_2/loss=3.555, rew=0.00]


Epoch #812: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #813: 1025it [00:02, 354.20it/s, env_step=832512, len=18, n/ep=3, n/st=64, player_1/loss=3.570, player_2/loss=2.091, rew=0.00]


Epoch #813: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #814: 1025it [00:02, 352.29it/s, env_step=833536, len=8, n/ep=8, n/st=64, player_1/loss=8.294, player_2/loss=6.237, rew=0.00]


Epoch #814: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #815: 1025it [00:02, 353.68it/s, env_step=834560, len=9, n/ep=7, n/st=64, player_1/loss=1.916, player_2/loss=1.547, rew=0.00]


Epoch #815: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #816: 1025it [00:02, 353.22it/s, env_step=835584, len=8, n/ep=8, n/st=64, player_1/loss=5.565, player_2/loss=1.398, rew=0.00]


Epoch #816: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #817: 1025it [00:02, 349.15it/s, env_step=836608, len=7, n/ep=8, n/st=64, player_1/loss=4.068, player_2/loss=4.352, rew=0.00]


Epoch #817: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #818: 1025it [00:02, 351.40it/s, env_step=837632, len=8, n/ep=8, n/st=64, player_1/loss=2.415, player_2/loss=2.600, rew=0.00]


Epoch #818: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #819: 1025it [00:02, 354.53it/s, env_step=838656, len=8, n/ep=7, n/st=64, player_1/loss=3.332, player_2/loss=2.800, rew=0.00]


Epoch #819: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #820: 1025it [00:02, 350.84it/s, env_step=839680, len=8, n/ep=8, n/st=64, player_1/loss=3.943, player_2/loss=2.805, rew=0.00]


Epoch #820: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #821: 1025it [00:02, 354.05it/s, env_step=840704, len=8, n/ep=7, n/st=64, player_1/loss=2.200, player_2/loss=1.878, rew=0.00]


Epoch #821: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #822: 1025it [00:02, 351.57it/s, env_step=841728, len=8, n/ep=7, n/st=64, player_1/loss=0.809, player_2/loss=2.054, rew=0.00]


Epoch #822: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #823: 1025it [00:02, 351.21it/s, env_step=842752, len=8, n/ep=8, n/st=64, player_1/loss=0.630, player_2/loss=0.831, rew=0.00]


Epoch #823: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #824: 1025it [00:02, 352.27it/s, env_step=843776, len=8, n/ep=8, n/st=64, player_1/loss=1.041, player_2/loss=1.336, rew=0.00]


Epoch #824: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #825: 1025it [00:02, 352.84it/s, env_step=844800, len=8, n/ep=8, n/st=64, player_1/loss=0.903, player_2/loss=0.624, rew=0.00]


Epoch #825: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #826: 1025it [00:02, 352.91it/s, env_step=845824, len=8, n/ep=8, n/st=64, player_1/loss=0.528, player_2/loss=0.345, rew=0.00]


Epoch #826: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #827: 1025it [00:02, 350.97it/s, env_step=846848, len=8, n/ep=7, n/st=64, player_1/loss=2.214, player_2/loss=1.650, rew=0.00]


Epoch #827: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #828: 1025it [00:02, 353.22it/s, env_step=847872, len=8, n/ep=8, n/st=64, player_1/loss=1.523, player_2/loss=4.590, rew=0.00]


Epoch #828: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #829: 1025it [00:02, 352.18it/s, env_step=848896, len=8, n/ep=7, n/st=64, player_1/loss=3.191, player_2/loss=3.244, rew=0.00]


Epoch #829: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #830: 1025it [00:02, 353.32it/s, env_step=849920, len=8, n/ep=7, n/st=64, player_1/loss=0.894, player_2/loss=1.413, rew=0.00]


Epoch #830: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #831: 1025it [00:02, 353.21it/s, env_step=850944, len=8, n/ep=8, n/st=64, player_1/loss=1.943, player_2/loss=4.671, rew=0.00]


Epoch #831: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #832: 1025it [00:02, 354.11it/s, env_step=851968, len=9, n/ep=7, n/st=64, player_1/loss=0.824, player_2/loss=0.892, rew=0.00]


Epoch #832: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #833: 1025it [00:02, 354.38it/s, env_step=852992, len=9, n/ep=7, n/st=64, player_1/loss=1.108, player_2/loss=1.781, rew=0.00]


Epoch #833: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #834: 1025it [00:02, 352.93it/s, env_step=854016, len=9, n/ep=7, n/st=64, player_1/loss=1.179, player_2/loss=2.049, rew=0.00]


Epoch #834: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #835: 1025it [00:02, 353.14it/s, env_step=855040, len=10, n/ep=6, n/st=64, player_1/loss=0.890, player_2/loss=2.635, rew=0.00]


Epoch #835: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #836: 1025it [00:02, 353.55it/s, env_step=856064, len=9, n/ep=7, n/st=64, player_1/loss=0.596, player_2/loss=1.529, rew=0.00]


Epoch #836: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #837: 1025it [00:02, 353.07it/s, env_step=857088, len=11, n/ep=5, n/st=64, player_1/loss=2.235, player_2/loss=4.166, rew=0.00]


Epoch #837: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #838: 1025it [00:02, 354.22it/s, env_step=858112, len=8, n/ep=7, n/st=64, player_1/loss=2.093, player_2/loss=1.878, rew=0.00]


Epoch #838: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #839: 1025it [00:02, 352.71it/s, env_step=859136, len=15, n/ep=5, n/st=64, player_1/loss=1.831, player_2/loss=3.697, rew=0.00]


Epoch #839: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #840: 1025it [00:02, 353.54it/s, env_step=860160, len=14, n/ep=5, n/st=64, player_1/loss=7.684, player_2/loss=3.502, rew=0.00]


Epoch #840: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #841: 1025it [00:02, 353.37it/s, env_step=861184, len=11, n/ep=6, n/st=64, player_1/loss=5.712, player_2/loss=1.468, rew=0.00]


Epoch #841: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #842: 1025it [00:02, 354.28it/s, env_step=862208, len=8, n/ep=8, n/st=64, player_1/loss=6.957, player_2/loss=0.853, rew=0.00]


Epoch #842: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #843: 1025it [00:02, 350.33it/s, env_step=863232, len=9, n/ep=7, n/st=64, player_1/loss=6.500, player_2/loss=1.751, rew=0.00]


Epoch #843: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #844: 1025it [00:02, 352.88it/s, env_step=864256, len=10, n/ep=7, n/st=64, player_1/loss=5.670, player_2/loss=1.224, rew=0.00]


Epoch #844: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #845: 1025it [00:02, 354.23it/s, env_step=865280, len=11, n/ep=5, n/st=64, player_1/loss=5.271, player_2/loss=0.953, rew=0.00]


Epoch #845: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #846: 1025it [00:02, 353.95it/s, env_step=866304, len=9, n/ep=7, n/st=64, player_1/loss=6.977, player_2/loss=0.608, rew=0.00]


Epoch #846: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #847: 1025it [00:02, 352.80it/s, env_step=867328, len=8, n/ep=7, n/st=64, player_1/loss=4.901, player_2/loss=0.851, rew=0.00]


Epoch #847: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #848: 1025it [00:02, 352.50it/s, env_step=868352, len=9, n/ep=7, n/st=64, player_1/loss=4.950, player_2/loss=1.924, rew=0.00]


Epoch #848: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #849: 1025it [00:02, 354.09it/s, env_step=869376, len=10, n/ep=6, n/st=64, player_1/loss=2.475, player_2/loss=0.756, rew=0.00]


Epoch #849: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #850: 1025it [00:02, 353.31it/s, env_step=870400, len=8, n/ep=8, n/st=64, player_1/loss=2.961, player_2/loss=2.120, rew=0.00]


Epoch #850: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #851: 1025it [00:02, 350.82it/s, env_step=871424, len=9, n/ep=6, n/st=64, player_1/loss=4.046, player_2/loss=0.652, rew=0.00]


Epoch #851: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #852: 1025it [00:02, 352.42it/s, env_step=872448, len=9, n/ep=7, n/st=64, player_1/loss=5.434, player_2/loss=1.754, rew=0.00]


Epoch #852: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #853: 1025it [00:02, 352.87it/s, env_step=873472, len=9, n/ep=7, n/st=64, player_1/loss=3.930, player_2/loss=1.096, rew=0.00]


Epoch #853: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #854: 1025it [00:02, 352.14it/s, env_step=874496, len=9, n/ep=6, n/st=64, player_1/loss=3.456, player_2/loss=0.272, rew=0.00]


Epoch #854: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #855: 1025it [00:02, 355.40it/s, env_step=875520, len=9, n/ep=7, n/st=64, player_1/loss=3.690, player_2/loss=1.131, rew=0.00]


Epoch #855: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #856: 1025it [00:02, 352.89it/s, env_step=876544, len=9, n/ep=7, n/st=64, player_1/loss=2.035, player_2/loss=0.717, rew=0.00]


Epoch #856: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #857: 1025it [00:02, 352.99it/s, env_step=877568, len=8, n/ep=7, n/st=64, player_1/loss=1.996, player_2/loss=0.618, rew=0.00]


Epoch #857: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #858: 1025it [00:02, 349.48it/s, env_step=878592, len=8, n/ep=7, n/st=64, player_1/loss=1.209, player_2/loss=1.295, rew=0.00]


Epoch #858: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #859: 1025it [00:02, 356.24it/s, env_step=879616, len=9, n/ep=6, n/st=64, player_1/loss=1.629, player_2/loss=1.221, rew=0.00]


Epoch #859: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #860: 1025it [00:02, 355.80it/s, env_step=880640, len=10, n/ep=6, n/st=64, player_1/loss=1.345, player_2/loss=1.191, rew=0.00]


Epoch #860: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #861: 1025it [00:02, 352.53it/s, env_step=881664, len=9, n/ep=7, n/st=64, player_1/loss=1.349, player_2/loss=1.566, rew=0.00]


Epoch #861: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #862: 1025it [00:02, 354.09it/s, env_step=882688, len=9, n/ep=7, n/st=64, player_1/loss=2.224, player_2/loss=1.666, rew=0.00]


Epoch #862: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #863: 1025it [00:02, 353.49it/s, env_step=883712, len=9, n/ep=7, n/st=64, player_1/loss=2.139, player_2/loss=1.164, rew=0.00]


Epoch #863: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #864: 1025it [00:02, 353.98it/s, env_step=884736, len=8, n/ep=8, n/st=64, player_1/loss=1.199, player_2/loss=1.605, rew=0.00]


Epoch #864: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #865: 1025it [00:02, 353.20it/s, env_step=885760, len=9, n/ep=7, n/st=64, player_1/loss=1.295, player_2/loss=1.083, rew=0.00]


Epoch #865: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #866: 1025it [00:02, 353.25it/s, env_step=886784, len=8, n/ep=7, n/st=64, player_1/loss=1.499, player_2/loss=2.040, rew=0.00]


Epoch #866: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #867: 1025it [00:02, 353.35it/s, env_step=887808, len=12, n/ep=5, n/st=64, player_1/loss=2.117, player_2/loss=4.233, rew=0.00]


Epoch #867: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #868: 1025it [00:02, 354.06it/s, env_step=888832, len=9, n/ep=7, n/st=64, player_1/loss=1.995, player_2/loss=2.198, rew=0.00]


Epoch #868: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #869: 1025it [00:02, 354.53it/s, env_step=889856, len=10, n/ep=7, n/st=64, player_1/loss=2.876, player_2/loss=0.779, rew=0.00]


Epoch #869: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #870: 1025it [00:02, 351.79it/s, env_step=890880, len=9, n/ep=6, n/st=64, player_1/loss=1.513, player_2/loss=3.171, rew=0.00]


Epoch #870: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #871: 1025it [00:02, 352.68it/s, env_step=891904, len=15, n/ep=4, n/st=64, player_1/loss=4.076, player_2/loss=2.552, rew=0.00]


Epoch #871: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #872: 1025it [00:02, 351.21it/s, env_step=892928, len=9, n/ep=6, n/st=64, player_1/loss=2.429, player_2/loss=3.972, rew=0.00]


Epoch #872: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #873: 1025it [00:02, 353.84it/s, env_step=893952, len=9, n/ep=6, n/st=64, player_1/loss=4.432, player_2/loss=4.362, rew=0.00]


Epoch #873: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #874: 1025it [00:02, 352.11it/s, env_step=894976, len=9, n/ep=6, n/st=64, player_1/loss=3.727, player_2/loss=2.713, rew=0.00]


Epoch #874: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #875: 1025it [00:02, 352.22it/s, env_step=896000, len=10, n/ep=7, n/st=64, player_1/loss=3.536, player_2/loss=2.183, rew=0.00]


Epoch #875: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #876: 1025it [00:02, 352.34it/s, env_step=897024, len=8, n/ep=7, n/st=64, player_1/loss=5.196, player_2/loss=2.097, rew=0.00]


Epoch #876: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #877: 1025it [00:02, 353.54it/s, env_step=898048, len=10, n/ep=6, n/st=64, player_1/loss=1.679, player_2/loss=0.992, rew=0.00]


Epoch #877: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #878: 1025it [00:02, 353.52it/s, env_step=899072, len=10, n/ep=7, n/st=64, player_1/loss=3.707, player_2/loss=2.802, rew=0.00]


Epoch #878: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #879: 1025it [00:02, 353.81it/s, env_step=900096, len=9, n/ep=7, n/st=64, player_1/loss=0.462, player_2/loss=1.564, rew=0.00]


Epoch #879: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #880: 1025it [00:02, 351.76it/s, env_step=901120, len=9, n/ep=6, n/st=64, player_1/loss=2.926, player_2/loss=2.666, rew=0.00]


Epoch #880: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #881: 1025it [00:02, 353.60it/s, env_step=902144, len=9, n/ep=6, n/st=64, player_1/loss=1.908, player_2/loss=0.699, rew=0.00]


Epoch #881: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #882: 1025it [00:02, 352.47it/s, env_step=903168, len=10, n/ep=6, n/st=64, player_1/loss=0.626, player_2/loss=1.100, rew=0.00]


Epoch #882: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #883: 1025it [00:02, 353.51it/s, env_step=904192, len=9, n/ep=7, n/st=64, player_1/loss=2.512, player_2/loss=1.891, rew=0.00]


Epoch #883: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #884: 1025it [00:02, 352.54it/s, env_step=905216, len=9, n/ep=8, n/st=64, player_1/loss=3.154, player_2/loss=2.281, rew=0.00]


Epoch #884: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #885: 1025it [00:02, 355.63it/s, env_step=906240, len=9, n/ep=7, n/st=64, player_1/loss=0.629, player_2/loss=1.106, rew=0.00]


Epoch #885: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #886: 1025it [00:02, 353.21it/s, env_step=907264, len=9, n/ep=7, n/st=64, player_1/loss=1.568, player_2/loss=3.198, rew=0.00]


Epoch #886: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #887: 1025it [00:02, 352.01it/s, env_step=908288, len=8, n/ep=8, n/st=64, player_1/loss=0.846, player_2/loss=2.756, rew=0.00]


Epoch #887: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #888: 1025it [00:02, 353.69it/s, env_step=909312, len=9, n/ep=6, n/st=64, player_1/loss=1.542, player_2/loss=1.415, rew=0.00]


Epoch #888: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #889: 1025it [00:02, 353.60it/s, env_step=910336, len=11, n/ep=5, n/st=64, player_1/loss=1.082, player_2/loss=0.926, rew=0.00]


Epoch #889: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #890: 1025it [00:02, 353.25it/s, env_step=911360, len=9, n/ep=7, n/st=64, player_1/loss=0.882, player_2/loss=1.460, rew=0.00]


Epoch #890: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #891: 1025it [00:02, 352.69it/s, env_step=912384, len=8, n/ep=7, n/st=64, player_1/loss=0.745, player_2/loss=1.308, rew=0.00]


Epoch #891: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #892: 1025it [00:02, 352.50it/s, env_step=913408, len=10, n/ep=6, n/st=64, player_1/loss=1.183, player_2/loss=1.892, rew=0.00]


Epoch #892: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #893: 1025it [00:02, 356.38it/s, env_step=914432, len=10, n/ep=6, n/st=64, player_1/loss=1.059, player_2/loss=2.295, rew=0.00]


Epoch #893: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #894: 1025it [00:02, 352.11it/s, env_step=915456, len=10, n/ep=6, n/st=64, player_1/loss=1.446, player_2/loss=1.562, rew=0.00]


Epoch #894: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #895: 1025it [00:02, 354.04it/s, env_step=916480, len=12, n/ep=5, n/st=64, player_1/loss=0.989, player_2/loss=1.424, rew=0.00]


Epoch #895: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #896: 1025it [00:02, 354.69it/s, env_step=917504, len=10, n/ep=7, n/st=64, player_1/loss=2.224, player_2/loss=2.397, rew=0.00]


Epoch #896: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #897: 1025it [00:02, 352.63it/s, env_step=918528, len=11, n/ep=7, n/st=64, player_1/loss=1.965, player_2/loss=0.860, rew=0.00]


Epoch #897: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #898: 1025it [00:02, 353.32it/s, env_step=919552, len=12, n/ep=5, n/st=64, player_1/loss=6.925, player_2/loss=0.700, rew=0.00]


Epoch #898: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #899: 1025it [00:02, 348.77it/s, env_step=920576, len=10, n/ep=7, n/st=64, player_1/loss=7.284, player_2/loss=0.477, rew=0.00]


Epoch #899: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #900: 1025it [00:02, 349.54it/s, env_step=921600, len=10, n/ep=6, n/st=64, player_1/loss=3.702, player_2/loss=0.730, rew=0.00]


Epoch #900: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #901: 1025it [00:02, 351.64it/s, env_step=922624, len=8, n/ep=7, n/st=64, player_1/loss=1.884, player_2/loss=0.915, rew=0.00]


Epoch #901: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #902: 1025it [00:02, 353.48it/s, env_step=923648, len=9, n/ep=8, n/st=64, player_1/loss=1.819, player_2/loss=3.202, rew=0.00]


Epoch #902: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #903: 1025it [00:02, 351.07it/s, env_step=924672, len=9, n/ep=6, n/st=64, player_1/loss=0.984, player_2/loss=1.574, rew=0.00]


Epoch #903: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #904: 1025it [00:02, 352.74it/s, env_step=925696, len=10, n/ep=6, n/st=64, player_1/loss=1.689, player_2/loss=1.660, rew=0.00]


Epoch #904: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #905: 1025it [00:02, 354.12it/s, env_step=926720, len=7, n/ep=8, n/st=64, player_1/loss=1.727, player_2/loss=3.360, rew=0.00]


Epoch #905: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #906: 1025it [00:02, 352.37it/s, env_step=927744, len=8, n/ep=8, n/st=64, player_1/loss=1.216, player_2/loss=0.936, rew=0.00]


Epoch #906: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #907: 1025it [00:02, 354.28it/s, env_step=928768, len=7, n/ep=9, n/st=64, player_1/loss=1.336, player_2/loss=1.520, rew=0.00]


Epoch #907: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #908: 1025it [00:02, 353.42it/s, env_step=929792, len=9, n/ep=7, n/st=64, player_1/loss=1.365, player_2/loss=1.193, rew=0.00]


Epoch #908: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #909: 1025it [00:02, 353.64it/s, env_step=930816, len=10, n/ep=6, n/st=64, player_1/loss=1.610, player_2/loss=1.049, rew=0.00]


Epoch #909: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #910: 1025it [00:02, 354.17it/s, env_step=931840, len=8, n/ep=9, n/st=64, player_1/loss=1.805, player_2/loss=1.276, rew=0.00]


Epoch #910: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #911: 1025it [00:02, 352.76it/s, env_step=932864, len=10, n/ep=5, n/st=64, player_1/loss=1.433, player_2/loss=2.708, rew=0.00]


Epoch #911: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #912: 1025it [00:02, 354.19it/s, env_step=933888, len=8, n/ep=8, n/st=64, player_1/loss=1.686, player_2/loss=1.911, rew=0.00]


Epoch #912: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #913: 1025it [00:02, 352.93it/s, env_step=934912, len=11, n/ep=6, n/st=64, player_1/loss=2.095, player_2/loss=2.701, rew=0.00]


Epoch #913: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #914: 1025it [00:02, 354.06it/s, env_step=935936, len=16, n/ep=4, n/st=64, player_1/loss=3.957, player_2/loss=0.949, rew=0.00]


Epoch #914: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #915: 1025it [00:02, 355.40it/s, env_step=936960, len=11, n/ep=6, n/st=64, player_1/loss=3.646, player_2/loss=1.128, rew=0.00]


Epoch #915: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #916: 1025it [00:02, 353.23it/s, env_step=937984, len=9, n/ep=7, n/st=64, player_1/loss=2.307, player_2/loss=3.733, rew=0.00]


Epoch #916: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #917: 1025it [00:02, 352.55it/s, env_step=939008, len=9, n/ep=7, n/st=64, player_1/loss=1.430, player_2/loss=4.144, rew=0.00]


Epoch #917: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #918: 1025it [00:02, 354.65it/s, env_step=940032, len=11, n/ep=6, n/st=64, player_1/loss=1.076, player_2/loss=3.096, rew=0.00]


Epoch #918: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #919: 1025it [00:02, 354.73it/s, env_step=941056, len=8, n/ep=8, n/st=64, player_1/loss=2.485, player_2/loss=1.645, rew=0.00]


Epoch #919: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #920: 1025it [00:02, 353.41it/s, env_step=942080, len=9, n/ep=7, n/st=64, player_1/loss=1.484, player_2/loss=1.493, rew=0.00]


Epoch #920: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #921: 1025it [00:02, 352.15it/s, env_step=943104, len=9, n/ep=7, n/st=64, player_1/loss=1.594, player_2/loss=3.723, rew=0.00]


Epoch #921: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #922: 1025it [00:02, 351.78it/s, env_step=944128, len=8, n/ep=7, n/st=64, player_1/loss=0.909, player_2/loss=0.835, rew=0.00]


Epoch #922: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #923: 1025it [00:02, 352.30it/s, env_step=945152, len=9, n/ep=7, n/st=64, player_1/loss=1.038, player_2/loss=1.823, rew=0.00]


Epoch #923: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #924: 1025it [00:02, 353.96it/s, env_step=946176, len=10, n/ep=7, n/st=64, player_1/loss=1.189, player_2/loss=2.860, rew=0.00]


Epoch #924: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #925: 1025it [00:02, 354.79it/s, env_step=947200, len=8, n/ep=7, n/st=64, player_1/loss=1.557, player_2/loss=1.967, rew=0.00]


Epoch #925: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #926: 1025it [00:02, 391.79it/s, env_step=948224, len=8, n/ep=7, n/st=64, player_1/loss=1.513, player_2/loss=3.417, rew=0.00]


Epoch #926: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #927: 1025it [00:02, 390.95it/s, env_step=949248, len=8, n/ep=7, n/st=64, player_1/loss=1.641, player_2/loss=3.043, rew=0.00]


Epoch #927: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #928: 1025it [00:02, 357.82it/s, env_step=950272, len=9, n/ep=7, n/st=64, player_1/loss=2.444, player_2/loss=2.491, rew=0.00]


Epoch #928: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #929: 1025it [00:02, 367.87it/s, env_step=951296, len=10, n/ep=6, n/st=64, player_1/loss=1.236, player_2/loss=2.358, rew=0.00]


Epoch #929: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #930: 1025it [00:02, 366.05it/s, env_step=952320, len=10, n/ep=7, n/st=64, player_1/loss=1.716, player_2/loss=1.434, rew=0.00]


Epoch #930: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #931: 1025it [00:02, 362.25it/s, env_step=953344, len=10, n/ep=6, n/st=64, player_1/loss=1.489, player_2/loss=0.757, rew=0.00]


Epoch #931: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #932: 1025it [00:02, 363.49it/s, env_step=954368, len=8, n/ep=7, n/st=64, player_1/loss=1.781, player_2/loss=2.493, rew=0.00]


Epoch #932: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #933: 1025it [00:02, 363.05it/s, env_step=955392, len=11, n/ep=5, n/st=64, player_1/loss=3.415, player_2/loss=2.950, rew=0.00]


Epoch #933: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #934: 1025it [00:02, 369.85it/s, env_step=956416, len=10, n/ep=6, n/st=64, player_1/loss=8.425, player_2/loss=3.201, rew=0.00]


Epoch #934: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #935: 1025it [00:02, 375.37it/s, env_step=957440, len=11, n/ep=6, n/st=64, player_1/loss=8.419, player_2/loss=3.141, rew=0.00]


Epoch #935: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #936: 1025it [00:02, 369.35it/s, env_step=958464, len=10, n/ep=6, n/st=64, player_1/loss=9.901, player_2/loss=1.600, rew=0.00]


Epoch #936: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #937: 1025it [00:02, 369.42it/s, env_step=959488, len=10, n/ep=6, n/st=64, player_1/loss=6.933, player_2/loss=1.151, rew=0.00]


Epoch #937: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #938: 1025it [00:02, 367.73it/s, env_step=960512, len=9, n/ep=6, n/st=64, player_1/loss=4.103, player_2/loss=1.224, rew=0.00]


Epoch #938: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #939: 1025it [00:02, 390.89it/s, env_step=961536, len=11, n/ep=6, n/st=64, player_1/loss=3.026, player_2/loss=1.140, rew=0.00]


Epoch #939: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #940: 1025it [00:02, 380.03it/s, env_step=962560, len=11, n/ep=5, n/st=64, player_1/loss=5.619, player_2/loss=0.348, rew=0.00]


Epoch #940: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #941: 1025it [00:02, 359.60it/s, env_step=963584, len=8, n/ep=7, n/st=64, player_1/loss=5.018, player_2/loss=3.516, rew=0.00]


Epoch #941: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #942: 1025it [00:02, 357.82it/s, env_step=964608, len=9, n/ep=7, n/st=64, player_1/loss=2.900, player_2/loss=3.929, rew=0.00]


Epoch #942: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #943: 1025it [00:02, 354.81it/s, env_step=965632, len=10, n/ep=6, n/st=64, player_1/loss=4.767, player_2/loss=4.526, rew=0.00]


Epoch #943: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #944: 1025it [00:02, 352.91it/s, env_step=966656, len=9, n/ep=7, n/st=64, player_1/loss=2.599, player_2/loss=1.630, rew=0.00]


Epoch #944: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #945: 1025it [00:02, 357.29it/s, env_step=967680, len=13, n/ep=5, n/st=64, player_1/loss=0.709, player_2/loss=2.620, rew=0.00]


Epoch #945: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #946: 1025it [00:02, 353.42it/s, env_step=968704, len=10, n/ep=6, n/st=64, player_1/loss=2.995, player_2/loss=1.272, rew=0.00]


Epoch #946: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #947: 1025it [00:02, 353.98it/s, env_step=969728, len=10, n/ep=6, n/st=64, player_1/loss=3.562, player_2/loss=0.631, rew=0.00]


Epoch #947: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #948: 1025it [00:02, 353.01it/s, env_step=970752, len=11, n/ep=6, n/st=64, player_1/loss=2.226, player_2/loss=0.467, rew=0.00]


Epoch #948: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #949: 1025it [00:02, 353.22it/s, env_step=971776, len=10, n/ep=6, n/st=64, player_1/loss=1.286, player_2/loss=0.908, rew=0.00]


Epoch #949: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #950: 1025it [00:02, 352.05it/s, env_step=972800, len=10, n/ep=6, n/st=64, player_1/loss=1.851, player_2/loss=2.850, rew=0.00]


Epoch #950: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #951: 1025it [00:02, 353.23it/s, env_step=973824, len=13, n/ep=5, n/st=64, player_1/loss=2.359, player_2/loss=0.763, rew=0.00]


Epoch #951: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #952: 1025it [00:02, 351.60it/s, env_step=974848, len=10, n/ep=6, n/st=64, player_1/loss=3.952, player_2/loss=1.637, rew=0.00]


Epoch #952: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #953: 1025it [00:02, 354.68it/s, env_step=975872, len=10, n/ep=6, n/st=64, player_1/loss=3.328, player_2/loss=1.117, rew=0.00]


Epoch #953: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #954: 1025it [00:02, 349.64it/s, env_step=976896, len=9, n/ep=7, n/st=64, player_1/loss=2.076, player_2/loss=1.304, rew=0.00]


Epoch #954: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #955: 1025it [00:02, 352.96it/s, env_step=977920, len=8, n/ep=8, n/st=64, player_1/loss=2.352, player_2/loss=3.195, rew=0.00]


Epoch #955: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #956: 1025it [00:02, 352.52it/s, env_step=978944, len=9, n/ep=6, n/st=64, player_1/loss=4.015, player_2/loss=4.601, rew=0.00]


Epoch #956: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #957: 1025it [00:02, 353.81it/s, env_step=979968, len=8, n/ep=7, n/st=64, player_1/loss=0.888, player_2/loss=5.466, rew=0.00]


Epoch #957: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #958: 1025it [00:02, 353.53it/s, env_step=980992, len=9, n/ep=7, n/st=64, player_1/loss=2.794, player_2/loss=1.615, rew=0.00]


Epoch #958: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #959: 1025it [00:02, 351.59it/s, env_step=982016, len=8, n/ep=8, n/st=64, player_1/loss=1.568, player_2/loss=1.819, rew=0.00]


Epoch #959: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #960: 1025it [00:02, 351.58it/s, env_step=983040, len=8, n/ep=7, n/st=64, player_1/loss=2.132, player_2/loss=2.232, rew=0.00]


Epoch #960: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #961: 1025it [00:02, 354.03it/s, env_step=984064, len=8, n/ep=7, n/st=64, player_1/loss=1.285, player_2/loss=0.687, rew=0.00]


Epoch #961: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #962: 1025it [00:02, 354.30it/s, env_step=985088, len=8, n/ep=7, n/st=64, player_1/loss=1.246, player_2/loss=3.022, rew=0.00]


Epoch #962: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #963: 1025it [00:02, 354.35it/s, env_step=986112, len=10, n/ep=6, n/st=64, player_1/loss=3.109, player_2/loss=1.939, rew=0.00]


Epoch #963: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #964: 1025it [00:02, 351.16it/s, env_step=987136, len=9, n/ep=7, n/st=64, player_1/loss=1.087, player_2/loss=5.345, rew=0.00]


Epoch #964: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #965: 1025it [00:02, 353.43it/s, env_step=988160, len=11, n/ep=6, n/st=64, player_1/loss=1.626, player_2/loss=4.137, rew=0.00]


Epoch #965: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #966: 1025it [00:02, 350.79it/s, env_step=989184, len=9, n/ep=7, n/st=64, player_1/loss=1.606, player_2/loss=1.951, rew=0.00]


Epoch #966: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #967: 1025it [00:02, 354.17it/s, env_step=990208, len=9, n/ep=7, n/st=64, player_1/loss=1.162, player_2/loss=1.616, rew=0.00]


Epoch #967: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #968: 1025it [00:02, 350.86it/s, env_step=991232, len=8, n/ep=8, n/st=64, player_1/loss=0.925, player_2/loss=1.688, rew=0.00]


Epoch #968: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #969: 1025it [00:02, 352.95it/s, env_step=992256, len=10, n/ep=6, n/st=64, player_1/loss=0.969, player_2/loss=3.830, rew=0.00]


Epoch #969: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #970: 1025it [00:02, 353.38it/s, env_step=993280, len=9, n/ep=7, n/st=64, player_1/loss=2.035, player_2/loss=2.657, rew=0.00]


Epoch #970: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #971: 1025it [00:02, 354.39it/s, env_step=994304, len=9, n/ep=7, n/st=64, player_1/loss=2.390, player_2/loss=1.752, rew=0.00]


Epoch #971: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #972: 1025it [00:02, 352.35it/s, env_step=995328, len=10, n/ep=7, n/st=64, player_1/loss=2.051, player_2/loss=1.170, rew=0.00]


Epoch #972: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #973: 1025it [00:02, 352.38it/s, env_step=996352, len=10, n/ep=6, n/st=64, player_1/loss=1.878, player_2/loss=2.206, rew=0.00]


Epoch #973: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #974: 1025it [00:02, 354.47it/s, env_step=997376, len=8, n/ep=7, n/st=64, player_1/loss=0.682, player_2/loss=3.135, rew=0.00]


Epoch #974: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #975: 1025it [00:02, 351.42it/s, env_step=998400, len=9, n/ep=6, n/st=64, player_1/loss=1.201, player_2/loss=0.856, rew=0.00]


Epoch #975: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #976: 1025it [00:02, 350.86it/s, env_step=999424, len=9, n/ep=7, n/st=64, player_1/loss=0.473, player_2/loss=4.410, rew=0.00]


Epoch #976: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #977: 1025it [00:02, 353.07it/s, env_step=1000448, len=8, n/ep=7, n/st=64, player_1/loss=1.686, player_2/loss=2.743, rew=0.00]


Epoch #977: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #978: 1025it [00:02, 353.35it/s, env_step=1001472, len=10, n/ep=6, n/st=64, player_1/loss=1.551, player_2/loss=1.710, rew=0.00]


Epoch #978: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #979: 1025it [00:02, 354.44it/s, env_step=1002496, len=8, n/ep=9, n/st=64, player_1/loss=3.517, player_2/loss=2.124, rew=0.00]


Epoch #979: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #980: 1025it [00:02, 351.56it/s, env_step=1003520, len=8, n/ep=7, n/st=64, player_1/loss=3.504, player_2/loss=1.614, rew=0.00]


Epoch #980: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #981: 1025it [00:02, 353.74it/s, env_step=1004544, len=9, n/ep=7, n/st=64, player_1/loss=1.913, player_2/loss=1.105, rew=0.00]


Epoch #981: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #982: 1025it [00:02, 349.96it/s, env_step=1005568, len=10, n/ep=7, n/st=64, player_1/loss=2.270, player_2/loss=1.531, rew=0.00]


Epoch #982: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #983: 1025it [00:02, 348.85it/s, env_step=1006592, len=8, n/ep=7, n/st=64, player_1/loss=2.425, player_2/loss=1.193, rew=0.00]


Epoch #983: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #984: 1025it [00:02, 352.51it/s, env_step=1007616, len=8, n/ep=8, n/st=64, player_1/loss=1.176, player_2/loss=1.869, rew=0.00]


Epoch #984: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #985: 1025it [00:02, 354.12it/s, env_step=1008640, len=8, n/ep=7, n/st=64, player_1/loss=3.489, player_2/loss=2.006, rew=0.00]


Epoch #985: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #986: 1025it [00:02, 353.69it/s, env_step=1009664, len=11, n/ep=5, n/st=64, player_1/loss=5.887, player_2/loss=2.499, rew=0.00]


Epoch #986: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #987: 1025it [00:02, 353.75it/s, env_step=1010688, len=8, n/ep=8, n/st=64, player_1/loss=5.846, player_2/loss=1.451, rew=0.00]


Epoch #987: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #988: 1025it [00:02, 351.94it/s, env_step=1011712, len=8, n/ep=7, n/st=64, player_1/loss=1.296, player_2/loss=5.382, rew=0.00]


Epoch #988: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #989: 1025it [00:02, 352.04it/s, env_step=1012736, len=9, n/ep=7, n/st=64, player_1/loss=1.098, player_2/loss=3.609, rew=0.00]


Epoch #989: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #990: 1025it [00:02, 352.72it/s, env_step=1013760, len=9, n/ep=7, n/st=64, player_1/loss=1.672, player_2/loss=1.799, rew=0.00]


Epoch #990: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #991: 1025it [00:02, 353.07it/s, env_step=1014784, len=12, n/ep=5, n/st=64, player_1/loss=1.775, player_2/loss=6.103, rew=0.00]


Epoch #991: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #992: 1025it [00:02, 353.43it/s, env_step=1015808, len=10, n/ep=7, n/st=64, player_1/loss=3.692, player_2/loss=4.204, rew=0.00]


Epoch #992: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #993: 1025it [00:02, 352.51it/s, env_step=1016832, len=9, n/ep=6, n/st=64, player_1/loss=4.025, player_2/loss=0.726, rew=0.00]


Epoch #993: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #994: 1025it [00:02, 354.08it/s, env_step=1017856, len=10, n/ep=6, n/st=64, player_1/loss=2.825, player_2/loss=1.885, rew=0.00]


Epoch #994: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #995: 1025it [00:02, 352.11it/s, env_step=1018880, len=9, n/ep=7, n/st=64, player_1/loss=4.085, player_2/loss=1.087, rew=0.00]


Epoch #995: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #996: 1025it [00:02, 352.92it/s, env_step=1019904, len=8, n/ep=8, n/st=64, player_1/loss=2.497, player_2/loss=1.113, rew=0.00]


Epoch #996: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #997: 1025it [00:02, 352.48it/s, env_step=1020928, len=8, n/ep=7, n/st=64, player_1/loss=1.606, player_2/loss=1.592, rew=0.00]


Epoch #997: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #998: 1025it [00:02, 354.00it/s, env_step=1021952, len=10, n/ep=7, n/st=64, player_1/loss=1.961, player_2/loss=1.705, rew=0.00]


Epoch #998: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0


Epoch #999: 1025it [00:02, 351.38it/s, env_step=1022976, len=7, n/ep=7, n/st=64, player_1/loss=2.653, player_2/loss=3.550, rew=0.00]

Epoch #999: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0





In [11]:
####################################################
# EXPERIMENT: VIEWING THE BEST LEARNED POLICY
####################################################

# Get the environment settings
env = get_env()
observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
state_shape = observation_space.shape or observation_space.n
action_shape = env.action_space.shape or env.action_space.n

# Configure the best agent
best_agent1 = cf_dqn_policy(state_shape= state_shape,
                            action_shape= action_shape)
best_agent1.load_state_dict(torch.load("./saved_variables/paper_notebooks/5/dqn_vs_dqn_no_move_reward/best_policy_agent1.pth"))


best_agent2 = cf_dqn_policy(state_shape= state_shape,
                            action_shape= action_shape)
best_agent2.load_state_dict(torch.load("./saved_variables/paper_notebooks/5/dqn_vs_dqn_no_move_reward/best_policy_agent2.pth"))

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= best_agent1,
      agent_player2= best_agent2)



Average steps of game:  8.333333333333334
Final mean reward agent 1: 25.0, std: 0.0
Final mean reward agent 2: -25.0, std: 0.0


In [12]:
####################################################
# EXPERIMENT: VIEWING THE LAST LEARNED POLICY
####################################################

# Configure the final agent
final_agent_player1 = cf_dqn_policy(state_shape= state_shape,
                            action_shape= action_shape)
final_agent_player1.load_state_dict(torch.load("./saved_variables/paper_notebooks/5/dqn_vs_dqn_no_move_reward/final_policy_agent1.pth"))


final_agent_player2 = cf_dqn_policy(state_shape= state_shape,
                            action_shape= action_shape)
final_agent_player2.load_state_dict(torch.load("./saved_variables/paper_notebooks/5/dqn_vs_dqn_no_move_reward/final_policy_agent2.pth"))

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= final_agent_player1,
      agent_player2= final_agent_player2)



Average steps of game:  7.666666666666667
Final mean reward agent 1: 25.0, std: 0.0
Final mean reward agent 2: -25.0, std: 0.0


<hr><hr>

## Discussion

Using the custom model and an environment with a reward for making moves, the DQN agents seem to have converged to a policy where they try to play for maximum amount of iterations. This results in a slightly more human way of playing, although clear winning opertunities are sometimes still left open in favour of a longer game, which is not the wanted behaviour. The agent playing as player two exhibits the most favourable behaviour.

When not using a reward for taking moves, the policy still mainly converges to stacking coins.

Whilst the policy is thus the best so far when using a reward for making moves, it is still far from ideal. We will have to look further if we want to create a trully meaningful bot for connect four. We also wonder if the learned policy is a reaction on the observed board or a replay from memory of what worked. This is the difference between a connect four bot working against any component vs a specific one. To test this, the next notebook will focus on creating a loop in such a way that we can play against a bot.

In [13]:
####################################################
# CLEAN VARIABLES
####################################################

del action_shape
del agent1
del agent2
del best_agent1
del best_agent2
del env
del final_agent_player1
del final_agent_player2
del observation_space
del off_policy_traininer_results
del state_shape
