# MLP based DQN agent against fixed oponent

In the previous notebook, `7-cnn-dqn-fixed-oponent.ipynb`, we used the CNN based model for training through an iteration of alternating frozen agents.
We found this to give interesting but not fully statisfactory results.
We will now use the same technique for the custom MLP based approach designed in `5-improving-dqn-architecture.ipynb` to properly compare both architectures performance for the agents.

<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Training two DQN agents on connect four Gym
  - Building the environment
  - Implementing the DQN policy
  - Building agents
  - Function for letting agents learn
  - Function for watching learned agent
  - Doing the experiment
- Discussion

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [3]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib

# Plotting
import matplotlib; print(f"Matplotlib version (3.5.1 recommended): {matplotlib.__version__}")
import matplotlib.pyplot as plt

# Argparser
import argparse

# More data types
import typing
import numpy as np

# Pygame
import pygame; print(f"Pygame version (2.1.2 recommended): {pygame.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Tianshou for RL algorithms
import tianshou as ts; print(f"Tianshou version (0.4.8 recommended): {ts.__version__}")

# Torch is a popular DL framework
import torch; print(f"Torch version (1.12.0 recommended): {torch.__version__}")

# PPrint is a pretty print for variables
from pprint import pprint

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym
importlib.invalidate_caches()
importlib.reload(cfgym)

# Time for allowing "freezes" in execution
import time;

# Allow for copying objects in a non reference manner
import copy

# Used for updating notebook display
from IPython.display import clear_output

Matplotlib version (3.5.1 recommended): 3.5.1
Pygame version (2.1.2 recommended): 2.1.2
Gym version (0.21.0 recommended): 0.21.0
Tianshou version (0.4.8 recommended): 0.4.8
Torch version (1.12.0 recommended): 1.12.0.dev20220520+cu116


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

In [4]:
####################################################
# CUDA VALIDATION
####################################################

# Check cuda available
print(f"CUDA is available: {torch.cuda.is_available()}")

# Show cuda devices
print(f"\nAmount of connected devices supporting CUDA: {torch.cuda.device_count()}")

# Show current cuda device
print(f"\nCurrent CUDA device: {torch.cuda.current_device()}")

# Show cuda device name
print(f"Cuda device 0 name: {torch.cuda.get_device_name(0)}")

CUDA is available: True

Amount of connected devices supporting CUDA: 1

Current CUDA device: 0
Cuda device 0 name: NVIDIA GeForce GTX 970


<hr><hr>

## Training two DQN agents on connect four Gym

Our connect four gym setup requires two agents, one for each player.
To reduce complexity, agents will always play as the same player, e.g. always as player 1.
It is important to note that connect four is a *solved game*.
According to [The Washington Post](https://www.washingtonpost.com/news/wonk/wp/2015/05/08/how-to-win-any-popular-game-according-to-data-scientists/):

> Connect Four is what mathematicians call a "solved game," meaning you can play it perfectly every time, no matter what your opponent does. You will need to get the first move, but as long as you do so, you can always win within 41 moves.

<hr>

### Building the environment

This code is taken from previous notebooks.
We don't allow invalid moves to make the problem easier for now.

In [5]:
####################################################
# CONNECT FOUR V2 ENVIRONMENT
####################################################

def get_env():
    """
    Returns the connect four gym environment V2 altered for Tianshou and Petting Zoo compatibility.
    Already wrapped with a ts.env.PettingZooEnv wrapper.
    """
    return ts.env.PettingZooEnv(cfgym.env(reward_move= 0, # Set to 1 for reward to make moves (incentivise longer games)
                                          reward_invalid= -3,
                                          reward_draw= 100,
                                          reward_win= 25,
                                          reward_loss= -25,
                                          allow_invalid_move= False))
    
    
# Test the environment
env = get_env()
print(f"Observation space: {env.observation_space}")
print(f"\nAction space: {env.action_space}")

# Reset the environment to start from a clean state, returns the initial observation
observation = env.reset()

print("\n Initial player id:")
print(observation["agent_id"])

print("\n Initial observation:")
print(observation["obs"])

print("\n Initial mask:")
print(observation["mask"])

# Clean unused variables
del observation
del env

Observation space: Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]], [[2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]], (6, 7), int8))

Action space: Discrete(7)

 Initial player id:
player_1

 Initial observation:
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]

 Initial mask:
[True, True, True, True, True, True, True]


<hr>

### Implementing the DQN policy

We use the strategy created in `5-improving-dqn-architecture.ipynb`.

In [6]:
####################################################
# DQN ARCHITECTURE
####################################################

class CustomDQN(torch.nn.Module):
    """
    Custom DQN using a model based on CNN
    """
    def __init__(self,
                 state_shape: typing.Sequence[int],
                 action_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu',):
        # Parent call
        super().__init__()
        
        # Save device (e.g. cuda)
        self.device = device
        
        self.model = torch.nn.Sequential(
            torch.nn.Linear(np.prod(state_shape), 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, np.prod(action_shape)),
        )

    def forward(self, obs, state=None, info={}):
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float, device=self.device)
        batch = obs.shape[0]
        logits = self.model(obs.view(batch, -1))
        return logits, state


In [7]:
####################################################
# DQN POLICY
####################################################

def cf_custom_dqn_policy(state_shape: tuple,
                         action_shape: tuple,
                         optim: typing.Optional[torch.optim.Optimizer] = None,
                         learning_rate: float =  0.0001,
                         gamma: float = 0.9, # Smaller gamma favours "faster" win
                         n_step: int = 4, # Number of steps to look ahead
                         frozen: bool = False,
                         target_update_freq: int = 320):
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Network to be used for DQN
    net = CustomDQN(state_shape, action_shape, device= device).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    if optim is None:
        optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # If we are frozen, we use an optimizer that has learning rate 0
    if frozen:
        optim = torch.optim.SGD(net.parameters(), lr= 0)
        
        
    # Our agent DQN policy
    return ts.policy.DQNPolicy(model= net,
                               optim= optim,
                               discount_factor= gamma,
                               estimation_step= n_step,
                               target_update_freq= target_update_freq)

<hr>

### Building agents

This is identical to the previous notebook with the added option of "freezing" an agent which corresponds to giving it an optimizer with learning rate 0.

In [8]:
####################################################
# AGENT CREATION
####################################################

def get_agents(agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
               agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
               optim: typing.Optional[torch.optim.Optimizer] = None,
               resume_path_player_1: str = '', # Path to file to resume agent training from
               resume_path_player_2: str = '', 
               agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
               agent_player2_frozen: bool = False,
               ) -> typing.Tuple[ts.policy.BasePolicy, torch.optim.Optimizer, list]:
    """
    Gets a multi agent policy manager, optimizer and player ids for the connect four V2 gym environment.
    Per default this returns 
        - Multi agent manager for 2 agents using DQN
        - Adam optimizer
        - ['player_1', 'player_2'] from the connect four environment
    """
    
    # Get the environment to play in (Connect four gym V2)
    env = get_env()
    
    # Get the observation space from the environment, depending on typo of space (ternary operator)
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    
    # Set the arguments
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent player 1 to be a DQN if no policy is passed.
    if agent_player1 is None:
        # Our agent1 uses a DQN policy
        agent_player1 = cf_custom_dqn_policy(state_shape= state_shape,
                                             action_shape= action_shape,
                                             optim= optim,
                                             frozen= agent_player1_frozen)
                
        # If we resume our agent we need to load the previous config
        if resume_path_player_1:
            agent_player1.load_state_dict(torch.load(resume_path_player_1))
            
    
    # Configure agent player 2 to be a DQN if no policy is passed.
    if agent_player2 is None:
        # Our agent1 uses a DQN policy
        agent_player2 = cf_custom_dqn_policy(state_shape= state_shape,
                                             action_shape= action_shape,
                                             optim= optim,
                                             frozen= agent_player2_frozen)
        
                
        # If we resume our agent we need to load the previous config
        if resume_path_player_2:
            agent_player2.load_state_dict(torch.load(resume_path_player_2))

    # Both our agents are DQN agents by default
    agents = [agent_player1, agent_player2]
        
    # Our policy depends on the order of the agents
    policy = ts.policy.MultiAgentPolicyManager(agents, env)
    
    # Return our policy, optimizer and the available agents in the environment
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    
    return policy, optim, env.agents

<hr>

### Function for letting agents learn

This is identical to the previous notebook.

In [9]:
####################################################
# AGENT TRAINING
####################################################

def train_agent(filename: str = "dqn_vs_dqn_cnn_based",
                agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
                agent_player2_frozen: bool = False,
                single_agent_score_as_reward: bool= False, # Uses non frozen agent's score as reward
                optim: typing.Optional[torch.optim.Optimizer] = None,
                training_env_num: int = 1,
                testing_env_num: int = 1,
                buffer_size: int = 2^14,
                batch_size: int = 1, 
                epochs: int = 50, #50
                step_per_epoch: int = 1024, #1024
                step_per_collect: int = 64, # transition before update
                update_per_step: float = 0.1,
                testing_eps: float = 0.05,
                training_eps: float = 0.1,
                ) -> typing.Tuple[dict, ts.policy.BasePolicy]:
    """
    Trains two agents in the connect four V2 environment and saves their best model and logs.
    Returns:
        - result from offpolicy_trainer
        - final version of agent 1
        - final version of agent 2
    """

    # ======== notebook specific =========
    notebook_version = '8' # Used for foldering logs and models

    # ======== environment setup =========
    train_envs = ts.env.DummyVectorEnv([get_env for _ in range(training_env_num)])
    test_envs = ts.env.DummyVectorEnv([get_env for _ in range(testing_env_num)])
    
    # set the seed for reproducibility
    np.random.seed(1998)
    torch.manual_seed(1998)
    train_envs.seed(1998)
    test_envs.seed(1998)

    # ======== agent setup =========
    # Gets our agents from the previously made function
    # Per default: 
    #   - Multi agent manager for 2 agents using DQN
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    policy, optim, agents = get_agents(agent_player1=agent_player1,
                                       agent_player2=agent_player2,
                                       agent_player1_frozen= agent_player1_frozen,
                                       agent_player2_frozen= agent_player2_frozen,
                                       optim=optim)

    # ======== collector setup =========
    # Make a collector for the training environments
    train_collector = ts.data.Collector(policy= policy,
                                        env= train_envs,
                                        buffer= ts.data.VectorReplayBuffer(buffer_size, len(train_envs)),
                                        exploration_noise= True)
    
    # Make a collector for the testing environments
    test_collector = ts.data.Collector(policy= policy,
                                       env= test_envs,
                                       buffer= ts.data.VectorReplayBuffer(buffer_size, len(test_envs)),
                                       exploration_noise= True)
    
    # Uncomment below if you want to set epsilon in epsilon policy
    # policy.set_eps(1)
    
    # Collect data fot the training evnironments
    train_collector.collect(n_step= batch_size * training_env_num)
    
    # ======== ensure folders exist =========
    if not os.path.exists(os.path.join('./logs', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./logs', 'paper_notebooks', notebook_version, filename))
    if not os.path.exists(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename))

    # ======== tensorboard logging setup =========
    # Allows to save the training progress to tensorboard compatable logs
    log_path = os.path.join('./logs', 'paper_notebooks', notebook_version, filename)
    writer = torch.utils.tensorboard.SummaryWriter(log_path)
    logger = ts.utils.TensorboardLogger(writer)

    # ======== callback functions used during training =========
    # We want to save our best policy
    def save_best_fn(policy):
        """
        Callback to save the best model
        """
        # Save best agent 1
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent1.pth')
        torch.save(policy.policies[agents[0]].state_dict(), model_save_path)
        
        # Save best agent 2
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent2.pth')
        torch.save(policy.policies[agents[1]].state_dict(), model_save_path)
        
        # Save agent2

    def stop_fn(mean_rewards):
        """
        Callback to stop training when we've reached the win rate
        """
        return mean_rewards >= 7 # (win = 10, 70% win without invalid moves = mean of 7)

    def train_fn(epoch, env_step):
        """
        Callback before training
        """        
        # Before training we want to configure the epsilon for the agents
        # In general more exploratory than the test case
        policy.policies[agents[0]].set_eps(training_eps)
        policy.policies[agents[1]].set_eps(training_eps)

    def test_fn(epoch, env_step):
        """
        Callback beore testing
        """        
        # Before testing we want to configure the epsilon for the agents
        # In general more greedy than the train case but not
        #   to avoid getting stuck on invalid moves
        policy.policies[agents[0]].set_eps(testing_eps)
        policy.policies[agents[1]].set_eps(testing_eps)

    def reward_metric(rews):
        """
        Callback for reward collection
        """        
        if agent_player2_frozen and single_agent_score_as_reward:
            # agent 2 frozen, optimizing for agent 1
            return rews[:, 0]
        
        if agent_player1_frozen and single_agent_score_as_reward:
            # agent 1 frozen, optimizing for agent 2
            return rews[:, 1]
        
        # Per default we are interested in optimizing both agents
        return rews[:, 0] + rews[:, 1]
    
            

    # trainer
    result = ts.trainer.offpolicy_trainer(policy= policy,
                                          train_collector= train_collector,
                                          test_collector= test_collector,
                                          max_epoch= epochs,
                                          step_per_epoch= step_per_epoch,
                                          step_per_collect= step_per_collect,
                                          episode_per_test= testing_env_num,
                                          batch_size= batch_size,
                                          train_fn= train_fn,
                                          test_fn= test_fn,
                                          # Stop function to stop before specified amount of epochs
                                          #stop_fn= stop_fn
                                          save_best_fn= save_best_fn,
                                          update_per_step= update_per_step,
                                          logger= logger,
                                          test_in_train= False,
                                          reward_metric= reward_metric)
    
    # Save final agent 1
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent1.pth')
    torch.save(policy.policies[agents[0]].state_dict(), model_save_path)

    # Save final agent 2
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent2.pth')
    torch.save(policy.policies[agents[1]].state_dict(), model_save_path)

    return result, policy.policies[agents[0]], policy.policies[agents[1]]

<hr>

### Function for watching learned agent

Identical to the previous notebook.

In [10]:
####################################################
# WATCHING THE LEARNED POLICY IN ACTION
####################################################

def watch(numer_of_games: int = 3,
          agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
          agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
          test_epsilon: float = 0.05, # For the watching we act completely greedy but low random for not getting stuck on invalid move
          render_speed: float = 0.15, # Amount of seconds to update frame/ do a step
          ) -> None:
    
    # Get the connect four V2 environment (must be a list)
    env= ts.env.DummyVectorEnv([get_env])
    
    # Get the agents from the trained agents
    policy, optim, agents = get_agents(agent_player1= agent_player1,
                                       agent_player2= agent_player2)
    
    # Evaluate the policy
    policy.eval()
    
    # Set the testing policy epsilon for our agents
    policy.policies[agents[0]].set_eps(test_epsilon)
    policy.policies[agents[1]].set_eps(test_epsilon)
    
    # Collect the test data
    collector = ts.data.Collector(policy= policy,
                                  env= env,
                                  exploration_noise= True)
    
    # Render games in human mode to see how it plays
    result = collector.collect(n_episode= numer_of_games, render= render_speed)
    
    # Close the environment aftering collecting the results
    # This closes the pygame window after completion
    env.close()
    
    # Get the rewards and length from the test trials
    rewards, length = result["rews"], result["lens"]
    
    # Print the final reward for the first agent
    print(f"Average steps of game:  {length.mean()}")
    print(f"Final mean reward agent 1: {rewards[:, 0].mean()}, std: {rewards[:, 0].std()}")
    print(f"Final mean reward agent 2: {rewards[:, 1].mean()}, std: {rewards[:, 1].std()}")

<hr>

### Doing the experiment

We now do the experiment with using our previously created functions.
We freeze one agent and initialize both agents from previous versions.

The following iterations were made:

1. Freeze agent 1, train agent 2:
    - Model save name: `1-mlp_dqn_frozen_agent1` 
    - Agent 1 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent2.pth`
    - Learning rate: `0.0001`
    - Training epsilon: `0.2`
    - Look ahead steps: `4`
    - Reward for move/invalid: `+1` / `-3`
    - Allow invalid move: `False`
    - Epochs: `1000`
    - Gamma: `0.9`
    - Best epoch: `1` with test reward `1102`
    - Scoring: sum of `both` agent's score
2. Freeze agent 2, train agent 1:
    - Model save name: `2-mlp_dqn_frozen_agent2` 
    - Agent 1 start: `./saved_variables/paper_notebooks/5/dqn_vs_dqn/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/1-mlp_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.0001`
    - Training epsilon: `0.2`
    - Look ahead steps: `4`
    - Reward for move/invalid: `+1` / `-3`
    - Allow invalid move: `False`
    - Epochs: `1000`
    - Gamma: `0.9`
    - Best epoch: `482` with test reward `1102`
    - Scoring: sum of `both` agent's score

After which the agent was so focused on prolonging the game, we decided to lower the learning rate and start optimizing for winning again. We also lowered the amount of epochs in each iterations of swapping the frozen agent.

3. Freeze agent 1, train agent 2:
    - Model save name: `3-mlp_dqn_frozen_agent1` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/2-mlp_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/1-mlp_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.00005` # halfed learning rate
    - Training epsilon: `0.1` # halfed training epsilon
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `500`
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: `7` with test reward `100`
    - Scoring: reward of `agent 2`
4. Freeze agent 2, train agent 1:
    - Model save name: `4-mlp_dqn_frozen_agent2` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/2-mlp_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/3-mlp_dqn_frozen_agent1/final_policy_agent2.pth`
    - Learning rate: `0.00005`
    - Training epsilon: `0.1`
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `500`
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: `XXX` with test reward `YYY`
    - Scoring: reward of `agent 1`
    
To do further training, a loop was created which alternated between freezing agens every 50 epochs. This loop was executed 20 times. The learning rate was also lowered once again.

5. Loop frozen agents:
    - Model save name: `5-looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/4-mlp_dqn_frozen_agent2/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/3-mlp_dqn_frozen_agent1/best_policy_agent2.pth`
    - Learning rate: `0.000001`
    - Training epsilon: `0.1`
    - Look ahead steps: `4`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `50` x `20` loops 
    - Gamma: `0.8` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`
6. Loop frozen agents:
    - Model save name: `6-looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/5-looping-iteration-19/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/5-looping-iteration-19/best_policy_agent2.pth`
    - Learning rate: `0.000003`
    - Training epsilon: `0.1`
    - Look ahead steps: `8`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `20` x `100` loops 
    - Gamma: `0.9` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`
7. Loop frozen agents:
    - Model save name: `7-looping-iteration-i` 
    - Agent 1 start: `./saved_variables/paper_notebooks/8/6-looping-iteration-99/best_policy_agent1.pth`
    - Agent 2 start: `./saved_variables/paper_notebooks/8/6-looping-iteration-99/best_policy_agent2.pth`
    - Learning rate: `0.001`
    - Training epsilon: `0.05`
    - Look ahead steps: `8`
    - Reward for move/invalid: `0` / `-3`
    - Allow invalid move: `False`
    - Epochs: `20` x `500` loops 
    - Gamma: `0.9` # Lowered to not make agent want to play too fast again
    - Best epoch: final epoch always taken to next round
    - Scoring: reward of `non frozen agent`

For file size reasons, only a portion of the saved agents are kept and stored on GitHub.


In [14]:
####################################################
# EXPERIMENT: TRAINING AGENTS
####################################################

# Configs for the agents
freeze_agent1 = False
agent1_starting_params = "./saved_variables/paper_notebooks/8/2-mlp_dqn_frozen_agent2/best_policy_agent1.pth"

freeze_agent2 = True
agent2_starting_params = "./saved_variables/paper_notebooks/8/3-mlp_dqn_frozen_agent1/final_policy_agent2.pth"

single_agent_score_as_reward = True # To use combined reward or non frozen agent reward as scoring
filename = "4-mlp_dqn_frozen_agent2"
epochs = 500
loops = 1

learning_rate = 0.00005
training_eps = 0.1
gamma = 0.8
n_step = 4

for loop_idx in range(loops):
    # Filename
    #filename = f"7-20epoch_500loop/7-looping-iteration-{loop_idx}"
    
    # Use provided starting params in first loop, the one from previous iteration in next
    #if loop_idx > 0:
    #    agent1_starting_params = f"./saved_variables/paper_notebooks/7/7-20epoch_500loop/7-looping-iteration-{loop_idx-1}/final_policy_agent1.pth"
    #    agent2_starting_params = f"./saved_variables/paper_notebooks/7/7-20epoch_500loop/7-looping-iteration-{loop_idx-1}/final_policy_agent2.pth"
    
    # Determine what agent to freeze
    #freeze_agent1 = True if loop_idx % 2 == 1 else False
    #freeze_agent2 = True if loop_idx % 2 == 0 else False
    
    # Get the environment settings
    env = get_env()
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent 1
    agent1 = cf_custom_dqn_policy(state_shape= state_shape,
                                  action_shape= action_shape,
                                  gamma= gamma,
                                  frozen= freeze_agent1,
                                  learning_rate = learning_rate,
                                  n_step= n_step)
    
    if agent1_starting_params:
        agent1.load_state_dict(torch.load(agent1_starting_params))
        
        # Configure agent 2
        agent2 = cf_custom_dqn_policy(state_shape= state_shape,
                                      action_shape= action_shape,
                                      gamma= gamma,
                                      frozen= freeze_agent2,
                                      learning_rate = learning_rate,
                                      n_step= n_step)
        
        if agent2_starting_params:
            agent2.load_state_dict(torch.load(agent2_starting_params))
            
            
            # Train the agent
            off_policy_traininer_results, final_agent_player1, final_agent_player2 = train_agent(epochs= epochs,
                                                                                                 agent_player1= agent1,
                                                                                                 agent_player1_frozen = freeze_agent1,
                                                                                                 agent_player2= agent2,
                                                                                                 agent_player2_frozen = freeze_agent2,
                                                                                                 filename= filename,
                                                                                                 single_agent_score_as_reward = single_agent_score_as_reward,
                                                                                                 training_eps= training_eps)
            
            

Epoch #1: 1025it [00:02, 442.19it/s, env_step=1024, len=13, n/ep=5, n/st=64, player_1/loss=2022.014, player_2/loss=11.597, rew=-25.00]                                                                                                                                                                                      


Epoch #1: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #2: 1025it [00:02, 433.46it/s, env_step=2048, len=12, n/ep=5, n/st=64, player_1/loss=1742.202, player_2/loss=15.667, rew=-5.00]                                                                                                                                                                                       


Epoch #2: test_reward: -25.000000 ± 0.000000, best_reward: -25.000000 ± 0.000000 in #0


Epoch #3: 1025it [00:02, 465.65it/s, env_step=3072, len=8, n/ep=7, n/st=64, player_1/loss=1100.388, player_2/loss=42.151, rew=17.86]                                                                                                                                                                                        


Epoch #3: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #4: 1025it [00:02, 446.26it/s, env_step=4096, len=12, n/ep=5, n/st=64, player_1/loss=645.603, player_2/loss=61.477, rew=-25.00]                                                                                                                                                                                       


Epoch #4: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #5: 1025it [00:02, 468.58it/s, env_step=5120, len=12, n/ep=5, n/st=64, player_1/loss=433.571, player_2/loss=75.934, rew=-25.00]                                                                                                                                                                                       


Epoch #5: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #6: 1025it [00:02, 463.57it/s, env_step=6144, len=8, n/ep=8, n/st=64, player_1/loss=485.334, player_2/loss=95.623, rew=-12.50]                                                                                                                                                                                        


Epoch #6: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #7: 1025it [00:02, 429.41it/s, env_step=7168, len=12, n/ep=5, n/st=64, player_2/loss=79.297, rew=-25.00]                                                                                                                                                                                                              


Epoch #7: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #8: 1025it [00:02, 456.50it/s, env_step=8192, len=13, n/ep=5, n/st=64, player_1/loss=600.770, player_2/loss=67.861, rew=-5.00]                                                                                                                                                                                        


Epoch #8: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #9: 1025it [00:02, 464.79it/s, env_step=9216, len=11, n/ep=6, n/st=64, player_1/loss=425.118, player_2/loss=67.068, rew=-25.00]                                                                                                                                                                                       


Epoch #9: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #10: 1025it [00:02, 465.52it/s, env_step=10240, len=8, n/ep=8, n/st=64, player_1/loss=309.031, rew=18.75]                                                                                                                                                                                                             


Epoch #10: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #11: 1025it [00:02, 419.27it/s, env_step=11264, len=8, n/ep=8, n/st=64, player_1/loss=292.683, player_2/loss=136.683, rew=0.00]                                                                                                                                                                                       


Epoch #11: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #12: 1025it [00:02, 453.42it/s, env_step=12288, len=7, n/ep=8, n/st=64, player_1/loss=213.679, player_2/loss=131.050, rew=25.00]                                                                                                                                                                                      


Epoch #12: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #13: 1025it [00:02, 452.26it/s, env_step=13312, len=13, n/ep=4, n/st=64, player_1/loss=189.048, rew=0.00]                                                                                                                                                                                                             


Epoch #13: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #14: 1025it [00:02, 447.54it/s, env_step=14336, len=17, n/ep=4, n/st=64, player_1/loss=252.485, player_2/loss=80.875, rew=-12.50]                                                                                                                                                                                     


Epoch #14: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #15: 1025it [00:02, 451.35it/s, env_step=15360, len=11, n/ep=6, n/st=64, player_1/loss=260.541, player_2/loss=97.499, rew=-16.67]                                                                                                                                                                                     


Epoch #15: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #16: 1025it [00:02, 451.34it/s, env_step=16384, len=12, n/ep=6, n/st=64, player_1/loss=206.075, player_2/loss=124.507, rew=-8.33]                                                                                                                                                                                     


Epoch #16: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #17: 1025it [00:02, 400.33it/s, env_step=17408, len=9, n/ep=7, n/st=64, player_1/loss=181.370, player_2/loss=92.148, rew=17.86]                                                                                                                                                                                       


Epoch #17: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #18: 1025it [00:02, 449.43it/s, env_step=18432, len=14, n/ep=4, n/st=64, player_1/loss=201.524, player_2/loss=75.970, rew=-12.50]                                                                                                                                                                                     


Epoch #18: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #19: 1025it [00:02, 460.16it/s, env_step=19456, len=7, n/ep=8, n/st=64, player_1/loss=241.595, player_2/loss=120.174, rew=18.75]                                                                                                                                                                                      


Epoch #19: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #20: 1025it [00:02, 461.43it/s, env_step=20480, len=23, n/ep=3, n/st=64, player_1/loss=214.584, player_2/loss=148.353, rew=-8.33]                                                                                                                                                                                     


Epoch #20: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #21: 1025it [00:02, 472.07it/s, env_step=21504, len=15, n/ep=4, n/st=64, player_1/loss=196.548, player_2/loss=115.615, rew=-25.00]                                                                                                                                                                                    


Epoch #21: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #22: 1025it [00:02, 420.49it/s, env_step=22528, len=8, n/ep=8, n/st=64, player_1/loss=196.362, player_2/loss=83.962, rew=18.75]                                                                                                                                                                                       


Epoch #22: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #23: 1025it [00:03, 284.66it/s, env_step=23552, len=9, n/ep=6, n/st=64, player_1/loss=132.146, player_2/loss=76.293, rew=8.33]                                                                                                                                                                                        


Epoch #23: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #24: 1025it [00:02, 346.75it/s, env_step=24576, len=11, n/ep=6, n/st=64, player_1/loss=126.760, player_2/loss=75.158, rew=0.00]                                                                                                                                                                                       


Epoch #24: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #25: 1025it [00:02, 421.20it/s, env_step=25600, len=9, n/ep=6, n/st=64, player_1/loss=134.385, player_2/loss=68.029, rew=16.67]                                                                                                                                                                                       


Epoch #25: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #26: 1025it [00:02, 397.28it/s, env_step=26624, len=12, n/ep=5, n/st=64, player_1/loss=136.754, player_2/loss=73.037, rew=15.00]                                                                                                                                                                                      


Epoch #26: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #27: 1025it [00:02, 401.30it/s, env_step=27648, len=12, n/ep=5, n/st=64, player_1/loss=155.579, player_2/loss=63.990, rew=15.00]                                                                                                                                                                                      


Epoch #27: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #28: 1025it [00:02, 386.34it/s, env_step=28672, len=14, n/ep=4, n/st=64, player_1/loss=162.885, player_2/loss=38.243, rew=12.50]                                                                                                                                                                                      


Epoch #28: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #29: 1025it [00:02, 392.13it/s, env_step=29696, len=24, n/ep=3, n/st=64, player_1/loss=222.680, player_2/loss=80.634, rew=-25.00]                                                                                                                                                                                     


Epoch #29: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #30: 1025it [00:02, 392.14it/s, env_step=30720, len=13, n/ep=5, n/st=64, player_1/loss=209.378, player_2/loss=82.140, rew=-25.00]                                                                                                                                                                                     


Epoch #30: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #31: 1025it [00:02, 386.02it/s, env_step=31744, len=8, n/ep=8, n/st=64, player_1/loss=164.072, player_2/loss=72.011, rew=-18.75]                                                                                                                                                                                      


Epoch #31: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #32: 1025it [00:02, 379.68it/s, env_step=32768, len=8, n/ep=7, n/st=64, player_1/loss=163.489, player_2/loss=75.523, rew=17.86]                                                                                                                                                                                       


Epoch #32: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #33: 1025it [00:02, 356.50it/s, env_step=33792, len=7, n/ep=9, n/st=64, player_1/loss=132.454, player_2/loss=128.274, rew=25.00]                                                                                                                                                                                      


Epoch #33: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #34: 1025it [00:02, 501.14it/s, env_step=34816, len=8, n/ep=8, n/st=64, player_1/loss=119.783, player_2/loss=176.550, rew=25.00]                                                                                                                                                                                      


Epoch #34: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #35: 1025it [00:02, 501.21it/s, env_step=35840, len=7, n/ep=9, n/st=64, player_1/loss=104.310, player_2/loss=209.286, rew=19.44]                                                                                                                                                                                      


Epoch #35: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #36: 1025it [00:02, 499.20it/s, env_step=36864, len=7, n/ep=8, n/st=64, player_1/loss=133.759, player_2/loss=210.242, rew=18.75]                                                                                                                                                                                      


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #37: 1025it [00:02, 499.56it/s, env_step=37888, len=7, n/ep=9, n/st=64, player_1/loss=118.603, player_2/loss=186.671, rew=19.44]                                                                                                                                                                                      


Epoch #37: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #38: 1025it [00:02, 437.06it/s, env_step=38912, len=7, n/ep=8, n/st=64, player_1/loss=106.166, player_2/loss=177.873, rew=25.00]                                                                                                                                                                                      


Epoch #38: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #39: 1025it [00:02, 474.42it/s, env_step=39936, len=7, n/ep=9, n/st=64, player_1/loss=110.499, player_2/loss=182.151, rew=19.44]                                                                                                                                                                                      


Epoch #39: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #40: 1025it [00:02, 497.81it/s, env_step=40960, len=7, n/ep=9, n/st=64, player_1/loss=92.469, player_2/loss=190.819, rew=25.00]                                                                                                                                                                                       


Epoch #40: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #41: 1025it [00:02, 499.94it/s, env_step=41984, len=9, n/ep=8, n/st=64, player_1/loss=98.877, player_2/loss=195.810, rew=18.75]                                                                                                                                                                                       


Epoch #41: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #42: 1025it [00:02, 503.70it/s, env_step=43008, len=12, n/ep=5, n/st=64, player_1/loss=101.064, player_2/loss=163.632, rew=15.00]                                                                                                                                                                                     


Epoch #42: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #43: 1025it [00:02, 500.05it/s, env_step=44032, len=7, n/ep=9, n/st=64, player_1/loss=86.360, player_2/loss=137.305, rew=19.44]                                                                                                                                                                                       


Epoch #43: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #44: 1025it [00:02, 381.56it/s, env_step=45056, len=9, n/ep=7, n/st=64, player_1/loss=107.409, player_2/loss=161.815, rew=17.86]                                                                                                                                                                                      


Epoch #44: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #45: 1025it [00:02, 413.25it/s, env_step=46080, len=12, n/ep=5, n/st=64, player_1/loss=144.332, player_2/loss=149.270, rew=5.00]                                                                                                                                                                                      


Epoch #45: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #46: 1025it [00:02, 409.47it/s, env_step=47104, len=14, n/ep=4, n/st=64, player_1/loss=144.011, player_2/loss=97.763, rew=0.00]                                                                                                                                                                                       


Epoch #46: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #47: 1025it [00:02, 415.08it/s, env_step=48128, len=8, n/ep=7, n/st=64, player_1/loss=146.102, player_2/loss=66.610, rew=-3.57]                                                                                                                                                                                       


Epoch #47: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #48: 1025it [00:02, 378.67it/s, env_step=49152, len=8, n/ep=8, n/st=64, player_1/loss=121.218, player_2/loss=163.360, rew=6.25]                                                                                                                                                                                       


Epoch #48: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #49: 1025it [00:03, 331.61it/s, env_step=50176, len=7, n/ep=8, n/st=64, player_1/loss=84.744, player_2/loss=233.976, rew=25.00]                                                                                                                                                                                       


Epoch #49: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #50: 1025it [00:03, 334.86it/s, env_step=51200, len=8, n/ep=8, n/st=64, player_1/loss=87.681, player_2/loss=216.915, rew=25.00]                                                                                                                                                                                       


Epoch #50: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #51: 1025it [00:03, 333.01it/s, env_step=52224, len=8, n/ep=7, n/st=64, player_1/loss=103.648, player_2/loss=213.279, rew=25.00]                                                                                                                                                                                      


Epoch #51: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #52: 1025it [00:03, 337.39it/s, env_step=53248, len=8, n/ep=8, n/st=64, player_1/loss=107.901, player_2/loss=201.900, rew=18.75]                                                                                                                                                                                      


Epoch #52: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #53: 1025it [00:03, 340.87it/s, env_step=54272, len=8, n/ep=7, n/st=64, player_1/loss=87.804, player_2/loss=178.385, rew=17.86]                                                                                                                                                                                       


Epoch #53: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #54: 1025it [00:02, 351.50it/s, env_step=55296, len=7, n/ep=8, n/st=64, player_1/loss=81.156, player_2/loss=218.446, rew=18.75]                                                                                                                                                                                       


Epoch #54: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #55: 1025it [00:03, 330.97it/s, env_step=56320, len=7, n/ep=8, n/st=64, player_1/loss=94.794, player_2/loss=226.127, rew=18.75]                                                                                                                                                                                       


Epoch #55: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #56: 1025it [00:02, 345.00it/s, env_step=57344, len=7, n/ep=8, n/st=64, player_1/loss=117.649, player_2/loss=209.448, rew=18.75]                                                                                                                                                                                      


Epoch #56: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #57: 1025it [00:02, 373.66it/s, env_step=58368, len=7, n/ep=8, n/st=64, player_1/loss=102.116, player_2/loss=173.214, rew=25.00]                                                                                                                                                                                      


Epoch #57: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #58: 1025it [00:02, 423.29it/s, env_step=59392, len=7, n/ep=8, n/st=64, player_1/loss=91.403, rew=18.75]                                                                                                                                                                                                              


Epoch #58: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #59: 1025it [00:02, 413.78it/s, env_step=60416, len=7, n/ep=8, n/st=64, player_1/loss=84.712, player_2/loss=155.689, rew=18.75]                                                                                                                                                                                       


Epoch #59: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #60: 1025it [00:02, 411.30it/s, env_step=61440, len=8, n/ep=8, n/st=64, player_1/loss=76.180, player_2/loss=160.411, rew=18.75]                                                                                                                                                                                       


Epoch #60: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #61: 1025it [00:02, 418.66it/s, env_step=62464, len=7, n/ep=8, n/st=64, player_1/loss=80.615, rew=18.75]                                                                                                                                                                                                              


Epoch #61: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #62: 1025it [00:02, 461.52it/s, env_step=63488, len=7, n/ep=9, n/st=64, player_1/loss=80.767, player_2/loss=141.758, rew=13.89]                                                                                                                                                                                       


Epoch #62: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #63: 1025it [00:02, 460.85it/s, env_step=64512, len=7, n/ep=8, n/st=64, player_1/loss=88.694, player_2/loss=127.863, rew=25.00]                                                                                                                                                                                       


Epoch #63: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #64: 1025it [00:02, 435.54it/s, env_step=65536, len=8, n/ep=8, n/st=64, player_1/loss=93.617, player_2/loss=129.383, rew=12.50]                                                                                                                                                                                       


Epoch #64: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #65: 1025it [00:02, 447.22it/s, env_step=66560, len=7, n/ep=9, n/st=64, player_1/loss=85.258, player_2/loss=166.592, rew=25.00]                                                                                                                                                                                       


Epoch #65: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #66: 1025it [00:02, 456.52it/s, env_step=67584, len=7, n/ep=8, n/st=64, player_1/loss=72.867, player_2/loss=179.405, rew=18.75]                                                                                                                                                                                       


Epoch #66: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #67: 1025it [00:02, 410.33it/s, env_step=68608, len=7, n/ep=10, n/st=64, player_1/loss=67.834, player_2/loss=174.164, rew=25.00]                                                                                                                                                                                      


Epoch #67: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #68: 1025it [00:02, 450.48it/s, env_step=69632, len=8, n/ep=8, n/st=64, player_1/loss=94.634, player_2/loss=167.654, rew=6.25]                                                                                                                                                                                        


Epoch #68: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #69: 1025it [00:02, 462.69it/s, env_step=70656, len=7, n/ep=9, n/st=64, player_1/loss=125.631, player_2/loss=223.951, rew=25.00]                                                                                                                                                                                      


Epoch #69: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #70: 1025it [00:02, 443.42it/s, env_step=71680, len=8, n/ep=7, n/st=64, player_1/loss=89.174, player_2/loss=219.889, rew=17.86]                                                                                                                                                                                       


Epoch #70: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #71: 1025it [00:02, 428.36it/s, env_step=72704, len=7, n/ep=9, n/st=64, player_1/loss=79.037, player_2/loss=229.253, rew=13.89]                                                                                                                                                                                       


Epoch #71: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #72: 1025it [00:02, 414.74it/s, env_step=73728, len=8, n/ep=7, n/st=64, player_1/loss=93.661, player_2/loss=173.821, rew=10.71]                                                                                                                                                                                       


Epoch #72: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #73: 1025it [00:02, 365.33it/s, env_step=74752, len=7, n/ep=8, n/st=64, player_1/loss=112.205, player_2/loss=172.404, rew=12.50]                                                                                                                                                                                      


Epoch #73: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #74: 1025it [00:02, 386.77it/s, env_step=75776, len=8, n/ep=8, n/st=64, player_1/loss=111.528, player_2/loss=194.220, rew=6.25]                                                                                                                                                                                       


Epoch #74: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #75: 1025it [00:02, 412.17it/s, env_step=76800, len=7, n/ep=9, n/st=64, player_1/loss=74.620, player_2/loss=202.324, rew=13.89]                                                                                                                                                                                       


Epoch #75: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #76: 1025it [00:02, 465.82it/s, env_step=77824, len=8, n/ep=7, n/st=64, player_1/loss=102.407, player_2/loss=184.245, rew=17.86]                                                                                                                                                                                      


Epoch #76: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #77: 1025it [00:02, 439.97it/s, env_step=78848, len=8, n/ep=8, n/st=64, player_1/loss=98.377, player_2/loss=167.940, rew=12.50]                                                                                                                                                                                       


Epoch #77: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #78: 1025it [00:02, 403.20it/s, env_step=79872, len=8, n/ep=8, n/st=64, player_1/loss=61.168, player_2/loss=189.414, rew=18.75]                                                                                                                                                                                       


Epoch #78: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #79: 1025it [00:02, 423.47it/s, env_step=80896, len=7, n/ep=9, n/st=64, player_1/loss=68.867, player_2/loss=210.129, rew=25.00]                                                                                                                                                                                       


Epoch #79: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #80: 1025it [00:02, 460.33it/s, env_step=81920, len=9, n/ep=7, n/st=64, player_2/loss=213.185, rew=17.86]                                                                                                                                                                                                             


Epoch #80: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #81: 1025it [00:02, 464.31it/s, env_step=82944, len=7, n/ep=8, n/st=64, player_1/loss=104.134, player_2/loss=205.841, rew=12.50]                                                                                                                                                                                      


Epoch #81: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #82: 1025it [00:02, 456.99it/s, env_step=83968, len=7, n/ep=9, n/st=64, player_1/loss=78.877, player_2/loss=173.692, rew=25.00]                                                                                                                                                                                       


Epoch #82: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #83: 1025it [00:02, 458.53it/s, env_step=84992, len=7, n/ep=8, n/st=64, player_1/loss=69.944, player_2/loss=177.017, rew=25.00]                                                                                                                                                                                       


Epoch #83: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #84: 1025it [00:02, 469.33it/s, env_step=86016, len=9, n/ep=7, n/st=64, player_1/loss=86.692, player_2/loss=170.930, rew=3.57]                                                                                                                                                                                        


Epoch #84: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #85: 1025it [00:02, 466.24it/s, env_step=87040, len=8, n/ep=7, n/st=64, player_1/loss=71.905, player_2/loss=155.342, rew=10.71]                                                                                                                                                                                       


Epoch #85: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #86: 1025it [00:02, 444.83it/s, env_step=88064, len=8, n/ep=8, n/st=64, player_1/loss=75.772, player_2/loss=120.015, rew=18.75]                                                                                                                                                                                       


Epoch #86: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #87: 1025it [00:02, 418.87it/s, env_step=89088, len=7, n/ep=9, n/st=64, player_1/loss=107.100, player_2/loss=161.238, rew=19.44]                                                                                                                                                                                      


Epoch #87: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #88: 1025it [00:02, 383.28it/s, env_step=90112, len=8, n/ep=7, n/st=64, player_1/loss=107.158, player_2/loss=186.666, rew=17.86]                                                                                                                                                                                      


Epoch #88: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #89: 1025it [00:02, 386.41it/s, env_step=91136, len=7, n/ep=9, n/st=64, player_1/loss=91.054, player_2/loss=187.761, rew=13.89]                                                                                                                                                                                       


Epoch #89: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #90: 1025it [00:03, 286.91it/s, env_step=92160, len=7, n/ep=8, n/st=64, player_1/loss=76.490, player_2/loss=212.319, rew=6.25]                                                                                                                                                                                        


Epoch #90: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #91: 1025it [00:03, 282.95it/s, env_step=93184, len=7, n/ep=8, n/st=64, player_1/loss=89.968, player_2/loss=180.039, rew=25.00]                                                                                                                                                                                       


Epoch #91: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #92: 1025it [00:03, 307.04it/s, env_step=94208, len=8, n/ep=8, n/st=64, player_1/loss=92.440, player_2/loss=140.571, rew=25.00]                                                                                                                                                                                       


Epoch #92: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #93: 1025it [00:03, 308.22it/s, env_step=95232, len=7, n/ep=8, n/st=64, player_1/loss=62.446, player_2/loss=178.222, rew=18.75]                                                                                                                                                                                       


Epoch #93: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #94: 1025it [00:03, 311.35it/s, env_step=96256, len=8, n/ep=7, n/st=64, player_1/loss=48.369, player_2/loss=196.139, rew=17.86]                                                                                                                                                                                       


Epoch #94: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #95: 1025it [00:03, 305.97it/s, env_step=97280, len=7, n/ep=9, n/st=64, player_1/loss=58.374, player_2/loss=192.158, rew=13.89]                                                                                                                                                                                       


Epoch #95: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #96: 1025it [00:03, 307.08it/s, env_step=98304, len=7, n/ep=9, n/st=64, player_1/loss=81.229, player_2/loss=173.855, rew=13.89]                                                                                                                                                                                       


Epoch #96: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #97: 1025it [00:03, 311.45it/s, env_step=99328, len=8, n/ep=7, n/st=64, player_1/loss=99.256, player_2/loss=215.782, rew=3.57]                                                                                                                                                                                        


Epoch #97: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #98: 1025it [00:03, 317.29it/s, env_step=100352, len=7, n/ep=8, n/st=64, player_1/loss=79.125, player_2/loss=227.998, rew=25.00]                                                                                                                                                                                      


Epoch #98: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #99: 1025it [00:03, 312.43it/s, env_step=101376, len=7, n/ep=9, n/st=64, player_1/loss=121.252, player_2/loss=171.999, rew=19.44]                                                                                                                                                                                     


Epoch #99: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #100: 1025it [00:03, 303.96it/s, env_step=102400, len=8, n/ep=7, n/st=64, player_1/loss=118.944, player_2/loss=145.998, rew=17.86]                                                                                                                                                                                    


Epoch #100: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #101: 1025it [00:03, 302.42it/s, env_step=103424, len=7, n/ep=8, n/st=64, player_1/loss=71.148, player_2/loss=136.338, rew=18.75]                                                                                                                                                                                     


Epoch #101: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #102: 1025it [00:03, 314.07it/s, env_step=104448, len=7, n/ep=9, n/st=64, player_1/loss=105.578, player_2/loss=159.395, rew=8.33]                                                                                                                                                                                     


Epoch #102: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #103: 1025it [00:03, 306.81it/s, env_step=105472, len=8, n/ep=8, n/st=64, player_1/loss=73.788, player_2/loss=158.880, rew=25.00]                                                                                                                                                                                     


Epoch #103: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #104: 1025it [00:03, 308.87it/s, env_step=106496, len=9, n/ep=7, n/st=64, player_1/loss=64.979, player_2/loss=175.277, rew=17.86]                                                                                                                                                                                     


Epoch #104: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #105: 1025it [00:03, 313.66it/s, env_step=107520, len=7, n/ep=7, n/st=64, player_1/loss=97.337, player_2/loss=162.604, rew=17.86]                                                                                                                                                                                     


Epoch #105: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #106: 1025it [00:03, 314.78it/s, env_step=108544, len=7, n/ep=8, n/st=64, player_1/loss=94.368, player_2/loss=179.009, rew=25.00]                                                                                                                                                                                     


Epoch #106: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #107: 1025it [00:03, 313.48it/s, env_step=109568, len=7, n/ep=9, n/st=64, player_1/loss=66.379, player_2/loss=194.806, rew=13.89]                                                                                                                                                                                     


Epoch #107: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #108: 1025it [00:03, 300.44it/s, env_step=110592, len=9, n/ep=7, n/st=64, player_1/loss=53.531, player_2/loss=207.222, rew=3.57]                                                                                                                                                                                      


Epoch #108: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #109: 1025it [00:03, 305.28it/s, env_step=111616, len=7, n/ep=8, n/st=64, player_1/loss=86.614, player_2/loss=231.488, rew=25.00]                                                                                                                                                                                     


Epoch #109: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #110: 1025it [00:03, 305.49it/s, env_step=112640, len=7, n/ep=8, n/st=64, player_1/loss=120.435, player_2/loss=194.326, rew=25.00]                                                                                                                                                                                    


Epoch #110: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #111: 1025it [00:02, 343.79it/s, env_step=113664, len=7, n/ep=9, n/st=64, player_1/loss=115.446, player_2/loss=218.790, rew=19.44]                                                                                                                                                                                    


Epoch #111: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #112: 1025it [00:03, 323.41it/s, env_step=114688, len=7, n/ep=8, n/st=64, player_1/loss=70.387, player_2/loss=235.951, rew=12.50]                                                                                                                                                                                     


Epoch #112: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #113: 1025it [00:03, 323.95it/s, env_step=115712, len=8, n/ep=8, n/st=64, player_1/loss=79.856, player_2/loss=221.406, rew=12.50]                                                                                                                                                                                     


Epoch #113: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #114: 1025it [00:03, 325.85it/s, env_step=116736, len=7, n/ep=8, n/st=64, player_1/loss=94.603, player_2/loss=219.456, rew=18.75]                                                                                                                                                                                     


Epoch #114: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #115: 1025it [00:03, 318.96it/s, env_step=117760, len=7, n/ep=9, n/st=64, player_1/loss=52.231, player_2/loss=205.951, rew=25.00]                                                                                                                                                                                     


Epoch #115: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #116: 1025it [00:03, 332.40it/s, env_step=118784, len=8, n/ep=8, n/st=64, player_1/loss=37.257, player_2/loss=163.749, rew=6.25]                                                                                                                                                                                      


Epoch #116: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #117: 1025it [00:03, 329.20it/s, env_step=119808, len=7, n/ep=8, n/st=64, player_1/loss=34.843, player_2/loss=152.278, rew=25.00]                                                                                                                                                                                     


Epoch #117: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #118: 1025it [00:03, 326.89it/s, env_step=120832, len=7, n/ep=9, n/st=64, player_1/loss=54.257, player_2/loss=167.000, rew=19.44]                                                                                                                                                                                     


Epoch #118: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #119: 1025it [00:03, 327.05it/s, env_step=121856, len=7, n/ep=9, n/st=64, player_1/loss=92.006, player_2/loss=201.907, rew=13.89]                                                                                                                                                                                     


Epoch #119: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #120: 1025it [00:03, 328.40it/s, env_step=122880, len=7, n/ep=9, n/st=64, player_1/loss=88.109, player_2/loss=159.906, rew=19.44]                                                                                                                                                                                     


Epoch #120: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #121: 1025it [00:03, 323.37it/s, env_step=123904, len=7, n/ep=9, n/st=64, player_1/loss=61.632, player_2/loss=169.764, rew=19.44]                                                                                                                                                                                     


Epoch #121: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #122: 1025it [00:03, 321.27it/s, env_step=124928, len=7, n/ep=9, n/st=64, player_1/loss=78.783, player_2/loss=196.720, rew=8.33]                                                                                                                                                                                      


Epoch #122: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #123: 1025it [00:03, 331.41it/s, env_step=125952, len=7, n/ep=8, n/st=64, player_1/loss=92.614, player_2/loss=187.683, rew=18.75]                                                                                                                                                                                     


Epoch #123: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #124: 1025it [00:03, 336.37it/s, env_step=126976, len=8, n/ep=8, n/st=64, player_1/loss=97.491, player_2/loss=149.428, rew=18.75]                                                                                                                                                                                     


Epoch #124: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #125: 1025it [00:03, 320.65it/s, env_step=128000, len=7, n/ep=8, n/st=64, player_1/loss=65.184, player_2/loss=177.337, rew=18.75]                                                                                                                                                                                     


Epoch #125: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #126: 1025it [00:03, 318.31it/s, env_step=129024, len=7, n/ep=8, n/st=64, player_1/loss=26.093, player_2/loss=188.666, rew=25.00]                                                                                                                                                                                     


Epoch #126: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #127: 1025it [00:03, 336.47it/s, env_step=130048, len=8, n/ep=7, n/st=64, player_1/loss=30.337, player_2/loss=192.411, rew=17.86]                                                                                                                                                                                     


Epoch #127: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #128: 1025it [00:04, 234.24it/s, env_step=131072, len=8, n/ep=8, n/st=64, player_1/loss=37.555, player_2/loss=162.305, rew=12.50]                                                                                                                                                                                     


Epoch #128: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #129: 1025it [00:04, 240.12it/s, env_step=132096, len=8, n/ep=8, n/st=64, player_1/loss=91.519, player_2/loss=168.992, rew=12.50]                                                                                                                                                                                     


Epoch #129: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #130: 1025it [00:03, 275.21it/s, env_step=133120, len=8, n/ep=8, n/st=64, player_1/loss=70.032, player_2/loss=208.708, rew=12.50]                                                                                                                                                                                     


Epoch #130: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #131: 1025it [00:04, 224.83it/s, env_step=134144, len=8, n/ep=9, n/st=64, player_1/loss=67.076, player_2/loss=216.903, rew=8.33]                                                                                                                                                                                      


Epoch #131: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #132: 1025it [00:04, 224.55it/s, env_step=135168, len=7, n/ep=8, n/st=64, player_1/loss=70.282, player_2/loss=255.520, rew=18.75]                                                                                                                                                                                     


Epoch #132: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #133: 1025it [00:04, 224.88it/s, env_step=136192, len=7, n/ep=8, n/st=64, player_1/loss=55.648, player_2/loss=234.865, rew=18.75]                                                                                                                                                                                     


Epoch #133: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #134: 1025it [00:03, 267.21it/s, env_step=137216, len=8, n/ep=8, n/st=64, player_1/loss=62.461, player_2/loss=219.757, rew=25.00]                                                                                                                                                                                     


Epoch #134: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #135: 1025it [00:03, 256.31it/s, env_step=138240, len=7, n/ep=8, n/st=64, player_1/loss=81.927, player_2/loss=214.278, rew=25.00]                                                                                                                                                                                     


Epoch #135: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #136: 1025it [00:04, 255.15it/s, env_step=139264, len=7, n/ep=9, n/st=64, player_1/loss=85.010, player_2/loss=210.222, rew=19.44]                                                                                                                                                                                     


Epoch #136: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #137: 1025it [00:04, 251.26it/s, env_step=140288, len=9, n/ep=6, n/st=64, player_1/loss=77.277, player_2/loss=188.187, rew=8.33]                                                                                                                                                                                      


Epoch #137: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #138: 1025it [00:04, 255.97it/s, env_step=141312, len=7, n/ep=9, n/st=64, player_1/loss=91.067, player_2/loss=150.226, rew=19.44]                                                                                                                                                                                     


Epoch #138: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #139: 1025it [00:04, 254.02it/s, env_step=142336, len=8, n/ep=8, n/st=64, player_1/loss=56.213, player_2/loss=149.035, rew=12.50]                                                                                                                                                                                     


Epoch #139: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #140: 1025it [00:04, 255.68it/s, env_step=143360, len=7, n/ep=8, n/st=64, player_1/loss=69.760, player_2/loss=147.370, rew=25.00]                                                                                                                                                                                     


Epoch #140: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #141: 1025it [00:04, 254.40it/s, env_step=144384, len=7, n/ep=9, n/st=64, player_1/loss=63.735, player_2/loss=170.487, rew=25.00]                                                                                                                                                                                     


Epoch #141: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #142: 1025it [00:04, 254.77it/s, env_step=145408, len=7, n/ep=8, n/st=64, player_1/loss=49.338, player_2/loss=178.220, rew=6.25]                                                                                                                                                                                      


Epoch #142: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #143: 1025it [00:04, 246.53it/s, env_step=146432, len=7, n/ep=9, n/st=64, player_1/loss=58.503, player_2/loss=209.141, rew=19.44]                                                                                                                                                                                     


Epoch #143: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #144: 1025it [00:03, 259.25it/s, env_step=147456, len=9, n/ep=8, n/st=64, player_1/loss=60.040, player_2/loss=203.410, rew=12.50]                                                                                                                                                                                     


Epoch #144: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #145: 1025it [00:04, 252.97it/s, env_step=148480, len=8, n/ep=8, n/st=64, player_1/loss=94.545, player_2/loss=200.367, rew=18.75]                                                                                                                                                                                     


Epoch #145: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #146: 1025it [00:04, 250.12it/s, env_step=149504, len=7, n/ep=9, n/st=64, player_1/loss=87.234, player_2/loss=172.543, rew=19.44]                                                                                                                                                                                     


Epoch #146: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #147: 1025it [00:03, 302.27it/s, env_step=150528, len=7, n/ep=9, n/st=64, player_1/loss=54.756, player_2/loss=172.659, rew=25.00]                                                                                                                                                                                     


Epoch #147: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #148: 1025it [00:04, 255.25it/s, env_step=151552, len=9, n/ep=7, n/st=64, player_1/loss=38.988, player_2/loss=148.661, rew=25.00]                                                                                                                                                                                     


Epoch #148: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #149: 1025it [00:04, 254.86it/s, env_step=152576, len=7, n/ep=8, n/st=64, player_1/loss=57.756, player_2/loss=146.243, rew=18.75]                                                                                                                                                                                     


Epoch #149: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #150: 1025it [00:04, 255.52it/s, env_step=153600, len=8, n/ep=8, n/st=64, player_1/loss=74.702, player_2/loss=174.708, rew=12.50]                                                                                                                                                                                     


Epoch #150: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #151: 1025it [00:04, 249.75it/s, env_step=154624, len=7, n/ep=9, n/st=64, player_1/loss=50.207, player_2/loss=179.444, rew=19.44]                                                                                                                                                                                     


Epoch #151: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #152: 1025it [00:04, 246.46it/s, env_step=155648, len=8, n/ep=8, n/st=64, player_1/loss=103.464, player_2/loss=207.412, rew=12.50]                                                                                                                                                                                    


Epoch #152: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #153: 1025it [00:04, 243.11it/s, env_step=156672, len=7, n/ep=9, n/st=64, player_1/loss=88.718, player_2/loss=205.128, rew=25.00]                                                                                                                                                                                     


Epoch #153: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #154: 1025it [00:04, 233.39it/s, env_step=157696, len=8, n/ep=6, n/st=64, player_1/loss=49.485, player_2/loss=211.226, rew=25.00]                                                                                                                                                                                     


Epoch #154: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #155: 1025it [00:04, 232.00it/s, env_step=158720, len=7, n/ep=8, n/st=64, player_1/loss=53.337, player_2/loss=213.836, rew=18.75]                                                                                                                                                                                     


Epoch #155: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #156: 1025it [00:04, 237.24it/s, env_step=159744, len=8, n/ep=8, n/st=64, player_1/loss=38.595, player_2/loss=188.265, rew=18.75]                                                                                                                                                                                     


Epoch #156: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #157: 1025it [00:04, 243.14it/s, env_step=160768, len=7, n/ep=9, n/st=64, player_1/loss=86.172, player_2/loss=164.879, rew=19.44]                                                                                                                                                                                     


Epoch #157: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #158: 1025it [00:04, 239.36it/s, env_step=161792, len=7, n/ep=8, n/st=64, player_1/loss=44.116, player_2/loss=170.439, rew=12.50]                                                                                                                                                                                     


Epoch #158: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #159: 1025it [00:03, 260.90it/s, env_step=162816, len=7, n/ep=9, n/st=64, player_1/loss=40.071, player_2/loss=218.968, rew=25.00]                                                                                                                                                                                     


Epoch #159: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #160: 1025it [00:04, 252.08it/s, env_step=163840, len=8, n/ep=8, n/st=64, player_1/loss=73.209, player_2/loss=207.615, rew=18.75]                                                                                                                                                                                     


Epoch #160: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #161: 1025it [00:03, 302.69it/s, env_step=164864, len=10, n/ep=6, n/st=64, player_1/loss=63.959, player_2/loss=157.263, rew=16.67]                                                                                                                                                                                    


Epoch #161: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #162: 1025it [00:03, 281.27it/s, env_step=165888, len=7, n/ep=8, n/st=64, player_1/loss=44.263, player_2/loss=204.341, rew=25.00]                                                                                                                                                                                     


Epoch #162: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #163: 1025it [00:03, 287.88it/s, env_step=166912, len=8, n/ep=8, n/st=64, player_1/loss=86.266, player_2/loss=220.164, rew=12.50]                                                                                                                                                                                     


Epoch #163: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #164: 1025it [00:03, 279.58it/s, env_step=167936, len=7, n/ep=8, n/st=64, player_1/loss=81.803, player_2/loss=197.775, rew=25.00]                                                                                                                                                                                     


Epoch #164: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #165: 1025it [00:03, 265.56it/s, env_step=168960, len=8, n/ep=8, n/st=64, player_1/loss=47.951, player_2/loss=191.000, rew=18.75]                                                                                                                                                                                     


Epoch #165: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #166: 1025it [00:03, 260.91it/s, env_step=169984, len=8, n/ep=7, n/st=64, player_1/loss=37.630, player_2/loss=170.210, rew=3.57]                                                                                                                                                                                      


Epoch #166: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #167: 1025it [00:04, 252.99it/s, env_step=171008, len=7, n/ep=8, n/st=64, player_1/loss=58.492, player_2/loss=156.261, rew=18.75]                                                                                                                                                                                     


Epoch #167: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #168: 1025it [00:04, 254.09it/s, env_step=172032, len=7, n/ep=9, n/st=64, player_1/loss=91.459, player_2/loss=169.745, rew=25.00]                                                                                                                                                                                     


Epoch #168: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #169: 1025it [00:04, 256.22it/s, env_step=173056, len=7, n/ep=8, n/st=64, player_1/loss=91.328, player_2/loss=215.891, rew=25.00]                                                                                                                                                                                     


Epoch #169: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #170: 1025it [00:04, 254.96it/s, env_step=174080, len=7, n/ep=9, n/st=64, player_1/loss=54.782, player_2/loss=194.162, rew=19.44]                                                                                                                                                                                     


Epoch #170: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #171: 1025it [00:03, 266.23it/s, env_step=175104, len=7, n/ep=8, n/st=64, player_1/loss=27.791, player_2/loss=188.095, rew=12.50]                                                                                                                                                                                     


Epoch #171: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #172: 1025it [00:03, 261.55it/s, env_step=176128, len=9, n/ep=7, n/st=64, player_1/loss=23.841, player_2/loss=207.780, rew=25.00]                                                                                                                                                                                     


Epoch #172: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #173: 1025it [00:03, 271.60it/s, env_step=177152, len=7, n/ep=8, n/st=64, player_1/loss=13.892, player_2/loss=205.812, rew=25.00]                                                                                                                                                                                     


Epoch #173: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #174: 1025it [00:03, 264.29it/s, env_step=178176, len=7, n/ep=9, n/st=64, player_1/loss=41.971, player_2/loss=200.345, rew=13.89]                                                                                                                                                                                     


Epoch #174: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #175: 1025it [00:04, 254.79it/s, env_step=179200, len=7, n/ep=8, n/st=64, player_1/loss=55.589, player_2/loss=214.772, rew=25.00]                                                                                                                                                                                     


Epoch #175: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #176: 1025it [00:04, 254.62it/s, env_step=180224, len=7, n/ep=9, n/st=64, player_1/loss=96.434, player_2/loss=213.509, rew=25.00]                                                                                                                                                                                     


Epoch #176: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #177: 1025it [00:03, 257.95it/s, env_step=181248, len=7, n/ep=8, n/st=64, player_1/loss=66.917, player_2/loss=188.975, rew=25.00]                                                                                                                                                                                     


Epoch #177: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #178: 1025it [00:04, 254.05it/s, env_step=182272, len=7, n/ep=8, n/st=64, player_1/loss=40.974, player_2/loss=178.180, rew=25.00]                                                                                                                                                                                     


Epoch #178: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #179: 1025it [00:04, 218.06it/s, env_step=183296, len=8, n/ep=8, n/st=64, player_1/loss=61.713, player_2/loss=171.460, rew=18.75]                                                                                                                                                                                     


Epoch #179: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #180: 1025it [00:04, 208.77it/s, env_step=184320, len=10, n/ep=6, n/st=64, player_1/loss=60.614, player_2/loss=177.403, rew=25.00]                                                                                                                                                                                    


Epoch #180: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #181: 1025it [00:04, 255.73it/s, env_step=185344, len=7, n/ep=9, n/st=64, player_1/loss=56.544, player_2/loss=189.836, rew=19.44]                                                                                                                                                                                     


Epoch #181: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #182: 1025it [00:03, 304.61it/s, env_step=186368, len=7, n/ep=9, n/st=64, player_1/loss=52.016, player_2/loss=174.439, rew=13.89]                                                                                                                                                                                     


Epoch #182: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #183: 1025it [00:03, 278.98it/s, env_step=187392, len=7, n/ep=9, n/st=64, player_1/loss=45.907, player_2/loss=190.940, rew=25.00]                                                                                                                                                                                     


Epoch #183: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #184: 1025it [00:03, 300.56it/s, env_step=188416, len=7, n/ep=8, n/st=64, player_1/loss=21.877, player_2/loss=164.038, rew=25.00]                                                                                                                                                                                     


Epoch #184: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #185: 1025it [00:03, 303.55it/s, env_step=189440, len=7, n/ep=9, n/st=64, player_1/loss=22.439, player_2/loss=184.642, rew=25.00]                                                                                                                                                                                     


Epoch #185: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #186: 1025it [00:03, 310.86it/s, env_step=190464, len=7, n/ep=9, n/st=64, player_1/loss=48.948, player_2/loss=183.726, rew=25.00]                                                                                                                                                                                     


Epoch #186: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #187: 1025it [00:03, 310.32it/s, env_step=191488, len=7, n/ep=9, n/st=64, player_1/loss=68.803, player_2/loss=201.644, rew=25.00]                                                                                                                                                                                     


Epoch #187: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #188: 1025it [00:03, 316.71it/s, env_step=192512, len=8, n/ep=8, n/st=64, player_1/loss=53.756, player_2/loss=173.965, rew=25.00]                                                                                                                                                                                     


Epoch #188: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #189: 1025it [00:03, 300.55it/s, env_step=193536, len=7, n/ep=8, n/st=64, player_1/loss=29.951, player_2/loss=162.766, rew=25.00]                                                                                                                                                                                     


Epoch #189: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #190: 1025it [00:03, 304.50it/s, env_step=194560, len=8, n/ep=8, n/st=64, player_1/loss=35.547, player_2/loss=180.328, rew=25.00]                                                                                                                                                                                     


Epoch #190: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #191: 1025it [00:03, 298.12it/s, env_step=195584, len=8, n/ep=8, n/st=64, player_1/loss=15.866, rew=12.50]                                                                                                                                                                                                            


Epoch #191: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #192: 1025it [00:03, 298.39it/s, env_step=196608, len=7, n/ep=9, n/st=64, player_1/loss=28.862, player_2/loss=146.755, rew=25.00]                                                                                                                                                                                     


Epoch #192: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #193: 1025it [00:03, 300.94it/s, env_step=197632, len=7, n/ep=9, n/st=64, player_1/loss=47.113, player_2/loss=154.032, rew=19.44]                                                                                                                                                                                     


Epoch #193: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #194: 1025it [00:03, 307.19it/s, env_step=198656, len=8, n/ep=7, n/st=64, player_1/loss=47.947, player_2/loss=178.981, rew=25.00]                                                                                                                                                                                     


Epoch #194: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #195: 1025it [00:03, 300.15it/s, env_step=199680, len=8, n/ep=8, n/st=64, player_1/loss=20.334, player_2/loss=160.142, rew=18.75]                                                                                                                                                                                     


Epoch #195: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #196: 1025it [00:02, 357.24it/s, env_step=200704, len=7, n/ep=8, n/st=64, player_1/loss=42.439, player_2/loss=163.576, rew=18.75]                                                                                                                                                                                     


Epoch #196: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #197: 1025it [00:03, 282.94it/s, env_step=201728, len=8, n/ep=8, n/st=64, player_1/loss=57.056, player_2/loss=198.064, rew=25.00]                                                                                                                                                                                     


Epoch #197: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #198: 1025it [00:03, 273.40it/s, env_step=202752, len=8, n/ep=7, n/st=64, player_1/loss=44.943, player_2/loss=178.239, rew=17.86]                                                                                                                                                                                     


Epoch #198: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #199: 1025it [00:03, 294.37it/s, env_step=203776, len=7, n/ep=9, n/st=64, player_1/loss=53.198, player_2/loss=151.387, rew=13.89]                                                                                                                                                                                     


Epoch #199: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #200: 1025it [00:03, 308.78it/s, env_step=204800, len=7, n/ep=9, n/st=64, player_1/loss=69.091, player_2/loss=134.808, rew=25.00]                                                                                                                                                                                     


Epoch #200: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #201: 1025it [00:03, 312.09it/s, env_step=205824, len=8, n/ep=8, n/st=64, player_1/loss=82.069, player_2/loss=180.201, rew=12.50]                                                                                                                                                                                     


Epoch #201: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #202: 1025it [00:03, 311.89it/s, env_step=206848, len=7, n/ep=8, n/st=64, player_1/loss=92.966, player_2/loss=208.004, rew=25.00]                                                                                                                                                                                     


Epoch #202: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #203: 1025it [00:03, 304.38it/s, env_step=207872, len=7, n/ep=8, n/st=64, player_1/loss=45.865, player_2/loss=230.951, rew=6.25]                                                                                                                                                                                      


Epoch #203: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #204: 1025it [00:03, 315.96it/s, env_step=208896, len=7, n/ep=9, n/st=64, player_1/loss=59.824, player_2/loss=224.102, rew=19.44]                                                                                                                                                                                     


Epoch #204: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #205: 1025it [00:03, 316.88it/s, env_step=209920, len=7, n/ep=8, n/st=64, player_1/loss=67.715, player_2/loss=174.197, rew=12.50]                                                                                                                                                                                     


Epoch #205: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #206: 1025it [00:03, 296.97it/s, env_step=210944, len=8, n/ep=8, n/st=64, player_1/loss=52.749, player_2/loss=169.983, rew=12.50]                                                                                                                                                                                     


Epoch #206: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #207: 1025it [00:03, 307.77it/s, env_step=211968, len=7, n/ep=8, n/st=64, player_1/loss=46.780, player_2/loss=191.521, rew=25.00]                                                                                                                                                                                     


Epoch #207: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #208: 1025it [00:03, 311.68it/s, env_step=212992, len=7, n/ep=9, n/st=64, player_1/loss=51.013, player_2/loss=212.564, rew=19.44]                                                                                                                                                                                     


Epoch #208: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #209: 1025it [00:03, 331.75it/s, env_step=214016, len=7, n/ep=9, n/st=64, player_1/loss=44.589, player_2/loss=186.604, rew=25.00]                                                                                                                                                                                     


Epoch #209: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #210: 1025it [00:03, 316.97it/s, env_step=215040, len=7, n/ep=8, n/st=64, player_1/loss=46.021, player_2/loss=169.438, rew=18.75]                                                                                                                                                                                     


Epoch #210: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #211: 1025it [00:03, 309.70it/s, env_step=216064, len=7, n/ep=9, n/st=64, player_1/loss=74.065, player_2/loss=187.399, rew=19.44]                                                                                                                                                                                     


Epoch #211: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #212: 1025it [00:03, 301.61it/s, env_step=217088, len=8, n/ep=8, n/st=64, player_1/loss=53.183, player_2/loss=196.028, rew=18.75]                                                                                                                                                                                     


Epoch #212: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #213: 1025it [00:03, 299.83it/s, env_step=218112, len=7, n/ep=8, n/st=64, player_1/loss=27.741, player_2/loss=178.631, rew=25.00]                                                                                                                                                                                     


Epoch #213: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #214: 1025it [00:04, 238.65it/s, env_step=219136, len=7, n/ep=8, n/st=64, player_1/loss=62.164, player_2/loss=219.821, rew=18.75]                                                                                                                                                                                     


Epoch #214: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #215: 1025it [00:05, 199.44it/s, env_step=220160, len=7, n/ep=8, n/st=64, player_1/loss=79.471, player_2/loss=229.728, rew=0.00]                                                                                                                                                                                      


Epoch #215: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #216: 1025it [00:04, 223.52it/s, env_step=221184, len=8, n/ep=8, n/st=64, player_1/loss=85.744, player_2/loss=205.924, rew=25.00]                                                                                                                                                                                     


Epoch #216: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #217: 1025it [00:04, 211.08it/s, env_step=222208, len=7, n/ep=9, n/st=64, player_1/loss=48.984, player_2/loss=208.697, rew=19.44]                                                                                                                                                                                     


Epoch #217: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #218: 1025it [00:03, 285.81it/s, env_step=223232, len=9, n/ep=7, n/st=64, player_1/loss=26.146, player_2/loss=197.017, rew=17.86]                                                                                                                                                                                     


Epoch #218: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #219: 1025it [00:03, 266.91it/s, env_step=224256, len=8, n/ep=8, n/st=64, player_1/loss=29.195, player_2/loss=216.203, rew=12.50]                                                                                                                                                                                     


Epoch #219: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #220: 1025it [00:03, 285.62it/s, env_step=225280, len=7, n/ep=10, n/st=64, player_1/loss=50.070, player_2/loss=208.124, rew=25.00]                                                                                                                                                                                    


Epoch #220: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #221: 1025it [00:03, 308.45it/s, env_step=226304, len=7, n/ep=8, n/st=64, player_1/loss=59.960, player_2/loss=202.716, rew=18.75]                                                                                                                                                                                     


Epoch #221: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #222: 1025it [00:03, 304.51it/s, env_step=227328, len=8, n/ep=9, n/st=64, player_1/loss=60.521, player_2/loss=234.969, rew=19.44]                                                                                                                                                                                     


Epoch #222: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #223: 1025it [00:03, 306.26it/s, env_step=228352, len=7, n/ep=8, n/st=64, player_1/loss=82.134, player_2/loss=181.244, rew=18.75]                                                                                                                                                                                     


Epoch #223: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #224: 1025it [00:03, 304.53it/s, env_step=229376, len=7, n/ep=8, n/st=64, player_1/loss=67.171, player_2/loss=127.221, rew=18.75]                                                                                                                                                                                     


Epoch #224: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #225: 1025it [00:03, 307.20it/s, env_step=230400, len=8, n/ep=8, n/st=64, player_1/loss=23.064, player_2/loss=171.633, rew=12.50]                                                                                                                                                                                     


Epoch #225: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #226: 1025it [00:03, 300.83it/s, env_step=231424, len=7, n/ep=8, n/st=64, player_2/loss=193.592, rew=12.50]                                                                                                                                                                                                           


Epoch #226: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #227: 1025it [00:03, 294.44it/s, env_step=232448, len=7, n/ep=9, n/st=64, player_1/loss=42.942, player_2/loss=211.352, rew=19.44]                                                                                                                                                                                     


Epoch #227: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #228: 1025it [00:03, 331.75it/s, env_step=233472, len=7, n/ep=9, n/st=64, player_1/loss=48.212, player_2/loss=199.416, rew=19.44]                                                                                                                                                                                     


Epoch #228: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #229: 1025it [00:03, 320.11it/s, env_step=234496, len=7, n/ep=9, n/st=64, player_1/loss=84.553, player_2/loss=188.663, rew=13.89]                                                                                                                                                                                     


Epoch #229: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #230: 1025it [00:03, 311.69it/s, env_step=235520, len=8, n/ep=8, n/st=64, player_1/loss=78.894, player_2/loss=196.357, rew=12.50]                                                                                                                                                                                     


Epoch #230: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #231: 1025it [00:03, 303.64it/s, env_step=236544, len=7, n/ep=8, n/st=64, player_1/loss=60.893, player_2/loss=180.062, rew=18.75]                                                                                                                                                                                     


Epoch #231: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #232: 1025it [00:03, 290.20it/s, env_step=237568, len=9, n/ep=7, n/st=64, player_1/loss=49.638, player_2/loss=165.079, rew=17.86]                                                                                                                                                                                     


Epoch #232: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #233: 1025it [00:03, 291.48it/s, env_step=238592, len=7, n/ep=9, n/st=64, player_1/loss=77.836, player_2/loss=184.658, rew=19.44]                                                                                                                                                                                     


Epoch #233: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #234: 1025it [00:03, 285.62it/s, env_step=239616, len=8, n/ep=8, n/st=64, player_1/loss=78.979, player_2/loss=174.756, rew=12.50]                                                                                                                                                                                     


Epoch #234: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #235: 1025it [00:03, 276.39it/s, env_step=240640, len=7, n/ep=9, n/st=64, player_1/loss=30.121, player_2/loss=152.965, rew=25.00]                                                                                                                                                                                     


Epoch #235: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #236: 1025it [00:03, 271.73it/s, env_step=241664, len=7, n/ep=9, n/st=64, player_1/loss=53.202, player_2/loss=147.869, rew=13.89]                                                                                                                                                                                     


Epoch #236: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #237: 1025it [00:03, 265.39it/s, env_step=242688, len=7, n/ep=9, n/st=64, player_1/loss=51.453, player_2/loss=162.911, rew=25.00]                                                                                                                                                                                     


Epoch #237: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #238: 1025it [00:04, 248.31it/s, env_step=243712, len=8, n/ep=8, n/st=64, player_1/loss=36.651, player_2/loss=168.535, rew=12.50]                                                                                                                                                                                     


Epoch #238: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #239: 1025it [00:04, 251.22it/s, env_step=244736, len=7, n/ep=8, n/st=64, player_1/loss=46.009, player_2/loss=171.236, rew=12.50]                                                                                                                                                                                     


Epoch #239: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #240: 1025it [00:04, 237.86it/s, env_step=245760, len=8, n/ep=8, n/st=64, player_1/loss=34.003, player_2/loss=197.149, rew=18.75]                                                                                                                                                                                     


Epoch #240: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #241: 1025it [00:04, 231.65it/s, env_step=246784, len=7, n/ep=8, n/st=64, player_1/loss=53.711, player_2/loss=231.686, rew=12.50]                                                                                                                                                                                     


Epoch #241: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #242: 1025it [00:04, 228.03it/s, env_step=247808, len=10, n/ep=6, n/st=64, player_1/loss=87.806, player_2/loss=235.300, rew=8.33]                                                                                                                                                                                     


Epoch #242: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #243: 1025it [00:04, 235.35it/s, env_step=248832, len=7, n/ep=9, n/st=64, player_1/loss=57.286, player_2/loss=187.063, rew=13.89]                                                                                                                                                                                     


Epoch #243: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #244: 1025it [00:04, 234.02it/s, env_step=249856, len=7, n/ep=9, n/st=64, player_1/loss=33.092, player_2/loss=191.956, rew=13.89]                                                                                                                                                                                     


Epoch #244: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #245: 1025it [00:04, 239.09it/s, env_step=250880, len=8, n/ep=8, n/st=64, player_1/loss=71.718, player_2/loss=194.515, rew=12.50]                                                                                                                                                                                     


Epoch #245: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #246: 1025it [00:04, 233.24it/s, env_step=251904, len=8, n/ep=7, n/st=64, player_1/loss=72.219, player_2/loss=208.601, rew=10.71]                                                                                                                                                                                     


Epoch #246: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #247: 1025it [00:04, 223.91it/s, env_step=252928, len=7, n/ep=9, n/st=64, player_1/loss=43.501, player_2/loss=220.237, rew=25.00]                                                                                                                                                                                     


Epoch #247: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #248: 1025it [00:04, 219.82it/s, env_step=253952, len=7, n/ep=9, n/st=64, player_1/loss=51.568, player_2/loss=212.321, rew=25.00]                                                                                                                                                                                     


Epoch #248: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #249: 1025it [00:04, 228.10it/s, env_step=254976, len=7, n/ep=8, n/st=64, player_1/loss=63.969, player_2/loss=210.955, rew=25.00]                                                                                                                                                                                     


Epoch #249: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #250: 1025it [00:04, 216.22it/s, env_step=256000, len=9, n/ep=8, n/st=64, player_1/loss=56.173, player_2/loss=193.270, rew=12.50]                                                                                                                                                                                     


Epoch #250: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #251: 1025it [00:04, 232.66it/s, env_step=257024, len=7, n/ep=9, n/st=64, player_1/loss=53.758, player_2/loss=199.831, rew=25.00]                                                                                                                                                                                     


Epoch #251: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #252: 1025it [00:04, 223.34it/s, env_step=258048, len=9, n/ep=7, n/st=64, player_1/loss=47.220, player_2/loss=252.662, rew=10.71]                                                                                                                                                                                     


Epoch #252: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #253: 1025it [00:04, 222.10it/s, env_step=259072, len=8, n/ep=8, n/st=64, player_1/loss=54.583, rew=25.00]                                                                                                                                                                                                            


Epoch #253: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #254: 1025it [00:04, 220.91it/s, env_step=260096, len=8, n/ep=8, n/st=64, player_1/loss=70.621, rew=18.75]                                                                                                                                                                                                            


Epoch #254: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #255: 1025it [00:04, 220.18it/s, env_step=261120, len=7, n/ep=8, n/st=64, player_1/loss=86.660, player_2/loss=202.078, rew=12.50]                                                                                                                                                                                     


Epoch #255: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #256: 1025it [00:04, 228.13it/s, env_step=262144, len=8, n/ep=8, n/st=64, player_1/loss=75.581, player_2/loss=185.753, rew=12.50]                                                                                                                                                                                     


Epoch #256: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #257: 1025it [00:03, 258.37it/s, env_step=263168, len=8, n/ep=8, n/st=64, player_1/loss=53.506, player_2/loss=150.235, rew=0.00]                                                                                                                                                                                      


Epoch #257: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #258: 1025it [00:03, 268.95it/s, env_step=264192, len=7, n/ep=8, n/st=64, player_1/loss=45.713, player_2/loss=168.579, rew=18.75]                                                                                                                                                                                     


Epoch #258: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #259: 1025it [00:03, 262.45it/s, env_step=265216, len=9, n/ep=8, n/st=64, player_1/loss=47.435, player_2/loss=154.808, rew=18.75]                                                                                                                                                                                     


Epoch #259: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #260: 1025it [00:03, 257.31it/s, env_step=266240, len=7, n/ep=9, n/st=64, player_1/loss=34.822, player_2/loss=166.577, rew=25.00]                                                                                                                                                                                     


Epoch #260: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #261: 1025it [00:04, 252.21it/s, env_step=267264, len=7, n/ep=8, n/st=64, player_1/loss=23.157, player_2/loss=177.259, rew=25.00]                                                                                                                                                                                     


Epoch #261: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #262: 1025it [00:03, 259.68it/s, env_step=268288, len=7, n/ep=9, n/st=64, player_1/loss=35.157, player_2/loss=205.006, rew=19.44]                                                                                                                                                                                     


Epoch #262: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #263: 1025it [00:03, 262.10it/s, env_step=269312, len=8, n/ep=8, n/st=64, player_1/loss=22.093, player_2/loss=222.437, rew=25.00]                                                                                                                                                                                     


Epoch #263: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #264: 1025it [00:04, 238.26it/s, env_step=270336, len=7, n/ep=9, n/st=64, player_1/loss=23.017, player_2/loss=203.383, rew=25.00]                                                                                                                                                                                     


Epoch #264: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #265: 1025it [00:03, 296.72it/s, env_step=271360, len=7, n/ep=8, n/st=64, player_1/loss=37.197, player_2/loss=162.491, rew=18.75]                                                                                                                                                                                     


Epoch #265: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #266: 1025it [00:03, 295.59it/s, env_step=272384, len=7, n/ep=8, n/st=64, player_1/loss=99.807, player_2/loss=142.706, rew=18.75]                                                                                                                                                                                     


Epoch #266: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #267: 1025it [00:03, 295.27it/s, env_step=273408, len=8, n/ep=8, n/st=64, player_1/loss=74.528, player_2/loss=205.276, rew=18.75]                                                                                                                                                                                     


Epoch #267: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #268: 1025it [00:03, 293.06it/s, env_step=274432, len=7, n/ep=8, n/st=64, player_1/loss=24.767, player_2/loss=195.300, rew=18.75]                                                                                                                                                                                     


Epoch #268: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #269: 1025it [00:03, 296.15it/s, env_step=275456, len=7, n/ep=9, n/st=64, player_1/loss=29.051, player_2/loss=187.523, rew=19.44]                                                                                                                                                                                     


Epoch #269: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #270: 1025it [00:03, 292.51it/s, env_step=276480, len=7, n/ep=9, n/st=64, player_1/loss=19.107, player_2/loss=185.928, rew=13.89]                                                                                                                                                                                     


Epoch #270: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #271: 1025it [00:03, 299.44it/s, env_step=277504, len=10, n/ep=6, n/st=64, player_1/loss=36.679, player_2/loss=142.645, rew=8.33]                                                                                                                                                                                     


Epoch #271: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #272: 1025it [00:03, 293.80it/s, env_step=278528, len=8, n/ep=7, n/st=64, player_1/loss=42.134, player_2/loss=164.221, rew=25.00]                                                                                                                                                                                     


Epoch #272: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #273: 1025it [00:03, 301.46it/s, env_step=279552, len=7, n/ep=8, n/st=64, player_1/loss=48.985, player_2/loss=152.285, rew=18.75]                                                                                                                                                                                     


Epoch #273: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #274: 1025it [00:03, 292.19it/s, env_step=280576, len=7, n/ep=8, n/st=64, player_1/loss=57.814, player_2/loss=154.492, rew=25.00]                                                                                                                                                                                     


Epoch #274: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #275: 1025it [00:03, 297.43it/s, env_step=281600, len=7, n/ep=8, n/st=64, player_1/loss=49.251, player_2/loss=204.013, rew=12.50]                                                                                                                                                                                     


Epoch #275: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #276: 1025it [00:03, 306.12it/s, env_step=282624, len=7, n/ep=8, n/st=64, player_1/loss=48.969, player_2/loss=173.133, rew=18.75]                                                                                                                                                                                     


Epoch #276: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #277: 1025it [00:03, 295.38it/s, env_step=283648, len=8, n/ep=8, n/st=64, player_1/loss=65.672, player_2/loss=210.030, rew=18.75]                                                                                                                                                                                     


Epoch #277: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #278: 1025it [00:05, 199.96it/s, env_step=284672, len=8, n/ep=8, n/st=64, player_1/loss=54.244, player_2/loss=194.772, rew=25.00]                                                                                                                                                                                     


Epoch #278: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #279: 1025it [00:03, 274.59it/s, env_step=285696, len=7, n/ep=9, n/st=64, player_1/loss=47.550, player_2/loss=151.043, rew=8.33]                                                                                                                                                                                      


Epoch #279: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #280: 1025it [00:04, 229.31it/s, env_step=286720, len=7, n/ep=8, n/st=64, player_1/loss=66.697, player_2/loss=192.935, rew=18.75]                                                                                                                                                                                     


Epoch #280: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #281: 1025it [00:04, 225.73it/s, env_step=287744, len=7, n/ep=8, n/st=64, player_1/loss=54.147, player_2/loss=218.799, rew=25.00]                                                                                                                                                                                     


Epoch #281: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #282: 1025it [00:04, 238.44it/s, env_step=288768, len=7, n/ep=8, n/st=64, player_1/loss=16.967, player_2/loss=175.309, rew=12.50]                                                                                                                                                                                     


Epoch #282: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #283: 1025it [00:04, 224.85it/s, env_step=289792, len=7, n/ep=8, n/st=64, player_1/loss=50.568, player_2/loss=183.233, rew=12.50]                                                                                                                                                                                     


Epoch #283: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #284: 1025it [00:04, 223.29it/s, env_step=290816, len=7, n/ep=8, n/st=64, player_1/loss=43.869, player_2/loss=173.217, rew=18.75]                                                                                                                                                                                     


Epoch #284: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #285: 1025it [00:04, 213.33it/s, env_step=291840, len=7, n/ep=8, n/st=64, player_2/loss=168.295, rew=18.75]                                                                                                                                                                                                           


Epoch #285: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #286: 1025it [00:05, 193.98it/s, env_step=292864, len=7, n/ep=9, n/st=64, player_1/loss=9.633, player_2/loss=164.621, rew=25.00]                                                                                                                                                                                      


Epoch #286: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #287: 1025it [00:05, 198.81it/s, env_step=293888, len=8, n/ep=8, n/st=64, player_1/loss=56.007, player_2/loss=111.183, rew=6.25]                                                                                                                                                                                      


Epoch #287: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #288: 1025it [00:04, 206.38it/s, env_step=294912, len=7, n/ep=8, n/st=64, player_1/loss=39.118, player_2/loss=148.338, rew=18.75]                                                                                                                                                                                     


Epoch #288: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #289: 1025it [00:04, 213.27it/s, env_step=295936, len=10, n/ep=6, n/st=64, player_1/loss=56.678, player_2/loss=153.331, rew=25.00]                                                                                                                                                                                    


Epoch #289: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #290: 1025it [00:04, 224.43it/s, env_step=296960, len=8, n/ep=8, n/st=64, player_1/loss=108.647, player_2/loss=163.423, rew=25.00]                                                                                                                                                                                    


Epoch #290: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #291: 1025it [00:04, 214.88it/s, env_step=297984, len=7, n/ep=9, n/st=64, player_1/loss=107.756, player_2/loss=186.817, rew=13.89]                                                                                                                                                                                    


Epoch #291: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #292: 1025it [00:04, 235.16it/s, env_step=299008, len=7, n/ep=8, n/st=64, player_1/loss=61.211, player_2/loss=201.967, rew=18.75]                                                                                                                                                                                     


Epoch #292: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #293: 1025it [00:04, 241.68it/s, env_step=300032, len=7, n/ep=9, n/st=64, player_1/loss=33.696, player_2/loss=169.855, rew=13.89]                                                                                                                                                                                     


Epoch #293: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #294: 1025it [00:04, 248.97it/s, env_step=301056, len=7, n/ep=8, n/st=64, player_1/loss=49.307, player_2/loss=156.059, rew=18.75]                                                                                                                                                                                     


Epoch #294: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #295: 1025it [00:04, 247.98it/s, env_step=302080, len=8, n/ep=7, n/st=64, player_1/loss=82.634, rew=17.86]                                                                                                                                                                                                            


Epoch #295: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #296: 1025it [00:04, 235.16it/s, env_step=303104, len=7, n/ep=8, n/st=64, player_1/loss=54.267, player_2/loss=197.767, rew=18.75]                                                                                                                                                                                     


Epoch #296: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #297: 1025it [00:04, 236.33it/s, env_step=304128, len=7, n/ep=9, n/st=64, player_1/loss=40.906, player_2/loss=216.448, rew=19.44]                                                                                                                                                                                     


Epoch #297: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #298: 1025it [00:04, 224.78it/s, env_step=305152, len=7, n/ep=9, n/st=64, player_1/loss=41.261, player_2/loss=189.394, rew=25.00]                                                                                                                                                                                     


Epoch #298: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #299: 1025it [00:05, 199.06it/s, env_step=306176, len=7, n/ep=8, n/st=64, player_1/loss=30.667, player_2/loss=197.247, rew=18.75]                                                                                                                                                                                     


Epoch #299: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #300: 1025it [00:05, 195.32it/s, env_step=307200, len=8, n/ep=7, n/st=64, player_1/loss=34.063, player_2/loss=218.449, rew=17.86]                                                                                                                                                                                     


Epoch #300: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #301: 1025it [00:05, 193.20it/s, env_step=308224, len=8, n/ep=8, n/st=64, player_1/loss=41.535, player_2/loss=226.280, rew=18.75]                                                                                                                                                                                     


Epoch #301: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #302: 1025it [00:04, 215.90it/s, env_step=309248, len=8, n/ep=8, n/st=64, player_1/loss=37.010, player_2/loss=192.306, rew=12.50]                                                                                                                                                                                     


Epoch #302: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #303: 1025it [00:02, 369.33it/s, env_step=310272, len=7, n/ep=9, n/st=64, player_1/loss=36.350, player_2/loss=146.641, rew=13.89]                                                                                                                                                                                     


Epoch #303: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #304: 1025it [00:02, 353.71it/s, env_step=311296, len=7, n/ep=8, n/st=64, player_1/loss=31.162, player_2/loss=169.340, rew=12.50]                                                                                                                                                                                     


Epoch #304: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #305: 1025it [00:02, 354.03it/s, env_step=312320, len=7, n/ep=9, n/st=64, player_1/loss=35.590, player_2/loss=169.089, rew=25.00]                                                                                                                                                                                     


Epoch #305: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #306: 1025it [00:02, 360.15it/s, env_step=313344, len=7, n/ep=8, n/st=64, player_1/loss=68.801, player_2/loss=209.149, rew=6.25]                                                                                                                                                                                      


Epoch #306: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #307: 1025it [00:02, 358.09it/s, env_step=314368, len=7, n/ep=8, n/st=64, player_1/loss=119.080, player_2/loss=204.419, rew=12.50]                                                                                                                                                                                    


Epoch #307: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #308: 1025it [00:02, 355.83it/s, env_step=315392, len=7, n/ep=9, n/st=64, player_1/loss=76.093, player_2/loss=187.936, rew=19.44]                                                                                                                                                                                     


Epoch #308: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #309: 1025it [00:02, 353.01it/s, env_step=316416, len=7, n/ep=9, n/st=64, player_1/loss=16.946, player_2/loss=193.613, rew=19.44]                                                                                                                                                                                     


Epoch #309: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #310: 1025it [00:02, 356.21it/s, env_step=317440, len=9, n/ep=7, n/st=64, player_1/loss=11.428, player_2/loss=171.610, rew=10.71]                                                                                                                                                                                     


Epoch #310: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #311: 1025it [00:02, 353.06it/s, env_step=318464, len=11, n/ep=6, n/st=64, player_1/loss=22.362, player_2/loss=188.707, rew=8.33]                                                                                                                                                                                     


Epoch #311: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #312: 1025it [00:02, 361.47it/s, env_step=319488, len=7, n/ep=8, n/st=64, player_1/loss=19.908, player_2/loss=182.465, rew=25.00]                                                                                                                                                                                     


Epoch #312: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #313: 1025it [00:02, 368.97it/s, env_step=320512, len=7, n/ep=8, n/st=64, player_1/loss=37.836, player_2/loss=184.519, rew=18.75]                                                                                                                                                                                     


Epoch #313: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #314: 1025it [00:02, 355.83it/s, env_step=321536, len=7, n/ep=8, n/st=64, player_1/loss=29.520, player_2/loss=167.223, rew=18.75]                                                                                                                                                                                     


Epoch #314: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #315: 1025it [00:04, 238.41it/s, env_step=322560, len=7, n/ep=9, n/st=64, player_1/loss=27.901, player_2/loss=171.937, rew=19.44]                                                                                                                                                                                     


Epoch #315: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #316: 1025it [00:03, 258.66it/s, env_step=323584, len=7, n/ep=9, n/st=64, player_1/loss=41.052, player_2/loss=178.584, rew=25.00]                                                                                                                                                                                     


Epoch #316: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #317: 1025it [00:03, 263.40it/s, env_step=324608, len=7, n/ep=8, n/st=64, player_1/loss=38.509, player_2/loss=184.497, rew=25.00]                                                                                                                                                                                     


Epoch #317: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #318: 1025it [00:04, 251.77it/s, env_step=325632, len=7, n/ep=8, n/st=64, player_1/loss=47.846, player_2/loss=199.497, rew=18.75]                                                                                                                                                                                     


Epoch #318: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #319: 1025it [00:04, 251.59it/s, env_step=326656, len=7, n/ep=9, n/st=64, player_1/loss=40.886, player_2/loss=178.725, rew=25.00]                                                                                                                                                                                     


Epoch #319: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #320: 1025it [00:04, 253.41it/s, env_step=327680, len=10, n/ep=6, n/st=64, player_1/loss=28.021, rew=8.33]                                                                                                                                                                                                            


Epoch #320: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #321: 1025it [00:04, 225.71it/s, env_step=328704, len=7, n/ep=9, n/st=64, player_1/loss=58.169, player_2/loss=222.272, rew=19.44]                                                                                                                                                                                     


Epoch #321: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #322: 1025it [00:04, 243.04it/s, env_step=329728, len=8, n/ep=8, n/st=64, player_1/loss=38.253, player_2/loss=228.526, rew=18.75]                                                                                                                                                                                     


Epoch #322: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #323: 1025it [00:04, 242.86it/s, env_step=330752, len=11, n/ep=6, n/st=64, player_1/loss=31.669, player_2/loss=207.362, rew=8.33]                                                                                                                                                                                     


Epoch #323: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #324: 1025it [00:03, 259.27it/s, env_step=331776, len=7, n/ep=9, n/st=64, player_1/loss=37.601, player_2/loss=195.897, rew=25.00]                                                                                                                                                                                     


Epoch #324: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #325: 1025it [00:04, 236.62it/s, env_step=332800, len=7, n/ep=8, n/st=64, player_1/loss=46.419, player_2/loss=222.058, rew=12.50]                                                                                                                                                                                     


Epoch #325: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #326: 1025it [00:04, 232.33it/s, env_step=333824, len=7, n/ep=8, n/st=64, player_1/loss=50.108, player_2/loss=165.495, rew=18.75]                                                                                                                                                                                     


Epoch #326: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #327: 1025it [00:04, 219.87it/s, env_step=334848, len=8, n/ep=8, n/st=64, player_1/loss=59.281, player_2/loss=162.405, rew=12.50]                                                                                                                                                                                     


Epoch #327: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #328: 1025it [00:05, 195.27it/s, env_step=335872, len=7, n/ep=9, n/st=64, player_1/loss=37.899, rew=19.44]                                                                                                                                                                                                            


Epoch #328: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #329: 1025it [00:04, 208.60it/s, env_step=336896, len=7, n/ep=8, n/st=64, player_1/loss=22.762, player_2/loss=195.001, rew=18.75]                                                                                                                                                                                     


Epoch #329: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #330: 1025it [00:04, 236.31it/s, env_step=337920, len=7, n/ep=8, n/st=64, player_1/loss=29.382, player_2/loss=187.131, rew=18.75]                                                                                                                                                                                     


Epoch #330: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #331: 1025it [00:03, 312.29it/s, env_step=338944, len=7, n/ep=8, n/st=64, player_1/loss=27.556, player_2/loss=168.334, rew=18.75]                                                                                                                                                                                     


Epoch #331: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #332: 1025it [00:03, 260.68it/s, env_step=339968, len=8, n/ep=8, n/st=64, player_1/loss=51.602, player_2/loss=191.064, rew=25.00]                                                                                                                                                                                     


Epoch #332: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #333: 1025it [00:04, 247.11it/s, env_step=340992, len=7, n/ep=8, n/st=64, player_1/loss=52.299, player_2/loss=191.242, rew=18.75]                                                                                                                                                                                     


Epoch #333: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #334: 1025it [00:04, 247.33it/s, env_step=342016, len=8, n/ep=8, n/st=64, player_1/loss=45.248, player_2/loss=205.145, rew=18.75]                                                                                                                                                                                     


Epoch #334: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #335: 1025it [00:04, 247.15it/s, env_step=343040, len=7, n/ep=9, n/st=64, player_1/loss=21.691, player_2/loss=169.557, rew=13.89]                                                                                                                                                                                     


Epoch #335: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #336: 1025it [00:03, 256.81it/s, env_step=344064, len=7, n/ep=8, n/st=64, player_1/loss=24.444, player_2/loss=183.733, rew=18.75]                                                                                                                                                                                     


Epoch #336: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #337: 1025it [00:03, 287.91it/s, env_step=345088, len=8, n/ep=8, n/st=64, player_1/loss=38.645, player_2/loss=184.421, rew=12.50]                                                                                                                                                                                     


Epoch #337: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #338: 1025it [00:03, 303.51it/s, env_step=346112, len=10, n/ep=6, n/st=64, player_1/loss=46.243, player_2/loss=172.844, rew=16.67]                                                                                                                                                                                    


Epoch #338: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #339: 1025it [00:03, 263.40it/s, env_step=347136, len=8, n/ep=7, n/st=64, player_1/loss=57.666, player_2/loss=192.070, rew=17.86]                                                                                                                                                                                     


Epoch #339: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #340: 1025it [00:04, 240.89it/s, env_step=348160, len=8, n/ep=8, n/st=64, player_1/loss=42.242, player_2/loss=174.941, rew=25.00]                                                                                                                                                                                     


Epoch #340: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #341: 1025it [00:05, 197.51it/s, env_step=349184, len=7, n/ep=8, n/st=64, player_1/loss=18.920, player_2/loss=172.707, rew=18.75]                                                                                                                                                                                     


Epoch #341: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #342: 1025it [00:04, 241.40it/s, env_step=350208, len=9, n/ep=7, n/st=64, player_1/loss=22.641, player_2/loss=216.361, rew=17.86]                                                                                                                                                                                     


Epoch #342: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #343: 1025it [00:03, 277.47it/s, env_step=351232, len=7, n/ep=8, n/st=64, player_1/loss=49.981, player_2/loss=221.886, rew=25.00]                                                                                                                                                                                     


Epoch #343: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #344: 1025it [00:03, 276.70it/s, env_step=352256, len=7, n/ep=8, n/st=64, player_1/loss=56.851, player_2/loss=184.288, rew=25.00]                                                                                                                                                                                     


Epoch #344: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #345: 1025it [00:03, 273.65it/s, env_step=353280, len=9, n/ep=7, n/st=64, player_1/loss=39.117, player_2/loss=212.083, rew=10.71]                                                                                                                                                                                     


Epoch #345: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #346: 1025it [00:03, 274.99it/s, env_step=354304, len=8, n/ep=7, n/st=64, player_1/loss=18.970, player_2/loss=195.697, rew=10.71]                                                                                                                                                                                     


Epoch #346: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #347: 1025it [00:03, 284.31it/s, env_step=355328, len=7, n/ep=9, n/st=64, player_1/loss=37.484, player_2/loss=168.865, rew=13.89]                                                                                                                                                                                     


Epoch #347: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #348: 1025it [00:03, 268.39it/s, env_step=356352, len=7, n/ep=9, n/st=64, player_1/loss=31.640, player_2/loss=186.756, rew=19.44]                                                                                                                                                                                     


Epoch #348: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #349: 1025it [00:03, 272.56it/s, env_step=357376, len=7, n/ep=9, n/st=64, player_1/loss=22.551, player_2/loss=174.198, rew=25.00]                                                                                                                                                                                     


Epoch #349: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #350: 1025it [00:03, 288.65it/s, env_step=358400, len=7, n/ep=9, n/st=64, player_1/loss=23.305, player_2/loss=174.107, rew=19.44]                                                                                                                                                                                     


Epoch #350: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #351: 1025it [00:03, 311.09it/s, env_step=359424, len=8, n/ep=8, n/st=64, player_1/loss=14.309, player_2/loss=214.492, rew=6.25]                                                                                                                                                                                      


Epoch #351: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #352: 1025it [00:04, 231.84it/s, env_step=360448, len=8, n/ep=7, n/st=64, player_1/loss=7.824, player_2/loss=182.483, rew=17.86]                                                                                                                                                                                      


Epoch #352: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #353: 1025it [00:04, 225.53it/s, env_step=361472, len=7, n/ep=8, n/st=64, player_1/loss=23.696, player_2/loss=120.859, rew=18.75]                                                                                                                                                                                     


Epoch #353: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #354: 1025it [00:08, 127.21it/s, env_step=362496, len=7, n/ep=9, n/st=64, player_1/loss=54.496, player_2/loss=145.605, rew=19.44]                                                                                                                                                                                     


Epoch #354: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #355: 1025it [00:11, 87.60it/s, env_step=363520, len=7, n/ep=8, n/st=64, player_1/loss=52.040, player_2/loss=149.390, rew=25.00]                                                                                                                                                                                      


Epoch #355: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #356: 1025it [00:06, 147.71it/s, env_step=364544, len=7, n/ep=8, n/st=64, player_1/loss=19.722, player_2/loss=172.751, rew=18.75]                                                                                                                                                                                     


Epoch #356: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #357: 1025it [00:05, 199.90it/s, env_step=365568, len=7, n/ep=9, n/st=64, player_1/loss=40.615, player_2/loss=205.796, rew=25.00]                                                                                                                                                                                     


Epoch #357: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #358: 1025it [00:04, 229.23it/s, env_step=366592, len=7, n/ep=9, n/st=64, player_1/loss=58.774, player_2/loss=208.029, rew=19.44]                                                                                                                                                                                     


Epoch #358: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #359: 1025it [00:04, 225.98it/s, env_step=367616, len=7, n/ep=8, n/st=64, player_1/loss=30.902, player_2/loss=151.803, rew=18.75]                                                                                                                                                                                     


Epoch #359: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #360: 1025it [00:04, 242.32it/s, env_step=368640, len=7, n/ep=8, n/st=64, player_1/loss=36.701, player_2/loss=162.101, rew=6.25]                                                                                                                                                                                      


Epoch #360: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #361: 1025it [00:04, 242.62it/s, env_step=369664, len=7, n/ep=9, n/st=64, player_1/loss=61.961, player_2/loss=204.036, rew=19.44]                                                                                                                                                                                     


Epoch #361: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #362: 1025it [00:04, 250.63it/s, env_step=370688, len=7, n/ep=8, n/st=64, player_1/loss=60.604, player_2/loss=224.346, rew=18.75]                                                                                                                                                                                     


Epoch #362: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #363: 1025it [00:04, 244.82it/s, env_step=371712, len=8, n/ep=8, n/st=64, player_1/loss=45.734, player_2/loss=208.664, rew=6.25]                                                                                                                                                                                      


Epoch #363: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #364: 1025it [00:04, 240.33it/s, env_step=372736, len=7, n/ep=8, n/st=64, player_1/loss=32.659, player_2/loss=191.484, rew=12.50]                                                                                                                                                                                     


Epoch #364: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #365: 1025it [00:04, 239.47it/s, env_step=373760, len=8, n/ep=8, n/st=64, player_1/loss=39.780, player_2/loss=193.915, rew=18.75]                                                                                                                                                                                     


Epoch #365: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #366: 1025it [00:03, 310.36it/s, env_step=374784, len=7, n/ep=8, n/st=64, player_1/loss=47.670, player_2/loss=200.224, rew=18.75]                                                                                                                                                                                     


Epoch #366: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #367: 1025it [00:03, 307.70it/s, env_step=375808, len=7, n/ep=8, n/st=64, player_1/loss=71.411, player_2/loss=212.262, rew=25.00]                                                                                                                                                                                     


Epoch #367: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #368: 1025it [00:03, 309.09it/s, env_step=376832, len=7, n/ep=9, n/st=64, player_1/loss=36.413, player_2/loss=224.112, rew=25.00]                                                                                                                                                                                     


Epoch #368: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #369: 1025it [00:03, 322.11it/s, env_step=377856, len=7, n/ep=9, n/st=64, player_1/loss=13.741, player_2/loss=223.943, rew=25.00]                                                                                                                                                                                     


Epoch #369: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #370: 1025it [00:03, 313.17it/s, env_step=378880, len=7, n/ep=9, n/st=64, player_1/loss=39.229, player_2/loss=183.629, rew=13.89]                                                                                                                                                                                     


Epoch #370: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #371: 1025it [00:03, 321.36it/s, env_step=379904, len=7, n/ep=9, n/st=64, player_1/loss=55.868, player_2/loss=178.528, rew=19.44]                                                                                                                                                                                     


Epoch #371: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #372: 1025it [00:03, 313.69it/s, env_step=380928, len=7, n/ep=9, n/st=64, player_1/loss=32.899, player_2/loss=188.386, rew=19.44]                                                                                                                                                                                     


Epoch #372: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #373: 1025it [00:03, 318.47it/s, env_step=381952, len=7, n/ep=8, n/st=64, player_1/loss=40.766, player_2/loss=200.983, rew=18.75]                                                                                                                                                                                     


Epoch #373: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #374: 1025it [00:03, 319.35it/s, env_step=382976, len=7, n/ep=7, n/st=64, player_1/loss=96.525, player_2/loss=218.356, rew=25.00]                                                                                                                                                                                     


Epoch #374: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #375: 1025it [00:03, 321.96it/s, env_step=384000, len=8, n/ep=8, n/st=64, player_1/loss=91.078, player_2/loss=207.444, rew=25.00]                                                                                                                                                                                     


Epoch #375: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #376: 1025it [00:03, 310.69it/s, env_step=385024, len=8, n/ep=8, n/st=64, player_1/loss=42.927, player_2/loss=162.424, rew=18.75]                                                                                                                                                                                     


Epoch #376: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #377: 1025it [00:03, 311.44it/s, env_step=386048, len=8, n/ep=8, n/st=64, player_1/loss=39.882, player_2/loss=178.850, rew=12.50]                                                                                                                                                                                     


Epoch #377: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #378: 1025it [00:03, 316.68it/s, env_step=387072, len=7, n/ep=8, n/st=64, player_1/loss=28.972, player_2/loss=217.886, rew=12.50]                                                                                                                                                                                     


Epoch #378: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #379: 1025it [00:03, 314.91it/s, env_step=388096, len=8, n/ep=8, n/st=64, player_1/loss=40.133, player_2/loss=168.416, rew=12.50]                                                                                                                                                                                     


Epoch #379: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #380: 1025it [00:03, 317.40it/s, env_step=389120, len=8, n/ep=8, n/st=64, player_1/loss=48.980, player_2/loss=176.400, rew=18.75]                                                                                                                                                                                     


Epoch #380: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #381: 1025it [00:03, 311.06it/s, env_step=390144, len=8, n/ep=8, n/st=64, player_1/loss=55.293, player_2/loss=191.925, rew=18.75]                                                                                                                                                                                     


Epoch #381: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #382: 1025it [00:03, 308.98it/s, env_step=391168, len=7, n/ep=8, n/st=64, player_1/loss=78.807, player_2/loss=192.522, rew=25.00]                                                                                                                                                                                     


Epoch #382: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #383: 1025it [00:03, 313.79it/s, env_step=392192, len=8, n/ep=8, n/st=64, player_1/loss=49.691, player_2/loss=183.187, rew=18.75]                                                                                                                                                                                     


Epoch #383: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #384: 1025it [00:03, 314.42it/s, env_step=393216, len=8, n/ep=8, n/st=64, player_1/loss=15.028, player_2/loss=181.903, rew=18.75]                                                                                                                                                                                     


Epoch #384: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #385: 1025it [00:03, 307.86it/s, env_step=394240, len=7, n/ep=8, n/st=64, player_2/loss=174.167, rew=25.00]                                                                                                                                                                                                           


Epoch #385: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #386: 1025it [00:03, 314.02it/s, env_step=395264, len=8, n/ep=8, n/st=64, player_1/loss=65.701, player_2/loss=155.702, rew=18.75]                                                                                                                                                                                     


Epoch #386: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #387: 1025it [00:03, 314.85it/s, env_step=396288, len=7, n/ep=9, n/st=64, player_1/loss=56.756, player_2/loss=205.699, rew=19.44]                                                                                                                                                                                     


Epoch #387: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #388: 1025it [00:03, 319.77it/s, env_step=397312, len=7, n/ep=9, n/st=64, player_1/loss=71.977, player_2/loss=218.789, rew=25.00]                                                                                                                                                                                     


Epoch #388: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #389: 1025it [00:03, 321.44it/s, env_step=398336, len=8, n/ep=8, n/st=64, player_1/loss=42.622, player_2/loss=187.639, rew=12.50]                                                                                                                                                                                     


Epoch #389: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #390: 1025it [00:03, 320.80it/s, env_step=399360, len=7, n/ep=9, n/st=64, player_1/loss=61.776, player_2/loss=188.624, rew=8.33]                                                                                                                                                                                      


Epoch #390: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #391: 1025it [00:03, 314.46it/s, env_step=400384, len=7, n/ep=8, n/st=64, player_1/loss=64.104, player_2/loss=197.807, rew=18.75]                                                                                                                                                                                     


Epoch #391: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #392: 1025it [00:03, 299.31it/s, env_step=401408, len=8, n/ep=8, n/st=64, player_1/loss=23.760, player_2/loss=164.615, rew=18.75]                                                                                                                                                                                     


Epoch #392: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #393: 1025it [00:02, 456.29it/s, env_step=402432, len=7, n/ep=9, n/st=64, player_1/loss=17.376, player_2/loss=163.246, rew=25.00]                                                                                                                                                                                     


Epoch #393: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #394: 1025it [00:02, 423.14it/s, env_step=403456, len=7, n/ep=8, n/st=64, player_1/loss=31.973, player_2/loss=182.771, rew=6.25]                                                                                                                                                                                      


Epoch #394: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #395: 1025it [00:02, 479.76it/s, env_step=404480, len=8, n/ep=7, n/st=64, player_1/loss=40.610, player_2/loss=171.277, rew=17.86]                                                                                                                                                                                     


Epoch #395: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #396: 1025it [00:02, 434.00it/s, env_step=405504, len=7, n/ep=9, n/st=64, player_1/loss=43.873, player_2/loss=168.770, rew=19.44]                                                                                                                                                                                     


Epoch #396: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #397: 1025it [00:02, 446.49it/s, env_step=406528, len=8, n/ep=7, n/st=64, player_1/loss=39.058, player_2/loss=176.117, rew=10.71]                                                                                                                                                                                     


Epoch #397: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #398: 1025it [00:02, 470.31it/s, env_step=407552, len=8, n/ep=9, n/st=64, player_1/loss=37.013, player_2/loss=208.860, rew=19.44]                                                                                                                                                                                     


Epoch #398: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #399: 1025it [00:02, 486.20it/s, env_step=408576, len=7, n/ep=8, n/st=64, player_1/loss=40.047, player_2/loss=216.002, rew=25.00]                                                                                                                                                                                     


Epoch #399: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #400: 1025it [00:02, 487.55it/s, env_step=409600, len=9, n/ep=7, n/st=64, player_1/loss=80.682, player_2/loss=194.515, rew=17.86]                                                                                                                                                                                     


Epoch #400: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #401: 1025it [00:02, 488.36it/s, env_step=410624, len=9, n/ep=7, n/st=64, player_1/loss=95.714, player_2/loss=166.477, rew=17.86]                                                                                                                                                                                     


Epoch #401: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #402: 1025it [00:02, 486.95it/s, env_step=411648, len=7, n/ep=9, n/st=64, player_1/loss=50.022, player_2/loss=198.627, rew=13.89]                                                                                                                                                                                     


Epoch #402: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #403: 1025it [00:02, 490.87it/s, env_step=412672, len=9, n/ep=7, n/st=64, player_1/loss=48.444, rew=-3.57]                                                                                                                                                                                                            


Epoch #403: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #404: 1025it [00:02, 485.30it/s, env_step=413696, len=7, n/ep=8, n/st=64, player_1/loss=35.991, player_2/loss=185.588, rew=18.75]                                                                                                                                                                                     


Epoch #404: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #405: 1025it [00:02, 486.73it/s, env_step=414720, len=7, n/ep=8, n/st=64, player_1/loss=31.173, player_2/loss=201.422, rew=18.75]                                                                                                                                                                                     


Epoch #405: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #406: 1025it [00:02, 486.29it/s, env_step=415744, len=7, n/ep=9, n/st=64, player_1/loss=47.645, player_2/loss=216.586, rew=13.89]                                                                                                                                                                                     


Epoch #406: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #407: 1025it [00:02, 491.14it/s, env_step=416768, len=7, n/ep=9, n/st=64, player_1/loss=36.299, player_2/loss=211.134, rew=19.44]                                                                                                                                                                                     


Epoch #407: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #408: 1025it [00:02, 488.96it/s, env_step=417792, len=8, n/ep=8, n/st=64, player_1/loss=13.603, player_2/loss=207.892, rew=25.00]                                                                                                                                                                                     


Epoch #408: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #409: 1025it [00:02, 489.55it/s, env_step=418816, len=7, n/ep=8, n/st=64, player_1/loss=11.808, player_2/loss=247.310, rew=18.75]                                                                                                                                                                                     


Epoch #409: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #410: 1025it [00:02, 490.80it/s, env_step=419840, len=7, n/ep=9, n/st=64, player_1/loss=16.060, player_2/loss=247.266, rew=19.44]                                                                                                                                                                                     


Epoch #410: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #411: 1025it [00:02, 493.94it/s, env_step=420864, len=7, n/ep=9, n/st=64, player_1/loss=36.149, player_2/loss=218.326, rew=25.00]                                                                                                                                                                                     


Epoch #411: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #412: 1025it [00:02, 491.04it/s, env_step=421888, len=7, n/ep=8, n/st=64, player_1/loss=33.069, player_2/loss=203.948, rew=25.00]                                                                                                                                                                                     


Epoch #412: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #413: 1025it [00:02, 485.02it/s, env_step=422912, len=7, n/ep=8, n/st=64, player_1/loss=18.098, player_2/loss=216.039, rew=18.75]                                                                                                                                                                                     


Epoch #413: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #414: 1025it [00:02, 488.01it/s, env_step=423936, len=7, n/ep=8, n/st=64, player_1/loss=34.093, player_2/loss=232.808, rew=25.00]                                                                                                                                                                                     


Epoch #414: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #415: 1025it [00:02, 489.71it/s, env_step=424960, len=7, n/ep=8, n/st=64, player_1/loss=60.742, player_2/loss=229.052, rew=18.75]                                                                                                                                                                                     


Epoch #415: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #416: 1025it [00:02, 490.52it/s, env_step=425984, len=7, n/ep=8, n/st=64, player_1/loss=50.119, player_2/loss=189.095, rew=18.75]                                                                                                                                                                                     


Epoch #416: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #417: 1025it [00:02, 482.04it/s, env_step=427008, len=8, n/ep=8, n/st=64, player_1/loss=45.376, player_2/loss=172.013, rew=25.00]                                                                                                                                                                                     


Epoch #417: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #418: 1025it [00:02, 466.62it/s, env_step=428032, len=9, n/ep=6, n/st=64, player_1/loss=40.677, player_2/loss=195.983, rew=16.67]                                                                                                                                                                                     


Epoch #418: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #419: 1025it [00:02, 436.25it/s, env_step=429056, len=7, n/ep=9, n/st=64, player_1/loss=18.553, player_2/loss=188.270, rew=19.44]                                                                                                                                                                                     


Epoch #419: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #420: 1025it [00:02, 431.34it/s, env_step=430080, len=7, n/ep=8, n/st=64, player_1/loss=32.205, player_2/loss=167.562, rew=12.50]                                                                                                                                                                                     


Epoch #420: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #421: 1025it [00:02, 420.88it/s, env_step=431104, len=8, n/ep=8, n/st=64, player_1/loss=38.364, player_2/loss=181.892, rew=6.25]                                                                                                                                                                                      


Epoch #421: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #422: 1025it [00:02, 406.97it/s, env_step=432128, len=7, n/ep=8, n/st=64, player_1/loss=40.227, player_2/loss=237.398, rew=18.75]                                                                                                                                                                                     


Epoch #422: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #423: 1025it [00:02, 396.15it/s, env_step=433152, len=8, n/ep=7, n/st=64, player_1/loss=61.296, player_2/loss=235.247, rew=3.57]                                                                                                                                                                                      


Epoch #423: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #424: 1025it [00:02, 391.39it/s, env_step=434176, len=9, n/ep=7, n/st=64, player_1/loss=52.954, player_2/loss=197.641, rew=10.71]                                                                                                                                                                                     


Epoch #424: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #425: 1025it [00:02, 392.81it/s, env_step=435200, len=7, n/ep=9, n/st=64, player_1/loss=39.645, player_2/loss=157.660, rew=19.44]                                                                                                                                                                                     


Epoch #425: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #426: 1025it [00:02, 392.09it/s, env_step=436224, len=8, n/ep=8, n/st=64, player_1/loss=40.198, player_2/loss=164.774, rew=25.00]                                                                                                                                                                                     


Epoch #426: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #427: 1025it [00:02, 390.85it/s, env_step=437248, len=7, n/ep=8, n/st=64, player_1/loss=20.865, player_2/loss=150.420, rew=18.75]                                                                                                                                                                                     


Epoch #427: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #428: 1025it [00:02, 390.51it/s, env_step=438272, len=7, n/ep=8, n/st=64, player_1/loss=32.588, player_2/loss=137.231, rew=25.00]                                                                                                                                                                                     


Epoch #428: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #429: 1025it [00:02, 391.55it/s, env_step=439296, len=7, n/ep=8, n/st=64, player_1/loss=19.562, player_2/loss=163.563, rew=25.00]                                                                                                                                                                                     


Epoch #429: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #430: 1025it [00:02, 392.17it/s, env_step=440320, len=7, n/ep=9, n/st=64, player_1/loss=46.288, player_2/loss=175.898, rew=19.44]                                                                                                                                                                                     


Epoch #430: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #431: 1025it [00:02, 391.93it/s, env_step=441344, len=7, n/ep=9, n/st=64, player_1/loss=53.942, player_2/loss=138.320, rew=25.00]                                                                                                                                                                                     


Epoch #431: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #432: 1025it [00:02, 391.62it/s, env_step=442368, len=8, n/ep=8, n/st=64, player_1/loss=49.168, player_2/loss=139.728, rew=12.50]                                                                                                                                                                                     


Epoch #432: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #433: 1025it [00:02, 394.63it/s, env_step=443392, len=7, n/ep=8, n/st=64, player_1/loss=37.605, player_2/loss=152.265, rew=18.75]                                                                                                                                                                                     


Epoch #433: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #434: 1025it [00:02, 393.38it/s, env_step=444416, len=7, n/ep=9, n/st=64, player_1/loss=25.501, player_2/loss=156.457, rew=25.00]                                                                                                                                                                                     


Epoch #434: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #435: 1025it [00:02, 393.13it/s, env_step=445440, len=7, n/ep=9, n/st=64, player_1/loss=34.894, player_2/loss=176.820, rew=13.89]                                                                                                                                                                                     


Epoch #435: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #436: 1025it [00:02, 377.36it/s, env_step=446464, len=7, n/ep=9, n/st=64, player_1/loss=26.786, player_2/loss=198.508, rew=25.00]                                                                                                                                                                                     


Epoch #436: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #437: 1025it [00:02, 448.31it/s, env_step=447488, len=7, n/ep=9, n/st=64, player_1/loss=28.005, player_2/loss=204.969, rew=25.00]                                                                                                                                                                                     


Epoch #437: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #438: 1025it [00:02, 458.07it/s, env_step=448512, len=7, n/ep=9, n/st=64, player_1/loss=36.131, player_2/loss=199.439, rew=8.33]                                                                                                                                                                                      


Epoch #438: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #439: 1025it [00:02, 438.77it/s, env_step=449536, len=7, n/ep=9, n/st=64, player_1/loss=30.902, player_2/loss=166.361, rew=19.44]                                                                                                                                                                                     


Epoch #439: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #440: 1025it [00:02, 419.14it/s, env_step=450560, len=7, n/ep=9, n/st=64, player_1/loss=51.683, player_2/loss=133.739, rew=19.44]                                                                                                                                                                                     


Epoch #440: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #441: 1025it [00:02, 401.99it/s, env_step=451584, len=8, n/ep=7, n/st=64, player_1/loss=53.643, player_2/loss=147.265, rew=17.86]                                                                                                                                                                                     


Epoch #441: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #442: 1025it [00:03, 330.01it/s, env_step=452608, len=7, n/ep=8, n/st=64, player_1/loss=18.919, player_2/loss=148.354, rew=18.75]                                                                                                                                                                                     


Epoch #442: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #443: 1025it [00:03, 329.93it/s, env_step=453632, len=8, n/ep=8, n/st=64, player_1/loss=25.194, player_2/loss=150.187, rew=25.00]                                                                                                                                                                                     


Epoch #443: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #444: 1025it [00:02, 397.50it/s, env_step=454656, len=7, n/ep=9, n/st=64, player_1/loss=34.075, player_2/loss=174.108, rew=13.89]                                                                                                                                                                                     


Epoch #444: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #445: 1025it [00:02, 433.36it/s, env_step=455680, len=7, n/ep=8, n/st=64, player_1/loss=40.640, player_2/loss=201.842, rew=25.00]                                                                                                                                                                                     


Epoch #445: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #446: 1025it [00:02, 436.45it/s, env_step=456704, len=7, n/ep=8, n/st=64, player_1/loss=44.131, player_2/loss=175.813, rew=12.50]                                                                                                                                                                                     


Epoch #446: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #447: 1025it [00:02, 421.37it/s, env_step=457728, len=7, n/ep=8, n/st=64, player_1/loss=35.493, player_2/loss=168.071, rew=25.00]                                                                                                                                                                                     


Epoch #447: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #448: 1025it [00:02, 411.61it/s, env_step=458752, len=8, n/ep=8, n/st=64, player_1/loss=27.476, player_2/loss=165.924, rew=12.50]                                                                                                                                                                                     


Epoch #448: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #449: 1025it [00:02, 391.40it/s, env_step=459776, len=7, n/ep=8, n/st=64, player_1/loss=36.959, player_2/loss=195.143, rew=25.00]                                                                                                                                                                                     


Epoch #449: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #450: 1025it [00:02, 408.57it/s, env_step=460800, len=7, n/ep=10, n/st=64, player_1/loss=39.815, player_2/loss=221.490, rew=25.00]                                                                                                                                                                                    


Epoch #450: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #451: 1025it [00:02, 390.15it/s, env_step=461824, len=7, n/ep=9, n/st=64, player_1/loss=57.626, player_2/loss=243.635, rew=13.89]                                                                                                                                                                                     


Epoch #451: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #452: 1025it [00:02, 388.19it/s, env_step=462848, len=8, n/ep=8, n/st=64, player_1/loss=70.844, player_2/loss=183.850, rew=25.00]                                                                                                                                                                                     


Epoch #452: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #453: 1025it [00:02, 415.20it/s, env_step=463872, len=8, n/ep=8, n/st=64, player_1/loss=43.874, player_2/loss=172.478, rew=25.00]                                                                                                                                                                                     


Epoch #453: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #454: 1025it [00:02, 371.36it/s, env_step=464896, len=8, n/ep=7, n/st=64, player_1/loss=40.656, player_2/loss=166.614, rew=17.86]                                                                                                                                                                                     


Epoch #454: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #455: 1025it [00:02, 411.58it/s, env_step=465920, len=7, n/ep=9, n/st=64, player_1/loss=22.151, player_2/loss=167.759, rew=25.00]                                                                                                                                                                                     


Epoch #455: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #456: 1025it [00:02, 437.90it/s, env_step=466944, len=7, n/ep=9, n/st=64, player_1/loss=23.621, player_2/loss=142.971, rew=19.44]                                                                                                                                                                                     


Epoch #456: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #457: 1025it [00:02, 480.07it/s, env_step=467968, len=7, n/ep=8, n/st=64, player_1/loss=18.090, player_2/loss=143.297, rew=25.00]                                                                                                                                                                                     


Epoch #457: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #458: 1025it [00:02, 433.20it/s, env_step=468992, len=8, n/ep=8, n/st=64, player_1/loss=25.119, player_2/loss=158.296, rew=25.00]                                                                                                                                                                                     


Epoch #458: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #459: 1025it [00:02, 442.13it/s, env_step=470016, len=7, n/ep=8, n/st=64, player_1/loss=42.772, player_2/loss=137.959, rew=18.75]                                                                                                                                                                                     


Epoch #459: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #460: 1025it [00:02, 433.61it/s, env_step=471040, len=8, n/ep=8, n/st=64, player_1/loss=46.075, player_2/loss=142.173, rew=25.00]                                                                                                                                                                                     


Epoch #460: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #461: 1025it [00:02, 496.42it/s, env_step=472064, len=8, n/ep=8, n/st=64, player_1/loss=36.354, player_2/loss=155.040, rew=12.50]                                                                                                                                                                                     


Epoch #461: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #462: 1025it [00:02, 503.95it/s, env_step=473088, len=7, n/ep=8, n/st=64, player_1/loss=46.776, player_2/loss=174.501, rew=12.50]                                                                                                                                                                                     


Epoch #462: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #463: 1025it [00:02, 483.44it/s, env_step=474112, len=8, n/ep=8, n/st=64, player_1/loss=43.154, player_2/loss=193.399, rew=6.25]                                                                                                                                                                                      


Epoch #463: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #464: 1025it [00:02, 488.60it/s, env_step=475136, len=7, n/ep=8, n/st=64, player_1/loss=22.351, player_2/loss=169.546, rew=18.75]                                                                                                                                                                                     


Epoch #464: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #465: 1025it [00:02, 490.26it/s, env_step=476160, len=7, n/ep=8, n/st=64, player_1/loss=35.218, player_2/loss=188.011, rew=25.00]                                                                                                                                                                                     


Epoch #465: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #466: 1025it [00:02, 465.12it/s, env_step=477184, len=8, n/ep=8, n/st=64, player_1/loss=53.127, player_2/loss=175.831, rew=25.00]                                                                                                                                                                                     


Epoch #466: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #467: 1025it [00:02, 494.69it/s, env_step=478208, len=7, n/ep=8, n/st=64, player_1/loss=60.261, player_2/loss=166.983, rew=18.75]                                                                                                                                                                                     


Epoch #467: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #468: 1025it [00:02, 467.75it/s, env_step=479232, len=7, n/ep=8, n/st=64, player_1/loss=57.338, player_2/loss=184.446, rew=25.00]                                                                                                                                                                                     


Epoch #468: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #469: 1025it [00:02, 456.06it/s, env_step=480256, len=8, n/ep=7, n/st=64, player_1/loss=71.106, player_2/loss=168.286, rew=17.86]                                                                                                                                                                                     


Epoch #469: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #470: 1025it [00:02, 444.37it/s, env_step=481280, len=7, n/ep=8, n/st=64, player_1/loss=66.133, player_2/loss=154.548, rew=12.50]                                                                                                                                                                                     


Epoch #470: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #471: 1025it [00:02, 435.97it/s, env_step=482304, len=8, n/ep=8, n/st=64, player_1/loss=40.744, player_2/loss=169.739, rew=0.00]                                                                                                                                                                                      


Epoch #471: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #472: 1025it [00:02, 417.84it/s, env_step=483328, len=8, n/ep=8, n/st=64, player_1/loss=29.495, player_2/loss=171.229, rew=6.25]                                                                                                                                                                                      


Epoch #472: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #473: 1025it [00:02, 412.70it/s, env_step=484352, len=8, n/ep=7, n/st=64, player_1/loss=43.067, player_2/loss=165.059, rew=25.00]                                                                                                                                                                                     


Epoch #473: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #474: 1025it [00:02, 396.20it/s, env_step=485376, len=7, n/ep=9, n/st=64, player_1/loss=33.576, player_2/loss=167.284, rew=19.44]                                                                                                                                                                                     


Epoch #474: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #475: 1025it [00:02, 410.37it/s, env_step=486400, len=7, n/ep=9, n/st=64, player_1/loss=11.137, player_2/loss=200.243, rew=25.00]                                                                                                                                                                                     


Epoch #475: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #476: 1025it [00:02, 483.74it/s, env_step=487424, len=7, n/ep=9, n/st=64, player_1/loss=25.665, player_2/loss=223.394, rew=19.44]                                                                                                                                                                                     


Epoch #476: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #477: 1025it [00:02, 482.64it/s, env_step=488448, len=7, n/ep=9, n/st=64, player_1/loss=24.269, player_2/loss=223.966, rew=19.44]                                                                                                                                                                                     


Epoch #477: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #478: 1025it [00:02, 484.28it/s, env_step=489472, len=7, n/ep=8, n/st=64, player_1/loss=23.251, player_2/loss=212.322, rew=12.50]                                                                                                                                                                                     


Epoch #478: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #479: 1025it [00:02, 482.30it/s, env_step=490496, len=8, n/ep=8, n/st=64, player_1/loss=26.420, player_2/loss=171.946, rew=25.00]                                                                                                                                                                                     


Epoch #479: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #480: 1025it [00:02, 481.25it/s, env_step=491520, len=7, n/ep=8, n/st=64, player_1/loss=36.429, player_2/loss=176.724, rew=18.75]                                                                                                                                                                                     


Epoch #480: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #481: 1025it [00:02, 478.50it/s, env_step=492544, len=7, n/ep=9, n/st=64, player_1/loss=50.437, player_2/loss=202.607, rew=19.44]                                                                                                                                                                                     


Epoch #481: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #482: 1025it [00:02, 483.08it/s, env_step=493568, len=8, n/ep=8, n/st=64, player_1/loss=79.334, player_2/loss=193.744, rew=25.00]                                                                                                                                                                                     


Epoch #482: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #483: 1025it [00:02, 482.02it/s, env_step=494592, len=7, n/ep=9, n/st=64, player_1/loss=66.740, player_2/loss=176.888, rew=25.00]                                                                                                                                                                                     


Epoch #483: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #484: 1025it [00:02, 472.50it/s, env_step=495616, len=7, n/ep=8, n/st=64, player_1/loss=39.851, player_2/loss=190.564, rew=25.00]                                                                                                                                                                                     


Epoch #484: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #485: 1025it [00:02, 474.28it/s, env_step=496640, len=7, n/ep=8, n/st=64, player_1/loss=40.689, player_2/loss=198.786, rew=18.75]                                                                                                                                                                                     


Epoch #485: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #486: 1025it [00:02, 484.52it/s, env_step=497664, len=8, n/ep=8, n/st=64, player_1/loss=20.956, player_2/loss=237.556, rew=6.25]                                                                                                                                                                                      


Epoch #486: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #487: 1025it [00:02, 482.35it/s, env_step=498688, len=7, n/ep=9, n/st=64, player_1/loss=13.983, player_2/loss=221.760, rew=19.44]                                                                                                                                                                                     


Epoch #487: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #488: 1025it [00:02, 405.07it/s, env_step=499712, len=8, n/ep=8, n/st=64, player_1/loss=36.893, player_2/loss=200.059, rew=6.25]                                                                                                                                                                                      


Epoch #488: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #489: 1025it [00:02, 451.73it/s, env_step=500736, len=7, n/ep=8, n/st=64, player_1/loss=24.187, player_2/loss=182.906, rew=25.00]                                                                                                                                                                                     


Epoch #489: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #490: 1025it [00:02, 452.33it/s, env_step=501760, len=7, n/ep=9, n/st=64, player_1/loss=14.524, player_2/loss=173.602, rew=19.44]                                                                                                                                                                                     


Epoch #490: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #491: 1025it [00:02, 453.19it/s, env_step=502784, len=7, n/ep=9, n/st=64, player_1/loss=26.259, player_2/loss=184.222, rew=25.00]                                                                                                                                                                                     


Epoch #491: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #492: 1025it [00:02, 454.74it/s, env_step=503808, len=7, n/ep=9, n/st=64, player_1/loss=45.922, player_2/loss=187.388, rew=19.44]                                                                                                                                                                                     


Epoch #492: test_reward: -25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #493: 1025it [00:02, 406.64it/s, env_step=504832, len=8, n/ep=8, n/st=64, player_1/loss=47.439, player_2/loss=169.204, rew=25.00]                                                                                                                                                                                     


Epoch #493: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #494: 1025it [00:02, 380.51it/s, env_step=505856, len=7, n/ep=9, n/st=64, player_1/loss=51.821, player_2/loss=199.674, rew=25.00]                                                                                                                                                                                     


Epoch #494: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #495: 1025it [00:02, 351.84it/s, env_step=506880, len=7, n/ep=8, n/st=64, player_1/loss=21.402, player_2/loss=199.220, rew=18.75]                                                                                                                                                                                     


Epoch #495: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #496: 1025it [00:02, 348.07it/s, env_step=507904, len=7, n/ep=8, n/st=64, player_1/loss=16.928, rew=18.75]                                                                                                                                                                                                            


Epoch #496: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #497: 1025it [00:02, 347.81it/s, env_step=508928, len=7, n/ep=8, n/st=64, player_1/loss=28.678, player_2/loss=162.131, rew=25.00]                                                                                                                                                                                     


Epoch #497: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #498: 1025it [00:02, 346.25it/s, env_step=509952, len=7, n/ep=9, n/st=64, player_1/loss=13.205, player_2/loss=217.632, rew=25.00]                                                                                                                                                                                     


Epoch #498: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


Epoch #499: 1025it [00:02, 374.72it/s, env_step=510976, len=7, n/ep=8, n/st=64, player_1/loss=37.137, player_2/loss=223.203, rew=18.75]                                                                                                                                                                                     


Epoch #499: test_reward: 25.000000 ± 0.000000, best_reward: 25.000000 ± 0.000000 in #3


In [17]:
####################################################
# EXPERIMENT: VIEWING THE BEST LEARNED POLICY
####################################################

# Get the environment settings
env = get_env()
observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
state_shape = observation_space.shape or observation_space.n
action_shape = env.action_space.shape or env.action_space.n

# Configure the best agent
best_agent1 = cf_custom_dqn_policy(state_shape= state_shape,
                                   action_shape= action_shape)
best_agent1.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/4-mlp_dqn_frozen_agent2/best_policy_agent1.pth"))
best_agent1.set_eps(0)


best_agent2 = cf_custom_dqn_policy(state_shape= state_shape,
                                   action_shape= action_shape)
best_agent2.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/4-mlp_dqn_frozen_agent2/best_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= best_agent1,
      agent_player2= best_agent2)



Average steps of game:  7.0
Final mean reward agent 1: 25.0, std: 0.0
Final mean reward agent 2: -25.0, std: 0.0


In [16]:
####################################################
# EXPERIMENT: VIEWING THE LAST LEARNED POLICY
####################################################

# Configure the final agent
final_agent_player1 = cf_custom_dqn_policy(state_shape= state_shape,
                                           action_shape= action_shape)
final_agent_player1.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/4-mlp_dqn_frozen_agent2/final_policy_agent1.pth"))
best_agent1.set_eps(0)

final_agent_player2 = cf_custom_dqn_policy(state_shape= state_shape,
                                           action_shape= action_shape)
final_agent_player2.load_state_dict(torch.load("./saved_variables/paper_notebooks/8/4-mlp_dqn_frozen_agent2/final_policy_agent2.pth"))
best_agent2.set_eps(0)

# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= final_agent_player1,
      agent_player2= final_agent_player2)



Average steps of game:  7.0
Final mean reward agent 1: 25.0, std: 0.0
Final mean reward agent 2: -25.0, std: 0.0


<hr><hr>

## Discussion

We see that the agent can learn quickly to win against a fixed strategy oponent but the overall performance of the agent is still weak, making human play of very poor quality once again.

In [None]:
####################################################
# CLEAN VARIABLES
####################################################

del action_shape
del agent1
del agent2
del best_agent1
del best_agent2
del env
del final_agent_player1
del final_agent_player2
del observation_space
del off_policy_traininer_results
del state_shape
